r4 - 23 Feb 2011 - 14:50:44 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesFeb23

MinutesFeb23

Introduction

Minutes of the Facilities Integration Program meeting, Feb 23, 2011
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • 866-740-1260, Access code: 7027475

Audio Details: Dial-in Number:
U.S. & Canada: 866.740.1260
U.S. Toll: 303.248.0285
Access Code: 7027475; Chair passcode: 8734
Registration Link: https://cc.readytalk.com/r/bd2w3deu2kkg

Attending

  • Meeting attendees: Charles, Aaron, Karthik, Dave, AK, Bob, Jason, John, Sarah, Michael, Saul, Torre, John B, Xin, Hiro, Horst, Wei, Armen, Kaushik, Mark
  • Apologies: none this week!

Integration program update (Rob, Michael)

Tier 3 Integration Program (Doug Benjamin, Rik Yoshida)

Tier 3 References:
  • The link to ATLAS T3 working groups Twikis are here
  • T3g Setup guide is here
  • Users' guide to T3g is here

last week(s):

  • xrootd rpm under test, needs work.
  • Arizona coming online.
  • Rik migrating out of Tier 3 management - to analysis support, but will stay closely involved since T3 and analysis closely related.
this week:

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • Analysis reference:
  • last meeting(s):
    • Running out of jobs - down to US and Canada now. There is a new request for 80M simulation, Borut will define these soon.
    • Cross-cloud production, promoted by Rod Walker. MWT2 and AGLT2 setup. Smooths production of high priority tasks.
    • Simultaneously Simone is performing sonar tests. Need to go beyond the star-channel.
    • LHC wide area connectivity becoming important.
    • Distributed analysis will be much more of a challenge, PD2P? will be transferring the entire dataset. Q: could this be improved for only the parts users request. Could send jobs to multiple Tier 2's if the dataset could be split; not favored - better smaller datasets.
    • Beyond-pledge issue: parameters in schedconfig not getting transferred into pilot, dependent on Condor-G.
  • this week:
    • Plenty of jobs .. no major issues
    • Charles - reporting on walltime efficiency with onset of pileup jobs from CERN cloud (at MWT2), causing NFS load. A report to Borut and ADC operations list. Kaushik: should forward to software group.
    • Brings us to CVMFS - to mitigate the load on NFS.

Data Management & Storage Validation (Kaushik)

Shifters report (Mark)

  • Reference
  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=127502
    
    1)  2/9: Problems accessing the panda monitor.  Issue was traced back to a faulty NAT configuration on the monitor machines - now solved.  https://savannah.cern.ch/bugs/?77992, eLog 21950.
    2)  2/9:  Bob at AGLT2: From 15:12 to 16:00 or so we had a network snafu at AGLT2 that took our primary DNS server off-line.  Some 70 jobs were lost, but all else seems to have recovered satisfactorily.  
    During part of this time I set auto-pilots off to keep load away from our gate-keeper.  These have now been turned back on.
    3)  2/9: Jobs were failing at several U.S. sites with "transfer timeout" errors.  Issue understood - from Yuri: Most of the jobs succeeded the 2-nd attempts. Output file transfer failed because these are 
    validation tasks (high priority), so they have very low transfer timeouts in Panda as such tasks could run at T1s. If they go to T2s and the storage or network is busy, then these jobs will fail due to timeout 
    (Another possibility is to increase these timeouts in Panda).  https://savannah.cern.ch/bugs/index.php?78000, eLog 21981.
    4)  2/10: File transfer errors between BNL & RAL - " [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2]."  Issue was reported as solved (intermittent 
    problem on the wide area network link between RAL and BNL), but later recurred (high load on the  dCache core servers), and the ticket was re-opened.  ggus 67214 in-progress, eLog 21973.
    5)  2/10: HU_ATLAS_Tier2 - job failures with the error "Could not update Panda server, EC = 7168," and eventually "Pilot received a panda server signal to kill job."  Eventually the problem went away.  
    ggus 67237 closed, eLog 21978.  (Note: a similar issue was reported at FZK in the DE cloud - see: https://gus.fzk.de/ws/ticket_info.php?ticket=66783.)
    6)  2/10: A bug in the most recent OSG software release (1.2.17, released on Monday, February 7th) affects WLCG availability reporting for sites. Sites may go into an UNKNOWN state one day after 
    updating.  Thus it is recommended that sites defer upgrading their OSG installations until a fix is released.  See: http://osggoc.blogspot.com/
    7)  2/12 - 2/14: Atlas s/w installation system off-line 12/02/2011 16:00 UTC to 14/02/2011 15:00 UTC, due to electrical power maintenance which affected all of the Tier2's in Roma.
    8)  2/13: WISC_DATADISK & _LOCALGROUPDISK file transfer errors.  ggus 67276 was initially (erroneously) opened and directed to BNL, but the problem was actually on the WISC end.  This ticket 
    was closed, and issue followed in ggus 66280.  Site admin (Wen) reported the issue was resolved - all tickets now closed.  eLog 22037.
    9)  2/13: BNL-OSG2_MCDISK to NDGF-T1_MCDISK file transfer errors (source) - "[INTERNAL_ERROR] Source file/user checksum mismatch]."  Not a BNL problem - Stephane pointed out that the file 
    in question was corrupt in all sites (i.e., file was corrupted at the time it was generated).  See details in eLog 22057.  ggus 67251 closed.
    10)  2/14: The new DN:
    /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=ddmadmin/CN=531497/CN=Robot: ATLAS Data Management
    now being used for DDM Functional Tests.  eLog 22046.  (Also gradually being rolled to in various clouds for site services.)
    11)  2/14: MWT2 postponed maintenance originally scheduled for 2/15 in order to participate in tier-2 testing.
    12)  2/14: AGLT2 maintenance outage (network optimizations, firmware upgrades, OSG upgrades, other tasks).  Work completed in the evening - test jobs successful, production & analysis queues 
    set back on-line.  eLog 22100.
    13)  2/15: BNL-OSG2_DATADISK file transfer failures - "[AGENT error during TRANSFER_SERVICE phase: [INTERNAL_ERROR] cannot create archive repository: No space left on device]."  Hiro 
    reported the issue was resolved - ggus 67492 closed, eLog 22105.
    14)  2/15: HU_ATLAS_Tier2 - John reported an overheating problem (glycol failure) that caused some filesystem hosts to shutdown.
    Issue resolved - systems back on-line.  eLog 22111, https://savannah.cern.ch/support/index.php?119248.
    15)  2/15: BU_ATLAS_Tier2o & OU_OCHEP_SWT2 - job failures with the error "SFN not set in LFC for guid (check LFC server version)."  Seems to have been a transient issue - disappeared.  
    ggus  67501 / RT 19464 closed, eLog 22133.
    16)  2/15: WISC_DATADISK - file transfer errors due to a certificate problem (could not map the new 'ddmadmin' DN).  Wen added the entry to the local mapfile - issue resolved.  
    ggus 67495 closed, eLog 22137.
    
    Follow-ups from earlier reports:
    
    (i)  1/9: AGLT2 - low-level of job failures with the error "Put error: lfc_creatg failed with (2704, Bad magic number)."  Site is investigating.
    (ii)  1/14: AGLT2_PHYS-SM - file transfer failures due to "[USER_ERROR] source file doesn't exist."  ggus 66150 in-progress, https://savannah.cern.ch/bugs/?77036.  
    Also https://savannah.cern.ch/bugs/index.php?77139.
    1/25: Update from Shawn:
    I have declared the 48 files as "missing" to the consistency service. See https://twiki.cern.ch/twiki/bin/view/Atlas/DDMOperationProcedures#In_case_some_files_were_confirme and you can 
    track the "repair" at http://bourricot.cern.ch/dq2/consistency/
    Let me know if there are further issues.
    Update 1/28: files were declared 'recovered' - Savannah 77036 closed.  (77139 dealt with the same issue.)  ggus 66150 in-progress.
    (iii)  1/19: BNL - user reported a problem while attempting to download files from the site - for example: "httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2: CGSI-gSOAP running on 
    t301.hep.tau.ac.il reports Error reading token data header: Connection closed."  ggus 66298.  From Hiro: There is a known issue for users with Israel CA having problem accessing BNL 
    and MWT2. This is actively investigated right now. Until this get completely resolved, users are suggested to request DaTRI request to transfer datasets to some other sites 
    (LOCAGROUPDISK area) for the downloading.
    (iv)  1/21: File transfer errors from ALGT2 to MWT2_UC_LOCALGROUPDISK with source errors like "FTS State [Failed] FTS Retries [1] Reason [SOURCE error during 
    TRANSFER_PREPARATION phase:[USER_ERROR] source file doesn't exist]."  https://savannah.cern.ch/bugs/index.php?77251, eLog 21440.
    (v)  1/27: all U.S. sites received an RT & ggus ticket regarding the issue "WLCG sites not publishing GlueSiteOtherInfo=GRID=WLCG value."  Consolidated into a single goc ticket, 
    https://ticket.grid.iu.edu/goc/viewer?id=9871.  Will be resolved in a new OSG release currently being tested in the ITB.
    (vi)  2/6: AGLT2_PRODDISK to BNL-OSG2_MCDISK file transfer errors (source) - " [GENERAL_FAILURE] RQueued]."  ggus 67081 in-progress, eLog 21935.
    
    • Transfer timeout issue from higher priority task - there was a tighter constraint on the timeout definition. Succeed on second or third attempt.
    • Maintenance in software distribution system over the weekend.
    • Some of the carryover issues above probably resolved.
    • Hiro: panda monitor is falsely reporting failures after two weeks - when it was a day.
    • Hiro: high prio tasks would be more likely succeed if given a larger share in DDM. Can Panda can request a higher priority with a different share? Kaushik wil look into it.
  • this meeting:* Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=slides&confId=128505
    
    1)  2/16: SWT2_CPB - file transfer failures from the site to BNL/other tier-2's.  High load on the storage caused the XroodtFS to become unresponsive.  Load gradually came down, 
    SRM service restarted.  Monitoring this issue.  RT 19475 / ggus 67576 closed,
    2)  2/17: From Dave at Illinois - Yesterday afternoon, by accident, the MTU on a 10Gb interface was set incorrectly (reset from 9000 back to 1500).  This caused half of the worker 
    nodes at IllinoisHEP to hang when trying to write back to the SE.  I found and fixed the MTU problem this morning, but unfortunately, the jobs that were running on those worker nodes 
    died in the process. So I assume that sometime in the next few hours many production jobs (120 or so) will show up as failing with lost heartbeat.
    3)  2/17: Jobs failures at BU_ATLAS_Tier2o with the error "No space left on device."  Issue was with analysis jobs occasionally requesting a large number of input files.  Resolved by 
    setting the panda schedconfigdb parameter 'maxinputsize' from 14 to 7 GB.  ggus 67565 closed, eLog 22381.
    4)  2/18: New pilot release from Paul (SULU 45e) to address an issue with the storage system at RAL.
    5)  2/20: From Alessandra Forti (update to issue with the new ddmadmin cert): New DN without email field has been deployed.  Related ggus tickets closed.
    6)  2/20: File transfer failures between SLACXRD_PRODDISK to BNL-OSG2_DATADISK with source errors "failed to contact on remote SRM."  From Wei -  One of the data servers 
    went down. This is fixed.  ggus 67690 closed, eLog 22283.  Later, 2/21, from Wei: We may need to take a short power outage on Tuesday to reset the host's AC.
    7) 2/21: From Rob at MWT2 - We have had an AC unit fail this morning and as a result have had to shut down a number of worker nodes (storage and head node services are unaffected). 
    There will be production and some analysis job failures as a result.  Later
    that day: AC has been restored; bringing nodes back online.  eLog 22303/08/17.
    8)  2/23: WISC file transfer failures with SRM errors like "DESTINATION + srmls is failing with "Connection reset."  Issue understood - from Wen: It's fixed now. Just now one cron job 
    failed update the grid CAs.  It caused that all grid certificates are not able to be authenticated.  ggus 67836 can probably be closed at this point.  eLog 22379.
    
    Follow-ups from earlier reports:
    
    (i)  1/9: AGLT2 - low-level of job failures with the error "Put error: lfc_creatg failed with (2704, Bad magic number)."  Site is investigating.
    (ii)  1/14: AGLT2_PHYS-SM - file transfer failures due to "[USER_ERROR] source file doesn't exist."  ggus 66150 in-progress, https://savannah.cern.ch/bugs/?77036.  
    Also https://savannah.cern.ch/bugs/index.php?77139.
    1/25: Update from Shawn:
    I have declared the 48 files as "missing" to the consistency service. See https://twiki.cern.ch/twiki/bin/view/Atlas/DDMOperationProcedures#In_case_some_files_were_confirme and you 
    can track the "repair" at http://bourricot.cern.ch/dq2/consistency/  Let me know if there are further issues.
    Update 1/28: files were declared 'recovered' - Savannah 77036 closed.  (77139 dealt with the same issue.)  ggus 66150 in-progress.
    Update 2/20: Last failed transfers reported have completed successfully.  ggus 66150 closed.
    (iii)  1/19: BNL - user reported a problem while attempting to download files from the site - for example: "httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2: CGSI-gSOAP running on 
    t301.hep.tau.ac.il reports Error reading token data header: Connection closed."  ggus 66298.  From Hiro: There is a known issue for users with Israel CA having problem accessing BNL and 
    MWT2. This is actively investigated right now. Until this get completely resolved, users are suggested to request DaTRI request to transfer datasets to some other sites 
    (LOCAGROUPDISK area) for the downloading.
    (iv)  1/21: File transfer errors from ALGT2 to MWT2_UC_LOCALGROUPDISK with source errors like "FTS State [Failed] FTS Retries [1] Reason [SOURCE error during 
    TRANSFER_PREPARATION phase:[USER_ERROR] source file doesn't exist]."  https://savannah.cern.ch/bugs/index.php?77251, eLog 21440.
    Update 2/23:  This issue is apparently resolved.  Related/similar problems that were being tracked in other tickets now closed.
    (v)  1/27: all U.S. sites received an RT & ggus ticket regarding the issue "WLCG sites not publishing GlueSiteOtherInfo=GRID=WLCG value."  Consolidated into a single goc ticket, 
    https://ticket.grid.iu.edu/goc/viewer?id=9871.  Will be resolved in a new OSG release currently being tested in the ITB.
    (vi)  2/6: AGLT2_PRODDISK to BNL-OSG2_MCDISK file transfer errors (source) - " [GENERAL_FAILURE] RQueued]."  ggus 67081 in-progress, eLog 21935.
    Update 2/20: Failed transfers have now completed.  ggus 67081 closed.
    (vii)  2/10: File transfer errors between BNL & RAL - " [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2]."  Issue was reported as 
    solved (intermittent problem on the wide area network link between RAL and BNL), but later recurred (high load on the  dCache core servers), and the ticket was re-opened.  
    ggus 67214 in-progress, eLog 21973.
    Update 2/23: Issue considered to be resolved (no recent errors of this type).  ggus 67214 closed.
    (iix)  2/10: A bug in the most recent OSG software release (1.2.17, released on Monday, February 7th) affects WLCG availability reporting for sites. Sites may go into an UNKNOWN state 
    one day after updating.  Thus it is recommended that sites defer upgrading their OSG installations until a fix is released.  See: http://osggoc.blogspot.com/
    
    • Michael: HI recon jobs failing at BNL - report was "out of memory". Investigated - found by setting the stack size (ulimit setting). If this is set to unlimited, it results in 400 MB less. Consulting experts for guidance for setting this parameter. (Un-tuned, its 8MB for SL5). General - not only for T1.

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week:
    • Last week's meeting skipped - sent around the perfsonar performance matrix. Sites are requested to please follow-up.
    • LHCOPN meeting tomorrow in Lyon - a need for better monitoring; Jason will send summary notes.
    • DYNES - there will be a phased deployment; first are the PI, co-PI sites, then 10 sites at a time, etc. Meeting at joint-techs last week. Deploy all sites in the instrument by end of year. May be a separate call for additional participants. There was an announcement last week at joint-techs. Everyone applied has been provisionally accepted.
  • this week:
    • Jason: OPN meeting discussed adding new sites to the network. A WG will be released soon. Rely on open exchange points, rather than Monarc model. What does it mean to connect? Still working on Nagios monitor -

Global Xrootd: Tier 3 (Doug), Tier 2 (Charles)

last week(s):
  • Development release in dq2 for the physical path
  • rpms from OSG - adler32 bug fixed; will work on testing re-installation
this week:

Site news and issues (all sites)

  • T1:
    • last week(s): New version of the OSG software stack - faulty RSV probe - affected availability. Storage expansion of ~ 2PB almost coming online. BNL now has 72 fibers north and south shore routing into NYC. Additional 10G circuit will be coming online - in March. 100G to BNL - equipment has been installed (optical switches) - mostly alpha equipment, so caution; late next year, part of EsNET testbed, maybe late next year. Late 2013/14 for production.
    • this week: Two major issues - more for the future: 1) cloud computing initiative at BNL; elastically expand computing capacity installed at BNL - dynamically adding cloud resources. Configure a worker node and make it available operationally - put in the cloud. Working with Magellan, and anticipate Amazon. Taking shape. Time horizon - 1-2 months. 2) Expand to a grid site on the cloud- gradually adding functionality. All will be done w/ R & D activities with ADC. 3) CVMFS - way up on our list: setup a replica server at BNL, synching from CERN; more testing requiring firewall. 4) Deploying auto-py factory. There was a missing job wrapper - now provided by Jose. 5) Another R & D area, Alexei and Maxim invited to work on non-SQL DB evaluation (Cassandra). Completed install of 3 powerful nodes to be used for benchmarks and evaluation.

  • AGLT2:
    • last week: Took downtime on Monday - took care of a number of issues. 1.9.10-4 dCache on head nodes. multiple-spanning tree running, cleanly talking with Cisco now. SSD for dCache heads delayed. PERC card firmware updates complete. MSU - no spanning tree there, to eliminate a loop.
    • this week: All is well. One dcache pool server acting up w/ a NIC problem.

  • NET2:
    • last week(s): HU panda queues down for a couple hours due to an AC problem in the room which houses fileservers. Perfsonar tuned up. Ordered 10G switch to run full analy jobs at HU. Analysis capacity - can't ramp up until 10g switch installed (has just been ordered).
    • this week: Improvements for the upcoming run - ramp up IO capacity to above 1GB/s; internal rearrangements. Will be ramping HU analysis. Anticipate requiring a second 10G link. Looking at merging two large GPFS volumes. Multiple nodes for lsm mover to HU, multiple nodes for gridftp; evaluating ClusterNSF. Gatekeeper - will be doubling its capacity in CPU and memory. Low-level issues - WLCG reporting verification. pcache related problem at BU.

  • MWT2:
    • last week(s): Connectivity/network issue with new R410s at UC resolved (static routes), nodes working fine. Postponed downtime so as to participate in "Big Tier2" testing by Rod - jobs run from CA and DE clouds (but at low levels). Next downtime will involve a dCache update. Otherwise all is fine.
    • this week: running smoothly - doing mostly cross-cloud production; Want to make sure performance and contribution is associated with the US cloud - consult with Valeri. Panglia needs to be checked.

  • SWT2 (UTA):
    • last week: Quiet in the past week. Replaced installations on perfsonar hosts, updated w/ latest patches. Production running smoothly. Will be getting back to federated xrootd.
    • this week: All is well. Storage load-issue caused SRM failures. Load resolved after restart.

  • SWT2 (OU):
    • last week: Waiting for Dell to install extra nodes. 18 dual quad x5620's w/ 32G ram.
    • this week: March 20 for Dell install.

  • WT2:
    • last week(s): All are running well with light load at SLAC. We handed over a Proof cluster for Tier 3 user to run testing jobs. We are discussing two power outages in April and one in May (and three more in April that are not supposed to affect us).
    • this week: Storage node developed a problem... power cycles and resets didn't help. Moving data off - triggering some DDM errors.

Carryover issues ( any updates?)

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc. Note: https://atlas-install.roma1.infn.it/atlas_install/ - site admins can subscribe, and get notified of release installation & validation activity at their site.

  • last report(s)
    • IU and BU have now migrated.
    • 3 sites left: WT2, SWT2-UTA, HU
    • Waiting on confirmation from Alessandro; have requested completion by March 1.
  • this meeting:
    • Focusing on WT2 - there is a proxy issue
    • No new jobs yet to: SWT2, HU - jobs are timing out, not running.
    • There is also Tufts. BDII publishing.

AOB

  • last week
  • this week


-- RobertGardner - 22 Feb 2011

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback