r8 - 02 Mar 2011 - 14:43:07 - SaulYoussefYou are here: TWiki >  Admins Web > MinutesMar2

MinutesMar2

Introduction

Minutes of the Facilities Integration Program meeting, March 2, 2011
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • 866-740-1260, Access code: 7027475

Audio Details: Dial-in Number:
U.S. & Canada: 866.740.1260
U.S. Toll: 303.248.0285
Access Code: 7027475; Chair passcode: 8734
Registration Link: https://cc.readytalk.com/r/bd2w3deu2kkg

Attending

  • Meeting attendees: Fred, Aaron, Shawn, Charles, Rob, Karthik, Michael, Dave, Saul, AK, Jason, John DeStefano, Torre, Booker, Tomasz, Sarah, Patrick, Horst, Wei, Bob, Tom, Mark, Alden, Armen, Hiro,
  • Apologies: Kaushik

Integration program update (Rob, Michael)

Tier 3 Integration Program (Doug Benjamin, Rik Yoshida)

Tier 3 References:
  • The link to ATLAS T3 working groups Twikis are here
  • T3g Setup guide is here
  • Users' guide to T3g is here

last week(s):

  • xrootd rpm under test, needs work.
  • Arizona coming online.
  • Rik migrating out of Tier 3 management - to analysis support, but will stay closely involved since T3 and analysis closely related.
this week:

Operations overview: Production and Analysis (Kaushik)

Data Management & Storage Validation (Kaushik)

Shifters report (Mark)

  • Reference
  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=slides&confId=128505
    
    1)  2/16: SWT2_CPB - file transfer failures from the site to BNL/other tier-2's.  High load on the storage caused the XroodtFS to become unresponsive.  Load gradually came down, 
    SRM service restarted.  Monitoring this issue.  RT 19475 / ggus 67576 closed,
    2)  2/17: From Dave at Illinois - Yesterday afternoon, by accident, the MTU on a 10Gb interface was set incorrectly (reset from 9000 back to 1500).  This caused half of the worker 
    nodes at IllinoisHEP to hang when trying to write back to the SE.  I found and fixed the MTU problem this morning, but unfortunately, the jobs that were running on those worker nodes 
    died in the process. So I assume that sometime in the next few hours many production jobs (120 or so) will show up as failing with lost heartbeat.
    3)  2/17: Jobs failures at BU_ATLAS_Tier2o with the error "No space left on device."  Issue was with analysis jobs occasionally requesting a large number of input files.  Resolved by 
    setting the panda schedconfigdb parameter 'maxinputsize' from 14 to 7 GB.  ggus 67565 closed, eLog 22381.
    4)  2/18: New pilot release from Paul (SULU 45e) to address an issue with the storage system at RAL.
    5)  2/20: From Alessandra Forti (update to issue with the new ddmadmin cert): New DN without email field has been deployed.  Related ggus tickets closed.
    6)  2/20: File transfer failures between SLACXRD_PRODDISK to BNL-OSG2_DATADISK with source errors "failed to contact on remote SRM."  From Wei -  One of the data servers 
    went down. This is fixed.  ggus 67690 closed, eLog 22283.  Later, 2/21, from Wei: We may need to take a short power outage on Tuesday to reset the host's AC.
    7) 2/21: From Rob at MWT2 - We have had an AC unit fail this morning and as a result have had to shut down a number of worker nodes (storage and head node services are unaffected). 
    There will be production and some analysis job failures as a result.  Later
    that day: AC has been restored; bringing nodes back online.  eLog 22303/08/17.
    8)  2/23: WISC file transfer failures with SRM errors like "DESTINATION + srmls is failing with "Connection reset."  Issue understood - from Wen: It's fixed now. Just now one cron job 
    failed update the grid CAs.  It caused that all grid certificates are not able to be authenticated.  ggus 67836 can probably be closed at this point.  eLog 22379.
    
    Follow-ups from earlier reports:
    
    (i)  1/9: AGLT2 - low-level of job failures with the error "Put error: lfc_creatg failed with (2704, Bad magic number)."  Site is investigating.
    (ii)  1/14: AGLT2_PHYS-SM - file transfer failures due to "[USER_ERROR] source file doesn't exist."  ggus 66150 in-progress, https://savannah.cern.ch/bugs/?77036.  
    Also https://savannah.cern.ch/bugs/index.php?77139.
    1/25: Update from Shawn:
    I have declared the 48 files as "missing" to the consistency service. See https://twiki.cern.ch/twiki/bin/view/Atlas/DDMOperationProcedures#In_case_some_files_were_confirme and you 
    can track the "repair" at http://bourricot.cern.ch/dq2/consistency/  Let me know if there are further issues.
    Update 1/28: files were declared 'recovered' - Savannah 77036 closed.  (77139 dealt with the same issue.)  ggus 66150 in-progress.
    Update 2/20: Last failed transfers reported have completed successfully.  ggus 66150 closed.
    (iii)  1/19: BNL - user reported a problem while attempting to download files from the site - for example: "httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2: CGSI-gSOAP running on 
    t301.hep.tau.ac.il reports Error reading token data header: Connection closed."  ggus 66298.  From Hiro: There is a known issue for users with Israel CA having problem accessing BNL and 
    MWT2. This is actively investigated right now. Until this get completely resolved, users are suggested to request DaTRI request to transfer datasets to some other sites 
    (LOCAGROUPDISK area) for the downloading.
    (iv)  1/21: File transfer errors from ALGT2 to MWT2_UC_LOCALGROUPDISK with source errors like "FTS State [Failed] FTS Retries [1] Reason [SOURCE error during 
    TRANSFER_PREPARATION phase:[USER_ERROR] source file doesn't exist]."  https://savannah.cern.ch/bugs/index.php?77251, eLog 21440.
    Update 2/23:  This issue is apparently resolved.  Related/similar problems that were being tracked in other tickets now closed.
    (v)  1/27: all U.S. sites received an RT & ggus ticket regarding the issue "WLCG sites not publishing GlueSiteOtherInfo=GRID=WLCG value."  Consolidated into a single goc ticket, 
    https://ticket.grid.iu.edu/goc/viewer?id=9871.  Will be resolved in a new OSG release currently being tested in the ITB.
    (vi)  2/6: AGLT2_PRODDISK to BNL-OSG2_MCDISK file transfer errors (source) - " [GENERAL_FAILURE] RQueued]."  ggus 67081 in-progress, eLog 21935.
    Update 2/20: Failed transfers have now completed.  ggus 67081 closed.
    (vii)  2/10: File transfer errors between BNL & RAL - " [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2]."  Issue was reported as 
    solved (intermittent problem on the wide area network link between RAL and BNL), but later recurred (high load on the  dCache core servers), and the ticket was re-opened.  
    ggus 67214 in-progress, eLog 21973.
    Update 2/23: Issue considered to be resolved (no recent errors of this type).  ggus 67214 closed.
    (iix)  2/10: A bug in the most recent OSG software release (1.2.17, released on Monday, February 7th) affects WLCG availability reporting for sites. Sites may go into an UNKNOWN state 
    one day after updating.  Thus it is recommended that sites defer upgrading their OSG installations until a fix is released.  See: http://osggoc.blogspot.com/
    
    • Michael: HI recon jobs failing at BNL - report was "out of memory". Investigated - found by setting the stack size (ulimit setting). If this is set to unlimited, it results in 400 MB less. Consulting experts for guidance for setting this parameter. (Un-tuned, its 8MB for SL5). General - not only for T1.
  • this meeting:* Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=129380
    
    1)  2/24: MWT2_UC - job failures with " lsm-get failed: time out after 5400 seconds" errors.  From Aaron: We performed a dcache upgrade yesterday, 3/1 
    which has improved our stability at the moment. This can probably be closed, as new tickets will be opened if new failures occur.  ggus 67887 in-progress 
    (and will be closed), eLog 22425.
    2)  2/25: From Rob at MWT2_UC: Overnight there were dCache failures at MWT2. Experts are investigating.  eLog 22437.  Late that day, from Sarah: We've past some 
    mass-transfers tests, and completed test jobs. We're turning back on FTS channels, and will continue to monitor the situation.
    3)  2/25: UTD-HEP set off-line due to A/C compressor problem.  eLog 22454
    4)  2/26 - 2/27: shifters reported some problems with the panda monitor (certain pages throwing errors, etc.).  Issue eventually went away - Valeri reported the problem 
    was fixed.  https://savannah.cern.ch/bugs/index.php?78770.  Also see https://savannah.cern.ch/bugs/index.php?78780 - voatlas20 was down for a period of time.  eLog 22530.
    5)  2/28: New pilot release from Paul (SULU 46a).  See details here: http://www-hep.uta.edu/~sosebee/ADCoS/New-pilot-version-SULU-46a.html
    6)  2/28: Xin noticed a large backlog of stale panda pilots for several U.S. sites.  They were cleaned out.
    7)  3/1:  MWT2_UC maintenance outage (update dCache, perform local network tests).  Work completed, queues back on-line as of ~3:15 p.m. CST.  eLog 22578.
    8)  3/2 a.m.: MWT2_UC job failures with errors like:
    "Error details: pilot: Get error: Failed to get PoolFileCatalog|Log put error: Could not figure out destination path from dst_se (guid=3920b517-03bb-4ae6-8ddf-d7c298a79a96 
    lfn=log.261508._039922.job.log.tgz.20): list index out of range."  Apparently a problem with the new pilot release (#5 above).  A fix is being prepared.  ggus 68156 in-progress, eLog 22594,
    
    Follow-ups from earlier reports:
    
    (i)  1/9: AGLT2 - low-level of job failures with the error "Put error: lfc_creatg failed with (2704, Bad magic number)."  Site is investigating.
    (ii)  1/19: BNL - user reported a problem while attempting to download files from the site - for example: "httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2: CGSI-gSOAP running on 
    t301.hep.tau.ac.il reports Error reading token data header: Connection closed."  ggus 66298.  From Hiro:
    There is a known issue for users with Israel CA having problem accessing BNL and MWT2. This is actively investigated right now. Until this get completely resolved, users are 
    suggested to request DaTRI request to transfer datasets to some other sites (LOCAGROUPDISK area) for the downloading.
    (iii)  1/27: all U.S. sites received an RT & ggus ticket regarding the issue "WLCG sites not publishing GlueSiteOtherInfo=GRID=WLCG value."  Consolidated into a single goc ticket, 
    https://ticket.grid.iu.edu/goc/viewer?id=9871.  Will be resolved in a new OSG release currently being tested in the ITB.
    (iv)  2/10: A bug in the most recent OSG software release (1.2.17, released on Monday, February 7th) affects WLCG availability reporting for sites. Sites may go into an UNKNOWN 
    state one day after updating.  Thus it is recommended that sites defer upgrading their OSG installations until a fix is released.  See: http://osggoc.blogspot.com/
    (v)  2/23: WISC file transfer failures with SRM errors like "DESTINATION + srmls is failing with "Connection reset."  Issue understood - from Wen: It's fixed now. Just now one cron job 
    failed update the grid CAs.  It caused that all grid certificates are not able to be authenticated.  ggus 67836 can probably be closed at this point.  eLog 22379.
    Update 2/25: Issue is resolved, ggus 67386 closed.  eLog 22402.
    

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week:
    • Jason: OPN meeting discussed adding new sites to the network. A WG will be released soon. Rely on open exchange points, rather than Monarc model. What does it mean to connect? Still working on Nagios monitor -
  • this week:
    • Had a meeting yesterday - see email for notes.
    • Good news - perfsonar plots for throughput and latency nearly green.
    • OU, BNL, Illinois issues addressed
    • MWT2_IU and AGLT2 path has a unique component - slowing things down, and its asymmetric. The only issues we've seen so far.
    • Action item all T2's to get another load test in. Sites to contact Hiro, monitor the results. An hour long test. ASAP.
    • More problems in the network likely with the new ATLAS computing model - could our monitoring system be more broadly adopted in ATLAS? Encourage new sites to adopt a perfsonar infrastructure.
    • Will ATLAS do something globally? Part of LHCONE, for example.
    • Timeframe for 10G monitoring. Testing with a server at UM - dual integrated 10G NICs. Probably with the next hardware purchase. Can a single box run both roles (throughput and latency)?

Federated Xrootd at sites: Tier 3 (Doug), Tier 2 (Charles)

last week(s): this week:
  • Running tests with the current architecture. At MWT2, using the xrootd as a front end to dcache now - more stable than dcap. (we do see a performance hit for the wide-area, requiring tuning in the xrootd client)
  • Working on re-architecting with LFC. Removal of DQ2 timestamps - working with Hiro and Simone. Will need to test this against storage.
  • Renormalization of paths at MWT2, for the global namespace.
  • Working with sites in Europe with DPM backend.
  • Will talk on this Tuesday morning.

Site news and issues (all sites)

  • T1:
    • last week(s): Two major issues - more for the future: 1) cloud computing initiative at BNL; elastically expand computing capacity installed at BNL - dynamically adding cloud resources. Configure a worker node and make it available operationally - put in the cloud. Working with Magellan, and anticipate Amazon. Taking shape. Time horizon - 1-2 months. 2) Expand to a grid site on the cloud- gradually adding functionality. All will be done w/ R & D activities with ADC. 3) CVMFS - way up on our list: setup a replica server at BNL, synching from CERN; more testing requiring firewall. 4) Deploying auto-py factory. There was a missing job wrapper - now provided by Jose. 5) Another R & D area, Alexei and Maxim invited to work on non-SQL DB evaluation (Cassandra). Completed install of 3 powerful nodes to be used for benchmarks and evaluation.
    • this week: BNL has its own PRODDISK area now. Deployed about 2PB of disk, in production. Will need to remove some of the storage.

  • AGLT2:
    • last week: All is well. One dcache pool server acting up w/ a NIC problem.
    • this week: All is working well. Have had some checksum failures - chasing this down. Users attempting to get files that were once here, but no longer. Is the user job unknowingly removed files under the usatlas1 account? Looking at options to trap the remove command, and log these. Want to get the lsm installed here, to instrument IO.

  • NET2:
    • last week(s): Improvements for the upcoming run - ramp up IO capacity to above 1GB/s; internal rearrangements. Will be ramping HU analysis. Anticipate requiring a second 10G link. Looking at merging two large GPFS volumes. Multiple nodes for lsm mover to HU, multiple nodes for gridftp; evaluating ClusterNSF. Gatekeeper - will be doubling its capacity in CPU and memory. Low-level issues - WLCG reporting verification. pcache related problem at BU.
    • this week: Tier3 hardware is on the way, ordering a new rack of worker nodes (looking at R410). Working to get additional 10G links, maybe even a 40G link. DYNES application approved!
Since someone was asking last time... debris from MCDISK

  • MWT2:
    • last week(s): running smoothly - doing mostly cross-cloud production; Want to make sure performance and contribution is associated with the US cloud - consult with Valeri. Panglia needs to be checked.
    • this week: Downtime yesterday - dCache upgraded to 1.9.5-24. Evaluating CVMFS at MWT2_IU. Migrated monitoring services (Cacti, ganglia, etc.) onto a new machine using kvm. Finishing last plans for new server room at UC - adding additional 30 ton CRAC unit; some construction already complete - raised floor, cooling infrastructure, new transformer and UPS delivered. At IU - we'll have to take a downtime to re-arrange server room, no exact date, but will announce. Hiro notes that there were some additional subscriptions made over the wekend - could have caused the lockup.

  • SWT2 (UTA):
    • last week: All is well. Storage load-issue caused SRM failures. Load resolved after restart.
    • this week: The grid monitor has been getting lost - causing load issues - a cron job was not running correctly not deleting gass-cache files. Maintenance yesterday at SWT2_UTA. Network connectivity into the analysis cluster is currently 1G links, working with networking folks to get a 10G switch. Will be looking to update OSG, hopefully the OSG will be released. Periodic failures in SAM testing, probably one of the storage nodes is getting too busy.

  • SWT2 (OU):
    • last week: March 20 for Dell install.
    • this week: Hiro's throughput test showing 400 MB/s.

  • WT2:
    • last week(s): Storage node developed a problem... power cycles and resets didn't help. Moving data off - triggering some DDM errors.
    • this week: Last week problem with a Dell machine storage - replaced CPU and memory, though not stressed. Planning 3 major outages - each lasting a day or two: March, April, early May. Will be setting final dates soon.

Carryover issues ( any updates?)

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc. Note: https://atlas-install.roma1.infn.it/atlas_install/ - site admins can subscribe, and get notified of release installation & validation activity at their site.

  • last report(s)
    • IU and BU have now migrated.
    • 3 sites left: WT2, SWT2-UTA, HU
    • Waiting on confirmation from Alessandro; have requested completion by March 1.
    • Focusing on WT2 - there is a proxy issue
    • No new jobs yet to: SWT2, HU - jobs are timing out, not running.
    • There is also Tufts. BDII publishing.
  • this meeting:
    • One of the problems at SLAC is lack of outbound links, and the new procedure will probably use gridftp. Discussing options with them.

AOB

  • last week
  • this week
    • No meeting again next week


-- RobertGardner - 01 Mar 2011

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback