r4 - 14 Dec 2011 - 14:35:21 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesDec14

MinutesDec14

Introduction

Minutes of the Facilities Integration Program meeting, December 14, 2011
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Our Skype-capable conference line: (6 to mute) ** announce yourself in a quiet moment after you connect *
    • USA Toll-Free: (877)336-1839
    • USA Caller Paid/International Toll : (636)651-0008
    • ACCESS CODE: 3444755

Attending

  • Meeting attendees: Nate, Dave, Patrick, Rob, Saul, Bob, Horst, Wei, Shawn, John, Armen, Mark, Fred, Wensheng, Hiro, Tom, Kaushik, Tomaz, Xin
  • Apologies: Michael, Jason

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon CDT, weekly - convened by Kaushik) : Data management
    • Tuesday (2pm CDT, bi-weekly - convened by Shawn): Throughput meetings
    • Friday (1pm CDT, bi-weekly - convened by Rob): Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • New integration phase - WIP NEW
      • Networking issues:
        • slow transfers from our Tier 2's and Tier 1s (Atlantic and Canada). We need to get into the habit of asking network experts for help, as well as relying on our own tools. Esnet can help.
        • We had problems getting CERN's attention for Calibration data to AGLT2.
        • Perfsonar was key to resolving this issue.
        • Tier 1 service coordination meeting is a forum to raise issues.
      • Federated data storage meeting last week in Lyon
      • CVMFS deployment at sites
      • Tier2D networking
      • Use of Vidyo - lack of ubuntu support; worries about room echos, efficiency.
      • LHC performance & outlook
        • record peak luminosity for heavy ions
        • cryo problem, but the machine is recovering
        • Good data taking efficiency - 94%
        • December 7 is the last day of data taking - technical stop to last until second week of feb. restart data taking in April.
      • More about Tier2D connectivity - the issues is a firewall at the EU Tier 1's for Tier 2. It is being discussed at the daily WLCG meeting. We'll need to shut off clouds with poor connections.
      • LHCONE testing ADC-wide; small number of sites should be evaluated. Shawn will setup a separate mesh in perf-sonar for this.
      • Procurement - Interlagos - C6145 128 cores - they have received one at BNL. Shuwei has been testing this at BNL. HS06 - achieved as 1000 HS06. Did not perform as expected for ATLAS jobs. 2.5 hours for 24 jobs versus 50 minutes on a Westmere, for reco jobs. Seeing this though for all processing steps. Getting in touch with Dell and AMD. (Interestingly there is no difference for a single job.)
      • OSG pacman versus native packaging - goal is to remove ATLAS-specific components in the workernode client (i.e. the LFC python-interfaces), and provide this via CVMFS.
      • OSG proposal review has taken place last week. Outcome.
    • this week
      • Winter shutdown has begun - no pp beams again until April
      • Review status if CVMFS deployment at sites:
        • BNL DONE
        • AGLT2 DONE
        • MWT2 DONE
        • MWT2_IllinoisHEP DONE
        • NET2_BU: problem with nodes with small local disk; follow HU.
        • NET2_HU: in progress, deployed. Need to get jobs using it. Will depend a bit on his availability.
        • SWT2_UTA: in progress for prod-only cluster; could finish by end of year. For prod-analysis cluster - will defer to next year, not enough time.
        • SWT2_OU: everything is ready, but defer to next year until new head nodes arrive for squid. Wants to wait until beginning of next year to get the OSCER cluster at the same time. Worried about changes during the break.
        • WT2: it is deployed; working with a new CE as well. Validation jobs are failing, as are install jobs. dq2 client, gcc, and cctools are not being installed. Log files not visible.
      • GEANT4 production: communication with Gabriele this week - still gathering some of the details. When available with test at MWT2 and update SupportingGEANT4
      • OSG All hands meeting: March 19-23, 2012 (University of Nebraska at Lincoln). Program being discussed. As last year part of the meeting will have co-located US ATLAS, and joint USCMS/OSG session.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Alden reporting - all has been going smoothly.
    • Have solid backlog of production available
    • Suggests minimal changes
  • this week:
    • Difficulties with Panda monitoring - its being addressed.
    • Production and analysis are chugging along.
    • Is there a problem with the BNL_CVMFS test site? Its taking large number of jobs - causing problems for Panda. Condor issue at BNL. Need to follow-up with Xin or someone at BNL. On-going issue most likely - requires manual re-assignments. Hiro will investigate the issue at BNL.

Data Management and Storage Validation (Armen)

  • Reference
  • last week(s):
    • All basically okay, see: MinutesDataManageNov29
    • A few problems with areas at various sites leading to blacklistings, but these have been addressed.
    • Deletion service is going okay.
    • USERDISK will be cleaned up next week by Hiro
    • See discussions of Rucio
  • this week:
    • Generally the storage.
    • Deletion errors - more than 10 during 4 hours creates a GGUS ticket. Most think this is not worth it?
      • Types of errors: LFC permissions issue errors, usually associated with USERDISK at AGLT2 and NET2. May need help from Hiro and Shawn. Sometimes these seem to get resolved without intervention. Are there remnants left in the LFC? Armen will send a list to Shawn.
      • OU has problems, also due to Bestman1. Same as UTD.
      • Deletion service is not getting a callback from SRM - getting timeouts instead. These were at Bestman1 failures.
      • Wisconsin failures - because they deleted files locally, the service isn't finding them. They then got blacklisted.
    • Storage reporting not working at BU.
    • DATADISK reporting at SLAC is incorrect, low.

Shift Operations (Mark)

  • Reference
  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=164763
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-12_5_2011.html
    
    1)  12/1: Functional test transfers to SLACXRD_DATADISK were failing with timeout errors.  From Wei: We are collecting info from various US sites to debug trans-atlantic 
    network between FR/DE and US.  Based on the performance data we've collected, it is probably not surprising to see timeouts.  ggus 76938 closed, eLog 31978.
    2)  12/1: From Hiro: BNL had problems with http proxy and a few other things.   This prevented any outbound http request, resulting in failed jobs.   It has been rectified.  
    Jobs are running normally now.  (Among other things this affected pandamover, so some sites were draining for a period of time due to lack of activated jobs.)
    3)  12/1: From Sarah - MWT2 queues migrated from pbs to condor jobmanager.
    4)  12/3: NET2 - file transfer errors ("failed to contact on remote SRM [httpg://atlas.bu.edu:8443/srm/v2/server]").  Jobs also failing with stage-in errors.  Site reported that 
    the SRM service was hung. It was restarted - fixed the problem.  ggus 77015 closed, eLog 32081.
    5)  12/3: Savannah tickets 89538 and 89539 were opened for SWT2_CPB and WISC, respectively, related to DDM deletion service errors.  It was pointed out that the 
    syntax in the deletion commands contained malformed SURL's, resulting in the failures.  See more details in:
    https://savannah.cern.ch/bugs/?89538.  Issue closed.  eLog 32224/32076. 
    6)  12/6: Some users reported this error when trying to retrieve (with dq2-get) a file from WISC: "[SRM_INVALID_PATH] No such file or directory."  ggus 77088 in-progress.
    7)  ggus 77122 was opened on 12/7 regarding the deletion errors at WISC, in-progress.
    
    
    Follow-ups from earlier reports:
    (i)  Previous week: ongoing testing at the new US tier-3 site BELLARMINE-ATLAS-T3.  Test jobs have run successfully at the site.
    (ii)  10/9: Shifter submitted http://savannah.cern.ch/bugs/?87589 regarding job output transfer timeouts from MWT2_UC => TRIUMF.  This is the same type of issue that 
    has been observed at several US tier-2d's when attempting to copy job output files to other clouds.  Working to understand this problem and decide how best to handle 
    these situations.  Discussion in eLog 30170.
    (iii)  11/27: BELLARMINE-T3_DATADISK - file transfer errors ("failed to contact on remote SRM
    [httpg://tier3-atlas2.bellarmine.edu:8443/srm/v2/server]").  From Horst: It looks like this was a network problem which went away again, since now
    my srm tests are working again.  ggus 76846 in-progress, eLog 31869.
    Update 11/29: ggus 76846 / RT 21277 closed.
    
    • Second generation of the DDM dashboard has been released.
    • Discussion of CVMFS cache corruption failing lots of jobs at UTD. Which version was being used?
    • Why isn't the job using a checksum? Wensheng will check a savannah thread on this topic.
  • this meeting: Operations summary:
     
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=164764
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-12_5_2011.html
    
    1)  12/7:  SLAC - ggus 77220 was opened due to a group of "lost heartbeat" failed jobs.  Seems to have been a short-lived, transient problem, so the ticket was closed.  
    eLog 32237.
    2)  12/8: DUKE_LOCALGROUPDISK - ggus 77255 opened for deletion errors at the site.  Ticket marked as 'solved' the same day.  eLog 32253.
    3)  12/8: OU_OSCER_ATLAS - ggus 77257 / RT 21351 opened due to job failures with seg fault errors.  Long-standing issue at the site, no solution has been found 
    as yet.  Tickets closed, eLog 32256.
    4)  12/9 p.m.: Hiro reported that a storage server at BNL was down.  Host restored later that evening, but job stage-in failures continued for a while due to high levels 
    of requests to the server.  Issue eventually resolved, eLog 32305.
    5)  12/9 Last Comp@Point1 shift for this year.  eLog 32284.
    6)  12/10 a.m.: from Rob at MWT2: One of the storage pools rebooted this morning most likely causing these failures. An unusual event, we're investigating why.  
    eLog 32308.
    7)  12/11:  AGLT2 - ggus 77330 opened due to DDM deletion errors at the site (~8400 over a four hour period).  Ticket in-progress - eLog 32317.  ggus 77341 also 
    opened for deletion errors at the site on 12/12 - in-progress.  eLog 32326.  Also ggus 77436/eLog 32383 on 12/14.
    8)  12/11: NET2 - ggus 77332 opened due to DDM deletion errors at the site (~1050 over a four hour period).  From Saul: Our adler checksumming was getting backed 
    up causing those errors. We added I/O resources and the errors should stop now.  Ticket was marked as solved the next day, and then additional errors were reported 
    ~eight hours later.  Final ticket status is 'unsolved'.  eLog 32318/50.  (Duplicate ggus ticket 77439 was opend/closed on 12/14.)
    9)  12/12: BNL - HPSS tape archive off line during the week of Dec 12 for a software upgrade.
    10)  12/12: ggus 77361 was opened due to an issue with the reporting of ATLAS s/w releases in BDII, originally for BNL.  Turned out to be a more widespread issue.  
    See the ggus ticket (still 'in-progress') for more details.  eLog 32338.
    11)  12/12: UTD-HEP - ggus 77382 opened due to DDM deletion errors at the site (~21 over a four hour period).  Ticket 'assigned' - eLog 32351.  (Duplicate ggus 
    ticket 77440 was opend/closed on 12/14.)
    12)  12/12: OU_OCHEP_SWT2 - job failures with errors like "/storage/app/gridapp/atlas_app/atlas_rel/16.6.7/cmtsite/setup.sh: No such file or directory."  Issue was a 
    missing soft link in the release area, since restored.  On 12/13 ggus 77383 / RT 21371 were updated with a report about jobs failing due to checksum errors, and also 
    other ATLAS s/w release area problems.  Awaiting a re-installation of 16.6.7 - this is expected to fix the problem(s).  eLog 32356/63.
    13)  12/12-12/13: SLACXRD_LOCALGROUPDISK file transfer errors ("[DDM Site Services internal] Timelimit of 604800 seconds exceeded").  Example of general 
    transfer issues between US and EU cloud sites.  Under investigation.  See details from Wei in eLog 32371.
    14)  12/13: MWT2 - downtime Tuesday, Dec 13th to upgrade dCache from v1.9.5 to 1.9.12.  Completed as of ~5:30 p.m. CST - FTS channels active and both queues 
    back online.  eLog 32375.
    15)  12/13: NERSC - downtime Tuesday, Dec. 13th from 7AM-5PM Pacific time.  ggus 77417 was opened for file transfer failures during this time - shifter wasn't aware 
    site was off-line.  Outage didn't appear in the atlas downtime calendar, announcement only sent to US cloud support.  eLog 32373.
    16)  12/13: SLAC - file transfer errors with the source error "[GRIDFTP_ERROR] an end-of-file was reachedglobus_xio: An end of file occurred (possibly the destination 
    disk is full)."  https://savannah.cern.ch/bugs/?89835, eLog 32380.
    17)  12/13: Two PandaMon hosts with problems (PandaMon_voatlas141, PandaMon_voatlas240).  Originally reported on 12/10.  See eLog 32329/79.
    http://savannah.cern.ch/bugs/?89834.
    18)  12/14: OU_OCHEP_SWT2 - ggus 77348 / RT 21378 opened due to DDM deletion errors at the site (~500 over a four hour period).  Ticket 'assigned' - eLog 32387.
    
    Follow-ups from earlier reports:
    
    (i)  Previous week: ongoing testing at the new US tier-3 site BELLARMINE-ATLAS-T3.  Test jobs have run successfully at the site.
    (ii)  10/9: Shifter submitted http://savannah.cern.ch/bugs/?87589 regarding job output transfer timeouts from MWT2_UC => TRIUMF.  This is the same type of issue that 
    has been observed at several US tier-2d's when attempting to copy job output files to other clouds.  Working to understand this problem and decide how best to handle 
    these situations.  Discussion in eLog 30170.
    (iii)  12/6: Some users reported this error when trying to retrieve (with dq2-get) a file from WISC: "[SRM_INVALID_PATH] No such file or directory."  
    ggus 77088 in-progress.
    (iv)  ggus 77122 was opened on 12/7 regarding the deletion errors at WISC, in-progress.  WISC was blacklisted on 12/8 (Savannah site exclusion:
    https://savannah.cern.ch/support/index.php?125163).  eLog 32263.
    
    • We are getting overwhelmed with tickets for deletion errors. this is partly because it has been added to the shift operations list.
    • Point 1 shifts ended December 9. Those issues have rolled into ADCOS shifts.
    • There was a discussion about ATLAS releases at BNL - turns out to be a BDII-Panda issue.
    • There have been failures at OU - tbc by Horst
    • T3 at NERSC; a downtime was announced, but shifter still ticketed. Was there a propagation failure to the downtime calendar?

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week:
    • Notes from last meeting were sent out
    • SWT2_UTA
  • this week:
    • Meeting was last week. Work to get LHCONE early adopters tested. Making sure perfsonar is installed.
    • Tom working on modular dashboard; adding alerts for primitive services. At some point real contact email addresses will be used.
    • Lookup services at some sites have problems.
    • Ian Gable joined from Canadian cloud - most sites will be up. They will appear in the dashboard.
    • Using R310 for perfsonar on a 10G host. Have not done the final set of tests. (Dell does not support 10G on the R310; it does R610) Wants to do a 10G-10G test before making a recommendation.
    • Canadian sites will all be 10G (w/ X520 NICs) hosts. Will have 1G-10G issues.
    • 2G and 4 cores is plenty.
    • How do we transition the US facility? This should be a good activity for Q2.
    • AGT2-MWT2 regional issues.

Federated Xrootd deployment in the US

last week(s) this week:
  • Last meeting MinutesFedXrootdDec2
  • Hiro has been developing N2N? to enlarge the hieristics - believe 99%.

Tier 3 GS

last meeting: this meeting:
  • none

Site news and issues (all sites)

  • T1:
    • last week(s):
      • Looking at a large procurement - something like 200 machines - before data taking resumes. PB of disk, Dec 8 delivery.
      • Chimera migration, upgrade, running stable, no scalability.
    • this week:
      • Xin has discussed the CVMFS test site queue with Condor admin - believes it has been stable at 100

  • AGLT2:
    • last week(s):
      • A bit of a compute node purchase, but mostly infrastructure.
      • Bob - there was a gmake clean in a users' prun job. Got in contact with the user - urged to fix before submitting to the grid again.
    • this week:
      • XFS pools have had issues for dCache locality, running full. High pool cost-factors, load high; queueing gets high. Not consistent among pools. Happens when writes are enabled. They did not see this in Europe. Will resize pools to reserve 5% in reserve. Tried this on a pool and if fixed the problem. Will need to adjust amount of storage delivering. May have to acquire more. Also tracking the fragmentation. (you can fragment online)

  • NET2:
    • last week(s):
      • New storage is up and available. There were corruption problems - need to make sure firmware is up to date for 3 TB drives.
      • CVMFS work is on-going.
      • Perfsonar - ready
      • DYNES work on-going
    • this week:
      • Partially drained. New storage is on order, and new servers & worker nodes. Have some performance results with C6145, Mangycoeurs. Seeing better scaling. Zee geant simulation.

  • MWT2:
    • last week:
      • Working on getting site completely converted to use MWT2-condor
      • 720 TB being installed at UC
    • this week:

  • SWT2 (UTA):
    • last week:
      • Looking at infrastructure, and a bit of storage.
      • Moving production cluster to CVMFS
      • Bestman2 rpms
      • APAC grid certs (issue with Bestman2 it seems) - email address in the DN. Related to jetty constraints. Can the signing policy files be modified to fix the issue, as before? Otherwise what to do given bestman2 support going away.
    • this week:
      • Has been working with Alex at LBL diagnosing a Bestman2-APAC grid cert problem. Have a fix which seems to be working. Need to get user proxies to test.
      • Analysis queue was excluded - some files were sent without long-form registration, caused an issue since direct access xrootd not handled by pilot in short form. Subscriptions were done using test channel rather than prod channel, perhaps this is the cause. Will follow-up with Hiro.
      • Updated netmon instances to have new perfsonar and traceroute tests.

  • SWT2 (OU):
    • last week:
      • Seeing lots of high IO jobs - not a problem however.
      • Will do the CVMFS deployment in the background.
    • this week:
      • Ordered 3 new headnodes.
      • Failed 16.6.7 failed jobs - had asked Alessandro to install 64 bit releases, wiping out 32 bit releases, this was on OCHEP.

  • WT2:
    • last week(s):
    • this week:
      • Added gridftp server yesterday.
      • Some users attempting 64 bit prun with Athena, cannot load python-LFC. The logfile indicates attempts to use /usr/bin/python.

Carryover issues (any updates?)

OSG Opportunistic Access

See AccessOSG for instructions supporting OSG VOs for opportunistic access.

last week(s)

  • SLAC: enabled HCC and Glow. Have problems with Engage - since they want gridftp from the worker node. Possible problems with Glow VO cert.
  • GEANT4 campaign, SupportingGEANT4
this week
  • See above

CVMFS deployment discussion

See TestingCVMFS

previously:

  • Wei: instructions seem to ask us to use the squid at BNL. Hardware recommendation? Same as what you're already using.
  • Load has not been observed to be high at AGLT2. Squid is single-threaded, multi-core not an issue. Want to have a good amount of memory, so as to avoid hitting local disk.
  • At AGLT2 -recommend multiple squids, and compute nodes are configured to not hit remote proxy. Doug claims it will still fail over to stratum 1 regardless.
this week:
  • See above.

AOB

last week this week


-- RobertGardner - 13 Dec 2011

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback