r5 - 30 Nov 2011 - 15:14:15 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesNov30

MinutesNov30

Introduction

Minutes of the Facilities Integration Program meeting, November 30, 2011
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Our Skype-capable conference line: (6 to mute) ** announce yourself in a quiet moment after you connect *
    • USA Toll-Free: (877)336-1839
    • USA Caller Paid/International Toll : (636)651-0008
    • ACCESS CODE: 3444755

Attending

  • Meeting attendees: Torre, Saul, Dave, Michael, John, Tom, Fred, Bob, Shawn, Kaushik, Alden, Horst, Hiro, Xin, Patrick, Armen, Wensheng
  • Apologies: Jason, Wei
  • Guests: Gabriele, Marco

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon CDT, weekly - convened by Kaushik) : Data management
    • Tuesday (2pm CDT, bi-weekly - convened by Shawn): Throughput meetings
    • Friday (1pm CDT, bi-weekly - convened by Rob): Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • New integration phase - WIP NEW
      • Networking issues:
        • slow transfers from our Tier 2's and Tier 1s (Atlantic and Canada). We need to get into the habit of asking network experts for help, as well as relying on our own tools. Esnet can help.
        • We had problems getting CERN's attention for Calibration data to AGLT2.
        • Perfsonar was key to resolving this issue.
        • Tier 1 service coordination meeting is a forum to raise issues.
    • this week
      • Federated data storage meeting last week in Lyon
      • CVMFS deployment at sites
      • Tier2D networking
      • Use of Vidyo - lack of ubuntu support; worries about room echos, efficiency.
      • LHC performance & outlook
        • record peak luminosity for heavy ions
        • cryo problem, but the machine is recovering
        • Good data taking efficiency - 94%
        • December 7 is the last day of data taking - technical stop to last until second week of feb. restart data taking in April.
      • More about Tier2D connectivity - the issues is a firewall at the EU Tier 1's for Tier 2. It is being discussed at the daily WLCG meeting. We'll need to shut off clouds with poor connections.
      • LHCONE testing ADC-wide; small number of sites should be evaluated. Shawn will setup a separate mesh in perf-sonar for this.
      • Procurement - Interlagos - C6145 128 cores - they have received one at BNL. Shuwei has been testing this at BNL. HS06 - achieved as 1000 HS06. Did not perform as expected for ATLAS jobs. 2.5 hours for 24 jobs versus 50 minutes on a Westmere, for reco jobs. Seeing this though for all processing steps. Getting in touch with Dell and AMD. (Interestingly there is no difference for a single job.)
      • OSG pacman versus native packaging - goal is to remove ATLAS-specific components in the workernode client (i.e. the LFC python-interfaces), and provide this via CVMFS.
      • OSG proposal review has taken place last week. Outcome.

GEANT4 opportunistic access

  • Gabriele Garzoglio <garzogli@fnal.gov> - OSG user support
  • Running validation jobs on OSG, historically 50%/50% OSG and EGEE
  • They are using CVMFS now
  • http://jira.opensciencegrid.org/browse/REQUESTS-19
  • SupportingGEANT4
  • Two main things: granting access to the geant4 VO (gums or gridmap file); allowing access to three CVMFS repositories.
  • Question about size of squid cache - 300 GB?
  • The worker node - 5.5G
  • Try at MWT2

Focus topic: Illinois Campus Cluster Pilot Project (Dave Lesny)

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Spoke with Borut today - making sure all clouds are full; will have enough G4 to keep us going for many months. We should never see drains. If we see them, raise an alarm, its a technical issue.
    • Regional production may be coming to RAC soon.
    • All is well otherwise.
    • Doug: Output of group-level activities into the US cloud? Not sure, CREM meeting tomorrow. Urges Doug to make subscriptions. PD2P? only looks at DATADISK - Kaushik suggests this as an alternative.
  • this week:
    • Alden reporting - all has been going smoothly.
    • Have solid backlog of production available
    • Suggests minimal changes

Data Management and Storage Validation (Armen)

Shift Operations (Mark)

  • Reference
  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=163385
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-11_21_2011.html
    
    
    1)  11/16 - 1/17: MWT2_UC - Sarah noticed the site was draining due jobs not progressing from 'assigned' to 'activated'.  Issue was traced to a configuration 
    issue on the panda servers following their migration to VM's - now fixed.
    2)  11/16 - 11/17: Job failures at OU_OCHEP_SWT2 with the error "Required CMTCONFIG (x86_64-slc5-gcc43-opt) incompatible with that of local system 
    (local cmtconfig not set)."  Some of the release 16.6.7 caches not yet re-installed - now completed, so the issue apparently resolved. 
    http://savannah.cern.ch/bugs/?88904 closed, eLog 31548.  (See item #10 from last week's summary.)
    3)  11/16: SLAC - file transfer errors ("failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]').  Wei reported the problem 
    had been fixed.  ggus 76524 closed, eLog 31539.
    4)  11/17: UTD-HEP - file transfer errors ("failed to contact on remote SRM [httpg://fester.utdallas.edu:8443/srm/v2/server]").  Site set off-line for a maintenance 
    outage.  http://savannah.cern.ch/support/?124772 (Savannah site exclusion).
    Update 11/19: outage completed - test jobs successful, site set back on-line.  ggus 76570 closed, eLog 31550/614.
    5)  11/18: SLAC - file transfer errors ("failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]").  Wei reported that a dataserver 
    went off-line, now back up.  ggus 76575 closed, eLog 31578.
    6)  11/18: NET2 - file transfer failures at several sites with NET2 as the source (example: "...has trouble with canonical path. cannot access it").  From Saul: We're 
    having a file system problem this evening with GPFS which will cause jobs to fail with get/put errors. We've turned off FTS, ddm endpoints and put our PanDA 
    queues in brokeroff.  ggus 76587.  Later, a GPFS hardware problem was fixed - issue resolved, ggus ticket closed.
    7)  11/18: AGLT2_CALIBDISK transfer errors ("user has no permission to create file /pnfs/aglt2.org/atlascalibdisk/...").  ggus 76576 in-progress, eLog 31563.
    8)  11/19: Michael reported that an SRM issue at BNL which was causing file transfer failures had been resolved.  eLog 31617.
    9)  11/19: OU_OCHEP_SWT2 - file transfer errors ("failed to contact on remote SRM [httpg://tier2-02.ochep.ou.edu:8443/srm/v2/server]").  A restart of the SRM 
    service fixed the problem.  ggus 76626 / RT 21254 closed, eLog 31625.
    10) 11/21 - 11/22: AGLT2 - Job failures with pilot errors like "Failed to get LFC replicas: -1 (lfc_getreplicas failed with: 2703, Could not secure the connection)|
    Log put error: lsm-put failed (201)."  Issue resolved - test jobs successful, site set back on-line.  ggus 76684 closed, eLog 31677/93/727.  (Site was in a scheduled 
    downtime 21 November, 17:00 – 20:00, to clean-up the dCache database.)  Savannah site exclusion:
    https://savannah.cern.ch/support/index.php?124827.
    11)  11/21 - 11/22: MWT2_UC was draining due to a lack of activated jobs.  Wensheng noticed a couple of issues affecting pandamover (not site specific), and 
    eventually the problem seemed to go away.  More details in the associated e-mail thread.
    12) 11/22 early a.m.: ADCR db's not accessible for ~30 minutes.  (Afftects among other services access to panda servers.)  Issue possibly related to an intervention 
    on a network switch arounf the same time.  eLog 31707.
    13)  11/22: OU_OCHEP_SWT2 - jobs failing with the error "Rlease16.6.7 jobs failed with Required CMTCONFIG (x86_64-slc5-gcc43-opt) incompatible with that of 
    local system."  (A similar release issue occurred around 11/12 - 11/15, see ggus 76278.)  Apparently some of the cache re-installs for release 16.6.7 were still 
    needed - now completed.  Test jobs to the site successful - set back on-line (finally...) on 11/27.  ggus 76708 / RT 21264 closed, 
    https://savannah.cern.ch/support/?124858 (Savannah site exclusion), eLog 31732/859.
    14)  11/22: MWT2 sites were set off-line due to a crashed NFS server.  Once the service was migrated to a new server test jobs submitted, completed successfully, 
    so set the queues back on-line.  eLog 31739.
    15)  11/23: SWT2_CPB - failed file transfers, due to a storage server going off-line.  Colling fan on the NIC died, now replaced.  Server back to available, transfers 
    succeeding.  http://savannah.cern.ch/support/?124876 (Savannah site exlcusion), eLog 31768.
    
    Follow-ups from earlier reports:
    (i)  Previous week: ongoing testing at the new US tier-3 site BELLARMINE-ATLAS-T3.  Test jobs have run successfully at the site.
    (ii)  10/9: Shifter submitted http://savannah.cern.ch/bugs/?87589 regarding job output transfer timeouts from MWT2_UC => TRIUMF.  This is the same type of issue 
    that has been observed at several US tier-2d's when attempting to copy job output files to other clouds.  Working to understand this problem and decide how best 
    to handle these situations.  Discussion in eLog 30170.
    (iii)  10/14: AGLT2 file transfer errors ("[NO_PROGRESS] No markers indicating progress received for more than 180 seconds").  Probably not an issue on the AGLT2 
    side, but rather slowness on the remote end (in this case TRIUMF).  There was a parallel issue causing jobs failures (NFS server hung up), which has since been 
    resolved.  ggus 75302 closed, eLog 30415.
    Update 10/15: ggus 75348 was opened for the site, again initially related to job failures due to slow output file transfer timeouts.  What was probably a real site issue 
    (jobs failing with "ERROR 201 Copy command exited with status 256 | Log put error: lsm-put failed (201)") began appearing on 10/17.  Most likely related to a dCache 
    service restart.  Ticket in-progress, eLog 30443, https://savannah.cern.ch/support/?124073 (Savannah site exclusion).
    Update 11/3: https://savannah.cern.ch/support/?124073 closed, but ggus 75348 is still 'in-progress'.
    Update 11/17: ggus 75348 marked as 'solved'.
    (iv)  11/11: UTD-HEP - job failures with the errors like "Payload stdout file too big: 3889399268 B (larger than limit 2147483648 B)."  Seems to be a site issue, since 
    jobs from the same tasks run successfully elsewhere.   http://savannah.cern.ch/bugs/?88774, eLog 31306.
    (v)  11/11: SLAC - job failures with a message like "Error accessing path/file for root file..."  From Wei: I see many failed jobs. they were retries of several missing files. 
    I manually checked the file in the ticket, along with a few other files used by other failed jobs. They are not presented in our storage. The storage logs show that they 
    were deleted a few hours before the jobs. The log also shows that they were all deleted by the SRM host, all at 13:43 UTC (5:43 PST).  ggus 76264 in-progress, 
    eLog 31307.
    (vi)  11/12: NET2 - jobs failures with the error "!!WARNING!!3000!! Trf setup file does not exist at: /atlasgrid/Grid3-app/atlas_app/atlas_rel/16.6.8/AtlasProduction/
    16.6.8.2/AtlasProductionRunTime/cmt/setup.sh."  Site investigating - ggus 76271, eLog 31469.
    Update 11/17: a later kit validation restored a missing link in the release area - issue resolved.  ggus 76271 closed.
    

  • this meeting: Operations summary:
     
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=164357
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-11_28_2011.html
    
    1)  11/23: ADCR DB's down.  Issue was a disk hardware failure.  Services restored as of early a.m. 11/24.  eLog 31789/802.
    2)  11/23: related to 1) above, central LFC service was unavailable due to being unable to contact the adcr_lfc Oracle database.  This resulted in a very large number 
    of failed jobs at many sites.  ggus 76769 was opened for job failures at AGLT2 during this time, but not a site issue, but rather due to the LFC outage.  ggus 76769 
    closed, 76770 (ticket for the LFC outage) also closed.  https://savannah.cern.ch/bugs/?89216, eLog 31784.
    3)  11/24 - 11/25: transfer of output datasets was taking a long time.  Tadashi noticed a python2.5 problem on panda server machines, such that datasets were not 
    getting closed properly.  Pandaserver was modified to use curl instead of pycurl, and this appears to have fixed the problem.  More details in: 
    http://savannah.cern.ch/support/?124915, plus the associated e-mail thread.
    4)  11/25: OU_OCHEP_SWT2_DATADISK file transfer errors (" [SECURITY_ERROR] not mapped./DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=ddmadmin/
    CN=531497/CN=Robot: ATLAS Data Management").  Problem solved by Host by implementing a round robin adler32 checksum method.  ggus 76830 / RT 21275 
    closed, eLog 31849.  https://savannah.cern.ch/support/index.php?124934 (Savannah site exclusion).
    5)  11/27 early a.m.: MWT2_UC file transfer failures (" failed to contact on remote SRM [httpg://uct2-dc1.uchicago.edu:8443/srm/managerv2]").  From Sarah: The dCache 
    headnode had run out of diskspace and was causing services to fail. I have freed up diskspace and restarted dCache.  ggus 76840 in-progress, eLog 31854.
    Update 11/29: Issue seems to be resolved - ggus 76840 closed.  eLog 31924.
    6)  11/27: BELLARMINE-T3_DATADISK - file transfer errors ("failed to contact on remote SRM
    [httpg://tier3-atlas2.bellarmine.edu:8443/srm/v2/server]").  From Horst: It looks like this was a network problem which went away again, since now
    my srm tests are working again.  ggus 76846 in-progress, eLog 31869.
    
    Follow-ups from earlier reports:
    (i)  Previous week: ongoing testing at the new US tier-3 site BELLARMINE-ATLAS-T3.  Test jobs have run successfully at the site.
    (ii)  10/9: Shifter submitted http://savannah.cern.ch/bugs/?87589 regarding job output transfer timeouts from MWT2_UC => TRIUMF.  This is the same type of issue 
    that has been observed at several US tier-2d's when attempting to copy job output files to other clouds.  Working to understand this problem and decide how best 
    to handle these situations.  Discussion in eLog 30170.
    (iii)  11/11: UTD-HEP - job failures with the errors like "Payload stdout file too big: 3889399268 B (larger than limit 2147483648 B)."  Seems to be a site issue, since 
    jobs from the same tasks run successfully elsewhere.   http://savannah.cern.ch/bugs/?88774, eLog 31306.
    Update 11/29: It was pointed out in the Savannah ticket that these errors could be associated with a corrupt db release file (in cvmfs).  Since all of the recent failed 
    jobs were occurring on two specific WN's the site admin ran a 'service cvmfs flush' on these hosts, and this appears (at least so far) to have fixed the problem.  
    During this period ggus 76757 was opened, and closed with the status 'not a site issue' - eLog 31886. 
    (iv)  11/11: SLAC - job failures with a message like "Error accessing path/file for root file..."  From Wei: I see many failed jobs. they were retries of several missing files. 
    I manually checked the file in the ticket, along with a few other files used by other failed jobs. They are not presented in our storage. The storage logs show that 
    they were deleted a few hours before the jobs. The log also shows that they were all deleted by the SRM host, all at 13:43 UTC (5:43 PST).  ggus 76264 in-progress, 
    eLog 31307.
    Update 11/29 from Wei: I don't think we know the reason why we don't have the data files. and it is not happening.  ggus 76264 closed.
    (v)  11/18: AGLT2_CALIBDISK transfer errors ("user has no permission to create file /pnfs/aglt2.org/atlascalibdisk/...").  ggus 76576 in-progress, eLog 31563.
    Update 11/29: No recent occurrences of this error - issue appears to be resolved.  Closed ggus 76576 - eLog 31926.
    
    • Second generation of the DDM dashboard has been released.
    • Discussion of CVMFS cache corruption failing lots of jobs at UTD. Which version was being used?
    • Why isn't the job using a checksum? Wensheng will check a savannah thread on this topic.

DDM Operations (Hiro)

Throughput and Networking (Shawn)

Federated Xrootd deployment in the US

last week(s) this week:
  • Meeting this week

Tier 3 GS

last meeting: this meeting:

Site news and issues (all sites)

  • T1:
    • last week(s):
      • Have on order PB of disk on order (Nexan - 3 TB drives); need to have this in place for next reprocessing round. Will evaluate Hitachi.
    • this week:
      • Looking at a large procurement - something like 200 machines - before data taking resumes. PB of disk, Dec 8 delivery.
      • Chimera migration, upgrade, running stable, no scalability.

  • AGLT2:
    • last week(s):
    • this week:
      • A bit of a compute node purchase, but mostly infrastructure.
      • Bob - there was a gmake clean in a users' prun job. Got in contact with the user - urged to fix before submitting to the grid again.

  • NET2:
    • last week(s):
    • this week:
      • New storage is up and available. There were corruption problems - need to make sure firmware is up to date for 3 TB drives.
      • CVMFS work is on-going.
      • Perfsonar - ready
      • DYNES work on-going

  • MWT2:
    • last week:
    • this week:
      • Working on getting site completely converted to use MWT2-condor
      • 720 TB being installed at UC

  • SWT2 (UTA):
    • last week:
    • this week:
      • Looking at infrastructure, and a bit of storage.
      • Moving production cluster to CVMFS
      • Bestman2 rpms
      • APAC grid certs (issue with Bestman2 it seems) - email address in the DN. Related to jetty constraints. Can the signing policy files be modified to fix the issue, as before? Otherwise what to do given bestman2 support going away.

  • SWT2 (OU):
    • last week:
    • this week:
      • Seeing lots of high IO jobs - not a problem however.
      • Will do the CVMFS deployment in the background.

  • WT2:
    • last week(s):
    • this week:

Carryover issues (any updates?)

OSG Opportunistic Access

See AccessOSG for instructions supporting OSG VOs for opportunistic access.

last week(s)

  • SLAC: enabled HCC and Glow. Have problems with Engage - since they want gridftp from the worker node. Possible problems with Glow VO cert.
this week

CVMFS deployment discussion

See TestingCVMFS

previously:

  • Wei: instructions seem to ask us to use the squid at BNL. Hardware recommendation? Same as what you're already using.
  • Load has not been observed to be high at AGLT2. Squid is single-threaded, multi-core not an issue. Want to have a good amount of memory, so as to avoid hitting local disk.
  • At AGLT2 -recommend multiple squids, and compute nodes are configured to not hit remote proxy. Doug claims it will still fail over to stratum 1 regardless.
this week:

AOB

last week this week


-- RobertGardner - 28 Nov 2011

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


pdf T2_Pilot.pdf (52.6K) | RobertGardner, 28 Nov 2011 - 16:58 |
pdf T2_Pilot-1.pdf (110.8K) | RobertGardner, 28 Nov 2011 - 16:58 |
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback