r5 - 08 Jun 2011 - 14:40:16 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJune8

MinutesJune8

Introduction

Minutes of the Facilities Integration Program meeting, June 8, 2011
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Our Skype-capable conference line: (6 to mute) ** announce yourself in a quiet moment after you connect *
    • USA Toll-Free: (877)336-1839
    • USA Caller Paid/International Toll : (636)651-0008
    • ACCESS CODE: 3444755

Attending

  • Meeting attendees: Jason, Charles, Nate, Michael, Karthik, Aaron, AK, Fred, Saul, Booker, Patrick, Sarah, Andy, John, Justin, Armen, Doug, Kaushik, Mark, Alden, Wensheng, Bob, Tom, Horst, Dave
  • Apologies: John DeStefano
  • Guest: John McGee

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon CDT, weekly - convened by Kaushik) : Data management
    • Tuesday (2pm CDT, bi-weekly - convened by Shawn): Throughput meetings
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • Integration program from this quarter, FY11Q3
      • Discussion about upcoming meeting on virtual machines and config management, FacilitiesMeetingConfigVM
        • See agenda - format will be improvised as topics are discussed, and interactive
      • Note that mode of operation has changed for multi-cloud production: input files will come from remote Tier 1s. There is a performance issue - transfers are fairly slow, causing sites to drain. Note that we have not optimized network links; it needs to be addressed with ADC. We do expect this to change with LHCONE, but this won't be for a while.
      • Hiro has attempted optimization of FTS settings, but this has not helped.
      • Sarah - notes MWT2 is draining anyway; there is a problem delivering pilots. Is the timefloor setting set?
      • Need to investigate any issues with the pilot rate, but also the transfer rates back to the destination cloud. Hiro notes the small file transfers are dominated by the setup overhead. The transfers to Lyon in particular are problematic.
      • Need some monitoring plots to show this back to ADC. Hiro has some of these.
    • this week
      • Recap from May 30 - June 2, ADC Technical Interchange Meeting @ JINR (Dubna), https://indico.cern.ch/conferenceDisplay.py?confId=132486
      • Introducing Engage VO - opportunistic access to US ATLAS sites (below)
      • Federated Xrootd program of work
      • Virtual machine workshop next week at BNL - list of names in the Doodle poll will be included on the gate.

OSG Opportunistic Access (Rob)

last week(s)
  • HCC @ SLAC: security requirements discussion thread last week - articulation of Glidein-WMS requirements from Burt; meeting today.
  • HCC @ UTA: enabled.
  • HCC @ NE: not yet.
  • Engage - setting up conf. call to discuss support issues and requirements
this week
  • Engage VO introduction - overview, requirements, questions.
  • John McGee, Mats Rynge, Steven Cox
  • Overview presentation: http://www.renci.org/~scox/engage_at_atlas/engage-atlas-v2.0.pdf
  • Support wiki: https://twiki.grid.iu.edu/bin/view/Engagement/EngageAtUSATLAS
  • Worker node outbound access - what are the restrictions at sites. Eg. at BNL, no direct access is permitted.
  • SL5 is okay - works for all applications with a couple exceptions.
  • What needs to be known about the applications? Varies a bit by user/application.
  • What will be the first application to run? Would start with a straightforward application.
  • What about storage? They don't use SRM access at the moment. Michael points out the advantages of SRM as a control mechanism.
  • Steve is the technical rep; engate-team@opensciencegrid.org.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Mark believes there are a number of factors relating to not keeping MWT2 full
    • There was a brokerage issue last weekend having to do with space available issues (Rod Walker found the issue) preventing jobs from running in the US cloud; a change in schedconfig fixed the problem.
  • this week:
    • PD2P? algorithm will be changed - Tier 1's had been favored. Will change to MOU share. Tier 2 is brokerage only.
    • How does this coexist with GROUPDISK subscriptions.
    • Dubna workshop was full of talks - fewer discussions, in contrast with Napoli.
    • Doug - talks should be representative of the workshop
    • Transferring backlog - related to star channel usage. Concurrent channels in the same star channel.

Federated Xrootd deployment in the US (Charles, Doug, Hiro, Wei)

Data Management and Storage Validation (Armen)

Shifters report (Mark)

  • Reference
  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting (this week provided by Jarka Schovancova):
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-5_16_2011.html
    
    1)  5/26: NET2 - file transfer errors to DATADISK.  (Issue related to the performance of checksum calculations, Bestman crashes, etc.)  See discussion 
    thread in https://ggus.eu/ws/ticket_info.php?ticket=70973, eLog 25826.
    2)  5/27: New pilot version from Paul (SULU 47e), produced to help with a production problem at LYON.  This had the effect of generating thousands of 
    errors at two FR cloud sites (see for example https://ggus.eu/ws/ticket_info.php?ticket=71032).  Problem under investigation.
    3)  5/28: Job brokerage was broken in the US & IT clouds.  Issue was a disk space check against an incorrect value.  Problem resolved.
    4)  5/29: MWT2_UC - job failures with transfer timeout errors.  From Rob: Not a site problem - caused by low concurrency settings for FTS instances at FR, 
    CERN for transfers from MWT2 endpoints.  ggus 71036 closed, eLog 25993.
    5)  5/31: ADCR database maintenance (switch db services back to original hardware - see eLog 25529 and thread therein for original issue).  Affected 
    services: ADCR_DQ2, ADCR_DQ2_LOCATION, ADCR_DQ2_TRACER, ADCR_PANDA, ADCR_PANDAMON, ADCR_PRODSYS, ADCR_AMI.  
    Duration ~one hour.  Work completed as of ~4:00 a.m. CST.  eLog 25949/50.
    6)  5/30-5/31: OU_OCHEP_SWT2 file transfer failures (two issues: (i) incorrect checksums, (ii) files with zero bytes size).  Horst reported that the issue 
    is resolved.  https://rt.racf.bnl.gov/rt/Ticket/Display.html?id=20106 closed, eLog 25943.
    7)  5/31: From Sarah at MWT2_IU: We have a storage pool off-line with disk issues at MWT2_IU.  We have paused the scheduler to prevent new jobs from 
    starting while it is down, and are working to bring it back online. We may see some transfers fail for files on the pool while it is off-line.
    8)  5/31: UTD-HEP set off-line at request of site admin (cleaning dark data from the storage).  eLog 25944.
    9)  6/1: Start of TAG reprocessing campaign (p-tag: p586).  From Jonas Strandberg: This will be a light-weight campaign starting from the merged AODs 
    and producing just the TAG and the FASTMON as output which are both very small.
    
    Follow-ups from earlier reports:
    (i)  5/17: AGLT2_USERDISK to MAIGRID_LOCALGROUPDISK file transfer failures ("globus_ftp_client: Connection timed out").  Appears to be a network 
    routing problem between the sites.  ggus 70671 in-progress, eLog 25480.
    Update 5/24: NGI_DE helpdesk personnel are working on the problem.  ggus ticket appended with additional info.
    Update 5/31 from Shawn: I am marking this as resolved but the solution seems to be that the remote site only has commercial network peering and will 
    be unable to connect to AGLT2 and WestGrid because of this. Not sure if the systems involved have been configured to limit their interactions to reachable 
    sites.  ggus 70671 closed,  eLog 25905.
    (ii)  5/19: SMU_LOCALGROUPDISK - DDM failures with "error during TRANSFER_FINALIZATION file/user checksum mismatch."  Justin at SMU thinks this 
    issue has been resolved.  Awaiting confirmation so ggus 70737 can be closed.  eLog 25537.
    Update 5/27: resolution of the problem confirmed - ggus 70737 closed.
    (iii)  5/24: NET2 - DDM transfer errors. Saul reported that the underlying issue was a networking problem that caused a gatekeeper to become overloaded. 
    Thinks the issue is now resolved. https://gus.fzk.de/ws/ticket_info.php?ticket=70844, eLog 25722.  Savannah site exclusion: 
    https://savannah.cern.ch/support/?121125.
    
  • this meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-6_6_2011.txt
    
    1)  6/1: OU_OCHEP - large numbers of job failures, primarily from tasks using release 15.6.12.9.  Re-installation of 15.6.12 started as of early a.m. 6/2.  
    ggus 71162 / RT 20131 in-progress, eLog 25994, site set off-line: https://savannah.cern.ch/support/index.php?121288.
    Update 6/7: additional problems with other releases (both production and analysis job failures), so these are being re-installed as well.  
    Details in the ggus / RT tickets.
    2)  6/2 early a.m.: SLAC DDM errors ("failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]").  Prod & analy queues 
    set off-line, blacklisted in DDM.  Wei reported that the SRM service was down for a period of time, and subsequently restarted.  Test jobs were successful - 
    queues set back on-line.  ggus 71171 closed, eLog 26045, https://savannah.cern.ch/support/index.php?121295.
    3)  6/2: MWT2_UC - job failures with the error "taskBuffer: transfer timeout for..."  Not a site issue, but rather related to the problem seen recently with 
    transfers between US tier-2's and European destinations (under investigation).  ggus 71177 closed, eLog 26032.
    Update 6/7: still see large numbers of these kinds of job failures.  ggus 71314, eLog 26202.
    See also discussion in DDM ops Savannah: https://savannah.cern.ch/bugs/?82974.
    4)  6/3-6/4: SLAC maintenance outage (power work) - June 3rd 5pm to June 4th 11:59pm PDT.
    Update from Wei, late p.m. 6/4: The outage is over. Services are back online.
    5)  6/6 early a.m.: ADCR db problems - restart necessary.  Investigating to determine the cause of the problem.  eLog 26151.
    6)  6/6: MWT2_UC - job failures with the error "Exception caught in runJob."  Suspicion is that these errors coincided with the db outage (5 above).  
    ggus 71243 still open, can probably be closed.
    7)  6/6-6/7: Test jobs submitted to (new) MWT2 queue were stuck waiting due to lack of information regarding atlas releases for the site.  Sarah updated an 
    OIM entry for the site (this may have been preventing BDII from getting forwarded to the CERN instance). 
    8)  6/7: New pilot release from Paul (SULU 47f) - details here:
    http://www-hep.uta.edu/~sosebee/ADCoS/pilot-version_SULU_47f.html
    
    Follow-ups from earlier reports:
    (i)  5/24: NET2 - DDM transfer errors. Saul reported that the underlying issue was a networking problem that caused a gatekeeper to become overloaded. 
    Thinks the issue is now resolved. https://gus.fzk.de/ws/ticket_info.php?ticket=70844, eLog 25722.  
    Savannah site exclusion: https://savannah.cern.ch/support/?121125.
    Update 6/2: ggus 70844 was closed on this date, but later re-opened when some DDM transfer errors reappeared.  Saul & John reported that a Bestman 
    restart and updating the CRL's solved this latest issue.  Waiting for ggus 70844 to be closed.
    Update 6/4: transfer errors re-appeared.  See additional discussion in ggus 70844.
    Update 6/8: Saul reported that the issue is solved.  Details in ggus 70844 (again closed).
    (ii)  5/26: NET2 - file transfer errors to DATADISK.  (Issue related to the performance of checksum calculations, Bestman crashes, etc.)  See discussion 
    thread in https://ggus.eu/ws/ticket_info.php?ticket=70973, eLog 25826.
    Update 6/2: Issues resolved - ggus 70973 closed. 
    (iii)  5/31: UTD-HEP set off-line at request of site admin (cleaning dark data from the storage).  eLog 25944.
    Update 6/2: disk clean-up completed, site set to 'test', jobs submitted.  eLog 26030.
    Update 6/7: Jobs were not starting up due to an issue with the gatekeeper, now apparently resolved.  Test jobs completed successfully, site set back 
    'on-line'.  eLog 26203.
    (iv)  6/1: Start of TAG reprocessing campaign (p-tag: p586).  From Jonas Strandberg: This will be a light-weight campaign starting from the merged AODs 
    and producing just the TAG and the FASTMON as output which are both very small.
    

DDM Operations (Hiro)

Throughput and Networking (Shawn)

CVMFS

See TestingCVMFS

last week:

  • Xin - has checked with Alessandro regarding the final formed CVMFS - still being tested
  • The LCG VO_ATLAS_SW_DIR will be used/assumed by the pilot.
  • Pilot wrapper needs to change to recognize the LCG environment. Suggests testing at Illinois.
  • Time frame from Alessandro - depends on testing the repo
  • Jose is writing a new pilot wrapper for OSG sites
  • Dave - will serve as a test site; notes that pilot and new layout are not in synch at the moment.

this week:

  • We will need to have more dedicated discussion on this, offline

Tier 3 Integration Program (Doug Benjamin)

Tier 3 References:
  • The link to ATLAS T3 working groups Twikis are here
  • T3g Setup guide is here
  • Users' guide to T3g is here
  • US ATLAS Tier3 RT Tickets

last week(s):

  • Doug not here
this week:

Tier 3GS site reports (Doug Benjamin, Joe, AK, Taeksu)

last week:
  • None reported

this week:

  • AK - CIO is looking at a number of issues that came from Jason.

Site news and issues (all sites)

  • T1:
    • last week:
      • Hiro - all is well, SS update.
    • this week:
      • All is well. Pleased to see the available headroom in cooling capacity at BNL. There was an automatic shutdown yesterday at FNAL.
      • Autopyfactory progress.
      • Jose making good progress w/ glexec. 8 out of 10 Tier 1 centers in production.

  • AGLT2:
    • last week(s):
      • all WN's re-built
      • CVMFS - ready to turn on site-wide
      • lsm-pcache updated (a few little things found, /pnfs nacc mount needed)
      • dcap - round robin issue evident for calib sites.
      • Want to update dcache to 1.9.12-3, its now golden; downtime? wait for a couple weeks (wait for PLHC results to go out)
    • this week:
      • All is well.

  • NET2:
    • last week(s):
      • Internal IO ramp-up progress still on-going
      • Found a lot of "get" issues; investigating
    • this week:
      • Will do a major ramp-up of analysis jobs next week.
      • HU will be in downtime next week.

  • MWT2:
    • last week:
      • UC: no recurrence of Chimera crashes since dcache upgrade to 1.9.5-26
      • Sarah - MWT2 queue development, Condor preemption - successful pilots
      • Illinois - CVMFS testing: testing the new repository from Doug; have had to put in a few softlinks and getting them running successfully; testing access to conditions data; There are problems with the pilot; Xin notes that in two weeks Alessandro will have a completed; participating in HTPC testing
    • this week:
      • UC: testing srvadmin 6.5.0 x86_64 rpms from dell as well as 6.3.0-0001 firmware on the PERC6E? /I cards in our R710s to reduce sense errors
      • IU:
      • UIUC:

  • SWT2 (UTA):
    • last week:
      • Issue with data server monday night - resolved.
    • this week:
      • Working on CVMFS issues - main focus.

  • SWT2 (OU):
    • last week:
      • Waiting for MP jobs from Doug. * this week:
      • Re-installing corrupted releases.

  • WT2:
    • last week(s):
      • 38 R410s online this morning. Will update
    • this week:
      • Completed last round of power outtages.
      • Prep w/ HCC VO - new sub-cluster. Agreement with SLAC security. HCC and Pilot factory firewall exceptions.

Carryover issues (any updates?)

Python + LFC bindings, clients (Charles)

last week(s):
  • We've had an update from Alain Roy/VDT - delays because of personnel availability, but progress on build is being made, expect more concrete news soon.
this week:
  • wlcg-client being tested by Marco Mambelli
  • wn-client being tested at UC

WLCG accounting (Karthik)

last week: this week:
  • There was an interoperability meeting - ball is in Brian's court. There seem to be plans.
  • Overall seems to be good progress.

HTPC configuration for AthenaMP testing (Horst, Dave)

last week
  • Dave: Doug Smith back so lots of activity with 16.6.5 successful at Illinois. 20 jobs using 16.6.6 - all ran well, with all options. So lots of progress.
  • Horst - CVMFS + Pilots setup
  • Suggestion - look for IO performance measures.
this week
  • Queue is setup and working, Doug has been running lots of jobs. Some failing, but others succeeding.

AOB

last week this week
  • No meeting next week (BNL virtual machines workshop).
  • Fred - there are some descrepancies in RSV reliability and availability numbers being reported. Has to do with maintenance downtimes being scheduled across UTC boundary. Tracking issue with the GOC.


-- RobertGardner - 07 Jun 2011

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


pdf FAX-20110608.pdf (48.7K) | CharlesWaldman, 08 Jun 2011 - 12:06 | Federated ATLAS XRootd status / plan
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback