r4 - 16 Sep 2011 - 17:51:23 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesSep7

MinutesSep7

Introduction

Minutes of the Facilities Integration Program meeting, Sep 7, 2011
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Our Skype-capable conference line: (6 to mute) ** announce yourself in a quiet moment after you connect *
    • USA Toll-Free: (877)336-1839
    • USA Caller Paid/International Toll : (636)651-0008
    • ACCESS CODE: 3444755

Attending

  • Meeting attendees: Hari, AK, Nate, Rob, Bob, John DeStefano? , Dave, Saul, Fred, Booker, Shawn, Jason, Horst, Michael, Tom, Mark, Alden, Patrick, Armen, Kaushik, Wensheng, John, Torre, Hiro, Tomaz, Wei
  • Apologies:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon CDT, weekly - convened by Kaushik) : Data management
    • Tuesday (2pm CDT, bi-weekly - convened by Shawn): Throughput meetings
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • NEW Facilities going bi-weekly!
      • Good time for opportunistic access; this week, GLOW - see below.
      • Evolving worker node requirements & considerations - perhaps discussion next week.
      • Federated Xrootd workshop scheduled for September 12-13, Chicago, see further: https://indico.cern.ch/conferenceDisplay.py?confId=149453
      • Tentative dates for next full US ATLAS Facilities workshop, October 11-12, 2011; location would be SMU (TBC soon)
      • Whats up with production?
      • LHC machine is running well - stable beams above 50%
    • this week
      • Xrootd workshop scheduled for September 12-13, Chicago: https://indico.cern.ch/conferenceDisplay.py?confId=149453
      • October 11-12, SMU, https://indico.cern.ch/conferenceDisplay.py?confId=151862
      • IntegrationPhase18 - overview for FY11Q4
      • SiteCertificationP18 - site tasks for FY11Q4
      • Topics to discuss today: CVMFS deployment (see yesterday's presentation at ADC weekly)
      • Pledges for 2012, 2013 to be delivered
        • Sep 16 will be submitted, will take the proposed numbers
        • 2.2 PB of disk. 12.5 k HS06 PER T2
        • that above the pledge to be at discretion in US collaboration - this will require a discussion; will require * Machine technical stop - machine development was very good - now at beta-star ~ 1 m; good prospect for new data, and lots of * Alexei's note regarding distribution of two versions of AODs to T2's; expected 85 TB per T2 "A".
    • future topics:
      • September 14: local-site-mover statistics - Tom

Procurements discussion

  • Meeting storage pledge as a first priority - eg. MWT2
  • Infrastructure costs need to be considered - encourage PI's
  • ME and RWG will have individual discussions with PIs

CVMFS deployment discussion

See TestingCVMFS

previously:

  • Dave: On Monday stratum 1 servers switched over to the new, final URL; working just fine at Illinois.
  • New rpm to become available by end of week.
  • Generally - we should not convert a large site for scalability.
  • MWT2, AGLT2 would be ready to do a large scale test.
this week:
  • Automounter problems at AGLT2 - had to upgrade to fix. Can partially go down. Will try again tomorrow. Alessandro's validation to be redone.
  • There was a problem with "browse mode" - need to disable, otherwise Alessandro's scripts would falsely find mounted directories. Didn't want to change rpm, so turn browse mode off.
  • Conditions data from HOTDISK
    • had problem using conditions data out of HOTDISK; Alessandro's tool is older
    • Mixed long & short URLs in LFC caused
    • Or, Alessandro updates the tools
  • So AGLT2 will deploy and provide reference implementation
  • Illinois has used both hotdisk and cvmfs for conditions data - no mix of url's at illinois. Has contacted Alessandro about the dq2-client, claims bug for PFC creation (fix in the dev branch).
  • Run site validation by-hand in advance? Alessandro did this locally at AGLT2.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • On Friday Borut aborted a large number of mc11 s128-n tags. Said he would resubmit.
    • On Monday started to drain. s127 tag jobs available. mc10 is finished, but some groups still requesting this.
    • Completely out! No email from Borut.
    • User analysis has been mostly constant
    • Reprocessing campaign (Jonas)? Has not started - scheduled to start today or tomorrow.
  • this week:
    • Full again, doing remainder of G4 samples
    • Last wednesday there was a notable bump in US analysis activity, 5K, then 10K for a few days.
    • Second half of the reprocessing campaign will be coming.
    • Remind sites to keep analysis cycles open
    • Michael: 3500 minimum analysis jobs at BNL; cap at 4000
    • Discussion of distribution of analysis

Data Management and Storage Validation (Armen)

  • Reference
  • last week(s):
    • Overall things are okay; maintenance activities in various places
    • Good news - progress with understanding deletions, factor 2 gain in rate. Finding better performance; this is mainly at BNL. Higher than 7 Hz. Sometimes 10-12 Hz, reducing backlog. Expect to finish in the next week. Had been struggling with this for several months.
    • MCDISK cleanup proceeding. BNLTAPE - finished, BNLDISK - nearly finished. Hope to complete the legacy space token.
    • LFC errors still with us - Hiro will talk with Shawn and Saul (user directories need to be fixed - ACL problems).
    • Space needed for BNLGROUPDISK
    • Next USERDISK clean-up - two or three weeks. Will need to send email by end of the week.
  • this week:
    • BNL MCDISK not gone, storage released for new storage
    • Tier 2's looking okay
    • Discussed cleaning up
    • USERDISK cleanup emails sent. There were suggestions for doing this cleanup more often
    • Discussed LOCALGROUPDISK cleanup; not centrally done, but in need of a policy
    • Central deletion not a problem at the moment
    • Some errors in LFC permission

Shifters report (Mark)

  • last week: Operations summary:
     
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=1&confId=150161
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-8_30_2011.txt
    
    1)  8/24: Huge backlog of 'waiting' jobs in the U.S. cloud.  Problem was due to a missing input dataset.  Details here: https://savannah.cern.ch/bugs/?85856, 
    eLog 28857.
    2)  8/25 early a.m.: panda servers in the load-balancing pool were very slow to respond.  Issue traced to a full /tmp partition on two of the three hosts.  
    Space freed up - back to normal.  https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/28847.
    3)  8/26: WISC DDM transfer failures ("Unable to connect to c091.chtc.wisc.edu:2811globus_xio: Operation was canceled globus_xio: Operation timed out]").  
    Site blacklisted due to the errors (http://savannah.cern.ch/support/?123167),  ggus 73847.
    Update 8/30: Wen at WISC reported the problem has been fixed.  ggus ticket closed, eLog 29002.
    4)  8/26: ggus 73463 re-opened regarding backlog in transfer of datasets to AGLT2_CALIBDISK.  Ticket is currently 'in-progress'.  See Shawn's comment 
    in the ticket.
    5)  8/27: BNL - emergency shutdown due to hurricane Irene.  U.S. cloud set off-line.  Services restored on Monday 8/29.  eLog 28906/56/67, 
    http://savannah.cern.ch/support/?123173
    6)  8/29: AGLT2 - DDM transfer errors ("failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]").  Problem is due to disk issue 
    on dCache admin node.  OIM outage declared, ggus 73908 in-progress, eLog 28952.
    7)  8/30: BNL - some file transfer failures due to a high load on the SE name space management component. (This was caused by a temporary lack of disk 
    space on the system disk.)  Issue resolved, transfers now succeeding.  eLog 28977.
    8)  8/30: UPENN file transfer failures ("failed to contact on remote SRM [httpg://srm.hep.upenn.edu:8443/srm/v2/server]").  Site blacklisted in DDM.  Issue 
    resolved as of 8/31 a.m. - ggus 73944 closed, http://savannah.cern.ch/support/?123233 (site exclusion ticket) closed and site whitelisted, eLog 29003.
    9)  8/31: File transfer errors between UPENN_LOCALGROUPDISK => SLACXRD_USERDISK ("SOURCE error during TRANSFER_PREPARATION phase: 
    [INTERNAL_ERROR] Source file/user checksum mismatch]").  http://savannah.cern.ch/bugs/?86222, eLog 28990.  (Is this error related to 8) above?)
    10)  File transfer failures from SLACXRD_USERDISK => PIC_SCRATCHDISK with the error "source file doesn't exist."  
    https://savannah.cern.ch/bugs/index.php?86214, eLog 28989.
    
    Follow-ups from earlier reports:
    
    (i)  7/16: Initial express stream reprocessing (ES1) for the release 17 campaign started. The streams being run are the express stream and the 
    CosmicCalo stream. The reconstruction r-tag which is being used is r2541.  This reprocessing campaign (running at Tier-1's) uses release 17.0.2.3 and 
    DBrelease 16.2.1.1.  Currently merging tasks are defined with p-tags p628 and p629.  More details here:
    https://twiki.cern.ch/twiki/bin/viewauth/Atlas/DataPreparationReprocessing
    Update 8/29: the first phase of the release 17 reprocessing campaign completed.  More details in the above link.
    (ii)  7/16: SWT2_CPB - user reported a problem when attempting to copy data from the site.  We were able to transfer the files successfully on lxplus, 
    so presumably not a site problem.  Requested additional debugging info from the user, investigating further.  ggus 72775 / RT 20459 in-progress.
    Update 7/26: Issue is related to how BeStMan SRM processes certificates which include an e-mail address.  Working with BeStMan developers to come 
    up with a solution.
    Update 8/23: Patrick requested that the user run some new tests to check whether a BeStMan patch he installed addresses this issue.
    (iii)  7/31: SMU_LOCALGROUPDISK - DDM transfer failures with the error "[SECURITY_ERROR] not mapped./DC=ch/DC=cern/OU=Organic Units/OU=
    Users/CN=ddmadmin/CN=531497/CN=Robot: ATLAS Data Management]."  From Justin: Certificate has expired. New cert request was put in a few days ago.  
    https://ggus.eu/ws/ticket_info.php?ticket=73070 in-progress, eLog 27876, site blacklisted: https://savannah.cern.ch/support/index.php?122540
    Update 8/31: Justin reported that the problem has been resolved and the site tested - ggus and Savannah tickets closed,  eLog 29007.
    iv)  8/22: AGLT2 - job failures due to checksum error on input file FSRelease-0.7.1.2.tar.gz.  ggus 73694 in-progress, eLog 28679.
    v)  8/22: NERSC_LOCALGROUPDISK - DDM errors such as "destination file failed on the SRM with error [SRM_ABORTED]."  ggus 73717 in-progress, 
    eLog 28700.
    
    

  • this meeting: Operations summary:
     
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=153768
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-9_5_2011.html
    
    1)  9/2: New express stream reprocessing campaign started (ES2).  More info here:
    https://twiki.cern.ch/twiki/bin/viewauth/Atlas/Summer2011Reprocessing
    2)  9/2: SMU_LOCALGROUPDISK - file transfer errors with "failed to contact on remote SRM [httpg://smuosgse.hpc.smu.edu:8443/srm/v2/server]"  
    Justin reported the problem was fixed.   ggus 74032 closed, eLog 29055.  ggus 74037 was also opened the next morning, but it appears to be a duplicate 
    and can therefore probably be closed.  eLog 29068.
    3)  9/4: SLAC - job failures due to output copy transfer timeouts.  https://savannah.cern.ch/bugs/index.php?86378.  It appears the transfers are succeeding 
    on subsequent attempts, so possibly a temporary problem?  eLog 29085.
    4)  9/4: BNL-OSG2_DATADISK => FZK-LCG2_MCTAPE file transfer failures ("FIRST_MARKER_TIMEOUT] First non-zero marker not received within 
    300 seconds").  From Jane at BNL: As checked, quite some pools were in high load and shown as not seen, which should be the cause of the failure.  
    Now, the system is in good shape and I don't see problems. As tested, the transfer of the file was also fine. The problem should be gone.  ggus 74050 closed, 
    eLog 29088.
    5)  9/7 a.m.: from Michael at BNL - We just noticed that there were some production jobs failing because a missing DB release file (DBRelease-13.7.1.tar.gz). 
    This file was successfully replicated to BNL's SE several months ago but was recently removed for unknown reasons. A copy of the file was restored meanwhile.  
    eLog 29137.
    
    Follow-ups from earlier reports:
    
    (i)  7/16: SWT2_CPB - user reported a problem when attempting to copy data from the site.  We were able to transfer the files successfully on lxplus, so 
    presumably not a site problem.  Requested additional debugging info from the user, investigating further.  ggus 72775 / RT 20459 in-progress.
    Update 7/26: Issue is related to how BeStMan SRM processes certificates which include an e-mail address.  Working with BeStMan developers to come up 
    with a solution.
    Update 8/23: Patrick requested that the user run some new tests to check whether a BeStMan patch he installed addresses this issue.
    (ii)  8/22: AGLT2 - job failures due to checksum error on input file FSRelease-0.7.1.2.tar.gz.  ggus 73694 in-progress, eLog 28679.
    Update 9/2 from Shawn: Not sure what the original problem was.  Current tests with 'srmcp' and 'lcg-cp' show the file checksum is OK.  ggus 73694 closed.
    (iii)  8/22: NERSC_LOCALGROUPDISK - DDM errors such as "destination file failed on the SRM with error [SRM_ABORTED]."  ggus 73717 in-progress, 
    eLog 28700.
    Update 9/5: File transfer failures reported in the ggus ticket eventually succeeded - ticket closed.
    (iv)  8/26: ggus 73463 re-opened regarding backlog in transfer of datasets to AGLT2_CALIBDISK.  Ticket is currently 'in-progress'.  See Shawn's comment 
    in the ticket.
    (v)  8/29: AGLT2 - DDM transfer errors ("failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]").  Problem is due to disk issue on 
    dCache admin node.  OIM outage declared, ggus 73908 in-progress, eLog 28952.
    Update 8/31: Issue resolved, test jobs successful, queue set on-line.  ggus 73908 closed,  eLog 29010.
    (vi)  8/30: UPENN file transfer failures ("failed to contact on remote SRM [httpg://srm.hep.upenn.edu:8443/srm/v2/server]").  Site blacklisted in DDM.  Issue 
    resolved as of 8/31 a.m. - ggus 73944 closed, http://savannah.cern.ch/support/?123233 (site exclusion ticket) closed and site whitelisted, eLog 29003.
    (vii)  8/31: File transfer errors between UPENN_LOCALGROUPDISK => SLACXRD_USERDISK ("SOURCE error during TRANSFER_PREPARATION phase: 
    [INTERNAL_ERROR] Source file/user checksum mismatch]").  http://savannah.cern.ch/bugs/?86222, eLog 28990.  (Is this error related to (vi) above?)
    (iix)  File transfer failures from SLACXRD_USERDISK => PIC_SCRATCHDISK with the error "source file doesn't exist."  
    https://savannah.cern.ch/bugs/index.php?86214, eLog 28989.
    

  • Job failures at SLAC; heavy ion jobs - stage-out jobs timed out, but later recovered. The transfers back to the DE cloud were timing out.
  • Issue: where to report tickets. Getting lost in Savannah? GGUS/RT - usually pretty good.

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week:
    • LHCOPN has nearly completed deployment of perfsonar and using Tom's monitoring system
    • Italian cloud expanding, now eager to extend to cross-cloud; Canadian cloud coming online.
    • Demonstrators for DYNES - phase A deployment complete; beginning phase B. Demo could be available shortly - to demonstrate circuits between sites.
  • this week:
    • Meeting was last week. Saul gave presentation for wide area transfers to/from BU; will do this at AGLT2
    • New perfsonar hardware (10G) being tested
    • Pre-release testing - expect sometime this month
    • Update on modular dashboard; adding configuration gui. Matrix on demand feature.
    • DYNES - multi-continent demo for GLIF conference (caltech, michigan, SPRACE) - Standard DYNES toolkit, FTD, etc. Deploying Group B sites now, hope to complete by SC.
      • How can we utilize DYNES?
      • Michael - would like a demo of how dynamic circuits can improve connectivity at T3; SMU would be a good use-case
      • Try for demo for SMU meeting - goal.

Federated Xrootd deployment in the US

last week(s) this week:

Tier 3 - Bellarmine (AK)

  • Horst has added Panda queue
  • Awaiting ToA entry;
  • Test jobs by next meeting.

Site news and issues (all sites)

  • T1:
    • last week(s):
      • As mentioned earlier, empty directories issue. Otherwise very smooth.
      • Uptick in analysis jobs - a the Tier1 and overall generally.
      • Chimera upgrade is taking shape.
    • this week:
      • Hurricane Irene response last week, exercised emergency shutdown procedures. Everything powered off. Restart went smoothly.
      • Lost almost nothing on the ATLAS side - on the RHIC side, 40 worker nodes had issues
      • 12K disk drives
      • Increase bandwidth to Tier 2s from BNL - new 10G circuit available. UC to be migrated off to a new 10G. Other circuit waiting for new I2 switch to come online at MANLAN, ~ 2 weeks.
      • LHCONE proceeding in Europe. Hope to see a couple of sites in US participating.

  • AGLT2:
    • last week(s):
      • Major problem - lost local networking at UM; switch had flow-control on while others didn't; fixed, recovering services.
    • this week:
      • Electrical storm - 3 power outages. Generators failed, unknown reasons. Control mechanisms for PDUs in College racks. Restored everything by noon, some cleanup on Monday.
      • Changed dCache to site aware configuration. All reads come from local pool node. Copies transparently from other site. LRU algorithm removes older cached replicas. Since turning on, no issues, working well. Hoping to reduce bandwidth for the intersite. lsm-db setup to accumulate statistics.
      • Racked 360 TB of new storage today!

  • NET2:
    • last week(s):
      • ATLAS Workshop at Boston University next week: http://physics.bu.edu/sites/atlas-workshop-2011/agenda/
      • Storage arrived for new rack with 3 TB drives
      • Worker node purchase in the plans
      • Next week ATLAS physics workshop in Boston
      • Still busy with IO program
      • Wide area networking to other sites - will discuss in next throughput meeting.
      • Smooth running
    • this week:
      • Racking up new storage - 430 TB useable; will use as GPFS pools, segmented as fast cache in the front.
      • Running blade chassis with direct reading
      • John has AMD 6272 machines - running 8 production jobs; will have some performance results, jobs and HS06.
      • Throughput study - uses gridpftp logs, making a package
      • ACL issue being addressed
      • Moved checksumming load to the gridftp host
      • Anticipating retiring equipment - how to "sell"

  • MWT2:
    • last week:
      • Retiring IU endpoints; both physical sites representing by one endpoints
      • Working on storage procurement
    • this week:
      • MWT2 storage upgrade
      • IllinoisHEP testing new wn-client, python 2.6, dq2 client. Need to validate against.

  • SWT2 (UTA):
    • last week:
      • Purchase cycle
    • this week:
      • Been on vacation - few problems, all is well.

  • SWT2 (OU):
    • last week:
    • this week:
      • all is well
      • quotes in for new storage

  • WT2:
    • last week(s):
      • Upgraded CE, added a new CE - encountered problems; GIP and BDII - got help from Burt H.
    • this week:
      • brining storage online
      • 68 R410, 48 GB, 1 500 G disk plus sys disk; 1 G networking; Cisco

Carryover issues (any updates?)

OSG Opportunistic Access

See AccessOSG for instructions supporting OSG VOs for opportunistic access.

last week(s)

  • Dan Bradley joined the meeting
  • SupportingGLOW
  • Science being supported - economists, animal science (statistics), physics; varies month-to-month
  • Space: no space requirements on the site, other than worker node
  • Few requirements - just must run under Condor
  • Some jobs use http-squid to get files; others use condor transfer mechanisms with TCP.
  • Using the site squid proxy, as advertised in the environment; (may need to check this)
  • Typical run time? Aim for 2 hours, some are running longer; they're running under glideins - so sites will see only the pilot. Glidein lifespan? Day or so.
  • Preemption is okay, expected.
this week

Python + LFC bindings, clients

last week(s):
  • New OSG release coming with has updated LFC, python client interfaces, etc., supporting new worker-node client and wlcg-client
this week:
  • OSG 1.2.21 released, has updates.

AOB

last week
  • None.
this week


-- RobertGardner - 06 Sep 2011

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback