r4 - 05 Oct 2011 - 14:54:16 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesOct5

MinutesOct5

Introduction

Minutes of the Facilities Integration Program meeting, Oct 5, 2011
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Our Skype-capable conference line: (6 to mute) ** announce yourself in a quiet moment after you connect *
    • USA Toll-Free: (877)336-1839
    • USA Caller Paid/International Toll : (636)651-0008
    • ACCESS CODE: 3444755

Attending

  • Meeting attendees: AK, Earle, Torre, Hari, Shawn, Doug, Alden, Dave, Wei, Saul, Patrick, Mark, Kaushik, Armen, Tom, Xin, Horst, Fred, Sarah, Hiro, John Brunelle,
  • Apologies: Jason, Michael

Integration program update (Rob, Michael)

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Lots of MC going on - see further Borut's talk at ADC weekly about production. Will keep us busy for a while.
    • User analysis about the same as before.
    • RAC - we have a new request for e-gamma regional production. These are being more or less auto-approved, since we have the capacity above pledge. Kaushik sees no need to change system. Michael would like to come back to verifying that resources above pledge are being prioritized for US physicists; Alden is investigating the algorithm and the data from Panda DB cloud.
  • this week:
    • Not sure of what the drop in production over the weekend; not sure of the cause. No mention at all of drainage on any list or report.
    • Mystery, since Borut wanted more resources last week. Wanted 30% set aside for MC backlog, so not surprising.
    • No big production campaigns being discussed, mainly getting MC - should be smooth.
    • Tier2D? issue: ~2500 failed jobs last weekend due to backlog at FR cloud. There is also a preliminary discussion about going to complete cloud-less model. Open it up so data can go from anywhere-to-anywhere. There will be a special meeting on Monday. Are we operating with the right kind of channels. Regarding liaison with other FTS admins - first point should be the cloud-support list. At least need to bring this up with at software week.

Data Management and Storage Validation (Armen)

  • Reference
  • last week(s):
    • See meeting from yesterday.
    • USERDISK cleanup has begun.
    • Attempting to understand rates of deletion among T2's (should be the same)
    • BNL legacy space tokens have been finally cleaned up. Closed.
    • Discussion to increase frequency of USERDISK deletions, and improve monitoring
  • this week:
    • c.f. MinutesDataManageOct4
    • Overall storage status is good.
    • Asking for proddisk cleanup at SLAC and MWT2
    • BNL and NET2 are continuing USERDISK cleanup. Struggling with parallel DATADISK cleanups.
    • 4-5 HZ at Tier 2s, higher at Tier 1.
    • Hiro has prepared the next round of deletions
    • LFC ACL issues - no problems currently, but issue is not totally understood.
    • Tier 3 storage and deletion issues for Wisconsin - have sent email to them to take action.

Shift Operations (Mark)

  • Reference
  • last meeting: Operations summary:
     
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=154280
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-9_26_2011.html
    
    
    1)  9/22: From Torre - Changeover of Panda server, Bamboo, Panda monitor grid certificates to a new (Graeme) certificate is 
    complete. Long term solution of switching to robot certificates is being worked on.  eLog 29603.
    2)  9/23: MWT2 - power interruption, took some WN's off-line that weren't on UPS power.  Also issue with the SRM service 
    causing file transfer errors.  Latter resolved by adding additional memory to the SRM door.  eLog 29649.
    3)  9/23-9/24 early a.m.: SLAC file transfer errors ("failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]').  
    Issue resolved - from Wei: It is fixed. the problem was due to continuous problem with storage power supply and disk after a power 
    glitch on 9/20, plus a previously unseen bug in storage code. We put a temporary workaround to that bug.  ggus 74622 closed, eLog 29662.
    
    Follow-ups from earlier reports:
    
    (i)  7/16: SWT2_CPB - user reported a problem when attempting to copy data from the site.  We were able to transfer the files 
    successfully on lxplus, so presumably not a site problem.  Requested additional debugging info from the user, investigating further.  
    ggus 72775 / RT 20459 in-progress.
    Update 7/26: Issue is related to how BeStMan SRM processes certificates which include an e-mail address.  Working with BeStMan 
    developers to come up with a solution.
    Update 8/23: Patrick requested that the user run some new tests to check whether a BeStMan patch he installed addresses 
    this issue.
    Update 9/18: User reported that the issue at SWT2_CPB was resolved, but the same problem is still seen at some other sites.
    (ii)  8/26: ggus 73463 re-opened regarding backlog in transfer of datasets to AGLT2_CALIBDISK.  Ticket is currently 'in-progress'.  
    See Shawn's comment in the ticket.
    Update 9/23: ggus 74594 was opened for this same issue - closed, with a cross-reference to 73463.  eLog 29624, 29698.
    (iii)  9/2: New express stream reprocessing campaign started (ES2).  More info here:
    https://twiki.cern.ch/twiki/bin/viewauth/Atlas/Summer2011Reprocessing
    Update 9/14: second phase has begun - more details at above link.
    Update 9/26: second phase of the bulk reprocessing completed.
    (iv)  9/12: OUHEP_ITB - jobs waiting due to missing release 16.6.6.1.  Site had not been enabled for release installation system.  
    Horst and Alessandro now working on this.  ggus 74238 in-progress, eLog 29272.
    Update 9/20: Horst added new information to the ticket - see https://rt.racf.bnl.gov/rt/Ticket/Display.html?id=20856.
    (v)  9/19: NET2 - file transfer failures ("failed to contact on remote SRM [httpg://atlas.bu.edu:8443/srm/v2/server]").  SRM had 
    stopped responding, restarting bestman seemed to fix the problem.  Monitoring the situation - ggus 74464 in-progress, eLog 29539.
    Update 9/22: No recent errors of this type - ggus 74464 closed.
    

  • this meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    
    https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=157447
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-10_3_2011.html 
    
    1)  9/29: WISC file transfer errors ("Command failed. : globus_gridftp_server_posix.c:globus_l_gfs_posix_recv:914:500-open() fail500 End").  
    Issue resolved by Wen at the site.  ggus 74794 closed, eLog 29890.  Transfer errors reappeared on 10/1 - issue was a full disk.  
    Some space was freed up, ggus 74855 closed, eLog 29892.
    2)  10/1: SLAC - file transfer errors ("...has trouble with canonical path. cannot access it.").  Issue was a power outage at the site.  
    Services restored during the day on 10/2.  ggus 74854 closed, eLog 29925, https://savannah.cern.ch/support/index.php?123820 
    (Savannah site exclusion).
    3)  10/2: SWT2_CPB - file transfers were failing with error "...has trouble with canonical path. cannot access it."  Issue was due to 
    the xrootdfs layer on the SRM host - a restart solved the problem.  ggus 74861 / RT 20936 closed, eLog 29952.  (Since the site was 
    set off-line by a shifter test jobs were verified at the site prior to setting it back on-line.)
    4)  10/3: NERSC - file transfer errors such as "failed to contact on remote SRM [httpg://pdsfdtn1.nersc.gov:62443/srm/v2/server]."  
    Issue resolved by the site admin (SRM server in a strange stage, re-started).  https://savannah.cern.ch/bugs/index.php?87394 
    closed, eLog 29971.
    5)  10/4: OU_OCHEP_SWT2 - maintenance outage for a storage system re-boot.  Back on-line as of ~9:30 a.m. CST.
    6)  Previous week: ongoing testing at the new US tier-3 site BELLARMINE-ATLAS-T3.  Test jobs have run successfully at the site.
    
    Follow-ups from earlier reports:
    
    (i)  7/16: SWT2_CPB - user reported a problem when attempting to copy data from the site.  We were able to transfer the files 
    successfully on lxplus, so presumably not a site problem.  Requested additional debugging info from the user, investigating further.  
    ggus 72775 / RT 20459 in-progress.
    Update 7/26: Issue is related to how BeStMan SRM processes certificates which include an e-mail address.  Working with BeStMan 
    developers to come up with a solution.
    Update 8/23: Patrick requested that the user run some new tests to check whether a BeStMan patch he installed addresses this issue.
    Update 9/18: User reported that the issue at SWT2_CPB was resolved, but the same problem is still seen at some other sites.
    (ii)  8/26: ggus 73463 re-opened regarding backlog in transfer of datasets to AGLT2_CALIBDISK.  Ticket is currently 'in-progress'.  
    See Shawn's comment in the ticket.
    Update 9/23: ggus 74594 was opened for this same issue - closed, with a cross-reference to 73463.  eLog 29624, 29698.
    (iii)  9/12: OUHEP_ITB - jobs waiting due to missing release 16.6.6.1.  Site had not been enabled for release installation system.  Horst
    and Alessandro now working on this.  ggus 74238 in-progress, eLog 29272.
    Update 9/20: Horst added new information to the ticket - see https://rt.racf.bnl.gov/rt/Ticket/Display.html?id=20856.
    Update 10/3: See latest entry in the ggus/RT tickets. 
    
    • low level of probs in the past two weeks for US sites.
    • Three carryover issues above.
    • Certification of Bellarmine
    • Doug: ICB approval needed

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week:
    • Had a meeting this week (during the off-week because of LHCONE / LHCOPN next week)
    • R310 hosts being tested (for 10g tests); changing test interval to 60s. Will implement ktune tuning on these.
    • Routing AGLT2-MWT2_IU issue resolved. NLR versus I2 path, switched back last Friday.
    • PerfSonar latency issue at SLAC - potential issue.
    • Modular dashboard evolving, continues to be important
    • AGLT2 not getting calib datasets timely; working the issue with CERN. New measurements indicating packet loss.
    • Scheduling LHCONE connectivity for Tier 2s - need a plan; Shawn will report back after LHCONE meeting next week.
    • Scheduling a new set of load tests. Rerouting three of the Tier2's onto new Esnet circuit. Want to verify with new configuration.
    • Next meeting will be two-weeks from yesterday
    • Michael emphasizes we have LHCOPN monitoring for the first time; remarkable progress in the last week in part thanks to Jason for pushing the sites to deploy & with the correct configuration. Aim to add more sites and measure - eg. Italian sites, in a dedicated dashboard.
  • this week:
    • Long meeting yesterday - next meeting will be in two weeks, another off-week meeting, to fit between other meetings.
    • Perfsonar node configuration will be available in dell matrix
    • Transatlantic - AGLT2 issue resolved, using perfsonar.
    • New release of perfsonar coming - week from Friday; updates will be via yum.
    • New kind of test: scheduled traceroute. Would like every Tier 2 to schedule this. Will have instructions for this.
    • Dashboard discussions
    • LHCONE likely to go slower than we anticipated. Will stay engaged as things evolve, providing our feedback.
    • Mesh testing decisions need to be made for cross-cloud testing. Goal is to have each Tier 1 tested by at least one Tier 2 in the US
    • New Esnet circuits to UC and SLAC; would like to schedule load tests before, and after. Test to all 5 T2's. Hiro will set this up. Friday morning: 10 am Eastern. Half hour test. Have not done this in a while.

Federated Xrootd deployment in the US

last week(s) this week:
  • Previous meeting minutes: MinutesFedXrootdSep23
  • Doug reporting poor WAN performance for copies. Transfer failures - mostly seems to be timing out finding the file.
    • Explicit problem for BNL resident datasets
    • There are particular cases when the N2N lookup wont work
    • Wei interested in looking into this
  • There will be a meeting this Friday
  • LOCALGROUPDISK
  • Doug will update canned example for D3PD

Tier 3 GS

last meeting:
  • Bellarmine tokens added, FTS transfers validated.
  • CVMFS release validation jobs on-going, okay; Horst will follow-up w/ Alessandro.
  • UTD - currently offline.

this meeting:

  • UTD - needs to update Joe's DN
  • Mark will provide a list of important operations lists
  • AK wants to know what the next steps are.

Site news and issues (all sites)

  • T1:
    • last week(s):
      • Hurricane Irene response last week, exercised emergency shutdown procedures. Everything powered off. Restart went smoothly.
      • Lost almost nothing on the ATLAS side - on the RHIC side, 40 worker nodes had issues
      • 12K disk drives
      • Increase bandwidth to Tier 2s from BNL - new 10G circuit available. UC to be migrated off to a new 10G. Other circuit waiting for new I2 switch to come online at MANLAN, ~ 2 weeks.
      • LHCONE proceeding in Europe. Hope to see a couple of sites in US participating.
    • this week:

  • AGLT2:
    • last week(s):
      • Site aware dCache tuned - next meeting will be Tom's talk
      • grid3-locations file corrupted for unknown reasons; may need to investigate with
      • Oct 22 - outtage to check generator
    • this week:
      • Connection to CERN fixed.
      • Preparing for generator test on Oct 22.
      • Will update OSG, OSG-wn, and wlcg-client
      • Have about 390 TB out, but not allocated.

  • NET2:
    • last week(s):
      • Racking up new storage - 430 TB useable; will use as GPFS pools, segmented as fast cache in the front.
      • Running blade chassis with direct reading
      • Bringing up new rack of storage
      • Three problems with SRM over the weekend: 8024F switch; reporting space;
      • AMDs at HU running production - looking good, 35% faster than HT Intel Harpertown thread.
      • John working on long-standing ACL issue
    • this week:
      • Bring up newest storage rack - powered up, being tested, online within two weeks, 430 TB useable.
      • Local networking re-arrangements to convert 6509 to pure 10g network. Getting a Dell switch to move local network off this. Plugged directly into the NOX, so it must be a Cisco router.
      • Going to 2x10g to wide area
      • Running 500 analy jobs at HU routinely. 650 is limit.

  • MWT2:
    • last week:
      • Tier 2 pilot project at Illinois campus cluster
    • this week:
      • Purchase request in for 720 TB of storage at UC; new headnodes at IU.
      • Dave Illinois campus computing cluster pilot, Phase 1: Condor flocking demonstrated between clusters
      • Sarah - redoing tests w/ direct access dcap vs xrootd. Looking impact of NAT and dcap. Seeing less of the problem, but still persists. Want to look at newer version of libdcap.

  • SWT2 (UTA):
    • last week:
      • Two incidents over the weekend - storage server and xrootdfs
      • federated xrootd working well
      • Downtime in ~ 2 weeks to install new storage; before SMU
    • this week:
      • xrootdFS fell over - turned back on. otherwise things are fine.

  • SWT2 (OU):
    • last week:
      • all is well
      • disk is on order; 200 TB useable. May be able to get 3 TB DDN drives; investing
    • this week:
      • Had a storage reboot yesterday. Load issue on Lustre servers - 9900 issues after reboot. Load gone now though.
      • Asked IBM about adding 3 TB drives - can't do this without voiding warranty.

  • WT2:
    • last week(s):
      • Upgraded CE, added a new CE - encountered problems; GIP and BDII - got help from Burt H.
      • bringing storage online
      • 68 R410, 48 GB, 1 500 G disk plus sys disk; 1 G networking; Cisco
    • this week:
      • Power outage last weekend. Power company outage.
      • Continue to install 68 R410.

Carryover issues (any updates?)

OSG Opportunistic Access

See AccessOSG for instructions supporting OSG VOs for opportunistic access.

last week(s)

  • Checked with each of the VOs - overall happy, even if not yet enabled on all sites.
this week
  • SLAC: enabled HCC and Glow. Have problems with Engage - since they want gridftp from the worker node. Possible problems with Glow VO cert.

CVMFS deployment discussion

See TestingCVMFS

previously:

  • For the Tier 1, based on developments in ADC and ATLAS, have setup a CVMFS based Panda instance. This could be used for Trigger reprocessing - there is a resource shortage at CERN, so making additional capacity at Tier 1. Xin and Chris Hollowell (sp) - have this Puppetized. Releases have been tagged by Alessandro. So another site has been validated. Xin notes there were some tweaks needed to publish to BDII, and this used a separate gatekeeper. Pilots will be coming from an autopyfactor method, the new version.
  • Status at Illinois: running for quite a while - no problems with production or analysis. Have run in both modes (HOTDISK & CVMFS) with conditions data.
  • AGLT2: there are some issues with releases being re-written in the grid3-locations files. Some interference with Xin's install jobs? Bob believes you cannot do this live - can't switch from NFS-mounted releases to CVMFS releases.
  • MWT2 queue: running CVMFS and the new wn-client with python 2.6; test jobs ran fine, running successfully; still working on ANALY queue.
  • There has been significant progress in terms of support from CERN IT.
  • Sites in the next weeks: SWT2_CPB cluster
this week:

Python + LFC bindings, clients

last week(s):
  • NET2 - planning
  • AGLT2 - 3rd week of October
  • Now would be a good time to upgrade; future OSG releases will be rpm-based with new configuration methods.
this week:

AOB

last week
  • None.
this week
  • No meeting next week - SMU workshop.


-- RobertGardner - 04 Oct 2011

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback