r4 - 21 Sep 2011 - 14:26:13 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesSep21

MinutesSep21

Introduction

Minutes of the Facilities Integration Program meeting, Sep 21, 2011
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Our Skype-capable conference line: (6 to mute) ** announce yourself in a quiet moment after you connect *
    • USA Toll-Free: (877)336-1839
    • USA Caller Paid/International Toll : (636)651-0008
    • ACCESS CODE: 3444755

Attending

  • Meeting attendees: AK, Hari, Dave, Rob, Nate, Fred, Torre, Michael, Shawn, Tom, Bob, Wei, Mark, Patrick, John B, Sarah, Alden, Armen, Kaushik, Hiro, Horst, Xin,
  • Apologies: Jason

Integration program update (Rob, Michael)

Special topic: Local Site Mover statistics at AGLT2 (Tom)

  • Review next week.

CVMFS deployment discussion

See TestingCVMFS

previously:

  • Automounter problems at AGLT2 - had to upgrade to fix. Can partially go down. Will try again tomorrow. Alessandro's validation to be redone.
  • There was a problem with "browse mode" - need to disable, otherwise Alessandro's scripts would falsely find mounted directories. Didn't want to change rpm, so turn browse mode off.
  • Conditions data from HOTDISK
    • had problem using conditions data out of HOTDISK; Alessandro's tool is older
    • Mixed long & short URLs in LFC caused
    • Or, Alessandro updates the tools
  • So AGLT2 will deploy and provide reference implementation
  • Illinois has used both hotdisk and cvmfs for conditions data - no mix of url's at illinois. Has contacted Alessandro about the dq2-client, claims bug for PFC creation (fix in the dev branch).
  • Run site validation by-hand in advance? Alessandro did this locally at AGLT2.
this week:
  • For the Tier 1, based on developments in ADC and ATLAS, have setup a CVMFS based Panda instance. This could be used for Trigger reprocessing - there is a resource shortage at CERN, so making additional capacity at Tier 1. Xin and Chris Hollowell (sp) - have this Puppetized. Releases have been tagged by Alessandro. So another site has been validated. Xin notes there were some tweaks needed to publish to BDII, and this used a separate gatekeeper. Pilots will be coming from an autopyfactor method, the new version.
  • Status at Illinois: running for quite a while - no problems with production or analysis. Have run in both modes (HOTDISK & CVMFS) with conditions data.
  • AGLT2: there are some issues with releases being re-written in the grid3-locations files. Some interference with Xin's install jobs? Bob believes you cannot do this live - can't switch from NFS-mounted releases to CVMFS releases.
  • MWT2 queue: running CVMFS and the new wn-client with python 2.6; test jobs ran fine, running successfully; still working on ANALY queue.
  • There has been significant progress in terms of support from CERN IT.
  • Sites in the next weeks: SWT2_CPB cluster

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Full again, doing remainder of G4 samples
    • Last wednesday there was a notable bump in US analysis activity, 5K, then 10K for a few days.
    • Second half of the reprocessing campaign will be coming.
    • Remind sites to keep analysis cycles open
    • Michael: 3500 minimum analysis jobs at BNL; cap at 4000
    • Discussion of distribution of analysis
  • this week:
    • Lots of MC going on - see further Borut's talk at ADC weekly about production. Will keep us busy for a while.
    • User analysis about the same as before.
    • RAC - we have a new request for e-gamma regional production. These are being more or less auto-approved, since we have the capacity above pledge. Kaushik sees no need to change system. Michael would like to come back to verifying that resources above pledge are being prioritized for US physicists; Alden is investigating the algorithm and the data from Panda DB cloud.

Data Management and Storage Validation (Armen)

  • Reference
  • last week(s):
    • Overall things are okay; maintenance activities in various places
    • Good news - progress with understanding deletions, factor 2 gain in rate. Finding better performance; this is mainly at BNL. Higher than 7 Hz. Sometimes 10-12 Hz, reducing backlog. Expect to finish in the next week. Had been struggling with this for several months.
    • MCDISK cleanup proceeding. BNLTAPE - finished, BNLDISK - nearly finished. Hope to complete the legacy space token.
    • LFC errors still with us - Hiro will talk with Shawn and Saul (user directories need to be fixed - ACL problems).
    • Space needed for BNLGROUPDISK
    • Next USERDISK clean-up - two or three weeks. Will need to send email by end of the week.
    • BNL MCDISK now gone, storage released for new storage
    • Tier 2 storage is looking okay
    • Discussed cleaning up
    • USERDISK cleanup emails sent. There were suggestions for doing this cleanup more often
    • Discussed LOCALGROUPDISK cleanup; not centrally done, but in need of a policy
    • Central deletion not a problem at the moment
    • Some errors in LFC permissions
  • this week:
    • See meeting from yesterday.
    • USERDISK cleanup has begun.
    • Attempting to understand rates of deletion among T2's (should be the same)
    • BNL legacy space tokens have been finally cleaned up. Closed.
    • Discussion to increase frequency of USERDISK deletions, and improve monitoring

Shift Operations (Mark)

  • Reference
  • last meeting: Operations summary:
     
    Yuri's summary from the weekly ADCoS meeting:
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-9_12_2011.html
    
    1)  9/8: SLAC - file transfer errors ("failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]").  Issue resolved - from Wei: 
    We had a site wide networking issue which is addressed.  ggus 74156, eLog 29172.
    Update 9/12: no recent occurrences of this error - ggus 74156 closed, eLog 29274.
    2)  9/8: AGLT2 migrated to CVMFS for atlas s/w releases. 
    3)  9/10: ALGT2 - user had reported that jobs were failing with lsm errors - from Shawn: The source of the problem was a dCache mis-configuration which 
    prevented file "writes" from being able to find a write-pool. This was resolved on Sept 9 around 7:35 AM EST.  ggus 74096 closed.
    4)  9/10-9/11: MWT2_UC - local site mover issue, causing large number of job failures with stage-in errors.  Problem fixed - ggus 74209 closed, eLog 29251.
    5)  9/12: OUHEP_ITB - jobs waiting due to missing release 16.6.6.1.  Site had not been enabled for release installation system.  Horst and Alessandro now 
    working on this.  ggus 74238 in-progress, eLog 29272.
    6)  9/12: NET2 - job failures with the error "failed because it had non-empty stderr."  Saul reported the problem was fixed.  ggus 74261 closed, eLog 29294.
    7)  9/12: DDM transfer errors at WISC_DATADISK ("An end of file occurred(possibly the destination disk is full)").  ggus 74243 in-progress, eLog 29315.
    8)  9/14: Shifter reported that file transfers were failing from CERN-PROD to US cloud tier-2's.  Hiro fixed the problem - ggus 74293 closed, eLog 29356.
    
    Follow-ups from earlier reports:
    (i)  7/16: SWT2_CPB - user reported a problem when attempting to copy data from the site.  We were able to transfer the files successfully on lxplus, so 
    presumably not a site problem.  Requested additional debugging info from the user, investigating further.  ggus 72775 / RT 20459 in-progress.
    Update 7/26: Issue is related to how BeStMan SRM processes certificates which include an e-mail address.  Working with BeStMan developers to come 
    up with a solution.
    Update 8/23: Patrick requested that the user run some new tests to check whether a BeStMan patch he installed addresses this issue.
    (ii)  8/26: ggus 73463 re-opened regarding backlog in transfer of datasets to AGLT2_CALIBDISK.  Ticket is currently 'in-progress'.  See Shawn's comment 
    in the ticket.
    (iii)  8/31: File transfer errors between UPENN_LOCALGROUPDISK => SLACXRD_USERDISK ("SOURCE error during TRANSFER_PREPARATION phase: 
    [INTERNAL_ERROR] Source file/user checksum mismatch]").  http://savannah.cern.ch/bugs/?86222, eLog 28990.
    Update 9/14: No recent postings to Savannah 86222, and the errors seem to have gone away.  File this one away.
    (iv)  File transfer failures from SLACXRD_USERDISK => PIC_SCRATCHDISK with the error "source file doesn't exist."  
    https://savannah.cern.ch/bugs/index.php?86214, eLog 28989.
    Update 9/14: No postings to the Savannah ticket since 8/31.  File this one away.
    (v)  9/2: New express stream reprocessing campaign started (ES2).  More info here:
    https://twiki.cern.ch/twiki/bin/viewauth/Atlas/Summer2011Reprocessing
    Update 9/14: second phase has begun - more details at above link.
    (vi)  9/2: SMU_LOCALGROUPDISK - file transfer errors with "failed to contact on remote SRM [httpg://smuosgse.hpc.smu.edu:8443/srm/v2/server]"  
    Justin reported the problem was fixed.   ggus 74032 closed, eLog 29055.  ggus 74037 was also opened the next morning, but it appears to be a duplicate 
    and can therefore probably be closed.  eLog 29068.
    Update 9/12: ggus 74037 definitely a duplicate of 74032, which has already been closed.  74037 closed as well.
    (vii)  9/4: SLAC - job failures due to output copy transfer timeouts.  https://savannah.cern.ch/bugs/index.php?86378.  It appears the transfers are succeeding 
    on subsequent attempts, so possibly a temporary problem?  eLog 29085.
    Update 9/14: No postings to the Savannah ticket since 9/4.  File this one away.
    

  • Job failures at SLAC; heavy ion jobs - stage-out jobs timed out, but later recovered. The transfers back to the DE cloud were timing out.
  • Issue: where to report tickets. Getting lost in Savannah? GGUS/RT - usually pretty good.

  • this meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=154279
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-9_19_2011.html
    
    1)  9/15: NET2 - Saul reported a problem with the site LSM nodes at BU.  HU_ATLAS_Tier2 and ANALY_HU_ATLAS_Tier2 were set to brokeroff while the 
    issue was being worked on - now fixed.
    Also on 9/16-9/17 site reported a problem with a network switch, also now fixed.
    2)  9/17: SWT2_CPB - file transfer failures - bad fan on a NIC in one of the storage servers took the host off-line.  Problem fixed, transfers resumed.  
    ggus 74424 / RT 20879 closed, eLog 29468.
    3)  9/18: Access to panda servers limited / timing out - affected number of running jobs.  Update from Tadashi: I've increased ServerLimit of httpd to 512 on 
    the panda servers.  The number of SYN_RECV connection has been decreased. I have the impression that all httpd processes of the panda server were 
    being occupied by requests for schedconfig and pilotcode.tar.gz.  eLog 29494/98/505.
    4)  9/19: SWT2_CPB - DDM transfer failures with the error "...has trouble with canonical path. cannot access it."  A restart of the xrootdFS process on the 
    SRM host resolved the problem.  (Possibly fallout from the issue in 2) above, but occurred more than 24 hours later, so not sure at this point.)  
    ggus 74439 / RT 20887 closed, eLog 29538.
    5)  9/19: NET2 - file transfer failures ("failed to contact on remote SRM [httpg://atlas.bu.edu:8443/srm/v2/server]").  SRM had stopped responding, restarting 
    bestman seemed to fix the problem.  Monitoring the situation - ggus 74464 in-progress, eLog 29539.
    6)  9/20: SLAC - power distribution issue, shut down many machines in the main machine room and affected some cooling systems.  Problem fixed.
    7)  9/21: NET2 - DDM errors ("NO_SPACE_LEFT").  No ggus ticket, since the site had already been blacklisted.  Saul reported this was actually a space 
    reporting issue, now fixed.  eLog 29583.
    
    Follow-ups from earlier reports:
    (i)  7/16: SWT2_CPB - user reported a problem when attempting to copy data from the site.  We were able to transfer the files successfully on lxplus, so 
    presumably not a site problem.  Requested additional debugging info from the user, investigating further.  ggus 72775 / RT 20459 in-progress.
    Update 7/26: Issue is related to how BeStMan SRM processes certificates which include an e-mail address.  Working with BeStMan developers to come 
    up with a solution.
    Update 8/23: Patrick requested that the user run some new tests to check whether a BeStMan patch he installed addresses this issue.
    Update 9/18: User reported that the issue at SWT2_CPB was resolved, but the same problem is still seen at some other sites.
    (ii)  8/26: ggus 73463 re-opened regarding backlog in transfer of datasets to AGLT2_CALIBDISK.  Ticket is currently 'in-progress'.  See Shawn's comment 
    in the ticket.
    (iii)  9/2: New express stream reprocessing campaign started (ES2).  More info here:
    https://twiki.cern.ch/twiki/bin/viewauth/Atlas/Summer2011Reprocessing
    Update 9/14: second phase has begun - more details at above link.
    (iv)  9/12: OUHEP_ITB - jobs waiting due to missing release 16.6.6.1.  Site had not been enabled for release installation system.  Horst and Alessandro 
    now working on this.  ggus 74238 in-progress, eLog 29272.
    Update 9/20: Horst added new information to the ticket - see https://rt.racf.bnl.gov/rt/Ticket/Display.html?id=20856.
    (v)  9/12: DDM transfer errors at WISC_DATADISK ("An end of file occurred(possibly the destination disk is full)").  ggus 74243 in-progress, eLog 29315.
    Update 9/16: Issue reported to be fixed.  gglus 74243 closed.  Site was blacklisted in DDM for a period of time, since restored.
    https://savannah.cern.ch/support/index.php?123526.
    
    • Notes weekend occurrence of production dip - connections to Panda DB
    • ADC ticketing issue to follow-up, when meeting rotates to a daylight time (new rotation policy)

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week:
    • Meeting was last week. Saul gave presentation for wide area transfers to/from BU; will do this at AGLT2
    • New perfsonar hardware (10G) being tested
    • Pre-release testing - expect sometime this month
    • Update on modular dashboard; adding configuration gui. Matrix on demand feature.
    • DYNES - multi-continent demo for GLIF conference (caltech, michigan, SPRACE) - Standard DYNES toolkit, FTD, etc. Deploying Group B sites now, hope to complete by SC.
      • How can we utilize DYNES?
      • Michael - would like a demo of how dynamic circuits can improve connectivity at T3; SMU would be a good use-case
      • Try for demo for SMU meeting - goal.
  • this week:
    • Had a meeting this week (during the off-week because of LHCONE / LHCOPN next week)
    • R310 hosts being tested (for 10g tests); changing test interval to 60s. Will implement ktune tuning on these.
    • Routing AGLT2-MWT2_IU issue resolved. NLR versus I2 path, switched back last Friday.
    • PerfSonar latency issue at SLAC - potential issue.
    • Modular dashboard evolving, continues to be important
    • AGLT2 not getting calib datasets timely; working the issue with CERN. New measurements indicating packet loss.
    • Scheduling LHCONE connectivity for Tier 2s - need a plan; Shawn will report back after LHCONE meeting next week.
    • Scheduling a new set of load tests. Rerouting three of the Tier2's onto new Esnet circuit. Want to verify with new configuration.
    • Next meeting will be two-weeks from yesterday
    • Michael emphasizes we have LHCOPN monitoring for the first time; remarkable progress in the last week in part thanks to Jason for pushing the sites to deploy & with the correct configuration. Aim to add more sites and measure - eg. Italian sites, in a dedicated dashboard.

Federated Xrootd deployment in the US

last week(s) this week:

Tier 3 GS

last meeting:
  • Horst has added Panda queue
  • Awaiting ToA entry;
  • Test jobs by next meeting.

this meeting:

  • Bellarmine tokens added, FTS transfers validated.
  • CVMFS release validation jobs on-going, okay; Horst will follow-up w/ Alessandro.
  • UTD - currently offline.

Site news and issues (all sites)

  • T1:
    • last week(s):
      • Hurricane Irene response last week, exercised emergency shutdown procedures. Everything powered off. Restart went smoothly.
      • Lost almost nothing on the ATLAS side - on the RHIC side, 40 worker nodes had issues
      • 12K disk drives
      • Increase bandwidth to Tier 2s from BNL - new 10G circuit available. UC to be migrated off to a new 10G. Other circuit waiting for new I2 switch to come online at MANLAN, ~ 2 weeks.
      • LHCONE proceeding in Europe. Hope to see a couple of sites in US participating.
    • this week:

  • AGLT2:
    • last week(s):
      • Electrical storm - 3 power outages. Generators failed, unknown reasons. Control mechanisms for PDUs in College racks. Restored everything by noon, some cleanup on Monday.
      • Changed dCache to site aware configuration. All reads come from local pool node. Copies transparently from other site. LRU algorithm removes older cached replicas. Since turning on, no issues, working well. Hoping to reduce bandwidth for the intersite. lsm-db setup to accumulate statistics.
      • Racked 360 TB of new storage today!
    • this week:
      • Site aware dCache tuned - next meeting will be Tom's talk
      • grid3-locations file corrupted for unknown reasons; may need to investigate with
      • Oct 22 - outtage to check generator

  • NET2:
    • last week(s):
      • Racking up new storage - 430 TB useable; will use as GPFS pools, segmented as fast cache in the front.
      • Running blade chassis with direct reading
      • John has AMD 6272 machines - running 8 production jobs; will have some performance results, jobs and HS06.
      • Throughput study - uses gridpftp logs, making a package
      • ACL issue being addressed
      • Moved checksumming load to the gridftp host
      • Anticipating retiring equipment - how to "sell" old equipment
    • this week:
      • Bringing up new rack of storage
      • Three problems with SRM over the weekend: 8024F switch; reporting space;
      • AMDs at HU running production - looking good, 35% faster than HT Intel Harpertown thread.
      • John working on long-standing ACL issue

  • MWT2:
    • last week:
      • MWT2 storage upgrade
      • IllinoisHEP testing new wn-client, python 2.6, dq2 client.
    • this week:
      • Tier 2 pilot project at Illinois campus cluster

  • SWT2 (UTA):
    • last week:
      • Been on vacation - few problems, all is well.
    • this week:
      • Two incidents over the weekend - storage server and xrootdfs
      • federated xrootd working well
      • Downtime in ~ 2 weeks to install new storage; before SMU

  • SWT2 (OU):
    • last week:
      • all is well
      • quotes in for new storage
    • this week:
      • all is well
      • disk is on order; 200 TB useable. May be able to get 3 TB DDN drives; investing

  • WT2:
    • last week(s):
      • Upgraded CE, added a new CE - encountered problems; GIP and BDII - got help from Burt H.
      • bringing storage online
      • 68 R410, 48 GB, 1 500 G disk plus sys disk; 1 G networking; Cisco
    • this week:

Carryover issues (any updates?)

OSG Opportunistic Access

See AccessOSG for instructions supporting OSG VOs for opportunistic access.

last week(s)

  • Dan Bradley joined the meeting
  • SupportingGLOW
  • Science being supported - economists, animal science (statistics), physics; varies month-to-month
  • Space: no space requirements on the site, other than worker node
  • Few requirements - just must run under Condor
  • Some jobs use http-squid to get files; others use condor transfer mechanisms with TCP.
  • Using the site squid proxy, as advertised in the environment; (may need to check this)
  • Typical run time? Aim for 2 hours, some are running longer; they're running under glideins - so sites will see only the pilot. Glidein lifespan? Day or so.
  • Preemption is okay, expected.
this week
  • Checked with each of the VOs - overall happy, even if not yet enabled on all sites.

Python + LFC bindings, clients

last week(s):
  • New OSG release coming with has updated LFC, python client interfaces, etc., supporting new worker-node client and wlcg-client
  • OSG 1.2.21 released.
this week:
  • NET2 - planning
  • AGLT2 - 3rd week of October
  • Now would be a good time to upgrade; future OSG releases will be rpm-based with new configuration methods.

AOB

last week
  • None.
this week


-- RobertGardner - 19 Sep 2011

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback