r8 - 05 Aug 2010 - 13:06:36 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesAug4

MinutesAug4

Introduction

Minutes of the Facilities Integration Program meeting, Aug 4, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (605) 715-4900, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees:
  • Meeting attendees: Michael, Dave, Rob, Patrick, Sarah, Jim, Aaron, Fred, Justin, Karthik, Saul, Nate, Rik, Charles, Booker, Wei, Shawn, Jason, Kaushik, Armen, Mark
  • Apologies:
  • Apologies: John B

Integration program update (Rob, Michael)

  • IntegrationPhase14 NEW
  • Special meetings
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
      • Wei offers SLAC to host the next facilities workshop Wednesday& Thursday October 13th & 14th, Housing should be arranged ASAP
      • LHC technical stop completed last Thursday, working on high intensity stable beams, possibly more data tonight
      • ATLAS special runs either canceled or not taken place
      • ICHEP quite successful, many good presentations
    • this week
      • LHC performance has been on/off - a new high intensity run underway;
      • Analysis jobs - the load has been spikey, US is getting most, but distribution is not good across T2s
      • Site admins occasionally get email from M.E. - there are issues and commonality; for example 3 sites failed yesterday. We need to re-consider reliability.

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

References
  • The link to ATLAS T3 working groups Twikis are here
  • Draft users' guide to T3g is here
last week(s):
  • Looking the Tier 3 design - the basics are covered - but there are still things to sort out.
  • Worry about data management at Tier 3s - a major issue. Using XrootdFS - and how well is this working.
  • Regarding funding: most sites have not received funding. Evaluating Puppet as a technology for installing nodes.
  • Will start contacting Tier 3's later this week to assess progress.
  • Working groups gave final reports
  • Data management - exploring what's available in Xrootd itself; will be writing down some requirements.
  • Doug and Rik are traveling from software managers workshop
this week:
  • ARRA funds are beginning to materialize; there is a DOE site that tracks this information; thus expect mid-August funds
  • Phase 1 version of Tier 3 is ready, updating documentation so that sites can start ordering servers
  • Phase 2 needs - analysis benchmarks for Tier 3 (Jim C); local data management; data access via grid (dq2-FTS); CVMFS - will need a mirror at BNL; effort to use Puppet to streamline T3 install and hardware recovery
  • Xrootd federation demonstrator project
  • UTA has setup a Tier 3 in advance of the ARRA funding.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • Analysis reference:
  • last meeting(s):
    • Production sporadic in US and all clouds
    • Known problem with Transform Errors, not site issues, tickets submitted and issue is being resolved
    • Checksum issue at OU is being worked out in a GOC ticket
    • OU_OSCAR needs new releases installed before test jobs can run properly
    • Meeting minutes from today will be posted later
    • Many user jobs failing at sites, at least at SLAC and BNL
    • Investigation of some of these jobs determined that jobs are being submitted which are not appropriate use of the resource
    • BNL long queue blocked by a single user submitting jobs that stuck and were only killed after 12 hours by the pilot
    • Need to determine how we handle these sort of jobs, what is appropriate response? Contact Analysis Shifters, Educate Users
    • Encourage better jobs: jobs that are more efficient in data use and length of processing
    • Site could have the right/ability to kill/remove jobs which are running inefficiently.
  • this week:
    • New schedconfid modification techniques. Instructions can be found here: SchedConfid modification
    • expect production workloads to vary for the near term
    • production versus analysis: need to discuss min/max (floor/ceiling) for the analy queues in the RAC.
    • general discussion about getting more analysis jobs running, and response.
    • With PD2P? , roughly 20% data being distributed to Tier 2s.
    • Current model is that dataset replications are triggered to Tier 2s, jobs still run at Tier 1 first. Efficient for 50% of the jobs (50% are reused, so far).
    • Rebrokering is coming - users can use containers and send jobs to multiple sites. Then the long queue at BNL can be examined, and jobs sent to Tier 2's when the datasets have arrived
    • Problem with user jobs - submitting jobs with prun causing problems (example pyROOT+dcap). Forward the issues to DAST.
    • Users can be blacklisted via GGUS
    • Sites are free to set their own wall time limits

Data Management & Storage Validation (Kaushik)

  • Reference
  • last week(s):
    • No meeting this week
    • BNL has enough room
    • Some Tier2 sites getting new storage, SLAC has some problems in some space tokens
    • Last cleanup for data had a cutoff of April 1st.
    • SWT2 has some storage issues too, quick cleanup process is running since the weekend
  • this week:
    • Problems with central deletion at SLAC which has hard limits on space tokens - starts too late, not efficient enough; lower the deletion trigger threshold? Each space token size must be fixed.
    • Michael: sites are going down more frequently; mostly SRM related. We see Bestman failures particularly. We may need a concentrated effort here to resolve Bestman reliability problems. Wei will drive the issue.

dCache local site mover development (Charles)

Work to explore/develop a common local-site-mover for dcache sites.

last week:

  • No updates

this week:

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting (this week from Elena Korolkova):
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=102418
    
    1)  7/21: Transfer errors between NET2 and BNL - issue with DNS name resolution along the transfer path.  Problem resolved.   ggus 60351 (closed), eLog 14946.
    2)  7/22: From Michael at BNL:
    There are currently some transfer failures from/to BNL due to high load on the postgres database associated with the dCache namespace manager.
    Later: Transfer efficiency is back to normal (>95%). eLog 14988.
    3)  7/23: Transfer errors at SLAC:
    [CONNECTION_ERROR] failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]. Givin'up after 3 tries]
    Issue resolved.  ggus 60436 (closed), eLog 15033.
    4)  7/23: Update to the system for site configuration changes in PanDA (Alden).  See:
    https://twiki.cern.ch/twiki/bin/view/Atlas/SchedConfigNewController#Configuration_Modification
    https://twiki.cern.ch/twiki/bin/view/Atlas/SchedConfigNewController#Configuration_Files
    https://twiki.cern.ch/twiki/bin/view/Atlas/SchedConfigNewController#New_Queue_InsertionDNS
    5)  7/25: BNL - Issue with DNS name resolution on acas1XXX hosts.  Problem fixed.  ggus 60450 (closed), eLOg 15070, 74.
    6)  7/25 - 7/26: MWT2_IU transfers errors.  Issues resolved.  eLog 15118.
    7)  7/25 - 7/27: MWT2_UC - issue with the installation of release IBLProd_15_6_10_4_7_i686_slc5_gcc43_opt.  Sarah and Xin resolved the problem.  ggus 60449 (closed), eLog 15063.
    8)  7/27: MWT2_IU - PRODDISK errors:
    ~160 transfers failed due to:
    SOURCE error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://iut2-dc1.iu.edu:8443/srm/managerv2]. Givin' up after 3 tries].  ggus 60601 (open), eLog 15156.
    
    Follow-ups from earlier reports:
    (i)  6/25: BNL - file transfer errors such as:
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] at Fri Jun 25 11:31:10 EDT 2010 state Failed : File name is too long].  From Hiro:
    BNL dCache does not allow the logical file name longer than 199 characters. I have canceled these problematic transfers since they will never succeed Users should reduce the length of file name. 
    (Users should not put all metadata of files in the filename itself.) 
    I have contacted the DQ2 developers to limit the length.  Savannah 69217, eLog 14016.
    7/7: any updates on this issue?
    7/26: Savannah 69217 closed.
    (ii)  7/12-13: OU_OCHEP_SWT2: best/SRM issues.  Restart fixed one issue, but there is still a lingering mapping/authentication problem.  Experts are investigating.  ggus 60005 & RT 17494 (both closed), 
    currently being tracked in ggus 60047, RT 17509, eLog 14551.
    Update 7/14: issue still under investigation.  RT 17509, ggus 60047 closed.  Now tracked in RT 17568, ggus 60272.
    (iii)  7/14: OU - maintenance outage.  eLog 14568.
    Update 7/14 afternoon from Karthik:
    OU_OCHEP_SWT2 is back online now after the power outage. It should be ready to put back into production. Maybe a few test jobs to start with and if everything goes as expected then we 
    can switch it into real/full production mode?  Ans.: initial set of test jobs failed with LFC error.  
    Next set submitted following LFC re-start.
    (iv)  7/21: BNL - dCache maintenance outage, 21 Jul 2009 08h00 - 21 Jul 2009 18h00.
    Update: completed as of ~6:00 p.m. EST.  eLog 14935, Savannah 115814.
    
    • Still working on SRM issues at OU - Hiro and Horst to follow-up offline.
  • this meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=103131
    
    1)  7/28: From Charles at MWT2:
    After running a large number of test jobs, we have converted the schedconfig entries for ANALY_MWT2 to use remote IO (a.k.a. direct
    access) using dcap protocol, and are bringing ANALY_MWT2 back online.
    See the field 'copysetup' here:
    http://panda.cern.ch:25980/server/pandamon/query?tp=queue&id=ANALY_MWT2
    2)  7/29: From Michael:
    Transfers to/from MWT2/UC are currently failing. Site admins are investigating.
    Later:
    Charles fixed the problem. Transfers to MCDISK resumed.  eLog 15245.
    3)  7/30: MWT2_IU transfer errors:
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://iut2-dc1.iu.edu:8443/srm/managerv2]. Givin' up after 3 tries].  Issue resolved - from Sarah:
    The Chimera java process was unresponsive, causing SRM copies to fail. I've restarted Chimera and transfers are succeeding. We're continuing to investigate the cause.  ggus 60689 (closed), eLog 15271.
    4)  8/2: Charles at MWT2 noticed that jobs were running which used release 15.6.11.4; however, the release installation was still in progress - seems like the jobs started prematurely.  Installation completed, issue resolved.  
    eLog 15337/70, ggus 60736 (closed).
    5)  8/2: From Bob at AGLT2:
    At 6:10am the NFS servers hosting the VO home directories went down.  It was an odd thing.  The gate-keeper could continue to write to the mounted partition, but many WN could not find it.  Consequently several hundred Condor jobs went "hold".
    This was repaired around 11am.  Examination revealed the gate keeper files were flushed to the NFS server when that latter came online, and I simply released all held condor jobs to run, which they now seem to be doing quite happily.
    Auto-pilots were briefly stopped while the NFS server was brought back online.  These have been re-enabled and normal operations have resumed at AGLT2 .
    6)  8/2: BNL - job failures due to "EXEPANDA_JOBKILL_NOLOCALSPACE."  >From John at BNL:
    The problem was actually NFS disk space for the home directory of the production user. We have doubled the amount of space and that should resolve the issue.  ggus 60743 (closed), eLog 15360.
    7)  8/2: Monitoring for the DDM deletion service upgraded:
    (http://bourricot.cern.ch/dq2/deletion/)
    8)  8/3: New pilot release from Paul (v44h) -
    * The root file identification has been updated - a file is identified as a root file unless the substrings '.tar.gz', '.lib.tgz', '.raw.' (upper or lower case) appear in the file name. This covers DBRelease, user lib and ByteStream files.
    9)  8/3: NET2 - DDM transfer problem, issue resolved by re-starting BeStMan.  eLog 15411.
    10)  8/3: AGLT2 - maintenance work on a dCache server at MSU.  Completed mid-afternoon.
    11)  8/3: FT failures from BNL to the tier2's - from Michael:
    The issue is understood and corrective actions have been taken. The failure rate is rapidly declining and is expected to fade away over the the course of the ~hour.  eLog 15415.
    12)  8/3: FT failures at UTA_SWT2 - from Patrick:
    There was a problem in the networking at UTA_SWT2 that was causing a problem with DNS in the cluster and hence a problem with mapping Kors' cert.  The issue has been resolved and I have activated the FTS channel from UTA_SWT2 to BNL.  
    I will check it over the next couple of hours to verify everything is working ok.  ggus 60837 & RT 17727 (closed), eLog 15457.
    
    Follow-ups from earlier reports:
    (i)  7/12-13: OU_OCHEP_SWT2: best/SRM issues.  Restart fixed one issue, but there is still a lingering mapping/authentication problem.  Experts are investigating.  ggus 60005 & RT 17494 (both closed), 
    currently being tracked in ggus 60047, RT 17509, eLog 14551.
    Update 7/14: issue still under investigation.  RT 17509, ggus 60047 closed.  Now tracked in RT 17568, ggus 60272.
    Update, 8/2: issues with checksums now appear to point to underlying storage or file system issues.   See:
    https://ticket.grid.iu.edu/goc/viewer?id=8961
    (ii)  7/14: OU - maintenance outage.  eLog 14568.
    Update 7/14 afternoon from Karthik:
    OU_OCHEP_SWT2 is back online now after the power outage. It should be ready to put back into production. Maybe a few test jobs to start with and if everything goes as expected then we can switch it into real/full production mode?  
    Ans.: initial set of test jobs failed with LFC error.  Next set submitted following LFC re-start.
    (iii)  7/27: MWT2_IU - PRODDISK errors:
    ~160 transfers failed due to:
    SOURCE error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://iut2-dc1.iu.edu:8443/srm/managerv2]. Givin' up after 3 tries].  ggus 60601 (open), eLog 15156.
    Update, 8/1: issue resolved, ggus 60601 closed.
    
    • Already reported above in production

DDM Operations (Hiro)

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week(s):
    • Meeting yesterday: OU networking problems being investigated: packet loss somewhere along the paths.
    • Illinois issues are being investigated by local network techs, willing to apply some of the Cisco changes suggested
    • MWT2 still has asymmetry in bandwidth between inbound and outboudn.
    • lhcmon at BNL crashing, possibly due to load
    • PerfSONAR? RC2 preserves the existing PerfSONAR? setup, 3.2 should be released sometime in August. CentOS? based instead of Knoppix, can be installed via CD or to a hard drive and upgraded with yum
    • AGLT2 testing a single PerfSONAR? box for both latency and bandwidth testing on the same machine
    • Dyns project submission to NSF MRI, received final notice of funding from NSF, kickoff meeting this Friday, messages about project and planning expected to go out next week. Project is to create and provide a distributed instrument to dynamically allocate circuits.

  • this week:
    • Work at the OU site to determine the asymmetry
    • Illinois - waiting for a config change in mid-August
    • Also an asymmetry at UC, not as severe
    • Meeting next tuesday.
    • DYNES - dynamic provisioning, funded NSF MRI program, 40 sites will be instrumented

Site news and issues (all sites)

  • T1:
    • last week(s): All resources installed and in service for 2010. New condor version allows group quota, allows resource shifting. Allows 5K job slots to be allocated to analysis or production. Pedro working on staging services and performance improvements for tape/disk pool movement, more large-scale testing of all-tape retrieval.
    • this week: working on optimizing balancing jobs across queues; pilot submission was an issue, addressed by Xin. Auto-pilot factory being setup. Setup a well-connected host for data caching.

  • AGLT2:
    • last week: Performing well, no major problems. More shelves purchased, not in production yet. UM site space constrained, must retire equipment to make room. Needs Dell matrix to be populated with new CPUs, better power data for certain models.
    • this week: Working on preparing for next purchase round. Migrating some data to move out obsolete data. Few more shelves to bring into production. Updating VM setup, will update the hardware supporting this.

  • NET2:
    • last week(s):
    • this week: First of two new storage racks is up and running. Working on performance issues, should be online soon. Life much easier with PD2P? in effect. However fewer analysis jobs. Will bring up HU analysis queue when John returns. SRM interruption - new checksum implemented.

  • MWT2:
    • last week(s): schedconfig work with Charles and Alden, chimera testing continues, some postgres improvements, remote I/O enabled at ANALY_MWT2, testing libdcap++
    • this week: still working on Chimera investigations with slow deletes; new hardware arrived for dcache headnode; working maui adjustments

  • SWT2 (UTA):
    • last week: Space related issues, deletion going on now, retired non-spacetoken data. Hope to retire an old storage rack (40TB) and replace it with a newer rack (200TB). Other issue is SAM test failing sporadicly, timeouts occur regularly, possibly network related. Maui work being done to redistribute jobs between nodes.
    • this week: Still working on bringing 200 TB of new storage.

  • SWT2 (OU):
    • last week: Timeout issues with DDM, xrdadler32 checksum causing timeouts, 20+ minutes per 2GB file, many suggestions for improvements: dd to test disks, different adler32 implementation, different block size
    • this week: Investigating network instability issue with headnode. Checksum scalability issues. May need to update Lustre (hopefully it has extended attributes to store attributes).

  • WT2:
    • last week(s): New storage being installed, storage group looking at new hardware, deletions occuring. Analysis pilots not coming in fast enough regardless of nqueue, discussion of multi-payload jobs
    • this week: Problem with SRM yesterday, BM ran out of file descriptors; not sure of cause, consulting Alex. All storage components are in place, network and storage groups are working on bringing them online. Expect online in about 2 weeks.

Topic: production/analysis (Michael)

last week:
  • All Tier2 need to provide 1000 analysis job slots
  • Issues raised: pilot submission, I/O efficiency (copy vs. direct read)
  • CVMFS discussion from software manager meeting, specifically finding someone to test and understand its restrictions

this week:

Carryover issues ( any updates?)

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc.
  • last meeting(s)
    • Has checked with Alessandro - no show stoppers - but wants to check on a Tier 3 site
    • Will try UWISC next
    • PoolFileCatalog creation - US cloud uses Hiro's patch; For now will run a cron job to update PFC on the sites.
    • Alessandro will prepare some documentation on the new system.
    • Will do BNL first - maybe next week.
  • this meeting:

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

AOB

  • last week
  • this week


-- RobertGardner - 02 Aug 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


pdf dcache-access.2010.08.04.pptx.pdf (213.5K) | RobertGardner, 05 Aug 2010 - 13:06 |
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback