r3 - 04 Aug 2010 - 09:39:57 - MarkSosebeeYou are here: TWiki >  Admins Web > MinutesJul28

MinutesJul28

Introduction

Minutes of the Facilities Integration Program meeting, July 28, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (605) 715-4900, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Shawn, Arman, Mark, Karthik, Aaron, Charles, Nate, Sarah, Michael, Bob, Patrick, Marco, Jason, Justin, Wei, Hiro, Fred
  • Apologies: Rob (vacation)

Integration program update (Rob, Michael)

  • IntegrationPhase14 NEW
  • Special meetings
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
      • IntegrationPhase14 - call for T2 site volunteers and related, supporting Tier 3 program (Tier3IntegrationPhase2)
      • Report pledge capacities for 2011 by end of September. Reviewing official experiment requests and scaled by US numbers. April 2011 they must be installed.
        • Disk storage capacity will ramp to 1.6 PB, 2.0 PB in 2012.
        • Also, numbers for plans for 2012 need to be reported.
      • Last week a good week for LHC, 1.6x10^31 run. Current stop will end tomorrow evening at CERN. Last 30 days has been the majority of luminosity. Over the weekend, there will be special runs - LAr32, level 1 trigger (13MB/event) rate to t0 2GBs (nominal 300 MB/s). Apart from physics interest, want to see response of facility components to rate.
      • ICHEP - lots of good presentations and physics results, a number of high priority tasks executed at the facility.
    • this week
      • Wei offers SLAC to host the next facilities workshop Wednesday& Thursday October 13th & 14th, Housing should be arranged ASAP
      • LHC technical stop completed last Thursday, working on high intensity stable beams, possibly more data tonight
      • ATLAS special runs either canceled or not taken place
      • ICHEP quite successful, many good presentations

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

References
  • The link to ATLAS T3 working groups Twikis are here
  • Draft users' guide to T3g is here
last week(s):
  • Looking the Tier 3 design - the basics are covered - but there are still things to sort out.
  • Worry about data management at Tier 3s - a major issue. Using XrootdFS - and how well is this working.
  • Regarding funding: most sites have not received funding. Evaluating Puppet as a technology for installing nodes.
  • Will start contacting Tier 3's later this week to assess progress.
  • Working groups gave final reports
  • Data management - exploring what's available in Xrootd itself; will be writing down some requirements.
this week:
  • Doug and Rik are traveling from software managers workshop

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • Analysis reference:
  • last meeting(s):
    • Production is on/off again, mostly off.
    • There is an opportunity for regional production requests
    • Poorly defined validation jobs with lots of errors
    • Analysis picking up again. PD2P? working well. Notes that 600/900 subscriptions are going un-used previously.
  • this week:
    • Production sporadic in US and all clouds
    • Known problem with Transform Errors, not site issues, tickets submitted and issue is being resolved
    • Checksum issue at OU is being worked out in a GOC ticket
    • OU_OSCAR needs new releases installed before test jobs can run properly
    • Meeting minutes from today will be posted later
    • Many user jobs failing at sites, at least at SLAC and BNL
    • Investigation of some of these jobs determined that jobs are being submitted which are not appropriate use of the resource
    • BNL long queue blocked by a single user submitting jobs that stuck and were only killed after 12 hours by the pilot
    • Need to determine how we handle these sort of jobs, what is appropriate response? Contact Analysis Shifters, Educate Users
    • Encourage better jobs: jobs that are more efficient in data use and length of processing
    • Site could have the right/ability to kill/remove jobs which are running inefficiently.

Data Management & Storage Validation (Kaushik)

dCache local site mover development (Charles)

Work to explore/develop a common local-site-mover for dcache sites.

last week:

  • Pedro has put the BNL local site mover implementation on svn@cern

this week:

  • No updates

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=101862
    
    1)  7/15: AGLT2 - dCache maintenance outage.  Work completed, back on-line.
    2)  7/15: New pilot releases from Paul -
    (v44f):
    * A site mover for using CHIRP either as primary of secondary output file transfer has been included. Secondary output transfer refers to the transfer of an output file to a user specified destination (for fast file access) after a successful primary transfer to the SE. 
    Rodney Walker has prepared a wiki with more details:
    https://twiki.cern.ch/twiki/bin/view/Atlas/ChirpForUserOutput
    The CHIRP site mover is also expected to be used for primary output file transfers for CERNVM from the user machine to an intermediary storage area. A stand-alone tool (based on PanDA pilot) currently in development will be responsible for the final transfer 
    from the intermediary storage area to the SE.
    * In preparation for AthenaMP, the pilot now sets an env variable (ATHENA_PROC_NUMBER) for the number of cores. If new schedconfig.corecount is set, the pilot will use this number. If it is set to -1, the pilot will grab the number of cores using /proc/cpuinfo 
    (courtesy of Adrian Taga and Douglas Smith). 
    If not set, the env variable will not be set either.
    (v44g):
    * String conversion correction in file size check in new ChirpSiteMover.
    * Remote I/O tests at ANALY_MWT2 revealed a problem (uninitialized variable) with the LocalSiteMover.
    3)  7/15 - 7/19: Various DDM issues at BNL.  See discussion (thread) in eLog 14801.  ggus 60170.
    4)  7/19: SMU_LOCALGROUPDISK ddm errors.  Problematic subscriptions were canceled, ggus 60223 (closed), eLog 14869.
    5)  7/19: From John at NET2:
    HU_ATLAS_Tier2-lsf is back on-line after gratia filled our disk this morning.  I see our error rate rising, but I believe all the problems are solved now.  I'll have to bring this up with the OSG folks, since gratia has taken out our gatekeeper multiple times.  
    In this case, in just over a week, gratia dumped 5 GB of files, 
    including over 1 million flat in one directory.
    6)  7/19 - 7/20: DDM errors at BNL-OSG2_DATADISK, such as:
    SRC SURL: srm://dcsrm.usatlas.bnl.gov/pnfs/usatlas.bnl.gov/...
    ERROR MSG: [FTS] FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase:
    [GENERAL_FAILURE] AsyncWait] ACTIVITY: Data Consolidation
    >From Pedro:
    There was a problem with a pool.  This has been resolved.  ggus 60249 (closed), eLog 14833.
    7)  7/21: BNL - dCache maintenance outage, 21 Jul 2009 08h00 - 21 Jul 2009 18h00.
    
    Follow-ups from earlier reports:
    (i)  6/25: BNL - file transfer errors such as:
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] at Fri Jun 25 11:31:10 EDT 2010 state Failed : File name is too long].  From Hiro:
    BNL dCache does not allow the logical file name longer than 199 characters. I have canceled these problematic transfers since they will never succeed Users should reduce the length of file name. (Users should not put all metadata of files in the filename itself.)  
    I have contacted the DQ2 developers to limit the length.  
    Savannah 69217, eLog 14016.
    7/7: any updates on this issue?
    (ii)  7/12-13: OU_OCHEP_SWT2: best/SRM issues.  Restart fixed one issue, but there is still a lingering mapping/authentication problem.  Experts are investigating.  ggus 60005 & RT 17494 (both closed), currently being tracked in ggus 60047, RT 17509, eLog 14551.
    Update 7/14: issue still under investigation.  RT 17509, ggus 60047 closed.  Now tracked in RT 17568, ggus 60272.
    (iii)  7/14: OU - maintenance outage.  eLog 14568.
    Update 7/14 afternoon from Karthik:
    OU_OCHEP_SWT2 is back online now after the power outage. It should be ready to put back into production. Maybe a few test jobs to start with and if everything goes as expected then we can switch it into real/full production mode?  
    Ans.: initial set of test jobs failed with LFC error.  
    Next set submitted following LFC re-start.
    (iv)  7/14: MWT2 maintenance outage - from Aaron:
    We will be taking a downtime tomorrow, July 14th starting at 9AM Central. This downtime will take all day, while we migrate our name services from PNFS to Chimera at this site.
    Update 7/15 from Aaron: The network disruption was fixed, and the upgrade is complete. MWT2_IU is now back online.  eLog 14649.
    
    
    • Still working on SRM issues at OU - Hiro and Horst to follow-up offline.
  • this meeting:
    Yuri's summary from the weekly ADCoS meeting (this week from Elena Korolkova):
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=102418
    
    1)  7/21: Transfer errors between NET2 and BNL - issue with DNS name resolution along the transfer path.  Problem resolved.   ggus 60351 (closed), eLog 14946.
    2)  7/22: From Michael at BNL:
    There are currently some transfer failures from/to BNL due to high load on the postgres database associated with the dCache namespace manager.
    Later: Transfer efficiency is back to normal (>95%). eLog 14988.
    3)  7/23: Transfer errors at SLAC:
    [CONNECTION_ERROR] failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]. Givin'up after 3 tries]
    Issue resolved.  ggus 60436 (closed), eLog 15033.
    4)  7/23: Update to the system for site configuration changes in PanDA (Alden).  See:
    https://twiki.cern.ch/twiki/bin/view/Atlas/SchedConfigNewController#Configuration_Modification
    https://twiki.cern.ch/twiki/bin/view/Atlas/SchedConfigNewController#Configuration_Files
    https://twiki.cern.ch/twiki/bin/view/Atlas/SchedConfigNewController#New_Queue_InsertionDNS
    5)  7/25: BNL - Issue with DNS name resolution on acas1XXX hosts.  Problem fixed.  ggus 60450 (closed), eLOg 15070, 74.
    6)  7/25 - 7/26: MWT2_IU transfers errors.  Issues resolved.  eLog 15118.
    7)  7/25 - 7/27: MWT2_UC - issue with the installation of release IBLProd_15_6_10_4_7_i686_slc5_gcc43_opt.  Sarah and Xin resolved the problem.  ggus 60449 (closed), eLog 15063.
    8)  7/27: MWT2_IU - PRODDISK errors:
    ~160 transfers failed due to:
    SOURCE error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://iut2-dc1.iu.edu:8443/srm/managerv2]. Givin' up after 3 tries].  ggus 60601 (open), eLog 15156.
    
    Follow-ups from earlier reports:
    (i)  6/25: BNL - file transfer errors such as:
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] at Fri Jun 25 11:31:10 EDT 2010 state Failed : File name is too long].  From Hiro:
    BNL dCache does not allow the logical file name longer than 199 characters. I have canceled these problematic transfers since they will never succeed Users should reduce the length of file name. 
    (Users should not put all metadata of files in the filename itself.) 
    I have contacted the DQ2 developers to limit the length.  Savannah 69217, eLog 14016.
    7/7: any updates on this issue?
    7/26: Savannah 69217 closed.
    (ii)  7/12-13: OU_OCHEP_SWT2: best/SRM issues.  Restart fixed one issue, but there is still a lingering mapping/authentication problem.  Experts are investigating.  ggus 60005 & RT 17494 (both closed), 
    currently being tracked in ggus 60047, RT 17509, eLog 14551.
    Update 7/14: issue still under investigation.  RT 17509, ggus 60047 closed.  Now tracked in RT 17568, ggus 60272.
    (iii)  7/14: OU - maintenance outage.  eLog 14568.
    Update 7/14 afternoon from Karthik:
    OU_OCHEP_SWT2 is back online now after the power outage. It should be ready to put back into production. Maybe a few test jobs to start with and if everything goes as expected then we can switch it 
    into real/full production mode?  
    Ans.: initial set of test jobs failed with LFC error.  
    Next set submitted following LFC re-start.
    (iv)  7/21: BNL - dCache maintenance outage, 21 Jul 2009 08h00 - 21 Jul 2009 18h00.
    Update: completed as of ~6:00 p.m. EST.  eLog 14935, Savannah 115814.
    
    • Already reported above in production

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • Please use DaTRI for subscription requests, in general.
    • Issue of getting analysis output back to Tier 3 sites if source is a Tier 2 in another cloud. If the site is in ToA, a normal DaTri subscription request. For non ToA transfers, suggestion was to send to closest Tier 2 and then dq2-get for dq2-FTS from there.
    • We need some testing here, and to organize a plan. Follow-up in a couple of weeks
    • Otherwise not big issues.
    • Bob: what is the correct procedure for disabling FTS during a maintenance? Blacklist? Hiro: should do both. Will send an email summarizing procedures.
  • this meeting:
    • At last software week, ATLASSCRATCHDISK area should be world writable for storage and LFC: otherwise people cannot do dq2-put or remove files using DDM, detail will follow via email
    • BDII issues reported/discussed earlier: tests were failing to register test results, fixed a bug and gridview should now show test results for every hour. Notify Hiro with any missing test results
    • An error where the site is shown as missing site data, are still being tracked down
    • Direct transfer from Tier2 to non-cloud sites caused crash of dcache at MWT2, too many transfers at once can overwhelm a site. Restricting FTS to limit concurrent transfers to STAR channels to avoid this state.
    • FTS channels state (on/off) can be seen from FTS monitor website main page.

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week(s):
    • Network asymmetries - is Esnet involved? Dave (Illinois) investigating, possible issue with campus switch
    • Notes from meeting:
      
      
    • From Jason (on travel this week):
      • An RC of the pS Performance Toolkit v 3.2 was announced last week. We are currently asking ATLAS to not adopt this version yet (even for testing) since it lacks upgrade capability. The next RC, due in a week or two) will start to make this option available and I will work with BU/OU/UMich/MSU/IU to test it out.
      • Working with OU/BNL on an ongoing performance problem (appears to be low sporadic loss that is not allowing full TCP performance). I will be getting back to Horst and Karthik very soon with some results and suggestion to get ESnet involved to examine some trouble spots.
    • New hardware box - replace two systems to a single server, but developers are working on splitting the task among available cores.

  • this week:
    • Meeting yesterday: OU networking problems being investigated: packet loss somewhere along the paths.
    • Illinois issues are being investigated by local network techs, willing to apply some of the Cisco changes suggested
    • MWT2 still has asymmetry in bandwidth between inbound and outboudn.
    • lhcmon at BNL crashing, possibly due to load
    • PerfSONAR? RC2 preserves the existing PerfSONAR? setup, 3.2 should be released sometime in August. CentOS? based instead of Knoppix, can be installed via CD or to a hard drive and upgraded with yum
    • AGLT2 testing a single PerfSONAR? box for both latency and bandwidth testing on the same machine
    • Dyns project submission to NSF MRI, received final notice of funding from NSF, kickoff meeting this Friday, messages about project and planning expected to go out next week. Project is to create and provide a distributed instrument to dynamically allocate circuits.

Site news and issues (all sites)

  • T1:
    • last week(s): Currently in maintenance - primary focus is to work on consistency and integrity. Databases dumped and restored, vacuumed. Deploying 2000 cores, going into production today, expect to finish by the end of the week.
    • this week: All resources installed and in service for 2010. New condor version allows group quota, allows resource shifting. Allows 5K job slots to be allocated to analysis or production. Pedro working on staging services and performance improvements for tape/disk pool movement, more large-scale testing of all-tape retrieval.

  • AGLT2:
    • last week: Dumping and restoring postgres database, recovering about 50%. Billing DB was the culprit. Upgrading dCache. May deploy another SSD for OSG APP. Looking into planning for next purchase round.
    • this week: Performing well, no major problems. More shelves purchased, not in production yet. UM site space constrained, must retire equipment to make room. Needs Dell matrix to be populated with new CPUs, better power data for certain models.

  • NET2:
    • last week(s): Issue with gratia filling /tmp disk at HU. Karthik - sees similar issue at OU - in contact with Chris Green. (will submit a GOC ticket). HU site full of production jobs.
    • this week: No one from NET2

  • MWT2:
    • last week(s): Migrated from pnfs to Chimera. Ordered a new dcache headnode. Studying SRM with load tests. Investigating direct access dcache.
    • this week: schedconfig work with Charles and Alden, chimera testing continues, some postgres improvements, remote I/O enabled at ANALY_MWT2, testing libdcap++

  • SWT2 (UTA):
    • last week: Small outage yesterday to move a network connection in advance of 10G switchover; expect another outage next week. Analysis and production running fine over the last week.
    • this week: Space related issues, deletion going on now, retired non-spacetoken data. Hope to retire an old storage rack (40TB) and replace it with a newer rack (200TB). Other issue is SAM test failing sporadicly, timeouts occur regularly, possibly network related. Maui work being done to redistribute jobs between nodes.

  • SWT2 (OU):
    • last week: Working on Bestman issue for several days, everything seems to be fine locally, but transfers failing - timeouts. Will start email an email thread to troubleshoot.
    • this week: Timeout issues with DDM, xrdadler32 checksum causing timeouts, 20+ minutes per 2GB file, many suggestions for improvements: dd to test disks, different adler32 implementation, different block size

  • WT2:
    • last week(s): Have received all new storage nodes, still getting networking equipment. Will get SLAC then to 1.4 PB when online.
    • this week: New storage being installed, storage group looking at new hardware, deletions occuring. Analysis pilots not coming in fast enough regardless of nqueue, discussion of multi-payload jobs

Topic: production/analysis (Michael)

last week:
  • ATLAS has 50/50 production/analysis
  • Are we giving too small a share to analysis? Yes - some sites are only running a few hundred, well less than 50%.
  • If there are issues to resolve we need to do this in the next few weeks.
  • Expect all sites to run 1000 concurrent analysis jobs.
  • At SLAC - limiting factors are disk space, and fair share priority for short running jobs. Large number of activated, few running.
  • What about multi-job pilots - Paul has a pilot release which does this.
  • Need to raise this at next meeting.

this week:

  • All Tier2 need to provide 1000 analysis job slots
  • Issues raised: pilot submission, I/O efficiency (copy vs. direct read)
  • CVMFS discussion from software manager meeting, specifically finding someone to test and understand its restrictions

Carryover issues ( any updates?)

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc.
  • last meeting
    • Has checked with Alessandro - no show stoppers - but wants to check on a Tier 3 site
    • Will try UWISC next
    • PoolFileCatalog creation - US cloud uses Hiro's patch; For now will run a cron job to update PFC on the sites.
    • Alessandro will prepare some documentation on the new system.
    • Will do BNL first - maybe next week.
  • this meeting:
    • None

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

  • Reference
  • last week(s)
    • testing
      Site       Squid Installed  Squid Works  Fail-Over Works
      AGLT2      Yes              Yes          Yes
      ANL T3     Test jobs on analy queue don't start.
      BNL        Yes              Yes          No failover
      Duke       Missing 15.8.0
      Illinois   Test jobs on build failed / don't start
      MWT2 IU    Yes              Yes          Not tested
          UC    Yes              Yes          Yes
      NET2 BU    Yes              Yes          Yes
          HU    Test jobs on analy queue don't start
      SWT2 CPB   Yes              Yes          No
          UTA   Test job build failed
          OU    No               Still upgrading hardware
      WT2        Yes              Yes           Yes
      
  • this week
    • No known issues for conditions data access
    • CVMFS discussed as distribution platform for pool condition files

AOB

  • last week
  • this week
    • None


-- RobertGardner - 21 Jul 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback