r4 - 16 Aug 2010 - 19:04:47 - MarkSosebeeYou are here: TWiki >  Admins Web > MinutesAug11



Minutes of the Facilities Integration Program meeting, Aug 11, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (605) 715-4900, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees: Saul, Shawn, Rob, Aaron, Patrick, Michael, Tom, Hiro, Dave, Wensheng, Fred, Sarah,
  • Apologies: Horst, Wei, Mark, Nate, Rik

Integration program update (Rob, Michael)

  • IntegrationPhase14 NEW
  • Special meetings
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
      • LHC performance has been on/off - a new high intensity run underway;
      • Analysis jobs - the load has been spikey, US is getting most, but distribution is not good across T2s
      • Site admins occasionally get email from M.E. - there are issues and commonality; for example 3 sites failed yesterday. We need to re-consider reliability.
    • this week
      • Another f2f meeting in October 13-14 at SLAC.
      • LHC luminosity record over the weekend 1 pb-1; discussions to increase this possibly increasing CMS energy; efficiency and reliability of the machine improving, resulting in long runs.
      • T0 performance issues - prompt reconstruction events overwhelming the T0 resources, requiring interrupts.
      • Lots of analysis data

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

  • The link to ATLAS T3 working groups Twikis are here
  • Draft users' guide to T3g is here
last week(s):
  • ARRA funds are beginning to materialize; there is a DOE site that tracks this information; thus expect mid-August funds
  • Phase 1 version of Tier 3 is ready, updating documentation so that sites can start ordering servers
  • Phase 2 needs - analysis benchmarks for Tier 3 (Jim C); local data management; data access via grid (dq2-FTS); CVMFS - will need a mirror at BNL; effort to use Puppet to streamline T3 install and hardware recovery
  • Xrootd federation demonstrator project
  • UTA has setup a Tier 3 in advance of the ARRA funding.
this week:

Operations overview: Production and Analysis (Kaushik)

Data Management & Storage Validation (Kaushik)

  • Reference
  • last week(s):
    • Problems with central deletion at SLAC which has hard limits on space tokens - starts too late, not efficient enough; lower the deletion trigger threshold? Each space token size must be fixed.
    • Michael: sites are going down more frequently; mostly SRM related. We see Bestman failures particularly. We may need a concentrated effort here to resolve Bestman reliability problems. Wei will drive the issue.
  • this week:
    • No meeting this week due to ATLAS Americas workshop at UTA

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    1)  7/28: From Charles at MWT2:
    After running a large number of test jobs, we have converted the schedconfig entries for ANALY_MWT2 to use remote IO (a.k.a. direct
    access) using dcap protocol, and are bringing ANALY_MWT2 back online.
    See the field 'copysetup' here:
    2)  7/29: From Michael:
    Transfers to/from MWT2/UC are currently failing. Site admins are investigating.
    Charles fixed the problem. Transfers to MCDISK resumed.  eLog 15245.
    3)  7/30: MWT2_IU transfer errors:
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://iut2-dc1.iu.edu:8443/srm/managerv2]. Givin' up after 3 tries].  Issue resolved - from Sarah:
    The Chimera java process was unresponsive, causing SRM copies to fail. I've restarted Chimera and transfers are succeeding. We're continuing to investigate the cause.  ggus 60689 (closed), eLog 15271.
    4)  8/2: Charles at MWT2 noticed that jobs were running which used release; however, the release installation was still in progress - seems like the jobs started prematurely.  Installation completed, issue resolved.  
    eLog 15337/70, ggus 60736 (closed).
    5)  8/2: From Bob at AGLT2:
    At 6:10am the NFS servers hosting the VO home directories went down.  It was an odd thing.  The gate-keeper could continue to write to the mounted partition, but many WN could not find it.  Consequently several hundred Condor jobs went "hold".
    This was repaired around 11am.  Examination revealed the gate keeper files were flushed to the NFS server when that latter came online, and I simply released all held condor jobs to run, which they now seem to be doing quite happily.
    Auto-pilots were briefly stopped while the NFS server was brought back online.  These have been re-enabled and normal operations have resumed at AGLT2 .
    6)  8/2: BNL - job failures due to "EXEPANDA_JOBKILL_NOLOCALSPACE."  >From John at BNL:
    The problem was actually NFS disk space for the home directory of the production user. We have doubled the amount of space and that should resolve the issue.  ggus 60743 (closed), eLog 15360.
    7)  8/2: Monitoring for the DDM deletion service upgraded:
    8)  8/3: New pilot release from Paul (v44h) -
    * The root file identification has been updated - a file is identified as a root file unless the substrings '.tar.gz', '.lib.tgz', '.raw.' (upper or lower case) appear in the file name. This covers DBRelease, user lib and ByteStream files.
    9)  8/3: NET2 - DDM transfer problem, issue resolved by re-starting BeStMan.  eLog 15411.
    10)  8/3: AGLT2 - maintenance work on a dCache server at MSU.  Completed mid-afternoon.
    11)  8/3: FT failures from BNL to the tier2's - from Michael:
    The issue is understood and corrective actions have been taken. The failure rate is rapidly declining and is expected to fade away over the the course of the ~hour.  eLog 15415.
    12)  8/3: FT failures at UTA_SWT2 - from Patrick:
    There was a problem in the networking at UTA_SWT2 that was causing a problem with DNS in the cluster and hence a problem with mapping Kors' cert.  The issue has been resolved and I have activated the FTS channel from UTA_SWT2 to BNL.  
    I will check it over the next couple of hours to verify everything is working ok.  ggus 60837 & RT 17727 (closed), eLog 15457.
    Follow-ups from earlier reports:
    (i)  7/12-13: OU_OCHEP_SWT2: best/SRM issues.  Restart fixed one issue, but there is still a lingering mapping/authentication problem.  Experts are investigating.  ggus 60005 & RT 17494 (both closed), 
    currently being tracked in ggus 60047, RT 17509, eLog 14551.
    Update 7/14: issue still under investigation.  RT 17509, ggus 60047 closed.  Now tracked in RT 17568, ggus 60272.
    Update, 8/2: issues with checksums now appear to point to underlying storage or file system issues.   See:
    (ii)  7/14: OU - maintenance outage.  eLog 14568.
    Update 7/14 afternoon from Karthik:
    OU_OCHEP_SWT2 is back online now after the power outage. It should be ready to put back into production. Maybe a few test jobs to start with and if everything goes as expected then we can switch it into real/full production mode?  
    Ans.: initial set of test jobs failed with LFC error.  Next set submitted following LFC re-start.
    (iii)  7/27: MWT2_IU - PRODDISK errors:
    ~160 transfers failed due to:
    SOURCE error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://iut2-dc1.iu.edu:8443/srm/managerv2]. Givin' up after 3 tries].  ggus 60601 (open), eLog 15156.
    Update, 8/1: issue resolved, ggus 60601 closed.
    • Still working on SRM issues at OU - Hiro and Horst to follow-up offline.
  • this meeting:
    Yuri's summary from the weekly ADCoS meeting:
    1)  8/4-8/5: IllinoisHEP - jobs were failing with the error "Remote and local checksums (of type adler32) do not match."  Issue resolved - from Dave at IllinoisHEP: The checksum errors were caused by a bad 10Ge network interface 
    on a dCache pool node.  This NIC has been removed and the pool node is currently using a 1Gb NIC until a replacement 10Gb NIC arrives.  ggus 60852 (closed), eLog 15438.
    2)  8/4-8/5: Upgrade of the DQ2 deletion agent caused problems with the ATLAS central catalogs.  The deletion service was temporarily stopped - s/w issue under investigation.  eLog 15494, Savannah 71070.
    3)  8/4 - 8/7: OU was planning to perform an upgrade to the Lustre file system, but it was decided to postpone this work pending a new release of the s/w (next couple of weeks).
    4)  8/5: BNL - Network maintenance was canceled due to technical issues - details:
    The technical problems that were encountered during the initial phase of the DL2 upgrade resulted in connectivity issues between the RCF and the rest of the BNL campus for an extended period. Connectivity between the US Atlas and the 
    wide area network was also affected, 
    but for a shorter period of time.  
    At this time, network connectivity has been restored and no further work on the network will be made today.
    5)   8/5: MWT2_UC - job failures with the error "Get error: Failed to get LFC replicas."  Issue resolved - from Aaron:
    These failures are likely due to an overzealous cleanup script which removed the job directory out from under the pilot. We have updated the script and should see these errors clear up once all jobs which have had this problem finish.  
    ggus 60891 (closed), eLog 15505.
    6)  8/6: DDM transfer errors at WISC like:
    failed to contact on remote SRM [httpg://atlas07.cs.wisc.edu:8443/srm/v2/server]. Givin' up after 3 tries].
    Issue resolved, from Wen:
    It's fixed. The SRM server had some problem.  ggus 60964 (closed), eLog 15544/46.
    7)  8/6: DDM transfer errors at MWT2_UC:
    failed to contact on remote SRM [httpg://uct2-dc1.uchicago.edu:8443/srm/managerv2]. Givin' up after 3 tries].
    Issue resolved - from Aaron:
    This was due to a stuck pnfsd instance, which was restarted and transfers are succeeding again.  ggus 60965 (closed), eLog 15548.
    8)  8/7-8/9: AGLT2 - job failures with the error "lfc-mkdir get was timed out after 600 seconds."  Issue resolved - from Bob:
    Our gatekeeper suffered through an outage, intermittent from Friday evening around 8pm EDT to Saturday morning at 6:30am when it stopped responding completely. gate01.aglt2.org was rebooted ~2:30pm on Sunday, EDT, and is now responding normally.  
    ggus 60968 (closed), eLog 15594.
    9)  8/8- 8/9: Transfer errors between BNL and MWT2_UC & SWT2_CPB - issues resolved.   See eLog 15582/601, ggus 60980 (closed).
    10)  8/9: New pilot release from Paul (v44i) -
    * The LFN length verification has been added to the pilot. In case of violation of the char limit, the pilot now sets error code 1190 / "LFN too long (exceeding limit of 150 characters)" with the corresponding proddb error EXEPANDA_LFNTOOLONG. 
    Panda monitor error codes + error wiki + Bamboo + proddb have been updated as well.
    11)  8/9: SWT2_CPB -  Transfer errors such as:
    [CONNECTION_ERROR] failed to contact on remote SRM [httpg://gk03.atlas-swt2.org:8443/srm/v2/server]. Givin' up after 3 tries].
    Issue resolved - from Patrick:
    The SRM service at SWT2_CPB became unresponsive. The service has been restarted after making a configuration change and transfers are succeeding.  ggus 61011, RT 17945 (both closed), eLog 15614.
    12)   8/10: From Saul at NET2, in response to SAM test failures at the site:
    We had a problem due to a bad ethernet port.  We appear to be OK now, but I'll keep an extra eye on things.
    13)  8/11 a.m.: MWT2_IU - job failures with the error "pilot: Put error: lsm-put failed (201): 201 Output file does not exist |Log put error: lsm-put failed (201): Output file does not exist."  Issue resolved - from Aaron:
    This was due to an error state in our SRM server. It has been restarted and the error is cleared.  ggus 61054 (closed), eLog 15690.
    Follow-ups from earlier reports:
    (i)  7/12-13: OU_OCHEP_SWT2: best/SRM issues.  Restart fixed one issue, but there is still a lingering mapping/authentication problem.  Experts are investigating.  ggus 60005 & RT 17494 (both closed), currently being tracked in ggus 60047, 
    RT 17509, eLog 14551.
    Update 7/14: issue still under investigation.  RT 17509, ggus 60047 closed.  Now tracked in RT 17568, ggus 60272.
    Update, 8/2: issues with checksums now appear to point to underlying storage or file system issues.   See:
    Update, 8/9 from Horst: progress with the slow adler32 checksum calculations - see details in RT 17568.

DDM Operations (Hiro)

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week(s):
    • Work at the OU site to determine the asymmetry
    • Illinois - waiting for a config change in mid-August
    • Also an asymmetry at UC, not as severe
    • Meeting next tuesday.
    • DYNES - dynamic provisioning, funded NSF MRI program, 40 sites will be instrumented
  • this week:
    • Phone meeting notes:
      USATLAS Throughput Monitoring --- August 10, 2010
      Attending: Shawn, Andy, John, Dave, Sarah, Hiro
      Shawn was LATE...sorry everyone!   New system prevents meeting from starting without having the "chair" in the meeting first!  Next time Shawn will try to enter EARLY.   Sarah mentioned it might be good to have a "backup" chair...Shawn will need to look into whether two "accounts" can be used for a single meeting.  
      1) - Illinois has about another week before changes can be implemented.  Will cover status in the next meeting. 
        - OU Debugging (info from Jason's email) - The OU KOIs were re-located to a different part of the network after some debugging - they seem to be performing a little better than before (perhaps due to first hop switch?) but the problem still persists.  Right now I think Eli is working with GPN/ONEnet on potential a routing/switching fabric problem, after that will start more testing from OU core to close ESnet machine (in Kansas) and then to BNL.
      2) Upgrade to perfSONAR 3.2 status.  Working well at BNL and AGLT2_UM so far.  Issue identified at both with the old URL no longer working.  Should be addressed in RC3?   John  planning to upgrade the other instance at BNL with RC2 since it is ready to go.  Jason (via email) had a question for the v3.2 release: Who is using PingER?   MWT2 is configured but not really using it.  AGLT2 has it configured and uses it sometimes.  Illinois not using it.  BNL not using.  Consensus was to allow the developers to delay implementing PingER until 3.2.1 comes out.  
      3) perfSONAR alerting - No activity from Sarah and Shawn yet...awaiting v3.2 release first.  Andy mentioned the Nagios plugin status.  Perhaps will be better documented and ready about the time of the 3.2 perfSONAR release.  Tom Wlodek (BNL) is willing to discuss adding a Nagios plugin to monitor US ATLAS perfSONAR instances.  Goal will be a central location to determine status of ALL USATLAS perfSONAR instances.  Alerting on "down" systems can be implemented but will need testing (perhaps alerting mailing list can just start out with Shawn until we know things are robust). 
      4) Round table discussing throughput related issues.  John asked about new perfSONAR hardware status.  Update on Dell R410 system: still TBD.  Goal is a single system capable of running both latency and bandwidth instances segregated by 'irq-binding' and 'process binding' to specific CPU cores.  New hardware can also support 10GE as well.  Could be used to allow 10GE sites to monitor bandwidth regularly. Also a question about new option for perfSONAR install to disk option.  Andy described that right now it will be done via a net-boot image.  Future: maybe a DVD to hard-disk install possibility. One possible nice feature is that YUM can be used to keep up-to-date systems at sites (could also introduce problems which reduce robustness, so we have to be careful).   
      Plan is to meet again in two weeks.  Send along corrections and additions to these notes to the list.

Site news and issues (all sites)

  • T1:
    • last week(s): Working on optimizing balancing jobs across queues; pilot submission was an issue, addressed by Xin. Auto-pilot factory being setup. Setup a well-connected host for data caching.
    • this week: Reprocessing from tape exercise on-going. Panda mover is used to stage data in while the reprocessing jobs; now done very efficiency with Pedro's changes. 30K completed reprocessing jobs per day with very good staging performance - 3 TB/hour coming off tape, distributed to servers at 1 GB/s.

  • AGLT2:
    • last week: Working on preparing for next purchase round. Migrating some data to move out obsolete data. Few more shelves to bring into production. Updating VM setup, will update the hardware supporting this.
    • this week: One dcache pool was misconfigured; was brought in manually but with old setup file, fixed. Site is doing very well overall. Implemented the time-floor at 60 minutes (length of time pilot stays in the system). Reached 1600 jobs. Have a network issue - virtual circuit between BNL and UM routing was not correct. Working on next purchasing, need to get HS06 numbers. Looking at power consumption. Have got two new servers, X5650 processors, R710, 24 GB ram; 12 or 24 cores.

  • NET2:
    • last week(s): First of two new storage racks is up and running. Working on performance issues, should be online soon. Life much easier with PD2P? in effect. However fewer analysis jobs. Will bring up HU analysis queue when John returns. SRM interruption - new checksum implemented.
    • this week: Brief local networking problem yesterday, otherwise things have been smooth. Expect new storage to be available tomorrow - will be a new DATADISK.

  • MWT2:
    • last week(s): Still working on Chimera investigations with slow deletes; new hardware arrived for dcache headnode; working on maui adjustments.
    • this week: Continued problems with Chimera, in touch with dcache team. Down for two hours. We have a pool offline. New storage headnode is being built up this week. Remote access issues being addressed.

  • SWT2 (UTA):
    • last week: Still working on bringing 200 TB of new storage.
    • this week: Incident on Monday w/ Bestman, used too many FDs, restarted fine. Wei has a potential change to the xrootd data server.

  • SWT2 (OU):
    • last week: Investigating network instability issue with headnode. Checksum scalability issues. May need to update Lustre (hopefully it has extended attributes to store attributes).
    • this week:
       But we're making pretty good progress with our site, so I'll give you
      a quick progress report here.
      As I already emailed to the RT ticket yesterday, we got the adler32
      calculation and storing in an extended file attribute working, and we're
      now running an hourly cron job which does that on all new files in our SE,
      so that should already help quite a bit in terms of scaling.
      And I just worked out a way with Paul to have the pilot set the adler32
      value on our SE at the end of each job, which should make this cron job
      unnecessary eventually, and hopefully that will make it into the next
      pilot version next week, and might be useful for other sites as well,
      depending on how or if they want to do something similar.
      So at this point we're waiting for Hiro to run some more transfer tests
      -- ideally starting slowly with 10 parallel transfers, to see how it goes --
      and assuming that holds up, then set us to test again to see if everything
      holds up.
      We'll most likely have another 1-day down time some time next week for
      a Lustre version upgrade, which will hopefully improve the storage
      throughput even more.

  • WT2:
    • last week(s): Problem with SRM yesterday, BM ran out of file descriptors; not sure of cause, consulting Alex. All storage components are in place, network and storage groups are working on bringing them online. Expect online in about 2 weeks.
    • this week: * Storage installation in progress. Have set 'timefloor', lower nqueue for analysis site, seems working. SRM space reporting use a cache.

Carryover issues ( any updates?)

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc.
  • last meeting(s)
    • Has checked with Alessandro - no show stoppers - but wants to check on a Tier 3 site
    • Will try UWISC next
    • PoolFileCatalog creation - US cloud uses Hiro's patch; For now will run a cron job to update PFC on the sites.
    • Alessandro will prepare some documentation on the new system.
    • Will do BNL first - maybe next week.
  • this meeting:

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

SVN access for scheddb

  • Site admins have access to update scheddb to change site configuration
  • Alden has been granting SVN access
  • Wensheng has been offered to act as backup.


  • last week
  • this week

-- RobertGardner - 10 Aug 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback