r7 - 24 Jul 2010 - 17:56:28 - MarkSosebeeYou are here: TWiki >  Admins Web > MinutesJul21

MinutesJul21

Introduction

Minutes of the Facilities Integration Program meeting, July 21, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (605) 715-4900, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Michael, Rob, Torre, Shawn, Booker, Tom, Aaron, Charles, Nate, Armen, Mark, Karthik, Patrick, Horst, Xin, Wei, Fred, Rik, Doug, Sarah, Bob
  • Apologies: Jason, Saul

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
      • Beam commissioning work at LHC has been successful - good runs, high intensity up to 10^30; some stability issues but good progress.
      • No news yet on the next re-processing campaign.
      • WLCG collaboration meeting at Imperial College in London. Discussions dominated by storage and data transfers. Fractional data access below the file level is being discussed. Over-arching theme is caching rather than pre-placement at a large scale. Discussions about how to use the existing resources more efficiently. Many demonstrators have been proposed as follow-up to Amsterdam brainstorming meeting. Expect the next year to be interesting. Xrootd is a big topic - all over the place - a global redirector is being pushed by CMS, also with plugins other storage backends could be used. Looking for industry standards, NFS v4.1 is promising, CERN-DESY partnership has formed; idea is to optimize wide area transfers - part of data access/replication mechanisms available in NFS v4.1. The client, and other missing pieces are becoming available. Wei is setting up a global redirector at SLAC. "Global dynamic inventory".
      • Quarterly reports are due. Part of this is to update the facilities spreadsheet (see CapacitySummary).
    • this week
      • IntegrationPhase14 - call for T2 site volunteers and related, supporting Tier 3 program (Tier3IntegrationPhase2)
      • Report pledge capacities for 2011 by end of September. Reviewing official experiment requests and scaled by US numbers. April 2011 they must be installed.
        • Disk storage capacity will ramp to 1.6 PB, 2.0 PB in 2012.
        • Also, numbers for plans for 2012 need to be reported.
      • Last week a good week for LHC, 1.6x10^31 run. Current stop will end tomorrow evening at CERN. Last 30 days has been the majority of luminosity. Over the weekend, there will be special runs - LAr32, level 1 trigger (13MB/event) rate to t0 2GBs (nominal 300 MB/s). Apart from physics interest, want to see response of facility components to rate.
      • ICHEP - lots of good presentations and physics results, a number of high priority tasks executed at the facility.

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

References
  • The link to ATLAS T3 working groups Twikis are here
  • Draft users' guide to T3g is here
last week(s):
  • Have a finalized procedure for installing Tier 3's from scratch - taking only one 8 hour day.
  • Tier-3 Panda to be added when Doug returns
  • Have been running functionality tests for Tier 3 (condor job submission, grid submission, etc.)
  • Xrootd demonstrator project - Doug setting up machines, will happen next week
  • manageTier3SW package will install all the ATLAS related software - to provide a uniform look-n-feel
this week:
  • Looking the Tier 3 design - the basics are covered - but there are still things to sort out.
  • Worry about data management at Tier 3s - a major issue. Using XrootdFS - and how well is this working.
  • Regarding funding: most sites have not received funding. Evaluating Puppet as a technology for installing nodes.
  • Will start contacting Tier 3's later this week to assess progress.
  • Working groups gave final reports
  • Data management - exploring what's available in Xrootd itself; will be writing down some requirements.

Operations overview: Production and Analysis (Kaushik)

Data Management & Storage Validation (Kaushik)

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=101231
    
    1)  7/9: IllinoisHEP - site had been off-line for software & hardware upgrades.  Test jobs successful - set back on-line.  eLog 14577.
    2)  7/9: BNL file transfer errors - from Michael:
    Due to high load on the dCache namespace manager we observe some transfer failures into and out of BNL.
    Issue resolved - eLog 14433.
    3)  7/10: BNL - jobs failing with lsm errors such as:
    !!WARNING!!2995!! lsm-put failed: time out after 5400 seconds
    Last error of this type occurred at around 16:25 UTC.  eLog 14452.
    Also, DDM errors at BNL-OSG2_MCDISK - from Pedro:
    Some storage services are being restarted to recover from the problem.  ggus 59963 (closed), eLog 14455.
    4)   7/10: BNL - from Pedro:
    The server holding the dcache namespace service had to rebooted.  srm was shutdown during this period. expect failures.
    Restarts on 7/11,12 as well.  eLog 14516.
    5)  7/10 - 11: Large backlog of transferring jobs across most (all?) clouds.  Description of the problem:
    A blocked cronjob which was preventing output datasets getting frozen, which meant the dataset subscriptions could not proceed.
    Issue resolved, eLog 14505.
    6)  7/11 - 7/12: SLAC - some analysis jobs failed with the error "missing installation."  The s/w was installed, and subsequent attempts for the earlier failed jobs were successful.  ggus 59969 & RT 17492 (both closed),
    7)  7/12: pilot update from Paul (v44e):
    * Pilot is now sending batchID to the server using new job record field batchID. It is also sending the full pilot version string. Requested by Torre Wenaus.
    * Pilot is now cleaning up tcf_* files left behind from killed user jobs (which otherwise end up in the log tarball). Requested by Tadashi Maeno
    * Removed http://trshare.triumf.ca/~rodwalker from URL list in dynamic Pacman installation. Requested by Rodney Walker
    * Removed special command setup previously needed for setting up xrdcp at ANALY_CERN.
    * A killed job will now dump a stack trace. Requested by David Rousseau et al.
    8)  7/12: BNL - "lsm-put failed" errors - from Jane:
    Our PNFS server had been in very high load due to massive deletion, which caused timeout to client requests. The server was rebooted and the situation has been improved.  ggus 60035 (closed), eLog 14539.
    9)  7/12-13: OU_OCHEP_SWT2: best/SRM issues.  Restart fixed one issue, but there is still a lingering mapping/authentication problem.  Experts are investigating.  ggus 60005 & RT 17494 (both closed), currently being tracked in ggus 60047, RT 17509, eLog 14551.
    10)  7/13: DDM errors at WISC_GROUP ("source file doesn't exist").  Issue resolved - from Wen:
    It was caused by a data server failure. Now these files should be available.  ggus 60070 (closed), eLog 14575.
    11)  7/14: OU - maintenance outage.  eLog 14568.
    12)  7/14: MWT2 maintenance outage - from Aaron:
    We will be taking a downtime tomorrow, July 14th starting at 9AM Central. This downtime will take all day, while we migrate our name services from PNFS to Chimera at this site.
    
    Follow-ups from earlier reports:
    (i)  4/23: OU sites were set off-line in advance of major upgrades -- from Horst:
    We'll be taking OU_OCHEP_SWT2 down for a complete hardware and software upgrade on Monday morning.  So could you please drain all the queues -- both *OU_OCHEP_* and *OU_OSCER_* -- and turn pilot submission (and everything else that might be accessing them) off starting this afternoon, 
    until we're ready to start back up, which will be at least a week?  
    I'll also schedule a maintenance in OSG OIM, which I will keep updated when we know better how long Dell and DDN will take for the upgrade.  eLog 11813.
    Update, week of 5/19-26: Horst thinks the upgrade work is almost done, so the site may be back on-line soon.
    Update 6/10, from Horst:
    We're working on release installations now with Alessandro, and there seem to be some problems, possibly related to the fact that the OSG_APP area is on the Lustre file system.
    Update 6/17: test jobs submitted which uncovered an FTS issue (fixed by Hiro).  As of 6/22 test jobs are still failing with ddm registration errors - under investigation.
    Update 7/6: additional incorrect entries in schedconfigdb discovered and were fixed (remnants from the pre-upgrade settings).  Will try new test jobs.
    Update 7/8: Latest modification to schedconfigdb seems to have resolved the last issue at OU.  Test jobs succeeded, site set back to on-line.
    eLog 14435.
    (ii)  6/12: WISC - job failures with errors like:
    FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] Error:/bin/mkdir: cannot create directory ...: Permission deniedRef-u xrootd /bin/mkdir.  Update 6/15 from Wen:
    The CS room network was still messed up by a power cut and the main servers are still not accessible. Now I drag out some other machines to start these services. I hope I can get the main server as soon as possible.  Savannah 115123 (open), eLog 13790.
    Update 7/13: no recent responses to this issue - give up.
    (iii)  6/25: BNL - file transfer errors such as:
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] at Fri Jun 25 11:31:10 EDT 2010 state Failed : File name is too long].  From Hiro:
    BNL dCache does not allow the logical file name longer than 199 characters. I have canceled these problematic transfers since they will never succeed Users should reduce the length of file name. (Users should not put all metadata of files in the filename itself.) 
    I have contacted the DQ2 developers to limit the length.  Savannah 69217, eLog 14016.
    7/7: any updates on this issue?
    (iv)  7/4 - 7/6: MWT2_UC - DDM errors:
    Failed to contact on remote SRM,  MWT2_UC_PERF-JETS space token.  From Aaron (7/6 p.m.):
    This was due to a load condition on our storage systems (uct2-dc1 and uct2-dc2) which caused many transfers to timeout, and a few to EOF.  These transfer errors ceased after the number of transfers decreased. We are currently doing another
    operation which may add some more load to these services, but this should be complete by the end of this evening and no further transfer errors are expected.  ggus 59690 (still open), Savannah 69500, eLog 14234/306.
    Update 7/7: high load condition resolved by site, dashboard shows efficient transfers over past 4 hours.  ggus 59690 closed.
    (v)  7/7: NET2 outage on Thursday, 7/8 - from Saul & John:
    ANALY_NET2, BU_ATLAS_Tier2o and HU_ATLAS_Tier2 will be down tomorrow to prepare our machine room for new storage racks.
    Update 7/8: outage ended, NET2 is back on-line.
    
    
    • Getting OU online
    • Trouble getting pilots at SLAC - is there a known problem somewhere? 10K jobs activated, only 100 running. Need to follow-up with Xin.
  • this meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=101862
    
    1)  7/15: AGLT2 - dCache maintenance outage.  Work completed, back on-line.
    2)  7/15: New pilot releases from Paul -
    (v44f):
    * A site mover for using CHIRP either as primary of secondary output file transfer has been included. Secondary output transfer refers to the transfer of an output file to a user specified destination (for fast file access) after a successful primary transfer to the SE. 
    Rodney Walker has prepared a wiki with more details:
    https://twiki.cern.ch/twiki/bin/view/Atlas/ChirpForUserOutput
    The CHIRP site mover is also expected to be used for primary output file transfers for CERNVM from the user machine to an intermediary storage area. A stand-alone tool (based on PanDA pilot) currently in development will be responsible for the final transfer 
    from the intermediary storage area to the SE.
    * In preparation for AthenaMP, the pilot now sets an env variable (ATHENA_PROC_NUMBER) for the number of cores. If new schedconfig.corecount is set, the pilot will use this number. If it is set to -1, the pilot will grab the number of cores using /proc/cpuinfo 
    (courtesy of Adrian Taga and Douglas Smith). 
    If not set, the env variable will not be set either.
    (v44g):
    * String conversion correction in file size check in new ChirpSiteMover.
    * Remote I/O tests at ANALY_MWT2 revealed a problem (uninitialized variable) with the LocalSiteMover.
    3)  7/15 - 7/19: Various DDM issues at BNL.  See discussion (thread) in eLog 14801.  ggus 60170.
    4)  7/19: SMU_LOCALGROUPDISK ddm errors.  Problematic subscriptions were canceled, ggus 60223 (closed), eLog 14869.
    5)  7/19: From John at NET2:
    HU_ATLAS_Tier2-lsf is back on-line after gratia filled our disk this morning.  I see our error rate rising, but I believe all the problems are solved now.  I'll have to bring this up with the OSG folks, since gratia has taken out our gatekeeper multiple times.  
    In this case, in just over a week, gratia dumped 5 GB of files, 
    including over 1 million flat in one directory.
    6)  7/19 - 7/20: DDM errors at BNL-OSG2_DATADISK, such as:
    SRC SURL: srm://dcsrm.usatlas.bnl.gov/pnfs/usatlas.bnl.gov/...
    ERROR MSG: [FTS] FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase:
    [GENERAL_FAILURE] AsyncWait] ACTIVITY: Data Consolidation
    >From Pedro:
    There was a problem with a pool.  This has been resolved.  ggus 60249 (closed), eLog 14833.
    7)  7/21: BNL - dCache maintenance outage, 21 Jul 2009 08h00 - 21 Jul 2009 18h00.
    
    Follow-ups from earlier reports:
    (i)  6/25: BNL - file transfer errors such as:
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] at Fri Jun 25 11:31:10 EDT 2010 state Failed : File name is too long].  From Hiro:
    BNL dCache does not allow the logical file name longer than 199 characters. I have canceled these problematic transfers since they will never succeed Users should reduce the length of file name. (Users should not put all metadata of files in the filename itself.) 
    I have contacted the DQ2 developers to limit the length.  
    Savannah 69217, eLog 14016.
    7/7: any updates on this issue?
    (ii)  7/12-13: OU_OCHEP_SWT2: best/SRM issues.  Restart fixed one issue, but there is still a lingering mapping/authentication problem.  Experts are investigating.  ggus 60005 & RT 17494 (both closed), currently being tracked in ggus 60047, RT 17509, eLog 14551.
    Update 7/14: issue still under investigation.  RT 17509, ggus 60047 closed.  Now tracked in RT 17568, ggus 60272.
    (iii)  7/14: OU - maintenance outage.  eLog 14568.
    Update 7/14 afternoon from Karthik:
    OU_OCHEP_SWT2 is back online now after the power outage. It should be ready to put back into production. Maybe a few test jobs to start with and if everything goes as expected then we can switch it into real/full production mode?  
    Ans.: initial set of test jobs failed with LFC error.  Next set submitted following LFC re-start.
    (iv)  7/14: MWT2 maintenance outage - from Aaron:
    We will be taking a downtime tomorrow, July 14th starting at 9AM Central. This downtime will take all day, while we migrate our name services from PNFS to Chimera at this site.
    Update 7/15 from Aaron: The network disruption was fixed, and the upgrade is complete. MWT2_IU is now back online.  eLog 14649.
    
    
    • Still working on SRM issues at OU

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • SAM test failures addressed - post Alessandro fixes
    • Checking into why site information is sometimes missing in BDII (according to error message)
    • FTS failure detector is still working - sites are turned off when the failure rate becomes excessive.
  • this meeting:
    • Please use DaTRI for subscription requests, in general.
    • Issue of getting analysis output back to Tier 3 sites if source is a Tier 2 in another cloud. If the site is in ToA, a normal DaTri subscription request. For non ToA transfers, suggestion was to send to closest Tier 2 and then dq2-get for dq2-FTS from there.
    • We need some testing here, and to organize a plan. Follow-up in a couple of weeks
    • Otherwise not big issues.
    • Bob: what is the correct procedure for disabling FTS during a maintenance? Blacklist? Hiro: should do both. Will send an email summarizing procedures.

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week(s):
    • Network asymmetries - is Esnet involved? Dave (Illinois) investigating, possible issue with campus switch
    • Notes from meeting:
      	USATLAS Throughput Meeting Notes --- July 13, 2010
           ==================================================
      
      Attending: Shawn, Dave, Aaron, Karthik, Andy
      
      Excused: Jason, John, Horst
      
      1) Ongoing issues at Illinois and OU
      	a) Illinois ---  Andy and Dave working on the asymmetry between Illinois and sites on ESnet: outbound good but inbound is problematic.  Network buffers may not be large enough (especially Cisco 65xx switch/routers).  Dave reported there may be other issues causing packet loss that the Illinois network engineers are looking into.   Once other sources are resolved they will try the hold-queue settings.  Eli Dart and Joe Metzger are good sources for info on Cisco hold-queue changes.  Joe will update the ESnet "fasterdata" web pages with more detail on altering the Cisco hold-queue values. 
           b) OU --- Horst has seen some bad throughput numbers and the asymmetry at OU still seems to be present.  Karthik reports that the perfSONAR node is down now and a power outage is scheduled for tomorrow.  Tomorrow Karthik will look into why the node is down and bring it back up.  OneNet will provide access to their existing 1GE perfSONAR for dividing the network test into smaller segments.  **ACTION ITEM** Try bandwidth tests to/from the OneNet box and OU and BNL (Karthik, Horst, John, Jason).
      
      2) perfSONAR status---  New perfSONAR v3.2 RC1 is out but NOT ready for USATLAS use.  Need to wait for RC2 which will allow upgrading and preserving data.   Some changes in v3.2 for improving performance but will have to see the impact once we can test.
      
      3) perfSONAR monitoring
      
      4) Site reports - AOB.   Nothing to report.   Aaron reports MWT2 is running great.  Karthik confirmed he will diagnose the  perfSONAR box at OU tomorrow.
      
      Next meeting in two weeks at the usual time (3 PM Eastern).   Send along additions or corrections to the list.
      
      Thanks,
      
      Shawn

  • this week:
    • From Jason (on travel this week):
      • An RC of the pS Performance Toolkit v 3.2 was announced last week. We are currently asking ATLAS to not adopt this version yet (even for testing) since it lacks upgrade capability. The next RC, due in a week or two) will start to make this option available and I will work with BU/OU/UMich/MSU/IU to test it out.
      • Working with OU/BNL on an ongoing performance problem (appears to be low sporadic loss that is not allowing full TCP performance). I will be getting back to Horst and Karthik very soon with some results and suggestion to get ESnet involved to examine some trouble spots.
    • New hardware box - replace two systems to a single server, but developers are working on splitting the task among available cores.

Site news and issues (all sites)

  • T1:
    • last week(s): Network people connecting rest of worker nodes. 20 Nexans being racked, awaiting front-end servers (IBM servers, will run Open Solaris and ZFS). DDN is doing well. Partial power failure yesterday, lost a flywheel UPS - losing 1MW. Positive note - people on holiday responded quickly, brought back 5K disks in less than 1 hour.
    • this week: Currently in maintenance - primary focus is to work on consistency and integrity. Databases dumped and restored, vacuumed. Deploying 2000 cores, going into production today, expect to finish by the end of the week.

  • AGLT2:
    • last week: Running smoothly over the past weeks, good since we've been away. Alignment splitter. New baby boy (Zahary) at AGLT2! Congratulations to Tom!
    • this week: Dumping and restoring postgres database, recovering about 50%. Billing DB was the culprit. Upgrading dCache. May deploy another SSD for OSG APP. Looking into planning for next purchase round.

  • NET2:
    • last week(s): Will be taking a downtime tomorrow for machine room work. SRM transfers interrupted for a short while last week (user cert issue in the ATLAS VO). New server on the HU site to start up the analysis queue (2K job slots).
    • this week: Issue with gratia filling /tmp disk at HU. Karthik - sees similar issue at OU - in contact with Chris Green. (will submit a GOC ticket). HU site full of production jobs.

  • MWT2:
    • last week(s): Had a number of DDM transfers failing over the weekend, investigating - large number of requests led to overloading of dcache head nodes; also suspect srmspacefile table out of synch with pnfs. Brought another data server online (156 TB), another server drained and raids rebuilt (still in progress - expect online by week's end, adding another 156 TB), see monitor at http://www.mwt2.org/sys/space). Problems with software raid-0/disk issues on about 32 compute nodes lead to filesystem errors and ANALY job failures - reverting to previous situation and studying the problem. Studies continue on xrootd using ANALY_MWT2_X. Several comparison hammercloud tests have been run:
    • this week: Migrated from pnfs to Chimera. Ordered a new dcache headnode.

  • SWT2 (UTA):
    • last week: Found and replaced switch in stack - bad flash - has been stable operationally. Otherwise all is well. Received notice of a missing rpm on compute nodes - tracked down.
    • this week: Small outage yesterday to move a network connection in advance of 10G switchover; expect another outage next week. Analysis and production running fine over the last week.

  • SWT2 (OU):
    • last week: Bringing new OU online - update LFC host name; next step is to run more test jobs. Next Wednesday outage for UPS work.
    • this week: Working on Bestman issue for several days, everything seems to be fine locally, but transfers failing - timeouts. Will start email an email thread to troubleshoot.

  • WT2:
    • last week(s): Want to make space in grouddisk - doing some deletions; automatic deletion has not started. Had AFS server failed, fortunately no impact (no transfers in flight at the time, no job failures). Global xrootd redirector for testing Tier 3. There is no technical specification for this - discussions w/ Andy. Setting up a PROOF cluster. Dell R510s with 12 disks. Sent email to Walker for 6 core chips - expect update early next week.
    • this week: Have received all new storage nodes, still getting networking equipment. Will get SLAC then to 1.4 PB when online.

Topic: production/analysis (Michael)

  • ATLAS has 50/50 production/analysis
  • Are we giving too small a share to analysis? Yes - some sites are only running a few hundred, well less than 50%.
  • If there are issues to resolve we need to do this in the next few weeks.
  • Expect all sites to run 1000 concurrent analysis jobs.
  • At SLAC - limiting factors are disk space, and fair share priority for short running jobs. Large number of activated, few running.
  • What about multi-job pilots - Paul has a pilot release which does this.
  • Need to raise this at next meeting.

Carryover issues ( any updates?)

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc.
  • last meeting
    • Email alerts for release installation problems (w/ Alessandro's system): https://atlas-install.roma1.infn.it/atlas_install/protected/subscribe.php
    • OU, Illinois-T3, BNL testing
    • OU - waiting for full panda validation
    • Illinois - problems with jobs sent WMS - lag in status update
    • BNL - Alessandro tested poolfilecatalog creation - there were problems with the environment; Xin provided a patch.
    • Waiting for feedback from Alessandro
  • this meeting:
    • Has checked with Alessandro - no show stoppers - but wants to check on a Tier 3 site
    • Will try UWISC next
    • PoolFileCatalog creation - US cloud uses Hiro's patch; For now will run a cron job to update PFC on the sites.
    • Alessandro will prepare some documentation on the new system.
    • Will do BNL first - maybe next week.

dCache local site mover plans (Charles)

last week:
  • Meeting with Pedro and Shawn - to create a unified local site mover
  • Been on vacation

this week:

  • Pedro has put the BNL local site mover implementation on svn@cern

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

AOB

  • last week
  • this week


-- RobertGardner - 20 Jul 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback