r4 - 01 Sep 2010 - 13:56:47 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesSep1

MinutesSep1

Introduction

Minutes of the Facilities Integration Program meeting, Sep 1, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • 866-740-1260, Access code: 7027475

Attending

  • Meeting attendees: Alden, Saul, Michael, Charles, Nate, Rik, Jim, Kaushik, Mark, Sarah, Wei, Justin, Booker, Armen, Bob, Shawn, Horst, Karthik, Dave, John
  • Apologies: Hiro, Jason

Integration program update (Rob, Michael)

  • IntegrationPhase14 NEW
  • Special meetings
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
      • OSG Storage Forum and US ATLAS Distributed Facility workshop, upcoming.
      • Benchmarks of new hardware - will start collecting these (Bob) - HepSpecBenchmarks.html. Send emailed results to Bob.
      • LHC continues to make progress - a couple of good fills recently - integrated luminosity ~ 3 pb-1. Increasing beam intensity by increasing bunches (50, yesterday). 400 bunches by end of the year is the goal. (fb-1 goal by end of next year). Tier 0 still encounter problems with overall bandwidth - export bandwidth must sometimes be reduced, but they catch up quickly and data distribution goes very quickly. Real physics ~ 36 hours over the week.
      • Anticipated program at SLAC - data distribution for analysis, PD2P? and associated data handling issues. Data deletion, dynamically. Some of the sites are close to capacity limits.
      • Analysis performance as it pertains to data access. Tier 3 development and integration into the facility. Cross-cloud traffic across Tier 3's - now getting into the production picture.
      • Idle CPUs - the RAC has a scheme for additional production or other types of analysis work, to backfill.
    • this week
      • CVMFS evaluation at the Tier 2 - OU, UC, AGLT2 interested in evaluation. See further TestingCVMFS.
      • October 12-13 next facility workshop.
      • WLCG asking for US ATLAS pledges for 2011, prelim for 2012. ~12K HS06.
      • Open ATLAS EB meeting on Tuesday. Next reprocessing campaign is shaping up. 7 TeV runs will be used. 1B events. October 20 deadline for building the dataset. Nov 26 repro deadline. For next major phys conferences (La Thuile - March 2011). 6-8 weeks for the simulation - mostly at the Tier 2s.

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

References
  • The link to ATLAS T3 working groups Twikis are here
  • Draft users' guide to T3g is here
last week(s):
  • Meeting this morning w/ OSG Tier 3 support, plus Rob and Hiro. Discussion about registration of Tier 3's, but also support.
  • We expect a large number of sites not part of DDM, but using gridFTP and the BNL FTS.
  • Existing Tier 3's w/ SRM and part of DDM. For new ones: we need a well-defined set of steps to become part of the DDM system. Rob will start up a page for this
  • For gridftp-only sites, how sites should get registered was discussed.
  • For Panda-Tier 3 sites, they appear as ANALY queues - but we don't want DAST supporting them - how?
this week:
  • OSG storage meeting coming up - will get into contact w/ Tanya for Tier 3 talk & requirements.
  • Working on plan to deal with local data management.
  • Most attention at the moment is figuring out funding to Tier 3s. Several sites have ordered their equipment.
  • Would like Panda monitoring to separate Tier 3 sites, so as to not be monitored.

Operations overview: Production and Analysis (Kaushik)

Data Management & Storage Validation (Kaushik)

  • Reference
  • last week(s):
    • No meeting this week
    • Sites are full. Lots of deletions in progress.
    • Issue of sites w/o explicit space token boundaries has not been resolved - will need ADC intervention.
  • this week:
    • Armen: other clouds have the same problem. Armen and Wensheng are deleting when space gets too full. The problem is the service itself. USERDISK has been seen the most deletions.
    • Small transfer requests to Tier 3s. Discussed in RAC. Stephane notes cannot control on site by site basis. 1/2 TB is the limit. Requires ACLs to be set correctly in the LFC Tier 3.
    • Same policy for LOCALGROUPDISK.
    • Investigating many problems "awaiting subscriptions". Not an approval process issue.
    • Discussed approval process - VOMS based approval for a local site data manager. This is plan long term.

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting: he's on vacation this week.  See summary from Alessandra Forti here:
    http://www-hep.uta.edu/~sosebee/ADCoS/World-wide-Panda_ADCoS-report-%28Aug17-23-2010%29.txt
    
    1)  8/18: WISC - DDM transfer errors:
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://atlas07.cs.wisc.edu:8443/srm/v2/server]. Givin' up after 3 tries].  
    Resolved as of 8/23, ggus 61283 closed, eLog 16167.  (Note: ggus 61352 may be a related ticket?)
    2)  8/18 - 8/19: SWT2_CPB - Source file/user checksum mismatch errors affecting DDM transfers.  Issue resolved - from Patrick: The RAID in question rebuilt because of a failed drive and the problem has disappeared.  ggus 61249, RT 1791 both closed, eLog 15965.
    3)  8/18 - 8/19: SWT2_CPB - fiber cut on a major AT&T segment in the D-FW area.  Reduced network capacity during this time, so Hiro reduced FTS transfers to single file to help alleviate the load.  Cut fiber repaired, system back to normal.
    4)  8/19: OU_OCHEP transfer errors:
    [SOURCE error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] failed to contact on remote SRM [httpg://tier2-02.ochep.ou.edu:8443/srm/v2/server].  From Horst: This must've been another SE overload because of the Lustre timeout bug. It has
    resolved itself with an auto-restart of Bestman, therefore I turned the channel back on and am resolving this ticket.  ggus 61289, RT 17997 both closed, eLog 16019.
    5)  8/19: from Charles at UC:
    Our PNFS server at MWT2_UC is temporarily down due to a power problem. It will be back up ASAP. In the meanwhile, FTS channels for UC are paused.
    Later:
    Server is back up and channels are being reopened.
    6)  8/19: SRM problem at BNL - issue resolved.  From Michael:
    Most likely caused by a faulty DNS mapping file that was created by BNL's central IT services. Forensic investigations are still ongoing. Transfers resumed at good efficiency.
    7)  8/19 - 8/20: Maintenance outage at OU_OCHEP_SWT2 for an upgrade to the Lustre file system.  8/21: Test jobs successful following the Lustre upgrade, so OU_OCHEP_SWT2 set on-line.  Still waiting for a complete set of ATLAS s/w releases 
    to be installed at OU_OSCER_ATLAS.  eLog 16119.
    8)  8/19: SWT2_CPB off-line for ~6 hours to add storage to the cluster.  Work completed, system back up as of late Thursday evening.  eLog 16072.
    9)  8/20 - 8/25: BNL - job failures with the error "Get error: lsm-get failed: time out."  Issue with a dCache pool - from Pedro (8/25):
    We still have a dcache pool offline.  It should be back by tomorrow noon.  In the meantime we move the files which are needed by hand but, of course, the jobs will fail on the first attempt.  ggus 61338/355, eLog 16154/230.
    10)  8/20: Job failures at IllinoisHEP with the error "lcg_cp: Communication error on send."  Issue resolved - from Dave:
    I believe this was due to some networking issues on campus.  ggus 61339 closed, eLog 16084.
    11)  8/23: BNL - file transfer errors from BNL to several destinations:
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] failed to contact on remote SRM [httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2]. Givin' up after 3 tries].  From Michael:
    The problem is solved. The dCache core server lost connection with all peripheral dCache components. Transfers resumed after restart of the server.  ggus 61359 closed, eLog 16174.
    12)  8/24 - 8/25: SLAC - DDM errors: 'failed to contact on remote SRM': SLACXRD_DATADISK and SLACXRD_MCDISK.  From Wei:
    Disk failure. Have turned off FTS channels to give priority to RAID resync. Will turn them back online when resync finish.  ggus 61537 in progress, eLog 16223/33.
    
    Follow-ups from earlier reports:
    (i)  8/15-8/16: NERSC_HOTDISK request timeout transfer errors.  ggus 61148 (in progress), eLog 15906.
    Update from Hiro, 8/23: NERSC has fixed the problem. The transfers are working normally.  ggus 61148 closed.
    (ii)  8/16-8/17: MWT2_UC maintenance outage - from Aaron:
    We are taking a downtime tomorrow, Tuesday August 17th all day in order to migrate our PNFS service to new hardware.
    Site back on-line as of ~4:30 p.m. CST 8/17.  Some issues with pilots at the site - under investigation.
    Update, 8/24: Xin added a second submit host for ANALY_MWT2, as it appeared that a single one could not keep up, especially in situations where the analysis jobs are very short (<5minutes, etc.) and cycle through the system at a high rate.  
    Also, the schedconfig variable "timefloor = None" was be set to 60 (i.e., pilots will run for at least 60 minutes if there are real jobs to pick up).
    (iii)  8/17: From Wei at SLAC:
    Due to power work we will need to shutdown some of our batch nodes. This will likely result in reduced capacity at WT2 (8/18 - 8/20). We may also use the opportunity to reconfigure our storage (If that happens, we will send out an outage notice).
    8/25: Presumably this outage is over?
    
    • Still working on SRM issues at OU - Hiro and Horst to follow-up offline.
  • this meeting:
    Yuri's summary from the weekly ADCoS meeting: not available this week (Yuri just back from vacation)
    1)  8/25: DDM failures on WISC_GROUP token - "cannot continue since no size has been returned after PrepareToGet or SrmStat."  Issue resolved.  ggus 61572 closed, eLog 16245.
    2)  8/26: ggus 60982 (transfer errors from SLACXRD_USERDISK to SLACXRD_LOCALGROUPDISK) was closed.  eLog 16268.
    3)   8/27: SWT2_CPB_USERDISK - several files were unavailable from this token.  Issue resolved - from Patrick:
    Sorry for the problems. The Xrootd system was inconsistent between the SRM and the disk contents. The issue has been resolved and the files are now available via the SRM.  (A 4 hour maintenance outage was taken 8/27 p.m. to fix this problem.)  
    ggus 61611 & RT 18043 closed, eLog 16349.
    4)  8/28: From Wei at SLAC: We will take WT2 down on Monday Aug 30, 10am to 5pm PDT (5pm - 12am UTC) for site maintenance and to bring additional storage online.  Maintenance completed on 8/30 - test jobs successful, 
    queues set back on-line as of ~1:30 p.m. CST.  eLog 16453.
    5)  8/29: BNL DDM errors - T0 exports failing to BNL-OSG2_DATADISK.  Initially seemed to be caused by a high load on pnfs01.  Problem recurred,
    from Jane: pnfs was just restarted to improve the situation.  ggus 61627 & goc #9155 both closed; eLog 16360/77.
    6)  8/30: BNL - upgrade of storage servers' BIOS - from Michael:
    This is to let shifters know that at BNL we are taking advantage of the LHC technical stop and conductiong an upgrade of the BIOS of some of our storage servers. This maintenance is carried out in a "transparent" fashion meaning the servers 
    will briefly go down for a reboot (takes ~10 minutes).  As files requested during that time are not available there may be a few transfer failures due to "locality is unavailable."  Maintenance completed - eLog 16419.
    7)  9/1: BNL - Major upgrade of the BNL network routing infrastructure that connects the US Atlas Tier 1 Facility at BNL to the Internet.  Duration: Wednesday, Sept 1, 2010 9:00AM EDT to 6:00PM EDT.  See eLog 16476.
    
    Follow-ups from earlier reports:
    (i)  8/18: WISC - DDM transfer errors:
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://atlas07.cs.wisc.edu:8443/srm/v2/server]. Givin' up after 3 tries].  
    Resolved as of 8/23, ggus 61283 closed, eLog 16167.  (Note: ggus 61352 may be a related ticket?)
    As of 8/31 ggus 61352 is still 'in progress'.
    (ii)  8/19 - 8/20: Maintenance outage at OU_OCHEP_SWT2 for an upgrade to the Lustre file system.  8/21: Test jobs successful following the Lustre upgrade, so OU_OCHEP_SWT2 set on-line.  Still waiting for a complete set of ATLAS s/w 
    releases to be installed at OU_OSCER_ATLAS.  eLog 16119.
    As of 8/31 no updates about atlas s/w installs on OU_OSCER.
    (iii)  8/20 - 8/25: BNL - job failures with the error "Get error: lsm-get failed: time out."  Issue with a dCache pool - from Pedro (8/25):
    We still have a dcache pool offline.  It should be back by tomorrow noon.  In the meantime we move the files which are needed by hand but, of course, the jobs will fail on the first attempt.  ggus 61338/355, eLog 16154/230.
    As of 8/31 ggus 61338 'solved', 61355 'in progress'.
    (iv)  8/24 - 8/25: SLAC - DDM errors: 'failed to contact on remote SRM': SLACXRD_DATADISK and SLACXRD_MCDISK.  From Wei:
    Disk failure. Have turned off FTS channels to give priority to RAID resync. Will turn them back online when resync finish.  ggus 61537 in progress, eLog 16223/33.
    Update, 8/31: a bad hard drive was replaced - issue resolved.  ggus 61537 closed,  eLog 16327.
    
    • Quiet week.
    • Release brokering is nearly Alden.

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • USERDISK getting cleaned up at all sites
    • BNL-Italy low bandwidth still being investigated; discussed at WLCG management meeting - there are organizational issues. Will be discussed at WLCG Service Coordination meeting.
    • Hiro notes that multi-cloud transfers are very manpower intensive.
    • DQ2 accounting - looking at DATADISK and MCDISK - notes that a few sites are very close to the limit, in particular NET2 and SLAC.
    • Armen: relying on central deletion is not working. During last several weeks doing manual cleanups. After USERDISK cleanup will investigate DATADISK.
    • NET2 - moving storage into new rack of servers, so situation should improve.
  • this meeting:
    • My reporting is that all US sites including T3s are now in CERN BDII. Also, there is a issue with deletion of production data at BNL (at least) by the central deletion service. This could be really bad bug. It is under investigation by Vincent. Also, the central deletion has completed halted for US sites (including USERDISK) for unknown reason. Experts are being notified. In some near(?) future, the deletion service will come to BNL instead of located at CERN.
    • Armen: other clouds have the same problem. Armen and Wensheng are deleting when space gets too full. The problem is the service itself. USERDISK has been seen the most deletions.

libdcap & direct access (Charles)

  • Rebuilt libdcap locally
  • Current HC performance, http://hammercloud.cern.ch/atlas/10000957/test/
  • Comparing to previous results with dcap++
  • Looking at messaging between dcap client and server
  • Most problems with prun jobs (pathena seems robust)
  • Trying to get dcap++ merged into official dcap sources.. libdcap has diverged.

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week(s):
    • Work at the OU site to determine the asymmetry
    • Illinois - waiting for a config change in mid-August
    • Also an asymmetry at UC, not as severe
    • Meeting next tuesday.
    • DYNES - dynamic provisioning, funded NSF MRI program, 40 sites will be instrumented
    • Phone meeting notes:
      USATLAS Throughput Meeting Notes --- August 24, 2010
           ====================================================
      
      Attending:  Shawn, Jason, Dave, Andy, Sarah, Horst, Karthik, Philippe, Aaron, Tom, Hiro
      
      1) perfSONAR identified issue resolution status:  
      	a) OU: Karthik reporting on OU testing.  Relocated test node which improved things.  Now using Tier-2 nodes directly testing to BNL (BNL->OU is OK, OU->BNL is poor).   Lots of tunings on OneNet and at BNL and OU.   Results from today suggest some issue in the tracepath but perhaps should stay focused on the current tools/tests for now.   Horst and Karthik are enabling  BWCTL on their 10GE hosts.  Hiro tested with real file transfers (BNL->OU is OK (1 file 100MB/sec, 5 streams), OU->BNL is poor performance (20MB/sec; same options; same file)).
           b) Illinois -  Configuration change made to the 6500...no change in dropped packets from BNL to Illinois (problem is opposite direction from OU).   Net admins see some possible issues but haven't had time to track things down yet and have a new testing node to help isolate the issue.  Trying to enable jumbo frames (MTU=9000).   (Bit of a discussion on jumbo frame issues seen at MWT2 and OU).   Tracepath may be useful to debug UC-IU jumbo problems involving PBS.   Some discussion about how NATs deal with jumbo frames on outside/normal frames on inside.  
      
      2) perfSONAR --- RC3 out maybe next week.  New version (CentOS/driver) seems to have changed the performance of the system at BNL (say 900 Mbps -> 700 Mbps ). This is possibly a show-stopper.  Trying new driver at BNL to see if it resolves things. Some new patches out to extract some timing info from old and current perfSONAR instances from Jason.  Philippe reported on patching/fixing systems at AGLT2_UM.   After 'myisamchk -er' on DB things were much faster.  Some nightly checks not in place or perhaps not working?   No progress yet on "single perfSONAR instance" Dell R410 at UM.  Waiting for the release to test it.  Very soon will have focused  access on testing/deploying on the R410.     Some discussion about install options for next perfSONAR release.   Jason is working on documenting the "to disk" install variant and it should be available when the release is ready. Will use a repo and YUM to maintain on disk version.  
      
      3) Monitoring perfSONAR ---  Plugins with "liveness" and thresholds are being worked on.  Work in progress to convert them into RPM.  Can be installed on perfSONAR host and/or Nagios server.  Plugins are perl scripts underneath.  Tom thought it would be straightforward to implement on the BNL Nagios server and can even provide a "perfSONAR" focused page showing just the perfSONAR related tests.  Andy mentioned that documentation will be ready in advance of the plugins and he will let us know so we can start discussing options.
      
      4) Round-table, site reports, other topics --- Hiro - Question: will perfSONAR be extended to each ATLAS site?  Jason:  perfSONAR MDM exists at Tier-1, Tier-0. Hiro points out problems with trying to debug Tier-1 issues and perfSONAR would be helpful.  Also question about Nagios plugin working with MDM variant; Andy: may work but slight differences may cause problems.   MWT2 has asymmetry involving UC, IU is OK.  
      
      Heads up: once the perfSONAR release is ready sites should be prepared to upgrade ASAP (within 1-2 weeks).  
      
      Plan to have our next meeting in 2 weeks.   Please send along any additions or corrections to these notes via email.
      
      Thanks,
      
      Shawn
    • DYNES - first task is to form an external review committee. For example to select sites participating in the infrastructure.
  • this week:
    • From Jason:
      • We are still actively trying to figure out what is causing the performance problems between the new release and the KOI harware
      • Final release will be pushed back until we can fix this, RC3 will be out this week.
      • Still not recommending that the USATLAS Tier2s adopt yet. Working with BNL/MSU/UM on testing solutions.

Site news and issues (all sites)

  • T1:
    • last week(s): Currently a problem with one of the storage pools - large - 100K files/day to re-create the metadata. local-site-mover - Pedro making progress handling of exceptions (have seen a rate of 2-5%). Want data access problems to become invisible.
    • this week:

  • AGLT2:
    • last week: Putting in storage orders - R710+MD1200 shelves. Focused also network infrastructure upgrades and service nodes. Running with direct access for a while; have found some queuing of movers on pools. (Note MWT2 is running dcap 1000 movers; note there is a newer dcap version test).
    • this week:

  • NET2:
    • last week(s): 250 TB rack is online, DATADISK is being migrated. Working on a second rack. Working on HU analysis queue.
    • this week:

  • MWT2:
    • last week(s): Chimera issues resolved - due to schema change; No crashes since. PNFS relocated to newer harder, good performance.
    • this week:

  • SWT2 (UTA):
    • last week: 200 TB of storage added to system - up and running without problems. Will be making modifcations for multipilot.
    • this week:

  • SWT2 (OU):
    • last week: Lustre upgrade seems to have stabilized things. Pilot update for adler32 update. Getting OSCER back online. Found problems installing new releases using Alessandro.
    • this week:

  • WT2:
    • last week(s): Disk failure yesterday - had to shutdown FTS for a while; fixed. Installing additional storage next week.
    • this week:

Carryover issues ( any updates?)

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc.
  • last meeting(s)
    • Has checked with Alessandro - no show stoppers - but wants to check on a Tier 3 site
    • Will try UWISC next
    • PoolFileCatalog creation - US cloud uses Hiro's patch; For now will run a cron job to update PFC on the sites.
    • Alessandro will prepare some documentation on the new system.
    • Will do BNL first - maybe next week.
  • this meeting:

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

AOB

  • last week
    • LHC accelerator outtage next week, M-Thurs.
  • this week


-- RobertGardner - 30 Aug 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback