r6 - 09 Jan 2013 - 14:25:41 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJan92013

MinutesJan92013

Introduction

Minutes of the Facilities Integration Program meeting, January 8, 2012
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern

  • Your access code: 2913843
  • Your private passcode: 4519
  • Dial *8 to allow conference to continue without the host
    1. Dial Toll-Free Number: 866-740-1260 (U.S. & Canada)
    2. International participants dial: Toll Number: 303-248-0285. Or International Toll-Free Number: http://www.readytalk.com/intl
    3. Enter your 7-digit access code, followed by “#”
    4. Press “*” and enter your 4-digit Chairperson passcode, followed by “#”

Attending

  • Meeting attendees: Rob, James, Shawn, Saul, Jason, Sarah, Michael, Patrick, Bob, Dave, Mark, Fred, Hiro, Kaushik, Alden, Armen, Wei, Horst, Doug, Ilija
  • Apologies:
  • Guests:

Integration program update (Rob, Michael)

Storage deployment

last meeting:
  • Tier 1 - 2.6 PB arrived; will be starting deployment. Mid-December, maybe earlier.
  • AGLT2: Testing best capacity and perf of md3260. decided not to use storage pools. takes too much of the storage. choosing 20 disk raid6. getting set to bring the storage online by end of month. At least some by next week. https://www.aglt2.org/wiki/bin/view/AGLT2/Md3260Benchmark
  • MWT2: 1 PB storage racked and in process of powering.
  • NET2: Funding didn't arrive till quite late; made an order for 720 TB. md3260's. Starting to arrive now. ETA: first week of January. Got a very good price. Will be moving to Holyoke end of March.
  • SWT2_UTA: Sent in PO's. No delivery date. 1.3 PB.
  • SWT2_OU:
  • WT2: done DONE

this meeting:

  • AGLT2: 1.296 PB added in production; done DONE
  • MWT2: 1 PB storage installed; will bring online pending internal network reconfiguration.
  • NET2: MD3200's arrived yesterday; moving into machine room. 720 TB raw. Will double this with FY12 funds, to provide 1.4 PB.
  • SWT2_UTA: 1 PB should be on order now. There were issues getting it ordered through the preferred UTA vendor. Delivery uncertain. Also buying switches to improve network speed. Adding Force10 switches to have 10G available for all storage servers, also to the wide-area.
  • SWT2_OU: touching base with DDN; expect to have few hundred TB online in two weeks.
  • WT2: DONE
  • Tier 1 - 2.6 PB in production; gradually will be made available to ATLAS DONE (BNL currently showing over 11 PB). Note - there is aging hardware that is becoming obsolete, so total may shrink over 2013.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Reprocessing is virtually done, a success, so expect smaller loads
    • Production will continue through the holidays
    • US facility performance has been very good, low number of issues
    • Mark suggests that when DDM emails from Hiro arrive, if the issue is more than transient, good idea to cc the shift list (atlas-project-adc-operations-shifts@cern.ch).
  • this meeting:
    • PRODDISK squeezed due to group production
    • PRODDISK versus DATADISK: merge or not - still not settled.
    • Retiring PandaMover - to discuss with Alexei

Multicore configuration

last meeting
  • Will at BNL is close to a solution to dynamically partition resources to request MC and high mem slots in Condor
  • Hope to have a solution by end of December.

this meeting:

  • if updates

Data Management and Storage Validation (Armen)

Shift Operations (Mark)

  • last week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    No report available this week - holidays.
    
    1)  12/27: PRODDISK tokens became very full at many sites due to big input files (AOD) for group production.  Stephane worked on more aggressive cleaning to 
    help ease the situation.  See: https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/41902.
    2)  12/27 ~10:30 a.m. EST: Saul reported a power outage affecting NET2 (BU site).  System restored as of ~6:00 p.m.
    3)  12/27: MWT2 reported problems with their SRM door.  Issue resolved as of early a.m. the next day.  eLog 41914.  However, https://ggus.eu/ws/ticket_info.php?ticket=89996 
    was opened on 12/28 at ~22:40 UTC for file transfer failures with SRM errors at the site.  Status in-progress, eLog 41921.
    4)  12/28: From Bob at AGLT2: We have had two separate and sustained outages of the NFS server that provides the OSG suite and VO home directories to our workers 
    in the last 24 hours.  We have just recovered from the second, where I was more or less forced to simply kill the whole job load here. Looks like about 2600 Production 
    and 1700 Analysis were dropped. This is complicated by intermittent network outages as well.
    5)  12/29: BNL_ATLAS_RCF - job failures due to problem with ATLAS release 17.2.8.1 (" /bin/sh: /usatlas/OSG/atlas_app/atlas_rel/17.2.8/cmtsite/setup.sh: No such file or 
    directory").  A problem with the installation of this software release was found, and has now been fixed.  https://ggus.eu/ws/ticket_info.php?ticket=90000 closed, eLog 41935.
    
    Follow-ups from earlier reports:
    
    (i)  8/24: NERSC_SCRATCHDISK - file transfer failures with SRM errors.  Issue was due to the fact that the site admins had previously taken the token off-line to protect 
    it from undesired auto-deletion of files.  ggus 85490 closed, eLog 38791.  https://savannah.cern.ch/support/index.php?131508 (Savannah site exclusion ticket), eLog 38795.
    Update 11/2: See latest information in Savannah 131508 (eventually the space token will be decommissioned).
    (ii)  12/8: NET2 - again seeing a spike in DDM deletion errors - https://ggus.eu/ws/ticket_info.php?ticket=89339 - eLog 41596.
    

  • this week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    Not available this week - ADCoS meeting not held.
    
    1)  1/2: AGLT2 setting queues off-line in preparation for maintenance downtime (networking).  Some lingering network issues after coming back from the downtime period.  
    As of 1/5 issues appear to be resolved.  https://ggus.eu/ws/ticket_info.php?ticket=90103 was opened for file transfer problems post-downtime - closed on 1/7.  eLog 42015.
    2)  1/2: MWT2 - file transfer failures ("Cannot get a connection, pool error Timeout waiting for idle objectNestedThrowables").  As of 1/4 Sarah reported the SRM was back 
    up without errors - https://ggus.eu/ws/ticket_info.php?ticket=90047 closed, eLog 41986.  https://savannah.cern.ch/support/?134825 (Savannah site exclusion for blacklisting).
    3)  1/3: attempts to do a 'voms-proxy-init' fail using the VOMS server at BNL.  Issue resolved - https://ggus.eu/ws/ticket_info.php?ticket=90074 closed.  (See the ggus ticket 
    for troubleshooting details from John Hover.)  eLog 41985.
    4)  Beginning 1/6 evening several U.S. production sites were draining due to a lack of input files.  Issue seems to be a slowness in the delivery of inputs by pandamover, plus 
    backlog of transfers from foreign clouds to U.S. sites running jobs for those clouds.  See extended discussion in the e-mail lists.
    5)  1/8: Fiber cut in Chicago affected MWT2 networking.  From Rob 1/9 a.m.: There was a fiber cut in downtown Chicago late yesterday afternoon that affected routes to 
    all research networks.  As of 5 a.m. this morning its been repaired and should be back to normal.
    
    Follow-ups from earlier reports:
    
    (i)  8/24: NERSC_SCRATCHDISK - file transfer failures with SRM errors.  Issue was due to the fact that the site admins had previously taken the token off-line to protect it 
    from undesired auto-deletion of files.  ggus 85490 closed, eLog 38791.  https://savannah.cern.ch/support/index.php?131508 (Savannah site exclusion ticket), eLog 38795.
    Update 11/2: See latest information in Savannah 131508 (eventually the space token will be decommissioned).
    (ii)  12/8: NET2 - again seeing a spike in DDM deletion errors - https://ggus.eu/ws/ticket_info.php?ticket=89339 - eLog 41596.
    (iii)  12/27: MWT2 reported problems with their SRM door.  Issue resolved as of early a.m. the next day.  eLog 41914.  However, https://ggus.eu/ws/ticket_info.php?ticket=89996 
    was opened on 12/28 at ~22:40 UTC for file transfer failures with SRM errors at the site.  Status in-progress, eLog 41921.
    Update 1/7: Issues with the SRM service resolved.  Closed ggus 89996. eLog 42014.
    

DDM Operations (Hiro)

Throughput and Networking (Shawn)

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference

last week(s)

  • beta version of f-stream enabled UCSD collector available. Needed changes in the monitor protocol. Only in xrootd 3.3.0-rc2. Ilija will change dcache monitor to adhere to new protocol
  • today moving MWT2 to dCache 1.9.12-23.
  • work ongoing in adopting AGIS for tests
  • Hiro produced site specific datasets for all the sites
  • new dashboard monitor ready
  • 3 italian sites added (Rome1, Frascati, Napoli). Working on their monitoring issues
  • Next week PAT tutorial. Will have to recheck if everything works. We should consider this as a small full-dress rehearsal.
  • CERN and AGLT2 have a problem with the limited proxy access.
  • Doug will subscribe D3PD's to sites in the US.
this week
  • FDR coming up in two weeks
  • Savannah needed - Doug
  • Wei has run discussed with Gerry the X509 issue that will stop certain sites from participating in the FDR. (The work-around does not work) Problem is jobs have limited.
  • BNL added three more proxy servers

Site news and issues (all sites)

  • T1:
    • last meeting(s): 2.6 PB soon. Opp resources at BNL - 2000 ATLAS jobs running on nuclear resources now (eviction). Amazon cloud resources will be used for scalability testing. WN benchmarking report sent out, on Sandybridge (production hardware from Dell); performance was much better than with the pre-production machines. 30% better performance better than 2.8GHz Westmere (working on pricing). Require to provide a security facility plan to NSF. Do sites have an institution security plan? If so share with Michael. Hans and Borut to participate in HLT farm to a cloud resource.
    • this meeting: Added another 10G link to the GPN for the WAN at BNL. 3GB/s at times to US and international sites. 70Gbps in/out BNL. There is 80 Gbps transatlantic, most is shared. Statistics show LHCONE is the most prominent user on these links. Note - Chicago and New York use different transatlantic links. Likely no competition with Tier 2s.

  • AGLT2:
    • last meeting(s): Things have been working well, focus has been md3260. Have got a lot of srmwatch errors, might be correlated with a user asking for too many jobs. HC has analy queue back online. Had to suppress # running jobs.
    • this meeting: MiLAR upgrade to 100 Gbps. Had some network state issues at UM and MSU, now resolved. (Now have 100G wave capability). Storage now running SL6 - rebuilding pools, getting rid of SL5 hosts.

  • NET2:
    • last meeting(s): Problems this morning - spikes in gatekeeper load and DDM errors. Might have large number of dq2-gets may be the cause, investigating. Panda queues on BU side are nearly drained. New storage starting to arrive. End of March move. HU running fine. BU-HU network link may be need maintenance.
    • this meeting:

  • MWT2:
    • last meeting(s): Big issue was had a # jobs fail with stage-in. Adler32 checksum indicates byte swapping. Narrowing down source. LHCONE config at IU, or UC-IU physical path. Unintended change in path happened. Also finding poor performance at IU internal network, and also finding some checksum failures on DDM transfers.
    • this meeting: Network incident in Chicago causing near connectivity lost for ~12 hours yesterday. Checksum mis-match (causing both DDM and lsg-get errors) - cause identified as due to Brocade firmware bug; reverted for now. Working on storage deployment & local network upgrade to remove internal bottleneck between cpu and storage servers. UIUC compute purchase imminent.

  • SWT2 (UTA):
    • last meeting(s): there was a problem with maui on the CPB cluster; drained and fixed. asymmetric network throughput to/from BNL. Getting network staff to track it down.
    • this meeting:

  • SWT2 (OU):
    • last meeting(s): Disk order has gone out. Horst is getting close to having clean-up script.
    • this meeting:

  • WT2:
    • last meeting(s): All is well.
    • this meeting:

AOB

last meeting this meeting


-- RobertGardner - 08 Jan 2013

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback