r5 - 23 Jan 2013 - 15:04:47 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJan232013



Minutes of the Facilities Integration Program meeting, January 23, 2012
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern

  • Your access code: 2913843
  • Your private passcode: 4519
  • Dial *8 to allow conference to continue without the host
    1. Dial Toll-Free Number: 866-740-1260 (U.S. & Canada)
    2. International participants dial: Toll Number: 303-248-0285. Or International Toll-Free Number: http://www.readytalk.com/intl
    3. Enter your 7-digit access code, followed by “#”
    4. Press “*” and enter your 4-digit Chairperson passcode, followed by “#”


  • Meeting attendees: Saul, Dave, Jason, Michael, John Brunelle, James, Patrick, Wei, Shawn, Sarah, Mark, Rob, Horst, Doug, John, Jose, Hiro, Kaushik, Mark, Fred
  • Apologies: none
  • Guests: John Hover and Jose

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, bi-weekly - convened by Kaushik) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Wei or Rob): Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • Facilities spreadsheet update - see email to t2 list
      • Still in the process of adding disk at site
      • Large amounts of replication on-going
    • this week
      • Updated Facilities spreadsheet (installed capacities ending FY13Q1):
      • The next US ATLAS Computing Facilities Meeting, March 11, 2013 in Indianapolis, IN. The meeting is co-located with the OSG All Hands Meeting, in particular the OSG Campus Infrastructures Community Meeting on Tuesday, March 12. Please indicate your interest and intention in attending.
      • FAX FDR testing is on-going and encountering a number of pilot/pilot-wrapper/python 2.6/CVMFS integration issues. John Hover here today to discuss. Communication with ADC management (Ueda, Simone and Stephane) regarding scale of testing so as to not interfere with current large demand of high priority tasks. Dataset placement.
      • Opportunistic cycles to meet high production demand for Moriond.
      • Storage deployment review below.
      • Quarterly reports (overdue!)
      • Michael - we're asked by ATLAS to find more CPU resources for Morion. A short term request until mid-February. All Tier 2 PI's have received the request. At BNL, will add 5000 job slots using Amazon EC2. Have an additional 3000 jobs running now. For anyone associated with a sizable campus, it would be good to request additional campus resources.

FAX FDR pilot issues (John Hover)

  • Top level wrapper to sites (bash) - invokes a python wrapper, which invokes Paul's pilot
  • Dependence on python and DQ2Client?
  • FAX worked fine, but then there were problem with standard analysis jobs
  • We have a work-around for the near term, but we need a longer-term strategy to deal with this on both grid production environments.
  • Modular wrapper developed by Jose is key

Tier2D evaluation at ADC (Doug)

Storage deployment

last meeting:
  • AGLT2: 1.296 PB added in production; done DONE
  • MWT2: 1 PB storage installed; will bring online pending internal network reconfiguration.
  • NET2: MD3200's arrived yesterday; moving into machine room. 720 TB raw. Will double this with FY12 funds, to provide 1.4 PB.
  • SWT2_UTA: 1 PB should be on order now. There were issues getting it ordered through the preferred UTA vendor. Delivery uncertain. Also buying switches to improve network speed. Adding Force10 switches to have 10G available for all storage servers, also to the wide-area.
  • SWT2_OU: touching base with DDN; expect to have few hundred TB online in two weeks.
  • WT2: DONE
  • Tier 1 - 2.6 PB in production; gradually will be made available to ATLAS DONE (BNL currently showing over 11 PB). Note - there is aging hardware that is becoming obsolete, so total may shrink over 2013.
this meeting:
  • Tier 1 DONE
  • WT2:DONE
  • MWT2: Still working on storage deployment; 6 R720 + MD1200's installed, powered. Re-configuring local network (plot below).
  • NET2: Racked and stacked. Electrical work is happening today. Optimistically new storage will be online next week. 720 TB raw.
  • SWT2_UTA: Equipment on order; expecting delivery before end of month. Won't take a downtime until after Moriond.
  • SWT2_OU: Week of Feb 4 is the scheduled downtime, but will postpone.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Mark suggests that when DDM emails from Hiro arrive, if the issue is more than transient, good idea to cc the shift list (atlas-project-adc-operations-shifts@cern.ch).
    • PRODDISK squeezed due to group production
    • PRODDISK versus DATADISK: merge or not - still not settled.
    • Retiring PandaMover - to discuss with Alexei
  • this meeting:
    • US Tier 2 starvation issue seems resolved. Lots of problems with Panda Mover getting the needed input files to Tier 2s. Last week emergency meeting - decision to increase GP share at the Tier 1.
    • Will be a shift from production to analysis; group production nearly finished.
    • At BNL we will move resources from production to analysis.
    • NET2 and SLAC seem to have enough jobs and/or pilots. Saul will open a thread on this.

Data Management and Storage Validation (Armen)

Shift Operations (Mark)

  • last week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    1)  1/9: BNL_ATLAS_RCF - jobs from task 1138902 failing heavily at the site with the error "ibimf.so' from LD_PRELOAD cannot be preloaded."  Yuri requested 
    that the cache be re-installed at the site, and this fixed the problem.  https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/42208.
    2)  1/10: MWT2: file transfers failing with "Best pool <.....> too high : Infinity" indicating all write pools were full.  Mopre pools added, solved the problem.  
    3)  1/13: BNL_ATLAS_RCF - jobs failing with "lost heartbeat" errors.  Known issue (the site uses non-dedicated resources, and occasionally jobs can be evicted).  
    Closed https://ggus.eu/ws/ticket_info.php?ticket=90352, eLog 42085.
    4)  1/15: Details of the recent DDM dashboard 2.0 upgrade:
    Follow-ups from earlier reports:
    (i)  8/24: NERSC_SCRATCHDISK - file transfer failures with SRM errors.  Issue was due to the fact that the site admins had previously taken the token off-line to 
    protect it from undesired auto-deletion of files.  ggus 85490 closed, eLog 38791.  https://savannah.cern.ch/support/index.php?131508 (Savannah site exclusion ticket), 
    eLog 38795.
    Update 11/2: See latest information in Savannah 131508 (eventually the space token will be decommissioned).
    (ii)  12/8: NET2 - again seeing a spike in DDM deletion errors - https://ggus.eu/ws/ticket_info.php?ticket=89339 - eLog 41596.
  • this week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    1)  1/16: BU_ATLAS_Tier2 - file transfer failures due to an expired host certificate.  Cert updated, issue resolved following an SRM restart to pick up the new cert.  
    https://ggus.eu/ws/ticket_info.php?ticket=90500 closed, eLog 42312.  (Duplicate ggus ticket 90504 also opened/closed during this period.)
    2)  1/16: UPENN file transfer failures with SRM errors.  Issue reported to be fixed (no details).  Subsequent transfers succeeding, so closed 
    https://ggus.eu/ws/ticket_info.php?ticket=90503.  eLog 42147.
    3)  1/17: New pilot release from Paul - see:
    4)  1/17: BNL_ATLAS_RCF: https://ggus.eu/ws/ticket_info.php?ticket=90524 was opened for the issue related to release cache at the site (jobs failing with 
    "libimf.so' from LD_PRELOAD cannot be preloaded").  Software was reinstalled, fixed the problem.  ggus 90524 closed, eLog 42208.
    5)  1/18: ggus 90548 / RT 22875 were opened for job failures at OU_OCHEP_SWT2 with "replica not found" pilot errors.  Not a site issue - such failures were seen at most 
    U.S. sites.  Seems to be an issue with PanDA/LFC.  Is the problem understood?  Both tickets closed, eLog 42183.
    6)  1/18: From Bob at AGLT2: gate01 hung around 5pm EST.  I have just rebooted the beast.  I suspect our whole load is lost, and will spend some time just cleaning it all 
    away.  In the meantime, the queues are set offline.  Later:  Both queues are now set back to hc testing.
    7)  1/19: BNL_CLOUD - 5k+ job failures (mostly "lost heartbeats").  Issue understood - BNL_CLOUD is using Amazon's EC2 service and acquires resources via their spot 
    pricing mechanism.  See more details in https://ggus.eu/ws/ticket_info.php?ticket=90591 (now closed) - eLog 42258.
    8)  1/19: The CRL's for the CERN CA expired (actually did not expire, but were renewed too close to the expiration time), which to led to various problems with grid services 
    and user jobs.  Most sites were auto-excluded during this time as well.  Issue resolved as of ~19:30 UTC.  See: https://ggus.eu/ws/ticket_info.php?ticket=90605 (and other 
    links contained in it), eLog 42254/55.
    9)  1/20: UTA_SWT2 - host certificate expired, resulting in DDM errors (transfers and deletions).  Cert was renewed - https://ggus.eu/ws/ticket_info.php?ticket=90612 & RT 22879 
    closed, eLog 42272.
    10)  1/21: NERSC destination file transfer failures - https://ggus.eu/ws/ticket_info.php?ticket=90619 in-progress, eLog 42283.
    11)  1/21: MWT2 (UC) transfer failures: destination & source: failed to contact on remote SRM - problem seemingly disappeared - https://ggus.eu/ws/ticket_info.php?ticket=90620 
    closed on 1/23 - eLog 42284.
    12)  1/21: SWT2_CPB DDM deletion errors - probably not a site issue, as the some of the errors are related to datasets with malformed names, others are deletion attempts for 
    very old datasets.  Working with DDM experts to resolve the issue.  https://ggus.eu/ws/ticket_info.php?ticket=90644 in-progress.
    Follow-ups from earlier reports:
    (i)  8/24: NERSC_SCRATCHDISK - file transfer failures with SRM errors.  Issue was due to the fact that the site admins had previously taken the token off-line to protect it from 
    undesired auto-deletion of files.  ggus 85490 closed, eLog 38791.  https://savannah.cern.ch/support/index.php?131508 (Savannah site exclusion ticket), eLog 38795.
    Update 11/2: See latest information in Savannah 131508 (eventually the space token will be decommissioned).
    (ii)  12/8: NET2 - again seeing a spike in DDM deletion errors - https://ggus.eu/ws/ticket_info.php?ticket=89339 - eLog 41596.
    • Reminder to look at updated pilot code.
    • Patrick: still seeing replica not found errors at CPB.

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last meeting(s):
    • Part of the new dashboard has alert and alarms
  • this meeting:
    • See email for notes
    • New perfsonar release candidate for v3.3; we have 5 sites that will participate in the testing. UC, IU, MSU, UNL, UM.
    • Consistently bad performance to/from RAL. How should we act in this regard. Hiro has added additional FTS channels for RAL, to help.
    • Will discuss at next WLCG Operations meeting. Hiro notes that inter-T2 throughput is slow.

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference

last week(s)

  • FDR coming up in two weeks
  • Savannah needed - Doug
  • Wei has run discussed with Gerry the X509 issue that will stop certain sites from participating in the FDR. (The work-around does not work) Problem is jobs have limited.
  • BNL added three more proxy servers
this week
  • link to FDR twiki https://twiki.cern.ch/twiki/bin/viewauth/Atlas/JanuaryFDR
  • Three major issues - distributing input data to all sites; still getting authorization errors for some sites. (Not many connections are affected)
  • Realistic analysis jobs from HC, sorting through pilot issues.

Site news and issues (all sites)

  • T1:
    • last meeting(s): Added another 10G link to the GPN for the WAN at BNL. 3GB/s at times to US and international sites. 70Gbps in/out BNL. There is 80 Gbps transatlantic, most is shared. Statistics show LHCONE is the most prominent user on these links. Note - Chicago and New York use different transatlantic links. Likely no competition with Tier 2s.
    • this meeting: John is currently writing a cost-aware cloud scheduler. Adds policies that are cost driven, to expand to "pay as you go" resources. The current demand-driven event is helping drive better understanding with Amazon policies for provisioning and cost modeling. No indication of bottlenecks into/out of storage.

  • AGLT2:
    • last meeting(s): MiLAR upgrade to 100 Gbps. Had some network state issues at UM and MSU, now resolved. (Now have 100G wave capability). Storage now running SL6 - rebuilding pools, getting rid of SL5 hosts.
    • this meeting: Things are running well. Lots of jobs in the analysis queue. Attempting to get a second peering into MiLAR, which will provide direct routes to Omnipop and MWT2.

  • NET2:
    • last meeting(s): Problems this morning - spikes in gatekeeper load and DDM errors. Might have large number of dq2-gets may be the cause, investigating. Panda queues on BU side are nearly drained. New storage starting to arrive. End of March move. HU running fine. BU-HU network link may be need maintenance.
    • this meeting:

  • MWT2:
    • last meeting(s): Network incident in Chicago causing near connectivity lost for ~12 hours yesterday. Checksum mis-match (causing both DDM and lsg-get errors) - cause identified as due to Brocade firmware bug; reverted for now. Working on storage deployment & local network upgrade to remove internal bottleneck between cpu and storage servers. UIUC compute purchase imminent.
    • this meeting: Job slots going unused. Switched gridftp transfers to new hardware. Shorthanded at BU due to admins working on setting up Holyoke. Illinois progressing on getting 100G. Lincoln getting additional resources at UC3 hooked up, about 500 job slots.

  • SWT2 (UTA):
    • last meeting(s): there was a problem with maui on the CPB cluster; drained and fixed. asymmetric network throughput to/from BNL. Getting network staff to track it down.
    • this meeting: Deletion errors with Bestman for certain datasets. Have a ticket open. Network issue still being worked on by Jason; looks like it might be a busy link in Chicago. Continued work on storage. Finalizing purchase for additional compute nodes. Will circulate specifications to the tier 2 list.

  • SWT2 (OU):
    • last meeting(s): Disk order has gone out. Horst is getting close to having clean-up script.
    • this meeting:

  • WT2:
    • last meeting(s): All is well.
    • this meeting: Unused job slots, lack of pilots.


last meeting this meeting

-- RobertGardner - 23 Jan 2013

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


pdf MWT2-Network-Diagrams.pdf (177.2K) | RobertGardner, 23 Jan 2013 - 10:25 |
pdf Normalization-factors-USATLAS-v26-v1.pdf (489.2K) | RobertGardner, 23 Jan 2013 - 10:32 |
pdf storage.pdf (77.5K) | RobertGardner, 23 Jan 2013 - 10:46 |
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback