r4 - 28 May 2014 - 14:26:57 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesMay282014



Minutes of the Bi-weekly Facilities IntegrationProgram meeting (Wednesdays, 1:00pm Eastern):

Connection info:


  • Meeting attendees: Bob, Dave, Christopher, Saul, Shawn, Michael, Rob, Torre, Ilija, Fred, Armen, Alden, Sarah,
  • Apologies: Horst, Kaushik
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, on-demand - convened by Armen) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Ilija): ATLAS Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • Pre-DC14 exercises: internal review of sites. Performance characteristics and stability under various loads.
    • this week
      • Computing element evolving. The old gram CE is old, not easily maintained, the OSG software experts will be moving towards a Condor-CE. Will start by phasing it in in parallel to the GT2 CE. Would like the US ATLAS computing facility to support this activity. Will try it out on the ALGT2 test gatekeeper. There is a gap in terms of other batch systems used at other sites in the facilities (lsf, sge). Need to support the code base. n.b. no discussed necessary about the replacement. Should start this almost immediately, to resolve any issues (perhaps DC14, certainly before Run 2).
        • Wei: has a test machine that can be used for the HTCondor CE.
        • Saul: would like to get involved early. Can be ready right away - with help from the OSG Software team.
      • So we can signal to the OSG Software team

Rucio Migration Status

  • c.f. yesterday's ADC weekly, https://indico.cern.ch/event/311435/
  • LUCILLE and AGLT2 migrated.
  • CONNECT queue was missed regarding Panda Mover. Moved now. AGLT2 - should be finished today. NET2 - one last time.
  • Dark data lists from Tomas - there are dark data in Rucio directories (new), sent to AGLT2.
  • What is the sequence of things? Armen will send a clarifying message, and will talk to Tomas.

Pre-DC14 site readiness, internal review

Performance metrics

  • ROOTIO performance on xAOD
    • Hammer Cloud tests
    • Site-specific ROOTIO load metrics
      • CPU efficiency
      • Wall time relative to a reference (e.g. a "Sandybridge standard")
      • Overall job failure rate
      • In bins of #jobs: 10, 100, 500, 1000
  • Representative workflows
    • Evgen
    • G4sim
    • Reco
    • Pileup
    • Data reduction framework (slim/skim)
  • Quiescent FAX Cost Matrix (the default)
  • Loaded FAX Cost Matrix
  • % of FAX lookup failures (Wei's test)
  • Rucio readiness
    • Scalability testing, 1M files per day, site-centric metrics
  • Co-scheduled DDM+FAX load tests?

Facility readiness

  • Final equipment in place (if applicable)
    • BNL:
    • AGLT2:
    • MWT2:
    • NET2:
    • SWT2:
    • WT2:

  • Final network configuration in place
    • BNL:
    • AGLT2:
    • MWT2:
    • NET2:
    • SWT2:
    • WT2:

  • Multicore readiness
    • BNL:
    • AGLT2:
    • MWT2:
    • NET2:
    • SWT2:
    • WT2:

  • FAX overflow testing
    • Activate
    • Tuning of JEDI parameters

  • Disk management readiness (all token areas)

  • User data set placements: requests, delivery, access

  • Opportunistic reach
    • DC14 tasks on opportunistic resources

  • Facility-to-Tier3 Capabilities

  • HPC readiness (US sites with ATLAS allocations: Titan, NERSC, ??)

Operations overview: Production and Analysis (Mark)

  • Production reference:
  • last meeting(s):
  • this meeting (5/14/14):
    • We are light on jobs - see link below from ADC weekly for summary of new production. Probably a couple weeks.
    • Pandamover has been turned off in the US Cloud. Need to figure out if PRODDISK is being cleaned properly. Need to check the cleanup policy. Should not see issue of jobs getting inputs removed before jobs begin.
    • Have all the sites been fully converted to Rucio? Need to find out where we stand. Mark wil follow up.
    • May be some residual data from Panda mover.
  • this meeting (5/28/14):
    • Not much to report on the production front. Not sure when to expect large volumes of tasks.
    • Last week there have been historic lows of site issues - not much to report.
    • There was a minor update to the pilot from Paul

Shift Operations (Mark)

  • this week: Operations summary:
    AMOD/ADCoS reports from the ADC Weekly and ADCoS meetings (Kai Leffhalm):
    1)  5/22: New pilot release from Paul (v58m) - details here:
    2)  5/23: BNL - file transfers were failing with "SOURCE file size is 0." Update from Iris: Host of Chimera database has high I/O wait since 1a.m. (EST) today, 
    which affected DDM. The problem is fixed and transfer resumed as monitored. https://ggus.eu/index.php?mode=ticket_info&ticket_id=105694 was closed, eLog 49501.
    3)  5/27: ADC Weekly meeting:
    Follow-ups from earlier reports:
    (i)  5/14: ANLASC_USERDISK - DDM deletion errors. https://ggus.eu/index.php?mode=ticket_info&ticket_id=105414, eLog 49452. Site waiting on ANL firewall 
    ports to be reopened. Downtime declared.

Data Management and Storage Validation (Armen)

  • Reference
  • meeting (5/14/14)
    • Non-US Cloud production inputs were being cleaned centrally.
    • Next week AGLT2 will be migrated from LFC to Rucio Catalog. Armen will look into specific status.
    • Stress testing in Rucio.
  • meeting (5/28/14)
    • Reminder to SLAC to add space for USERDISK.
    • Hiro sent out an email there will be USERDISK cleanup.
    • LOCALGROUPDISK management tools - still working. No timeline. Working prototype? Maybe by the end of summer.

DDM Operations (Hiro)

  • Reminder that the query to find files is different.
  • Not sure if the Rucio dump created by the Rucio team was sufficient, or not.
  • Wei: notes there is a document, but it looks like it doesn't work. Does the basic functionality exist? The REST API commands - about half are not working. Will discuss next week.
  • Hiro: agrees documentation is in poor shape.
  • Will still need the equivalent of a CCC to find dark data. Or missing data?
  • Hiro will coordinate issues.

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference meeting (5/14/14)
  • A few more sites have installed. PIC now federating. Working with Netherlands.
  • Last several days - running up to 700 jobs in parallel. Trying to understand.
  • Overflow rebrokering tests in progress. Submitted 20 tasks, 744 jobs, fully successful.
this meeting (5/28/14)
  • Added Spanish cloud, the rest of Grif, and IN2P3? _,
  • SARA T1. Coverage up to 85% of files, 56% of sites.
  • Missing only Triumf and Nikkef. Goal is to get to 95%
  • Failover working okay
  • Overflow tests are working nicely; took a long time given the quota. Gathering results for next year. Rates observed are what sites deliver.
  • Improvements to fax-ls, will be moving tools to Rucio-based.
  • Wei: Shu-wei has been testing redireciton and performance. Wei is working with him - good understanding of delay sources, and how to avoid. Triumf - they are doing internal testing; they are concerned about rpms not being signed by WLCG. WLCG will work on this issue. Working to change architecture for SLAC. Only an issue for Xrootd storage. Patrick will have a look as well.
  • 100g testing will continue once local UC issues are resolved.

ATLAS Connect (Rob)

last week
  • Continued testing with Parrot 4.1.4rc5.
  • Focus remains getting a usable ATLAS software environment exported to the Stampede cluster.
  • Current strategy is to use an edge VM at TACC to mount CVMFS and a squid service; then export CVMFS with Chirp.
this week
  • The main issue is gaining access to ATLAS software from sites without CVMFS installed.
  • Looking to deliver CVMFS to the site via NFS in the client starting with 2.1.5. Working with the admins to get NFS mount setup, on a per job or per node basis.

Site news and issues (all sites)

  • NET2:
    • last meeting(s): FTS3 poor performance issue addressed by deploying more gridftp issues. Gearing up to purchase storage. 700 usable TB with 4 TB drives. Half a rack.
    • this week:

  • MWT2:
    • last meeting(s): 100g testing to BNL.
    • this meeting:

  • SWT2 (UTA):
    • last meeting(s): Working on Big Panda job testing. Working on setting up a network test with new equipment - working on setting a downtime. Seen a great improvement in stability by removing jobs using large amounts of memory. 6GB! There are hammer cloud jobs consuming lots of memory - related to xrootd?
    • this meeting: All the network equipment is in, and we've started stacking it, setting up LAGs, get through the configuration and setting a downtime.

  • SWT2 (OU, OSCER):
    • last meeting(s): LHCONE - working with John Bigrow to set this up.
    • this meeting:

  • SWT2 (LU):
    • last meeting(s): Fully functional and operational and active.
    • this meeting:

  • WT2:
    • last meeting(s): Network reconfiguration for the gridftp server. Observed a problem with the Digicert CRL update - caused probs with VOMS proxy init. Also there were GUMS issues. OSG Gratia service was down for more than one day. Were job statistics lost?
    • this meeting:

  • T1:
    • last meeting(s): Maintenance next Tuesday. Will integrate the two Arista 7500's and will replace obsolete network switches. Will then have full 100g inter-switch technology.
    • this meeting: Had a problem affecting storage on Friday - Chimera stop working between the server and the storage backend, had to reboot to recover; analysis by Hiro leading to decision to upgrade system. Also working on replacement of aging worker nodes, and a small increment in terms of capacity. Running an extensive evaluation program, AMD processors, finding 6000 series is performing very well. Decided to go with AMD rather than Intel (much more expensive than last year). Happy to share results (Saul is interested). Considering Atom processors from HP, and IO was okay. Likely relevant for the Tier3.

  • AGLT2:
    • last meeting(s): Juniper equipment in place at both sites - working on getting the last bit of fiber in place.
    • this meeting: 338k dark files in Rucio. Trouble with gatekeeper around midnight last night - not sure what happened; memory and swap filled; haven't had a chance to clean up. Did update to latest OSG gatekeeper. EX9208 at MSU will be powered up and brought up to 40 Gbps. At EX9208 at UM, still running at 10G, problem with sub-interfaces on link aggregation of 40g.


last meeting this meeting

-- RobertGardner - 27 May 2014

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback