r4 - 11 Jun 2014 - 14:51:06 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJun112014



Minutes of the Bi-weekly Facilities IntegrationProgram meeting (Wednesdays, 1:00pm Eastern):

Connection info:


  • Meeting attendees: Mark Sosebee, Michael, Dave, Wei, Alden, Horst, Mayuko, Saul, Armen, Kaushik
  • Apologies:
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, on-demand - convened by Armen) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Ilija): ATLAS Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • Computing element evolving. The old gram CE is old, not easily maintained, the OSG software experts will be moving towards a Condor-CE. Will start by phasing it in in parallel to the GT2 CE. Would like the US ATLAS computing facility to support this activity. Will try it out on the ALGT2 test gatekeeper. There is a gap in terms of other batch systems used at other sites in the facilities (lsf, sge). Need to support the code base. n.b. no discussed necessary about the replacement. Should start this almost immediately, to resolve any issues (perhaps DC14, certainly before Run 2).
        • Wei: has a test machine that can be used for the HTCondor CE.
        • Saul: would like to get involved early. Can be ready right away - with help from the OSG Software team.
      • So we can signal to the OSG Software team
    • this week
      • Readiness of the Facility is the primary focus for the next month, for DC14 obviously.
      • Rucio migration; issue of additional higher level tools consistency checking.
      • Enthusiastic about Connect technology to get new computing architectures into our computing environment.
      • Event service - high expectations and hopes; should be ideal for opportunistic resources.
      • Still working on improving networking - finalizing program funded upgrades.
      • Migration away from GRAM-based CE. Transition plan from OSG Operations; Saul and Wei will be helping with the validation.
      • TIM meeting will be planned for the Fall - likely no Fall Facilities meeting. There is also going to be a December meeting at CERN with a technical sites-focus. Would like to discuss DC14 results in advance.
      • ATLAS Analytics questionnaire and whitepaper.

Rucio Migration Status

last meeting
  • c.f. yesterday's ADC weekly, https://indico.cern.ch/event/311435/
  • LUCILLE and AGLT2 migrated.
  • CONNECT queue was missed regarding Panda Mover. Moved now. AGLT2 - should be finished today. NET2 - one last time.
  • Dark data lists from Tomas - there are dark data in Rucio directories (new), sent to AGLT2.
  • What is the sequence of things? Armen will send a clarifying message, and will talk to Tomas.

this meeting

  • Functional testing? Armen - discussed with DDM ops, all tools must be working. Migration of the DQ2 catalog itself to the Rucio catalog is the last step.
  • Scalability testing (local)?
  • Armen: how we'll treat dark data?

Condor CE

  • Saul - started with Tim Cartwright. Will setup a separate machine to do this. Wei - has not started year.

Pre-DC14 site readiness, internal review

  • Moved to ReadinessDC14
  • Will create version of site certification table for specific testing
  • Hiro: FTS test already to a few sites. BNL's FTS seems okay. Auto-tune seems to be working. Saturated 10g to AGLT2 and MWT2 simultaneously. Michael: should also scale by number of files. Also SWT2.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • meeting (5/28/14):
    • Not much to report on the production front. Not sure when to expect large volumes of tasks.
    • Last week there have been historic lows of site issues - not much to report.
    • There was a minor update to the pilot from Paul
  • meeting (6/11/14):
    • Sporadically going up and down in terms of jobs. There is some MCORE testing going. Important to keep a low rate of MCORE pilots running. Keep 1 or 2 slots always available. Allows for fast startup of samples.

Shift Operations (Mark)

  • this week: Operations summary:
    AMOD/ADCoS reports from the ADC Weekly and ADCoS meetings (Gabriela Navarro):
    1) 6/8: SLACXRD - failing file transfers due to an expired host certificate. Certificate was updated - issue resolved. 
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=106041 was closed, eLog 49700.
    2)  6/3: ADC Weekly meeting:
    http://indico.cern.ch/event/311436/  (includes a talk about Rucio migration status)
    Follow-ups from earlier reports:

Data Management and Storage Validation (Armen)

  • Reference
  • meeting (5/14/14)
    • Non-US Cloud production inputs were being cleaned centrally.
    • Next week AGLT2 will be migrated from LFC to Rucio Catalog. Armen will look into specific status.
    • Stress testing in Rucio.
  • meeting (5/28/14)
    • Reminder to SLAC to add space for USERDISK.
    • Hiro sent out an email there will be USERDISK cleanup.
    • LOCALGROUPDISK management tools - still working. No timeline. Working prototype? Maybe by the end of summer.
  • meeting (6/11/14)
    • USERDISK is now on-going.
    • No plans for LOCALGROUPDISK cleanup. Rucio quota system might help with LOCALGROUPDISK policy.
    • LOCALGROUPDISK management service - Kaushik - there is work going on here, to improve monitoring, and database schema.

DDM Operations (Hiro)

meeting (5/28/14):
  • Reminder that the query to find files is different.
  • Not sure if the Rucio dump created by the Rucio team was sufficient, or not.
  • Wei: notes there is a document, but it looks like it doesn't work. Does the basic functionality exist? The REST API commands - about half are not working. Will discuss next week.
  • Hiro: agrees documentation is in poor shape.
  • Will still need the equivalent of a CCC to find dark data. Or missing data?
  • Hiro will coordinate issues.

meeting (6/11/14)

  • You can no longer use the LFC dump.
  • Have not seen the new dump from Rucio yet - will need to get back to Vincent.
  • Should no longer have dark data - it can all be deleted.
  • There will be centrally provided script for CCC.
  • CERN FTS has some issues today - which caused a backlog, specific problem unknown.

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference last meeting (5/28/14)
  • Added Spanish cloud, the rest of Grif, and IN2P3? _,
  • SARA T1. Coverage up to 85% of files, 56% of sites.
  • Missing only Triumf and Nikkef. Goal is to get to 95%
  • Failover working okay
  • Overflow tests are working nicely; took a long time given the quota. Gathering results for next year. Rates observed are what sites deliver.
  • Improvements to fax-ls, will be moving tools to Rucio-based.
  • Wei: Shu-wei has been testing redireciton and performance. Wei is working with him - good understanding of delay sources, and how to avoid. Triumf - they are doing internal testing; they are concerned about rpms not being signed by WLCG. WLCG will work on this issue. Working to change architecture for SLAC. Only an issue for Xrootd storage. Patrick will have a look as well.
  • 100g testing will continue once local UC issues are resolved.
this meeting (6/11/14)

ATLAS Connect (Rob)

last week
  • The main issue is gaining access to ATLAS software from sites without CVMFS installed.
  • Looking to deliver CVMFS to the site via NFS in the client starting with 2.1.5. Working with the admins to get NFS mount setup, on a per job or per node basis.
this week
  • Much progress on several resource targets, as Dave Lesny will discuss.
  • Last two weeks working on four clusters into the Connect project. Taking an alternative to CVMFS. NFS-based CVMFS server.
  • Stampede: admins are working on delivering CVMFS via an NFS server; to deploy via their Lustre filesystem.
  • Midway cluster at Chicago - deployed several nodes, an NFS CVMFS solution. Proving to be successful - running ATLAS jobs; we bring everything the jobs. Working to deploy into a larger environment.
  • ICCC - working with an NFS CVMFS component.
  • Odyssey - already an ATLAS ready cluster. Opportunistically. A very large cluster - max has been 550 jobs at a time; preemption enabled.

Site news and issues (all sites)

  • AGLT2:
    • last meeting(s): 338k dark files in Rucio. Trouble with gatekeeper around midnight last night - not sure what happened; memory and swap filled; haven't had a chance to clean up. Did update to latest OSG gatekeeper. EX9208 at MSU will be powered up and brought up to 40 Gbps. At EX9208 at UM, still running at 10G, problem with sub-interfaces on link aggregation of 40g.
    • this meeting: working hard on networking. There was a bug on the 40g interface on the Juniper; forwarding table does not get updated. Going to CC-NIE. Working on connecting the two sites. Bringing up EX9208 caused routing tables to overflow on the Juniper 4200 (16k entries), which caused problems. 9208 holds 1M entries.

  • NET2:
    • last meeting(s): FTS3 poor performance issue addressed by deploying more gridftp issues. Gearing up to purchase storage. 700 usable TB with 4 TB drives. Half a rack.
    • this week: Need to update FAX doors. Have not yet purchased storage from Dell. Working on Condor CE.

  • MWT2:
    • last meeting(s): 100g testing to BNL.
    • this meeting: At UIUC will be updating GPFS 3.5.16 to 3.5.18 to address file corruption; a downtime of 16 hours.

  • SWT2 (UTA):
    • last meeting(s): All the network equipment is in, and we've started stacking it, setting up LAGs, get through the configuration and setting a downtime.
    • this meeting: Close to scheduling downtime for networking upgrade. Prototype system in place. One more call needed to Dell the sign-off on the configuration. Expect to do the upgrade after that. Close to doing a purchase for worker nodes. Not getting great pricing from Dell. Ivybridge. Michael: on-going evaluation of Intel processor, AMD6000, seem to be much more cost effective. Seems to be well-suited.

  • SWT2 (OU, OSCER):
    • last meeting(s): LHCONE - working with John Bigrow to set this up.
    • this meeting:

  • SWT2 (LU):
    • last meeting(s): Fully functional and operational and active.
    • this meeting:

  • WT2:
    • last meeting(s): Network reconfiguration for the gridftp server. Observed a problem with the Digicert CRL update - caused probs with VOMS proxy init. Also there were GUMS issues. OSG Gratia service was down for more than one day. Were job statistics lost?
    • this meeting: Will have a short outage next week for power work.

  • T1:
    • last meeting(s): Had a problem affecting storage on Friday - Chimera stop working between the server and the storage backend, had to reboot to recover; analysis by Hiro leading to decision to upgrade system. Also working on replacement of aging worker nodes, and a small increment in terms of capacity. Running an extensive evaluation program, AMD processors, finding 6000 series is performing very well. Decided to go with AMD rather than Intel (much more expensive than last year). Happy to share results (Saul is interested). Considering Atom processors from HP, and IO was okay. Likely relevant for the Tier3.
    • this meeting: Started initiative to phase out Oracle at the Tier 1, under discussion. WAN architecture changes. In process of setting up an object store, based on Ceph. Good opportunity to evaluate technology under quasi-production, and the event service. BNL, ESnet, Amazon discussion to waive egress fees. Invitation to setup a grant to make a long term study, to provide data to Amazon to assess and develop a business model for academic institutions. Worker node procurement, 120 server purchase, most likely based on AMD.


last meeting this meeting
  • OSG VO request next meeting
  • Alden: Sched config updating service was running on three servers; the machines were decommissioned. But restored - report any
  • v31 spreadsheet please
  • Gratia-APEL reporting email backup.

-- RobertGardner - 10 Jun 2014

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback