r3 - 17 Dec 2008 - 14:32:26 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesDec17



Minutes of the Facilities Integration Program meeting, Dec 17, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • new phone (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees: Michael, Charles, Rob, Saul, Nurcan, Torre, Douglas, Alexei, Kaushik, Armen, Bob, Tom, Karthik, Hiro, Xin, Wensheng, Mark, Patrick, UT Dallas
  • Apologies: none
  • Guests:

Integration program update (Rob, Michael)

  • IntegrationPhase7
  • High level goals in Integration Phase 7:
    • Pilot integration with space tokens DONE
    • LFC deployed and commissioned: DDM, Panda-Mover, Panda fully integrated DONE
    • Transition to /atlas/Role=Production proxy for production DONE
    • Storage
      • Procurements - keep to schedule
      • Space management and replication
    • Network and Throughput
      • Monitoring infrastructure and new gridftp server deployed DONE
      • Throughput targets reached
    • Analysis
      • New benchmarks for analysis jobs coming from Nurcan
      • Support upcoming Jamborees
  • BNL Analysis Jamboree, Dec 15-18, 2009 agenda, BNLJamboreeDec2008 led-blue
  • Tier 0/1/2/3 Jamboree - Jan 22, 2009
  • Next US ATLAS Facilities face-to-face meeting (past meetings):
    • Will be co-located with the OSG All-Hands meeting at the LIGO observatory in Livingston, LA, March 2-5, 2009 Agenda
    • US ATLAS: March 3, 9am-3pm - focus on readiness for data, and Tier 3 integration - organized by Maxim Potekhin
  • Tier 3 study group is making good progress - a draft document is available. Consolidating input regarding workflows, requirements.
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open usually): http://integrationcloud.campfirenow.com/1391f
  • Upcoming reviews:
    • Program mangers review in January
    • February agency review

BNL Jamboree (Nurcan)

  • Nurcan has been looking at user analysis jobs on-going. There have been a couple of issues to resolve. A number of issues on hypernews, but nothing Jamobree.
  • There is also a CERN offline tutorial on-going by Alden, ending today.
  • BNL - scheduled maintenance to upgrade Condor. Found pilots failing - prob w/ home dir. Still more jobs getting stuck in Activated resolved now. Long queue - registration in LFC failing. Jobs are submitted locally, so they don't have the usual OSG environment variables. Xin investigating w/ Torre.
  • AGLT2 - observing high failure rate - SRM authentication errors, caused by high load of GUMS server.
    • Analysis user files going into top level directory - seeing timeouts.
    • USERDDISK - needs to partitioned at least by user. Kaushik will take the issue to Paul. Charles will summarize for Kaushik.
    • There seemed to be a lot of additional authentications going on - came and went. There was a lot of logfile rotation happening as well.
  • NET2 - job stuck in transferring state. Job completed, but problems possibly registering. Will be in touch with Saul.
  • General problem - users need to know if queues are available. Will create a status page table showing online/offline from the analysis dashboard. Will also elog user problems. Revisit whether an email to the user is needed.

Operations overview: Production (Kaushik)

  • Last week
    • very little production - the task definition system was down due to panda monitoring system migration to cern. should be functional now. the panda monitor at bnl may not give accurate info about tasks and ddm. the pieces maintained by pavel and alexei have moved to cern - not yet in synch with the bnl instances. db migration from mysql to oracle. will take time to sort out.
    • http://panda.cern.ch/
    • No new tasks defined this week which means no new jobs. Except for re-processing validation jobs - shifters filing bug reports.
    • Expect large numbers of jobs and tasks for the holidays - deadline Dec 15 - a month's supply.
  • this week
    • US ATLAS ADC Operations workshop at BNL
    • Glitch with a cache for reprocessing - Xin has patched at all US sites.
    • Lots of evgen now running, start of large scale holiday production.
    • Discussions at BNL. Planning for next year - Cosmic data run in May, end of summer beam. Keeping sites full and busy. What steps should we take (locally). Improvements to software for production and shifts. Will send a summary of meeting notes tomorrow. Permissions for LFC - agreed to, sites do need to be fixed. What to do about user datasets? For the moment, only those with production role will be able to delete (same as production). DDM - core members of dq2 team are leaving. Should we use Panda mover for output data? Under discussion. Reprocessing discussion. Alexei looking into data migration from Tier 2's outside the cloud. Condor-G issues.
    • Alexei: agreed to have a phone call to discuss Panda mover.

Shifters report (Mark)

  • Distributed Computing Operations Meetings
  • ADC Shifters meeting
  • Have been running reprocessing tasks - there was a bad file from another cloud. Wensheng tracked it down, brought a new copy. Files replaced at Tier 2s.
  • On-going issue at Harvard - stage-in problems. John contacted.
  • Problems reprocessing at Tier2's accessing conditions database via COOL. Patrick - local compute nodes have a domain not recognized by POOL to lookup database. Since 13.0.35 an enviro variable to set domain name. Couple of options to patch the ATLAS WN script. Or, pilots set the enviro variable to use the sites' main gatekeeper name. Implemented in pilot now by Paul. Old problem since June.
  • Alexei - is Sasha in the loop? Yes.
  • SLAC gatekeeper issue. Authentication okay, but globus-job-run not working. Wei is out of town. Can Douglas help? Will look into it.
  • For a thorough summary, look at Yuri's summary.

Analysis queues, FDR analysis (Nurcan)

Operations: DDM (Hiro)

  • LFC permissions problem - Charles and Hiro investigating.
  • Continuing to forward email to sites if there are problems with getting data.
  • Reminds everyone pay attention to the dashboard monitoring emails.
  • Alexei - all AODs replicated to all Tier 2s. 25 TBs over the next two weeks. Which token area - DATADISK.
  • Hiro's testing dataset replication monitoring program (dq2ping) - don't pay attention.

LFC migration

  • SubCommitteeLFC
  • last week
    • Discussed Kaushik's proposal for ownership and roles for LFC and the sites. Kaushik will send around specific checks. The remaining issue is user dataset deletions.
    • New cleanse script for LFC - see DQ2SiteCleanse. Handling DQ2, LFC deletions - will need to add storage specific deletion, but this can be handled easily.
    • Will also need to add logic for managing PRODDISK as a cache, different than the other endpoints.
    • Hiro - suggestion to add a default group, as a way for enabling deletion of datasets.
  • this week
    • Done at all sites.
    • DQ2SiteCleanse - Karthik testing at OU. Stalled for the moment. Current version is v1.8. Will be adding a verify mode. BU - Saul - will check it out.
    • Kaushik: share with ATLAS in general.

VDT Bestman-Xrootd

Throughput Initiative (Shawn)

* Shawn at CERN.

Holiday operations (Michael)

  • AGLT2 - all be reading email.
  • NET2 - Saul will be checking
  • MWT2 - laptops on.
  • OU - irregularly
  • SLAC - Doug will be checking in. Wei probably as well.

Site news and issues (all sites)

  • T1:
    • last week: Procurement still going. 1 PB storage being setup by DDN - to be evaluated. WAN issues with LHC OPN - fiber disruption in London, fixed. Hiro reported about the LFC daemon crash, working with David Smith. Will get an upgrade. ATLAS DDM stress test, goal is to stress replication components - site services and catalogs. 2000 datasets at each site. 5-10K replication transactions per site/hour. Idea is to make lots of transactions affecting site services and catalogs. 1-10M files involved. Started this morning - last 10 days. Suggestion from Charles about reducing the number of queries against the central catalog; suggested to speak with Pedro. Pedro will move to BNL and head storage management group at the Tier 1.
      • In discussion about direct 10G to Starlight - dedicated circuit. Last mile on long island to be put in place in January. Propose a meeting with ESnet and Tier 2 people. Will be provided to US ATLAS at no cost to the project.
    • this week: discussion of Holiday schedule.
  • AGLT2:
    • last week: Had problems with LFC - installed from test cache - had to re-install, it was easy. monit software for restarting downed services. Working on Lustre setup. Meeting with Merit in order to get dynamic circuit capability to MSU. Getting ready for muon calibration workshop - Bob setting things up for interactive users. Working on getting Charle's ccc.py script working to check consistency.
    • this week: seeing some ANALY jobs coming through. Only 7 TB available in DATADISK. Still working on Lustre - working on admin machines. Planning to do this in a HA mode - primary and secondary.
  • NET2:
    • last week: Hardware - racking new storage and blades. Hope to have everything online before christmas. LFC migration completed. Mysql database will be at BU, with two LFC instances. John has started new job at Harvard, but he'll continue to work on NET2.
    • this week: Since the migration have not run a lot of production; want to see things fully up before the holiday production; new blades installed & in production, storage installed not online. Muon calibration workshop tomorrow. * MWT2:
    • last week: Have got DATADISK and PRODDISK tokenized areas completely cleaned up - all holes in datasets removed. Now have 0 orphaned ghosted or orphaned datasets. Bringing up new hardware - new storage nodes.
    • this week: running into problems hitting dcache full limits. Brought up compute nodes; working on dcache configuration.
  • SWT2 (UTA):
    • this week: CPB cluster back online. Upgraded LFC release. Working with Nurcan on TAG analysis with AODs.
  • SWT2 (OU):
    • last week: working on updating LFC with the OSCER admins. Trying out the clean-up script from Charles.
    • this week: still working on LFC cleanup exercise. Ibrix.
  • WT2:
    • last week: Upgraded LFC to stable version on Monday - went smoothly.
    • this week:

Carryover issues (any updates?)

Release installation via Pacballs + DMM (Xin, Fred)

  • last week:
    • Yesterday Torre began submitting jobs - they come in as usatlas2. Is the rate too high (SWT2_CPB_INSTALL observations)? Increased load from grid monitor due to the (separate) job submit host.
    • Xin is submitting one install job to each site.
  • this week:
    • Testing installation pilots on Tier2s. A couple of configuration problems - now fixed. Will be sending new jobs.

Squids and Frontier (Douglas)

  • last week:
    • Testing read-real script - to approximate a real conditions data access job. Running repeatedly. Configuration between SLAC and BNL figured out. Connecting to squid cache at BNL, then to the Frontier service at BNL. Then to squid cache at SLAC - forwards http request to Frontier @ BNL. Tested.
    • Access times vary between jobs - 2000-2500 seconds mostly. Jumps to 3000-3500 seconds sometimes. Testing "zip levels" for compression (0-5).
    • Might be improvements ~ 10%.
    • Squid cache verified to be working correctly at SLAC.
    • Rerun script at CERN, 120-180 seconds, local Oracle access.
  • this week:
    • I presented an update on the conddb access testing at SLAC this morning at the Atlas database group meeting. You can find the slides in the meeting agenda: http://indico.cern.ch/conferenceDisplay.py?confId=47447. Things are working fairly well now that the frontier log issue is understood, and the slac squid server is in production use. This has been shown to work on slac batch nodes, and with multiple jobs running. More testing to be done to show that this can work for a full production running, but hopeful this week. Douglas
    • Problem with client slowness writing logfiles to AFS. Write Frontier logfile to /tmp.
    • Now getting good results, 120 - 180 seconds in various configurations.
    • When using everything is local, just 20 seconds.
    • Running on batch systems, accessing conditions data works. Will scale up. Question about what the profile will look like.

Local Site Mover

  • Specification: LocalSiteMover
  • code
  • last week:
    • Saul re-installing latest version for BU/Harvard. Tested and working now.
    • Charles will do the dcache version
    • Patrick will take a look at version for xrootd. xrdcp.
  • this week:
    • Still a problem with the analy queue at Harvard, so there may still be an issue with BU lsm. Saul investigating w/ John.


  • We will skip the next two week's meetings. Next meeting in 2009.

-- RobertGardner - 16 Dec 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback