r2 - 29 Oct 2008 - 14:28:16 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesOct29



Minutes of the Facilities Integration Program meeting, Oct 29, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • new phone (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees: Tom, Rob, Charles, John, Rich, Michael, Nurcan, Kaushik, Mark, Wei, Torre, Shawn, Fred, Horst, Karthik, Hiro, Xin, Armen, Yuri
  • Apologies: Sarah
  • Guests: none

Integration program update (Rob, Michael)

  • IntegrationPhase7 under construction
  • IntegrationPhase6 now over including SiteCertificationP6 and CapacitySummary (figures effective Sep 30) led-green
  • Quarterly reports - this is needed for the agency reporting. Done DONE
  • High level goals in Integration Phase 7:
    • Pilot integration with space tokens
    • LFC deployed and commissioned: DDM, Panda-Mover, Panda fully integrated
    • Transition to /atlas/Role=Production proxy for production
    • Storage
      • Procurements - keep to schedule
      • Space management and replication
    • Network and Throughput
      • Monitoring infrastructure and new gridftp server deployed
      • Throughput targets reached
    • Analysis
      • New benchmarks for analysis jobs coming from Nurcan
      • Upcoming Jamborees
    • Probably will hold another US ATLAS Tier 2/Tier3 meeting
      • Winter/Early Spring
      • Location: Duke (TBC)
  • OSG site admins meeting coming up: https://twiki.grid.iu.edu/bin/view/SiteCoordination/SiteAdminsWorkshop2008 led-blue

Reliable, published facility services (Michael)

We need to make transparent downtimes, or whether there are problems. Making this well-known to the collaboration. We need to work on a structure provide this information to the collaboration.
  • What are the agreed upon methods?
  • OIM https://oim.grid.iu.edu/ to announce downtimes
  • Where to "announce"? shifters, prodsys, oim, ....
  • Curl method for off/online sits in Panda - make this automatically inform OIM?
  • Would like to do this in one place. With one click. Note OIM requires a grid certificate.
  • Kaushik will provide a summary, send around.

Next procurements

  • Cover in site-reports

Operations overview: Production (Kaushik)

  • follow-up issues
    • Issue - there was a problem with release matching in Panda; disabled for now - all releases at all sites in the OSG. DONE
    • DATADISK - mistakenly used instead of PRODDISK. This is a mistake (except for Tier 1). Sorted out. DONE
  • US production initiative - keeping queues filled. Two-three weeks.
  • New round of production by Borut should keep the cloud filled for the next couple of weeks.
  • We also need to keep the communication lines open with the US community - RAC meetings starting back up.

Shifters report

PRODDISK migration (Yuri, Paul, others)

  • Follow-up w/ status at each site
  • AGLT2 - done DONE
  • MWT2_UC, UC_ATLAS_MWT2 - done DONE
  • MWT2_IU - done DONE; IU_OSG: - now done DONE
  • SWT2_CPB - done DONE
  • SWT2_OU - need to install Bestman; will start tomorrow afternoon.
  • SLAC
    • Ready. However Yuri claims there were problems with job being evicted.
    • Yuri will contact Paul to get the process started.
    • Paul sent word xrootd site mover is working.
  • BU - site back online, but expect new jobs to succeed. Using lcg-cp. Paul will do this in a future pilot release. DONE
  • Yuri notes that all configuration changes will now be done through pilotcontroller.py, in SVN, and all changes should go through Paul, Torre, Tadashi, and information there should be the same as in ToA.
  • Charles - what about putting a generic mover into the pilot which calls out a script.

Analysis queues, FDR analysis (Nurcan)

  • Background, https://twiki.cern.ch/twiki/bin/view/AtlasProtected/PathenaAnalysisQueuesUScloud
  • Continuing stress test w/ 400 TAG selection jobs.
    • MWT2 - passed, AGLT2 - 1/2 jobs failed due to a file server problem - fixing.
  • Second phase 10K jobs. Will start after next week, after LFC.
  • New instrumentation in pathena to run on datasets that have files on tape, or disk. Panda monitor has been updated to provide information about these jobs. There is a new link about obtaining files on tape, FAQ, explaining policy. Yesterday production server was updated. Will create a shadow dataset on disk, proceed with jobs already on disk; remainder can be rerun while files on tape are staged.
  • Issue of sampling very large datasets, eg. 2000 files out of 16K RDO files. New policy of 2K files for pathena, for greater numbers use production panda.
  • DA session next week at ATLAS software week. 3 talks on pathena updates and analysis support. User proxy on worker nodes.

Operations: DDM (Hiro)

  • UTA - offline; BU - problems fixed.
  • We expect ESDs to be replicated shortly to DATADISK.

Throughput initiative - status (Shawn)

  • Focus on testing with new BNL doors. 700MB/s easily. Now testing at sites.
  • Perfsonar boxes. Supposed to be deployed and operational.
  • Hiro: can easily get to 200 MB/s to AGLT2, but not 400 MB/s for long periods. Checking configurations.
  • Will test at UC and Wisconsin today.
  • Will continue with other sites as available.
  • Rich: sites should let John Bigelow know when their Koi servers are running.

LFC migration

Site news and issues (all sites)

  • T1: have seen 900MB/s sustained CERN to BNL. System looked very good, no excessive loads. Note these were large files. FY09 procurement - 120 worker nodes 2.8 GHz; 2 PB of disk, initially thumper, will look at DDN. Planning upgrade of HPSS system (anticipated downtime would be 4 hours).
  • AGLT2: Equipment is arriving - received all storage equipment except two head nodes, two chaisis for blade servers. Installing and bringing up equipment asap. Cabling and documenting.
  • NET2: PRODDISK complete. Perfsonar nodes here, install pending security issues. HU up and running, but with high failure rate, not sure site-specific. HU expanding to add 1000 cores. At BU - 128 cores of IBM blades. Storage ordered.
  • MWT2: Completed PRODDISK migration. LFC migrated at both sites. Equipment now arriving.
  • SWT2 (UTA): LFC migration in progress. Awaiting quotes from Dell.
  • SWT2 (OU): all okay.
  • WT2: all okay. Ready to migrate to proddisk and LFC.

Carryover issues (any updates?)

Long URL's in the catalog (Marco)

  • Our convention is to use full URL's in the catalog in the US.
  • There are few changes implied for the pilot - Paul is aware.
  • What about dq2-put? Is the short URL used? Need to check w/ Mario.

Release installation via Pacballs + DMM (Xin, Fred)

  • status from last week
    • All pacballs now being transferred automatically to BNL (will test with 14.2.24 release)
    • Pilot - usatlas2 role - how to save the results to the output SE.
    • There is a question about where to send the production installation job logs.
  • Successfully ran install job at BNL. Will try to more sites this week.
  • Migration to production - will do this site by site.
  • 14.2.24 being built right now. Will check the timing of the pacball migration to BNL.


  • Fred was running jobs on Tier 3 getting conditions data from BNL database. Tune to reduce download times. Carlos Gamboa at BNL offered suggestions for Oracle tuning at the host level. Wei notes that latency to SLAC around 20 minutes. Bring back into question use of Frontier. Michael - setting up required infrastructure at BNL to support the distribution - use Squid at Tier 2. Timeframe - few weeks.
  • Vote early, vote often

-- RobertGardner - 28 Oct 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback