r5 - 20 Aug 2008 - 16:42:31 - MarkSosebeeYou are here: TWiki >  Admins Web > MinutesAug20



Minutes of the Facilities Integration Program meeting, Aug 20, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees: Sarah, Fred, Rob, Rich, Michael, Mark, Tom, Wen, Jim, Horst, Karthik, Justin, Shawn, John, Jay, John, Hiro, Xin, Saul, Nurcan, Patrick, Wensheng, Bob
  • Apologies: Torre, Wei, Kaushik
  • Guests: none

Integration program update (Rob, Michael)

  • See IntegrationPhase6
  • Upcoming meetings
    • Analysis tutorial (CERN) - last week of August
    • 3-site Jamboree: 9-12 September
    • Next Tier2/Tier3 meeting - revised dates: 22-23 September; @BNL, webpage coming soon. can book tickets
      • Meeting website to be setup.
  • Analysis requirements group (Jim)
    • Computing model not very specific about doing analysis
    • Acquiring input from the physics community - some regular phone meetings w/ experts.
    • Come up with guidelines to people for how to do things; still forming.
    • Will need Facility input
  • Procedural issue - data integrity problem
    • Tracked down to transfer problem T2 and T1.
    • Serious problem that caused a lot of effort for cleanup.
    • Bringing back resources in a controlled way if they are taken offline during investigations; communication results back to Facilities manager who will then propose to the Production manager.
    • Will develop a protocol for this.

Next procurements

  • Dell, IBM, Sun pricing news
  • Latest - Sun will provide information tomorrow
  • IBM still working

Internet2 monitoring hosts

  • Follow-up:
    • Orders placed for each site?
    • Resistance to the agreed configuration from SLAC - Wei will check out why. Also a concern about Web100 kernels (or old kernels).
    • AGLT2 - ordering 2 sets of two. 2-week delay from MSU
    • MWT2_UC - ordered; _IU - ordered.
    • NET2 - will order this week - asked Koi for a copy of PO.
    • SWT2 - will order this week. - waiting for a Koi; OU - will get to this soon.
  • Software available end of month, install as servers arrive.
    • Rich: code freeze by end of the week. Delivery date by Sep 4. Downloadable,
  • Fully deployed infrastructure at all sites by Sep 30

Storage capacity guidance

  • Storage capacity recommendations/guidance for the Facility (from Kaushik's model on MinutesJune11).
  • Review of recommendations (from Kaushik, last week):
Token		Now (minimum)	Oct 15th
-----		------------	--------
PRODDISK	20 TB		       20 TB
MCDISK		60 TB		       66 TB
DATADISK	20 TB		       168 TB
USERDISK	10 TB		       35 TB
GROUPDISK	10 TB		       10 TB

Note: as I said in my June presentation, these numbers do not include the 64 TB for US regional quota, 
which will most likely be distributed among USER, GROUP and LOCALUSER tokens.

Revised WLCG pledges

Operations overview: Production (Mark)

  • Production has ramped (nearly) back up.
  • Condor-G auto submission at BNL ran into a problem with large scale job eviction. The pilot logs have messages claiming grid resource not available, evicted. Some jobs have code 1201 in Panda, though this is not really what is happening. 14K jobs in a hold state over the weekend. Remove held jobs, restarted Condor, improved situation but not fully. There is a problem in the submission logic - may be too aggressive for the latency at large scale. A change was put in on Tuesday - and situation looks better; problem on-going.
  • A few site level issues - AGLT2 authentication problem related w/ GUMS server (resolved; moved to new system).
  • Autopilot scheduler stopped sending jobs early Tuesday morning for some reason. Could not find the cause.
  • Missing release at several sites - Xin installed.
  • Removed SWT2 from production - older OS causing problems.
  • DDM central catalog server caused by aliasing servers.
  • Following up on last week's main issue: file checksum corruption
    • Wensheng + AGLT2 discovered that one pool was probably due to one pool, and the error occurred during transmission. Should be resolved by now.
    • 94/97 files had the correct checksum (LRC matched).
    • Happened between July 3-6. Still not sure the exact cause.
    • Hiro has script which checks transfers. Testing. (Could double I/O load - files need to be written, then read). Will modify DQ2 site service. Transparent for dCache sites. Wei says this can be done for Xrootd as well. (Bestman as well - NET2, OU, old SWT2)
    • CMS has looked into checksums in flight, thought to be too heavy. (?)
    • Could more of the data inventory be affected? Hiro was looking at some sweeps, have not heard of others.
    • Implement Adler32 for pilots - Paul will work on this this week.

Shift report

  • To re-iterate, need to keep an eye on the job eviction problem. Site admins - look in the pilot logs and report to the shift.
  • Typical errors:
    • From the panda monitor page for a job:
      • Error details: pilot: Payload failed: Interrupt failure code: 1201
      • Error details: pilot: Job killed by signal 15: Signal handler has set job result to FAILED, ec = 1201 (no time to send log)
    • From the 'pilot.log' file:
                000 (837929.000.000) 08/19 22:48:41 Job submitted from host: <>
                001 (837929.000.000) 08/19 21:01:22 Job executing on host: gt2osgserv01.slac.stanford.edu/jobmanager-lsf
                004 (837929.000.000) 08/19 23:43:16 Job was evicted.
      	  (0) Job was not checkpointed.
      	     Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
      	     Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
      	      0  -  Run Bytes Sent By Job
      	      0  -  Run Bytes Received By Job
                009 (837929.000.000) 08/19 23:45:36 Job was aborted by the user, via condor_rm (by user sm)

Analysis queues, FDR analysis (Nurcan)

  • Functional testing - Kaushik suggested work with a large dataset to submit jobs to many sites, using a min bias dataset. This is now a running task.
    • Nurcan requested replication to all sites.
    • Cron-job - a little reluctant to do this right now. Reluctant to do this until dataset deletion is working (Charles thinks it will be working now).
    • Will manually saturate the sites and observe scaling issues - before the Jamboree.
  • From last week:
    • PathenaAnalysisQueues
    • Waiting to hear from the FDR-analysis users about which releases and datasets. Jim has new information, see Jamboree twiki
  • Will be testing analysis queues with Release 14.2.20.

RSV and WLCG SAM (Fred)

Operations: DDM (Hiro)

  • BNL had a PNFS related problem; symptom was inability to read/write. PNFS daemon was stuck, load too high w/ no CPU usage and no disk I/O. Cleared after restart.
    • PNFS a constant area of worry. Constant load from various activities - analysis jobs opening lots of files, eg. Not getting significant help from developers. Will be instrumenting the environment. No possibility to throttle requests (probably coming from one or two users) having global impact. Chimera? There are no production sites using it, testbed at BNL. Will be a visit from Chimera developer.
  • AGLT2 - GUMS server fixed inhibiting DATADISK (now okay); troubleshooting space-token enabled pilot. Access to PRODDISK causing problems.
  • http://atladcops.cern.ch:8000/drmon/fdrmon_TiersInfo.html
  • FDR2 reprocessed datasets - results
  • FT look good.

Site news and issues (all sites)

  • T1: PNFS issue - instrumentation; Gridftp doors: new machines arrived, 10G NICs. Waiting for Force10 network cards (expected soon).
  • AGLT2: offline - job failures were either lost heartbeats, or get-errors for space token authorization. GUMS server was sitting on unreliable hardware, now migrated to a VM, working well. Typically run 3 GUMS servers - round-robin failover.
  • NET2: running smoothly here - news: still working on bringing up Harvard site. Harvard: 4096 cores ( expect to get a good fraction ). New Intel Harpertowns. Will have a special queue - "scavenge". Lots of free capacity - used for a small astrophysics group, mostly cosmology. Storage - have a thumper; installing a rack of thumpers. Running Lustre.
  • MWT2: running smoothly. Have been using curl interface to adjust queue depth. Setting up additional space tokens.
  • SWT2 (UTA): NFS server problem (exporting usatlas1 home and releases).
  • SWT2 (OU): busy with physics department rennovation.
  • WT2:

Carryover issues (any updates?)

LFC migration

Pilot upgrade for space tokens (Kaushik (Paul))

  • Update from Shawn. lcg-cp working; still need to work on registration. Carry-over
  • There are user analysis jobs too; authorization related.

Release installation via Pacballs + DMM (Xin)

  • Xin: testing script from Alessandro - couple of modifications needed for OSG.
  • Fred is following up Stan and Alexei about some of the DQ2, hopefully a report next week.

Throughput initiative - status (Shawn)

SRM v2 and Space Tokens (Kaushik)

  • Follow-up issue of atlas versus usatlas role.
  • The issue for dCache space token controlled spaces supporting multiple roles is still open.
  • For the moment, the production role in use for US production remains usatlas, but this may change to get around this problem.

User LRC deletion (Charles)

  • Data deletion tool - completed. Maro - Testers needed. Updates at sites are needed.

WLCG accounting

OSG 1.0

  • Following development of Globus gatekeeper errors 17 and 43 at some sites in OSG 1.0


  • There is a separate subcommittee formed to redefine the whitepaper (Oct 1). Placeholder to follow developments.
  • Any new? Chip sent a list of homework for people to do, have not met yet but lots of activity.
  • Neng (UW): question will be how to effectively get DPD's to Tier3.


  • FDR 2c (reprocessed data) replicated to Tier2's: 100%
  • Dataset deletion at sites.

-- RobertGardner - 19 Aug 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


xls Normalization-factors-USATLAS-v10.xls (61.5K) | RobertGardner, 19 Aug 2008 - 18:24 | Normalization table, and WLCG pledge factors
pdf Normalization-factors-USATLAS-v10-pledge.xls.pdf (34.9K) | RobertGardner, 20 Aug 2008 - 12:56 | WLCG pledges - v10
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback