r5 - 26 Sep 2009 - 18:14:08 - MarkSosebeeYou are here: TWiki >  Admins Web > MinutesSep23



Minutes of the Facilities Integration Program meeting, Sep 23, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees: Rob, Charles, Aaron, Michael, Sarah, Jason, Rich, John, Justin, Wei, Saul, Bob, Hiro, Torre, Shawn, Tom, Patrick, Kaushik, Armen, Fred
  • Apologies: Horst, Nurcan, Mark

Integration program update (Rob, Michael)

  • Introducing: SiteCertificationP10 - FY09Q04
  • Special meetings
    • Tuesday (9am CDT): Frontier/Squid
    • Tuesday (9:30am CDT): Facility working group on analysis queue performance: FacilityWGAP suspended for now
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
      • Site certification table has a new column for the lcg-utils update, as well as curl. Will update as needed. Note: needed more critically by PandaMover host.
      • ATLAS software week: http://indico.cern.ch/conferenceDisplay.py?confId=50976
      • Important issue: DQ2 site services consolidation
        • Will run at BNL for all sites in the US cloud - predicated on FTS 2.2 for remote checksum; if this works, we'll consolidate. New FTS is being tested currently. Needs to work for all storage back ends, including Bestman. AI: need to follow-up w/ Simone, Hiro and Wei, to bring a test instance to BNL, test with Bestman sites.
      • Storage requirements: SpaceManagement
      • FabricUpgradeP10 - procurement discussion
      • Latest on lcg-utils and LFC:
         Begin forwarded message:
        For LFC: Just yesterday I got it building on one platform and hope to have it building on multiple platforms today. So it's in good shape.
        For lcg-utils: I upgraded (it was a painful set of changes in my build, but it's done) and Tanya did a test that showed everything working well. But about 30 minutes ago ago, Brian Bockelman told me that I might need to upgrade again to avoid a bug--I just contacted some folks in EGEE, and they confirmed that I should upgrade to a new version. *sigh* Hopefully I get can this done today as well.
        All of that said: I can almost certainly give you something for testing this week.
      • Specifically: GFAL: 1.11.8-1 (or maybe 1.11.9-1), lcg-utils: 1.7.6-1. These were certified recently, and 1.7.6-1 fixes a bug that was critical for CMS: https://savannah.cern.ch/bugs/index.php?52485 -alain
    • this week:
      • CapacitySummary
      • WLCG pledge
      • ATLAS has refined resource requirements versus what has been requested earlier. Revised requirements in terms of numbers of replicas of complete AOD datasets has led to a reduction in requirements, in particular at the T1 (9.5 PB down to 5 PB). However not at the T2's. Old 2010 25K HS06 requirement at same level in terms of storage and CPU. Pledge for 09 2.5 PB disk at all T2's (almost what we have). Target would be 3.1 PB by end of year. 2010 is a steep ramp up required, in order to satisfy ATLAS plus US reserve. Numbers go from 3 to 7 PB in 010. In terms of CPU we're already doing very good. Currently at 55 KHS06 - target is 61 kHS06. 76-80 kHS06 by end of 2010 - shouldn't be an issue. Really have to get CPU and storage balanced. So we need to more than double the storage the space deployed by end of 09. Table sent out summarizing currently installed capacity - some sites need to significantly ramp. Goal is to get all our T2's to the same level - from the production perspective. Harder to balance large and small sites. May take more than one year. Goal is to homogenize level of resources across sites. Regarding technology - all of our goals should be to use where possible the new technology in disks - eg 2 TB drives. Avoid going for old technology w.r.t. cost of space and power. Target dates will depend on the performance of the machine. Overriding issue is supporting analysis.
      • Kaushik: need to plan for allocation of the storage among the space tokens. As we evolve towards 1 PB, we need specific information about procurements step-by-steps. Stefan's proposal from SW week.
      • Shawn: We need a schedule for required storage.
      • Armen is watching critical storage areas at T1/T2's.

Rich Carlson

* October 9 leaving I2 - transitioning to DOE in advanced scientific computing research division (OSCER) - scidac, esnet, nersc * Will be working on middleware and networking research * Jason will be handling some of the I2 - LHC outreach activities

Tier 3 (Rik, Doug)

  • last week:
    • A statement of support for T3GS? from the facility is needed
    • T3 interviews in progress - half the institutions have responded
    • About 5 T3GS? 's have been identified (LT, UTD, U of I, Wisc, Tufts, Hampton)
    • 25 interviews so far - half are starting from scratch - have some funds. Other half is sites w/ T3G? of some kind. Others have departmental clusters to build on. 30-60K budgets. About half have no resources at all.
    • T3GS? 's will require a support structure that is not available - will rely on the community support. Will cooperate with OSG getting instructions together for some T3 documentation.
    • pcache for T3's.
    • lsm for T3's.
  • this week:
    • still working on interviews
    • Doug feels we'll need t2-t3 'affinities'
    • T3 usability should be a focus in the next phase of integration program

Operations overview: Production (Kaushik)

  • Reference:
  • last meeting(s):
    • mc09 tasks are arriving and running great (no need for manufactured queue fillers); 75K jobs finished
    • Reprocessing is a mess - tasks aborted
    • Simulation queue filling is going fine
  • this week:
    • Very successful reprocessing validation - at BNL - went quickly 20K jobs/day. Caveats: some jobs are crashing; Panda tweak - orders jobs by success probability. 2% job failures in the repo task.
    • Today: James Catmore given green light for full reprocessing. Will be fully Panda brokered.
    • US cloud may be majority of jobs - we might get most of the jobs, and they my finish quickly.
    • Getting lots of mc09 validation tasks. There are some problems still.
    • There is cosmic data being distributed - keep an eye on storage
    • GB/s going to BNL overnight from CERN, to tape
    • Volume of cosmic data for T2s? There seems to be derived data generated. AOD datasets defined - may be distributed to T2s also. AODs probably going to subscriptions

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri is on vacation this week -- weekly summary will resume once he's back.
    [ ESD reprocessing -- began over this past weekend, but large numbers of jobs were failing due to s/w problem.  Will require a new cache -- expected to be ready by  ~9/21. 
    Until then shifters don't need to report reprocessing-related errors. ] See: https://twiki.cern.ch/twiki/bin/view/Atlas/ADCReproFromESDAug2009 [Other production generally running very smoothly this past week -- most tasks ("queue fillers") have low error rates. ] 1) 9/11: WISC_DATADISK and WISC_MCDISK transfer errors: "unable to create dir." From site admin: This problem is solved. Because the ownership of the cns directory is changed to another user (no idea why). I changed it back now. It works now. 2) 9/11: Failed jobs at MWT2_IU & IU_OSG with LFC replica errors -- due to some files getting deleted from the storage during a prod disk cleanup. 3) 9/13: Access problem with the LFC at AGLT2 resolved -- modified firewall settings to open needed ports. 4) 9/14: New pilot versions: -- (39a) * The internal pilot time-out command has been updated with some minor fixes by Charles. * In case a job definition contains a non-null/empty cmtconfig value, the pilot will use it instead of the static schedconfig value for cmtconfig validation. * "hotdisk" (highest priority) and "bnlt0d1" has been added to the LFC sorting algorithm. * Batch system identification now uses QSUB_REQNAME as BQS identifyer (used at Lyon) as an alternative to BQSCLUSTER. * Sites using dCacheLFCSiteMover (dccplfc copytool) can now use direct access mode in user analysis jobs. * In case of get failure of an input file, the file info is now added to the metadata for later diagnostics. (39b) * The pilot has been patched to correct an issue seen with analysis jobs using the AtlasProduction cache ($SITEROOT not set leading to problems with the release installation verification). 5) 9/14: A/C issue at AGLT2 -- from Bob: The air-conditioning seems to have stabilized now. We will be turning machines back on again shortly to pick up jobs. Note that we are taking this opportunity to apply patched kernels to the compute nodes. So we will generally be running somewhat lighter loads over the next few days as we do so. 6) 9/14 - 9/15: Transfer errors at AGLT2 -- :[FILE_E XISTS] and [DQ2] Transfer validation failed -- understood, issue resolved. https://gus.fzk.de/ws/ticket_info.php?ticket=51506 7) 9/15: Test jobs submitted to UCITB_EDGE7 site -- quickly end with the pilot status "Output late or N/A." Suchandra is investigating. Follow-ups from earlier reports: (i) 7/23-7/24 -- Ongoing work by Paul and Xin to debug issues with s/w installation jobs at OU_OSCER_ATLAS. Significant progress, but still a few remaining issues. (ii) SLC5 upgrades will most likely happen sometime during the month of September at most sites. (iii) UTD-HEP is working on installing a new RAID controller in their fileserver. Will use this opportunity to do a clean-up of old data in their storage.
    • Sunday reprocessing jobs were okay
    • Monday - initial tasks were subscribed only at BNL - 1000s of failed jobs
    • A new software cache is needed - expect at least first part of next week. Decision was to let some of the jobs to continue to fail to learn what the problems are.
  • this meeting:
    Yuri's summary from the weekly ADCoS meeting:
    [ ESD reprocessing -- restarted mid-week once the new s/w cache became available.  Most (all?) jobs are running with the flag  "--ignoreunknown accepted,"
    which means errors like "Unknown Transform error" can be ignored.  Primary error seen so far is "Athena ran out of memory." ]
    See: https://twiki.cern.ch/twiki/bin/view/Atlas/ADCReproFromESDAug2009
    [Other production generally running very smoothly this past week -- most tasks ("queue fillers") have low error rates. ]
    1)  9/17: upgrade of dCache pool nodes at MWT2_UC to SL5.3.
    2)  9/17: From Xin, s/w patch for SLC4 ==> 5 migration:
    The patch fixes problems encountered by analysis jobs, which run on SL5 platform and involve compilation in the job.
    Other production jobs and SL4 platform sites are fine without it, while having it is harmless as well.
    3)  9/20: Test jobs completed successfully at UCITB_EDGE7.
    4)  9/21: Intermittent transfer errors at MWT2 sites likely due to ongoing testing -- from Charles:
    We've been running some throughput/load tests from UC to IU, which are almost certainly the cause of these transfer failures.
    I'll terminate the test now and the errors ought to clear up.  https://gus.fzk.de/ws/ticket_info.php?ticket=51697
    5)  9/23: UTD-HEP completed hardware maintenance (new RAID controller on fileserver) -- test jobs finished successfully, site set back to 'online'.
    6)  9/23: All jobs were failing at AGLT2 with "Put error: lfc-mkdir failed."  Hiro was able to fix a problem with an ACL -- site set back to 'online'.
    Follow-ups from earlier reports:
    (i)  7/23-7/24 -- Ongoing work by Paul and Xin to debug issues with s/w installation jobs at OU_OSCER_ATLAS.  Significant progress, 
    but still a few remaining issues.
    (ii)  SLC5 upgrades are ongoing at the sites during the month of September.

Analysis queues (Nurcan)

  • Reference:
  • last meeting:
    • CosmicsAnalysis job using DB access is successfully tested at FZK using Frontier and at DESY-HH using Squid/Frontier (by Johannes). The job has been put into HammerCloud and now being tested at DE_PANDA, no submission to US sites yet.
    • TAG selection job has been put into HammerCloud and is now being tested (in DE cloud).
    • We have now 3 new analysis shifters confirmed, still waiting to hear from one person. I'm planing a training for them in October.
    • Jim C. contacted with us on the status of the large containers for the stress test. Kaushik reported that we have total ~500M events produced. Only the first bunch replicated to Tier2's as I had validated them (step09.00000011.jetStream_medcut.recon.AOD.a84/ with 97.69M events and step09.00000011.jetStream_lowcut.recon.AOD.a84/ with 27.49M events). Others are at BNL, waiting to be merged and put into new containers. Depending on the time scale of the stress test this can be done in a few days as Kaushik reported.
  • this meeting:

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • One problem w/ BNL DQ2 - host rebooted - led to a downtime; unsure why.
    • What about checksumming through SRM-bestman. Can do checksum on the fly, better to distribute checksum. Can the calculate on the data server node. lcg-util checksum has been tested by Wei and it works.
    • Action item: need to have meeting w/ Alex, Wei, Hiro
  • this meeting:
    • Site services - make sure Tier 0 shares are dropped, blacklist is removed. Make sure this is done. Those functions have been moved to CERN's DQ2.
    • BNL viewer: http://ddmv01.usatlas.bnl.gov:20050/dq2log/
    • Migration of SS to BNL - will happen only if and after FTS checksum works well. Testing now w/ CERNs FTS 2.2. dCache sites are okay. Current BM sites will have a discussion. Under discussion w/ Wei, Hiro and Simone. Requires some BM development. BM does not store it, so it must be calculated all the time. 500 MB = 2 seconds. Hiro is finding 33 secs for FTS to get the data. LBL asked to provide a hook for an external package to compute the checksum. Expect a new version within a week. Saul: might work since the machine is powerful. client controls the checksum. lcg-cp client asks this.
    • DQ2 upgrade expected next week.
    • T3 LFCs to move to BNL; will make a plan and contact each site.

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

  • last week
  • this week
    • Fred managed to run jobs using just enviro variables to control access to conditions data.
    • Getting ready for full scale test at MWT2.
    • Wei: is there a way to setup two squid servers at a site? Documentation needs to be updated.
    • Upgraded frontier servers at BNL to take care of cache consistency provided by ATLAS. Any T2 that uses squid needs to update.
    • All sites need to upgrade - John will send an email.
    • Fred will follow-up on placement of files in these areas

Data Management & Storage Validation (Kaushik)

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • last week(s):
    • pefsonar release next week (RC4). Would like to have a quick deployment once its available. In a week's time span.
    • Following week would be to focus on configuration
    • October time frame - hope to have an enough experience to make recommendations
    • VC test BNL-UC - problem went away, but unknown.
  • this week:
    • perfSONAR due to be release this Friday (Sep 25 2009).
      • Goal is to have all USATLAS Tier-2 sites updated by October 1, 2009
      • By October 7 all sites should have configured full mesh-testing for BWCTL and OWAMP testing for all Tier-2 sites and BNL
    • Need to create a mapping for all Tier-3s to provide an associated Tier-2 for testing purposes
    • Hiro will update the automated file transfer testing to allow Tier-2 to Tier-3 transfers (SRM-to-SRM) using the existing load testing framework

OSG 1.2 deployment (Rob, Xin)

  • last week:
    • Update to OSG 1.2.1 release this past week for pyopenssl patch
  • this week:
    • Validation of upgrade to lcg-utils in wn-client, as well as curl. OSG 1.2.3 has this update (relevant for wn-client, client)
    • Tested with UCITB_EDGE7 site. Validation complete DONE

OIM issue (Xin)

  • last week:
    • Registration information change for bm-xroot in OIM - Wei will follow-up
  • this week:
    • SRM V2 tag - Brian says nothing to do but watch for the change at the end of the month.

Site news and issues (all sites)

  • T1:
    • last week(s): apart from glitch w/ site services; 120 worker nodes is out: 8 core nodes, 2.66 GHz Nahelms, 24 GB ram x-series, 5550. Fast memory. R410s. Can put up to 4 drives in these 1U servers. 1000 analysis cores hitting DDN storage. SSD 160 GB drives for $400.
    • this week: waiting for 120 worker nodes will be delayed by 2 weeks. Pedro has completed pcache installation, evaluating. HOTDISK - have distributed the area over 30 Thors (small amounts). Large number of reprocessing and validation jobs. Have observed efficiency issues getting jobs into running state even though everything was ready - investigation as to origin latency, increased nqueue. Pilot rate issue. Possibility of letting multiple jobs run by the same pilot - will be looked at, but its not a quick fix. Torre notes its an opportune time to look at this since.

  • AGLT2:
    • last week: next procurements - hope to order by next week. waiting on dell and sun. conversion to sl5. few issues w/ sl5 build - probs compiling athena jobs - looking at whats different; hope to transition into over the next few weeks, would like to deprecate sl4 support. applied new kernels on all nodes.
    • this week: no update.

  • NET2:
    • last week(s): running smoothly except for 17 athena crashes from this morning - investigating. getting ready for procurements - bu and hu. all space an infrast defined. getting bids. looking at blades.
    • this week: hotdisk space token; proceeding w/ procuremet

  • MWT2:
    • last week(s): circuit tests of last week caused dcache pools to crash on the receiving end -want to reinvestigate after SL5.3 using kdump. power interruption.
    • this week: SL5.3 upgrade in progress. dCache load testing between IU and UC - some stability issues have gone away. 1 GB/s UC to BNL.

  • SWT2 (UTA):
    • last week: OSG 1.2 and sl5 upgrade first/second week of october. planning for storage upgrade. looking at 1PB of disk in the next purchase. upgrading storage at SWT2_UTA - deprecating ibrix.
    • this week: deployed HOTDISK last night - working fine. focusing on procurement at UTA and CPB sites.

  • SWT2 (OU):
    • last week: waiting on storage purchase - Joel investigating. Upgrade after storage.
    • this week: all is well.

  • WT2:
    • last week(s): sl5 migration - rhel5 - setup a test machine and queue. setting up additional squid servers.
    • this week: SS messed up w/ yum update. Side effect was that update removed checksum FTS code - a number of transfers missed checksum validation. ITB site testing for new version of xrootd - do we need a Panda site? SE use a lot of functions. R410s arrived. 40-41 servers (195 total - other experiments included). xrootd client developer - provide a newer version, will do a test. Power outage at the end of the month addressing safety concerns.

Tier 3 data transfers (Doug, Kaushik)

  • last week
    • no change
  • this week

Carryover issues (any updates?)

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
  • this meeting:

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week
    • Have discussed adding Adler32 checksum to xrootd. Alex developing something to calculate this on the fly. Expects to release this very soon. Want to supply this to the gridftp server.
    • Need to communicate w/ CERN regarding how this will work with FTS.
  • this week

Local Site Mover

Gratia transfer probes @ Tier 2 sites

Hot topic: SL5 migration

  • last weeks:
    • ACD ops action items, http://indico.cern.ch/getFile.py/access?resId=0&materialId=2&confId=66075
    • Kaushik: we have the green light to do this from ATLAS; however there are some validation jobs still going on and there are some problems to solve. If anyone wants to migrate, go ahead, but not pushing right now. Want to have plenty of time before data comes (means next month or two at the latest). Wait until reprocessing is done - anywhere between 2-7 weeks from now, for both SL5 and OSG 1.2.
    • Consensus: start mid-September for both SL5 and OSG 1.2
    • Shawn: considering rolling part of AGT2 infrastructure to SL 5 - should they not do this? Probably okay - Michael. Would get some good information. Sites: use this time to sort out migration issues.
    • Milestone: my mid-October all sites should be migrated.
    • What to do about validation? Xin notes that compat libs are needed
    • Consult UpgradeSL5
  • this week

Getting OIM registrations correct for WLCG installed pledged capacity

  • last week
    • Note - if you have more than one CE, the availability will take the "OR".
    • Make sure installed capacity is no greater than the pledge.
    • Storage capacity is given the GIP by one of two information providers (one for dCache, one for Posix-like filesystem) - requires OSG 1.0.4 or later. Note - not important for WLCG, its not passed on. Karthik notes we have two ATLAS sites that are reporting zero. This is a bit tricky.
    • Have not seen yet a draft report.
    • Double check the accounting name doesn't get erased. There was a big in OIM - should be fixed, but checked.
  • this week


  • last week
  • this week

-- RobertGardner - 22 Sep 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback