r5 - 09 Jun 2009 - 15:46:16 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJune3



Minutes of the Facilities Integration Program meeting, June 3, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees: Pedro, Tom, Shawn, Tom, Nurcan, Mark, Kaushik, Saul, Rupam, Patrick, Horst, Xin, Wensheng, Wei, Bob, John, Michael, Rob, Fred, Sarah, Torre
  • Apologies: none

Integration program update (Rob, Michael)

  • IntegrationPhase9 - FY09Q3
  • Special meetings
    • Tuesday (9:30am CDT): Facility working group on analysis queue performance: FacilityWGAP
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
    • Friday (1pm CDT): Frontier/Squid
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • In third month of FY09Q3:
      • Please update FabricUpgradeP9 with any CPU, storage, or infrastructure procurements during this phase.
    • Squid deployment at Tier 2 - we need:
      • We need ATLAS to validate the method - we need a discussion within ADC operations and physics (Jim C). Also need a validation job - eg. one running over cosmics.
      • Within the facility need sites to deploy Squid - June 24.
    • Tentative date & location for next face-to-face workshop: August 19, 20 @ UC
      • Idea is to get some Tier 3 sites to join this meeting
  • Other remarks

Operations overview: Production (Kaushik)

  • Reference:
  • last meeting(s):
    • Production - we have plenty of jobs, but there have been problems with STEP09 task definitions. Roughly 20% from every task failed. About 30M JF35 AOD sample now.
    • A pre-test container has been made, and is being replicated to Tier 2s. 10M events, 40K files.
    • Bamboo rate to US cloud was too small (1800 jobs / hour). Adjusted back up to 3000 by Tadashi - don't know what happened.
    • Condor-G still having trouble keeping queues filled; meeting later this afternoon. Shawn reports that after shift to Oracle AGLT2 filled up quickly.
    • Target is still 100 M for STEP09. Switch to new tarball from Borut, starting with new evgens. Want to switch to this.
    • Saul notes available capacity at HU is not getting used. Increase queue depth?
  • this week:
    • lots of activity in parallel - its involving all parts of our systems. replication, production and analysis.
    • US sites appear to be performing well.
    • SLAC - is deleting datasets with wrong checksums. Not sure why transfers are fast.
    • Have seen excellent rates across all T2's yesterday to all sites.
    • Status page for STEP09 - see http://atladcops.cern.ch:8000/drmon/ftmon.html and http://panda.cern.ch:25980/server/pandamon/query?mode=listFunctionalTests
    • STEP09 user dataset generation for the US is going well - 50M events, but there were problems with event numbers; started yesterday with a new sample, so we're behind.
    • Lack of autopilots, Torre: 3 machines - problem with crons, fixing.
    • Would like to generate 100M events. Should be able to do ~10M/day. Trying to get this done before June 12.
    • Subscriptions that Alexei made for the 10M pre-test container - all vanished.
    • HC continues to run - 19K jobs in US cloud w/ 90% efficiency.
    • Charles: lots of failures due to high memory footprints. Mark has seen this at a number of sites.
    • Merge jobs - are we finished. We have 1000's of tasks in the backlog.
    • When does reprocessing start in the US? Should start now.

Shifters report (Mark)

  • Reference
  • last meeting:
    • Procedure for test jobs at sites after being offline - new change: see shifter's wiki (updates to CERN rather than BNL).
    • Oracle conversion for autopilot.
    • Site drainage over the weekend - there were fast jobs. There were increases to nqueue that didn't help; Torre reports updates at Oracale not forwarded to mysql correctly.
    • Additional operational points from Mark:
      1)  Minor pilot update -- v37d.  See announcement from Paul for details (5/20).
      2)  ANALY_NET2 off-line for maintenance at the end of last week -- successful test jobs -- site returned to online.
      3)  Was the queue depth at BNL modified over the past couple of weeks?  Xin noticed better (more efficient) use of the resources there as of 5/20 -- didn't see a follow-up.
      4)  USERDISK ==> SCRATCHDISK -- timing?  Need xroot modification?
      5)  VOMS service migration to new hardware at BNL on 5/21.
      6)  Obsolete queues in panda -- suggested that we remove or at least mark as "obsolete" by Alessandra Forti.  Alden did some clean-up in this regard.
      7)  UTA_SWT2 is now back online.  Storage system (ibrix) was very full, causing failures of atlas release installation jobs.  Storage was cleaned, and recent installs are succeeding.  Test athena jobs finished, production resumed over the weekend.  (Thanks to Xin for all the help with the install problem.)
      8)  Test job submission procedure has been modified slightly to reflect the migration to Oracle at CERN.  See:
      9)  From Torre:  autopilots now using Oracle at CERN (last part of the mysql ==> Oracle migration).  Seems to be working well, but of course please report any issues (5/26).
      10)  Sites draining over the weekend.  Due to an insufficient number of pilots?  This may have gotten caught up in the mysql ==> Oracle migration -- Torre noticed that changes to "queue depth" were not getting forwarded between the systems.  No longer an issue as of 9).
      11)  Huge number of failed jobs from STEP09 production -- known issue. 12)  Over this past weekend , a dCache server outage at MWT2 -- site was set off-line -- successful test jobs once the dCache issues was resolved -- production resumed (5/23).
  • this meeting:
    • dq2 server errors - was causing a lot of failures at sites. These were failing at the registration step. The DQ2 tracker queries were the culprit.
    • lot of lost heartbeats - most understood. seems like these are tapering off now. Happened at all sites, same date and time 6/1 ~ 8:40am. Related to condor-remove? Could we dig some of these up at the sites?
    • Rupon - starting to take shifts.
    • "setupper" failures are Panda server trying to register files, losing contact.
      1)  Panda db outage at CERN early Monday a.m. (6/1) -- many "lost heartbeat" failed jobs as a result.
      2)  Stage-in errors at AGLT2 caused by a mis-configuration in schedconfig.  Torre fixed the entry, production at the site resumed (6/1).
      3)  Pilot updates from Paul over the past week:  37e (5/29), 37f (6/1).  Details in announcements from Paul.
      4)  Issue with file staging errors at IU_OSG resolved -- site set back 'online' -- RT 12995 (6/1).
      5)  dCache issues at BNL Friday evening into Saturday (5/29 ==> 5/30).  Various postings to Usatlas-ddm-l discussing the problems.
      6)  Issue of STEP09 data having zero adler32 checksums resolved (thanks to Wei for noticing this). 5/31
      7)  Early a.m. 6/3 -- one of the dq2 central catalog hosts at CERN had to be re-started.  Large number of DDM errors during the outage.
      8)  Transferring backlog early on 6/2 due to issue with dq2 upgrade at BNL that resulted in an incompatible version of the GFAL-client software. Fixed by Hiro.  https://prod-grid-logger.cern.ch/elog/ATLAS+Computer+Operations+Logbook/3861

Analysis queues (Nurcan)

  • Reference:
  • last meeting:
    • DB access job is working with pathena at US sites since 5/21. First Panda server was fixed by Tadashi. Second Sasha Vanyashin found out that AtlasProduction- installation was not configured properly for database access. An env variable, ATLAS_CONDDB, was wrong at the sites. This was fixed by Xin. Xin was going to scan all installations at all US sites. AGLT2, BNL, MWT2 are successful. SWT2 and SLAC have a problem reading the input files. Wei, Patrick, Paul are following up, Paul suspected a file system problem. NET2 had a network card problem causing corrupted input files (bad checksums). Saul and John worked on a cleanup, and switched from the 10G to the 1G interface. This job is put into HammerCloud, however few issues remain to be solved. I started a manual stress test with this job to AGLT2, BNL, MWT2 on Friday. Results to be looked at.
    • Status of pre-stress tests by HammerCloud for STEP09:
      • First test was run on 5/20. Saul summarized the results: I've had a look at the results from HC 363, 364, 365 and 366
        • http://atlas.bu.edu/hammercloud/363/ (susy validation job)
        • http://atlas.bu.edu/hammercloud/364/ (graviton job)
        • http://atlas.bu.edu/hammercloud/365/ (muon trigger job)
        • http://atlas.bu.edu/hammercloud/366/ (single top job)
          The situation is pretty easy to summarize:
          a) SWT2 has a checksum problem caught by the pilot
          b) NET2 has a checksum problem caused by a bad network card
          c) AGLT2 has their "Get error: lcg-cp timed out after 1800 seconds" problem
          I'm sure that Patrick knows about a), b) has been fixed but not in
          time to really be in the 2 day test.  Shawn and Bob are working on c). These are the only three problems in the whole system!  In particular,
          MWT2 and WT2 have essentially 100% success rates.  There are no
          problems which are correlated to the particular HC test 363-366.  The
          same errors appear for all of these tests.
      • A couple of miscellanous points: the Panda "currentPriority" number is sometimes negative now. I'm not sure if that's intentional. Many of these jobs in all tests are killed by the submitter. I'm assuming that this isn't a problem, but I don't quite understand why so many are getting killed. Patrick looked at SWT2 errors in more detail:
        There is not really a problem with checksums at SWT2.                                                                   
        There are known reasons as to why these get misreported.
        He reports on errors with not finding the jobOption file at run time (24 failures) and errors with segmentation violation (possible corrupt input files, 7 failures) and 14 failures with input file problems and segmentation violations (9 failures) with "ERROR poolToObject: Could not get object for token" and "ERROR createObj PoolToDataObject() failed".
      • Second pre-stress test was submitted yesterday, 5/26. Results to be looked. From Dan:
        I have scheduled a second pretest to start in 15 minutes. Changes are:                                                  
           - now using mc08.*merge.AOD.e*_s*_r6*tid* for all tests                                                              
           - NG cloud omitted (due to unzipped logfile problem).                                                                
           - LCG backend splits so that subjobs have maximum of 30GB input                                                      
           - anti-blacklisting is on a (test,site) tuple (i.e. if a test,site                                                   
        fails >70%, resubmission at the (test,site) is disabled).                                                               
           - resubmission is using Graeme's load increasing algorithm:                                                          
              - Previously, the bulk of jobs was resubmitted if %(completed or                                                  
        failed) was > 50%.                                                                                                      
              - Now, when the percentage of jobs in the most recently                                                           
        submitted bulk drops below 30%, the bulk is resubmitted.                                                                
              - i.e. submit 11 jobs. All are in 'submitted' state. When only 3                                                  
        are left in 'submitted' state, resubmit another 11.                                                                     
              - This means HC will grow the number of jobs at each site to the                                                  
        actual capacity at the site.                                                                                            
              - changes to the HC storage so that we shouldn't run out of disk. 
    • Next week we'll start with the four job types at all sites - next Tuesday.
  • this meeting:
    • FacilityWGAPMinutesJun2
    • DB access jobs do not work at SLAC and SWT2 yet, Tools/PyUtils-00-06-17 tag that Sebastian Binet provided to process correctly the files with URL 'root://' and 'dcap://' does not work with He reported that this is not meant to be used for releases earlier than 15.1.0.
    • First STEP09 analysis stress test jobs have been started on 6/2. See a summary (number of jobs, efficiency) from slides in today's ADC dev meeting: http://indico.cern.ch/getFile.py/access?contribId=5&resId=0&materialId=slides&confId=60214
    • An user analysis challenge is being organized by J. Cochran for June 12th. A 10M container dataset has been made for pre-test: step09.00000010.jetStream_pretest.recon.AOD.a84/. This has not been replicated to Tier2's yet, subscriptions disappeared as Alexei reported, he resubscribed.
    • Expect many more to come in over the next few days.

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • BNL site services problem over the weekend - too many callbacks to central services. Callbacks to dashboard were okay. Required manual intervention, now working.
    • IU problem yesterday. Problem - on-going but intermittent today, being investigated.
    • Otherwise all okay at sites.
    • There were some missing datasets at AGLT2 - resubscribing everything on MCDISK to fill in holes.
    • Working on monitoring site services remotely.
    • Excluding upgrades - what is the administrative load for running DQ2 site services? At AGLT2 - not much of a load. At MWT2 - when we're running smoothly its roughly zero. Opinion shared by Wei and John. What about the LFC? Real problem is maintaining consistency. Moving LFC to a central location wouldn't help.
  • this meeting:
    • All sites okay
    • Adler32 - all sites should update w/ Hiro's code. Should be a plug-in.

LFC @ Tier 3

  • last meeting:
    • Concerns about maintaining catalogs at Tier 3's.
    • Request from Jim C to move data optionally to Tier 3's (eg. from analysis jobs running at Tier 2s, Tier 1).
    • Kaushik, Torre discussing having a single LFC for Tier 3's - then Panda can provide an option to output data at Tier 3, using Panda mover.
    • Doug: what about subscriptions? Run a single site services for Tier 3.
    • So then Tier 3's would need to setup an SRM storage element.
    • Doug, Kaushik, Torre will discuss this offline.
    • Will need an LFC at BNL.
  • this meeting
    • Torre, Kaushik, Doug met last week, using Panda jobs to push data to Tier 3. Tier 3 would need an SRM, be in ToA. New panda client to serve data rather than full subscription model.
    • Kaushik is preparing draft for ATLAS review.
    • Does need an LFC someplace. This will be provided by BNL.
    • Few weeks of discussion to follow. Will take a month of development. Client is not too difficult.

Tier 3 issues (Doug)

  • last meeting
    • Tier 3 meeting at ANL last week - "1/2 successful".
    • Lots of discussion about VMs.
    • Opportunistic computing within a campus. How to use it efficiently?
    • Would like a data subscription method for Tier 3s.
    • Only 2 sites in person did not have Tier 3's already - having trouble contacting the community.
    • Will be scheduling a concurrent "dq2-get test". This week.
    • Pushing xrootd as baseline in the Tier 3s.
    • What fraction of sites will be GS? Discouraging this model.
  • this meeting
    • Rik and Doug have been doing dq2-get performance testing on a dataset with 158 files, O(100 GB).
    • Tested from 3 places (ANL, FNAL, Duke) against most Tier 2's.
    • Rik reported on results to the stress-test.
    • Has seen copy errors of ~1%. Have also seen checksums not agreeing.
    • Lots of work on VMs. Looking like a good solution for Tier 3. Virtualize headnode systems.
    • No news on BNL srm-xrootd.
    • Would like to get analysis jobs similar to HC to use for Tier 3 validation (local submission).

Data Management & Storage Validation (Kaushik)

  • Reference
  • last week(s):
    • SWT2, WT2, NET2 have been cleaned up ESDs
    • AGLT2 (DQ2 reports 15 TB, SRM reports > 50 TB), MWT2 - DQ2 has 85 TB; these should be deleted (why did central operations didn't delete them?)
    • Need to send email to Kaushik with report for space available on DATADISK for STEP09.
    • Obsolete dataset - Charles has new script available - has run at UC and IU. Working on fixes for AGLT2, and adding deletion for sub- and dis-datasets. Can also be used to prune even after central deletion.
    • For analysis stress testing:
      • MCDISK - need 20 TB for stress testing datasets. But we also will have merged datasets which unknown size.
      • USERDISK or SCRATCHDISK? Migrate to SCRATCHDISK after STEP09? Stay w/ USERDISK until after STEP09. (Also gives us time for the new xrootd release.)
  • this week:
    • Adler32 in SS

Throughput Initiative (Shawn)

  • Notes from meeting this week:
  • last week:
    • Need to check status of current perfsonar boxes. Next major milestone will be to deploy new Knoppix disk. Will have new driver for on-board NIC, and other work-arounds.
    • Milestones - production has been catching up. BNL has been sending >800 MB/s for last 24 hours. 10G sites are to setup 400 MB/s - consistency.
    • What about Tier-3 throughput? Welcome to join Tuesday meetings now.
  • this week:
    • Google summer of code gridftp meeting.
    • Local throughput is approaching our milestones. Primary topic next week: reaching the milestone.

Site news and issues (all sites)

  • T1:
    • last week: dedicated circuit will be activated tomorrow for SLAC. UTA - more effort will be needed, would like to call a meeting to discuss this to arrange for last mile to SWT2_UTA (esnet, I2, LEARN, ...). Working on storage services and access efficiency out of dcache using pnfsid - Hiro, Wensheng, working with Paul. Want to setup a test queue to test the modified pilot. Expect scalability to be addressed. Evaluating technologies for next storage purchase of ~1.5 PB. Vendors hesitant to move to 1.5 or 2 TB drives. Will use nexan + FC to Thors to extend storage space; flexible and cost-effective.
    • this week: making progress w/ investigations regarding dcache scalability. pnfs-id based file addressing implemented by Hiro w/ help from Wensheng and Paul, and in production since Friday. Relieve load on pnfs server in terms of metadata lookups. See load creeping up w/ regard to staging. System seems very stable. 5 gridftp doors - will add, since we have 30 GB/s. Disk procurement - will come up w/ 1.5 PB of disk by August. Will put FC behind the Thor - can put 2x32 TB of useable disk without adding management nodes. Using Nexan technology. Oracle implementation of Chimera has a severe bug, not as well-tested as Postgres. Missing a trigger for Oracle, asked us to port. They obviously are not running the same test harness. For this reason we are still using pnfs.

  • AGLT2:
    • last week: Working on consistency within LFC and dCache components, and DQ2. Re-subscribing ~900 datasets to MCDISK. Lots of space available for STEP09. Fixing one residual problem with a gridftp server. Running 13 10G gridftp nodes.
    • this week: running pretty well recently. working on consistency checking. schema prob with chimera+postgres. Tigran responded about schema error. Hope to have everything checked by the end of the day.

  • NET2:
    • last week(s): Running steadily but below capacity. Only real problem was bad network card - replaced 10G with 1G card, stopped corrupting. Have replacement. 300 MB/s transfer rate observed for replacing a dataset. Extra bandwidth helping! HU - all going well. * this week: running smoothly. using 1G network card - have yet to get myricom replaced. HU jobs failing, investigating. Have not run HC yet, since queue is filled with user jobs. Should we be deleting step09 data which has adler32=0? Are the step09 data real data? What to do with this data? Central deletion. Adding Tufts to NET2.

  • MWT2:
    • last week(s): gplazma auth timeouts - have a configuration now that works. Have some hardware instability with new Dell servers. Problems with files not getting registration.
    • this week: converted 3 storage servers to 64 bit linux. Stability improved, add 48 TB of disk. Running solid. High memory jobs took out some nodes.

  • SWT2 (UTA):
    • last week: Mark: main thing over the past week was old cluster had full file system. Cleared up, weekend installs succeeded, now back in production.
    • this week: analysis site for CPB offline - not sure why. SRM became unreachable last week - investigating w/ LBL. Issue with high memory jobs - disrupted PBS. Analysis jobs - direct reads through xrootd for input data, job seg faults; if move job to read directly, succeeds. Some kind of xroot-talking problem. Cleaning space.

  • SWT2 (OU):
    • last week:
    • this week: all okay. Waiting for additional storage.

  • WT2:
    • last week: all okay. An earlier problem with SRM resolved. Working on next procurement.
    • this week: all okay. Investigating.

Carryover issues (any updates?)

Squids and Frontier (John DeStefano)

  • Note: this activity involves a number of people from both Tier 1 and Tier 2 facilities. I've put John down to regularly report on updates as he currently chairs the Friday Frontier-Squid meeting, though we expect contributions from Douglas, Shawn, John - others at Tier 2's, and developments are made. - rwg
  • last meeting(s):
    • Fred reports problems with using squid w/ release 15 - cannot access frontier server at BNL. Rel 14 works. Has contacted Richard Hawking, working on a solution.
    • Has tested direct oracle access; with help from Carlos, but found no significant performance improvement in spite of having reduced number of queries.
    • Sarah used the instructions for Tier 2 successfully.
  • this week:

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
    • Improved transfers of pacballs from CERN to BNL more quickly.
    • Increase priority of pacball transfers in pandamover - will consult Tadashi.
    • Move all sites to use new installation system. Need to add independent install queues - need Alden's help. Have 11 sites migrated already. Missing UTA, OU, Illinois, UTD.
    • MWT2_IU path different between gatekeeper and worker-node. Sarah will fix this so fall-back works.
    • Torre: do we need special queues? Why not use conventional queue's when site is in test-mode. Xin: send email to Kausihk, Torre, Tadashi.
  • this meeting:
    • Transfer of new pacball datasets to bnl much improved. Tom/UK will handle subs.
    • Tadashi changed panda mover to make install jobs highest priority. Should see releases installed very quickly.
    • Six new installation sites to panda - some configs need changes.

Tier 3 coordination plans (Doug, Jim C)

  • last report:
    • Upcoming meeting an ANL
    • Sent survey to computing-contacts
    • There is an Tier3 support list being setup.
    • Need an RT queue for Tier3
  • this report:

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week
    • Have discussed adding Adler32 checksum to xrootd. Alex developing something to calculate this on the fly. Expects to release this very soon. Want to supply this to the gridftp server.
    • Need to communicate w/ CERN regarding how this will work with FTS.
  • this week

Local Site Mover


  • last week
    • STEP09 - Jim was discussing moving it to June 18, probably outside STEP09 proper. Note ATLAS central operations is in the process of moving data into the regions, and so there will be data available then.
    • SVN checkout to local disk - you need subversion rpm installed locally.
    • Next Tier 2/3 workshop? Consider late August, mid-September. Hands on component? Doug - want to write up a cookbook. Consider August 17.
  • this week

-- RobertGardner - 01 Jun 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback