r5 - 01 Jun 2009 - 10:25:50 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesMay27



Minutes of the Facilities Integration Program meeting, May 27, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees: Michael, Saul, Rich, Doug, Charles, Rob, Torre, Horst, Shawn, Sarah, Wei, Armen, Nurcan, Kaushik, Mark, Karthik, Pedro, Fred
  • Apologies: none

Integration program update (Rob, Michael)

  • IntegrationPhase9 - FY09Q3
  • Special meetings
    • Tuesday (9:30am CDT): Facility working group on analysis queue performance: FacilityWGAP
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
    • Friday (1pm CDT): Frontier/Squid
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Other remarks
    • Most important - completing preparations for STEP09 and the analysis stress tests.
    • Data replication for STEP09 announced this morning - same functionality as before, but rates small for now.
    • We have nearly completed network infrastructure between the Tier 1 and most Tier 2's. UTA still a concern - no easy path for a 10G link.

Operations overview: Production (Kaushik)

  • Reference:
  • last meeting(s):
    • Getting back into full production after several site upgrades
    • US Cloud last cloud to move to the CERN oracle panda server - so we expect a few days of issues related to monitoring; archivedb is still in mysql. Will need to change port manually if there are problems.
    • dCache upgrade at BNL yesterday - everything came back normally. Need to keep an eye on Panda mover.
    • Need to get IU and AGLT2 online asap.
    • Large number of production requests - including STEP09 production; expect a huge backlog.
      • 5 independent runs - 2 runs per Tier 2. (10 pb-1 per run, 100 M)
      • Same estimate as before: 15-20 TB
    • Concerns about central-cleanup operations. Expect 20-70 TB of data to be removed from DATADISK. Cleaning of obsolete and aborted datasets at MCDISK.
    • Will continue with Pandamover rather than DQ2.
  • this week:
    • Production - we have plenty of jobs, but there have been problems with STEP09 task definitions. Roughly 20% from every task failed. About 30M JF35 AOD sample now.
    • A pre-test container has been made, and is being replicated to Tier 2s. 10M events, 40K files.
    • Bamboo rate to US cloud was too small (1800 jobs / hour). Adjusted back up to 3000 by Tadashi - don't know what happened.
    • Condor-G still having trouble keeping queues filled; meeting later this afternoon. Shawn reports that after shift to Oracle AGLT2 filled up quickly.
    • Target is still 100 M for STEP09. Switch to new tarball from Borut, starting with new evgens. Want to switch to this.
    • Saul notes available capacity at HU is not getting used. Increase queue depth?

Shifters report (Mark)

  • Reference
  • last meeting:
    • Large number failed jobs at UTD - failures due to a missing release. Long installation time required - investigating . UTA_SWT2 - might have a problem with Ibrix.
    • Several updates to pilot, v36a.
    • RSV not respecting OIM downtime settings.
    • Panda monitor migration nearly complete.
    • Intermittent problem setting sites offline/online.
    • MWT2_IU test job submission.
    • STEP09 jobs - tasks created this week - might expect
    • SRM upgrade issue at SWT2 - updated later
  • this meeting:
    • Procedure for test jobs at sites after being offline - new change: see shifter's wiki (updates to CERN rather than BNL).
    • Oracle conversion for autopilot.
    • Site drainage over the weekend - there were fast jobs. There were increases to nqueue that didn't help; Torre reports updates at Oracale not forwarded to mysql correctly.
    • Additional operational points from Mark:
      1)  Minor pilot update -- v37d.  See announcement from Paul for details (5/20).
      2)  ANALY_NET2 off-line for maintenance at the end of last week -- successful test jobs -- site returned to online.
      3)  Was the queue depth at BNL modified over the past couple of weeks?  Xin noticed better (more efficient) use of the resources there as of 5/20 -- didn't see a follow-up.
      4)  USERDISK ==> SCRATCHDISK -- timing?  Need xroot modification?
      5)  VOMS service migration to new hardware at BNL on 5/21.
      6)  Obsolete queues in panda -- suggested that we remove or at least mark as "obsolete" by Alessandra Forti.  Alden did some clean-up in this regard.
      7)  UTA_SWT2 is now back online.  Storage system (ibrix) was very full, causing failures of atlas release installation jobs.  Storage was cleaned, and recent installs are succeeding.  Test athena jobs finished, production resumed over the weekend.  (Thanks to Xin for all the help with the install problem.)
      8)  Test job submission procedure has been modified slightly to reflect the migration to Oracle at CERN.  See:
      9)  From Torre:  autopilots now using Oracle at CERN (last part of the mysql ==> Oracle migration).  Seems to be working well, but of course please report any issues (5/26).
      10)  Sites draining over the weekend.  Due to an insufficient number of pilots?  This may have gotten caught up in the mysql ==> Oracle migration -- Torre noticed that changes to "queue depth" were not getting forwarded between the systems.  No longer an issue as of 9).
      11)  Huge number of failed jobs from STEP09 production -- known issue. 12)  Over this past weekend , a dCache server outage at MWT2 -- site was set off-line -- successful test jobs once the dCache issues was resolved -- production resumed (5/23).

Analysis queues (Nurcan)

  • Reference:
  • last meeting:
    • FacilityWGAPMinutesMay12
    • Status of test jobs:
      • TAG selection job was submitted to SWT2 yesterday after understanding the xrootd related problems last week. There are still failures, need to be looked at today from the stresstest link above.
      • I was trying to setup a job that runs on AOD/DPD and requires DB access. Contacted with Sasha Vanyashin, he said such a job exist and he referred to Katharina Fiekas. I asked for instructions to be tested and put into HammerCloud.
    • Status of integrating job types into HammerCloud: SUSYValidation, D3PD? making and TAG selection jobs are now integrated, see at: https://twiki.cern.ch/twiki/bin/view/Atlas/StressTestJobs. I have requested tests last week, due to changes in HammerCloud submission mechanism and the dq2 catalog problems last week tests have been submitted this week:
    • Sites please look at the failures. Mostly input file problems, checksum failures (NET2), not finding input files (AGLT2).
    • We will need to look at the metrics and compare site performances. Any site admin is interested in doing this comparison? Please help.
    • Saul will look into checksum errors; also will help summarizing failures
  • this meeting:
    • DB access job is working with pathena at US sites since 5/21. First Panda server was fixed by Tadashi. Second Sasha Vanyashin found out that AtlasProduction- installation was not configured properly for database access. An env variable, ATLAS_CONDDB, was wrong at the sites. This was fixed by Xin. Xin was going to scan all installations at all US sites. AGLT2, BNL, MWT2 are successful. SWT2 and SLAC have a problem reading the input files. Wei, Patrick, Paul are following up, Paul suspected a file system problem. NET2 had a network card problem causing corrupted input files (bad checksums). Saul and John worked on a cleanup, and switched from the 10G to the 1G interface. This job is put into HammerCloud, however few issues remain to be solved. I started a manual stress test with this job to AGLT2, BNL, MWT2 on Friday. Results to be looked at.
    • Status of pre-stress tests by HammerCloud for STEP09:
      • First test was run on 5/20. Saul summarized the results: I've had a look at the results from HC 363, 364, 365 and 366
        • http://atlas.bu.edu/hammercloud/363/ (susy validation job)
        • http://atlas.bu.edu/hammercloud/364/ (graviton job)
        • http://atlas.bu.edu/hammercloud/365/ (muon trigger job)
        • http://atlas.bu.edu/hammercloud/366/ (single top job)
          The situation is pretty easy to summarize:
          a) SWT2 has a checksum problem caught by the pilot
          b) NET2 has a checksum problem caused by a bad network card
          c) AGLT2 has their "Get error: lcg-cp timed out after 1800 seconds" problem
          I'm sure that Patrick knows about a), b) has been fixed but not in
          time to really be in the 2 day test.  Shawn and Bob are working on c). These are the only three problems in the whole system!  In particular,
          MWT2 and WT2 have essentially 100% success rates.  There are no
          problems which are correlated to the particular HC test 363-366.  The
          same errors appear for all of these tests.
      • A couple of miscellanous points: the Panda "currentPriority" number is sometimes negative now. I'm not sure if that's intentional.
Many of these jobs in all tests are killed by the submitter. I'm assuming that this isn't a problem, but I don't quite understand why so many are getting killed. Patrick looked at SWT2 errors in more detail:
There is not really a problem with checksums at SWT2.                                                                   
There are known reasons as to why these get misreported.
He reports on errors with not finding the jobOption file at run time (24 failures) and errors with segmentation violation (possible corrupt input files, 7 failures) and 14 failures with input file problems and segmentation violations (9 failures) with "ERROR poolToObject: Could not get object for token" and "ERROR createObj PoolToDataObject() failed".
      • Second pre-stress test was submitted yesterday, 5/26. Results to be looked. From Dan:
        I have scheduled a second pretest to start in 15 minutes. Changes are:                                                  
           - now using mc08.*merge.AOD.e*_s*_r6*tid* for all tests                                                              
           - NG cloud omitted (due to unzipped logfile problem).                                                                
           - LCG backend splits so that subjobs have maximum of 30GB input                                                      
           - anti-blacklisting is on a (test,site) tuple (i.e. if a test,site                                                   
        fails >70%, resubmission at the (test,site) is disabled).                                                               
           - resubmission is using Graeme's load increasing algorithm:                                                          
              - Previously, the bulk of jobs was resubmitted if %(completed or                                                  
        failed) was > 50%.                                                                                                      
              - Now, when the percentage of jobs in the most recently                                                           
        submitted bulk drops below 30%, the bulk is resubmitted.                                                                
              - i.e. submit 11 jobs. All are in 'submitted' state. When only 3                                                  
        are left in 'submitted' state, resubmit another 11.                                                                     
              - This means HC will grow the number of jobs at each site to the                                                  
        actual capacity at the site.                                                                                            
              - changes to the HC storage so that we shouldn't run out of disk. 
  • Next week we'll start with the four job types at all sites - next Tuesday.

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • dCache updates at UC, IU, and AGLT2
    • dq2 client has for queries for datasets with large numbers of files (LFC 30K file limitation); fixed, not released yet, will need to upgrade soon.
    • Two thumper pools at BNL will be fixed today.
    • pnfsid entries in LFC completed - will use for checking checksums on pools
    • Chimera upgrade caused checksum not to work any longer - investigating; Tigran sent mail today, will provide a simple method.
  • this meeting:
    • BNL site services problem over the weekend - too many callbacks to central services. Callbacks to dashboard were okay. Required manual intervention, now working.
    • IU problem yesterday. Problem - on-going but intermittent today, being investigated.
    • Otherwise all okay at sites.
    • There were some missing datasets at AGLT2 - resubscribing everything on MCDISK to fill in holes.
    • Working on monitoring site services remotely.
    • Excluding upgrades - what is the administrative load for running DQ2 site services? At AGLT2 - not much of a load. At MWT2 - when we're running smoothly its roughly zero. Opinion shared by Wei and John. What about the LFC? Real problem is maintaining consistency. Moving LFC to a central location wouldn't help.

LFC @ Tier 3

  • Concerns about maintaining catalogs at Tier 3's.
  • Request from Jim C to move data optionally to Tier 3's (eg. from analysis jobs running at Tier 2s, Tier 1).
  • Kaushik, Torre discussing having a single LFC for Tier 3's - then Panda can provide an option to output data at Tier 3, using Panda mover.
  • Doug: what about subscriptions? Run a single site services for Tier 3.
  • So then Tier 3's would need to setup an SRM storage element.
  • Doug, Kaushik, Torre will discuss this offline.
  • Will need an LFC at BNL.

Tier 3 issues (Doug)

  • Tier 3 meeting at ANL last week - "1/2 successful".
  • Lots of discussion about VMs.
  • Opportunistic computing within a campus. How to use it efficiently?
  • Would like a data subscription method for Tier 3s.
  • Only 2 sites in person did not have Tier 3's already - having trouble contacting the community.
  • Will be scheduling a concurrent "dq2-get test". This week.
  • Pushing xrootd as baseline in the Tier 3s.
  • What fraction of sites will be GS? Discouraging this model.

Data Management & Storage Validation (Kaushik)

  • Reference
  • last week(s):
  • this week:
    • SWT2, WT2, NET2 have been cleaned up ESDs
    • AGLT2 (DQ2 reports 15 TB, SRM reports > 50 TB), MWT2 - DQ2 has 85 TB; these should be deleted (why did central operations didn't delete them?)
    • Need to send email to Kaushik with report for space available on DATADISK for STEP09.
    • Obsolete dataset - Charles has new script available - has run at UC and IU. Working on fixes for AGLT2, and adding deletion for sub- and dis-datasets. Can also be used to prune even after central deletion.
    • For analysis stress testing:
      • MCDISK - need 20 TB for stress testing datasets. But we also will have merged datasets which unknown size.
      • USERDISK or SCRATCHDISK? Migrate to SCRATCHDISK after STEP09? Stay w/ USERDISK until after STEP09. (Also gives us time for the new xrootd release.)

Throughput Initiative (Shawn)

  • Notes from meeting this week:
  • last week:
    • Sites are too busy to schedule throughput tests
    • Next release - late June
    • Next week during throughput timeslot - will be open for other uses, but no organized meeting.
  • this week:
    • Need to check status of current perfsonar boxes. Next major milestone will be to deploy new Knoppix disk. Will have new driver for on-board NIC, and other work-arounds.
    • Milestones - production has been catching up. BNL has been sending >800 MB/s for last 24 hours. 10G sites are to setup 400 MB/s - consistency.
    • What about Tier-3 throughput? Welcome to join Tuesday meetings now.

Site news and issues (all sites)

  • T1:
    • last week: condor-g job submission progress - new code from Condor team - to address scaling probs on submit host. Want to stress test the new code - found submission rate decreased 1/3 from before. Condor developers investigating. Postgres PNFS database upgraded to 64 bit machine yesterday - all went well. Increased memory 48 GB. BNL-AGLT2 circuit now in place. Next is SLAC.
    • this week: dedicated circuit will be activated tomorrow for SLAC. UTA - more effort will be needed, would like to call a meeting to discuss this to arrange for last mile to SWT2_UTA (esnet, I2, LEARN, ...). Working on storage services and access efficiency out of dcache using pnfsid - Hiro, Wensheng, working with Paul. Want to setup a test queue to test the modified pilot. Expect scalability to be addressed. Evaluating technologies for next storage purchase of ~1.5 PB. Vendors hesitant to move to 1.5 or 2 TB drives. Will use nexan + FC to Thors to extend storage space; flexible and cost-effective.

  • AGLT2:
    • last week: Upgrade dCache and Chimera. Seemed to working over the weekend. Building power test sent an EPO signal to storage servers, one got RAID reconfigured mistakenly. PNFS mounting issue.
    • this week: Working on consistency within LFC and dCache components, and DQ2. Re-subscribing ~900 datasets to MCDISK. Lots of space available for STEP09. Fixing one residual problem with a gridftp server. Running 13 10G gridftp nodes.

  • NET2:
    • last week(s): HU ramped up to 500 jobs. Helping Tufts setup a Tier3 as a production end point; 300 cores opportunistic. Not too concerned about support issues. Still have a problem with 10G NIC, still producing checksum errors 1/(1K-10K files). Have a replacement Myricom NIC on order.
  • this week: Running steadily but below capacity. Only real problem was bad network card - replaced 10G with 1G card, stopped corrupting. Have replacement. 300 MB/s transfer rate observed for replacing a dataset. Extra bandwidth helping! HU - all going well.

  • MWT2:
    • last week(s): kernel & OS upgrades last week. Upgraded dcache to 1.9.2-5 - have gPlazma timeouts, but have new changes. UC back online. IU - difficulties changing site status and getting test jobs running.
    • this week: gplazma auth timeouts - have a configuration now that works. Have some hardware instability with new Dell servers. Problems with files not getting registration.

  • SWT2 (UTA):
    • last week:
    • this week: Mark: main thing over the past week was old cluster had full file system. Cleared up, weekend installs succeeded, now back in production.

  • SWT2 (OU):
    • last week:
    • this week: all okay.

  • WT2:
    • last week:
    • this week: all okay. An earlier problem with SRM resolved. Working on next procurement.

Carryover issues (any updates?)

Squids and Frontier (John DeStefano)

  • Note: this activity involves a number of people from both Tier 1 and Tier 2 facilities. I've put John down to regularly report on updates as he currently chairs the Friday Frontier-Squid meeting, though we expect contributions from Douglas, Shawn, John - others at Tier 2's, and developments are made. - rwg
  • last meeting(s):
    • AGLT2 - two servers setup, working, talking to a front-end at BNL
    • Presentation at ATLAS Computing Workshop 4/15: Slides
    • Fred is looking into the Squid install instructions - Doug will help
    • Will try out instructions on MWT2; Patrck ~ 2 weeks; Saul will discuss with John.
  • this week:
    • Fred reports problems with using squid w/ release 15 - cannot access frontier server at BNL. Rel 14 works. Has contacted Richard Hawking, working on a solution.
    • Has tested direct oracle access; with help from Carlos, but found no significant performance improvement in spite of having reduced number of queries.
    • Sarah used the instructions for Tier 2 successfully.

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
    • The new system is in production.
    • Discussion to add pacball creation into official release procedure; waiting for this for 15.0.0 - not ready yet. Issue is getting pacballs created quickly.
    • Trying to get the procedures standardized so it can be done by the production team. Fred will try to get Stan Thompson to do this.
    • Testing release installation publication against the development portal. Will move to the production portal next week.
    • Future: define a job that compares whats at a site with what is in the portal.
    • Tier 3 sites - this is difficult for Panda - the site needs to have a production queue. Probably need a new procedure.
    • Question; how are production caches installed in releases? Its in its own pacball, can be installed in the directory of the release that its patching. Should Xin be a member of the SIT? Fred will discuss next week.
    • Xin will develop a plan and present in 3 weeks.
  • this meeting:
    • Improved transfers of pacballs from CERN to BNL more quickly.
    • Increase priority of pacball transfers in pandamover - will consult Tadashi.
    • Move all sites to use new installation system. Need to add independent install queues - need Alden's help. Have 11 sites migrated already. Missing UTA, OU, Illinois, UTD.
    • MWT2_IU path different between gatekeeper and worker-node. Sarah will fix this so fall-back works.
    • Torre: do we need special queues? Why not use conventional queue's when site is in test-mode. Xin: send email to Kausihk, Torre, Tadashi.

Tier 3 coordination plans (Doug, Jim C)

  • last report:
    • Upcoming meeting an ANL
    • Sent survey to computing-contacts
    • There is an Tier3 support list being setup.
    • Need an RT queue for Tier3
  • this report:

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week
    • Have discussed adding Adler32 checksum to xrootd. Alex developing something to calculate this on the fly. Expects to release this very soon. Want to supply this to the gridftp server.
    • Need to communicate w/ CERN regarding how this will work with FTS.
  • this week

Tier3 networking (Rich)

  • last week
    • Reminder to advise campus infrastructure: Internet2 member meeting, April 27-29, in DC
    • http://events.internet2.edu/2009/spring-mm/index.html
    • Engage with the CIOs and program managers
    • Session 2:30-3:30 on Monday, 27-29 to focus on Tier 3 issues
    • Another session added for Wednesday, 2-4 pm.
  • this week

Local Site Mover


  • last week
    • dCache service interruption tomorrow. postgres vacuum seems to flush the write ahead logs to disk frequently. Will increase logging buffer (w/ check point segments) for 1-2GB, as well as write-ahead logging buffers. To descrease the load while vacuuming. May need to do another one at some point. Will publish settings.
    • OSG 1.0.1 to be shortly released.
  • this week
    • STEP09 - Jim was discussing moving it to June 18, probably outside STEP09 proper. Note ATLAS central operations is in the process of moving data into the regions, and so there will be data available then.
    • SVN checkout to local disk - you need subversion rpm installed locally.
    • Next Tier 2/3 workshop? Consider late August, mid-September. Hands on component? Doug - want to write up a cookbook. Consider August 17.

-- RobertGardner - 26 May 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback