r9 - 09 Feb 2009 - 18:19:35 - HorstSeveriniYou are here: TWiki >  Admins Web > MinutesJan28



Minutes of the Facilities Integration Program meeting, Jan 28, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • new phone (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees: Patrick, Saul, Michael, Shawn, Charles, Rob, Douglas, Wei, Sarah, Neng, Bob, Armen, Horst, Karthik, Nurcan, Hiro, Wensheng, Mark, Kaushik, Torre, Pedro
  • Apologies:
  • Guests:

Integration program update (Rob, Michael)

Operations overview: Production (Kaushik)

  • last meeting:
    • Lots of reprocessing tasks in the US - failure rates are very high (job definition problems).
    • Filled up Monday/yesterday (>6K jobs), now ~drained.
    • Potential Condor-G scaling issues (job eviction) - pilot submit host upgrade plan. Upgrade to newer version of Condor, evaluate; this has the changes made by the Condor team to accommodate Panda requirements. Eg. Condor strategy of completing a job no matter what, at odds with panda philosophy (we can lose pilots, no need to retry failed pilots).
    • Working on job submission to HU. Problems at BU - perhaps missing files. John will work the issues w/ Mark offline.
    • Pilot queue data misloaded when scheddb server not reachable; gass_cache abused. Mark will follow-up with Paul.
  • this week:
    • See Michael's remarks above regarding reprocessing; Mid-Feb will be decisions about release, and soon thereafter freeze of databases, but otherwise dates above are relevant.
    • Need to make sure there is sufficient capacity beyond PRODDISK
    • MC production has ramped back up - have plenty of tasks
    • Brokering has become an issue, some things need fixing. One problem is random evgen distribution - no longer going to all Tier 1's. Tasks getting stuck if there are missing evgen files.
    • There is also some throttling introduced for tasks to keep clouds from getting overloaded. However, the number is sometimes wrong by a factor of 10. Still an issue
    • Will get back to policy of sending evgen to every cloud.
    • Expect to keep sites up and running for the next two-three weeks.
    • Still evicting lots of jobs from Condor-G (error code 1201). John is working on new instance. UTA has been switched new submit host, have observed significant reduction. Both IU_OSG and MWT2_IU also seem to be susceptible to this error.

Shifters report (Mark)

  • Distributed Computing Operations Meetings
  • last meeting:
  • this meeting:
    • Progress on getting OU_OSCER back into production.
    • Working with getting UTD back online. Test jobs are being submitted successfully. Could be some NFS server problems.
    • ANALY_AGLT2 offline - not sure why. Probably an oversight. Will turn back on.
       Dear Colleagues,
              Please find enclosed a new U.S./CA weekly Panda production/operation
      status report.
                With best wishes,
       US-CA Panda-production shift report (Jan.19-26, 2009)
      I. General summary:
      During the past week (Jan.19-26) in U.S. and Canada
        Panda production service:
      - completed successfully 215,597 managed MC production, validation
        and remaining reprocessing jobs.
      - average ~30,800 jobs per day.
      - failed 32,317 jobs.
      - average job success rate ~86.94%.
      - 47 (42+5) active tasks run in U.S. and Canada (validation,data08,mc08).
      II. Site and FT/DDM related interruptions/issues/news.
      1) Mon Jan 19. AGLT2. More test jobs were submitted and finished
      successfully. Site set to 'online'. Elog #2810.
      2) Mon Jan 19. The site OU_OCHEP_SWT2 has been set offline until
      release could be installed. Large number of jobs failed.
      Elog #2811. Fixed.
      3) Tue Jan 20. SWT2_CPB set 'online'. Clean-up of the home directories
      for usatlas1 and usatlas3 has been completed, and this fixed the problem
      with the pilots. Test jobs have completed successfully.
      4) Wed Jan 21. MWT2_IU job failures: Get error: No such file or directory.
      ~460 jobs failed. RT ticket #11587 submitted. MWT2_IU set offine.
      An inintended side-effect of proddisk cleanup. Test jobs have
      been submitted and ran successfully, ticket closed and site set
      back online.
      5) Wed Jan 21. DE/FZK-LCG2 pilot: Get error: dccp get was timed out
      after 18000 seconds. GGUS #45484.
      6) Wed Jan 21. FR/LYON Missing installation. GGUS #45491. More than 800 jobs failed in the last hours, the queues were just switched to offline
      in Panda. Resoved on Thu Jan 22. Put back online as test jobs succeeded.
      7) Wed Jan 21. CA/SFU-LCG2 Get error: Replica not found. GGUS #45492.
      8) Wed Jan 21. UK/UKI-NORTHGRID-MAN-HEP1 pilot: Get error: rfcp failed:
      Connection refused. GGUS #45494. Resolved next day. Solution: Node had
      a faulty memory and has been fixed.
      9) Thu Jan 22. UK/RAL pilot Put error: Error in copying the file from
      job workdir to localSE. GGUS #45523. Switched to offline in Panda, too many failures. RAL fixed the issue be restarting their SRM nodes the same day. Site's queues set back online. New job-failures appeared 2h.later. Queues were set offline again.
      10) Fri Jan 23. Several hundred jobs failed at MWT2_IU due to missing release 14.5.1. Elog #2850. Experts were informed, installation done.
      11) Mon Jan 26. All jobs failing at NL/csTCDie: signal 15 or lost heartbeat. csTCDie set offline. It could be a walltime too short problem. GGUS #45616.
      12) Mon Jan 26. Numerous failures at FR/LPC, missing 14.2.10 and 14.2.20.
      72% failure rate. GGUS ticket # 45617.
      III. ATLAS Validation and ADC Operation Support Savannah bug-reports:
       -- data08_cos repro task 33113 job failures: TRF_EXC |
          IOError: (2, 'No such file or directory', 'jobReport.pickle')
          Savannah bug #46151. Crashes in ESDtoAOD, and ESDtoDPD, many
          6-th-7-th attempts. Under investigation: ESD to AOD bad_alloc,etc.
          See Savannah bug #45764. Task set to "FINISHED".
       -- mc08 digi task 39224 failures: Athena failed: Non-zero
          failed job return code: 1. ~270 jobs from this task failed in LYON.
          Savannah bug #46152. Task is in "FINISHED". Bug closed: 15995/15996
          jobs done (temporary site-related issue).
       -- valid1 simul-reco task 39495 failures: TRF_INFILE_TOOFEW
          | Input file EVNT.039398._00001.pool.root.1 : too few events.
          Savannah bug #46266. ALL jobs failed up to 9 attempts. Task "ABORTED"
          and redefined with "valid2" prefix - thanks to Andreu.

Analysis queues, FDR analysis (Nurcan)

  • Analysis shifters meeting on 1/26/09
  • last meeting:
  • this meeting:
    • CERN offline software tutorial last week. Paul helped with pathena tutorials.
    • 2008 Software & Computing Post-Mortem Workshop last week. Distributed analysis user support post-mortem on Thursday, shifters discussions on Friday. Dan presented input from AtlasDAST, see the minutes/highlights of the discussion from analysis shifters meeting agenda above.
    • Analysis site issues:
      • TAG selection jobs: works at AGLT2, SLAC, BNL, does not work at OU, UTA (ERROR : could not import lfc) and MWT2. The associated AOD is not inserted into the PoolFileCatalog.xml file. Need help from Paul/Torre on site configuration parameters at OU, UTA (copysetup parameter in schedconfigDB?). I could not submit to NET2 yesterday and today due to the ERROR : failed to access LFC.
      • Urgent: Users report problems with BNL LFC: send2nsd: NS002 - connect error : Timed out, ERROR : Timed out, ERROR : failed to access LFC. This seems to be happening often, dq2-ls and dq2-get would work for datasets at BNL however pathena submission would not go through.
      • The same LFC problem above also occurs at heroatlas.fas.harvard.edu. Marco/Wensheng discussed the issue; HU uses BU's LFC. The site configuration for ANALY_HU_ATLAS_Tier2 should be checked.
    • Annoucement of site status table from Nurcan last week: _A site status table is now available for all pathena sites from the Analysis Dashboard of Panda monitor: http://panda.cern.ch:25880/server/pandamon/query?dash=analysis. Click on the "Show Analysis Site Status" button. We hope that this table will further help users as to check the site status if their jobs get stuck in activated state for instance. If your favorite site has a status of "Test" this means that the site is being tested after a major maintenance, upgrade etc, and will be put back online once the test jobs are successful. Please also continue to check out the two eLogs (Shift eLog and CERN eLog) available under the "Panda info and help" link on top left of Panda monitor to see if any site problem is reported already for your site in question: http://atlas003.uta.edu:8080/ADCoS/?mode=summary, https://prod-grid-logger.cern.ch/elog/ATLAS+Computer+Operations+Logbook/._

  • Continue email thread for TAG selection jobs.
  • Kaushik will provide a script which fails 50% of the time - increased since transition to LFC.

  • Experiences with prun and policy questions (Neng)
    • http://wisconsin.cern.ch/~nengxu/pathen_prun.ppt
    • Are we automatically getting all the AODs and central production DPDs? No Need to request these with new releases of datasets, and re-request datasets (via Kaushik)
    • At which stage the AOD files get transferred? During the Panda jobs running or wait until the whole task finished?
    • When should users submit Pathena jobs in case the AOD MC samples are not finished. Whenever they are available. Pathena can do this incrementally, as new data arrives. However, DQ2 will only transfer datasets that are closed(?). Perhaps this is Alexei's policy? Need to check w/ Alexei.
    • Can users define the output datasets destination (WISC) having the jobs running somewhere else? Currently not doing this and not recommended in order to avoid the risk of wasting cpu resources when things go wrong during stage-out. Kaushik has requested that requests from Neng be automatically approved from the web request form
    • Problems running multiple jobs within a task using prun.
    • What about large input datasets produced "locally". This has to go into a GROUPDDISK quota. No one is allowed to store locally produced data on ATLAS resources. This should be done through the DDM page.
    • What are the policies of usage of PRUN: - running time. - input/output file size limitation (usage of --extFile.)
    • What type of jobs should/shouldn’t be run with PRUN?
    • Where can we find the policies?
      • For the moment the project is monitoring the use before declaring official policies. There are some built in limitations.
    • Pedro will look into dataset deletion which should be possible with LFC SE's. We also need to look into Xrootd SE's.

Operations: DDM (Hiro)

  • last meeting:
    • Hiro at the dCache workshop
    • DDM stress test currently running, 10M files, involving all Tier 1s. DS subscribed from one, to all. Small files, stressing central components and local catalogs, and SRM. 14K files per hour (much higher than in production). Anticipated plan is to include the Tier 2s - more info from Simone later. Validating the latest version of the DDM code.
  • this meeting:
    • Reporting massive subscription exercise ~ 1M files/day. Registration failures, due to latency (RTT).
    • Discussion about hosting the Tier 0-Tier 1 export service.
    • BNL_GROUPDISK added to ToA.
    • AGLT2 DQ2 wasn't working for Tier 0 exports - fixed by Shawn.
    • SRM problems at UW
    • New site at UTA now online
    • LFC for OU - must be public. Fixed.
    • dq2-put problems reported.
    • New DDM monitor up and running (dq2ping); testing with a few sites. Can clean up test files with srmrm. Plan to monitor all the disk areas, except proddisk.
    • dCache workshop - can bypass namespace for reading files. New version of dccp to write to space token areas. Chimera should work better with multiple clients in comparison to pnfs. Nordugrid will do testing of this. New documentation coming, with more "how-to's". Can add commentary.
    • Another 10M transfer jobs planned - mid-Feb. During this phase there will be real throughput tests combined with the stress tests. And planning to include the Tier 2's.
    • Proxy delegation problem w/ FTS - the patch has been developed and in the process of being released. Requires FTS 2.1. Did back-port. Though only operational SL4 machines. We would need to carefully plan migrating to this.

Storage validation

  • last week:
    • See new task StorageValidation
    • Current ccc.py script works for dCache, though the coupling is pretty weak (uses a text file dump from dcache). Need to get dumps for xrootd and gpfs. Output is text.
    • Separates files into ghosts and orphans - can take further steps to clean up.
    • Strongly coupled to DQ2 and LFC.
    • Armen: there are discussions on-going to systematically describe the policy for bookkeeping and clean-ups, beyond emails to individual. Follow-up in two weeks. * this week:
    • Have been using dq2site-cleanse in the past couple of weeks. Has exposed problems with the lfc python bindings.
    • Have developed a workaround for administrative purposes:
      • http://repo.mwt2.org/viewvc/admin-scripts/lfc/proddisk-cleanse.py?view=markup
      • see email to ddm-l
      • optimized for proddisk structure; not to be used for cleaning DATADISK, MCDISK etc.
      • Wei: found previous dq2site-cleanse at slac without problems - removing 30 TB consistently.
      • Pedro warns against multiple threads - experiences from other clouds; will consult with Vincent.
      • There is a problem with ACLs perhaps - from older version of the pilot. Follow-up offline.

VDT Bestman-Xrootd

  • last week
    • BestMan
    • Horst - basic installation of bm-gateway, needs to add the space token configuration. Bestman and the srm interface is working. All tests look okay.
    • Doug and Ofer will be looking at this at BNL. follow-up in two weeks
  • Doug: installing at BNL (gateway only) and Duke (both gateway and xrootd back-end) from VDT.
  • Patrick: installed BM-gateway for UTA_SWT2. Will forward suggestions to osg-storage. There is also some work for SRM space monitoring that needs to be track.

Throughput Initiative (Shawn)

  • Notes from meeting this week:
            Notes for Throughput Call, January 27, 2009
Attending:  Shawn, Charles,  Saul,  Neng,  Hiro, Patrick, Rob
Excused:  Horst, Karthik (OU “closed”)
Recent network testing (OU and MWT2) discussion.  OU now has reasonable (3-4 Gbps) Iperf results from OU to BNL.   Still need to test BNL->OU.  MWT2 had some issues but this was mostly due to the cut-over for the new routing to BNL.   Will resume testing after new peering is in place on the new path. 
FYI:  Use ‘ethtool –S ethX’  before and after Iperf tests (saving to a file) so you can compare.
Load testing during the last week:  None.    No sites reported being ready during the next week.  
MonALISA Status at BNL: no report this week.
perfSONAR box issues:   UTA needs to check their boxes.  NET2 estimates 1-2 weeks.   IU needs to be checked.  
IU throughput issues:  delayed till next week.
Finish discussion of testing infrastructure.  Detailed discussion of how to monitor/track inter (and intra) site throughput.    Need to have easy access to PANDAMover and FTS data at a minimum.   How best to get this info and track it?   Where is the logical “tap” point?     Will think about how to get needed info.   Hiro may have an idea about how to track “production” throughput to discuss next meeting.
Site reports:
                BNL:  Tier-1 no updates.
                AGLT2:   Working on network problems.   Plan to have access to StarLight test-box later today/tomorrow.
                MWT2:  BNL peering tomorrow
                NET2:  perfSONAR operational in 1-2 weeks.
                SWT2: perfSONAR boxes will be checked tomorrow.
                WT2:  no report
                Wisconsin: No updates.    XROOTD issues to be resolved before further throughput testing.
AOB:  None.
Call next week at the regular time.  Please send along any corrections or additions to these notes.


Last week:

            USATLAS Throughput Call
January 20, 2009 at 3 PM Eastern
Attending: Charles, Tom,  Shawn, Wei, Jay, Horst, Karthik, Rich, Hiro.
Site Reports:
                BNL:  Things look good for 1GB/sec.   Email Hiro for redoing site  load-testing.   Need manpower at remote Tier-2s to help manage/monitor tests.  Need sites ready.

                AGLT2:   MSU site has 6 Dell storage nodes in production (5x10GE, 1x1GE).  Lustre  awaiting v1.8.   UM site has 5x10GE storage nodes.

                MWT2:  UC/IU bringing up new Dell storage nodes (3/site).   All have 10GE NICs.  Currently just configured as pool nodes.   Soon to add them as GridFTP doors, then ready for redoing load tests.   IU had problems when bringing up a second GridFTP door.


                SWT2:   OU reports not much change.   The 10GE NICs are mostly operational.  Ready to redo load-testing.

                WT2:  SLAC is ready to retest.   Got 380MB/sec using all servers.    Current maximum is 300-350 MB/sec (using two new servers).   
MonaLISA/perfSONAR:   MonALISA is installed/running.   Haven’t yet remade graph templates (were previously lost).     The perfSONAR results were reported by Rich with testing to/from the Internet-2 perfSONAR to/from BNL, OU, MSU and UM.      UM to BNL is problematic (known problem).  OU has issues sending.   BNL  may have periods of congestion.  Rich would like to include John Bigrow in future discussions about perfSONAR.
Long discussion about how best to use or extend perfSONAR.   Shawn doesn’t want to lose or modify existing USATLAS  perfSONAR installation based upon locating identical hardware as close to Tier-n locations as possible and running the same software.  Jay wants to make sure we measure end-to-end using production systems as this is the best measure of our real throughtput.  Rich pointed out that OSG will soon be distributing perfSONAR clients so that part will be easier.   Could be useful for hybrid testing.   Jay is interested in extending perfSONAR to include (USATLAS-like) GridFTP capability.   Discussion will be continued at next week’s meeting. (ran out of time).
Next meeting, January 27th at 3 PM.   Please send along corrections or additions to these notes.


  • Main focus is getting the perfsonar infrastructure at all sites. Separate this from the host-level config issues on production resources. Feel its important to track this and follow over time.
  • The bwctl program tracks scheduled transfers between sites. Expect the testing to be light enough not to interfere with production.

Site news and issues (all sites)

  • T1:
    • last week: building addition making good progress.
    • this week: there will be a short LAN outtage around noon (multicast forwarding issue to be studied on backbone switch). This will be about 2 hours. On March 24 there will be an 18 hour outtage, though the impact to ATLAS should be small. Still bringing up the 1 PB of disk. Would like to get rid of the data on distributed disk on compute nodes, retire, ~ 100 TB. Over the course of Feb upgrade dCache to v 1.9. FTS upgrade as mentioned above. Direct links between BNL and sites - our network providers are very interested in helping here. Mike Conner w/ Esnet, in contact with Internet2. Expect links to BU and HU shortly. BNL performed well during previous 10M DDM test.
  • AGLT2:
    • last week: waiting for a testpilot job to complete at the moment - the problem was related to a zombie dcache process, problem with the nic. Will clean up proddisk when Charles' new script becomes available. 400 job slots at MSU ready to come online.
    • this week: running well since dCache database vaccuuming working correctly. Number of running jobs in Panda tends to exceed the number actually running. Wenjing is having trouble with new proddisk cleaner script. GLPI tracking, ticketing inventory system.
  • NET2:
    • last week: Running analysis jobs, not production jobs. There was a problem yesterday resulting from the cleanse script w/ permissions on the lfc. Now running a script to clean-up the LFC, but its slow (Sarah has a faster version). Still waiting for word from Kaushik on cleaning up MCDISK, DATADISK. New storage 336 TB raw, new gpfs volume. Harvard: still working on firewall/proxy server issues for worker nodes.
    • this week: Running production steadily since last week. Bringing up the new 336 TB storage; bringing HU site up - John working hard, working through local issues.
  • MWT2:
    • last week: brought up another Dell server ~ up to 200 TB. Running smoothly mostly. Marco working on running TAG-based analysis jobs; three sites have configuration problems in scheddb.
    • this week: brought new worker nodes.
  • SWT2 (UTA):
    • last week: CPB running smoothly. Working on upgrade of UTA_SWT2. Ibrix re-installed. Wrap up early next week.
    • this week: all well.
  • SWT2 (OU):
    • last week: gass cache filled up again, from usatlas jobs. 1000's of directories are being directories. Working on getting OSCER cluster back up.
    • this week: all is well. oscer.
  • WT2:
    • last week: Cooling outtage is over, everything coming back. Ran into probs changing ATLAS releases to a new NFS server. Xin: this is a well known feature of Pacman. Will just need to re-install.
    • this week: all fine.

Carryover issues (any updates?)

Release installation via Pacballs + DMM (Xin, Fred)

  • last week:
    • Now working at BNL
    • There was a problem with proxy handling within the pilot. Fixed.
    • Now going through the sites and discovering new problems, eg. usatlas2 permissions.
    • MWT2_IU, SWT2_UTA - worked all the way to the end, but ran into a permissions problem; needs to change the script.
    • There is a problem with the installation script (?).
    • Pacman / pacball version problems?

Squids and Frontier (Douglas)

  • last week:
    • Things are tuned now so that access times can be measured.
    • Squid at SLAC is now working well with lots of short jobs. Cache is setup with 36 GB disk for testing.
    • Will be working on jobs with different access patterns.
    • What's the plan in ATLAS generally? There are tests going on in Germany (Karlsrue Tier 1). There is an coming workshop where this will be discussed. Also a muon calibration group is looking into this (lots of data the beginning of the job).
    • How to try out with a real reco/reprocessing job?
    • We need to make sure this can be extended to the Tier 2s.
    • Discuss at next week's workshop

Local Site Mover

  • Specification: LocalSiteMover
  • code
  • this week:
    • What about direct reading of files? Not relevant - only invoked for local copies.


  • None

-- RobertGardner - 20 Jan 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback