r5 - 09 Feb 2009 - 18:19:35 - HorstSeveriniYou are here: TWiki >  Admins Web > MinutesFeb4



Minutes of the Facilities Integration Program meeting, Feb 4, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • new phone (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees: Fred, Rob, Michael, Tom, Sarah, Douglas, Horst & Karthik, Marco, John, Patrick, Armen, Bob, Hiro, Wei, Saul, Shawn, Kaushik, Neng
  • Apologies:
  • Guests:

Integration program update (Rob, Michael)

  • IntegrationPhase8
  • Facility working group on analy queue performance: FacilityWGAP NEW
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open usually): http://integrationcloud.campfirenow.com/1391f
  • Other remarks:
    • Re-reprocessing exercise will be defined - validation of sites March 2; they will validate all sites, will extend to Tier 2's. Bulk of jobs to start week of March 9 - probably a couple of weeks. An extremely important exercise - avoid scheduled maintenance and interventions during this time period.
    • Tier-3 support issues - more requests are coming in. How we deal with this will require US ATLAS management decisions and resources. Please communicate Tier-3 requests to Rob and Michael.

Operations overview: Production (Kaushik)

  • last meeting(s):
    • Lots of reprocessing tasks in the US - failure rates are very high (job definition problems).
    • Potential Condor-G scaling issues (job eviction) - pilot submit host upgrade plan. Upgrade to newer version of Condor, evaluate; this has the changes made by the Condor team to accommodate Panda requirements. Eg. Condor strategy of completing a job no matter what, at odds with panda philosophy (we can lose pilots, no need to retry failed pilots). Fixed at UTA and MWT2 DONE
    • Working on job submission to HU. Problems at BU - perhaps missing files. John will work the issues w/ Mark offline.
    • Pilot queue data misloaded when scheddb server not reachable; gass_cache abused. Mark will follow-up with Paul. (carryover)
    • Need to make sure there is sufficient capacity beyond PRODDISK
    • Brokering has become an issue, some things need fixing. One problem is random evgen distribution - no longer going to all Tier 1's. Tasks getting stuck if there are missing evgen files. Resolved DONE
    • There is also some throttling introduced for tasks to keep clouds from getting overloaded. However, the number is sometimes wrong by a factor of 10. Still an issue
    • Will get back to policy of sending evgen to every cloud.
    • Expect to keep sites up and running for the next two-three weeks.
    • Still evicting lots of jobs from Condor-G (error code 1201). John is working on new instance. UTA has been switched new submit host, have observed significant reduction. Both IU_OSG and MWT2_IU also seem to be susceptible to this error. Migrated to new submit host instance.
  • this week:
    • Production ramping back up w/ brokering improvements.
    • Monte Carlo production should not be stopped because a Tier 1 is short of space.
    • Condor-g: seems to have gone away at UTA and MWT2.
    • All is fine, we're in good shape so long as we get jobs from central production; keep bringing on resources.
    • Retries for transferring files & job recovery. Kaushik will follow-up with Paul.

Shifters report (Mark)

  • Distributed Computing Operations Meetings
  • last meeting:
    • Progress on getting OU_OSCER back into production.
    • Working with getting UTD back online. Test jobs are being submitted successfully. Could be some NFS server problems.
    • ANALY_AGLT2 offline - not sure why. Probably an oversight. Will turn back on.
  • this meeting:
    • Networking issue at BNL - temporary outage of queues
    • Noticed job evictions at SLAC due to wall-time limits
    • SRM issue at AGLT2 - fixed
    • UTD - test jobs succeeding now; write permissions for panda mover files
    • UTA_SWT2 - back online for production after major software production
    • 10K jobs are in transferring state- normal?

 Dear Colleagues,

        Please find enclosed a new U.S./CA weekly Panda production/operation
status report.

           With best wishes,

 US-CA Panda-production shift report (Jan.26 - Feb.2, 2009)

I. General summary:

During the past week (Jan.26-Feb.2) in U.S. and Canada
  Panda production service:

- completed successfully 274,409 managed MC production and validation jobs
- average ~39,201 jobs per day
- failed 25,852 jobs
- average job success rate ~91.39%.
- 83 (73+10) active tasks run in U.S. and Canada (validation,mc08,etc.)

II. Site and FT/DDM related interruptions/issues/news.

1) Tue Jan 27. SE problems (stage-in/out) at ASGC (Taiwan-LCG2).
In the last 6 hours: Error rate is 77%, 1639 failures
E.g. Panda job: 24314448 Error details: pilot: Get error:
Copy command self timed out after 1814 s. GGUS ticket # 45638.
Test jobs sent to TW on Fri Jan 30 after they added more space into
the MCDISK token area. These succeeded, so it looks like the
issue was resolved.

2) Thu Jan 29. US/MWT2_UC. about 150 job-failures (stage in)
Get error: dccp get was timed out after 18000 seconds.
RT ticket # 11721. Caused by transient dCache pool error.
Resolved, ticket closed on Mon.

3) Thu Jan 29. UK/RAL pilot Put error: Error in copying the file
from job workdir to local SE. GGUS #45678.

4) Thu Jan 29. NL/NIKHEF-ELPROD pilot Get error: Failed to get LFC
replica. GGUS #45682.

5) Thu Jan 29. BNL_ATLAS_DDM queue switched to offline, PandaMover
suspended due to maintenance of NFS/AFS and local network at BNL T1.
Elog 2892. Restarted/activated next day Jan.30.

6) Thu Jan 29. A number of job-failures at BNL ~O(1000) during
the maintenance period. Elog ##28890,2890. BNL_ATLAS_T2 set offline,
back online on Friday.

7)  Thu Jan 29. FZK-LCG2 set to offline: SE problem. GGUS 45695.
Resolved next day. The site set back online.

8) Sat Jan 31. UC_ATLAS_MWT2 -- Get error: md5sum mismatch on input
file. ~140 jobs failed. Elog #2924. All of the failed jobs appear
to have run on the same node, c044.local. Update from Charles at UC:
"This same node has had problems before. I'm removing it from the Condor
queue (again) until I can we figure out what is wrong with it".

9) Mon Feb 2. SLACXDR failures: a number of problematic WNs.
- "fell0089" WN produced 173 job failures for the last 12h.
because it lost access to NFS-mounted ATLAS release area (nfs
mount point).
- A subset of other WNs at SLAC:
bali0221-bali0252, boer0102-boer0135,fell0127-fell0186 has
100% failure rate (~500 jobs) of ATLAS prod. jobs due to some configuration issue: jobs get evicted after 2-3h. running
on these WNs and then fail with "lost heartbeat".
- Other regions of WNs at SLAC don't have these problems and run
ATLAS production jobs well with very high success rate.
Experts at SLAC were informed.
 Thanks to Wei, fel0089 was rebooted, other problematic nodes
were closed to ATLAS jobs.

III. ATLAS Validation and ADC Operation Support Savannah bug-reports:

 -- mc08 simu+reco task 39291 failures: TRF_UNKNOWN |
    Peak-to-Val dist is 3.5 Val-to-Peak dist is -9.5 Should not
    be negative. Savannah bug #46389. Some jobs failed repeatedly
    at various sites up to 5 attempts. Known reco issue.
    Should be fixed in the new releases. Task is still "RUNNING".

 -- mc08 simu+reco task 39294 failures: ATH_EXC_PYT |
    MemoryError: ()"no mem for new parser". A number of jobs failed.
    Savannah bug #46392. Task "FINISHED" as 15995 jobs finished out
    of 16000 defined.

 -- mc08 simu+reco task 39290,32294 failures: TRF_SEGFAULT |
    (pid=15669 ppid=15668) received fatal signal 11 (Segmentation fault).
    Some jobs failed. Savannah bug #46393.


Analysis queues, FDR analysis (Nurcan)

  • Analysis shifters meeting on 1/26/09
  • last meeting:
    • Analysis site issues:
      • TAG selection jobs: works at AGLT2, SLAC, BNL, does not work at OU, UTA (ERROR : could not import lfc) and MWT2. The associated AOD is not inserted into the PoolFileCatalog.xml file. Need help from Paul/Torre on site configuration parameters at OU, UTA (copysetup parameter in schedconfigDB?). I could not submit to NET2 yesterday and today due to the ERROR : failed to access LFC.
      • Urgent: Users report problems with BNL LFC: send2nsd: NS002 - connect error : Timed out, ERROR : Timed out, ERROR : failed to access LFC. This seems to be happening often, dq2-ls and dq2-get would work for datasets at BNL however pathena submission would not go through.
      • The same LFC problem above also occurs at heroatlas.fas.harvard.edu. Marco/Wensheng discussed the issue; HU uses BU's LFC. The site configuration for ANALY_HU_ATLAS_Tier2 should be checked.
    • Continue email thread for TAG selection jobs.
    • Kaushik will provide a script which fails 50% of the time - increased since transition to LFC. resolved by Hiro, not an LFC problem
    • Experiences with prun and policy questions (Neng)
      • http://wisconsin.cern.ch/~nengxu/pathen_prun.ppt
      • Are we automatically getting all the AODs and central production DPDs? No Need to request these with new releases of datasets, and re-request datasets (via Kaushik)
      • At which stage the AOD files get transferred? During the Panda jobs running or wait until the whole task finished?
      • When should users submit Pathena jobs in case the AOD MC samples are not finished. Whenever they are available. Pathena can do this incrementally, as new data arrives. However, DQ2 will only transfer datasets that are closed(?). Perhaps this is Alexei's policy? Need to check w/ Alexei.
      • Can users define the output datasets destination (WISC) having the jobs running somewhere else? Currently not doing this and not recommended in order to avoid the risk of wasting cpu resources when things go wrong during stage-out. Kaushik has requested that requests from Neng be automatically approved from the web request form
      • Problems running multiple jobs within a task using prun.
      • What about large input datasets produced "locally". This has to go into a GROUPDDISK quota. No one is allowed to store locally produced data on ATLAS resources. This should be done through the DDM page.
      • What are the policies of usage of PRUN: - running time. - input/output file size limitation (usage of --extFile.)
      • What type of jobs should/shouldnt be run with PRUN?
      • Where can we find the policies?
        • For the moment the project is monitoring the use before declaring official policies. There are some built in limitations.
        • Pedro will look into dataset deletion which should be possible with LFC SE's. We also need to look into Xrootd SE's.
  • this meeting:
    • LFC timeout errors resolved - secret fix - http proxy server at BNL by Hiro (??)
    • Import LFC error still unresolved.

Operations: DDM (Hiro)

  • last meeting:
    • Reporting massive subscription exercise ~ 1M files/day. Registration failures, due to latency (RTT).
    • Discussion about hosting the Tier 0-Tier 1 export service.
    • BNL_GROUPDISK added to ToA.
    • AGLT2 DQ2 wasn't working for Tier 0 exports - fixed by Shawn.
    • SRM problems at UW
    • New site at UTA now online
    • LFC for OU - must be public. Fixed.
    • dq2-put problems reported.
    • New DDM monitor up and running (dq2ping); testing with a few sites. Can clean up test files with srmrm. Plan to monitor all the disk areas, except proddisk.
    • dCache workshop - can bypass namespace for reading files. New version of dccp to write to space token areas. Chimera should work better with multiple clients in comparison to pnfs. Nordugrid will do testing of this. New documentation coming, with more "how-to's". Can add commentary.
    • Another 10M transfer jobs planned - mid-Feb. During this phase there will be real throughput tests combined with the stress tests. And planning to include the Tier 2's.
    • Proxy delegation problem w/ FTS - the patch has been developed and in the process of being released. Requires FTS 2.1. Did back-port. Though only operational SL4 machines. We would need to carefully plan migrating to this.
  • this meeting:
    • AGLT2 problem still - still recovering from MSU network stack failing; Should be up - test pilots are working.
    • UWISC down - still getting fixed. There was a problem with a new PROOF package installed, killed an xrootd daemon.
    • UTA_SWT2 should be back online now - fully equipped with tokens and SRM, should be straightened out
    • Tier 3 support issue with Illinois - requiring effort, which is an issue.

Storage validation

  • See new task StorageValidation
  • last week:
    • Armen: there are discussions on-going to systematically describe the policy for bookkeeping and clean-ups, beyond emails to individual. Follow-up in two weeks.
    • Have been using dq2site-cleanse in the past couple of weeks. Has exposed problems with the lfc python bindings.
    • Have developed a workaround for administrative purposes:
      • http://repo.mwt2.org/viewvc/admin-scripts/lfc/proddisk-cleanse.py?view=markup
      • see email to ddm-l
      • optimized for proddisk structure; not to be used for cleaning DATADISK, MCDISK etc.
      • Wei: found previous dq2site-cleanse at slac without problems - removing 30 TB consistently.
      • Pedro warns against multiple threads - experiences from other clouds; will consult with Vincent.
      • There is a problem with ACLs perhaps - from older version of the pilot. Follow-up offline. * this week: * proddisk-cleanse running now at OU; small problem w/ AGLT2 (running on host w/ no access to pnfs) - fixed. * Wenjing will run full cleanup at AGLT2 today. * AGLT2 - what about MCDISK (now at 60 TB, 66 TB allocated)? These subscriptions are central subscriptions - should be AODs. Does the estimate need revision? Kaushik will follow-up. * Need a tool for examining token capacities and allocations. Hiro working on this. * Armen - a tool will be supplied to list obsolete datasets. Have been analyzing BNL - Hiro has a monitoring tool under development. Will delete obsolete datasets from Tier 2's too. * ADC operations does not delete data in the US cloud - only functional tests and temporary datasets. Should we revisit this? We don't know what the deletion policy is, but we'd like to off-load to central operations as appropriate.

VDT Bestman-Xrootd

  • BestMan
  • last week
    • Doug: installing at BNL (gateway only) and Duke (both gateway and xrootd back-end) from VDT.
    • Patrick: installed BM-gateway for UTA_SWT2. Will forward suggestions to osg-storage. There is also some work for SRM space monitoring that needs to be track.
  • this week
    • Patrick: ibrix issue being addressed by osg-storage, to allow lcg-cp to work properly through their sudo interface. Wei: Alex has a new release which addressed this as well as space availability monitoring.
    • Wei: followed Tanya's document to install bm-xrd system; almost everything works.
    • Doug: waiting on hardware at BNL. Will report next week.

Throughput Initiative (Shawn)

  • Notes from meeting this week:
USATLAS Throughput Call Notes
 February 3, 2009
 Attending:   Shawn, Neng, Hiro, Jay, Wei, Horst, Karthik, Sarah, Charles
               Recent network testing discussion (All)
                                OU - had successful testing.
                                UC  - Yes, new peering to BNL in place.   Need to re-test.
                                AGLT2 – Got testing access to a  10GE machine at StarLight today.  Plan to test tomorrow.
                Continuing update on testing status and plans (Hiro)
                                Nothing new.  Still have to meet goal of 1GB/sec.  Will likely depend upon finding current problems in our network paths and end systems before we can achieve that goal.
                MonaLISA update (new server/service status) (Jay)
                                Service operational but no (re)configuration yet.   Need to determine what type of tests would be useful to reimplement.
                IU issue for throughput.   Need to define problem and produce a plan to resolve it (Sarah/Shawn)
                                Original problem was oscillating (SRM transfer) up and down rates.   Now using GridFTP transfers.  Iperf didn’t show the problem when it was occuring.   Need to watch for this problem reoccur.  
                Continue testing infrastructure discussion from last week (All)
                                How  to monitor throughput?   Netflow for tracking  ALL traffic between specific pairs of sites.   perfSONAR for providing a baseline of what is happening on the network(s) between endpoints.  Regular “test production” transfers (monitored) to track end-to-end site-to-site throughput in a comparable way.   Still the issue of FTS vs PandaMover (no easy way to track both as a function of site).     For next meeting we need to determine a set of regular tests to be used for monitoring throughput.    GOALS:   1) Measure maximum throughput between BNL-Tier2 regularly, 2) Track network related changes via perfSONAR and possible additional Iperf tests, 3) Track end-to-end (disk-to-disk) production throughput between BNL-Tier2 regularly.   METAGOAL: To have sufficient information to be able to isolate throughput bottlenecks and new problems as they occur.
Site reports:  Status for renewed throughput testing (I hope a rep from each site can give a brief status report)
                                Ran out of time.   Next week Site Report will be first!
Meeting at the usual time next week.   Two primary topics:   1) Define plans for infrastructure monitoring, 2) Discuss status/configuration/use of perfSONAR for USATLAS.  
Please send along any corrections or edits to these notes.

  • last week:
    • Main focus is getting the perfsonar infrastructure at all sites. Separate this from the host-level config issues on production resources. Feel its important to track this and follow over time.
    • The bwctl program tracks scheduled transfers between sites. Expect the testing to be light enough not to interfere with production.
  • this week:
    • continue to focus on monitoring
    • some sites working on local issues.
    • next week hope to have better info on monitoring.
    • GB/s milestone - delayed until site specific issues resolved.

Pathena & Tier-3 (Doug B)

  • Meeting this week to discuss options for a lightweight panda at tier 3 - Doug, Torre, Marco, Rob
  • Local pilot submission, no external data transfers
  • Needs http interface for LFC
  • Common output space at the site
  • Run locally - from pilots to panda server. Tier 3 would need to be in Tiers of Atlas (needs to be understood)
  • No OSG CE required
  • Need a working group of the Tier 3's to discuss these issues in detail.

Site news and issues (all sites)

  • T1:
    • last week: there will be a short LAN outtage around noon (multicast forwarding issue to be studied on backbone switch). This will be about 2 hours. On March 24 there will be an 18 hour outtage, though the impact to ATLAS should be small. Still bringing up the 1 PB of disk. Would like to get rid of the data on distributed disk on compute nodes, retire, ~ 100 TB. Over the course of Feb upgrade dCache to v 1.9. FTS upgrade as mentioned above. Direct links between BNL and sites - our network providers are very interested in helping here. Mike Conner w/ Esnet, in contact with Internet2. Expect links to BU and HU shortly. BNL performed well during previous 10M DDM test.
    • this week:
      • Scheduled intervention of core switch to study Foundry switch, but no definitive results; had impact on Tier 1 operations last thurs/friday. Still working on brining up Thor storage systems. Install image prepared for quick installation. Working with John from Harvard for Frontier deployment for muon reconstruction.
  • AGLT2:
    • last week: running well since dCache database vaccuuming working correctly. Number of running jobs in Panda tends to exceed the number actually running. Wenjing is having trouble with new proddisk cleaner script. GLPI tracking, ticketing inventory system.
    • this week: network stack at MSU. Node crashes w/ a memory problem, switch stops forwarding traffic. Taking up issue with Dell. Large number of lost heartbeat jobs - looking into pilot log, error 10, peaks on the hour and large peaks every three hours.
  • NET2:
    • last week: Running production steadily since last week. Bringing up the new 336 TB storage; bringing HU site up - John working hard, working through local issues.
    • this week: Still running steadily. HU: working on setting up Frontier. Douglas claims need a local squid setup. Need a user job to test. Start recording findings - see SquidTier2.
  • MWT2:
    • last week: brought up new worker nodes at UC.
    • this week: Esnet peering w/ BNL in place, but local campus routes still being worked on. Issue yesterday with Panda config getting clobbered - fixed.
  • SWT2 (UTA):
    • last week: all well.
    • this week: SWT2_UTA up and running now. ToA issue tracked down w/ Hiro. Will start running proddisk-cleaner and ccc checker on
  • SWT2 (OU):
    • last week: All is well. Working on getting OSCER cluster back up.
    • this week: ALL OK. -- Well, we still need help with the OSCER cluster, something's still not working right. Paul is helping.
  • WT2:
    • last week: All is fine.
    • this week: Mostly okay. Had problem with preemption on subset of cluster due to another experiment's jobs.

Carryover issues (any updates?)

Release installation via Pacballs + DMM (Xin, Fred)

  • last week:
    • Now working at BNL
    • There was a problem with proxy handling within the pilot. Fixed.
    • Now going through the sites and discovering new problems, eg. usatlas2 permissions.
    • MWT2_IU, SWT2_UTA - worked all the way to the end, but ran into a permissions problem; needs to change the script.
    • There is a problem with the installation script (?).
    • Pacman / pacball version problems?
  • this week
    • All Tier 2's have run test jobs successfully. Still working on HU site.
    • A second type of job - transformation cache
    • KV validation now standard as part of job
    • Release installations are registered in Alessandro's portal
    • Expect full production in two weeks.

Squids and Frontier (Douglas S)

  • last meeting:
    • Things are tuned now so that access times can be measured.
    • Squid at SLAC is now working well with lots of short jobs. Cache is setup with 36 GB disk for testing.
    • Will be working on jobs with different access patterns.
    • What's the plan in ATLAS generally? There are tests going on in Germany (Karlsrue Tier 1). There is an coming workshop where this will be discussed. Also a muon calibration group is looking into this (lots of data the beginning of the job).
    • How to try out with a real reco/reprocessing job?
    • We need to make sure this can be extended to the Tier 2s.
  • this week:

Local Site Mover


  • Direct notification of site issues from GGUS portal into RT, without manual intervention. Fred will follow-up.
  • Wei: questions about release 15 coming up - which platforms (release sl 4, sl 5 ) and gcc 3.5, 4.3. Kaushik will develop a validation and migration plan for the production system and facility.

-- RobertGardner - 03 Feb 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback