r3 - 11 Feb 2009 - 14:37:34 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesFeb11

MinutesFeb11

Introduction

Minutes of the Facilities Integration Program meeting, Feb 11, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • new phone (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Kaushik, Charles, Rob, Sarah, Fred, Saul, Doug, Shawn, Patrick, Wei, Torre, Armen, Bob, Pedro, Xin, Hiro
  • Apologies:
  • Guests:

Integration program update (Rob, Michael)

Operations overview: Production (Kaushik)

  • last meeting(s):
    • Working on job submission to HU. Problems at BU - perhaps missing files. John will work the issues w/ Mark offline.
    • Pilot queue data misloaded when scheddb server not reachable; gass_cache abused. Mark will follow-up with Paul. (carryover)
    • Retries for transferring files & job recovery - pilot option. Kaushik will follow-up with Paul.
  • this week:
    • running quite well
    • There were some pilot problems yesterday introducing adler32 changes at swt2 and uta. checksums stored in LFC are wrong; sites set offline.
    • backlog of transfers to BNL - across several Tier 2 sites
    • End of month - reprocessing, and DDM stress test

Shifters report (Mark)

  • Distributed Computing Operations Meetings
  • last meeting:
    • Networking issue at BNL - temporary outage of queues
    • Noticed job evictions at SLAC due to wall-time limits
    • SRM issue at AGLT2 - fixed
    • UTD - test jobs succeeding now; write permissions for panda mover files
    • UTA_SWT2 - back online for production after major software production
    • 10K jobs are in transferring state- normal?
  • this meeting:
    • report below
    • md5sums were overriding adler32 discussed above
    • old temp directories not getting cleaned up correctly at BNL - Paul fixed
    • UTD - working on bringing this back up. Need to re-install transformations.
    • Late saturday night/sunday - there were a lack of pilots - not sure what alleviated the problems.
    • all sites have been migrated to new condor submit hosts
    • Missing libraries - corrupt release at OU: 14.2.0; Xin will follow-up.
    • large number of checksum errors at mwt2 - caused by a central subscription - trouble ticket submitted.
 
 US-CA Panda-production shift report (Feb.2-9, 2009)
___________________________________________________________

I. General summary:
  ---------------

During the past week (Feb.2-9) in U.S. and Canada
  Panda production service:

- completed successfully 217,320 managed MC production and validation jobs
- average ~31,046 jobs per day
- failed 29,601 jobs
- average job success rate ~88.23%.
- 75 (68+7) active tasks run in U.S. and Canada (validation,mc08,etc.)

II. Site and FT/DDM related interruptions/issues/news.
  -------------------------------------------------

1) Mon Feb 2. SLACXDR failures: a number of problematic WNs (NFS
problems, lost heartbeat, etc.) detected and reported to sysadm.
Resolved. Elog 2941.

2) Mon Feb 2. NICHEF-ELPROD. >100 jobs failed with a get-error.
May be related to the SARA entry in BDII, or some glitch after
the downtime. Elog 2942.

3) Mon Feb 2. Stageout failures (put error) at ITEP, missing
directory. 100% failures, 36 in the last 6 hours. GGUS #45766.

4) Tue Feb 3. MWT2_UC back online after pilot/LFC problems
(the pilot was contacting the wrong host for LFC queries).
Fixed by Paul. After successful tests MWT2_UC and UC_ATLAS_MWT2
were set back online.

5) Tue Feb 3. Stage-in problems (srmcp timing out) at UKI-LT2-QMUL.
100% failure (28jobs), site set offline, GGUS-Ticket 45819.

6) Tue Feb 3. More than 800 stage-out failures (~550 in the last
6hours) at RAL and no job in transferring. Comment appended to
GGUS #45818. Seems to be resolved.

7) Tue Feb 3. Jobs killed by signal 15 at PIC (>130, 58%failure
rate). Some other job has lost heartbeat. GGUS ticket 45820.

8) Fri Feb 6. Unscheduled downtime at UKI-SCOTGRID-GLASGOW.
Plenty of jobs failed (>3000 failures, 86%). The error
indicated an expired server certificate. Site set offline.
GGUS #46035.

9) Fri Feb 6. Stagein errors at UKI-NORTHGRID-LANCS-HEP,
probably gLiteWN is missing or misconfigured in nodes wnAAA.
Nodes nodeXXX work fine. GGUS #46034. Filures continued
on Sunday: 883 job failures, 98%. Site set offline. Elog #2104.

10) Fri Feb 6. CA|US clouds. Checksum mismatch in mc08.106321
dataset. The file and all the available replicas seem corrupted
(with a checksum different from DQ2). Elog #2091,2093. Experts
and Alexei as owner of the dataset informed on Saturday. Savannah
DDM RT ticket #46767 submitted on Mon Jan 9. ~300 job failures
due to that at 5 different sites.

11) Fri Feb 6. AGLT2: a number of stage-out/stage-in errors.
RT ticket 11896. Multiple issues with dCache.
dCache on a machine was shut down while more disk trays
were added and also some network interruptions. ~1500 job
failures. Resolved on Sunday. But on Mon Jan 9 again ~100
job failures (stage-in/out, srm authentication). RT #11911.
Under investigation.

12) Sun Feb 8. Lost heartbeat jobs at UC_ATLAS_MWT2: 73% failure
rate. RT ticket 11903. Site set offline.

13) Sun Feb 8. Stagein errors at LIP-LISBON: 477 errors, 80%.
Site set offline. GGUS # 46042.

14) Sun Feb 8. Some problem in the pilots submission in US cloud.
MWT2 and AGLT2 experienced intermittent pilot outages between
Saturday and Sunday. gridui10 & 12 seem to identify periodically
as down the queues they are submitting to. Elog 32107. Under
investigation.

15) Mon Feb 9. DE/FZK-LCG2 pilot: Get error: dccp get was timed
out after 18000 seconds. GGUS #46059.


III. ATLAS Validation and ADC Operation Support Savannah bug-reports:
--------------------------------------------------------------------

 -- 14.5.1.2 valid3 digi+reco task 39950 failures: ATH_JOP_NOTFOUND |
    IncludeError: include file SetJetConstants-02-000.py.
    Savannah bug #46576. All jobs failed  several attempts.
    Thanks to Borut task ABORTED. Reassigned #46607. Understood:
    1-st try with  Reco_trf.py, wrong syntax. Bug-fix provided
    by David Rousseau. Bug closed.

 -- 14.5.1.2 valid1 digi+reco task 39910 failures: TRF_UNKNOWN | dummy
    version of propagation method with search of nearest surface called.
    Savannah bug #46711. Some jobs failed multiple attempts (up to 15!)
    ad different sites due to this error. Task is still RUNNING
    (only 169 of 200 jobs finished). Similar bug #46543 for another
    14.5.2.1 valid1 task Ztautau_filter.recon exists.

Yuri

Analysis queues, FDR analysis (Nurcan)

  • Analysis shifters meeting on 1/26/09
  • last meeting:
    • LFC timeout errors at BNL - resolved.
  • this meeting:
    • TAG selection jobs failures
    • This week there were lots of problems from users to submit pathena jobs - actually a catalog issue with container datasets.
    • BNL queues now equally divided.
    • Brokering changing to distribute load to more Tier 2s
    • With http-lfc interface, can use a local condor submitter for analysis queues

Operations: DDM (Hiro)

  • last meeting:
    • New DDM monitor up and running (dq2ping); testing with a few sites. Can clean up test files with srmrm. Plan to monitor all the disk areas, except proddisk.
    • Another 10M transfer jobs planned - mid-Feb. During this phase there will be real throughput tests combined with the stress tests. And planning to include the Tier 2's.
    • Proxy delegation problem w/ FTS - the patch has been developed and in the process of being released. Requires FTS 2.1. Did back-port. Though only operational SL4 machines. We would need to carefully plan migrating to this.
    • AGLT2 problem - still recovering from MSU network stack failing; Should be up - test pilots are working.
    • UWISC down - still getting fixed. There was a problem with a new PROOF package installed, killed an xrootd daemon.
    • UTA_SWT2 should be back online now - fully equipped with tokens and SRM, should be straightened out
    • Tier 3 support issue with Illinois - requiring effort, which is an issue.
  • this meeting:
    • BNL_MCDISK has a problem - files are not being registered. New DQ2 version coming up the end of the week which will hopefully fix this.
    • BNL_PANDA - many datasets are still open. Is this an operations issues?
    • Pedro: there may be precision problems querying the DQ2 catalog. Will check creation date of the file.
    • Note - all clouds have jobs piling up in the red category.

Storage validation

  • See new task StorageValidation
  • last week:
    • Armen: there are discussions on-going to systematically describe the policy for bookkeeping and clean-ups, beyond emails to individual. Follow-up in two weeks.
    • proddisk-cleanse running now at OU; small problem w/ AGLT2 (running on host w/ no access to pnfs) - fixed.
    • Wenjing will run full cleanup at AGLT2 today.
    • AGLT2 - what about MCDISK (now at 60 TB, 66 TB allocated)? These subscriptions are central subscriptions - should be AODs. Does the estimate need revision? Kaushik will follow-up.
    • Need a tool for examining token capacities and allocations. Hiro working on this.
    • Armen - a tool will be supplied to list obsolete datasets. Have been analyzing BNL - Hiro has a monitoring tool under development. Will delete obsolete datasets from Tier 2's too.
    • ADC operations does not delete data in the US cloud - only functional tests and temporary datasets. Should we revisit this? We don't know what the deletion policy is, but we'd like to off-load to central operations as appropriate. * this week:
    • Armen - still an on-going activity
    • proddisk-cleanse questions - may need a dedicated phone meeting to discuss space management; more tools becoming available.

VDT Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week
    • Doug: installing at BNL (gateway only) and Duke (both gateway and xrootd back-end) from VDT.
    • Patrick: installed BM-gateway for UTA_SWT2. Will forward suggestions to osg-storage. There is also some work for SRM space monitoring that needs to be track.
    • Patrick: ibrix issue being addressed by osg-storage, to allow lcg-cp to work properly through their sudo interface. Wei: Alex has a new release which addressed this as well as space availability monitoring.
    • Wei: followed Tanya's document to install bm-xrd system; almost everything works.
    • Doug: waiting on hardware at BNL. Will report next week.
  • this week
    • Horst - has installed newest version of bm-gateway. version i2.
    • Wei - there are new features for space querying - installed on production service. 2.2.1.2.i2
    • Armen - note this is the version which monitors space token (usage, available)
    • Doug - there are problems with instructions, send to osg-storage.
    • Horst having difficulty posting to osg-storage.

Throughput Initiative (Shawn)

  • Notes from meeting this week:
  
USATLAS Throughput Call Notes for February 10th, 2009
              ===========================================
 
Attending:  Shawn, Jason,  Rich,  Charles,  Karthik,  Hiro,  John ,  Aaron,  Rob,  Neng,  Brian
 
Topic: perfSONAR for USATLAS, led by Rich Carlson.     John Bigrow has setup monitoring  for USATLAS to/from BNL.   Many sites included.  Still waiting on MWT2_IU, NET2 and SLAC.   Jay working on monitoring visualization.   perfSONAR developers are involved.  John asked about Cacti interface.   How can USATLAS Tier-2’s/Tier-3’s configure perfSONAR to be more useful?   Brian described ESnet plans to deploy/configure perfSONAR to monitor 25-30 ESnet sites.   Lots of discussion about perfSONAR as it exists and near-term plans.   Focus on throughput usage for perfSONAR.   Plan discussion at OSG All-hands meeting?     Sites should look at http://code.google.com/p/perfsonar-ps/wiki/NPToolkitQuickStart Discussions about USATLAS view showing all ‘USATLAS perfSONAR Installations’ and their status.  Additional discussion about perfSONAR improvements desired:
 
a)    Have a more useful (from the “USATLAS” user viewpoint) front page which quickly shows you the current “USATLAS” testing results
b)    Have a link which shows all “registered” USATLAS perfSONAR instances and their current up/down status
c)    Provide configuration or plugins for things like syslog/syslog-ng, Cacti, Nagios to help monitor the status of the perfSONAR instance/hardware. 
d)    Could email be sent on critical problems?
 
Big action item for USATLAS sites is to configure Tier-2/Tier-3 scheduled tests.
 
We would like a volunteer Tier-2 or Tier-3 who be willing to do the following:
 
1)    Read through the URL above about setting up testing
2)    Using their perfSONAR infrastructure, setup testing to BNL (Tier-1) and the other USATLAS sites
3)    Document their steps and problems on a web page for other USATLAS sites to follow once it is working

I assume we can get good support from the perfSONAR folks to do this!
 
John Bigrow will verify the UC perfSONAR boxes are part of his BNL scheduled tests (and add them if not).   This is important to have running now, in advance of tomorrow’s circuit cut-over.   Rich will send the URL to see the current tests.
 
Our perfSONAR discussion took the whole timeslot so we have postponed the discussion about needed USATLAS monitoring/tests until next week.
 
Please send along corrections or additions to these notes, especially relevant URLs I may have missed.
 
Shawn
 

  • last week:
    • Main focus is getting the perfsonar infrastructure at all sites. Separate this from the host-level config issues on production resources. Feel its important to track this and follow over time.
    • The bwctl program tracks scheduled transfers between sites. Expect the testing to be light enough not to interfere with production.
    • continue to focus on monitoring
    • some sites working on local issues.
    • next week hope to have better info on monitoring.
    • GB/s milestone - delayed until site specific issues resolved.
  • this week:
    • Meeting focused mainly on perfsonar
    • The way perfsonar is not intuitively useful the way we have it deployed
    • Web interface is not immediately useful. Logging to site-level systems (nagios, syslog-ng) could be setup.
    • Need to setup tests between sites.
    • New UC & BNL circuit now in place.

Site news and issues (all sites)

  • T1:
    • last week: Scheduled intervention of core switch to study Foundry switch, but no definitive results; had impact on Tier 1 operations last thurs/friday. Still working on brining up Thor storage systems. Install image prepared for quick installation. Working with John from Harvard for Frontier deployment for muon reconstruction.
    • this week: no report

  • AGLT2:
    • last week: network stack at MSU. Node crashes w/ a memory problem, switch stops forwarding traffic. Taking up issue with Dell. Large number of lost heartbeat jobs - looking into pilot log, error 10, peaks on the hour and large peaks every three hours.
    • this week: migrating dcache files off compute nodes to large servers; trouble bringing a pool online - possible migration side effect - solved (increased Java memory). Wenjing looking into database configurations. Large transfer backlog - probably not a local problem. dcache version 1.8-11-15. BNL is upgrading to 1.9.

  • NET2:
    • last week: Still running steadily. HU: working on setting up Frontier. Douglas claims need a local squid setup. Need a user job to test. Start recording findings - see SquidTier2.
    • this week: Still working on storage. John: HU functioning okay. Have a problem w/ high gatekeeper loads. Xin notes an install job has been running for over a day.

  • MWT2:
    • last week: Working on Esnet peering w/ BNL in place, but local campus routes still being worked on. Issue yesterday with Panda config getting clobbered - fixed.
    • this week: One day downtime tomorrow for dCache upgrade. BNL-UC circuit established today.

  • SWT2 (UTA):
    • last week: SWT2_UTA up and running now. ToA issue tracked down w/ Hiro. Will start running proddisk-cleaner and ccc checker on
    • this week: Adler32 issue

  • SWT2 (OU):
    • last week: ALL OK. -- Well, we still need help with the OSCER cluster, something's still not working right. Paul is helping.
    • this week: Install issue; still not much progress with OSCER - needs to consult with Paul.

  • WT2:
    • last week: Mostly okay. Had problem with preemption on subset of cluster due to another experiment's jobs.
    • this week: GUMS server HD failed. Installed backup GUMS.

Carryover issues (any updates?)

Pathena & Tier-3 (Doug B)

  • Last week:
    • Meeting this week to discuss options for a lightweight panda at tier 3 - Doug, Torre, Marco, Rob
    • Local pilot submission, no external data transfers
    • Needs http interface for LFC
    • Common output space at the site
    • Run locally - from pilots to panda server. Tier 3 would need to be in Tiers of Atlas (needs to be understood)
    • No OSG CE required
    • Need a working group of the Tier 3's to discuss these issues in detail.
  • this week
    • http-LFC interface: Charles had developed a proof-of-concept setup. Pedro has indicated willingness to help - pass knowledge of apache configuration and implement oracle features.

Release installation via Pacballs + DMM (Xin, Fred)

  • last week:
    • All Tier 2's have run test jobs successfully. Still working on HU site.
    • A second type of job - transformation cache
    • KV validation now standard as part of job
    • Release installations are registered in Alessandro's portal
    • Expect full production in two weeks.
  • this week
    • Next week - full production. Discussing with Alessandro switching portal from development to production. Also code not checked in. Also need to publish existing releases.

Squids and Frontier (Douglas S)

  • last meeting:
    • Harvard examining use of Squid for muon calibrations (John B)
    • There is a twiki page, SquidTier2 to organize work a the Tier-2 level
  • this week:
    • Douglas requesting help with real applications for testing Squid/Frontier
    • Some related discussions this morning at the database deployment meeting here.
    • Fred in touch w/ John Stefano.
    • AGLT2 -tests - 130 simultaneous (short) jobs. Looks like x6 speed up. Doing tests without squid.
    • Wei - what is the squid cache refreshing policy?
    • John - BNL, BU conference

Local Site Mover

AOB

  • Direct notification of site issues from GGUS portal into RT, without manual intervention. Fred will follow-up. - next week.
  • Wei: questions about release 15 coming up - which platforms (release sl 4, sl 5 ) and gcc 3.5, 4.3. Kaushik will develop a validation and migration plan for the production system and facility. - will follow up.


-- RobertGardner - 10 Feb 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback