r3 - 09 Jan 2008 - 15:08:04 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJan9

MinutesJan9

Introduction

Minutes of the Facilities Integration Program meeting, January 9, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial 6 to mute/un-mute.

Attending

  • Meeting attendees: Michael, Rob, Wei, Shawn, John Hover, Charles, John B, Horst, Karthik, Mark, Nurcan, Kaushik
  • Apologies: Wensheng

Integration program update (Rob, Michael)

  • Phase 4 plan
  • Phase 3 SiteCertificationP3 and Summary report
  • Overarching near term goals (previously December 15) are:
    • Establish 200 MB/s sustained throughput to all Tier2s
    • Establish analysis queues at all Tier2s
    • Replicate Rel 12 AODs to all Tier2, for routine pathena analysis
    • Plus a number of other actions items in NotesTier2Nov30
      • Mark: Analysis queues: Mark will send a list of
      • Kaushik: AOD - release 13, working backwards in task definitions.
      • Mark: Prototype analysis task, on a site-by-site basis
      • Nurcan: standard SUSY plotting package.
      • SRM v2.2 testing, pinning - make a connection
  • FDR preparations
    • See Mark and Nurcan action items above.

Operations: Production (Kaushik)

  • Production summary (Kaushik)
    • Was running fine - main issue has been job shortage. Discussion w/ Ian, Pavel, Alex Reed - new MC coordinator.
    • Network problem at UTA, causing problems for Condor-G; Mark following up w/ UTA networking.
    • Michigan problem understood - there was one bad node, black hole. Condor scheduler is being used.
  • Production shift report (Nurcan/Mark)
    • Major problem is just lack of jobs - not sure if Eowyn is a problem. Probably not.
    • Will check pilot submission from Condor-G.
    • All site doing well.
    • M5 reconstruction datasets - failed, Yuri is debugging.

SRM v2.2 and pinning (Gabriele)

  • Pinning functionality T1D0? class data. Period of time is associated. How is this integrated w/ DDM?
  • Another initiative to do pinning within dCache using requests. A bring-online function. A separate set of pools are being used.
    • Not sure what the plans are here - it involves DQ2.
    • Action item - report back on up-to-date from Miguel
  • Kaushik: irrespective of srm v2.2, need to provide.
  • For production, we are essentially ready to pin input files.
  • Kaushik will provide a list to Gabriele. Selected RDOs and ESDs. Proactively manage disk space at BNL. User analysis files are taken care of.
  • Activity within OSG - participate.

LFC (John)

  • Instance setup at BNL w/ RHEL4, not hard to install, Mysql backend
  • Need to work w/ Panda team about how to use it.
  • Can the development Panda instance be used? (Panda already using LFC in Canda, Europe)
  • Kaushik: can run test jobs. Setup a test site in Panda.
  • Steps
    • Setup panda test site (Mark Sosebee)
    • Setup in autopilot (Torre)
    • Check w/ Tadashi
    • Action item - John will organize meeting and will discuss with Mark

Operations: DDM (Alexei)

Analysis Queues (Bob, Mark)

Accounting (Shawn, Rob)

Summary of existing accounting issues.
  • See: http://www3.egee.cesga.es/gridsite/accounting/CESGA/tier2_view.html
  • Follow-up from last meeting:
    • SWT2_UTA still being addressed. Need VORS registration - not yet. Hopefully this week, may take until January.
      • Still not in VORS correctly. Will get fixed this week.
    • BNL accounting info was lost - Xin investigating. There was confusion on the WLCG APEL site - having to do with the change in the Gratia site name - they appear to have static mappings. Xin still investigating.
      • Being sorted out be Xin. Converging, hope to have correct accounting by end of week.
    • John W is still clarifying w/ EGEE people on naming convention. Xin will continue to push the issue w/ John Weigand. Michael will push this with Ruth.
  • Schedule a phone call w/ Sue to get the US Facility view available. Not done - need to follow-up (Rob).

Throughput initiative - status (Shawn)

  • See notes from meeting this week, MinutesTPJan7
  • Summary of actions from Shawn
  • Demonstrated 600 MB/s sustained over a long period last weekend.
  • A problem w/ files > 2 GB - Jay and Hiro investigating
  • Individual site testing - testing at UC in progress
  • Copying from /dev0 results in 50 MB/s.
  • Asking for documentation on doors at each site. Will create a page in the twiki to this. Some of this is sensitive.
  • Next site: SLAC or IU
  • Future goals - demonstrate that from BNL we can do all sites at 200 MB/s simultaneously; what is the burst rate at BNL?
  • Jay: graphing. At each Tier2, would be nice to have Cacti / Ganglia - for the aggregate.
  • Merged AODs may grow to > 2GB
  • Action item for Jay and Hiro
  • Next throughput meeting on Monday.
  • US LHC Net meeting - Shawn will present ATLAS networking there - please send comments.

Panda release installation jobs (Xin)

  • Testing a new submit host
  • Two issues:
    • need to clarify w/ Tadashi the differnt kinds of installation jobs
    • permission issue with DQ2. Need a convention/requirements for usatlas2 write access;
      • Not a problem group permissions are set for usatlas1, usatlas2, usatlas3, usatlas4
      • All sites agree this isn't a problem
  • Next steps:
    • Will check installs at SLAC and BNL
    • Follow week

OSG

RSV, Nagios, SAM (WLCG) site availability monitoring program (Tomasz)

  • Follow-up on meeting w/ OSG: MinutesRSVNagiosDec21
  • Split of Nagios server into internal and external - still working on this. Work has now started.
  • Wisconsin LRC problem
  • Increase the timeouts for MWT2
  • RSV publishing to WLCG
    • Starting to publish this now
    • Dantong will follow-up

Site news and issues (all sites)

  • T1: Getting prepared for FDR; discussed how to distribute data. Rearrange space usage in dCache. 210 TB, 20-30 to be used for data in front of HPSS. Retiring some worker nodes, so will need to figure out where data goes. Pinning system starting to give good results (3 files out of 200K lost).
  • AGLT2: Running well recently - up to 850 jobs. Issues with Dell switches, upgraded firmware. Lessons: switches do no work as documented, Dell support lagging. The switches are stacked, and had to take the
  • NET2: no report
  • MWT2: We had a downtime at UC on Sunday, gatekeeper slowness being investigated. http://www.mwt2.org/sys/gatekeeper
  • SWT2_UTA: new cluster about online (will have 75 TB online) - test jobs running. Running xrootd, running a gridftp door. Will add srm doors w/ new purchase. (Note need spare machines for proof clusters.) Finalizing purchase for next round - online in March (capacity 240 TB disk, >400 cores, ~100 servers).
  • SWT2_OU: all is well. Problems w/ motherboard on gridftp server (keeps crashing w/ dropped packets, not understoood). Working on 10G upgrade.
  • WT2: using SRM load balancer from Bestman, working well. 2 new gridftp servers. External network - working on 10G network. Problem is identifying power for this - and Ganglia monitoring (for external viewing). Hardware - CPUs from purchase arrived (34 machines - 272 cores). Installing these now - expect them to come online by end of the month. Storage - not clear what the situtation is given funding situation.

RT Queues and pending issues (Tomasz)

Carryover action items

New Action Items

  • See items in carry-overs and new in bold above.

AOB

  • none

-- RobertGardner - 08 Jan 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback