r6 - 09 Feb 2009 - 18:19:35 - HorstSeveriniYou are here: TWiki >  Admins Web > MinutesJan14

MinutesJan14

Introduction

Minutes of the Facilities Integration Program meeting, Jan 14, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • new phone (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Doug, Rob, Charles, Saul, Fred, Horst, Karthik, Michael, Rich, Patrick, Xin, Jim, Tom, Wei, Armen, Alden, Mark, John, Sarah
  • Apologies: Nurcan
  • Guests:

Integration program update (Rob, Michael)

Operations overview: Production (Mark)

  • last meeting:
    • We ran re-processing at scale, ATLAS world-wide. In the US things were handled differently, all Tier 2's were participating. Panda mover staging of input files was handled automatically. Overall result has been excellent. 43% processed in US cloud (of 1/2 PB of the data). 54% at Tier 1, 46% at the Tier 2. An excellent result. Transform errors leading to the failure rate. Much of the facility infrastructure was stressed, and performed remarkably well. Three of the Tier 1's fell short.
    • Should expect large MC loads to come in soon (the large 'holiday production' from before). Lots of tasks, but their all small (merge tasks - DPDs and AODs).
    • Expect to see large number of reprocessed data coming into DATADISK ~ 30 TB.
    • PRODDISK - can be cleaned up now.
    • AGLT2 - added 12 TB to PRODDISK yesterday. Having problems cleaning up.
  • this week:
    • Lots of reprocessing tasks in the US - failure rates are very high (job definition problems).
    • Filled up Monday/yesterday (>6K jobs), now ~drained.
    • Potential Condor-G scaling issues (job eviction) - pilot submit host upgrade plan. Upgrade to newer version of Condor, evaluate; this has the changes made by the Condor team to accommodate Panda requirements. Eg. Condor strategy of completing a job no matter what, at odds with panda philosophy (we can lose pilots, no need to retry failed pilots).
    • Working on job submission to HU. Problems at BU - perhaps missing files. John will work the issues w/ Mark offline.
    • Pilot queue data misloaded when scheddb server not reachable; gass_cache abused. Mark will follow-up with Paul.

Shifters report

  • Distributed Computing Operations Meetings
  • last meeting:
    • Pilots failing at different sites - was it a network problem? Probably - multiple sites affected.
    • Hot-backup of the Panda server - did this contribute?
    • Secondary effect is that pilot config data couldn't be downloaded. Consult Paul.
    • Pandamover transfers are coming slow - caused timeouts in the pilot. Timeout increased to 4 hours (workaround). Hiro consulted - looked like a reasonable rate. Shawn will follow-up.
  • this meeting:

Analysis queues, FDR analysis (Alden, for Nurcan)

  • Background, https://twiki.cern.ch/twiki/bin/view/AtlasProtected/PathenaAnalysisQueuesUScloud
  • Analysis shifters meeting on 1/12/09
  • last meeting:
    • long queue at brookhaven needed to be reconfigured after the LFC configuration; and condor-g now used. pilots were not running due to a large number in the idle state - Xin cleaned up.
    • bnl short queue was running fewer than expected. 420 slots now, balanced.
    • Armen created a table for online/offline status - see analysis dashboard.
    • TAG selection jobs after LFC migration. Marco sent jobs that use back navigation, requiring direct reading. All sites pass except AGLT2 and MWT2.
    • CERN instance of panda monitor - is slow. Monitor is being run at CERN, database at BNL. Also logfiles are not available.
    • Analysis activity ramping back up.
    • Michael - had an incident with lack of space for analysis jobs, causing lots of job failures. Need to consolidate space at BNL. Hiro will be deleting files.
  • this meeting:

Operations: DDM (Hiro)

  • last meeting:
    • Generally things worked fine over the break.
    • No major problems. AGLT2 transfers to BNL failing this morning? Investigating.
  • this meeting:
    • Hiro at the dCache workshop
    • DDM stress test currently running, 10M files, involving all Tier 1s. DS subscribed from one, to all. Small files, stressing central components and local catalogs, and SRM. 14K files per hour (much higher than in production). Anticipated plan is to include the Tier 2s - more info from Simone later. Validating the latest version of the DDM code.

Storage validation

  • See new task StorageValidation
  • Current ccc.py script works for dCache, though the coupling is pretty weak (uses a text file dump from dcache). Need to get dumps for xrootd and gpfs. Output is text.
  • Separates files into ghosts and orphans - can take further steps to clean up.
  • Strongly coupled to DQ2 and LFC.
  • Armen: there are discussions on-going to systematically describe the policy for bookkeeping and clean-ups, beyond emails to individual. Follow-up in two weeks.

Space reporting (Tomasz)

  • last meeting:
    • Sites should be reporting via curl for each token. Some tokens are working, others are not.
    • Decide on hourly reporting, for all tokens.
    • Tom needs warning and critical values.
    • Are dCache sites being monitored via srm-get-metadata, on a token basis by CERN. There is a website for this someplace. Right now not available in bestman srm.
    • Wei will gather technical requirements from Armen and Hiro, will communicate with Alex.
    • dCache sites should check to see if the space reporting is accurate.
  • this meeting:
    • More questions than anything - get this through the SRM interface like Tier 1. Working for dcache, but not working for bestman SEs. Wei is tracking this.
    • Information system with a dynamic information provided, in the storage management system.
    • Third way - curl reporting from sites to panda monitor. Probably needs a database needs updating.

VDT Bestman-Xrootd

  • BestMan
  • Horst - basic installation of bm-gateway, needs to add the space token configuration. Bestman and the srm interface is working. All tests look okay.
  • Doug and Ofer will be looking at this at BNL. follow-up in two

Throughput Initiative (Shawn)

  • Next week.

Site news and issues (all sites)

  • T1:
    • last week: Took delivery of 31 thumpers - being installed now. 1 PB of data to be added. These were ordered through UCI. Negotiating with Dell for CPU. Tier2 / Tier1 connectivity - dedicated circuits to Starlight. Will be setting up a meeting between PIs and Esnet/I2 people.
    • this week: building addition making good progress. Condor-G submission system.
  • AGLT2:
    • last week: space issues - proddisk filling w/ reprocessing. why dq2sitecleanse doesn't work at the moment (38 TB being used at the moment). Charles helping. Holding off on lustre work for now, waiting for 1.8 release, plan to migrate one space token there. source preparation error being tracked down.
    • this week: waiting for a testpilot job to complete at the moment - the problem was related to a zombie dcache process, problem with the nic. Will clean up proddisk when Charles' new script becomes available. 400 job slots at MSU ready to come online.
  • NET2:
    • last week: Muon calibration workshop went well. Top priority is bringing new storage online. Total capacity will be 336+168 TB raw. HU site has been down over the break. Will start ramping up to use 1000 cores. Networking performance between BU to HU is being studied. Need to check token-by-token reporting.
    • this week: Running analysis jobs, not production jobs. There was a problem yesterday resulting from the cleanse script w/ permissions on the lfc. Now running a script to clean-up the LFC, but its slow (Sarah has a faster version). Still waiting for word from Kaushik on cleaning up MCDISK, DATADISK. New storage 336 TB raw, new gpfs volume. Harvard: still working on firewall/proxy server issues for worker nodes.
  • MWT2:
    • last week: brought up first new dell storage unit, and first compute node. Pilots failure at IU - globus error 22. homedir cleaned up.
    • this week: brought up another Dell server ~ up to 200 TB. Running smoothly mostly. Marco working on running TAG-based analysis jobs; three sites have configuration problems in scheddb.
  • SWT2 (UTA):
    • last week: srm failing during the break - restart fixed this.
    • this week: CPB running smoothly. Working on upgrade of UTA_SWT2. Ibrix re-installed. Wrap up early next week.
  • SWT2 (OU):
    • last week: 100 useable TB once funding arrives.
    • this week: gass cache filled up again, from usatlas jobs. 1000's of directories are being directories. Working on getting OSCER cluster back up.
  • WT2:
    • last week: cooling upgrade in progress. Found reprocessing jobs putting lots of stress on NFS servers. Separated atlas homedir and atlas releases. dq2sitecleanse.py attempted - some problems deleting entries from LFC, w/ consulting Charles.
    • this week: Cooling outtage is over, everything coming back. Ran into probs changing ATLAS releases to a new NFS server. Xin: this is a well known feature of Pacman. Will just need to re-install.

Carryover issues (any updates?)

Release installation via Pacballs + DMM (Xin, Fred)

  • Now working at BNL
  • There was a problem with proxy handling within the pilot. Fixed.
  • Now going through the sites and discovering new problems, eg. usatlas2 permissions.
  • MWT2_IU, SWT2_UTA - worked all the way to the end, but ran into a permissions problem; needs to change the script.
  • There is a problem with the installation script.
  • Pacman / pacball version problems?

Squids and Frontier (Douglas)

  • Things are tuned now so that access times can be measured.
  • Squid at SLAC is now working well with lots of short jobs. Cache is setup with 36 GB disk for testing.
  • Will be working on jobs with different access patterns.
  • What's the plan in ATLAS generally? There are tests going on in Germany (Karlsrue Tier 1). There is an coming workshop where this will be discussed. Also a muon calibration group is looking into this (lots of data the beginning of the job).
  • How to try out with a real reco/reprocessing job?
  • We need to make sure this can be extended to the Tier 2s.
  • Discuss at next week's workshop

Local Site Mover

  • Specification: LocalSiteMover
  • code
  • this week:
    • What about direct reading of files? Not relevant - only invoked for local copies.

AOB

  • None


-- RobertGardner - 13 Jan 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback