r5 - 07 Jan 2009 - 14:36:29 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJan7

MinutesJan7

Introduction

Minutes of the Facilities Integration Program meeting, Jan 7, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • new phone (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Tom, Patrick, Rob, Charles, Michael, Fred, Saul, Douglas, Nurcan, Shawn, Doug Benjamin, Wei, Rich, Karthik, Kaushik, Armen, Bob
  • Apologies: none
  • Guests:

Integration program update (Rob, Michael)

Operations overview: Production (Kaushik)

  • last meeting:
    • http://panda.cern.ch/
    • US ATLAS ADC Operations workshop at BNL
    • Glitch with a cache for reprocessing - Xin has patched at all US sites.
    • Lots of evgen now running, start of large scale holiday production.
    • Discussions at BNL. Planning for next year - Cosmic data run in May, end of summer beam. Keeping sites full and busy. What steps should we take (locally). Improvements to software for production and shifts. Will send a summary of meeting notes tomorrow. Permissions for LFC - agreed to, sites do need to be fixed. What to do about user datasets? For the moment, only those with production role will be able to delete (same as production). DDM - core members of dq2 team are leaving. Should we use Panda mover for output data? Under discussion. Reprocessing discussion. Alexei looking into data migration from Tier 2's outside the cloud. Condor-G issues.
    • Alexei: agreed to have a phone call to discuss Panda mover.
  • this week:
    • We ran re-processing at scale, ATLAS world-wide. In the US things were handled differently, all Tier 2's were participating. Panda mover staging of input files was handled automatically. Overall result has been excellent. 43% processed in US cloud (of 1/2 PB of the data). 54% at Tier 1, 46% at the Tier 2. An excellent result. Transform errors leading to the failure rate. Much of the facility infrastructure was stressed, and performed remarkably well. Three of the Tier 1's fell short.
    • Should expect large MC loads to come in soon (the large 'holiday production' from before). Lots of tasks, but their all small (merge tasks - DPDs and AODs).
    • Expect to see large number of reprocessed data coming into DATADISK ~ 30 TB.
    • PRODDISK - can be cleaned up now.
    • AGLT2 - added 12 TB to PRODDISK yesterday. Having problems cleaning up.

Shifters report

  • Distributed Computing Operations Meetings
  • last meeting:
    • ADC Shifters meeting
    • Have been running reprocessing tasks - there was a bad file from another cloud. Wensheng tracked it down, brought a new copy. Files replaced at Tier 2s.
    • On-going issue at Harvard - stage-in problems. John contacted.
    • Problems reprocessing at Tier2's accessing conditions database via COOL. Patrick - local compute nodes have a domain not recognized by POOL to lookup database. Since 13.0.35 an enviro variable to set domain name. Couple of options to patch the ATLAS WN script. Or, pilots set the enviro variable to use the sites' main gatekeeper name. Implemented in pilot now by Paul. Old problem since June.
    • Alexei - is Sasha in the loop? Yes.
    • SLAC gatekeeper issue. Authentication okay, but globus-job-run not working. Wei is out of town. Can Douglas help? Will look into it.
    • For a thorough summary, look at Yuri's summary.
  • this meeting:
    • Pilots failing at different sites - was it a network problem? Probably - multiple sites affected.
    • Hot-backup of the Panda server - did this contribute.
    • Secondary effect is that pilot config data couldn't be downloaded. Paul.
    • Pandamover transfers are coming slow - caused timeouts in the pilot. Timeout increased to 4 hours (workaround). Hiro consulted - looked like a reasonable rate. Shawn will follow-up.

Analysis queues, FDR analysis (Nurcan)

  • Background, https://twiki.cern.ch/twiki/bin/view/AtlasProtected/PathenaAnalysisQueuesUScloud
  • Analysis shifters
  • long queue at brookhaven needed to be reconfigured after the LFC configuration; and condor-g now used. pilots were not running due to a large number in the idle state - Xin cleaned up.
  • bnl short queue was running fewer than expected. 420 slots now, balanced.
  • Armen created a table for online/offline status - see analysis dashboard.
  • TAG selection jobs after LFC migration. Marco sent jobs that use back navigation, requiring direct reading. All sites pass except AGLT2 and MWT2.
  • CERN instance of panda monitor - is slow. Monitor is being run at CERN, database at BNL. Also logfiles are not available.
  • Analysis activity ramping back up.
  • Michael - had an incident with lack of space for analysis jobs, causing lots of job failures. Need to consolidate space at BNL. Hiro will be deleting files.

Operations: DDM (Hiro)

  • last meeting:
    • Alexei - all AODs replicated to all Tier 2s. 25 TBs over the next two weeks. Which token area - DATADISK.
    • Hiro's testing dataset replication monitoring program (dq2ping) - don't pay attention.
  • this meeting:
    • Generally things worked fine over the break.
    • No major problems. AGLT2 transfers to BNL failing this morning?
    • prob

Space reporting (Tomasz)

On Mon, Dec 29, 2008 at 14:57, Tom Wlodek  wrote:
Hi, I would like to obtain info about disk space available at UC
computing sites. I go to the page:
http://panda.cern.ch:25880/server/pandamon/query?dash=prod
and I see a list of sites and disk space info. Unfortunately out of US
sites I can see only SLAC and Great Lakes. Others - NETier2, SWTier2,
Midwest (Chicago and IU) are missing.
Question 1: is there a problem here? Why those sites do not report?
Question 2: Do I look at the right place? If not - what is the right
place?
Question 3: Is there other way to get this information from panda
monitor?
Tom Wlodek

  • Sites should be reporting via curl for each token. Some tokens are working, others are not.
  • Decide on hourly reporting, for all tokens.
  • Tom needs warning and critical values.
  • Are dCache sites being monitored via srm-get-metadata, on a token basis by CERN. There is a website for this someplace. Right now not available in bestman srm.
  • Wei will gather technical requirements from Armen and Hiro, will communicate with Alex.
  • dCache sites should check to see if the space reporting is accurate.

VDT Bestman-Xrootd

  • last meeting
  • this meeting:
    • No updates
    • OU, SLAC, BU
    • Need instructions for setting up space token areas

Throughput Initiative (Shawn)

  • Plan to meet next week.

Site news and issues (all sites)

  • T1:
    • last week: discussion of holiday schedule
    • this week: Took delivery of 31 thumpers - being installed now. 1 PB of data to be added. These were ordered through UCI. Negotiating with Dell for CPU. Tier2 / Tier1 connectivity - dedicated circuits to Starlight. Will be setting up a meeting between PIs and Esnet/I2 people.
  • AGLT2:
    • last week: Seeing some ANALY jobs coming through. Only 7 TB available in DATADISK. Still working on Lustre - working on admin machines. Planning to do this in a HA mode - primary and secondary.
    • this week: space issues - proddisk filling w/ reprocessing. why dq2sitecleanse doesn't work at the moment (38 TB being used at the moment). Charles helping. Holding off on lustre work for now, waiting for 1.8 release, plan to migrate one space token there. source preparation error being tracked down.
  • NET2:
    • last week: Since the migration have not run a lot of production; want to see things fully up before the holiday production; new blades installed & in production, storage installed not online. Muon calibration workshop tomorrow.
    • this week: Muon calibration workshop went well. Top priority is bringing new storage online. Total capacity will be 336+168 TB raw. HU site has been down over the break. Will start ramping up to use 1000 cores. Networking performance between BU to HU is being studied. Need to check token-by-token reporting. * MWT2:
    • last week: running into problems hitting dcache full limits. Brought up compute nodes; working on dcache configuration.
    • this week: brought up first new dell storage unit, and first compute node. Pilots failure at IU - globus error 22. homedir cleaned up.
  • SWT2 (UTA):
    • last week: CPB cluster back online. Upgraded LFC release. Working with Nurcan on TAG analysis with AODs.
    • this week: srm failing during the break - restart fixed this.
  • SWT2 (OU):
    • last week: still working on LFC cleanup exercise. Ibrix issues.
    • this week: 100 useable TB once funding arrives.
  • WT2:
    • last week:
    • this week: cooling upgrade in progress. Found reprocessing jobs putting lots of stress on NFS servers. Separated atlas homedir and atlas releases. dq2sitecleanse.py attempted - some problems deleting entries from LFC, w/ consulting Charles.

Carryover issues (any updates?)

Release installation via Pacballs + DMM (Xin, Fred)

  • last week:
    • Testing installation pilots on Tier2s. A couple of configuration problems - now fixed. Will be sending new jobs.
  • this week:
    • no report

Squids and Frontier (Douglas)

  • last week:
    • I presented an update on the conddb access testing at SLAC this morning at the Atlas database group meeting. You can find the slides in the meeting agenda: http://indico.cern.ch/conferenceDisplay.py?confId=47447. Things are working fairly well now that the frontier log issue is understood, and the slac squid server is in production use. This has been shown to work on slac batch nodes, and with multiple jobs running. More testing to be done to show that this can work for a full production running, but hopeful this week. Douglas
    • Problem with client slowness writing logfiles to AFS. Write Frontier logfile to /tmp.
    • Now getting good results, 120 - 180 seconds in various configurations.
    • When using everything is local, just 20 seconds.
    • Running on batch systems, accessing conditions data works. Will scale up. Question about what the profile will look like.
  • this week:
    • No update

Local Site Mover

  • Specification: LocalSiteMover
  • code
  • last week:
    • Still a problem with the analy queue at Harvard, so there may still be an issue with BU lsm. Saul investigating w/ John.
  • this week:
    • No update.

AOB

  • Need to setup space token instructions.


-- RobertGardner - 22 Dec 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback