r2 - 10 Dec 2008 - 14:35:39 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesDec10



Minutes of the Facilities Integration Program meeting, Dec 10, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • new phone (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees: Michael, Rob, Charles, Karthik, Patrick, Shawn, Saul, Wei, Sarah, Nurcan, Kaushik, Mark, Bob, Fred & Marco, Douglas, Armen
  • Apologies: Jim C, Horst, John
  • Guests: Rich

Integration program update (Rob, Michael)

  • IntegrationPhase7
  • High level goals in Integration Phase 7:
    • Pilot integration with space tokens DONE
    • LFC deployed and commissioned: DDM, Panda-Mover, Panda fully integrated DONE
    • Transition to /atlas/Role=Production proxy for production DONE
    • Storage
      • Procurements - keep to schedule
      • Space management and replication
    • Network and Throughput
      • Monitoring infrastructure and new gridftp server deployed DONE
      • Throughput targets reached
    • Analysis
      • New benchmarks for analysis jobs coming from Nurcan
      • Support upcoming Jamborees
  • ANL Analysis Jamboree, Dec 9-12, twki-home led-blue
  • BNL Analysis Jamboree, Dec 15-18, 2009 agenda, BNLJamboreeDec2008 led-blue
  • Muon Calibration and Alignment Workshop at Boston University on December 18-19, http://atlas.bu.edu/workshop/
    • Dear colleagues, There are two issues that we would like your help with for the upcoming Muon Calibration and Alignment Workshop at Boston University on December 18-19. Firstly, we are arranging a tutorial for the first afternoon which will focus on reconstruction and looking at cosmic data, and we would like your input to decide where to run the tutorial. We would like to prepare it in such a way that users would be able to use the same cluster of computers. We can easily get accounts for participants at Michigan or the Northeastern Tier 2 at BU. The tutorial is being done by people that run at Michigan, so using the Michigan cluster is probably the best for this reason. We can also use lxplus or BNL. There might be issues with users not having accounts at BNL and/or CERN. Please give us your preference, especially if you intend to participate in the tutorial and one of our potential choices is a show-stopper for you. If we choose Michigan or NET2, we will get accounts beforehand for the registered participants. If you would like to see the tutorial weighted more heavily toward reconstruction or looking at calibration ntuples, please also let us know. Secondly, please check http://atlas.bu.edu/workshop/participants.html to make sure you're registered. We will use this list to order food and arrange parking, so if you are not registered please do so as soon as possible. Thank you for your help! Robert Harrington Boston University
    • Dedicated machines at AGLT2 and BU.
  • Tier 0/1/2/3 Jamboree - Jan 22, 2009
  • Next US ATLAS Facilities face-to-face meeting (past meetings):
    • Will be co-located with the OSG All-Hands meeting at the LIGO observatory in Livingston, LA, March 2-5, 2009 Agenda
    • US ATLAS: March 3, 9am-3pm - focus on readiness for data, and Tier 3 integration - organized by Maxim Potekhin
  • Tier 3 study group is making good progress - a draft document is available. Consolidating input regarding workflows, requirements.
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open usually): http://integrationcloud.campfirenow.com/1391f
  • Upcoming reviews:
    • Program mangers review in January
    • February agency review

Operations overview: Production (Kaushik)

  • Last week
  • this week
    • very little production - the task definition system was down due to panda monitoring system migration to cern. should be functional now. the panda monitor at bnl may not give accurate info about tasks and ddm. the pieces maintained by pavel and alexei have moved to cern - not yet in synch with the bnl instances. db migration from mysql to oracle. will take take to sort out.
    • http://panda.cern.ch/
    • No new tasks defined this week == no new jobs. Except for re-processing validation jobs - shifters filing bug reports.
    • Expect large numbers of jobs and tasks for the holidays - deadline Dec 15 - a month's supply.

Shifters report

  • Mark - as noted, idle.
  • BU back online after LFC changes - back to BU. Some additional issue - John B investigating with Paul.
  • Pandamover jobs - Paul has supplied fix in the pilot issue.
  • swt2-cpb back online; nfs server issues addressed, lfc upgraded.
  • See Yuri's weekly summary for details.

Local Site Mover

  • Specification: LocalSiteMover
  • code
  • last week:
    • Now testing.
    • Week from today expect to have the final version working. This will bring up the Harvard site.
    • Test jobs sent by Paul worked. Code is now ready.
  • this week:
    • Saul re-installing latest version. Tested and working now.
    • Charles will do the dcache version
    • Patrick will take a look at version for xrootd. xrdcp.

Analysis queues, FDR analysis (Nurcan)

  • Background, https://twiki.cern.ch/twiki/bin/view/AtlasProtected/PathenaAnalysisQueuesUScloud
  • Analysis shifts, http://indico.cern.ch/conferenceDisplay.py?confId=46430
  • Follow-up:
    • Pathena stress tests - Benjamin contacted for monitoring from ARDA dashboard. Waiting for response. Might do this as a joint activity w/ French cloud.
  • SLAC back online after config in pilot and scheddb. Then there were probs with LFC. Waiting for ganga robot jobs. Have requested more ganga robot jobs. 3 submissions per day. Benjamin - still in process of translating his code for ARDA.
  • Other sites - swt2-cpb (back online, 100% success rate). net2 - paul and john cleaning up issues after lfc migration.
  • During the LFC migration at BNL last week - had some jobs queued in the analy-bnl queue. Site was set to offline, unbeknowest to users. Issue was how to redirect, and availability of input files. Can panda send an email to users if the site went offline? Still discussing.
  • Workshop at ANL this week; waiting on a cosmic data sample from CERN to Castor. May take a day or two if its on tape. Marco will follow-up with Stephane.

Operations: DDM (Hiro)

  • Not much activity in the last few days.
  • BNL LFC crash - due to an old installation. Will be upgraded? (Has two machines behind an F5 switch).
  • Follow-up on FTS delegation procedure.
  • Hiro would like to discontinue the myproxy service. Will send an email.
  • Restarting dq2-ping for mcdisk and datadisk.

LFC migration

  • SubCommitteeLFC
  • last week
    • There is VDT fix coming soon for the problem with one of the libraries used LFC daemon.
    • BNL: role back pilot changes related to space tokens - resulted in permissions issues. Long term issue of using usatlas1 versus usatlas4, still need to be worked out. Hiro has locked LRC for write operations. Looks like its all functioning, waiting for test jobs to complete.
    • BU: starting to install at Harvard.
  • this week
    • Discussed Kaushik's proposal for ownership and roles for LFC and the sites. Kaushik will send around specific checks. The remaining issue is dataset deletions.
    • New cleanse script for LFC - see DQ2SiteCleanse. Handling DQ2, LFC deletions - will need to add storage specific deletion, but this can be handled easily.
    • Will also need to add logic for managing PRODDISK as a cache, different than the other endpoints.
  • Hiro - suggestion to add a default group, as a way for enabling deletion of datasets.

Throughput Initiative (Shawn)

  • This week's meeting summary:
                        Throughput  Meeting Notes for December 9th 2008
Attending:  Rob, Karthik, Shawn, Sarah, Charles, Jay, Rich, Neng,
MonALISA:  Jay…still waiting on the “conduit” through the firewall.   Software installed on the disk.   Hopefully by next week we will have good news (operational service).
AGLT2:  Still trying to debug (find) the WAN issue.   Now have access to an intermediate box in Chicago to help isolate potential fault locations.  
IU:    Still have oscillation in throughput to debug.  Will
OU:  Network hosts are ready.  Tests planned between BNL and OU to bring the systems online. 
 Wisconsin: New gridftp  server tested.  Unstable and reverted back to original hosts.  Still working on debugging and bringing up the new high-performance gridftp servers.
None: AOB
Send along corrections or additions.    We plan to meet again next week.

  • 1 GB/s BNL to multiple Tier 2 sites
  • New high water marks for Tier 2's to check.

Site news and issues (all sites)

  • T1:
    • last week: Dantong: missing lib in worker node client, re-installed.
    • this week: Procurement still going. 1 PB storage being setup by DDN - to be evaluated. WAN issues with LHC OPN - fiber disruption in London, fixed. Hiro reported about the LFC daemon crash, working with David Smith. Will get an upgrade. ATLAS DDM stress test, goal is to stress replication components - site services and catalogs. 2000 datasets at each site. 5-10K replication transactions per site/hour. Idea is to make lots of transactions affecting site services and catalogs. 1-10M files involved. Started this morning - last 10 days. Suggestion from Charles about reducing the number of queries against the central catalog; suggested to speak with Pedro. Pedro will move to BNL and head storage management group at the Tier 1.
    • In discussion about direct 10G to Starlight - dedicated circuit. Last mile on long island to be put in place in January. Propose a meeting with ESnet and Tier 2 people. Will be provided to US ATLAS at no cost to the project.
  • AGLT2:
    • last week: FTS proxy issue, otherwise all is well. In process of bringing up Lustre. Checking out luster+bestman. LFC inconsistency yesterday - still looking into fixing it. Using Charles' script, but it needs modification. DB contents were lost for 27 hours. Will need to scan and re-register. (Wei has a related question about how to retrieve the LFN given an observed orphan in the storage system.)
    • this week: Had problems with LFC - installed from test cache - had to re-install, it was easy. monit software for restarting downed services. Working on Lustre setup. Meeting with Merit in order to get dynamic circuit capability to MSU. Getting ready for muon calibration workshop - Bob setting things up for interactive users. Working on getting Charle's ccc.py script working to check consistency.
  • NET2:
    • last week: John has been working on LFC migration and local site mover - solving tricky HU firewall issues. BU running smoothly; HU down pending migrations. Contact with Tufts - might join in production. All storage hardware arrived, 100 cores of blades arrived. Still need to setup perf-sonar machine. Problem with analysis queue - related to mistake in panda pilot, which apparently was fixed.
    • this week: Hardware - racking new storage and blades. Hope to have everything online before christmas. LFC migration completed. Mysql database will be at BU, with two LFC instances. John has started new job at Harvard, but he'll continue to work on NET2.
  • MWT2:
    • last week: Disk-catalog-DQ2 catalog consistency checking - recovering from some lost metadata, nearly done now. Down to 28 files missing. New Dell storage server - image installed, configuring raid, switch. Did some throughput IU-UC; 200 MB/s.
    • this week: Have got DATADISK and PRODDISK tokenized areas completely cleaned up - all holes in datasets removed. Now have 0 orphaned ghosted or orphaned datasets. Bringing up new hardware - new storage nodes.
  • SWT2 (UTA):
    • last week: NFS server failure last week. Brought back, replaced file system. Servers /home for grid user directories and ATLAS releases. Kernel panics. Replace xfs from ext3.
    • this week: CPB cluster back online. Upgraded LFC release. Working with Nurcan on TAG analysis with AODs.
  • SWT2 (OU):
    • last week: all is well. LFC daemon died once - put auto-restart script in place. Kartik putting up perfsonar hosts - disk problems not well-understood.
    • this week: working on updating LFC with the OSCER admins. Trying out the clean-up script from Charles.
  • WT2: no report.
    • last week: production is working fine. Working w/ Paul on analysis queue. glexec is working correctly for remote users - test jobs successful. Frontier testing.
    • this week: Upgraded LFC to stable version on Monday - went smoothly.

Carryover issues (any updates?)

Release installation via Pacballs + DMM (Xin, Fred)

  • last week:
    • Torre is working submission of installation jobs. Will be trying out sites.
    • Pacballs are being submitted automatically to all Tier 1's. DONE
  • this week:
    • Yesterday Torre began submitting jobs - come in as usatlas2. Is the rate too high (SWT2_CPB_INSTALL observations)? Increased load from grid monitor due to the (separate) job submit host.
    • Xin is submitting one install job to each site.

Squids and Frontier

  • Testing read-real script - to approximate a real conditions data access job. Running repeatedly. Configuration between SLAC and BNL figured out. Connecting to squid cache at BNL, then to the Frontier service at BNL. Then to squid cache at SLAC - forwards http request to Frontier @ BNL. Tested.
  • Access times vary between jobs - 2000-2500 seconds mostly. Jumps to 3000-3500 seconds sometimes. Testing "zip levels" for compression (0-5).
  • Might be improvements ~ 10%.
  • Squid cache verified to be working correctly at SLAC.
  • Rerun script at CERN, 120-180 seconds.


  • none

-- RobertGardner - 09 Dec 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback