r3 - 16 Jul 2008 - 14:48:26 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJul16



Minutes of the Facilities Integration Program meeting, July 16, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees: Wen, Rob, Michael, Shawn, Rich, Nurcan, Kaushik, Saul, Armen, Marco, Fred, Torre, Wei, Patrick, Karthik, Hiro, Tom
  • Apologies: none
  • Guests: none

Integration program update (Rob, Michael)

  • Overarching near term goals (from Phase 5):
    • Full and effective participation FDR-2 exercises
    • Complete the benchmarks of 200 MB/s sustained disk-to-disk throughput to all Tier2s
    • SRM v2.2 functionality for all ATLAS sites DONE
  • Upcoming meetings:
  • Milestones from the Ann Arbor meeting: AnnArborNotesMay2008:
    • FDR2: data replication and analysis queues DONE
    • 200/400 MB/s T1-T2
    • OSG 1.0 deployed DONE - nearly complete
    • LFC evaluation and deployment strategy complete on-going
    • WLCG - SAM/RSV, reliability availability metrics for CE and SE reporting >80% for all sites.
    • Provisioning of capacities according to pledges on track for September 15 2008 deployment.
    • Network performance monitoring infrastructure deployed. see below
    • Revision to the Tier 3 white paper, and a reference Tier 3 facility defined. group formed
    • Analysis benchmarks demonstrated at increasing scale (100/200/500/1000 simultaneous jobs) at all Tier 2 facilities.

DATATAPE incidence

  • 250-350 MB/s to tape at BNL from CERN; curious since no activity was announced by ADC operations.
  • Cosmic data replication has started. At some point there will be a replication started to the Tier2's. Which space token/how much.

Next procurements

  • Standing agenda item, see CapacitySummary.
  • Follow-up on Dell storage offering.
  • Forwarding information from Shawn to Andy/UCI. A response has come back regarding node price, higher than the target. Expect more info within the next 48 hours.
  • MD1000 pricing not as we expected.
  • Deadline was Friday.
  • We need to get the FY08 disk in place by September 15.
  • Does IBM have a similar disk shelve setup?

Internet2 host specifications (from Rich)

  • Need a schedule to put the services in place in the facility.


I've been working with my colleagues from Internet2 and ESnet to define the hardware needed to run the perfSONAR monitoring packages. At the present time, we have tentatively identified the following HW, that should perform this task. Our remaining task is to purchase a server and verify that it performs as expected. While we talked about a $400 price point, we are unable to put something together in a 1U server package that meets this price point. However, this server configuration comes close, and these are retail prices so EDU or GOV discounts may apply.

Server config for perfSONAR monitoring/measurement

1U Server case & Motherboard ASUS RS100-X5/P12 $319.99 http://usa.asus.com/products.aspx?l1=9&l2=40&l3=116&l4=0&model=1969&modelmenu=1

CPU Intel E2200 Allendale 2.4 GHz $89.99 http://www.newegg.com/Product/Product.aspx?Item=N82E16819116070

Memory 2 GB Crucial CT783338 2x$23.99 = $47.98 http://www.crucial.com/store/mpartspecs.aspx?mtbpoid=9C0B41BAA5CA7304

Disk Western Digital Caviar 160 GB $43.99 http://www.newegg.com/Product/Product.aspx?Item=N82E16822136075

Slim DVD Lite-ON DVD-ROM $34.99 http://www.newegg.com/Product/Product.aspx?Item=N82E16827106224

Total price $1073.88 + tax & shipping (gets 2 1U servers). I also looked into the server the Albany team built, they had a $400/cpu price not $400/server so we can't use that as a guide.


Follow-up issues

  • Storage capacity recommendations/guidance for the Facility (320 TB capacity, from Kaushik's model on MinutesJune11).
  • Revised WLCG pledges - need info by July 15. Action item for Rob

Operations overview: Production (Kaushik)

  • Follow-up on space token description assignments (PRODDISK, MCDISK, GROUPDISK, etc)
    • Kaushik will send reminder via email. Will re-work these numbers with latest from Kors.
    • Site naming issues w/ space tokens. DDM team would like to use "alternate name" in ToA. This name is how they aggregate space tokens from a particular site. Two associated issues - they use this name to connect to WLCG BDII. Should match w/ alternate name in ToA. But we don't want to do this.
    • Downtimes are published via BDII. But we want to avoid this - it can be done via OIM to WLCG GOCDB.
  • There have been sporadic groups of jobs to run; not much to do but complain.
  • Reprocessing at all Tier1 and Tier2 sites. Yuri and Sasha are working through all the issues. Most significant issue is Oracle DB at SLAC.
  • Regardless, Tier2 sites are still performing well for reprocessing.
  • M8 data starting to show up.
  • There are some missing files in the LRC.
  • Testing dq2- tools; finding missing md5sum and file size in the central DQ2 catalog. 1/3 (7M) did not have this in the central catalog. Did find a number of files that do not exist (1.4 M) in DQ2. Only 2000 files were completely inconsistent.
  • Should we do a consistency check on the Tier2 LRC? Perhaps defer to LFC migration complete.
  • File registration problems at BNL - slow, then stopped. A problem with DQ2 site services not registering files. 30K files in the queue. Already transferred, but not registered. Move queue database to another machine, solved problem. Miguel had seen something similar to this with LFC sites.
  • See this in DDM dashboard - QUEUED, ATTEMPT DONE, but not FILE DONE. Inspect the site service logfile. Hiro may have seen a memory leak in the site services.
  • Charles reports defunct processes on DQ2 site services at UC.

Shift report

  • Follow-up: There is a bug in srmcp-fnal client. This has been fixed, but in the current version. It affects only dCache sites. It is 2.0.3, needs to be 2.0.8. Q: what services are affected?
  • No news.

Analysis queues, FDR analysis (Nurcan)

  • A wiki page was setup to show the online/offline status of the US pathena analysis queues as well as their availability for various athena releases/packages, see the page at PathenaAnalysisQueues.
  • A request came from the FR cloud to add FR analysis queues. Nurcan will organize the page to include other pathena analysis queues. Will need to contact with other clouds to check if they are interested in providing their info.

RSV and WLCG SAM (Fred)

  • See https://twiki.grid.iu.edu/twiki/bin/view/Operations/RsvSAMGridView for links to SAM and Gridview reporting consoles.
  • For scheduling downtimes, the OIM system: https://oim.grid.iu.edu/
  • Daily reports now sent to usatlas-grid-l.
  • SWT2 - naming issue being tracked.
  • NET2 - replacing gatekeeper today. Will add an SE.
  • With two gatekeepers in front of the same local job manager, there is a registration prescription so the logical OR is taken for the availability.
  • Follow-up:
    • Need to fix site names in OIM
    • Need to register all SE's
    • MWT2_IU_SE - why is it not reporting correctly? Just sent an email to osg-storage. Looks like the service is working, but there are problems with the client. Sarah is investigating.
    • Fred will drive the process with sites and the GOC. Need responsiveness on the part of sites not reporting correctly.
    • We will be publishing officially for July
    • Frequency of running RSV probes

Operations: DDM (Hiro and/or Alexei)

Carryover - file exists problem

  • Last week there was the "file exists" problem @ BNL
    • Was this fixed at all affected facilities? Hiro believes this was isolated to one machine.
    • Same issue Kaushik discussed earlier - prevents registration.

LFC migration (John, Dantong)

  • SubCommitteeLFC
  • Two servers are put between a smart switch which does load balancing and failover. Now ready to begin formally start migration.
  • Torre reports dq2 developers encouraging to use the dq2- tools. dq2-get tested, dq2-put would need to be tested and perhaps.

wlcg-client (Marco)

  • Updated to use the new VDT client which fixed srmcp-fermi-client.
  • Still uses work-around to avoid the problem of two globus versions.

OSG 1.0 (Rob)

Site news and issues (all sites)

  • T1: Following dCache upgrade last week there were a couple of issues - there was a problem with one of the servers; once fixed, transfers resumed at the proper rate. 10G connectivity to each Thumper. Deployed more than 1000 new cores. Proposal for a scaling test in advance of the reprocessing task. Jay reports Monalisa graphs are working again. Can a time-stamp be added?
  • AGLT2: follow-up: pursuing why we're getting DATADISK errors - investigating. Perhaps back to usatlas3 vs usatlas1? Traced this to credential caching in dCache. Recently this has been okay, after disabling caching. /var partition filled on worker nodes, dcache log files. Progress from Paul on pilot+space tokens. Eager to switch.
  • NET2: gatekeeper upgrade, nothing else.
  • MWT2: Setting up a new analysis queue ANALY_MWT2_SHORT, still testing.
  • SWT2 (UTA): Rebuild going along
  • SWT2 (OU): waiting on a Dell quote.
  • WT2: follow-up: working on deploying glexec at slac, waiting to hear back from Maxim. Looking at replicating conditions db at slac - babar will allow use of their hardware initially. Will put in PO for own servers. Channel bonding of three NICs on gridftp door. Finding problems with transfers back to BNL - lower by 50%. Will be changing name of gatekeepers.

Carryover issues (any updates?)

Pilot upgrade for space tokens (Kaushik (Paul))

  • Update from Shawn. lcg-cp working; still need to work on registration. Carry-over

Release installation via Pacballs (Xin)

  • Need to follow-up.

Throughput initiative - status (Shawn)

Notes from Monday's throughput meeting

Attending: Jay, Rich, Sarah, Charles, Dantong, Shawn
Horst, Wei – Unable to attend.
BNL doors:  New systems ordered but not arrived: two 10GE NICs (in/out), PCI-e slots, 16GB of RAM,  dual quad-core cpus.  Dantong can provide more details.
Some testing has occurred since our last meeting:  Sarah reported direct pool writes operational at MWT2.  Testing has shown  BNL->IU works directly to the pool node.  Achieved 500MB/sec UC->IU. 
FTS:  We need to learn more about FTS operational details.  How does it setup/negotiate a data transfer?  What tools/protocols are possible?  Jay can add FTS testing/monitoring to MonaLisa repository (and will if we get him details).
MWT2_IU: Direct writing configured…required unique pool/host certs.  Open port range 20000-25000.  A new gridftp server on 10GE link is running at IU.
MWT2_UC: Charles, 1 gridftp door.  Wanting to bring up 2nd 10GE Gridftp door.  Currently pools are on private network.
Want to initiate some testing between AGLT2 and MWT2 to debug/verify “direct pool” access and resulting throughput.  Wenjing, Sarah, Charles and Shawn will try to set this up and run some tests.
SLAC:  Wei reported they have channel bonding operational on one of their gridftp doors.
OU:  Still have issues with their 10GE host.  Suspect iptables issues.  Need to do further debugging.
Rich Carlson noted Joint-techs is next week.  Our call will go on as scheduled and people can join as available.   Rich reported that the new NDT/BWCTL distribution will be demo’ed there.   Also looking into getting a vendor to provide the fully built perfSONAR boxes that the USATLAS Tier-2’s need to deploy.
Next meeting: Monday, July 21st at 2 PM Eastern
Send along any additions or corrections.   Thanks,


Nagios monitoring subcommittee (Dantong)

  • Tom still on vacation. Dantong covering alerts.
  • There was an issue regarding RSV->(site-level) Nagios probes - Tom was to provide a package, GOC-RSV team waiting.

SRM v2 and Space Tokens (Kaushik)

  • Follow-up issue of atlas versus usatlas role.
  • The issue for dCache space token controlled spaces supporting multiple roles is still open.
  • For the moment, the production role in use for US production remains usatlas, but this may change to get around this problem.

Site certification review

User LRC deletion (Charles)

  • Nurcan reports this is currently failing - Charles has addressed bug reported. New version available for Nurcan to try, will follow-up. Will email Nurcan today.
  • No change. Files are not getting deleted on server.

WLCG accounting


  • none

-- RobertGardner - 15 Jul 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback