r3 - 10 Jun 2009 - 15:32:46 - FrederickLuehringYou are here: TWiki >  Admins Web > MinutesJune10

MinutesJune10

Introduction

Minutes of the Facilities Integration Program meeting, June 10, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Rich, Michael, Rob, Fred, Charles, Jim C, Wei, Sarah, Torre, Pedro, Patrick, Nurcan, Mark, Rupon, Saul, Shawn, Xin
  • Apologies: none

Integration program update (Rob, Michael)

  • IntegrationPhase9 - FY09Q3
  • Special meetings
    • Tuesday (9:30am CDT): Facility working group on analysis queue performance: FacilityWGAP
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
    • Friday (1pm CDT): Frontier/Squid
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • In third month of FY09Q3:
      • Please update FabricUpgradeP9 with any CPU, storage, or infrastructure procurements during this phase.
      • Please update SiteCertificationP9 for DQ2 site services update for Adler32, Fabric upgrade, and Squid server-Frontier client deployment
    • Squid deployment at Tier 2 - we need:
      • We need ATLAS to validate the method - we need a discussion within ADC operations and physics (Jim C). Also need a validation job - eg. one running over cosmics.
      • Within the facility need sites to deploy Squid - June 24.
  • Other remarks
    • Step 09 still in full swing - ending this Friday
    • Analysis jobs are providing stresses on storage services
    • Need to do a post-mortem afterwards, and develop action plan towards fixing major weaknesses

Analysis stress plans (Jim C)

  • Contacted 30 most active users to submit jobs
  • Use Top mixing sample
  • Transfer results back using Rik Y's utility
  • Have heard back from most users
  • No major problems reported, some positive experiences as well
  • At next week's Jamboree will present summary
  • Next up - look at the pretest container, and then the "big" sample. Will extend to other clouds.
  • Blitz day on hold - need more users
  • Kaushik: pre-test container: complete at BNL, partial at the Tier 2s. Would be interesting to see how users' jobs perform. Will launch next week. Fraction of big sample: 50M JF35, about half already merged. 25M JF17 done. 15M G4 full sim found (some duplication). So - its about 70M to date. 20M jobs have "avalanched" - meaning they're starting to run after the scout jobs completed.

Operations overview: Production (Kaushik)

  • Reference:
  • last meeting(s):
    • lots of activity in parallel - its involving all parts of our systems. replication, production and analysis.
    • US sites appear to be performing well.
    • SLAC - is deleting datasets with wrong checksums. Not sure why transfers are fast.
    • Have seen excellent rates across all T2's yesterday to all sites.
    • Status page for STEP09 - see http://atladcops.cern.ch:8000/drmon/ftmon.html and http://panda.cern.ch:25980/server/pandamon/query?mode=listFunctionalTests
    • STEP09 user dataset generation for the US is going well - 50M events, but there were problems with event numbers; started yesterday with a new sample, so we're behind.
    • Lack of autopilots, Torre: 3 machines - problem with crons, fixing.
    • Would like to generate 100M events. Should be able to do ~10M/day. Trying to get this done before June 12.
    • Subscriptions that Alexei made for the 10M pre-test container - all vanished.
    • HC continues to run - 19K jobs in US cloud w/ 90% efficiency.
    • Charles: lots of failures due to high memory footprints. Mark has seen this at a number of sites.
    • Merge jobs - are we finished. We have 1000's of tasks in the backlog.
    • When does reprocessing start in the US? Should start now.
  • this week:
    • All running fine - MC production and reprocessing.
    • CERN having difficulty sending data to BNL.
    • Tier 2's are doing well.
    • Even though the tests will finish 11 am CEST, there will be a 2 day period to clear the backlog
    • Currently reduced bandwidth available BNL to CERN - fiber cut in Frankfurt. 4.2 Gbps rather than 10. Why is fermilab still getting so much (7.2)? Backup link BNL-Triumf, adding more load.

Shifters report (Mark)

  • Reference
  • last meeting:
    • dq2 server errors - was causing a lot of failures at sites. These were failing at the registration step. The DQ2 tracker queries were the culprit.
    • lot of lost heartbeats - most understood. seems like these are tapering off now. Happened at all sites, same date and time 6/1 ~ 8:40am. Related to condor-remove? Could we dig some of these up at the sites?
    • Rupon - starting to take shifts.
    • "setupper" failures are Panda server trying to register files, losing contact.
      1)  Panda db outage at CERN early Monday a.m. (6/1) -- many "lost heartbeat" failed jobs as a result.
      2)  Stage-in errors at AGLT2 caused by a mis-configuration in schedconfig.  Torre fixed the entry, production at the site resumed (6/1).
      3)  Pilot updates from Paul over the past week:  37e (5/29), 37f (6/1).  Details in announcements from Paul.
      4)  Issue with file staging errors at IU_OSG resolved -- site set back 'online' -- RT 12995 (6/1).
      5)  dCache issues at BNL Friday evening into Saturday (5/29 ==> 5/30).  Various postings to Usatlas-ddm-l discussing the problems.
      6)  Issue of STEP09 data having zero adler32 checksums resolved (thanks to Wei for noticing this). 5/31
      7)  Early a.m. 6/3 -- one of the dq2 central catalog hosts at CERN had to be re-started.  Large number of DDM errors during the outage.
      8)  Transferring backlog early on 6/2 due to issue with dq2 upgrade at BNL that resulted in an incompatible version of the GFAL-client software. Fixed by Hiro. 
      https://prod-grid-logger.cern.ch/elog/ATLAS+Computer+Operations+Logbook/3861

  • this meeting:
  • From Mark:
    1)  For much of the past week overall production efficiency was very good.
    2)  Mailing list for information regarding STEP09 activities:  atlas-project-adc-operations-step@cern.ch
    3)  Minor pilot update from Paul -- v37g -- (6/4) -- contents:  (i) Removed unnecessary panda monitor warning messages ('outputFiles has zero length' which is irrelevant for panda mover jobs, and the -t deprecation warning from lcg-ls); (ii) Code for checking file staging in analysis jobs has been updated to use BNL dCache file indices (code from Hiro).
    4)  Long-standing site issue at UTD-HEP resolved -- missing RPM -- (error was "No such file or directory: 'ntuple_rdotoesd.pmon.dat'").
    5)  On Thursday a lack of pilots caused several sites to drain.  Torre re-started the cron process, this cleared up the problem (6/4).
    6)  Analysis jobs using large amounts of memory caused machines to fall over at multiple sites.  See:
    https://prod-grid-logger.cern.ch/elog/ATLAS+Computer+Operations+Logbook/3900 -- no follow-ups?
    7)  AGLT2 drained early morning on Friday (6/5) -- turned out to be an issue with AFS -- resolved.
    8)  Tuesday morning (6/9) -- AGLT2 and MWT2_IU were draining due to a lack of pilots (from gridui11).  Torre moved pilot submission over to gridui07, and pilots began appearing at the sites.
    9)  Request from Shawn -- Can someone with access set 'copytoolin = dccp' for AGLT2 and ANALY_AGLT2 ?  (5/29) -- has this been done?
    10)   Just curious -- message from John at BU about adding Tufts U. to panda (5/29) -- didn't see any follow-up?
  • Problem with job failures this morning, indicating task 68897 problem (Pavel). Needs to be redefined? 75K jobs already completed.
  • Stage in problem at AGLT2 - Paul replied to Shawn.
  • UTD - SRM is not working. Came back online Monday, during troubleshooting activities at BNL. Michael requests notification to Hiro when bringing Tier 3 sites online.
  • Worries about supporting Tier 3's during coordinated production exercises. Jim C: there will be a Tier 3 discussion at next week's Jamboree.

Analysis queues (Nurcan)

  • Reference:
  • last meeting:
    • FacilityWGAPMinutesJun2
    • DB access jobs do not work at SLAC and SWT2 yet, Tools/PyUtils-00-06-17 tag that Sebastian Binet provided to process correctly the files with URL 'root://' and 'dcap://' does not work with 14.5.1.4. He reported that this is not meant to be used for releases earlier than 15.1.0.
    • First STEP09 analysis stress test jobs have been started on 6/2. See a summary (number of jobs, efficiency) from slides in today's ADC dev meeting: http://indico.cern.ch/getFile.py/access?contribId=5&resId=0&materialId=slides&confId=60214
    • An user analysis challenge is being organized by J. Cochran for June 12th. A 10M container dataset has been made for pre-test: step09.00000010.jetStream_pretest.recon.AOD.a84/. This has not been replicated to Tier2's yet, subscriptions disappeared as Alexei reported, he resubscribed.
    • Expect many more to come in over the next few days.
  • this meeting:
    • FacilityWGAPMinutesJun9
    • US cloud has highest efficiency (95%) with most jobs 136K jobs, followed by UK and France.
    • Main failures - similar to last week's, but there were a couple of new failure modes (eg. looping jobs).
    • More job types coming into Hammer Cloud: cosmics & DB access. Sebastien back-ported the pyutils package. Have a cosmics job, however produces a large output file.

DDM Operations (Hiro)

Tier 3 issues (Doug)

  • last meeting
    • Torre, Kaushik, Doug met last week, using Panda jobs to push data to Tier 3. Tier 3 would need an SRM, be in ToA. New panda client to serve data rather than full subscription model.
    • Kaushik is preparing draft for ATLAS review.
    • Does need an LFC someplace. This will be provided by BNL.
    • Few weeks of discussion to follow. Will take a month of development. Client is not too difficult.
    • Rik and Doug have been doing dq2-get performance testing on a dataset with 158 files, O(100 GB).
    • Tested from 3 places (ANL, FNAL, Duke) against most Tier 2's.
    • Rik reported on results to the stress-test.
    • Has seen copy errors of ~1%. Have also seen checksums not agreeing.
    • Lots of work on VMs. Looking like a good solution for Tier 3. Virtualize headnode systems.
    • No news on BNL srm-xrootd.
    • Would like to get analysis jobs similar to HC to use for Tier 3 validation (local submission).
  • this meeting
    • no report

Data Management & Storage Validation (Kaushik)

Throughput Initiative (Shawn)

  • Notes from meeting this week:

Throughput Meeting Notes for June 9

Attending: Rich, Sarah, Hiro, Rob, Jamie, Doug

NOTE: All sites should read below…there are a few requests and one important action item that each site is responsible for.

Long discussion about BNL network performance. Hiro has noted the production traffic is not performing as expected. The perfSONAR measurements show a significant difference between “inbound” vs “outbound” performance (outbound is worse). See the following perfSONAR URLs for more info:

https://lhcmon.bnl.gov

https://lhcmon.bnl.gov/gui/serviceTest.cgi?url=http://lhcmon.bnl.gov:8085/perfSONAR_PS/services/pSB&eventType=http://ggf.org/ns/nmwg/tools/iperf/2.0

Rich noted that: “If you use this URL

http://dc211.internet2.edu/cgi-bin/perfSONAR/serviceTest.cgi?url=http://lhcmon.bnl.gov:8085/perfSONAR_PS/services/pSB&eventType=http://ggf.org/ns/nmwg/tools/iperf/2.0

you will get more details for each graph. At least these graphs have labels and the src/dst names are listed in the graph. The flash graphs are scaleable and the mouse-over prints out the speed results for that point.”

Hiro has been sidetracked and hasn’t had time to finish the automated dataset movement. Will continue to work on it. Jay can provide visualization once tests are running.

Jamie report on her project for the Globus Alliance on characterizing gridftp over 10GE links. This is a Google summer of code project to characterize performance with different protocols and using both combinations of memory and disk as src/destination. Jamie needs to get access to multiple sites. Would like USATLAS Tier-2’s to provide an account and access. Currently has access to OU, AGLT2 and BNL. If other sites are will to provide Jamie access for GridFTP? testing please send her an email ‘Jamie Hegarty [jamie.e.hegarty ‘at’ gmail.com]’

We really need to finish off our throughput milestones. To expedite this I need EACH site to provide a date they will do their throughput testing:

 
**ACTION ITEM**  Each site needs to provide a date before the end of June for their throughput test demonstration.  Send the following information to Shawn McKee and CC Hiro:
a)      Date of test (sometime after June 14 when STEP09 ends and before July 1)
b)      Site person name and contact information.  This person will be responsible for watching their site during the test and documenting the result.
For each site the goal is a graph or table showing either:
i)                    400MB/sec (avg) for a 10GE connected site
ii)                   Best possible result if you have a bottleneck below 10GE
Each site should provide this information by close of business Thursday (June 11th).   Otherwise Shawn will assign dates and people!!
 

Network troubleshooting help for the throughput demonstrations can be gotten from this list: John Bigrow, Rich Carslon, Shawn McKee? , Jay Packard, Dantong Yu.

Next perfSONAR release will be available prior to joint-techs meeting in July.

Doug raised the issue of testing and Tier-3’s and suggested August may be a good timescale to push for this. In general the group wants to strongly encourage Tier-3’s to deploy perfSONAR. The real issue is when. We want to ensure the perfSONAR version is robust and very much a “set-and-forget” installation. Rich confirmed that the perfSONAR client tools have been incorporated into the VDT. This should make it easy for sites to have this broadly available (at Tier-3’s especially).

Shawn will be gone the next two Throughput meeting slots. Is anyone will to organize either of the next two meetings? (Basically just send a reminder email and take notes on whatever happens)? Please let me know ASAP if you can do this.

A reminder that the meeting info for future meetings has a new meeting number ‘1324’ instead of the old ‘1234’.

Please send along corrections or additions to these minutes to the list. Thanks,

Shawn

  • last week:
    • Google summer of code gridftp meeting.
    • Local throughput is approaching our milestones. Primary topic next week: reaching the milestone.
  • this week:
    • See notes
    • Major action item is for every site to reserve a time for BW test
    • OSG will provide client tools as part of VDT
    • Rich - raises issue that checks of the plots need to be made

Site news and issues (all sites)

  • T1:
    • last week: making progress w/ investigations regarding dcache scalability. pnfs-id based file addressing implemented by Hiro w/ help from Wensheng and Paul, and in production since Friday. Relieve load on pnfs server in terms of metadata lookups. See load creeping up w/ regard to staging. System seems very stable. 5 gridftp doors - will add, since we have 30 GB/s. Disk procurement - will come up w/ 1.5 PB of disk by August. Will put FC behind the Thor - can put 2x32 TB of useable disk without adding management nodes. Using Nexan technology. Oracle implementation of Chimera has a severe bug, not as well-tested as Postgres. Missing a trigger for Oracle, asked us to port. They obviously are not running the same test harness. For this reason we are still using pnfs.
    • this week: step09 backlog started building on Monday, no good explanation. Had an issue with write pools out of balance, resolved by Pedro. Lots of effort expended to understand the issue - only resolved after the monitoring at Tier 0 was available, which explained the issue. 2.5 GB/s constant rate. Noted that other Tier 1's were getting data okay, only BNL was getting limited. In WLCG meeting, this was treating this as a "site problem". Yesterday had a staging record of 24K requests at high rate, cleared within a few hours. pnfsID method is fully implemented;
Hiro developed the site mover with Paul, pnfs load lowered significantly. Very positive developments here. Moving forward with farm procurement, will go to bid. 1.5 PB useable disk procurement underway; expect delivery end of July.

  • AGLT2:
    • last week: running pretty well recently. working on consistency checking. schema prob with chimera+postgres. Tigran responded about schema error. Hope to have everything checked by the end of the day.
    • this week: most things working well. First issue about data transfers pausing but nothing in DQ2, not sure why. Second was Condor system dumped all running jobs over the weekend. Rebooted condor master node to clear it up (memory leak in Condor 7.2.1?). Tom looking at analysis jobs closely - requiring a large number of r/w from scratch. Is this expected behavior? How to follow-up with low-level metrics to ATLAS. Nurcan suggests sending to atlas-dist-analysis-stress-testing-coord@cern.ch list. Access to mysql in release area? accessing conditions database (oracle)? back/forth reading/writing? how is space checked in the work dir? Q: are these HC jobs realistic? Nurcan claims yes - they're reading AODs and writing ntuples.

  • NET2:
    • last week(s): running smoothly. using 1G network card - have yet to get myricom replaced. HU jobs failing, investigating. Have not run HC yet, since queue is filled with user jobs. Should we be deleting step09 data which has adler32=0? Are the step09 data real data? What to do with this data? Central deletion. Adding Tufts to NET2.
    • this week: cleaning up corrupt data. Will put in Adler32 shortly. Checking analysis job performance. Will be getting 10G nic soon. Reorganizing machine room. Fred notes there are problems with SAM figures. In contact w/ the GOC.

  • MWT2:
    • last week(s): converted 3 storage servers to 64 bit linux. Stability improved, add 48 TB of disk. Running solid. High memory jobs took out some nodes.
    • this week: Doing lots of analysis jobs. 200 input files, fails on 170th file. Wei see's same kind of behavior by same user. Working on dataset consistency. Adler32 will be turned back on today.

  • SWT2 (UTA):
    • last week: analysis site for CPB offline - not sure why. SRM became unreachable last week - investigating w/ LBL. Issue with high memory jobs - disrupted PBS. Analysis jobs - direct reads through xrootd for input data, job seg faults; if move job to read directly, succeeds. Some kind of xroot-talking problem. Cleaning space.
    • this week: late Friday evening - problems with file transfers; mysteriously stopped. Possible packet loss back to FTS server. Still some low-level I/O errors, still tracking it down. Adler32 - need to debug some code from Wei. Analysis jobs - getting read errors from xrootd.

  • SWT2 (OU):
    • last week: all okay. Waiting for additional storage, 100 TB (Dell/DDN)
    • this week: throughput issue OU-BNL, may be maintenance related.

  • WT2:
    • last week: all okay.
    • this week: DQ2 transfer backlog - reduced the number of concurrent checksum calcs on nodes to 1, probably too small. When more than 500 analysis jobs, one of the data servers gets a large number of hosts talking to it, creating hangs. When # connections > 200, problems occur, ZFS becomes slow, etc. Large number of analysis jobs failing with local copy. Nurcan will check with Dan as to which protocol he's using.

Carryover issues (any updates?)

Squids and Frontier (John DeStefano)

  • Note: this activity involves a number of people from both Tier 1 and Tier 2 facilities. I've put John down to regularly report on updates as he currently chairs the Friday Frontier-Squid meeting, though we expect contributions from Douglas, Shawn, John - others at Tier 2's, and developments are made. - rwg
  • last meeting(s):
    • Fred reports problems with using squid w/ release 15 - cannot access frontier server at BNL. Rel 14 works. Has contacted Richard Hawking, working on a solution.
    • Has tested direct oracle access; with help from Carlos, but found no significant performance improvement in spite of having reduced number of queries.
    • Sarah used the instructions for Tier 2 successfully.
  • this week:

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
    • Transfer of new pacball datasets to bnl much improved. Tom/UK will handle subs.
    • Tadashi changed panda mover to make install jobs highest priority. Should see releases installed very quickly.
    • Six new installation sites to panda - some configs need changes.
  • this meeting:

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week
    • Have discussed adding Adler32 checksum to xrootd. Alex developing something to calculate this on the fly. Expects to release this very soon. Want to supply this to the gridftp server.
    • Need to communicate w/ CERN regarding how this will work with FTS.
  • this week

Local Site Mover

AOB

  • last week
  • this week


-- RobertGardner - 09 Jun 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback