r7 - 17 Jun 2009 - 15:39:51 - NurcanOzturkYou are here: TWiki >  Admins Web > MinutesJune17



Minutes of the Facilities Integration Program meeting, June 17, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees: Saul, Rob, Pedro, Sarah, Michael, Patrick, Wei, Tom, Nurcan, Armen, Rupam, Mark, Hiro, Xin, Horst, Karthik, John De Stefano, Kaushik
  • Apologies: none

Integration program update (Rob, Michael)

  • IntegrationPhase9 - FY09Q3
  • Special meetings
    • Tuesday (9:30am CDT): Facility working group on analysis queue performance: FacilityWGAP
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
    • Friday (1pm CDT): Frontier/Squid
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • In third month of FY09Q3:
      • Please update FabricUpgradeP9 with any CPU, storage, or infrastructure procurements during this phase.
      • Please update SiteCertificationP9 for DQ2 site services update for Adler32, Fabric upgrade, and Squid server-Frontier client deployment
    • Squid deployment at Tier 2 - we need:
      • We need ATLAS to validate the method - we need a discussion within ADC operations and physics (Jim C). Also need a validation job - eg. one running over cosmics.
      • Within the facility need sites to deploy Squid - June 24.
  • Other remarks
    • last week
      • Step 09 still in full swing - ending this Friday
      • Analysis jobs are providing stresses on storage services
      • Need to do a post-mortem afterwards, and develop action plan towards fixing major weaknesses
    • this week
      • step09 went very well
      • Next exercise is around the corner - cosmic run begins Monday, will last for two weeks

Conditions data access from Tier 2, Tier 3 (Fred)

  • https://twiki.cern.ch/twiki/bin/view/Atlas/RemoteConditionsDataAccess
  • Needs to be solved quickly - Saha, Richard Hawkings, Elizabeth Gallas, David Front.
  • Jobs are taking up connections - they hold them open for a long time.
  • Squid tests have AGLT2, MWT2, WT2 successful - reduces load on backend
  • Fred - will contact NE and SW to setup squid caches; validate w/ Fred's test jobs
  • COOL conditions data will be subscribed to sites; owned by Sasha
  • XML file needs to be maintained at each site using Squid - raw, esd, or even AOD data; this needs to be structured centrally, since all sites will want to do this.
  • Michael has discussed this w/ Massimo; needs to bring this up at the ADC operations meeting.
  • More ATLAS coordination is involved

Operations overview: Production (Kaushik)

  • Reference:
  • last meeting(s):
    • All running fine - MC production and reprocessing.
    • CERN having difficulty sending data to BNL.
    • Tier 2's are doing well.
    • Even though the tests will finish 11 am CEST, there will be a 2 day period to clear the backlog
    • Currently reduced bandwidth available BNL to CERN - fiber cut in Frankfurt. 4.2 Gbps rather than 10. Why is fermilab still getting so much (7.2)? Backup link BNL-Triumf, adding more load.
  • this week:
    • step09 - all went well
    • there will be plans for more user analysis later this week - Jim C
    • putting together pieces of the sample together into the containers (150 M events); can only add completed datasets to containers
    • rate currently is 10M-15M events/day
    • size of analysis container ~20TB, i.e. need about 25 TB in MCDISK.
    • Capacity: SWT2 (35 TB); NET2 (30 TB); WT2 (45 TB)
    • Will need space for cosmic data too - not sure of requirements.
    • Michael - notes that the effort required for step09 was quite high, how do we improve? Need input from sites providing feedback on experiences.
    • Saul - still seeing HU running below capacity - not understood? Problem w/ pilots?
    • What is the schedule for analysis stress testing - probably next week. There is a list prepared by Jim - and some jobs have already been tested.
    • 65K directories at the top level in PRODDISK - limit for GPFS - solution. Discussion w/ Simone to change way directory structures are defined by DQ2. Need to follow same scheme for all proddisk areas. Competing effects of nesting versus number.

Shifters report (Mark)

  • Reference
  • last meeting:
    • From Mark:
      1)  For much of the past week overall production efficiency was very good.
      2)  Mailing list for information regarding STEP09 activities:  atlas-project-adc-operations-step@cern.ch
      3)  Minor pilot update from Paul -- v37g -- (6/4) -- contents:  (i) Removed unnecessary panda monitor warning messages ('outputFiles has zero length' which is irrelevant for panda mover jobs, and the -t deprecation warning from lcg-ls); (ii) Code for checking file staging in analysis jobs has been updated to use BNL dCache file indices (code from Hiro).
      4)  Long-standing site issue at UTD-HEP resolved -- missing RPM -- (error was "No such file or directory: 'ntuple_rdotoesd.pmon.dat'").
      5)  On Thursday a lack of pilots caused several sites to drain.  Torre re-started the cron process, this cleared up the problem (6/4).
      6)  Analysis jobs using large amounts of memory caused machines to fall over at multiple sites.  See:
      https://prod-grid-logger.cern.ch/elog/ATLAS+Computer+Operations+Logbook/3900 -- no follow-ups?
      7)  AGLT2 drained early morning on Friday (6/5) -- turned out to be an issue with AFS -- resolved.
      8)  Tuesday morning (6/9) -- AGLT2 and MWT2_IU were draining due to a lack of pilots (from gridui11).  Torre moved pilot submission over to gridui07, and pilots began appearing at the sites.
      9)  Request from Shawn -- Can someone with access set 'copytoolin = dccp' for AGLT2 and ANALY_AGLT2 ?  (5/29) -- has this been done?
      10)   Just curious -- message from John at BU about adding Tufts U. to panda (5/29) -- didn't see any follow-up?
    • Problem with job failures this morning, indicating task 68897 problem (Pavel). Needs to be redefined? 75K jobs already completed.
    • Stage in problem at AGLT2 - Paul replied to Shawn.
    • UTD - SRM is not working. Came back online Monday, during troubleshooting activities at BNL. Michael requests notification to Hiro when bringing Tier 3 sites online.
    • Worries about supporting Tier 3's during coordinated production exercises. Jim C: there will be a Tier 3 discussion at next week's Jamboree.

  • this meeting:
    1)  Ongoing work to bring Tufts into production.  Issues found with release installations.  https://rt-racf.bnl.gov/rt/index.html?q=13205
    2)  Storage issues apparently resolved at UTD-HEP, and test jobs submitted yesterday succeeded.  Site set back to 'on-line'.
    3)  Minor pilot updates from Paul:
    (v37h) 6/10:
    The exception handling in the pilots' internal time-out function has been updated. The exception handler did not catch one of the thrown exceptions, leading to the reported error "Get error: dccp failed with output: ec 1, output filedescriptor out of range in select()". This fix addresses an issue seen with [at least] a user job at ANALY_MWT2_SHORT where dccp got stuck and was not timed out cleanly.
    (v37i) 6/12:
    (i) File staging query at BNL now using http method instead of dc_check (user analysis jobs).
    (ii) Changed time-out from 1800 s to 3600 s for lcg-cp/cr since 1800 s appears to be too low.
    4)  Last Wednesday Xin reported on an issue that had been at least partly responsible for some of the problems with pilot submissions from gridui11 at BNL:
    For the record, the problem was that condor-g got stuck at gridui11, due to a known bug of globus client used in condor-g gahp server. Condor-G developers are aware of this and have plan to automatically detect and restart it in the future.
    5)  Job recovery by the pilot re-enabled for BNL by Paul (6/11).
    6)  AGLT2 -- Condor schedd process crashed Saturday evening at around 6:00 p.m.  ~1400 resulting "lost heartbeat" jobs.
    7)  Job eviction issue at IllinoisHEP resolved (6/15) -- Condor preemption / suspension had to be disabled.
    8)  Queues at BNL began to be drained Tuesday evening (6/16) in preparation for for today's dCache upgrade.
    9)  Maintenance outage at SLAC next week (6/21 - 6/25).  Wei plans to upgrade OS and ZFS on all storage servers during this time.
    10)  Bug in dq2 affected some of the subscriptions for STEP09.  From Simone:
    The bug appears as following: the subscription T0->T1 is working trying to put files at T1. In the meanwhile there is a subscription T1->T2. The latter looks for files in T1 which is marked as only possible source in the subscription (--source option), but the files are not yet at T1. The DDM decided that the source (the T1) is broken and gives up at the subscription. This should not be and will be cured in next releases.

Analysis queues (Nurcan)

  • Reference:
  • last meeting:
    • FacilityWGAPMinutesJun9
    • US cloud has highest efficiency (95%) with most jobs 136K jobs, followed by UK and France.
    • Main failures - similar to last week's, but there were a couple of new failure modes (eg. looping jobs).
    • More job types coming into Hammer Cloud: cosmics & DB access. Sebastien back-ported the pyutils package. Have a cosmics job, however produces a large output file.
  • this meeting:
    • STEP09 ended on June 12th. See analysis job summaries (as of yesterday, June 16 ) for clouds and US sites below. Statistics are not final yet. Killed jobs on June 12th will be excluded.
    • STEP09 post-mortem is due: July 1 - ATLAS internal, July 9-10 - WLCG. MWT2 signed up to provide feedback: https://twiki.cern.ch/twiki/bin/view/Atlas/Step09Feedback#T2
    • Status of DB access job: tested the new tag PyUtils-00-03-56-02 that Sebastien provided to process files with URL's root:// and dcap:// for Release 14 jobs worked. However jobs at SLAC (not a ROOT file) and SWT2 (looping job) are still unsuccessful, needs further debugging.
    • STEP09_AnalysisSummary_Clouds:
    • STEP09_AnalysisSummary_USsites:

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • UC: will finish this today.
    • SWT2 (bit a problem, Patrick looking into it), NET2 (John coding a component), not finished.
    • SLAC: having to restart DQ2 several times. 250 jobs analysis hitting one xrootd server, others with only 20, not sure if this is being caused by DQ2 transfers problem. Hiro - files are moving now. Hiro will look over logs.
    • Fix applied by Shawn to replace a missing gfl library, restarted. Not sure why this problem just surfaced.
  • this meeting:
    • UC cleaning issue understood
    • Working on problem reported by Wei - LFC service was being blocked - service was doing port scanning. There is a new DOE cross-lab cyber-security program that detected SLAC and AGLT2 were detected as port scanning. 30 ports are being accessed on a Panda mover servers (not necessarily Panda mover processes) from remote LFCs. What changed? John Hover is investigating will provide a report.
    • AGLT2 FTS delegated proxy has a problem; Bob increased frequency of renewal of cron job. Should probably check validity.
    • Simone changed ToA setting to enforce production role, but it can be over-ridden.

Tier 3 issues (Doug)

  • last meeting(s)
    • Torre, Kaushik, Doug met last week, using Panda jobs to push data to Tier 3. Tier 3 would need an SRM, be in ToA. New panda client to serve data rather than full subscription model.
    • Kaushik is preparing draft for ATLAS review.
    • Does need an LFC someplace. This will be provided by BNL.
    • Few weeks of discussion to follow. Will take a month of development. Client is not too difficult.
    • Rik and Doug have been doing dq2-get performance testing on a dataset with 158 files, O(100 GB).
    • Tested from 3 places (ANL, FNAL, Duke) against most Tier 2's.
    • Rik reported on results to the stress-test.
    • Has seen copy errors of ~1%. Have also seen checksums not agreeing.
    • Lots of work on VMs. Looking like a good solution for Tier 3. Virtualize headnode systems.
    • No news on BNL srm-xrootd.
    • Would like to get analysis jobs similar to HC to use for Tier 3 validation (local submission).
  • this meeting
    • next week

Data Management & Storage Validation (Kaushik)

Throughput Initiative (Shawn)

  • last week:
    • Note:
      **ACTION ITEM**  Each site needs to provide a date before the end of June for their throughput test demonstration.  Send the following information to Shawn McKee and CC Hiro:
      a)      Date of test (sometime after June 14 when STEP09 ends and before July 1)
      b)      Site person name and contact information.  This person will be responsible for watching their site during the test and documenting the result.
      For each site the goal is a graph or table showing either:
      i)                    400MB/sec (avg) for a 10GE connected site
      ii)                   Best possible result if you have a bottleneck below 10GE
      Each site should provide this information by close of business Thursday (June 11th).   Otherwise Shawn will assign dates and people!!
    • Major action item is for every site to reserve a time for BW test
    • OSG will provide client tools as part of VDT
    • Rich - raises issue that checks of the plots need to be made
  • this week:

Site news and issues (all sites)

  • T1:
    • last week: step09 backlog started building on Monday, no good explanation. Had an issue with write pools out of balance, resolved by Pedro. Lots of effort expended to understand the issue - only resolved after the monitoring at Tier 0 was available, which explained the issue. 2.5 GB/s constant rate. Noted that other Tier 1's were getting data okay, only BNL was getting limited. In WLCG meeting, this was treating this as a "site problem". Yesterday had a staging record of 24K requests at high rate, cleared within a few hours. pnfsID method is fully implemented; Hiro developed the site mover with Paul, pnfs load lowered significantly. Very positive developments here. Moving forward with farm procurement, will go to bid. 1.5 PB useable disk procurement underway; expect delivery end of July.
    • this week: Biggest issue was the Tier 1 didn't get the anticipated share of data exports during the step09 exercise. Issues with the output buffer at CERN - meeting to understand this, will instrument to adjust FTS configuration; will meeting again to discuss concrete steps. Pedro/dCache: maintenance, upgraded. Jason working on monitoring srm probes for ganglia.

  • AGLT2:
    • last week: most things working well. First issue about data transfers pausing but nothing in DQ2, not sure why. Second was Condor system dumped all running jobs over the weekend. Rebooted condor master node to clear it up (memory leak in Condor 7.2.1?). Tom looking at analysis jobs closely - requiring a large number of r/w from scratch. Is this expected behavior? How to follow-up with low-level metrics to ATLAS. Nurcan suggests sending to atlas-dist-analysis-stress-testing-coord@cern.ch list. Access to mysql in release area? accessing conditions database (oracle)? back/forth reading/writing? how is space checked in the work dir? Q: are these HC jobs realistic? Nurcan claims yes - they're reading AODs and writing ntuples.
    • this week: mostly covered above.

  • NET2:
    • last week(s): cleaning up corrupt data. Will put in Adler32 shortly. Checking analysis job performance. Will be getting 10G nic soon. Reorganizing machine room. Fred notes there are problems with SAM figures. In contact w/ the GOC.
    • this week: John finishing up the corrupt data issue. Squid installed. Discussion about IO_WAIT observations.

  • MWT2:
    • last week(s): Doing lots of analysis jobs. 200 input files, fails on 170th file. Wei see's same kind of behavior by same user. Working on dataset consistency. Adler32 will be turned back on today.
    • this week: dCache much more stable - but a few restarts during step09 still being investigated. Only got about 200 MB/s during step09 due to an FTS configuration error. Would like to get FTS 2.0 working for UC to implement passive gridftp for direct transfers. Throughput tests tomorrow.

  • SWT2 (UTA):
    • last week: late Friday evening - problems with file transfers; mysteriously stopped. Possible packet loss back to FTS server. Still some low-level I/O errors, still tracking it down. Adler32 - need to debug some code from Wei. Analysis jobs - getting read errors from xrootd.
    • this week: Will implement Adler32 today. Squid server host prepared, will being installing.

  • SWT2 (OU):
    • last week: throughput issue OU-BNL, may be maintenance related.
    • this week: will revisit w/ Kaushik.

  • WT2:
    • last week: DQ2 transfer backlog - reduced the number of concurrent checksum calcs on nodes to 1, probably too small. When more than 500 analysis jobs, one of the data servers gets a large number of hosts talking to it, creating hangs. When # connections > 200, problems occur, ZFS becomes slow, etc. Large number of analysis jobs failing with local copy. Nurcan will check with Dan as to which protocol he's using.
    • this week: found performance problems w/ thumpers, file-not-found errors during step09. Old solaris and ZFS which has difficulty detecting defective drives; will upgrade. Found throughput to BNL reduced while analysis jobs running. Found registrations to LFC sometimes failing for DQ2 transfers, in discussions w/ DQ2 developers. Hiro believes data did get into LFC, were they removed? Wei claims registrations, specifically for DATADISK, were stopped while MCDISK registrations were successful. IN2P3? visitor did I/O profiling and found largest block size was 300 bytes; also found lots of jumps within a file. Is read-ahead then useful?

Carryover issues (any updates?)

Squids and Frontier (John DeStefano)

  • Note: this activity involves a number of people from both Tier 1 and Tier 2 facilities. I've put John down to regularly report on updates as he currently chairs the Friday Frontier-Squid meeting, though we expect contributions from Douglas, Shawn, John - others at Tier 2's, and developments are made. - rwg
  • last meeting(s):
  • this week:
    • See above

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
    • Transfer of new pacball datasets to bnl much improved. Tom/UK will handle subs.
    • Tadashi changed panda mover to make install jobs highest priority. Should see releases installed very quickly.
    • Six new installation sites to panda - some configs need changes.
  • this meeting:

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week
    • Have discussed adding Adler32 checksum to xrootd. Alex developing something to calculate this on the fly. Expects to release this very soon. Want to supply this to the gridftp server.
    • Need to communicate w/ CERN regarding how this will work with FTS.
  • this week

Local Site Mover


  • last week
  • this week
    • Doug reports that we will have Tier 3's opting for the GS type. Is there a Tier 2 "best practices" that can be provided to Tier 3's?

-- RobertGardner - 16 Jun 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


png Untitled1.png (121.8K) | NurcanOzturk, 17 Jun 2009 - 12:39 | STEP09_AnalysisSummary_Clouds
png Untitled2.png (127.3K) | NurcanOzturk, 17 Jun 2009 - 12:40 | STEP09_AnalysisSummary_USsites
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback