r3 - 15 Jul 2009 - 13:54:43 - JohnDeStefanoYou are here: TWiki >  Admins Web > MinutesJul15



Minutes of the Facilities Integration Program meeting, July 15, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees:
  • Apologies: Rob

Integration program update (Rob, Michael)

Operations overview: Production (Kaushik)

Shifters report (Mark)

  • Reference
  • last meeting:
    • Yuri's weekly summary presented at the Tuesday morning ADCoS? meeting: http://indico.cern.ch/conferenceDisplay.py?confId=63897
    • One particular topic that was discussed this week was the procedure for shifters to follow in cases of missing files (i.e., input files not found, etc.) A procedure is outlined in the ADCoS? wiki, but there's a question of how long this could take in a situation with a large number of files. Not yet finalized.
    1. ) Update to the pilot (from Graeme while Paul is on vacation): A small changed to the pilot code has been made in version 37l. This modifies the athena environment setup to get the correct trigger menus in athena > 14.5 (7/2).
    2. ) SRM restart at BNL (7/2) -- from Pedro: We needed to SRM a few more times and change some settings on the SRM, PnfsManager? plus we had to quickly retire+clean some old pools and re-deploy them in the MCTAPE write pools.
    3. ) (7/2) Stage-in/out errors at OU due to occasional crash of the SE/LFC node -- a system is in place to detect this and perform an auto-restart. (This issue will be resolved once their new storage system is in place.)
    4. ) AGLT2 set back to 'on-line' following resolution of a missing file(s) issue. See: https://rt-racf.bnl.gov/rt/index.html?q=13427 (7/3).
    5. ) Auto-pilot wrapper was downloading the pilot code from BNL (which had a slightly out of date version) rather than CERN. Issue resolved (thanks to Torre & Graeme) See: http://savannah.cern.ch/bugs/?52757 (7/4).
    6. ) Over the weekend transfers to AGLT2_MCDISK were failing due to insufficient space. 20TB added (7/6). See: https://rt-racf.bnl.gov/rt/index.html?q=13454
    7. ) Intermittent stage-in issues at the NET2 sites (BU and HU) -- from Saul: This problem appeared when the top prodsys directory reached 65K subdirectories (a hard GPFS limit). We have gotten around the limit, but don't yet understand the missing files below. We're working on it.
    8. ) Stage-out & DDM transfer problems at AGLT2 Tuesday afternoon (7/7). From Shawn: We had some strange gPlazma issues this afternoon. It required a restart of the dCache admin headnode (and some subsequent gPlazma restarts) to resolve. Things seem to be OK now but we will continue to watch it. See: prod-grid-logger.cern.ch/elog/ATLAS+Computer+Operations+Logbook/4475.
    9. ) Stage-in errors at BNL Tuesday morning (>600 failed jobs). RT 13463. According to Pedro this issue is the same one as previously reported in https://rt-racf.bnl.gov/rt/Ticket/Display.html?id=13418. Will follow up in that thread. Problem seems to be network-related.
    10. ) Maintenance downtime at AGLT2 tomorrow (Thursday, 7/9). From Bob: On Thursday this week, July 9, we will upgrade various components of our cluster. Many nodes will be rebuilt, and dCache and network outages should be expected during this time. The upgrades will commence at 8am EDT on July 9, and we hope to be finished with all work by 6pm. If we are back up earlier than this, or need more time, we will send notification. An OIM outage for this time has been set. New condor jobs will not start after 4pm on July 8. I will set our site offline (AGLT2 and ANALY_AGLT2) shortly before that time so that no Idle jobs will remain when compute nodes stop accepting new jobs.
    11. ) Problem with transfers to WISC resolved (7/7). From Wen: We had a gridftp problem which blocked some transfers this afternoon. So we stop the SRM server to redefine gridftp configuration. The problem has already solved. GGUS ticket 50095.
    12. ) Tuesday evening, 7/7 -- MWT2_UC failed jobs with errors about missing files. From Charles: All of these failures were coming from a single WN - uct2-c194. There was a filesystem problem on this node earlier today which led to a large number of failures.... the problem is fixed now.
    13. ) 7/7: Test jobs submitted to UCITB_EDGE7. The jobs failed, most likely due to a missing value for "cmtConfig" in schedconfigdb. Working on this. See for example: http://panda.cern.ch:25980/server/pandamon/query?job=1014372939.
    • Follow-ups from an earlier items:
      • Any updates about bringing Tufts into production? (NET2 site)
      • Slow network performance between UTD-HEP and BNL not fully understood, but the site has been running production stably for at least a week or more. Software release installations are still slow, but this should get resolved later this Summer when a hardware issue with a file server is addressed.

  • this meeting:

Analysis queues (Nurcan)

  • Reference:
  • last meeting:
    • AnalysisStep09PostMortem - Thanks to all sites providing a detailed info here. Could we use some more info from BNL?. ATLAS post-mortem was on July1st. A summary was presented by Graeme in ATLAS Week yesterday. WLCG post-mortem coming up, July 9-10.
    • Status of DB access job at SWT2: Patrick managed to run the job by hand. He made a change in InstallArea/python/PyUtils/AthFile.py. Now we are waiting a response from Sebastien Binet on this.
    • Stress testing of DB access jobs at other sites: BNL, AGLT2, MWT2 passed the stress test (200 jobs, >95% success rate). NET2 has still input files with bad checksums. This job is to put in HammerCloud.
    • Status of cosmic job: Instructions from Hong Ma. I ran this job fine at BNL. It only runs at BNL not at other sites (tried AGLT2) since BNL provides a local pool files via the poolcond directory at /usatlas/workarea/atlas/PFC/catalogue/poolcond. Hong commented that the regular DB release does not come with these pool files so unless something is done like at BNL, job will fail. Need help from experts on this.
    • User analysis challenge in US with step09 samples:
      • Two samples are now replicated to US sites: step09.00000011.jetStream_medcut.recon.AOD.a84/ (estimated total size 14900 GB, 9769 files, 97.69M events), step09.00000011.jetStream_lowcut.recon.AOD.a84/ (estimated total size 3674 GB, 2749 files, 27.49M events)
      • Validation work ongoing. SUSYValidation job is successful, now running on full sample at SWT2.
      • Day of analysis challenge to be announced by Jim C. Sites are scheduling downtimes, AGLT2 (July 9) and SLAC (needs 4 days).
    • User analysis issues:
      • User reports on problems with "R__unzip: error in inflate" at SWT2. Issue discussed at AnalysisStep09PostMortem by Patrick and Wei. From Wei today:
        The problem, if it is as what the xrootd developer identified before, is a
        cache replacement issue of root client, which only happens if a job reads
        from xrootd servers directly. If a file is copied to batch node, this cache
        is not used. According to the developer, the offending file will _likely_
        repeat the problem most of the time.
        I am not sure whether upgrading xrootd client package at site will help (it
        may). It is a little risky because the buggy root client is embedded in
        ATLAS releases. Using LD_LIBRARY_PATH at a site will modify a lot of things
        so we much be very careful. To solve the root of the problem, we need xrootd
        developers to work with the ROOT team and ATLAS.
      • Wei: why did this show up during step09? How is it triggered? A ROOT patch needs to go into the ATLAS release. Happening at SWT2 and SLAC; user's jobs are successful at NET2. Perhaps turn off direct reading. Can we reproduce the error? Nurcan claims the error was seen for several types of HC jobs.
        • We need to have a lightweight/simple way to recreate problems.
        • Need to coordinate updates to ATLAS releases, and updates to ROOT- contact through David Q
      • We need to address the problem of distributing the conditions data cool files and xml files. This is a serious issue that is not being addressed by ADC operations. Fred will follow-up with Jim Shank. We should also discuss this as an item at the L2 management meeting.

  • this meeting:
    • Status of DB access job at SWT2 and SLAC: Still waiting to hear back from Sebastien Binet on the fix Patrick made on PyUtils/AthFile.py module.
    • Stress testing of DB access jobs at other sites: NET2 still have corrupted files in second try from mc08.105200.T1_McAtNlo_Jimmy.merge.AOD.e357_s462_r635_t53/. BNL, AGLT2, MWT2 had passed the test.
    • User analysis challenge in US with step09 samples: Validation work ongoing with SUSYValidation jobs, one of the four job types used in HammerCloud submissions. The medcut sample makes 1955 jobs, the lowcut sample makes 551 jobs, each reading 5 input AOD's. Status at sites:
      • SWT2: 212 jobs failed from the medcut sample with errors "R__unzip: error in inflate". This error was first seen for HammerCloud jobs, and later with user jobs. Patrick manually patched the release (14.5.x) using the libXrdClient.so client library from the latest xrootd release (instead using the version shipped with root v5.18.00f). I'm retrying these jobs now. Jobs on the lowcut sample are running fine.
      • SLAC: 100% success in first try. The above patch was applied by Wei.
      • AGLT2: 5 jobs failed out of 2506. One ran on bad node, taken offline. Other four failed with "Could not open connection ! lcg_cp: Communication error ". From Bob: "These seem to occur randomly at a low level, and usually all at about the same time.". I'm retrying these jobs now.
      • MWT2: 4 jobs failed out of 2506 with errors like "lfc_creatg failed", "lcg_cp: Communication error". All successful upon retry.
      • NET2: Missing files in one of the tid datasets from the medcut sample, even though it appears as complete in the dashboard. 122 jobs failed in the first run, 51 failed on retry. As for lowcut sample, 141 jobs failed with problems of type "SFN not set in LFC for guid BA56DA5F? -C25D-DE11-88D2-001E4F14A8A4 (check LFC server version)". Paul commented that this happens when the LFC is under a high load and has been fixed in later LFC server versions. Saul to check with John. I have sent a retry.
      • BNL: 17 jobs failed out of 2506. Errors due to "lfc_getreplicas failed", "lfc-mkdir failed", "lfc_creatg failed". I'm retrying these jobs now.

DDM Operations (Hiro)

Tier 3 issues (Doug)

  • last meeting(s)
    • CERN virtualization workshop - discuss regarding head node services.
    • BNL is providing some hardware for virtualizing Tier 3 clusters.
    • Considering Rpath
  • this meeting

Conditions data access from Tier 2, Tier 3 (Fred)

Data Management & Storage Validation (Kaushik)

Throughput Initiative (Shawn)

  • last week:
    • Note:
      **ACTION ITEM**  Each site needs to provide a date before the end of June for their throughput test demonstration.  Send the following information to Shawn McKee and CC Hiro:
      a)      Date of test (sometime after June 14 when STEP09 ends and before July 1)
      b)      Site person name and contact information.  This person will be responsible for watching their site during the test and documenting the result.
      For each site the goal is a graph or table showing either:
      i)                    400MB/sec (avg) for a 10GE connected site
      ii)                   Best possible result if you have a bottleneck below 10GE
      Each site should provide this information by close of business Thursday (June 11th).   Otherwise Shawn will assign dates and people!!
    • Last week BNL --> MWT2_UC throughput testing, see: http://integrationcloud.campfirenow.com/room/192199/transcript/2009/06/18
    • Performance not as good as hoped, but 400 MB/s milestone reached (peak only)
      • For some reason the individual file transfers were low
    • NET2 - need 10G NIC
    • SLAC - may need additional gridftp servers
    • UTA - CPB - iperf tends to vary to dcdoor10. 300-400 Mbps, mostly 50-75 Mbps; 700-800 Mbps into _UTA (probably is coming back).
    • OU - need new storage.
    • AGLT2 - directional issues

    • 1 GB/s throughput milestone going on now.
    • August 10 target to deploy new perfsonar package.
    • We need to begin using these tools during our throughput testing.

  • this week:
  • this week

OSG 1.2 validation (Rob, Xin)

  • last week:
    • Testing in progress on UCITB-EDGE7 - requires a scheddb change.
    • Xin - adding BNL as a test site in Panda
  • this week:
    • testing on UCITB - problems with staging out output files.

Site news and issues (all sites)

  • T1:
    • last week: HPSS maintenance on July 14. 52 units of storage coming to BNL today. Expect to have this completed quickly. Have decided against using Thor extension units (expense) - will use the FC connected Nexan units. Have submitted an order for 120 worker nodes (Nehalen). (3MSI2K goal). 3 exascale F10 network chasis. Observed a couple of bottlenecks during step09. Will get 60 Gbps backbound. HPSS inter-mover upgrade to 10 Gbps. Note ATLAS resource request is still under discussion, not yet approved by LHCC; resource request for Tier 2's is at the same level as we've known from before.
    • this week:

  • AGLT2:
    • last week: Downtime planned for tomorrow - 8 am to 6 pm. Need a fix for glue schema reporting. Updating bios and firmware for storage controllers. Jumbo frames on the public network. There were some outages during the last two weeks - but these are understood. AFS to be upgraded - hopes this will stabilize. gplazma on the headnode. Monitoring of the gplazma logfile
    • this week:

  • NET2:
    • last week(s): Working on a number of DDM issues. Squid and frontier client now installed, waiting for Fred to test. About to install new Myricom cards - 10G, then will repeat throughput tests. 130 TB of storage getting.
    • this week:

  • MWT2:
    • last week(s): Have been running smoothly. Saturated 10G link between IU and UC, 14K files transferred. dCache seems to be stable now. No failures. ~5K xfers / hour. Try smaller files to boost up the SRM rate.
    • this week:

  • SWT2 (UTA):
    • last week: CPB running okay. Working on analysis queue issue discussed above. Ibrix system being upgraded for SWT2 cluster. Still working on squid validation w/ frontier clients.
    • this week:

  • SWT2 (OU):
    • last week:
    • this week:

  • WT2:
    • last week: preload library problem fixed. Working on a procurement - ~10 thor units. ZFS tuning. Looking for a window to upgrade the Thumpers. Latest fix for xrootd client is in the latest ROOT release.
    • this week:

Carryover issues (any updates?)

Squids and Frontier (John DeStefano)

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
  • this meeting:

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week
    • Have discussed adding Adler32 checksum to xrootd. Alex developing something to calculate this on the fly. Expects to release this very soon. Want to supply this to the gridftp server.
    • Need to communicate w/ CERN regarding how this will work with FTS.
  • this week

Local Site Mover


  • last week
  • this week

-- RobertGardner - 14 Jul 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback