r4 - 22 Jul 2009 - 19:25:22 - WeiYangYou are here: TWiki >  Admins Web > MinutesJul22



Minutes of the Facilities Integration Program meeting, July 22, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees: Fred, Rob, John DeStefano, Sarah, Jim C, Justin, Patrick & Mark, Booker, Kaushik, Armen, Shawn, Nurcan, Torre, Bob, Charles, Wei
  • Apologies: Michael, Saul, Horst

Integration program update (Rob, Michael)

  • Introducing: SiteCertificationP10 - FY09Q04 NEW
  • Special meetings
    • Tuesday (9:30am CDT): Facility working group on analysis queue performance: FacilityWGAP suspended for now
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
    • Friday (1pm CDT): Frontier/Squid
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
  • Other remarks
    • last week(s)
    • this week:
      Since yesterday we observe much higher pAthena based DA activities in the German and now also in the UK clouds (plot attached).
      The load profile is such that ~all sites have jobs running (~2.5k per cloud). The efficiency is not great, most problems, as expected, are related to data access (staging file into WN, but also staging the output back to the SE).
      Given the load is fairly evenly spread across sites it looks to me this is another centrally controlled exercise, probably managed by HC (‘haven’t looked at jobs).
      We have to have this kind of exercise as well, to test the machinery as a whole and to focus on the critical issues (e.g. data access).  Now that we have the 100M event container we should be in a position where we can/should start conducting these tests.
      Please discuss this important point at today’s computing meeting (I won’t be able to attend).
      • analysis-day-summary.png:
      • Jim C in contact Massimo coordinating an ATLAS-wide user analysis stress test, looks like cloud-by-cloud.
      • There are testing in Germany - but these are HC tests.
      • Looks like they're experimenting with dcap - optimizing analysis queues
      • Torre: will be adding statistics on sites by site, processing (job) types, etc. Working on it with Sergei - next week or two. Jim C would like to have statistics by user (identify poorly trained users).
      • Nurcan: there is an option in pathena to identify the processing type (eg. Hammer Cloud) - the users could set this, for example.
      • Jim C: expect to have users start next Monday.

Operations overview: Production (Kaushik)

Shifters report (Mark)

  • Reference
  • last meeting:
  • this meeting:
    • Yuri's weekly summary presented at the Tuesday morning ADCoS? meeting: http://indico.cern.ch/materialDisplay.py?contribId=0&materialId=0&confId=64809
       1)  IU_OSG -- new data and scratch locations setup -- 'schedconfigdb' updates -- successful test jobs -- site set back to 'on-line'. (7/20-21)
      2)  7/16 -- Installation of new network cards at BU completed.  Test jobs submitted / completed successfully -- site set back to 'on-line'. 
      3)  Thursday morning, 7/16, from Bob at AGLT2:
      From 8:10am to 8:45am EDT today, a network switch crashed and resulted in a rack of compute nodes going "unavailable".  This was long enough that all jobs running on the nodes were lost.  Approximately 160 jobs were running on these nodes. 
      4)  Thursday morning, 7/16 -- dCache maintenance at BNL -- from Pedro:
      Due to transfer failures (to and from other sites) and job failures due to timeouts, dCache will be shutdown to cleanup the trash directory and start PnfsManager in a clean environment.  After we've guaranteed this service is working well, we will need to be restarted all pools to cleanup orphan files.  When everything has finished, we will start dCache again.
      5)  Thursday evening. 7/16 -- issue with transfer high error rates between BNL and other tier 1's resolved -- from Pedro:
      There were some pools which couldn't be brought only due to an OS problem.  This has been fixed. All pools are available now.
      And later same evening:
      Our pnfs server had a high load due to the postgres vacuum.  SRM service usually gets stuck and queues requests up to the point where the queue in front of tomcat fills up and then it just drops HTTP requests.  During vacuum operation, the dbserver processes and the database stopped working.  The pnfs service had to be stopped, then killed and only then the load on the machine dropped.  Services were then resumed.  Once again, everything should be back to normal now.
      6)  Friday morning, 7/17 -- ~200 failed jobs at SWT2_CPB with the error "!!FAILED!!2990!! Required CMTCONFIG 
      (i686-slc4-gcc34-opt) incompatible with that of local system (.: usage: .filename [arguments]) !!FAILED!!3000!! Setup failed 
      All of the failed jobs ran on the same WN (c8-38).  This node removed from the batch system -- still investigating the cause.
      7)  Friday, 7/17 -- gk host cert expired at HU_ATLAS_Tier2 -- new cert installed on Tuesday (7/21) -- test jobs submitted -- site set back to 'on-line'.
      8)  Site UCITB_EDGE7 is now available (schedconfigdb updates, local site mover, etc.).  Various test jobs submitted to the site, including atlas release 15 jobs, as requested by Xin.
      9)  7/19-20 -- Site IllinoisHEP set 'off-line' -- jobs were failing due to a missing DBRelease file -- file re-subscribed -- site set back to 'on-line'.
      10)  7/21 -- jobs were failing at BNL with the error:
      "Error details: pilot: Put error: Voms proxy certificate does not exist or is too short. Command output: This node runs platform i386 313302|Log put error: Voms proxy certificate does not exist or is too short. Command output: This node runs platform i386 31326"  
      Issue resolved -- from Xin:
      It's due to the setup file change, the new setup file printed out the platform info, as you can see below ("This node runs platform i386"). The additional stdout appeared to confused pilot to think the voms-proxy is not valid. I commented out the "echo" line in the setup file shortly after I saw this error early this morning. The new jobs 
      have run fine. 
      11)  Wednesday morning, 7/22 -- pilot update from Paul (37m).  Details:
      * Wrong guid problem. The previous pilot version generated a new guid for an output file if it failed to read the guid from the PoolFileCatalog file during job recovery. The pilot will now generate an error instead, 1123/EXEPANDA_MISSINGGUID/"Missing guid in output file list". It is not known why the pilot could not read the guid from the PFC in the found cases since the batch logs are only available for a short time.
      * The storage paths for output files in LocalSiteMover are now generated in the same way as in lcgcp site movers.
      * It is now possible to setup analysis trfs on Nordugrid. Andrej is testing.
      * Corrected the Nordugrid trf setup for Athena post 14.5 (same as Graeme's fix from v 37l).
      * For analysis jobs killed by users, the pilot will now kill the running processes in reverse order so that the athena process is killed first. Also, the time delay between the SIGTERM and SIGKILL signals has been increased from 10 to 60s (after killing the first process) to allow the athena stdout buffer to be properly flushed.
      12)  Follow-ups from earlier reports: 
      (i)  Proposed dates for maintenance outage at SLAC -- from Wei:
      WT2/SLAC is looking for outage window to address a number of things: electric works, ATLAS storage OS upgrade, etc. We are currently looking at the week of Aug 17-21 to get all these done. It is a tentative schedule but we hope we won't be off by too much. Please put this into consideration when you plan to use ATLAS resources at SLAC, and let me know of any concerns.

Analysis queues (Nurcan)

  • Reference:
  • last meeting:
    • Status of DB access job at SWT2 and SLAC: Still waiting to hear back from Sebastien Binet on the fix Patrick made on PyUtils/AthFile.py module.
    • Stress testing of DB access jobs at other sites: NET2 still have corrupted files in second try from mc08.105200.T1_McAtNlo_Jimmy.merge.AOD.e357_s462_r635_t53/. BNL, AGLT2, MWT2 had passed the test.
    • User analysis challenge in US with step09 samples: Validation work ongoing with SUSYValidation jobs, one of the four job types used in HammerCloud submissions. The medcut sample makes 1955 jobs, the lowcut sample makes 551 jobs, each reading 5 input AOD's. Status at sites:
      • SWT2: 212 jobs failed from the medcut sample with errors "R__unzip: error in inflate". This error was first seen for HammerCloud jobs, and later with user jobs. Patrick manually patched the release (14.5.x) using the libXrdClient.so client library from the latest xrootd release (instead using the version shipped with root v5.18.00f). I'm retrying these jobs now. Jobs on the lowcut sample are running fine.
      • SLAC: 100% success in first try. The above patch was applied by Wei.
      • AGLT2: 5 jobs failed out of 2506. One ran on bad node, taken offline. Other four failed with "Could not open connection ! lcg_cp: Communication error ". From Bob: "These seem to occur randomly at a low level, and usually all at about the same time.". I'm retrying these jobs now.
      • MWT2: 4 jobs failed out of 2506 with errors like "lfc_creatg failed", "lcg_cp: Communication error". All successful upon retry.
      • NET2: Missing files in one of the tid datasets from the medcut sample, even though it appears as complete in the dashboard. 122 jobs failed in the first run, 51 failed on retry. As for lowcut sample, 141 jobs failed with problems of type "SFN not set in LFC for guid BA56DA5F? -C25D-DE11-88D2-001E4F14A8A4 (check LFC server version)". Paul commented that this happens when the LFC is under a high load and has been fixed in later LFC server versions. Saul to check with John. I have sent a retry.
      • BNL: 17 jobs failed out of 2506. Errors due to "lfc_getreplicas failed", "lfc-mkdir failed", "lfc_creatg failed". I'm retrying these jobs now.
  • this meeting:
    • Status of DB access job at SWT2 and SLAC: Sebastien Binet reported that a fix was made to PyUtils/AthFile.py module while he was away. He is checking on the new tag made.
    • Stress testing of DB access jobs at other sites: Saul reported that there are corrupted files from the input dataset.
    • Cern ROOT libs in ATLAS release: Manual patch worked at SLAC and SWT2 for 14.5.x and 15.x. Requested HammerCloud jobs from release 15.1.0, successful at both sites. Wei contacted with David Quarrie. Response from David: "ATLAS does not plan on moving to ROOT 5.24.00 prior to the forthcoming LHC run, and the schedule for doing such an upgrade has not yet been decided. However, there are discussions ongoing whether to try to upgrade to a patched 5.22.00 version in the next week or so. I've added Rene and Pere to the thread since it might be possible to backport the fix to this new release (or it might already have been backported)."
    • User analysis challenge in US with step09 samples: All sites are ready except NET2.
      • NET2: missing files from medcut sample. Are these completed now?. Problems with "SFN not set in LFC for guid" for the lowcut sample. Paul commented: "There is indeed no SFN set (replica above) which is forbidden, i.e. you can not add an LFC entry without setting the SFN. Jean-Philippe told me a month ago that the only way this can happen is if the SFN entry has been modified after the registration". This needs investigation.
      • Any updates on LFC server upgrades?: to supposedly fix the problems with "lfc_getreplicas failed", "lfc-mkdir failed", "lfc_creatg failed", etc.
    • AGLT2 requested a HammerCloud test after their reconfiguration: Tests 513-516 submitted (http://gangarobot.cern.ch/hc), any news on the results from Bob?
    • ANALY_SWT2_CPB was again set to offline w/o a reason we are aware of. Kaushik is following up with Alden (modification time and DN to be recorded).
    • Coming up a site statistics table from Torre/Sergey: a table that will record site statistics on a daily basis, by site/cloud and by processing type. This will allow improved monitoring, performance analysis and problem detection without stressing the job table. Autogenerated plots of site analysis performance metrics (like job wait time and fine grained execution times) to the monitor so we can better examine/compare site performance and target problems and improvements.
    • Send HC requests to stress-testing-coordination list. Specify the duration.

DDM Operations (Hiro)

Tier 3 issues (Doug)

  • last meeting(s)
    • CERN virtualization workshop - discuss regarding head node services.
    • BNL is providing some hardware for virtualizing Tier 3 clusters.
    • Considering Rpath
  • this meeting
    • On vacation

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

  • last week
  • this week
    • Meeting on Tuesday: Richard Hawkings, David Front, Sasha, John, Fred, Shawn meeting... plan for distributing conditions data.
    • Richard hopes to get changes to access in time for release 15.4.0, scheduled for July 29th.
    • Working on understanding on how to access locally available conditions data from jobs.
    • Fred making sure management is informed.
    • The setup will allow using Frontier if an environmental variable for the squid cache is, and which Frontier to use.
    • Sites define the variable to locate squid.
    • Fred iterating with Alessandro.
    • Will be set by Xin's setup script.
    • AGLT2 will be the test site.
    • Fred and John will serve as contacts.
    • NET2: Fred will be testing their squid.
    • Xin: when release is installed, it inserts the variable into a "sites" package in the release area.
    • UTA - installed, just need to test.

Data Management & Storage Validation (Kaushik)

  • Reference
  • last week(s):
  • this week:
    • See meeting MinutesDataManageJul21
    • Complaints from users about getting their datasets from sites. Charles consistency checker ccc.py; finding datasets that are marked completes but missing files.
    • In discussions w/ DQ2 developers about this issue. Important - pathena will run if it things the dataset is complete, but will silently skip files - unless they are paying attention.

Throughput Initiative (Shawn)

  • last week:
    • Note:
      **ACTION ITEM**  Each site needs to provide a date before the end of June for their throughput test demonstration.  Send the following information to Shawn McKee and CC Hiro:
      a)      Date of test (sometime after June 14 when STEP09 ends and before July 1)
      b)      Site person name and contact information.  This person will be responsible for watching their site during the test and documenting the result.
      For each site the goal is a graph or table showing either:
      i)                    400MB/sec (avg) for a 10GE connected site
      ii)                   Best possible result if you have a bottleneck below 10GE
      Each site should provide this information by close of business Thursday (June 11th).   Otherwise Shawn will assign dates and people!!
    • Last week BNL --> MWT2_UC throughput testing, see: http://integrationcloud.campfirenow.com/room/192199/transcript/2009/06/18
    • Performance not as good as hoped, but 400 MB/s milestone reached (peak only)
      • For some reason the individual file transfers were low
    • NET2 - need 10G NIC
    • SLAC - may need additional gridftp servers
    • UTA - CPB - iperf tends to vary to dcdoor10. 300-400 Mbps, mostly 50-75 Mbps; 700-800 Mbps into _UTA (probably is coming back).
    • OU - need new storage.
    • AGLT2 - directional issues
    • 1 GB/s throughput milestone going on now.
    • August 10 target to deploy new perfsonar package.
    • We need to begin using these tools during our throughput testing.
  • this week:
    • perfsonar status: rc1 of next version is available. Karthik will help test this version. When released, want to deploy by August 10 on all sites. There are new one-way timing measurement delays.
    • Throughput tests of last week. Not quite able to get GB/s. The virtual circuits are not performing as well as they had. Mike O Conner studying flows, seeing lost packet ramps when circuits in place. Packet lost not caused by circuit. Will use UC as a test case to study this.
    • Hiro: will regular testing 20 files, 600 MB/s - at least once per day to each site.

OSG 1.2 validation (Rob, Xin)

Site news and issues (all sites)

  • T1:
    • last week: HPSS maintenance on July 14. 52 units of storage coming to BNL today. Expect to have this completed quickly. Have decided against using Thor extension units (expense) - will use the FC connected Nexan units. Have submitted an order for 120 worker nodes (Nehalen). (3MSI2K goal). 3 exascale F10 network chasis. Observed a couple of bottlenecks during step09. Will get 60 Gbps backbound. HPSS inter-mover upgrade to 10 Gbps. Note ATLAS resource request is still under discussion, not yet approved by LHCC; resource request for Tier 2's is at the same level as we've known from before.
    • this week: no report.

  • AGLT2:
    • last week: Downtime planned for tomorrow - 8 am to 6 pm. Need a fix for glue schema reporting. Updating bios and firmware for storage controllers. Jumbo frames on the public network. There were some outages during the last two weeks - but these are understood. AFS to be upgraded - hopes this will stabilize. gplazma on the headnode. Monitoring of the gplazma logfile
    • this week: working well generally, except the strange DQ2 issue. Fix was rebuild DB and reboot. There were a large number of connections back to the central database. Not quite sure what the cause was - something. See Bobs comparison of recent HC tests. Found better results, but mixed. Looked at one of the jobs w/ poor I/O characteristics.

  • NET2:
    • last week(s): Working on a number of DDM issues. Squid and frontier client now installed, waiting for Fred to test. About to install new Myricom cards - 10G, then will repeat throughput tests. 130 TB of storage getting.
    • this week: Recovering missing files from datasets. HU site problem - just needed to restart the services w/o ssh x forwarding.

  • MWT2:
    • last week(s): Have been running smoothly. Saturated 10G link between IU and UC, 14K files transferred. dCache seems to be stable now. No failures. ~5K xfers / hour. Try smaller files to boost up the SRM rate.
    • this week: New block of public IPs from campus networking - will migrate our dcache pools to public IPs. Should improve input rates. dCache outtage tomorrow.

  • SWT2 (UTA):
    • last week: CPB running okay. Working on analysis queue issue discussed above. Ibrix system being upgraded for SWT2 cluster. Still working on squid validation w/ frontier clients.
    • this week:

  • SWT2 (OU):
    • last week:
    • this week: Karthik: all is well.

  • WT2:
    • last week: preload library problem fixed. Working on a procurement - ~10 thor units. ZFS tuning. Looking for a window to upgrade the Thumpers. Latest fix for xrootd client is in the latest ROOT release.
    • this week: downtime for power outage - planned during the week of Tier 2 meeting.

Carryover issues (any updates?)

Squids and Frontier (John DeStefano)

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
  • this meeting:

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week
    • Have discussed adding Adler32 checksum to xrootd. Alex developing something to calculate this on the fly. Expects to release this very soon. Want to supply this to the gridftp server.
    • Need to communicate w/ CERN regarding how this will work with FTS.
  • this week

Local Site Mover


  • last week
  • this week
    • none

-- RobertGardner - 21 Jul 2009

  • attached by mistake - well its pretty:

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


jpg CE0127M.jpg (380.9K) | RobertGardner, 22 Jul 2009 - 12:16 |
png analysis-day-summary.png (29.0K) | RobertGardner, 22 Jul 2009 - 12:18 |
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback