r5 - 29 Jul 2009 - 14:59:40 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJul29

MinutesJul29

Introduction

Minutes of the Facilities Integration Program meeting, July 29, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: John De Stefano, Rob, Saul, Tom, Michael, Kaushik, Bob, Torre, John, Wei, Nurcan, Justin, Armen, Rupom, Mark, Charles, Doug, Fred
  • Apologies: none

Integration program update (Rob, Michael)

  • Introducing: SiteCertificationP10 - FY09Q04 NEW
  • Special meetings
    • Tuesday (9:30am CDT): Facility working group on analysis queue performance: FacilityWGAP suspended for now
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
    • Friday (1pm CDT): Frontier/Squid
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
  • Other remarks
    • last week(s)
      Since yesterday we observe much higher pAthena based DA activities in the German and now also in the UK clouds (plot attached).
      The load profile is such that ~all sites have jobs running (~2.5k per cloud). The efficiency is not great, most problems, as expected, are related to data access (staging file into WN, but also staging the output back to the SE).
      Given the load is fairly evenly spread across sites it looks to me this is another centrally controlled exercise, probably managed by HC (‘haven’t looked at jobs).
       
      We have to have this kind of exercise as well, to test the machinery as a whole and to focus on the critical issues (e.g. data access).  Now that we have the 100M event container we should be in a position where we can/should start conducting these tests. 
      • Jim C in contact Massimo coordinating an ATLAS-wide user analysis stress test, looks like cloud-by-cloud.
      • They are testing in Germany - but these are HC tests.
      • Looks like they're experimenting with dcap - optimizing analysis queues
      • Torre: will be adding statistics on sites by site, processing (job) types, etc. Working on it with Sergei - next week or two. Jim C would like to have statistics by user (identify poorly trained users).
      • Nurcan: there is an option in pathena to identify the processing type (eg. Hammer Cloud) - the users could set this, for example.
      • Jim C: expect to have users start next Monday.
    • this week:
      • We have a reprocessing ESDs activity coming up in the next two weeks. This will involve Tier 2's. There is about 35 TB of data to be reprocessed.

Operations overview: Production (Kaushik)

  • Reference:
  • last meeting(s):
    • Thanks again to everyone for keeping sites running well, the production rate is pretty well.
    • Starting to run out of queue filler jobs. Generating another batch of JF35 and JF17. Build more containers - we may be up to 300M events(!) Kaushik will look into creating merge jobs.
    • Rock solid performance of sites in the past few weeks.
    • Starting to identify more regional production tasks.
  • this week:
    • MC production: we have failure rates this morning (27K jobs) due to task problem. Event numbers became too large, hit an Athena limit (in evgen). Ran into a problem at 250M events (2B unfiltered). Request to Borut/Pavel to abort task. Ticket submitted. Not sure how many days of processing we have left, since this large sample is dead.
    • Sites continuing to do well. Good efficiency.
    • Reprocessing exercise forthcoming: ESD-based repro to start mid-August. Won't take a long time - but data could be on tape. Armen looking into which datasets will be required, working with Hiro to get them staged. SL5 upgrade has exposed CORAL for releases 15.1-3. All Tier 2's should participate. There are concerns there may be hacks from the fast reprocessing done previously, want continue to have this capability at the Tier 2's so we should tests. * Update: Borut has created a new tag and series for the dataset, so we can continue generating new events. Therefore should no disruption. Kaushik also able to get access to proddb.

Shifters report (Mark)

  • Reference
  • last meeting:
    • Yuri's weekly summary presented at the Tuesday morning ADCoS? meeting: http://indico.cern.ch/materialDisplay.py?contribId=0&materialId=0&confId=64809
       1)  IU_OSG -- new data and scratch locations setup -- 'schedconfigdb' updates -- successful test jobs -- site set back to 'on-line'. (7/20-21)
      2)  7/16 -- Installation of new network cards at BU completed.  Test jobs submitted / completed successfully -- site set back to 'on-line'. 
      3)  Thursday morning, 7/16, from Bob at AGLT2:
      From 8:10am to 8:45am EDT today, a network switch crashed and resulted in a rack of compute nodes going "unavailable".  This was long enough that all jobs running on the nodes were lost.  Approximately 160 jobs were running on these nodes. 
      4)  Thursday morning, 7/16 -- dCache maintenance at BNL -- from Pedro:
      Due to transfer failures (to and from other sites) and job failures due to timeouts, dCache will be shutdown to cleanup the trash directory and start PnfsManager in a clean environment.  After we've guaranteed this service is working well, we will need to be restarted all pools to cleanup orphan files.  When everything has finished, we will start dCache again.
      5)  Thursday evening. 7/16 -- issue with transfer high error rates between BNL and other tier 1's resolved -- from Pedro:
      There were some pools which couldn't be brought only due to an OS problem.  This has been fixed. All pools are available now.
      And later same evening:
      Our pnfs server had a high load due to the postgres vacuum.  SRM service usually gets stuck and queues requests up to the point where the queue in front of tomcat fills up and then it just drops HTTP requests.  During vacuum operation, the dbserver processes and the database stopped working.  The pnfs service had to be stopped, then killed and only then the load on the machine dropped.  Services were then resumed.  Once again, everything should be back to normal now.
      6)  Friday morning, 7/17 -- ~200 failed jobs at SWT2_CPB with the error "!!FAILED!!2990!! Required CMTCONFIG 
      (i686-slc4-gcc34-opt) incompatible with that of local system (.: usage: .filename [arguments]) !!FAILED!!3000!! Setup failed 
      All of the failed jobs ran on the same WN (c8-38).  This node removed from the batch system -- still investigating the cause.
      http://savannah.cern.ch/bugs/?53326
      7)  Friday, 7/17 -- gk host cert expired at HU_ATLAS_Tier2 -- new cert installed on Tuesday (7/21) -- test jobs submitted -- site set back to 'on-line'.
      8)  Site UCITB_EDGE7 is now available (schedconfigdb updates, local site mover, etc.).  Various test jobs submitted to the site, including atlas release 15 jobs, as requested by Xin.
      9)  7/19-20 -- Site IllinoisHEP set 'off-line' -- jobs were failing due to a missing DBRelease file -- file re-subscribed -- site set back to 'on-line'.
      10)  7/21 -- jobs were failing at BNL with the error:
      "Error details: pilot: Put error: Voms proxy certificate does not exist or is too short. Command output: This node runs platform i386 313302|Log put error: Voms proxy certificate does not exist or is too short. Command output: This node runs platform i386 31326"  
      Issue resolved -- from Xin:
      It's due to the setup file change, the new setup file printed out the platform info, as you can see below ("This node runs platform i386"). The additional stdout appeared to confused pilot to think the voms-proxy is not valid. I commented out the "echo" line in the setup file shortly after I saw this error early this morning. The new jobs 
      have run fine. 
      11)  Wednesday morning, 7/22 -- pilot update from Paul (37m).  Details:
      * Wrong guid problem. The previous pilot version generated a new guid for an output file if it failed to read the guid from the PoolFileCatalog file during job recovery. The pilot will now generate an error instead, 1123/EXEPANDA_MISSINGGUID/"Missing guid in output file list". It is not known why the pilot could not read the guid from the PFC in the found cases since the batch logs are only available for a short time.
      * The storage paths for output files in LocalSiteMover are now generated in the same way as in lcgcp site movers.
      * It is now possible to setup analysis trfs on Nordugrid. Andrej is testing.
      * Corrected the Nordugrid trf setup for Athena post 14.5 (same as Graeme's fix from v 37l).
      * For analysis jobs killed by users, the pilot will now kill the running processes in reverse order so that the athena process is killed first. Also, the time delay between the SIGTERM and SIGKILL signals has been increased from 10 to 60s (after killing the first process) to allow the athena stdout buffer to be properly flushed.
      12)  Follow-ups from earlier reports: 
      (i)  Proposed dates for maintenance outage at SLAC -- from Wei:
      WT2/SLAC is looking for outage window to address a number of things: electric works, ATLAS storage OS upgrade, etc. We are currently looking at the week of Aug 17-21 to get all these done. It is a tentative schedule but we hope we won't be off by too much. Please put this into consideration when you plan to use ATLAS resources at SLAC, and let me know of any concerns.
  • this meeting:
    • Fred would like to discuss the handling of this rt ticket: https://rt-racf.bnl.gov/rt/Ticket/Display.html?id=13449.
      • GGUS filed a ticket for data transfer problem between NET2-SWT2, filed to SWT2 queue. Wrong queue - problem was with NET2.
      • Several attempts from SWT2 to indicate problem not at SWT2. Sat for 3 weeks, no action.
      • GOC complained to Fred. Who was responsible for switching the ticket?
      • Who is monitoring the queue, who is responsible? Could SW have the ability to move to another queue? (Answer: yes)
      • Michael - will discuss with Tier 1 staff.
      • Note: in future, contact John De Stefano for any RT concerns (unofficially)

Analysis queues (Nurcan)

  • Reference:
  • last meeting:
    • Status of DB access job at SWT2 and SLAC: Sebastien Binet reported that a fix was made to PyUtils/AthFile.py module while he was away. He is checking on the new tag made.
    • Stress testing of DB access jobs at other sites: Saul reported that there are corrupted files from the input dataset.
    • Cern ROOT libs in ATLAS release: Manual patch worked at SLAC and SWT2 for 14.5.x and 15.x. Requested HammerCloud jobs from release 15.1.0, successful at both sites. Wei contacted with David Quarrie. Response from David: "ATLAS does not plan on moving to ROOT 5.24.00 prior to the forthcoming LHC run, and the schedule for doing such an upgrade has not yet been decided. However, there are discussions ongoing whether to try to upgrade to a patched 5.22.00 version in the next week or so. I've added Rene and Pere to the thread since it might be possible to backport the fix to this new release (or it might already have been backported)."
    • User analysis challenge in US with step09 samples: All sites are ready except NET2.
      • NET2: missing files from medcut sample. Are these completed now?. Problems with "SFN not set in LFC for guid" for the lowcut sample. Paul commented: "There is indeed no SFN set (replica above) which is forbidden, i.e. you can not add an LFC entry without setting the SFN. Jean-Philippe told me a month ago that the only way this can happen is if the SFN entry has been modified after the registration". This needs investigation.
      • Any updates on LFC server upgrades?: to supposedly fix the problems with "lfc_getreplicas failed", "lfc-mkdir failed", "lfc_creatg failed", etc.
    • AGLT2 requested a HammerCloud test after their reconfiguration: Tests 513-516 submitted (http://gangarobot.cern.ch/hc), any news on the results from Bob?
    • ANALY_SWT2_CPB was again set to offline w/o a reason we are aware of. Kaushik is following up with Alden (modification time and DN to be recorded).
    • Coming up a site statistics table from Torre/Sergey: a table that will record site statistics on a daily basis, by site/cloud and by processing type. This will allow improved monitoring, performance analysis and problem detection without stressing the job table. Autogenerated plots of site analysis performance metrics (like job wait time and fine grained execution times) to the monitor so we can better examine/compare site performance and target problems and improvements.
    • Send HC requests to stress-testing-coordination list. Specify the duration.
  • this meeting:
    • Any news from NET2 on the missing files from the medcut sample (mc08.105807.JF35_pythia_jet_filter.merge.AOD.e418_a84_t53_tid070499 in particular) and on the registration problems (SFN not set in LFC for guid ...) for the dataset step09.00000011.jetStream_lowcut.recon.AOD.a84/ ?
      • Saul: these have been replaced, should be good to go. Nurcan: worried registration problem has not been addressed. Saul will track this down.
    • How ANALY_SWT2_CPB was being set to offline: Understood, Aaron reported: 'client=crawl-66-249-71-XXX.googlebot.com' has been setting 'ANALY_SWT2_CPB-pbs' offline without using proxy. https and proxy requirement was put in place for the curl command.
    • SLAC requested a HammerCloud test last Friday: http://gangarobot.cern.ch/hc/all/test/, tests 533-536, all completed. Any news on the results?
    • ANALY_MWT2_SHORT - decommissioned. Charles working with Alden to get internal name straightened.

DDM Operations (Hiro)

Tier 3 issues (Doug)

  • last meeting(s)
    • CERN virtualization workshop - discuss regarding head node services.
    • BNL is providing some hardware for virtualizing Tier 3 clusters.
    • Considering Rpath
  • this meeting
    • Working out details for upcoming Tier 2/Tier 3 workshop - would like to discuss facilities at next week's America's workshop at NYU.
    • Working on costing of analysis facilities at SLAC and BNL.
    • Progress on virtualization work is slow.
    • Will provide recipes in advance of the UC workshop for setting up a Tier 3.
    • Needs modification to Panda for data transfers to Tier 3s.

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

  • last week
    • https://twiki.cern.ch/twiki/bin/view/Atlas/RemoteConditionsDataAccess
    • Meeting on Tuesday: Richard Hawkings, David Front, Sasha, John, Fred, Shawn meeting... plan for distributing conditions data.
    • Richard hopes to get changes to access in time for release 15.4.0, scheduled for July 29th.
    • Working on understanding on how to access locally available conditions data from jobs.
    • Fred making sure management is informed.
    • The setup will allow using Frontier if an environmental variable for the squid cache is, and which Frontier to use.
    • Sites define the variable to locate squid.
    • Fred iterating with Alessandro.
    • Will be set by Xin's setup script.
    • AGLT2 will be the test site.
    • Fred and John will serve as contacts.
    • NET2: Fred will be testing their squid.
    • Xin: when release is installed, it inserts the variable into a "sites" package in the release area.
    • UTA - installed, just need to test.
  • this week
    • Changes to release 15.4 to support this.
    • There is an open discussion about setting up the environmental variables. There will be documentation setup.
    • Doug wants to know where to pull the Squid code - apparently its not on the page.
    • Native SL5 for 64 bit is still far away. Just needs proper compat libs. There is an rpm being setup. There is a 'mega' rpm to drop in.
    • HU has already installed 64 bit on SL5 and has address many of the compat lib issues.

Data Management & Storage Validation (Kaushik)

  • Reference
  • last week(s):
    • See meeting MinutesDataManageJul21
    • Complaints from users about getting their datasets from sites. Charles consistency checker ccc.py; finding datasets that are marked completes but missing files.
    • In discussions w/ DQ2 developers about this issue. Important - pathena will run if it things the dataset is complete, but will silently skip files - unless they are paying attention.
  • this week:
    • See meeting MinutesDataManageJul28
    • All storage at bnl is now thumper/thor-based (wn storage retired)
    • New release of prodiskcleanse.py and ccc.py this week from Charles.

Throughput Initiative (Shawn)

  • last week:
    • perfsonar status: rc1 of next version is available. Karthik will help test this version. When released, want to deploy by August 10 on all sites. There are new one-way timing measurement delays.
    • Throughput tests of last week. Not quite able to get GB/s. The virtual circuits are not performing as well as they had. Mike O Conner studying flows, seeing lost packet ramps when circuits in place. Packet lost not caused by circuit. Will use UC as a test case to study this.
    • Hiro: will regular testing 20 files, 600 MB/s - at least once per day to each site.
  • this week:
    • no update

OSG 1.2 validation (Rob, Xin)

Site news and issues (all sites)

  • T1:
    • last week(s): HPSS maintenance on July 14. 52 units of storage coming to BNL today. Expect to have this completed quickly. Have decided against using Thor extension units (expense) - will use the FC connected Nexan units. Have submitted an order for 120 worker nodes (Nehalen). (3MSI2K goal). 3 exascale F10 network chasis. Observed a couple of bottlenecks during step09. Will get 60 Gbps backbound. HPSS inter-mover upgrade to 10 Gbps. Note ATLAS resource request is still under discussion, not yet approved by LHCC; resource request for Tier 2's is at the same level as we've known from before.
    • this week:
      • All the storage deployed (1.5 PB useable) now in production. It was a smooth transition. New hardware is working well, staging performance greatly improved. Nehalem bids coming in next few days. OS SL5 upgrade. Decoupled file systems for interactive and grid production queues. Lots of disk I/O - considering moving SSD system. Upgraded AFS. Change in the BNL site name, now properly.

  • AGLT2:
    • last week: working well generally, except the strange DQ2 issue. Fix was rebuild DB and reboot. There were a large number of connections back to the central database. Not quite sure what the cause was - something. See Bobs comparison of recent HC tests. Found better results, but mixed. Looked at one of the jobs w/ poor I/O characteristics.
    • this week: all running smoothly. Did have a Dell switch crash; recovered. Test pilot failures having Athena failures; strange. All identified as seg faults. File not found, possibly? Did not seem to be the case. Do we have old test jobs?

  • NET2:
    • last week(s): Recovering missing files from datasets. HU site problem - just needed to restart the services w/o ssh x forwarding.
    • this week: Just back from vacation - John working on HU probs. Myricom cards installed; should be ready for throughput testing. There are a few data management issues still not quite understood, though all have been resolved; could be related to file corruption from faulty NICs. At HU, jobs failing for lfc-mkdir problem. Site not reporting to ReSS.

  • MWT2:
    • last week(s): New block of public IPs from campus networking - will migrate our dcache pools to public IPs. Should improve input rates. dCache outtage tomorrow.
    • this week: ANALY_MWT2 set at 400; lsm switchover at IU on saturday. Uncovered problem with PNFS latency (files taking a long time to appear). Turned out to be a mount option from compute nodes. Monday lost a shelve of data during OS install. Memory requirements discussion - added lots more swap. 2G/2G RAM/swap. Fred following up with how the formal requirement is determined. Accounting descrepancy for MWT2_IU corrected.

  • SWT2 (UTA):
    • last week: CPB running okay. Working on analysis queue issue discussed above. Ibrix system being upgraded for SWT2 cluster. Still working on squid validation w/ frontier clients.
    • this week: Power outage event yesterday. Building generator control system failed; fixed. Lost power to cluster - spent most of yesterday bringing systems back online. Problem with gatekeeper node - should be back up today.

  • SWT2 (OU):
    • last week:
    • this week: all running smoothly.

  • WT2:
    • last week: preload library problem fixed. Working on a procurement - ~10 thor units. ZFS tuning. Looking for a window to upgrade the Thumpers. Latest fix for xrootd client is in the latest ROOT release. downtime for power outage - planned during the week of Tier 2 meeting.
    • this week: running smoothly. 60%->45% cpu/wall time reduction, trying to understand. Need to update OS and ZFS asap.

Carryover issues (any updates?)

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
  • this meeting:

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week
    • Have discussed adding Adler32 checksum to xrootd. Alex developing something to calculate this on the fly. Expects to release this very soon. Want to supply this to the gridftp server.
    • Need to communicate w/ CERN regarding how this will work with FTS.
  • this week

Local Site Mover

AOB

  • last week
  • this week
    • none


-- RobertGardner - 28 Jul 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback