r5 - 03 Feb 2010 - 14:35:50 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesFeb3



Minutes of the Facilities Integration Program meeting, Feb 3, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (605) 715-4900, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees: Rob, Aaron, Nate, Charles, Booker, Saul, Michael, Sarah, Wei, Fred, Justin, John B, John D, Kaushik, Mark, Bob, Armen, Tom, Doug, Rik, Torre
  • Apologies: Nurcan

Integration program update (Rob, Michael)

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

  • last week(s):
    • Benchmarking I/O performance of virtual machines - could be fed into the joint US CMS- US ATLAS meeting at OSG AH
    • Report on ATLAS Tier 3 workshop
    • 70 registrants, IT auditorium completely full, > 30 EVO participants
    • Reports from all regions on status of Tier 3 efforts. Most are fully grid-compliant type T3, or co-located w/ T2,T1
    • Off-grid T3 are mostly US, Japan, Canada
    • Working groups and program of work have been formed for 3 months - prototype phase - covering: distributed storage (eg. Lustre, xrootd, multi-site technology)
    • T3 support - desire to use something like HC
    • PROOF - mentioned but few groups are using actively. What are the limitations, issues?
    • Data distribution for T3 - as we have discussed several times; dq2-like tools being contemplated for this
    • CVMFS - to distribute / cache ATLAS releases
    • SW installation from Canada (Asoka) - looks promising
    • Try to move in direction creating standards for ATLAS-wide standards working with ADC
    • Q: what about US ATLAS position regarding data replication; then dq2-get via FTS w/ queueing; no cataloging, thus not managed; what about policy.
    • There is also the question of sharing data across T3s.
    • Can FTS 2.2 be modified to support gridftp-only endpoint discussed at WLCG MB - 2 weeks.
  • this week: (Rik)
    • further meetings this past week with regards to T3 planning.
    • several working groups for T3's - distributed storage, ddm, proof; 3 month timeframe, intermediate report in 6 weeks.
    • output will be the recommended solutions
    • T3's in the US - will be working in parallel. Expect funding soon.
    • Expect a communication from Massimo regarding call for participation of working groups.

Operations overview: Production (Kaushik)

  • Reference:
  • last meeting(s):
    • Ran out of jobs over the weekend
    • Now considering to backfill w/ regional production, polling US physics groups
    • There was a Panda DB incident, cause unknown, affects analysis users; Torre intervened .. pilots seem to be flowing again now.
  • this week:
    • Borut's report this morning
    • US T1 already validated for reprocessing - will discuss whether to include US T2's.
    • New jobs to keep us busy for the next few jobs.
    • Presently in US cloud ~4000 analysis jobs running, ~5000 production jobs
    • Will also want to backfill our queues with regional production
    • There are some CondorG scaling issues keeping up with lots of short jobs.
    • Sites that are up are running quite well.
    • Discussion of site blacklisting and monitoring of sites with regards to DDM FT, SAM, Ganga robots, etc. At some point these will be automated.
    • Michael - there are discussions with Alessandro, Hiro and Michael on some of the technical issues. We don't expect our sites to get blacklisted.

Data Management & Storage Validation (Kaushik)

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    1)  1/20: File transfer errors at UTA_SWT2 related to an issue with one of the xrootd modules that was installed during an upgrade.  From Patrick:
    This is a problem related to one component in the newly installed version of Xrootd on the cluter.  We have reverted to an older version of the component and cluster has been stable ever since.  We are awaiting an updated version of the problematic component from the Xrootd developers, 
    but can continue to operate in the current configuration.  RT 15099.
    2)  1/21: SLAC -- power restored following the outage on 1/19.
    3)  1/21 - 1/23: BU -- Jobs were failing due to a problem with atlas release 15.6.1 at the site.  From Xin:
    15.6.1 and caches are re-installed at BU, validation run finished o.k.  ggus 54825, RT 15107.
    4)  1/22: Failed jobs at AGLT2 with errors like "Failed to get LFC replicas: -1 (lfc_getreplicas failed with: 2702, Bad credentials)."  >From Shawn:
    We had one of our 3 AFS servers drop 3 of four disks last night.   We fixed it this morning and AFS was functional by around 9 PM Eastern.
    This could be related (but I'm not sure exactly how yet).   eLog 8848, Savannah 61769.
    5)  1/22: John at BU noticed an issue with failed dq2 transfers failing with LFC errors:
    LFC exception [Cannot connect to LFC [lfc://lfc.usatlas.bnl.gov:/grid/atlas]]
    This was due to an expired vo.racf.bnl.gov.30105.pem certificate.  Update available:
    6)  1/23: Transfer errors at AGLT2_HOTDISK -- due to cert / proxy issue in 5) above.  Resolved.  ggus 54938, RT 15130.
    7)  1/23 - 1/27: U.S. cloud has been well below capacity with MC production.  Need new tasks to be assigned.
    8)  1/24: Failed jobs (missing input file errors) at BNL due to a problematic storage server.  System back on-line.   eLog 8907.
    9)  1/23 - 1/26: File transfer errors at BNL-OSG2_MCDISK due to very old (~2 years) missing files in the storage.  From Hiro:
    I have no idea why these old files were physically missing. I am guessing that there were some problem with (very old) DQ2. But, this dataset
    should be in LYON since that is the original site?  I removed the bad entries in BNL LFC which is very easy to find since BNL LFC contains PNFSID/storage unique-id entry of file. All of these files are showing "-1", which means they were never there.  ggus 54937, RT 15129.
    Follow-ups from earlier reports:
    i)  12/29:  SLAC -- Failed jobs with the pilot error:
    !!FAILED!!1999!! Exception caught in pilot: [Errno 5] Input/output error: \'curl.config\'.  Update from Paul (1/6):
    For some reason, the file system did not allow the curl config file to be created so the open command threw an exception. I will improve the error handling in a later pilot version (the error occurred in the _Curl class responsible for the getJob operation used by the pilot).
    ii)  1/11: From Paul, discussing possible changes to the pilot to deal with the issue of orphaned processes left behind following athena crashes (seen by John at BU and Horst at OU):
    Since the parent pid is set to 1 (a result of athena core dumping?), the pilot doesn't identify it as belonging to its chain of processes. I will see if the pilot can identify it by e.g. also looking at the pgid. [Added to the todo list].
    iii) Reminder: analysis jamboree at BNL 2/9 - 2/12.   

  • this meeting:
    Yuri's summary from the weekly ADCoS meeting:
    1)  1/28: New pilot version from Paul (42a) --
    * CERNVM modifications allowing the pilot to communicate with the dispatcher, run jobs inside the virtual machine. Output file handling done in a similar way to NG using mv site mover. Requested by CERNVM team.
    * Refactoring in various pilot codes to facilitate glExec integration (next couple of pilot versions) and multi-job handling.
    * Error handling added to curl class responsible for server communications.
    * Pilot is now searching for orphan processes after found looping job and eliminates them. Requested by John Brunelle.
    * Changed 'source setup.sh' to 'source ./setup.sh' in payload setup. Requested by Rod Walker.
    * Fix for space token handling related to the issue of datasets being split between MCDISK and MCTAPE. Space tokens did not arrive to the pilot in the expected order. Requested by Stephane Jezequel et al.
    * Replaced slc5,gcc43 tags with cmtconfig value in setup used for SITEROOT extraction/dynamic trf installation. Seen to have caused problems at TRIUMF with SL(C)5 jobs (fixed earlier by installing patch release) Requested by Rod Walker.
    2)  1/28: Issues with the storage systems at MWT2_IU following some sustem upgrades resolved -- from Hiro:
    UC had experienced the problems in the storage earlier today.  Some of hosts for the storage disk pools run out of memory due to too small swap spaces after the recent upgrade of the system. To correct the problem, those affected hosts have been restarted, resulting in failures in jobs and transfers. 
    Since the fix, all are reported to be working correctly.
    Also, from Sarah (1/29):
    MWT2_IU has completed the upgrade of our gatekeeper to OSG 1.2.5, and workers nodes to SL5 x86_64, and test jobs have completed successfully.  We are setting the site back online in OIM, Panda and FTS.
    3)  1/28 - 2-1: ddm upgrades at NET2.  Noticed some CA-related issues after re-starting -- resolved, system now back on-line.  RT 15282.
    4)  1/28: Xavier has set up a twitter page for ADCoS shift announcements / news: https://twitter.com/ADCShifts.
    5)  1/29 (early a.m. U.S. time): panda server unavailable due to typo in the ToA -- issue resolved.  eLog 9060.
    6)  Sites SWT2_CPB & ANALY_SWT2_CPB will be off-line 2/2 - 2/5 for system upgrades.  eLog 9098.
    7)  2/1: SLAC -- CE is off-line -- from Wei:
    CE at SLAC is down. We got error from the disk array hosting Grid home and Atlas releases. Investigating.
    Site is curretnly off-line in panda.
    8)  2/1: From Hiro, regarding one of the SAM tests with problems at U.S. sites:
    For CE test, you can ignore the current failure results due to the lack of "lcg-tags" command. The SAM test will be modified to accommodate
    US / OSG sites configuration.
    9)  Again this week very low level of production jobs in the U.S. cloud.
    Follow-ups from earlier reports:
    i)  12/29:  SLAC -- Failed jobs with the pilot error:
    !!FAILED!!1999!! Exception caught in pilot: [Errno 5] Input/output error: \'curl.config\'.  Update from Paul (1/6):
    For some reason, the file system did not allow the curl config file to be created so the open command threw an exception. I will improve the error handling in a later pilot version (the error occurred in the _Curl class responsible for the getJob operation used by the pilot).
    Update from Paul (1/28): Error handling added to curl class responsible for server communications (see pilot update above).
    ii)  1/11: From Paul, discussing possible changes to the pilot to deal with the issue of orphaned processes left behind following athena crashes (seen by John at BU and Horst at OU):
    Since the parent pid is set to 1 (a result of athena core dumping?), the pilot doesn't identify it as belonging to its chain of processes. I will see if the pilot can identify it by e.g. also looking at the pgid. [Added to the todo list].
    Update from Paul (1/28): Pilot is now searching for orphan processes after found looping job and eliminates them. Requested by John Brunelle (see pilot update above).
    iii) Reminder: analysis jamboree at BNL 2/9 - 2/12. 
    • A major update to the pilot code last week from Paul which cleared some issue we'd been seeing.

Analysis queues (Nurcan)

  • Reference:
  • last meeting:
    • Queues are quite busy in the US
    • Discussion revisiting "home cloud" (DN based) panda brokering constraint to spread jobs into other clouds
    • Kaushik - do we need to increase resources in the US cloud? Michael - notes that many were coming from abroad anyway.
    • Will look at the numbers
    • Nurcan is examining input datasets at clouds as well
  • this meeting:
    • Recent progress on Panda job rebrokering:
      • removed the restriction of us physicists going to us queues (no DN brokering)
      • more information to users when a task is submitted
      • also pcache development (Charles, Tadashi) providing wn-level brokering (use local wn disk); agnostic about its source, could be an NSF backend. Another goal is to integrate this into the pilot so as to remove site admin involvement.
    • BNL queues can not keep up high pilot rate because of Condor-g problems, Xin is investigating.
    • DAST involvement in blacklisting problematic analysis sites. Starting this week DAST receives an email twice a day for the sites failing GangaRobot? jobs. A procedure is being set up to act on these failures.

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • Postpone discussion of proposal on renaming sites for consistency
      I am wondering if we can agree on the consistent site naming convention
      for various services in the ATLAS production system used in US.  There
      seems to be confusions among people/shifters outside of the US to
      identify the actual responsible site from various names used in the US
      production services/queues.   In fact, some of them are openly
      commenting the frustration of the difficulty in the computing log. 
      Hence, I am wondering if we can/should put the effort to use the
      consistent naming conventions for the site name used in the various
      systems.    In the below, I have identified some of the systems which
      could help users if the consistent naming were being used. 
      1.  PANDA site name
      2.  DDM site name
      3.  BDII site name
      At least, since these three names come to the front of the major ATLAS
      computing monitoring system, the good consistent naming for each site in
      these three separate systems should help ease problems encountered by
      the other people.   So, is it possible to change any of the name?  ( I
      know some of them are pain to change.   If needed, I can make a table of
      names for each site used in these three system. )
    • FTS 2.2 coming soon - will demand update 2 weeks after certification --> close to when we can consolidate site services
    • Be prepared ---> will consolidation of DQ2 site services from Tier 2s; week following FTS upgrade
  • this meeting:
    • proddisk-cleanse change from Charles - working w/ Hiro
    • Tier 3 in ToA will be subject to FT and blacklisting; under discussion

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

  • Reference
  • last week
    • New frontier client released, included in new LCG release - 2.7.12;
    • Now in WLCG Tier 1 coordination effort
    • Is there a prob w/ BU - Douglas ran into probs w/ HC test (needs follow-up offline)
    • Waiting for new squid packages for sites - has patch, but wait for rpms (been waiting for a while)
  • this week
    • Fred - previous performance tests were invalid - found squid was out of the loop, requires adding client machines in squid.conf file; new tests show very good performance
    • So should T3 sites maintain their own squid? Doug thinks every site should and its easy to do. CVMFS - web file system gets better performance if you have a local squid, so will speed up access for releases and conditions data.
    • There are two layers of security - source and destination; John: there are recommendations in the instructions.
    • There is a discussion about how feasible it is to install Squid at each Tier 3 - John worries about the load on the associated Tier 2s.
    • Can also use CVMFS. Testing at BU.
    • There was an issue of HC performance at BU relative to AGLT2 for conditions data access jobs. Fred will run his tests against BU.

WLCG Availability discussion (Fred)

  • There was a serious discrepancy between SAM and RSV availability during a recent SLAC
  • There is an on-going discussion within WLCG.

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • last week(s):
    • focus was on perfsonar - new release to be out this friday; Jason: no show stoppers. All sites should upgrade next week.
    • Fixes bugs identified at our sites - hope this release resilient enough for T3 recommendations
    • Next meeting Feb 9
  • this week:
    • no meeting

Site news and issues (all sites)

  • T1:
    • last week(s):working on LAN updates, another Force10 added for redundancy. Ordering another 200 worker-nodes; R410s most likely. Note Intel has announced 32 nm based cpus out in March/April , will probably have big effect on pricing. Working on a 2 PB procurement.
    • this week: On-going issue with condor-g - there has been incremental progress being made but there are new effects observed. Observed a slow-down in job throughput. Working with Condor team, some fixes were applied (new condor-q) which helped for a while; decided to add another submit host to the configuration. New HPSS data movers and network links.

  • AGLT2:
    • last week: Working on preparing for downtime, Feb 4. Migrating to new hardware, much of it live.
    • this week: Failover testing of WAN, all successful. Tuesday will be upgrading dcache hosts to SL5.

  • NET2:
    • last week(s): Upgrading gatekeeper tomorrow; two racks of worker nodes arrived from Dell - operational soon. Finding some probs w/ HC tests failing, pilots failing quickly, cause not known.
    • this week: Upgraded GK host to SL5 OSG 1.2.6 at BU; HU in progress. LFC upgraded to latest version.

  • MWT2:
    • last week(s): _IU and _UC in process of being upgraded to SL 5.3. Gatekeeper updated to OSG 1.2.5.
    • this week: Completed upgrade of MWT2 (both sites) to SL 5.3 and new Puppet/Cobbler configuration build system. Both gatekeepers at OSG 1.2.5, to be upgraded to 1.2.6 at next downtime. Delivery of 28 MD1000 shelves.

  • SWT2 (UTA):
    • last week: OSG 1.2.5, all components to SL 5.4, 200 TB storage - probably a couple of days. xrootd very stable.
    • this week: Extended downtime thurs/fri to add new storage hardware plus usual new SW upgrades; hope to be done by end of week.

  • SWT2 (OU):
    • last week: Started getting more equip from storage order; continue to wait for hardware.
    • this week: Still waiting for equipment delivery.

  • WT2:
    • last week(s): A problem with a storage node - will vacate of data and replace. Latest xrootd the namespace agent has probs under heavy load. Losing events under load. At SLAC using old namespace agent.
    • this week: ATLAS home and release NFS server failed; will be relocating to temporary hardware.

Carryover issues (any updates?)

OSG 1.2 deployment (Rob, Xin)

  • last week:
    • BNL updated
  • this week:
    • Any new updates?

OIM issue (Xin)

  • last week:
    • Registration information change for bm-xroot in OIM - Wei will follow-up
    • SRM V2 tag - Brian says nothing to do but watch for the change at the end of the month.
  • this week:

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
  • this meeting:

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week(s)
  • this week

Local Site Mover

  • Specification: LocalSiteMover
  • code
  • this week if updates:
    • BNL has a lsm-get implemented and they're just finishing implementing test cases [Pedro]

Gratia transfer probes @ Tier 2 sites

Hot topic: SL5 migration

  • last weeks:
    • ACD ops action items, http://indico.cern.ch/getFile.py/access?resId=0&materialId=2&confId=66075
    • Kaushik: we have the green light to do this from ATLAS; however there are some validation jobs still going on and there are some problems to solve. If anyone wants to migrate, go ahead, but not pushing right now. Want to have plenty of time before data comes (means next month or two at the latest). Wait until reprocessing is done - anywhere between 2-7 weeks from now, for both SL5 and OSG 1.2.
    • Consensus: start mid-September for both SL5 and OSG 1.2
    • Shawn: considering rolling part of AGT2 infrastructure to SL 5 - should they not do this? Probably okay - Michael. Would get some good information. Sites: use this time to sort out migration issues.
    • Milestone: my mid-October all sites should be migrated.
    • What to do about validation? Xin notes that compat libs are needed
    • Consult UpgradeSL5
  • this week

WLCG Capacity Reporting (Karthik)

  • last discussion(s):
    • Preliminary capacity report is now working:
      This is a report of pledged installed computing and storage capacity at sites.
      Report date:  2010-01-25
       #       | Site                   |      KSI2K |       HS06 |         TB |
       1.      | AGLT2                  |      1,570 |     10,400 |          0 |
       2.      | AGLT2_CE_2             |        100 |        640 |          0 |
       3.      | AGLT2_SE               |          0 |          0 |      1,060 |
       Total:  | US-AGLT2               |      1,670 |     11,040 |      1,060 |
               |                        |            |            |            |
       4.      | BU_ATLAS_Tier2         |      1,910 |          0 |        400 |
       Total:  | US-NET2                |      1,910 |          0 |        400 |
               |                        |            |            |            |
       5.      | BNL_ATLAS_1            |          0 |          0 |          0 |
       6.      | BNL_ATLAS_2            |          0 |          0 |          1 |
       7.      | BNL_ATLAS_SE           |          0 |          0 |          0 |
       Total:  | US-T1-BNL              |          0 |          0 |          1 |
               |                        |            |            |            |
       8.      | MWT2_IU                |      3,276 |          0 |          0 |
       9.      | MWT2_IU_SE             |          0 |          0 |        179 |
       10.     | MWT2_UC                |      3,276 |          0 |          0 |
       11.     | MWT2_UC_SE             |          0 |          0 |        200 |
       Total:  | US-MWT2                |      6,552 |          0 |        379 |
               |                        |            |            |            |
       12.     | OU_OCHEP_SWT2          |        464 |          0 |         16 |
       13.     | SWT2_CPB               |      1,383 |          0 |        235 |
       14.     | UTA_SWT2               |        493 |          0 |         15 |
       Total:  | US-SWT2                |      2,340 |          0 |        266 |
       Total:  | All US ATLAS           |     12,472 |     11,040 |      2,106 |
    • Debugging underway
  • this meeting


  • last week
  • this week
    • none

-- RobertGardner - 03 Feb 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback