r3 - 02 Sep 2009 - 14:46:53 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesSep2



Minutes of the Facilities Integration Program meeting, Sep 2, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • EVO Coordinates:
    Title:		US ATLAS Computing & Integration
    Description:	Weekly US Facilities meeting
    Community:	ATLAS
    Meeting Access Information:
    - Meeting URL
    - Phone Bridge
    	ID: 1207682
    Central Daylight Time (-0500)
    	Start	2009-09-02  11:30
    	End  	2009-09-02  14:30
    Central European Summer Time (+0200)
    	Start	2009-09-02  18:30
    	End  	2009-09-02  21:30
    Eastern Daylight Time (-0400)
    	Start	2009-09-02  12:30
    	End  	2009-09-02  15:30
    Pacific Daylight Time (-0700)
    	Start	2009-09-02  09:30
    	End  	2009-09-02  12:30
    EVO Phone Bridge Telephone Numbers:
    - USA (Caltech, Pasadena, CA)
    	+1 626 395 2112
    - Switzerland (CERN, Geneva)
    	+41 22 76 71400
    - USA (BNL, Upton, NY)
    	+1 631 344 6100
    • Old phone number (not bridged), to be used if EVO fails: (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees: Karthik, Fred, Rob, Hiro, Charles, Armen, Wei, Tom, Sarah, Booker, Nurcan, Saul, Rich, John B, Bob, Mark, Doug, Shawn, Michael, Horst, Kaushik, Torre
  • Apologies: none

Integration program update (Rob, Michael)

  • Introducing: SiteCertificationP10 - FY09Q04
  • Special meetings
    • Tuesday (9am CDT): Frontier/Squid
    • Tuesday (9:30am CDT): Facility working group on analysis queue performance: FacilityWGAP suspended for now
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
      • Site certification table has a new column for the lcg-utils update, as well as curl. Will update as needed. Note: needed more critically by PandaMover host.
    • this week:
      • ATLAS software week: http://indico.cern.ch/conferenceDisplay.py?confId=50976
      • Important issue: DQ2 site services consolidation
        • Will run at BNL for all sites in the US cloud - predicated on FTS 2.2 for remote checksum; if this works, we'll consolidate. New FTS is being tested currently. Needs to work for all storage back ends, including Bestman. AI: need to follow-up w/ Simone, Hiro and Wei, to bring a test instance to BNL, test with Bestman sites.
      • Storage requirements: SpaceManagement
      • FabricUpgradeP10 - procurement discussion
      • Latest on lcg-utils and LFC:
         Begin forwarded message:
        For LFC: Just yesterday I got it building on one platform and hope to have it building on multiple platforms today. So it's in good shape.
        For lcg-utils: I upgraded (it was a painful set of changes in my build, but it's done) and Tanya did a test that showed everything working well. But about 30 minutes ago ago, Brian Bockelman told me that I might need to upgrade again to avoid a bug--I just contacted some folks in EGEE, and they confirmed that I should upgrade to a new version. *sigh* Hopefully I get can this done today as well.
        All of that said: I can almost certainly give you something for testing this week.
      • Specifically: GFAL: 1.11.8-1 (or maybe 1.11.9-1), lcg-utils: 1.7.6-1. These were certified recently, and 1.7.6-1 fixes a bug that was critical for CMS: https://savannah.cern.ch/bugs/index.php?52485 -alain

Operations overview: Production (Kaushik)

  • Reference:
  • last meeting(s):
    • Reprocessing - will run a set of validation tasks for the next five days - shifters should not file bug reports. Depending on results, will be another week before a decision is made.
    • Otherwise tasks are running fine. Have a month's worth of tasks defined.
    • May get more requests from Jet-ET miss groups (Eric Feng)
    • Central production - will get 10 TeV with release 15 rather than 7 TeV. Why? Detailed schedule is available.
    • Need to plan to replication of conditions data & squid access at Tier 2's.
  • this week:
    • mc09 tasks are arriving and running great (no need for manufactured queue fillers); 75K jobs finished

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's weekly summary presented at the Tuesday morning ADCoS meeting: 
    See attached file.
    [ ESD reprocessing postponed until ~August 31(?) ]
    [Production generally running very smoothly this past week -- most tasks have low error rates.  Exception -- see 4) below.]
    1)  8/19: Job failures at HU_ATLAS_Tier2 with the error "Bad credentials" -- resolved -- from John:
    We had a problem with the disk that's hosting the TRUSTED_CA for the wn-client installation, causing jobs to fail with `Bad credentials' errors.  This should be fixed now. 
    2)  8/19: file transfer errors due to problem at WISC_MCDISK -- issue resolved.  RT 13841.
    3)  8/20: file transfer errors at AGLT2_MCDISK -- GGUS 51024 -- resolved, from Shawn:
    The two pools with the bulk of the MCDISK free space are being recovered. Once they are back online we will respond with further details. The AGLT2_MCDISK are is very full in general.
    4)  8/20:  High failure rate at most U.S. sites for jobs like valid1 csc_physVal_Mon*, merge tasks 79222-79524.  Tasks aborted.
    5)  Weekend of 8/22-23: Software upgrades (OSG, dCache) completed at AGLT2.  An issue with AFS access from the WN's affecting pilots was resolved -- test jobs succeeded -- site set back to 'online'. 
    6)  8/24: ~250 jobs failed at HU_ATLAS_Tier2 with the error "No space left on device."  GGUS 51087.  Apparently resolved -- I must have missed a follow-up?
    7)  8/25-26: Power outage over at SLAC -- test jobs completed successfully -- site is set back to 'online'.
    8)  Follow-ups from earlier reports: 
    (i)  7/23-7/24 -- Ongoing work by Paul and Xin to debug issues with s/w installation jobs at OU_OSCER_ATLAS.  Significant progress, but still a few remaining issues.
    (ii)  SLC5 upgrades will most likely happen sometime during the month of September at most sites.
    • Quiet week
    • A few sites have been in downtime for s/w upgrades (aglt2), power outage (slac). Both back online.
    • Had some aborted tasks causing lots of failures across all sites.
  • this meeting:
    • Notes:
      Yuri's weekly summary presented at the Tuesday morning ADCoS meeting: 
      [ ESD reprocessing -- if no issues are discovered during the current final testing phase will likely start later this week. More details in Yuri's weekly summary.  Run  91890 is being used for this final testing -- shifters were requested to ignore errors from these jobs.]
      [Production generally running very smoothly this past week -- most tasks have low error rates. ]
      1)  8/26: UTD-HEP set 'offline' while security patches were installed.  Production was restarted over this past weekend after submitting test jobs, but then yesterday problems were noticed with the site, this time related to a file server in the cluster.  Jobs are currently completing successfully, but no word on a resolution of the problem?
      2)  8/27: Request to delete BU-DDM entry from DDM listing was posted -- Hiro did the removal.  Savannah 54901.
      3)  8/28: Sites needed to update their lcg-vomscerts package, in preparation for the update of the host cert for voms.cern.ch on Monday, 8/31.  There were a large number of job failures on Monday with the error "Failed to get LFC replicas: -1 (lfc_getreplicas failed with: 2703, Could not secure the connection)" at those sites where the cert update had not yet been done.  By Monday evening all sites appeared to have the completed the update.
      4)  9/1: New pilot version from Paul (v38a):
      The code has many internal changes (such as code optimizations as suggested by Charles Waldman) but of more general interest are the following new features and fixes:
      * Get and put functions in lcgcp/cr site movers are now using the proper timeout options depending on which LCG-Utils version is available locally. The current options are: --connect-timeout=300 --sendreceive-timeout=3600. Transfers on sites using older LCG-Utils versions have -t 3600. Problems were seen during the testing phase at Brunel where the new timeout options caused segmentation violations with lcg-cp. A site wide reinstallation of the command solved the problem. A similar problem is also seen at UNI-DORTMUND. Due to known difficulties with debugging jobs on this site (stdout logs are not available) Rod has set the site offline until we have solved the problem there.
      * Direct access mode is now available for analysis sites using dCache in combination with the following site movers: dCacheSiteMover, BNLdCacheSiteMover, lcgcpSiteMover, lcgcp2SiteMover. Direct access mode was previously only available for xrootd sites. (LocalSiteMover will be updated next).
      * Direct access can now be skipped for individual files (RAW files e.g.) via job attribute prodDBlockToken.
      * Correction for missing ./ in the trf name for Nordugrid analysis trfs.
      5)  Follow-ups from earlier reports: 
      (i)  7/23-7/24 -- Ongoing work by Paul and Xin to debug issues with s/w installation jobs at OU_OSCER_ATLAS.  Significant progress, but still a few remaining issues.
      (ii)  SLC5 upgrades will most likely happen sometime during the month of September at most sites.
  • things are running smoothly, low failure rates
  • Update to pilot code from Paul - see blog.
  • Sites need to update lcg host cert, for voms cert. Some sites didn't get to this quickly - job failures, for LFC updates - people contacted and corrected.

Analysis queues (Nurcan)

  • Reference:
  • last meeting:
    • Working on to get analysis jobs using TAGs and database access into HammerCloud. There will be a discussion on jobs database access next week during S&C workshop. Fred and David Front invited. Sasha suggests to use release 15 with two options:
      • Adapt for HammerCloud the lightweight database access testing scripts from Xavi. These Athena jobs do not require any input data files per se. The tests are most useful, when a large set of jobs is submitted with varied input parameters for each job.
      • Also recommends to use the job from Fred Luehring for the HammerCloud tests. Efforts to adapt this job for Ganga will be invested well, as this particular job can be later developed into a Frontier testing job as well as into the Conditions POOL files access job.
  • this meeting:
    • We're making progress on getting TAG selection jobs into HC.
    • db access jobs - Fred and David Front are in the loop.
    • Two new analysis shifters! Waiting to hear from 3 more people
    • What about the stress test w/ "large container" analysis.
    • 135M events in the container has been distributed to all all T2's; can we get users to start using them?
    • The rest still need to be merged, then replicated. This is about 50 TB. Will take about a week.

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • While looking at the transfers problem from BU to FZK (which is not part of ATLAS computing model for file transfer), I noticed that some T2s (NET2, SWT2, WT2) are not publishing SRM information to even OSG bdii. Also, some sites in OSG BDII (2170 port) are not in OSG BDII (2180port). This is really confusing (for debugging). Are there any plans to publish all US T2s SRM to OSG BDII (both 2170 and 2180 ports). Although it is not part of ATLAS computing model, this type of transfers from US T2s scratch space to other cloulds will happen as more users become active with real data. Unless, ATLAS T2s publish this info via BDII, DDM will fail because FTSs at other cloulds depends on this information. So, shouldn't we push to do this?
    • Publishing storage into BDII - SE's only, as certain transfers are failing (foreign Tier 1's to US Tier 2's)
    • Need to understand technical implications. Why are we breaking the heirarchical model? Need a discussion in a different forum.
    • Lets discuss this next week at CERN.
    • Throughput test bnl->aglt2 revealed a dcache glitch, Shawn investigating
  • this meeting:
    • One problem w/ BNL DQ2 - host rebooted - led to a downtime; unsure why.
    • What about checksumming through SRM-bestman. Can do checksum on the fly, better to distribute checksum. Can the calculate on the data server node. lcg-util checksum has been tested by Wei and it works.
    • Action item: need to have meeting w/ Alex, Wei, Hiro

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

  • last week
  • this week
    • Moving forward - CMB, SPMB meeting expect squid-frontier approach endorsed. Still need to work on getting xml files written to site. New HOTDISK space token + cron is being discussed.
    • Who owns the problem? Need to find effort. This is an international ATLAS problem.

Data Management & Storage Validation (Kaushik)

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • last week(s):
    • Yesterday had a briefing from DOE project Office of Science - there was a discussion of T3 throughput
    • We need to get T3 data transfers working, and documented. Want this consolodated in one place, for documentation purposes.
    • Doug: we need I2 tools in place, and we need to get the panadmover xfers to T3
  • this week:
                 USATLAS Throughput Meeting Notes
                                    September 1, 2009
    Attending:  Shawn, Rich, Saul, Jay, Karthik
    1)      perfSONAR discussion – RC3 is available now.  Feedback from currently testing sites (you know who you are!) is requested by the end of the week.  The hope is RC3 will become the real release assuming no major problems are found (30 bugs were addressed in RC2->RC3).
    2)      Tier-3/DQ2 testing  -- No report this week
    3)      Data-movement tests --  Missing NET2 results in the summary graphs on the main page.  Hiro, can you add them?   AGLT2 results have gotten very bad after upgrade from 1.9.2-5 to 1.9.4-2.  Considering downgrade to see if it addresses the issue.  
    4)      Circuit issues/discussion.   UC testing status?   No report this week.
    5)      Site reports
    a.       BNL
    b.      AGLT2 -  dCache upgrade may be causing some issues.
    c.       MWT2 – No report but still need to continue circuit testing as soon as path is instrumented at UC/OmniPoP.
    d.      NET2 – New Myricom 10GE NICs have been in place for a while.  No issues but no real testing yet.
    e.      SWT2 – Karthik is planning to test the new RC3 of perfSONAR
    f.        WT2
    g.       Wisconsin
    6)      AOB --  Thanks to Rich and Jay for their extremely valuable participation in the working group.   Both are moving on to new things.  We will very much miss them on our weekly calls.
    Please send along and edits or updates to the list.   Plan is to meet again at the regular time next week.
  • From Jason: Just to clarify, we still have a many to address before the real release. We expect an RC4 to release around 9/11 and the 'real' release the week of 9/21 if everything stays on schedule. Also I would like to again thank those that are helping in the beta test process - the only reason we are able to find and fix the issues as fast as we are is due to their diligence in debugging with us.
  • Want to get this deployed at Tier 2's as soon as release 4 is available. There have been a number of problems discovered at the beta sites with RC3.
  • Throughput tests showing probs w/ dCache 1.9.4-2, considering downgrading.
  • Automated plots
    • NET2 to be added to plot
    • BNL-BNL plot
    • SRM not working at Duke
  • Plan to redo load tests for UC tests.
  • What about transfers into Tier 3g's?
  • Action: gather information on network/host tunings as guide for Tier 3

OSG 1.2 deployment (Rob, Xin)

  • last week:
    • Update to OSG 1.2.1 release this past week for pyopenssl patch
  • this week:

Site news and issues (all sites)

  • T1:
    • last week(s): We have completed site name consolodation issue; now have a proper WLCG site name. This was nontrivial to implement this.
    • this week: apart from glitch w/ site services; 120 worker nodes is out: 8 core nodes, 2.66 GHz Nahelms, 24 GB ram x-series, 5550. Fast memory. R410s. Can put up to 4 drives in these 1U servers. 1000 analysis cores hitting DDN storage. SSD 160 GB drives for $400.

  • AGLT2:
    • last week: Upgrade OSG 1.2 gatekeeper - went well; but there are dCache upgrade problems, throughput is very low. Adding space to mcdisk.
    • this week: Looking at next purchase round - dell and sun. SSD's versus extra spindles. Looking at R610. Need to test. DDN storage (~ $320/useable TB). May need downtime for dCache.

  • NET2:
    • last week(s): All okay; HU running below capacity.
    • this week: running smoothly at BU - investigating a problem at HU - not getting pilots.

  • MWT2:
    • last week(s): looking into data corruption - some jobs have failed as a result. Doing a full scan of dCache and calculate checksums and compare to LFC. Found over 300 files with mismatched checksums. 2M files. Otherwise running w/ low error rate.
    • this week: Aaron van Meerton coming on board at UC. Distributed xrootd layout studies, setting this up. Bestman on top of dcache (looking at performance issues). We did experience some partial drains, this is because of really the short jobs.

  • Shawn: discussion of HEPSPEC is now 120 after changing default BIOS settings. This was verified at BNL. 20% better than harpertown! HS mismatch w/ ATLAS code! What about Dell joining as an ITB site?

  • SWT2 (UTA):
    • last week: no issues
    • this week: all is well. Looking at purchase options for next round. Need to think about what to do w/ the old SWT2_UTA cluster. Blocking off a few days for the SL5 and OSG upgrade.

  • SWT2 (OU):
    • last week: no issues
    • this week: all is well - will be looking into purchases.

  • WT2:
    • last week(s): completed power outage, site coming back online. Low efficiency of xrootd client - have an update from developer to test.
    • this week: hardware purchases are proceeding. a new algorithm xrood-client being improved by developer; a new algorithm will be tested in a new HC test. Will likely have another power outage.

Tier 3 data transfers

  • There have been discussions this week at CERN regarding this topic. Expect a solution very soon, see Simone's. Next week.

Carryover issues (any updates?)

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
  • this meeting:

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week
    • Have discussed adding Adler32 checksum to xrootd. Alex developing something to calculate this on the fly. Expects to release this very soon. Want to supply this to the gridftp server.
    • Need to communicate w/ CERN regarding how this will work with FTS.
  • this week

Local Site Mover

Gratia transfer probes @ Tier 2 sites

Hot topic: SL5 migration

  • last weeks:
    • ACD ops action items, http://indico.cern.ch/getFile.py/access?resId=0&materialId=2&confId=66075
    • Kaushik: we have the green light to do this from ATLAS; however there are some validation jobs still going on and there are some problems to solve. If anyone wants to migrate, go ahead, but not pushing right now. Want to have plenty of time before data comes (means next month or two at the latest). Wait until reprocessing is done - anywhere between 2-7 weeks from now, for both SL5 and OSG 1.2.
    • Consensus: start mid-September for both SL5 and OSG 1.2
    • Shawn: considering rolling part of AGT2 infrastructure to SL 5 - should they not do this? Probably okay - Michael. Would get some good information. Sites: use this time to sort out migration issues.
    • Milestone: my mid-October all sites should be migrated.
  • this week

Getting OIM registrations correct for WLCG installed pledged capacity

  • last week
    • Note - if you have more than one CE, the availability will take the "OR".
    • Make sure installed capacity is no greater than the pledge.
    • Storage capacity is given the GIP by one of two information providers (one for dCache, one for Posix-like filesystem) - requires OSG 1.0.4 or later. Note - not important for WLCG, its not passed on. Karthik notes we have two ATLAS sites that are reporting zero. This is a bit tricky.
    • Have not seen yet a draft report.
    • Double check the accounting name doesn't get erased. There was a big in OIM - should be fixed, but checked.
  • this week


  • last week
  • this week

-- RobertGardner - 02 Sep 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


jpg SiteCertificationP10vSep2.jpg (84.6K) | RobertGardner, 02 Sep 2009 - 03:29 |
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback