r2 - 09 Sep 2009 - 13:14:01 - ShawnMckeeYou are here: TWiki >  Admins Web > MinutesSep9

MinutesSep9

Introduction

Minutes of the Facilities Integration Program meeting, Sep 9, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees:
  • Apologies: Rob

Integration program update (Rob, Michael)

  • Introducing: SiteCertificationP10 - FY09Q04
  • Special meetings
    • Tuesday (9am CDT): Frontier/Squid
    • Tuesday (9:30am CDT): Facility working group on analysis queue performance: FacilityWGAP suspended for now
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
      • Site certification table has a new column for the lcg-utils update, as well as curl. Will update as needed. Note: needed more critically by PandaMover host.
      • ATLAS software week: http://indico.cern.ch/conferenceDisplay.py?confId=50976
      • Important issue: DQ2 site services consolidation
        • Will run at BNL for all sites in the US cloud - predicated on FTS 2.2 for remote checksum; if this works, we'll consolidate. New FTS is being tested currently. Needs to work for all storage back ends, including Bestman. AI: need to follow-up w/ Simone, Hiro and Wei, to bring a test instance to BNL, test with Bestman sites.
      • Storage requirements: SpaceManagement
      • FabricUpgradeP10 - procurement discussion
      • Latest on lcg-utils and LFC:
         Begin forwarded message:
        For LFC: Just yesterday I got it building on one platform and hope to have it building on multiple platforms today. So it's in good shape.
        For lcg-utils: I upgraded (it was a painful set of changes in my build, but it's done) and Tanya did a test that showed everything working well. But about 30 minutes ago ago, Brian Bockelman told me that I might need to upgrade again to avoid a bug--I just contacted some folks in EGEE, and they confirmed that I should upgrade to a new version. *sigh* Hopefully I get can this done today as well.
        All of that said: I can almost certainly give you something for testing this week.
        -alain
      • Specifically: GFAL: 1.11.8-1 (or maybe 1.11.9-1), lcg-utils: 1.7.6-1. These were certified recently, and 1.7.6-1 fixes a bug that was critical for CMS: https://savannah.cern.ch/bugs/index.php?52485 -alain
    • this week:

Operations overview: Production (Kaushik)

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's weekly summary presented at the Tuesday morning ADCoS meeting: 
    http://www-hep.uta.edu/atlas/World-wide-Panda_ADCoS-report-%28Aug25-31-2009%29.html
    
    [ ESD reprocessing -- if no issues are discovered during the current final testing phase will likely start later this week. More details in Yuri's weekly summary.  Run  91890 is being used for this final testing -- shifters were requested to ignore errors from these jobs.]
    [Production generally running very smoothly this past week -- most tasks have low error rates. ]
    
    1)  8/26: UTD-HEP set 'offline' while security patches were installed.  Production was restarted over this past weekend after submitting test jobs, but then yesterday problems were noticed with the site, this time related to a file server in the cluster.  Jobs are currently completing successfully, but no word on a resolution of the problem?
    2)  8/27: Request to delete BU-DDM entry from DDM listing was posted -- Hiro did the removal.  Savannah 54901.
    3)  8/28: Sites needed to update their lcg-vomscerts package, in preparation for the update of the host cert for voms.cern.ch on Monday, 8/31.  There were a large number of job failures on Monday with the error "Failed to get LFC replicas: -1 (lfc_getreplicas failed with: 2703, Could not secure the connection)" at those sites where the cert update had not yet been done.  By Monday evening all sites appeared to have the completed the update.
    4)  9/1: New pilot version from Paul (v38a):
    The code has many internal changes (such as code optimizations as suggested by Charles Waldman) but of more general interest are the following new features and fixes:
    * Get and put functions in lcgcp/cr site movers are now using the proper timeout options depending on which LCG-Utils version is available locally. The current options are: --connect-timeout=300 --sendreceive-timeout=3600. Transfers on sites using older LCG-Utils versions have -t 3600. Problems were seen during the testing phase at Brunel where the new timeout options caused segmentation violations with lcg-cp. A site wide reinstallation of the command solved the problem. A similar problem is also seen at UNI-DORTMUND. Due to known difficulties with debugging jobs on this site (stdout logs are not available) Rod has set the site offline until we have solved the problem there.
    * Direct access mode is now available for analysis sites using dCache in combination with the following site movers: dCacheSiteMover, BNLdCacheSiteMover, lcgcpSiteMover, lcgcp2SiteMover. Direct access mode was previously only available for xrootd sites. (LocalSiteMover will be updated next).
    * Direct access can now be skipped for individual files (RAW files e.g.) via job attribute prodDBlockToken.
    * Correction for missing ./ in the trf name for Nordugrid analysis trfs.
    5)  Follow-ups from earlier reports: 
    (i)  7/23-7/24 -- Ongoing work by Paul and Xin to debug issues with s/w installation jobs at OU_OSCER_ATLAS.  Significant progress, but still a few remaining issues.
    (ii)  SLC5 upgrades will most likely happen sometime during the month of September at most sites.
  • things are running smoothly, low failure rates
  • Update to pilot code from Paul - see blog.
  • Sites need to update lcg host cert, for voms cert. Some sites didn't get to this quickly - job failures, for LFC updates - people contacted and corrected.
  • this meeting:
    • Notes:

Analysis queues (Nurcan)

  • Reference:
  • last meeting:
    • We're making progress on getting TAG selection jobs into HC.
    • db access jobs - Fred and David Front are in the loop.
    • Two new analysis shifters! Waiting to hear from 3 more people
    • What about the stress test w/ "large container" analysis.
    • 135M events in the container has been distributed to all all T2's; can we get users to start using them?
    • The rest still need to be merged, then replicated. This is about 50 TB. Will take about a week.
  • this meeting:

DDM Operations (Hiro)

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

Data Management & Storage Validation (Kaushik)

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • *last week(s):*
                 USATLAS Throughput Meeting Notes
                                    September 1, 2009
                  ==============================
    Attending:  Shawn, Rich, Saul, Jay, Karthik
    1)      perfSONAR discussion – RC3 is available now.  Feedback from currently testing sites (you know who you are!) is requested by the end of the week.  The hope is RC3 will become the real release assuming no major problems are found (30 bugs were addressed in RC2->RC3).
    2)      Tier-3/DQ2 testing  -- No report this week
    3)      Data-movement tests --  Missing NET2 results in the summary graphs on the main page.  Hiro, can you add them?   AGLT2 results have gotten very bad after upgrade from 1.9.2-5 to 1.9.4-2.  Considering downgrade to see if it addresses the issue.  
    4)      Circuit issues/discussion.   UC testing status?   No report this week.
    5)      Site reports
    a.       BNL
    b.      AGLT2 -  dCache upgrade may be causing some issues.
    c.       MWT2 – No report but still need to continue circuit testing as soon as path is instrumented at UC/OmniPoP.
    d.      NET2 – New Myricom 10GE NICs have been in place for a while.  No issues but no real testing yet.
    e.      SWT2 – Karthik is planning to test the new RC3 of perfSONAR
    f.        WT2
    g.       Wisconsin
    6)      AOB --  Thanks to Rich and Jay for their extremely valuable participation in the working group.   Both are moving on to new things.  We will very much miss them on our weekly calls.
     
    Please send along and edits or updates to the list.   Plan is to meet again at the regular time next week.
     
    Shawn
    • From Jason: Just to clarify, we still have a many to address before the real release. We expect an RC4 to release around 9/11 and the 'real' release the week of 9/21 if everything stays on schedule. Also I would like to again thank those that are helping in the beta test process - the only reason we are able to find and fix the issues as fast as we are is due to their diligence in debugging with us.
    • Want to get this deployed at Tier 2's as soon as release 4 is available. There have been a number of problems discovered at the beta sites with RC3.
    • Throughput tests showing probs w/ dCache 1.9.4-2, considering downgrading.
    • Automated plots
    • NET2 to be added to plot
    • BNL-BNL plot
    • SRM not working at Duke
    • Plan to redo load tests for UC tests.
    • What about transfers into Tier 3g's?
    • Action: gather information on network/host tunings as guide for Tier 3
  • this week:
                Notes for US ATLAS Throughput Meeting September 8, 2009
              ===============================================

Attending: Dave, Doug, Sarah, Shawn, Horst, Karthik, Hiro, Rich

1)	perfSONAR status – RC4 out ?  Some good feedback provided by current RC3 testers.    Need to verify status of target release date (September 21 ?).  Karthik reported RC3 worked pretty well but some services are disabled when they shouldn’t be.  Aaron will be following-up (in RC4).   Plan for the near future is to have all US ATLAS Tier-2’s deploy the “released” perfSONAR within 1 week of its release (by the end of September).   Then  we need to verify that it works as expected and gain enough experience with its configuration and use to be able to make a recommendation within a month (by the end of October).   The recommendation would be concerning whether or not Tier-3’s should deploy this version of perfSONAR (presumably 1 box devoted to bandwidth testing) OR await the next perfSONAR version (in case of too many issues still being present that prevent wider deployment).
2)	Updates on “Tier-3” related testing/throughput -  Lots of discussion which touched upon (US)ATLAS policy, DQ2 and related topics.  Our group is  focused upon throughput and testing and we need to determine what to do for Tier-3’s.   Draft idea is that each Tier-3 should be assigned a Tier-2 as a testing partner.   Data movement tests will be configured using Hiro’s data transfer testing.  Tier-3’s will have only 7-8 test files sent from their associated Tier-2 once per day.  This is compared to the Tier-1 to Tier-2 tests which send 20 files, twice per day to two space-token areas.   This testing along with a properly configured perfSONAR instance at each Tier-3 should provide sufficient initial monitoring/testing. Also, Shawn is assembling pages on the Twiki Rob has setup which are the beginning of a “how-to/cookbook” for throughput debugging, tuning and testing.   **Please send along any URLs or info that  would be appropriate to include**.  Also Doug pointed out that Tier-3’s need examples of existing  storage systems and configurations in place already to help guide their selection of hardware purchases.     **We need to get feedback (also on the Twiki) from each of the Tier-2’s on their hardware and configurations which can be used by the Tier-3’s**.
3)	Automated data-movement testing status – Logging of errors is to be added.   In addition Hiro will update the software to allow the “3rd party mode” which will enable tests from any site to any site in the US.  Rob had some requests for updates to the plots that Sarah and Hiro will follow-up on.
4)	Virtual circuit testing status – Still waiting to hear that the UC networking folks have the path instrumented so we can repeat the tests.   **Shawn will send another email asking for a status update**.  If things are ready we will try to reschedule the test sometime in the next few days.
5)	Site reports – Skipped till next meeting
6)	AOB – Skipped till next meeting

Please send any corrections or additions to the list.   We plan to meet again next week at the regular time.

Shawn

OSG 1.2 deployment (Rob, Xin)

  • last week:
    • Update to OSG 1.2.1 release this past week for pyopenssl patch
  • this week:

Site news and issues (all sites)

  • T1:
    • last week(s): apart from glitch w/ site services; 120 worker nodes is out: 8 core nodes, 2.66 GHz Nahelms, 24 GB ram x-series, 5550. Fast memory. R410s. Can put up to 4 drives in these 1U servers. 1000 analysis cores hitting DDN storage. SSD 160 GB drives for $400.
    • this week:

  • AGLT2:
    • last week: Looking at next purchase round - dell and sun. SSD's versus extra spindles. Looking at R610. Need to test. DDN storage (~ $320/useable TB). May need downtime for dCache.
    • this week:

  • NET2:
    • last week(s): running smoothly at BU - investigating a problem at HU - not getting pilots.
    • this week:

  • MWT2:
    • last week(s): Aaron van Meerton coming on board at UC. Distributed xrootd layout studies, setting this up. Bestman on top of dcache (looking at performance issues). We did experience some partial drains, this is because of really the short jobs.
    • this week:

  • Shawn: discussion of HEPSPEC is now 120 after changing default BIOS settings. This was verified at BNL. 20% better than harpertown! HS mismatch w/ ATLAS code! What about Dell joining as an ITB site?

  • SWT2 (UTA):
    • last week: all is well. Looking at purchase options for next round. Need to think about what to do w/ the old SWT2_UTA cluster. Blocking off a few days for the SL5 and OSG upgrade.
    • this week:

  • SWT2 (OU):
    • last week: no issues
    • this week:

  • WT2:
    • last week(s): hardware purchases are proceeding. a new algorithm xrood-client being improved by developer; a new algorithm will be tested in a new HC test. Will likely have another power outage.
    • this week:

Tier 3 data transfers (Doug, Kaushik)

  • last week
    • There have been discussions this week at CERN regarding this topic. Expect a solution very soon, see Simone's. Next week.
  • this week

Carryover issues (any updates?)

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
  • this meeting:

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week
    • Have discussed adding Adler32 checksum to xrootd. Alex developing something to calculate this on the fly. Expects to release this very soon. Want to supply this to the gridftp server.
    • Need to communicate w/ CERN regarding how this will work with FTS.
  • this week

Local Site Mover

Gratia transfer probes @ Tier 2 sites

Hot topic: SL5 migration

  • last weeks:
    • ACD ops action items, http://indico.cern.ch/getFile.py/access?resId=0&materialId=2&confId=66075
    • Kaushik: we have the green light to do this from ATLAS; however there are some validation jobs still going on and there are some problems to solve. If anyone wants to migrate, go ahead, but not pushing right now. Want to have plenty of time before data comes (means next month or two at the latest). Wait until reprocessing is done - anywhere between 2-7 weeks from now, for both SL5 and OSG 1.2.
    • Consensus: start mid-September for both SL5 and OSG 1.2
    • Shawn: considering rolling part of AGT2 infrastructure to SL 5 - should they not do this? Probably okay - Michael. Would get some good information. Sites: use this time to sort out migration issues.
    • Milestone: my mid-October all sites should be migrated.
  • this week

Getting OIM registrations correct for WLCG installed pledged capacity

  • last week
    • Note - if you have more than one CE, the availability will take the "OR".
    • Make sure installed capacity is no greater than the pledge.
    • Storage capacity is given the GIP by one of two information providers (one for dCache, one for Posix-like filesystem) - requires OSG 1.0.4 or later. Note - not important for WLCG, its not passed on. Karthik notes we have two ATLAS sites that are reporting zero. This is a bit tricky.
    • Have not seen yet a draft report.
    • Double check the accounting name doesn't get erased. There was a big in OIM - should be fixed, but checked.
  • this week

AOB

  • last week
  • this week


-- RobertGardner - 04 Sep 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback