r8 - 07 Oct 2009 - 14:25:43 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesOct7

MinutesOct7

Introduction

Minutes of the Facilities Integration Program meeting, Oct 7, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Sarah, Rob, John DeStefano, Doug, Michael, Bob, Tom, John, Saul, Wei, Fred, Charles, Aaron, Nurcan, Kaushik, Mark, Armen, Patrick
  • Apologies: Horst, Karthik

Integration program update (Rob, Michael)

  • Introducing: SiteCertificationP10 - FY09Q04
  • Special meetings
    • Tuesday (9am CDT): Frontier/Squid
    • Tuesday (9:30am CDT): Facility working group on analysis queue performance: FacilityWGAP suspended for now
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
      • CapacitySummary
      • Quarterly reports due!
      • Storage requirements: SpaceManagement
      • FabricUpgradeP10 - procurement discussion
      • End user analysis test (UAT), Date: 21-23 Oct. (28-30 Oct. as a backup date), UserAnalysisTest
        • 5 large containers - 100M events - spread over Tier 2s, mostly US sites
        • Expert users running jobs making ntuples first few days (20 people); in Europe, 4-5 people
        • Larger group would copy ntuple datasets
        • Smaller datasets will run over raw and ESDs
        • Metrics - is information going into panda db correct?
        • 400M events high pt, 100M low pt. Merge jobs going on right now. Earlier 300M sample already done. 520M events total. 6 containers.
        • How much space? 80 TB for all 6 containers.
        • 420M events will go to 5 containers - small and large.
        • Plan is that some jobs will not be US specific.
        • Pre-testing should show what the output sizes are.
        • USERDISK - capacity could be as much as 8 TB.
        • At the end, could be many users attempting fetches of output ntuples via dq2-get.
        • Lots of questions of details on how this will work.
        • Will be discussed in daily operations meeting; need to follow-up here next week
    • this week
      • Quarterly reports due
      • Next phase - will focus on end-user analysis with facilities services.
      • On-going discussion w/ RRB regarding aggressive ATLAS needs; we need to aggressively ramp up (for each of 5 T2) to 11 kHS06 fully devoted to central ATLAS production; disk: 1100 TB (5.7 PB aggregate for US share among all T2); some argument of the schedule. April 2010 nominal, staggered fashion over the calendar year. 2 PB/quarter, 7 PB for all T2's world wide, 25 PB for first quarter of 2011. For the US: completed by end of 2010. This is the 2010 pledge. These figures are for useable storage. Decimal units.
      • Michael will provide a table for deployment

UAT program (Kaushik)

  • ADC daily operations notes
  • See https://twiki.cern.ch/twiki/bin/view/Atlas/UserAnalysisTest
  • Status: 6 containers have been defined in DQ2. 72 TB.
  • Date: October 21-23.
  • 2 (pre-test) containers have already been distributed and tested. uat09. Will delete step09 containers.
  • Kaushik will make up a table of containers at Tier 2s.
  • Jim C will put out a call for users to test the datasets.
  • Which space token? MCDISK
  • Subscription lists will be handled by Kaushik (consulting w/ Alexei and Simone)
  • Metrics: (Nurcan)
    • how many job slots per user; estimated 1 1/2 hours - 5 hours; 300-500 job slots per user to process complete containers.
    • will use same step 09 metrics; #files, cpu/wall time, #events (from parsed Athena logfiles, in Rel 15.6.0, so won't be available in time; will need to script in pilot-Paul)
    • would like a dedicated monitoring page for these (Dan will provide this); Sergei and Torre working on the monitor in Panda
    • 20 users from US cloud; 4-5 elsewhere. Need to broker US-user jobs to sites outside the US.
    • Dan will summarize at tomorrow's ADC meeting.

Operations overview: Production (Kaushik)

Data Management & Storage Validation (Kaushik)

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/materialDisplay.py?contribId=0&materialId=0&confId=69159
    
    [ ESD reprocessing is essentially done.]
    See: https://twiki.cern.ch/twiki/bin/view/Atlas/ADCReproFromESDAug2009
    
    1)  9/24: Some files were lost in the MWT2 storage due to a dCache misconfiguration / cleanup operation.  Not a major issue -- jobs should simply fail and get rerun. eLog 5731.
    2)  9/25 ==> Large number of failed jobs in the US cloud from task 78741 -- error was "could not add files to dataset."  Remaining jobs were aborted.  Issue discussed in Savannah 56127, RT 14134.
    3)  9/25: Jobs failed at AGLT2 with the error "Put error: Error in copying the file from job workdir to localSE."  Issue was expired host certs on several machines -- resolved.
    4)  9/26: NET2 - problematic WN atlas-c01.bu.edu taken offline -- all pilots were failing on the machine with the error "Did not find a valid proxy, will now abort:"
    5)  9/29 p.m.- 9/30 a.m.: NET2 sites offline due to a problem with the gatekeeper.  Issue resolved, test jobs finished successfully, sites set back 'online'. eLog 5831.
    6)  9/30: Power outage at SLAC today -- from Wei:
    SLAC will take a power outage at 9/30 to work on urgently needed maintenance of two transformers that supply power to machine rooms. We will start
    setting things offline from 6pm 9/29 and eventually will shutdown all ATLAS services. The outage is scheduled to complete at 6pm of 9/30.
    7)  9/30: New pilot s/w from Paul, v39c:
    A problem with job recovery was discoverer due to the usage of a wrong error code related to LFC registration. When lfc-mkdir encountered an error, 
    the wrong error code was set which led the pilot to believe that the job could be recovered on sites that support job recovery. 
    The current job recovery version can not handle these cases.
    8)  Grid certificate for special user 'sm' updated at BNL & UTA (thanks Nurcan).
    9)  Heads-up: ATLAS User Analysis Test (UAT) scheduled for the second half of October.
    
    Follow-ups from earlier reports:
    (i)  7/23-7/24 -- Ongoing work by Paul and Xin to debug issues with s/w installation jobs at OU_OSCER_ATLAS.  Significant progress, but still a few remaining issues.
    (ii)  SLC5 upgrades are ongoing at the sites during the month of September.
    • Saul: Jobs seen where Athena running out of memory, causing nodes to lock up; Kaushik: they were reprocessing jobs - we expected 2% job failures for memory failures. Frustrating at the site level since host recovery is labor intensive. We could kill jobs w/ limits, but sometimes results in unnecessary job failures. Q: how much swap was available.
    • Kaushik notes that these jobs failed everywhere.
  • this meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://www-hep.uta.edu/~sosebee/ADCoS/Panda_ADCoS-status-report-9_29-10_5.html
    
    [ ESD-ESD reprocessing exercise is done.  Merging jobs completed on 10/2.  Proposed dates for a postmortem: 10/14 or 10/15. ]
    
    1)  9/30 p.m.: Power outage at SLAC ended.  This was followed by a test of some recently installed RHEL 5 nodes on 10/2-10/3.  Jobs finished successfully.
    2)  10/1 a.m. -- from Saul:
    We had an air conditioning incident this morning at BU and had to turn off some of the blades while the room cooled down.  
    A corresponding bunch of failed panda jobs will follow.  Site services were not interrupted.
    3)  10/2: From Tomasz, updates to nagios URL's:
    New nagios page:  https://nagios.racf.bnl.gov/
    New dashboard locations:
    https://nagios.racf.bnl.gov/nagios/sla_array.html  (BNL)
    https://nagios.racf.bnl.gov/nagios/tier2.html  (Tier2 services)
    4)  10/2: Issue with LFC db corruption at MWT2 resolved -- thanks Charles, Hiro.
    5)  10/4: AGLT2, from Shawn:
    ==> The certificates updating at AGLT2 is not working because the AFS volume Certificates has some issues.
    ==> This issue should be resolved now.  I fixed the AFS volume replication.  There was a second problem on gate01.aglt2.org:  the ‘rsync’  RPM was not installed.  
    Once I restored it and re-ran the synchronize script things started working.  This was done on gate01, gate02, head01 and head02.
    6)  10/5: MWT2_IU -- several hundred job failures with the error "Get error: lsm-get failed."  From Sarah:
    ==> One of our dCache pools is having memory issues. I've stopped our local job scheduler and dq2-site services until it is recovered.
    ==> The pool has recovered. I've restarted the local job scheduler & dq2-siteservices.
    7) 10/6-10/7: Maintenance downtime at the MWT2 sites to apply security patches and work on SL5.3 migration.  Test jobs submitted after the downtime completed successfully
    8)  Various s/w upgrades announced for BNL:
    a)  ATLAS dCache upgrade -- 13 Oct 2009 08h00 - 13 Oct 2009 17h00
    b)  Deploy security patches bug fixes in the OS and Oracle Cluster database underlying software -- 10/12/09 10:00 EDT - 10/12/09 13:00 EDT
    c)  GUMS Database Reconfiguration -- Tuesday, October 13th, 2009 1100 - 1300 EST.
    d)  HPSS software upgrade -- Tuesday Oct 6 - Thursday Oct 8
    
    Follow-ups from earlier reports:
    (i)  7/23-7/24 -- Ongoing work by Paul and Xin to debug issues with s/w installation jobs at OU_OSCER_ATLAS.  Significant progress, but still a few remaining issues.
    (ii)  SLC5 upgrades are ongoing at the sites during the month of September.
    (iii) ATLAS User Analysis Test (UAT) scheduled for the second half of October.

Analysis queues (Nurcan)

  • Reference:
  • last meeting:
    • CosmicsAnalysis job using DB access is successfully tested at FZK using Frontier and at DESY-HH using Squid/Frontier (by Johannes). The job has been put into HammerCloud and now being tested at DE_PANDA, no submission to US sites yet.
    • TAG selection job has been put into HammerCloud and is now being tested (in DE cloud).
    • We have now 3 new analysis shifters confirmed, still waiting to hear from one person. I'm planing a training for them in October.
    • Jim C. contacted with us on the status of the large containers for the stress test. Kaushik reported that we have total ~500M events produced. Only the first bunch replicated to Tier2's as I had validated them (step09.00000011.jetStream_medcut.recon.AOD.a84/ with 97.69M events and step09.00000011.jetStream_lowcut.recon.AOD.a84/ with 27.49M events). Others are at BNL, waiting to be merged and put into new containers. Depending on the time scale of the stress test this can be done in a few days as Kaushik reported.
  • this meeting:
    • Preparing for UAT

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • New DQ2 (now available) testing this week at BNL. Do not update SS at the Tier 2's.
    • As soon as new bestman installed SLAC - this week; complete FTS 2.2 afterwards.
    • BNL DQ2 SS host has some problems - investigating. Host rebooting automatically. It is delaying some transfers in the US cloud.
    • Changing way LFC connects to the network via F5 switch to avoid firewall problems. Unscheduled.
    • Did manual cleanup of jobs which failed LFC registrations
  • this meeting:
    • Summary:
      DDM/Throughput related issue.
      1.  The DDM dataset/throughput monitor service was moved to the
      different host (from my desktop).  But, it has the same link. 
      2.  The test of FTS 2.2 continues.   The bug is still found and being
      fixed by the developers.
      3.  The mass deletion of many old datasets (probably in MCDISK) by ADC
      is going on.
      4.  On next Tuesday, using the scheduled down-time for the maintenance
      of dCache, BNL FTS and LFC as well as the PANDA mover will be shutdown. 
      And, during that time, there will be several operations. Application of
      ORACLE patch.  The change in the routing of network between F5 and 
      LFC.  The relocation for the PANDA mover hosts. 
      5.  ANL_LOCALGROUPDISK has been added to the T3 throughput test.

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

  • last week
    • https://twiki.cern.ch/twiki/bin/view/Atlas/RemoteConditionsDataAccess
    • Fred - continuing to work on this topic; not sure its happening fast enough; Jack Cranshaw working w/ Alessandro and Xin to get pool file catalogs generated and installed correctly.
    • On-going discussion about how to set the environment variable to effectively use local conditions files in HOTDISK
    • New version of squid - need updates at all sites. Improved mechanism for cache consistency; this version more efficient.
    • Instructions have been updated.
  • this week
    • Announcement from Dario - ATLAS specific instructions:
      • Dear all, ATLAS has decided to move ahead with a rapid deployment of FroNTier? servers and Squid web caches to address the shortcomings of direct remote access to Oracle databases holding conditions data that are needed by jobs running worldwide. See note ATL-SOFT-INT-2009-003 (http://cdsweb.cern.ch/record/1210179/files/ATL-SOFT-INT-2009-003.pdf) for more details. The report of the "Analysis Model for the First Year" (AMFY) Task Force presented at today's ATLAS plenary meeting supports this rapid deployment; see http://indico.cern.ch/getFile.py/access?contribId=13&sessionId=6&resId=1&materialId=slides&confId=47256 for more details. Thanks to the development and deployment work of the BNL and other US-ATLAS sites, and the testing done in the German cloud and at CERN with help also from CMS experts, we now have the tools for a quick general deployment. It is required that every Tier-1 and Tier-2 site install a Squid server asap. A number of Tier-1s have installed, or are going to install, a FroNTier? server; we are having, or going to have soon, direct discussions with Tier-1 sites. A FroNTier? server is also under test at CERN, but this server will be initially reserved for further load tests. Preliminary instructions can be found in https://twiki.cern.ch/twiki/bin/view/Atlas/T2SquidDeployment and the references therein. While more FroNTier? servers are installed, the functionality of Squids can be tested immediately against the BNL and/or FZK servers. The same wiki page is used to keep track of installation and test progress. In order to speed up the deployment of the Frontier/Squid system for ATLAS, we appoint John DeStefano? (BNL) and Rod Walker (LMU) as coordinators of this activity. Within this organization, Douglas Smith (SLAC) will be the reference person concerning Squid issues. For general questions and hopefully answers, sites should send messages to the database operations support list at hn-atlas-dbops@cern.ch. Dario Barberis & Kors Bos & Jim Shank
    • Q: do we have an inconsistency w/ instructions developed here:, SquidTier2 ?
      • Note from John: The two sets of instructions are not consistent, or at least they've grown out of sync. The page on the US ATLAS Admins TWiki is a bit out of date (based on pre-v15.4.0 Athena), and ATLAS is developing custom RPMs for Squid. We brought up the discrepancy of information, and the need for consistent, centralized documentation at today's Frontier meeting, but the proper area and responsibility for this information is not yet clear. Maybe we can talk a bit about how to proceed in tomorrow's meeting. Douglas Smith is coordinating the Squid aspect, and I'm sure he will have an idea of what to communicate to the US ATLAS community. Meanwhile, the ATLAS link for Squid deployment below actually points to our RACF installation instructions for guidance, so we have some control there. One concern is that most (if not all) of the US T2s have already deployed and configured Squid instances, and I'm not sure how receptive they'll be to re-installing via CERN's RPMs.
    • Meeting summary, Oct 06: https://lists.bnl.gov/pipermail/racf-frontier-l/2009-October/000548.html
    • All Tier2's must have this deployed ASAP. We have this though.
    • New Squid version providing for cache consistency.
    • Fred will clean up the site certification table. John will add references to the ATLAS twiki (https://twiki.cern.ch/twiki/bin/view/Atlas/T2SquidDeployment)
    • Update both US integration site certification and the ATLAS cloud table.

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • last week(s):
     
    • perfSONAR due to be release this Friday (Sep 25 2009).
      • Goal is to have all USATLAS Tier-2 sites updated by October 1, 2009
      • By October 7 all sites should have configured full mesh-testing for BWCTL and OWAMP testing for all Tier-2 sites and BNL
    • Need to create a mapping for all Tier-3s to provide an associated Tier-2 for testing purposes
    • Hiro will update the automated file transfer testing to allow Tier-2 to Tier-3 transfers (SRM-to-SRM) using the existing load testing framework
    • Meeting Notes USATLAS Throughput Call – September 29, 2009
                      ================================================
       Attending:   Sarah,  Joe,  Shawn,  Hiro,  Jason,  Dave,  Horst,  Karthik, Doug
      Excused:  Saul, Neng
      1)      perfSONAR status.   Release v3.1 is out.  Already installed  at OU and AGLT2.    Possible issue with iptables: Shawn will send screen capture to Jason.  MWT2, NET2 and Wisconsin have all confirmed they should be updated this week.  Need to hear from WT2 and SWT2-UTA.
      2)      Update on Tier-3 testing.   The info for the KOI perfSONAR box is (total cost is for 2 of them; note there is an additional charge for rails ~$30):
      Item Number Qty Description Unit Cost Total Amount
      2 1U Intel Pentium Dual-Core E2200 2.2GHz System $598.00 $1,196.00
      Breakdown per System:
      1 ASUS RS100-X5/P12 1U Chassis with 180W Single Power Supply. Intel
      945GC/ICH7 Chipset Main Board. Onboard 2 x marvel 8056 GbE LAN
      Controller, Intel Graphics Media Accelerator 950, 2 x SATA Ports.
      1 Intel BX80557E2200 Pentium DC E2200 2.2GHz 1MB 800MHz Processor
      2 Kingston KVR667D2N5/1G 1GB DDR2-5300 667MHz Non-ECC Unbuffered
      1 Seagate ST3160815AS 160GB SATA 16MB 7200RPM Hard Drive
      1 ASUS Slim DVD-ROM Drive
      1 Labor/Shipping
      1 Three Year Parts Repair/Replacement Warranty
      TOTAL: $1,196.00
      3)      No updates.   Still working on “3rd party” transfer capability for use in Tier-2 to Tier-3 testing.  Will need to prestage long-term source files at Tier-2s for this.   Tier-2s will need to set aside ~30GB of space for testing files. 
      4)      Site reports
      a.       BNL – Nothing to report
      b.      AGLT2 – Still low throughput to debug.  Issues with SRM hanging.
      c.       MWT2 – SL5 upgrade underway to fix TCP/Network issues.  
      d.      NET2 – Working on perfSONAR updates.
      e.      SWT2 – perfSONAR installed and running. 
      f.        WT2 – No report
      g.       Wisconsin -  perfSONAR boxes should be upgraded this week.
      5)      AOB  - Some review of perfSONAR milestones.  October 7th to have all USATLAS Tier-2/Tier-1 sites config’ed for mesh-testing.
      a.       Manual load-test to AGLT2 on Wednesday 9:30 AM Eastern
      b.      MWT2 will schedule a manual load-test sometime after their SL5 update
      c.       Analysis stress test coming up.   May have implications for our preparations…
       We plan to meet again next week at the usual time (3 PM Eastern on Tuesday).   Send along any corrections or additions to these notes via email.
      Thanks,
       Shawn
    • All sites should update perfsonar installations by end of week.
    • Next step ensure all sites are configured for automated tests - Oct 7 deadline
    • ADC operations is planning a big throughput test, Oct 5, 5 days. Not sure if Tier 2's will be involved.
  • this week:
    • No meeting

OSG 1.2 deployment (Rob, Xin)

  • last week:
    • Validation of upgrade to lcg-utils in wn-client, as well as curl. OSG 1.2.3 has this update (relevant for wn-client, client)
    • Tested with UCITB_EDGE7 site. Validation complete DONE
  • this week:

Site news and issues (all sites)

  • T1:
    • last week(s): all is well, not much change from last week. Completing electrical work in new data center. 3 tape libraries have arrived - being installed - adding more than 24K cartridge slots to existing robot.
    • this week: hpss upgrade in progress, including movers; dcache upgrade next tuesday (server hardware replacements); 1000 cores to arrive October 16; electrical infrastructure, 1M flywheel UPS testing;

  • AGLT2:
    • last week: Trying to get purchase orders out - Dell still not providing quotes for 2 TB drives; OSG security challenge readiness. Getting ready for SL5 conversion. One compute node ran 8 jobs to completion.
    • this week: Received quote for 2 TB drives from Dell - requisitions have been placed at UM; OSG security challenge went well, asked for feedback; are there common scripts? Atlas releases have been patched on all US sites, but not the CERN repo's. Plan on transitioning to SL5 at MSU. Downtime next Monday to move equipment around in prep for new equipment.

  • NET2:
    • last week(s): SL5 migration - have one node deployed and testing; will proceed. Gatekeeper problem last night - not sure what happened. Meeting w/ Dell. Added 130 TB of storage (2 partitions).
    • this week: OSG security tests passed. AC incident last week. Squid installed - need to upgrade. Talking w/ Dell on some configurations. DATADISK being moved to a new filesystem. Perfsonar updated.

  • MWT2:
    • last week(s): Two phases - downtime next week for security patches on head nodes. Will update SL5.3 on computes. LFC site problem fixed. Both sites have perfsonar updated. Using Puppet for as new config management tool. OSG security drill tomorrow.
    • this week: Downtime for upgrades yesterday. Storage nodes updated to SL 5.3 64 bit. Upgraded perfsonar, configured, have not. LFC problem last week - manual maintenance issue.

  • SWT2 (UTA):
    • last week: UTA cluster will be updated to SL5; CPB will follow. OSG security drill tomorrow. Space getting tight. Had a few nodes go down during reprocessing.
    • this week: All is well - need to do the Squid update. Perfsonar boxes updated - need to add mesh testing. Working on procurements for UTA_SWT2 cluster.

  • SWT2 (OU):
    • last week: OSG security drill today. 100 TB useable storage held up by Langston University's purchasing.
    • this week:

  • WT2:
    • last week(s): SL5 migration - has queue of 6 machines running already. Will be able to migrate some machines before mid-October. Planning two steps (100-200 systems first; hope to completely switch before UAT); ~350 nodes. Will migrate OSG 1.2 in the next two weeks. New bestman testing Friday or Monday. New testbed for xrootd - finding a number of small bugs.
    • this week: Panda validation jobs run successfully on RHEL5-64 test q (23 jobs successful, 1 failed but seems not site related). 1st production frontier squid running. target Oct-13 for OSG 1.2 migration. new bestman with required checksum features in place. target new (prod) releases of xrootd and xrootdfs on friday.

Tier 3 program report (Doug)

  • last week:
    • still working on interviews
    • Doug feels we'll need t2-t3 'affinities'
    • T3 usability should be a focus in the next phase of integration program
    • Wants to know when next integration phase starts
    • Interviews w/ sites nearly completed.
    • Some sites will need a site mover.
    • How are dataflows monitored in T2's and T3's - are Gratia probes needed?
  • this week:
    • Tier 3 meeting at ANL October 29-30
    • finished phone interviews with all US Atlas institutions
      • Have Tier 3 contacts for each institution
      • Will update Usatlas-compcontacts-l mailing list at BNL to reflect this information
      • Rik and Doug will write a report available to all
    • ANL SRM now part of throughput testing
    • Need to startup the Tier 3 dataset subscriptions to Tier 3g sites (initially Duke and ANL) with more to come
    • Starting to write the Tier 3 configuration documentation
      • Rik will be guinea pig on the instructions initially (He will setup an SRM first)

Carryover issues (any updates?)

OIM issue (Xin)

  • last week:
    • Registration information change for bm-xroot in OIM - Wei will follow-up
    • SRM V2 tag - Brian says nothing to do but watch for the change at the end of the month.
  • this week:

Tier 3 data transfers

  • last week
    • no change
  • this week

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
  • this meeting:

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week
    • Have discussed adding Adler32 checksum to xrootd. Alex developing something to calculate this on the fly. Expects to release this very soon. Want to supply this to the gridftp server.
    • Need to communicate w/ CERN regarding how this will work with FTS.
  • this week

Local Site Mover

Gratia transfer probes @ Tier 2 sites

Hot topic: SL5 migration

  • last weeks:
    • ACD ops action items, http://indico.cern.ch/getFile.py/access?resId=0&materialId=2&confId=66075
    • Kaushik: we have the green light to do this from ATLAS; however there are some validation jobs still going on and there are some problems to solve. If anyone wants to migrate, go ahead, but not pushing right now. Want to have plenty of time before data comes (means next month or two at the latest). Wait until reprocessing is done - anywhere between 2-7 weeks from now, for both SL5 and OSG 1.2.
    • Consensus: start mid-September for both SL5 and OSG 1.2
    • Shawn: considering rolling part of AGT2 infrastructure to SL 5 - should they not do this? Probably okay - Michael. Would get some good information. Sites: use this time to sort out migration issues.
    • Milestone: my mid-October all sites should be migrated.
    • What to do about validation? Xin notes that compat libs are needed
    • Consult UpgradeSL5
  • this week

WLCG Capacity Reporting (Karthik)

  • *last discussion(s):
    • Note - if you have more than one CE, the availability will take the "OR".
    • Make sure installed capacity is no greater than the pledge.
    • Storage capacity is given the GIP by one of two information providers (one for dCache, one for Posix-like filesystem) - requires OSG 1.0.4 or later. Note - not important for WLCG, its not passed on. Karthik notes we have two ATLAS sites that are reporting zero. This is a bit tricky.
    • Have not seen yet a draft report.
    • Double check the accounting name doesn't get erased. There was a big in OIM - should be fixed, but checked.
    • Reporting come two sources: OIM and the GIP from the sites
    • Here is a snapshot of the most recent report for ATLAS sites:
      --------------------------------------------------------------------------------------------------------
      This is a report of Installed computing and storage capacity at sites.
      For more details about installed capacity and its calculation refer to the installed capacity document at
      https://twiki.grid.iu.edu/twiki/pub/Operations/BdiiInstalledCapacityValidation/WLCG_GlueSchemaUsage-1.8.pdf
      --------------------------------------------------------------------------------------------------------
      * Report date: Tue Sep 29 14:40:07
      * ICC: Calculated installed computing capacity in KSI2K
      * OSC: Calculated online storage capacity in GB
      * UL: Upper Limit; LL: Lower Limit. Note: These values are authoritative and are derived from OIMv2 through MyOSG. That does not
      necessarily mean they are correct values. The T2 co-ordinators are responsible for updating those values in OIM and ensuring they
      are correct.
      * %Diff: % Difference between the calculated values and the UL/LL
             -ve %Diff value: Calculated value < Lower limit
             +ve %Diff value: Calculated value > Upper limit
      ~ Indicates possible issues with numbers for a particular site
      -----------------------------------------------------------------------------------------------------------------------------
      #  | SITE                 | ICC        | LL          | UL          | %Diff      | OSC         | LL      | UL      | %Diff   |
      -----------------------------------------------------------------------------------------------------------------------------
                                                            ATLAS sites
      1  | AGLT2                |      5,150 |       4,677 |       4,677 |          9 |    645,022 | 542,000 | 542,000 |      15 |
      2  | ~ AGLT2_CE_2         |        165 |         136 |         136 |         17 |     10,999 |       0 |       0 |     100 |
      3  | ~ BNL_ATLAS_1        |      6,926 |           0 |           0 |        100 |  4,771,823 |       0 |       0 |     100 |
      4  | ~ BNL_ATLAS_2        |      6,926 |           0 |         500 |         92 |  4,771,823 |       0 |       0 |     100 |
      5  | ~ BU_ATLAS_Tier2     |      1,615 |       1,910 |       1,910 |        -18 |        511 | 400,000 | 400,000 | -78,177 |
      6  | ~ MWT2_IU            |        928 |       3,276 |       3,276 |       -252 |          0 | 179,000 | 179,000 |    -100 |
      7  | ~ MWT2_UC            |          0 |       3,276 |       3,276 |       -100 |          0 | 179,000 | 179,000 |    -100 |
      8  | ~ OU_OCHEP_SWT2      |        611 |         464 |         464 |         24 |     11,128 |  16,000 | 120,000 |     -43 |
      9  | ~ SWT2_CPB           |      1,389 |       1,383 |       1,383 |          0 |      5,953 | 235,000 | 235,000 |  -3,847 |
      10 | ~ UTA_SWT2           |        493 |         493 |         493 |          0 |     13,752 |  15,000 |  15,000 |      -9 |
      11 | ~ WT2                |      1,377 |         820 |       1,202 |         12 |          0 |       0 |       0 |       0 |
      -----------------------------------------------------------------------------------------------------------------------------
      
    • Karthik will clarify some issues with Brian
    • Will work site-by-site to get the numbers reporting correctly
    • What about storage information in config ini file?
  • this meeting

AOB

  • last week
  • this week
    • None


-- RobertGardner - 07 Oct 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback