r4 - 17 Oct 2009 - 16:12:29 - MarkSosebeeYou are here: TWiki >  Admins Web > MinutesOct14

MinutesOct14

Introduction

Minutes of the Facilities Integration Program meeting, Oct 14, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Booker, Rik, Doug, Michael, Wei, JIm C, Sarah, Rob, Charles, Aaron, Hiro, John, Saul, Fred, John D, Tom, Wensheng, Patrick, UTA
  • Apologies: Karthik

Integration program update (Rob, Michael)

  • SiteCertificationP10 - FY09Q04
  • Special meetings
    • Tuesday (9am CDT): Frontier/Squid
    • Tuesday (9:30am CDT): Facility working group on analysis queue performance: FacilityWGAP suspended for now
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
      • Quarterly reports (OVER) due
      • Next phase - will focus on end-user analysis with facilities services.
      • On-going discussion w/ RRB regarding aggressive ATLAS needs; we need to aggressively ramp up (for each of 5 T2) to 11 kHS06 fully devoted to central ATLAS production; disk: 1100 TB (5.7 PB aggregate for US share among all T2); some argument of the schedule. April 2010 nominal, staggered fashion over the calendar year. 2 PB/quarter, 7 PB for all T2's world wide, 25 PB for first quarter of 2011. For the US: completed by end of 2010. This is the 2010 pledge. These figures are for useable storage. Decimal units.
      • Michael will provide a table for deployment
    • this week

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

  • last week:
    • Tier 3 meeting at ANL October 29-30
    • finished phone interviews with all US Atlas institutions
      • Have Tier 3 contacts for each institution
      • Will update Usatlas-compcontacts-l mailing list at BNL to reflect this information
      • Rik and Doug will write a report available to all
    • ANL SRM now part of throughput testing
    • Need to startup the Tier 3 dataset subscriptions to Tier 3g sites (initially Duke and ANL) with more to come
    • Starting to write the Tier 3 configuration documentation
      • Rik will be guinea pig on the instructions initially (He will setup an SRM first)
  • this week:
    • Will have a visit to Wisconsin in a couple of weeks
    • Test cluster will be setup at UW to examine instructions
    • Tier 3 workshop at ANL Oct 29-30; Plan strategy for deploying Tier 3s. Input from experts.
    • Tier 3 commissioning meeting - will be a separate meeting, will summarize and report here.
    • Doug working with Hiro on the SE - focus of the integration
    • Hiro will begin integrating Duke and ANL into site services and FTS at BNL, and FTS controls throughput as appropriate
    • Kaushik: modification in Panda complete for destination option. Works only for Tier 3 sites.
    • Tier 3 data transfer stress test against the Tier 2's; plan a Tier 3 data transfer test
    • What about cross-cloud transfers? Right now its only a US solution.

UAT program (Kaushik, Jim C)

  • last week:
    • ADC daily operations notes
    • See https://twiki.cern.ch/twiki/bin/view/Atlas/UserAnalysisTest
    • Status: 6 containers have been defined in DQ2. 72 TB.
    • Date: October 21-23.
    • 2 (pre-test) containers have already been distributed and tested. uat09. Will delete step09 containers.
    • Kaushik will make up a table of containers at Tier 2s.
    • Jim C will put out a call for users to test the datasets.
    • Which space token? MCDISK
    • Subscription lists will be handled by Kaushik (consulting w/ Alexei and Simone)
    • Metrics: (Nurcan)
      • how many job slots per user; estimated 1 1/2 hours - 5 hours; 300-500 job slots per user to process complete containers.
      • will use same step 09 metrics; #files, cpu/wall time, #events (from parsed Athena logfiles, in Rel 15.6.0, so won't be available in time; will need to script in pilot-Paul)
      • would like a dedicated monitoring page for these (Dan will provide this); Sergei and Torre working on the monitor in Panda
      • 20 users from US cloud; 4-5 elsewhere. Need to broker US-user jobs to sites outside the US.
      • Dan will summarize at tomorrow's ADC meeting.
  • this week:
    • 6 containers are mostly replicated to tall the Tier 1s. Stephane has a table.
    • Containers have been assigned to Tier 2s - there is a table. 1-4 containers out of the six at each Tier 2. BNL has all 6.
    • Jim: contacted 18 users from step 09; 11 confirmed. Several other inquires.
    • Follow the computing model - 50% of the resources for a Tier 2. There will probably be about 3000 slots, 2000 across the Tier 2s.
    • What about special users requiring special access.

Operations overview: Production (Kaushik)

Data Management & Storage Validation (Kaushik)

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://www-hep.uta.edu/~sosebee/ADCoS/Panda_ADCoS-status-report-9_29-10_5.html
    
    [ ESD-ESD reprocessing exercise is done.  Merging jobs completed on 10/2.  Proposed dates for a postmortem: 10/14 or 10/15. ]
    
    1)  9/30 p.m.: Power outage at SLAC ended.  This was followed by a test of some recently installed RHEL 5 nodes on 10/2-10/3.  Jobs finished successfully.
    2)  10/1 a.m. -- from Saul:
    We had an air conditioning incident this morning at BU and had to turn off some of the blades while the room cooled down.  
    A corresponding bunch of failed panda jobs will follow.  Site services were not interrupted.
    3)  10/2: From Tomasz, updates to nagios URL's:
    New nagios page:  https://nagios.racf.bnl.gov/
    New dashboard locations:
    https://nagios.racf.bnl.gov/nagios/sla_array.html  (BNL)
    https://nagios.racf.bnl.gov/nagios/tier2.html  (Tier2 services)
    4)  10/2: Issue with LFC db corruption at MWT2 resolved -- thanks Charles, Hiro.
    5)  10/4: AGLT2, from Shawn:
    ==> The certificates updating at AGLT2 is not working because the AFS volume Certificates has some issues.
    ==> This issue should be resolved now.  I fixed the AFS volume replication.  There was a second problem on gate01.aglt2.org:  the ‘rsync’  RPM was not installed.  
    Once I restored it and re-ran the synchronize script things started working.  This was done on gate01, gate02, head01 and head02.
    6)  10/5: MWT2_IU -- several hundred job failures with the error "Get error: lsm-get failed."  From Sarah:
    ==> One of our dCache pools is having memory issues. I've stopped our local job scheduler and dq2-site services until it is recovered.
    ==> The pool has recovered. I've restarted the local job scheduler & dq2-siteservices.
    7) 10/6-10/7: Maintenance downtime at the MWT2 sites to apply security patches and work on SL5.3 migration.  Test jobs submitted after the downtime completed successfully
    8)  Various s/w upgrades announced for BNL:
    a)  ATLAS dCache upgrade -- 13 Oct 2009 08h00 - 13 Oct 2009 17h00
    b)  Deploy security patches bug fixes in the OS and Oracle Cluster database underlying software -- 10/12/09 10:00 EDT - 10/12/09 13:00 EDT
    c)  GUMS Database Reconfiguration -- Tuesday, October 13th, 2009 1100 - 1300 EST.
    d)  HPSS software upgrade -- Tuesday Oct 6 - Thursday Oct 8
    
    Follow-ups from earlier reports:
    (i)  7/23-7/24 -- Ongoing work by Paul and Xin to debug issues with s/w installation jobs at OU_OSCER_ATLAS.  Significant progress, but still a few remaining issues.
    (ii)  SLC5 upgrades are ongoing at the sites during the month of September.
    (iii) ATLAS User Analysis Test (UAT) scheduled for the second half of October.
  • this meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://www-hep.uta.edu/~sosebee/ADCoS/Panda_ADCoS-status-report-10_6-10_12.html
    
    1)  10/6-10/8 -- Intermittent job failures at various US sites due to lack of tape access during the HPSS upgrade at BNL.
    2)  Problems with the panda servers at CERN following a move to new hosts.  One machine was blocked by a firewall.  Issues seem to be resolved now.  eLog 6092, 6098.
    3)  10/9: MWT2_IU -- issue with access to library file ( libpopt.so.0) from some SL5 worker nodes -- they were taken offline to fix the problem.  ggus 52293.
    4)  10/9: Armen and Alden completed migration of user analysis areas to USERDISK at BNL.
    5)  10/9: BU -- gatekeeper reboot -- resulted in ~50 "lsm-get failed" errors.
    6)  Over this past weekend (10/10 - 10/12) -- large number of failed jobs at BNL - issue was a misconfiguration in schedconfigdb -- resolved.  See ggus 52281.
    7)  10/12: BU -- ~500 failed jobs due to a GPFS partition filling up -- resolved.  ggus 52283.
    8)  10/12: AGLT2 -- Jobs failing due to lack of free space in AGLT2_PRODDISK -- resolved.  ggus 52274.
    9)  10/13: dCache upgrade at BNL -- some residual issues following re-start, but everything seems to be resolved now.
    10)  10/13: UTA_SWT2 set 'offline' to investigate problems with the ibrix storage.
    11)  10/13-14: SLAC outage for OSG upgrade -- initially some issues sending test jobs to the site, owing to stale entries on the BNL submit host -- cleaned up by Xin -- 
    test jobs eventually succeeded, site set back to 'online'.
    12)  Large increase recently in the number of nagios alerts -- from Tomasz:
    Nagios seems to flip flop on gatekeeper tests. The problem started few days ago and we do not know the cause. It seems intermittent: I can run the bare test several times 
    by hand and it works and then suddenly it fails.  In addition to that we do see network interruptions which come and go.  Those two problems may or may not be related.  
    I will disable nagios e-mail alerts for gatekeeper tests in order to reduce noise.
    Later:
    Last few days nagios was going nuts about gatekeeper tests: the probes were flipping up and down continuously.  We had some sort of connectivity problem: nagios could not 
    reach various hosts. The connections would intermittently fail. To make matters harder to debug the connection failures appeared completely random.  In the end I had to 
    disable notifications from nagios gatekeeper probes until the underlying connectivity problem is resolved.  It seems that by now we have a partial understanding of what 
    caused the connectivity problems. The probes are back green and in a few moments I will re-enable nagios notifications.  I still have one issue which I need to discuss 
    with administrators of sites which run osg 1.2 - I will contact you off line.
    
    Follow-ups from earlier reports:
    (iii) ATLAS User Analysis Test (UAT) scheduled for October 21-23.
    

Analysis queues (Nurcan)

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • Summary:
      DDM/Throughput related issue.
      1.  The DDM dataset/throughput monitor service was moved to the
      different host (from my desktop).  But, it has the same link. 
      2.  The test of FTS 2.2 continues.   The bug is still found and being
      fixed by the developers.
      3.  The mass deletion of many old datasets (probably in MCDISK) by ADC
      is going on.
      4.  On next Tuesday, using the scheduled down-time for the maintenance
      of dCache, BNL FTS and LFC as well as the PANDA mover will be shutdown. 
      And, during that time, there will be several operations. Application of
      ORACLE patch.  The change in the routing of network between F5 and 
      LFC.  The relocation for the PANDA mover hosts. 
      5.  ANL_LOCALGROUPDISK has been added to the T3 throughput test.
  • this meeting:
    • FTS 2.2 still not shown to be working well everywhere; checksum support shown to work; improper implementation of shares; lots of things still not understood
    • Thus site services consolidation schedule still up in the air
    • Note logging service at BNL from SS continue to work okay
    • Hiro is running scans on LFCs - be on the lookout for problems

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

  • last week
    • Announcement from Dario - ATLAS specific instructions
    • Q: do we have an inconsistency w/ instructions developed here:, SquidTier2 ?
      • Note from John: The two sets of instructions are not consistent, or at least they've grown out of sync. The page on the US ATLAS Admins TWiki is a bit out of date (based on pre-v15.4.0 Athena), and ATLAS is developing custom RPMs for Squid. We brought up the discrepancy of information, and the need for consistent, centralized documentation at today's Frontier meeting, but the proper area and responsibility for this information is not yet clear. Maybe we can talk a bit about how to proceed in tomorrow's meeting. Douglas Smith is coordinating the Squid aspect, and I'm sure he will have an idea of what to communicate to the US ATLAS community. Meanwhile, the ATLAS link for Squid deployment below actually points to our RACF installation instructions for guidance, so we have some control there. One concern is that most (if not all) of the US T2s have already deployed and configured Squid instances, and I'm not sure how receptive they'll be to re-installing via CERN's RPMs.
    • Meeting summary, Oct 06: https://lists.bnl.gov/pipermail/racf-frontier-l/2009-October/000548.html
    • All Tier2's must have this deployed ASAP. We have this though.
    • New Squid version providing for cache consistency.
    • Fred will clean up the site certification table. John will add references to the ATLAS twiki (https://twiki.cern.ch/twiki/bin/view/Atlas/T2SquidDeployment)
    • Update both US integration site certification and the ATLAS cloud table.
  • this week
    • Fred spent some time revising the SquidTier2 page
    • Checked squid version at all Tier 2 sites
    • Please send any documentation problems to John or Fred
    • Two environmental vars to be set at each site - squid-frontier variable; one for xml file location for pool conditions files; This still needs a standardization for these two variables; also caching of over-subscribed conditions file (not sure how serious this is). This is similar to the issue for dbRelease file (though it gets staged by the pilot). Athena is using remote IO since the file is never staged (thus pcache does not apply).

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • last week(s):
    
    
  • this week:
    • USATLAS Throughput  Meeting Notes – October 13, 2009
       
      Attending:  Shawn, David, Mike, Sarah,  Doug, Jeff,  Horst, Hiro
      Excused: Karthik, Jason
       
      Primary topic of discussion was last week’s perfSONAR installation/configuration for USATLAS.     A survey during the call showed that OU’s instance was working fine since configuration, MWT2_IU had some issues with services stopping but a reconfiguration and reboot fixed it.  AGLT2_UM had problems with the perfSONAR-BUOY services stopping as well is PingER stopping.   The Wisconsin site is up but is not yet configured properly.  Neng is looking into this.  The AGLT2_UM issues are being debugged by the Internet2 developers.    The AGLT2_MSU instances also seem to be running without an issue so far.  Didn’t get reports from the other sites.  
       
      Jeff Boote mentioned syslog configuration specifically on the AGLT2 boxes.  UM needs to look at it to try for a more rational syslog configuration that also sends data to the central syslog host UM uses.   
       
      Jeff also mentioned if perfSONAR  software changes are needed another ISO could be produced.      We will have to see what debugging the problems to-date turns up.
       
      Sarah provided our first perfSONAR measurement question for testing from IU to UTA.   Sarah is seeing a lot of packet loss to UTA SWT2 (70/600) during the OWAMP testing.   Even losing 1 OWAMP packet/600 could be significant so this is really a large loss that needs to be tracked down. The relevant traceroute is  here (both directions):
       
      [knoppix@Knoppix ~]$ traceroute netmon1.atlas-swt2.org
      traceroute to netmon1.atlas-swt2.org (129.107.255.26), 30 hops max, 40 byte packets
       1  149.165.225.254 (149.165.225.254)  11.924 ms  0.301 ms  0.335 ms
       2  xe-0-2-0.2012.rtr.ictc.indiana.gigapop.net (149.165.254.249)  0.237 ms  0.265 ms  0.246 ms
       3  tge-0-1-0-0.2093.chic.layer3.nlr.net (149.165.254.226)  6.450 ms
      5.447 ms  5.252 ms
       4  hous-chic-67.layer3.nlr.net (216.24.186.24)  31.837 ms  31.056 ms
      30.961 ms
       5  hstn-hstn-nlr-ge-0-0-0-0-layer3.tx-learn.net (74.200.188.34)  30.759 ms  30.803 ms  30.726 ms
       6  dlls-hstn-nlr-ge-1-0-0-3002-layer3.tx-learn.net (74.200.188.38)
      36.091 ms  36.169 ms  36.092 ms
       7  74.200.188.42 (74.200.188.42)  36.112 ms  36.139 ms  36.218 ms
       8  as16905_uta7206_m320_nlr.uta.edu (129.107.35.114)  37.548 ms  37.546 ms  37.494 ms
       9  netmon1.atlas-swt2.org (129.107.255.26)  37.664 ms  37.645 ms  37.669 ms
       
      Reverse traceroute to my laptop:
       
      Executing exec(traceroute, -m 30 -q 3 -f 3, 149.166.143.177, 140)
      traceroute to 149.166.143.177 (149.166.143.177), 30 hops max, 140 byte packets
       3  74.200.188.41 (74.200.188.41)  1.885 ms  1.772 ms  1.807 ms
       4  hstn-dlls-nlr-ge-3-0-0-3002-layer3.tx-learn.net (74.200.188.37)
      7.169 ms  7.150 ms  7.099 ms
       5  hstn-hstn-nlr-layer3.tx-learn.net (74.200.188.33)  7.650 ms  7.598 ms  7.536 ms
       6  chic-hous-67.layer3.nlr.net (216.24.186.25)  33.475 ms  33.489 ms
      33.393 ms
       7  xe-1-2-0.2093.rtr.ictc.indiana.gigapop.net (149.165.254.225)  37.669 ms  37.555 ms  37.657 ms
       8  tge-1-2.9.br.ul.net.uits.iu.edu (149.165.254.230)  37.695 ms  37.733 ms  37.751 ms
       9  tge-1-4.912.cr.ictc.net.uits.iu.edu (149.166.5.6)  38.877 ms  38.922 ms  40.268 ms 10  149-166-143-177.dhcp-in.iupui.edu (149.166.143.177)  37.809 ms
      37.905 ms  37.944 ms
       
      Testing from Tier-2 to Tier-3 enabled for Hiro’s ( NET2 - Duke  and MWT2_UC - Argonne).     Moving 7 files from dataset.   See Hiro’s update page at:
      https://www.usatlas.bnl.gov/dq2/throughput
       
      Milestone for 1GB/sec for 1 hour was ALMOST completed from BNL to MWT2_UC.    Need to redo this during the next week.   Sites should contact Hiro to arrange a throughput test.  Need to get 1GB/sec for one hour from BNL -> (set of one or more Tier-2s).   Individual sites with 10GE should strive for 400MB/sec for > ½  hour.
       
      IU notices a slowdown via Hiro’s automated load-test starting between Sep 30 and October 1st 2009.   Sarah is looking into what changed.
       
      Future calls will regularly discuss perfSONAR measurement results once we start acquiring enough data from our testing configuration.
       
      Hiro will be contacting Jeff Boote (Internet2) to get information on the API for accessing perfSONAR measurement results for future integration into his plots.
       
      Please send along any corrections or additions to these minutes via email to the list.   We plan to meet again next week at the normal time.
       
      Shawn
    • Tier 3 is now added to Hiro's throughput test
    • Can now do any-to-any site testing.

Site news and issues (all sites)

  • T1:
    • last week(s): hpss upgrade in progress, including movers; dcache upgrade next tuesday (server hardware replacements); 1000 cores to arrive October 16; electrical infrastructure, 1M flywheel UPS testing;
    • this week: Saturday morning 1000's of jobs in holding - there was a schedd config problem. This issue needs attention. Upgrades yesterday: Xin upgraded OSG 1.2.3; Oracle maintenance for LFC backend. Chris made kernel updates on worker nodes for security; dcache upgrades. SRM load quite high, but understood. Gridftp doors are in between public and internal networks, and so traffic is going through firewalls, ticket submitted to dcache developers; computing room infrastructure developments - certification of 1 MW flywheel passed, all passed.

  • AGLT2:
    • last week: Received quote for 2 TB drives from Dell - requisitions have been placed at UM; OSG security challenge went well, asked for feedback; are there common scripts? Atlas releases have been patched on all US sites, but not the CERN repo's. Plan on transitioning to SL5 at MSU. Downtime next Monday to move equipment around in prep for new equipment.
    • this week: Squid install resolved; next week a short outtage at UM for hardware relocation; SL 5 update, rolling update. Dell compute and storage expected to arrive at the end of the month.

  • NET2:
    • last week(s): OSG security tests passed. AC incident last week. Squid installed - need to upgrade. Talking w/ Dell on some configurations. DATADISK being moved to a new filesystem. Perfsonar updated.
    • this week: SE datadisk filled - recovered. Today there were SRM problems, fixed with a restart. Perfsonar, squid, SL5 updates.

  • MWT2:
    • last week(s): Downtime for upgrades yesterday. Storage nodes updated to SL 5.3 64 bit. Upgraded perfsonar, configured, have not. LFC problem last week - manual maintenance issue.
    • this week: Still working on SL5 upgrade of last week - test jobs are running on worker nodes, then will do a rolling update.

  • SWT2 (UTA):
    • last week: UTA cluster will be updated to SL5; CPB will follow. OSG security drill tomorrow. Space getting tight. Had a few nodes go down during reprocessing.
    • this week: Had an issue with the UTA_SWT2 storage system, hopefully it will last until the new Dell purchase arrives; packet loss issue.

  • SWT2 (OU):
    • last week: OSG security drill today. 100 TB useable storage held up by Langston University's purchasing.
    • this week: Finally in the process of getting a new quote from Dell and DDN. Will probably get more compute nodes.

  • WT2:
    • last week(s): Panda validation jobs run successfully on RHEL5-64 test q (23 jobs successful, 1 failed but seems not site related). 1st production frontier squid running. target Oct-13 for OSG 1.2 migration. new bestman with required checksum features in place. target new (prod) releases of xrootd and xrootdfs on friday.
    • this week: rhel5 migration continuing ~ 100 systems complete. New xrootd release with expected new features, will deploy later. Westgrid certificates causing problems for Bestman. OSG 1.2.3 updated.

Carryover issues (any updates?)

OSG 1.2 deployment (Rob, Xin)

  • last week:
  • this week:
    • BNL updated

OIM issue (Xin)

  • last week:
    • Registration information change for bm-xroot in OIM - Wei will follow-up
    • SRM V2 tag - Brian says nothing to do but watch for the change at the end of the month.
  • this week:

Tier 3 data transfers

  • last week
    • no change
  • this week

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
  • this meeting:

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week
    • Have discussed adding Adler32 checksum to xrootd. Alex developing something to calculate this on the fly. Expects to release this very soon. Want to supply this to the gridftp server.
    • Need to communicate w/ CERN regarding how this will work with FTS.
  • this week

Local Site Mover

Gratia transfer probes @ Tier 2 sites

Hot topic: SL5 migration

  • last weeks:
    • ACD ops action items, http://indico.cern.ch/getFile.py/access?resId=0&materialId=2&confId=66075
    • Kaushik: we have the green light to do this from ATLAS; however there are some validation jobs still going on and there are some problems to solve. If anyone wants to migrate, go ahead, but not pushing right now. Want to have plenty of time before data comes (means next month or two at the latest). Wait until reprocessing is done - anywhere between 2-7 weeks from now, for both SL5 and OSG 1.2.
    • Consensus: start mid-September for both SL5 and OSG 1.2
    • Shawn: considering rolling part of AGT2 infrastructure to SL 5 - should they not do this? Probably okay - Michael. Would get some good information. Sites: use this time to sort out migration issues.
    • Milestone: my mid-October all sites should be migrated.
    • What to do about validation? Xin notes that compat libs are needed
    • Consult UpgradeSL5
  • this week

WLCG Capacity Reporting (Karthik)

  • last discussion(s):
    • Note - if you have more than one CE, the availability will take the "OR".
    • Make sure installed capacity is no greater than the pledge.
    • Storage capacity is given the GIP by one of two information providers (one for dCache, one for Posix-like filesystem) - requires OSG 1.0.4 or later. Note - not important for WLCG, its not passed on. Karthik notes we have two ATLAS sites that are reporting zero. This is a bit tricky.
    • Have not seen yet a draft report.
    • Double check the accounting name doesn't get erased. There was a big in OIM - should be fixed, but checked.
    • Reporting come two sources: OIM and the GIP from the sites
    • Here is a snapshot of the most recent report for ATLAS sites:
      --------------------------------------------------------------------------------------------------------
      This is a report of Installed computing and storage capacity at sites.
      For more details about installed capacity and its calculation refer to the installed capacity document at
      https://twiki.grid.iu.edu/twiki/pub/Operations/BdiiInstalledCapacityValidation/WLCG_GlueSchemaUsage-1.8.pdf
      --------------------------------------------------------------------------------------------------------
      * Report date: Tue Sep 29 14:40:07
      * ICC: Calculated installed computing capacity in KSI2K
      * OSC: Calculated online storage capacity in GB
      * UL: Upper Limit; LL: Lower Limit. Note: These values are authoritative and are derived from OIMv2 through MyOSG. That does not
      necessarily mean they are correct values. The T2 co-ordinators are responsible for updating those values in OIM and ensuring they
      are correct.
      * %Diff: % Difference between the calculated values and the UL/LL
             -ve %Diff value: Calculated value < Lower limit
             +ve %Diff value: Calculated value > Upper limit
      ~ Indicates possible issues with numbers for a particular site
      -----------------------------------------------------------------------------------------------------------------------------
      #  | SITE                 | ICC        | LL          | UL          | %Diff      | OSC         | LL      | UL      | %Diff   |
      -----------------------------------------------------------------------------------------------------------------------------
                                                            ATLAS sites
      1  | AGLT2                |      5,150 |       4,677 |       4,677 |          9 |    645,022 | 542,000 | 542,000 |      15 |
      2  | ~ AGLT2_CE_2         |        165 |         136 |         136 |         17 |     10,999 |       0 |       0 |     100 |
      3  | ~ BNL_ATLAS_1        |      6,926 |           0 |           0 |        100 |  4,771,823 |       0 |       0 |     100 |
      4  | ~ BNL_ATLAS_2        |      6,926 |           0 |         500 |         92 |  4,771,823 |       0 |       0 |     100 |
      5  | ~ BU_ATLAS_Tier2     |      1,615 |       1,910 |       1,910 |        -18 |        511 | 400,000 | 400,000 | -78,177 |
      6  | ~ MWT2_IU            |        928 |       3,276 |       3,276 |       -252 |          0 | 179,000 | 179,000 |    -100 |
      7  | ~ MWT2_UC            |          0 |       3,276 |       3,276 |       -100 |          0 | 179,000 | 179,000 |    -100 |
      8  | ~ OU_OCHEP_SWT2      |        611 |         464 |         464 |         24 |     11,128 |  16,000 | 120,000 |     -43 |
      9  | ~ SWT2_CPB           |      1,389 |       1,383 |       1,383 |          0 |      5,953 | 235,000 | 235,000 |  -3,847 |
      10 | ~ UTA_SWT2           |        493 |         493 |         493 |          0 |     13,752 |  15,000 |  15,000 |      -9 |
      11 | ~ WT2                |      1,377 |         820 |       1,202 |         12 |          0 |       0 |       0 |       0 |
      -----------------------------------------------------------------------------------------------------------------------------
      
    • Karthik will clarify some issues with Brian
    • Will work site-by-site to get the numbers reporting correctly
    • What about storage information in config ini file?
  • this meeting

AOB

  • last week
    • None
  • this week


-- RobertGardner - 14 Oct 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


pdf Facility-Capacities-09-2009.pdf (167.3K) | RobertGardner, 14 Oct 2009 - 00:00 |
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback