r8 - 01 May 2013 - 15:31:19 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesMay12013



Minutes of the Facilities Integration Program meeting, May 1, 2013
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern

  • Your access code: 2913843
  • Your private passcode:
  • Dial *8 to allow conference to continue without the host
    1. Dial Toll-Free Number: 866-740-1260 (U.S. & Canada)
    2. International participants dial: Toll Number: 303-248-0285. Or International Toll-Free Number: http://www.readytalk.com/intl
    3. Enter your 7-digit access code, followed by “#”
    4. Press “*” and enter your 4-digit Chairperson passcode, followed by “#”


  • Meeting attendees: Michael, Rob, Torre, Bob, Dave, Sarah, Shawn, Patrick, Saul, Mark, Kaushik, Wei, Fred, John Brunelle, Alden, Mark, Armen, Horst
  • Apologies: Jason
  • Guests: Scott Teige

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, bi-weekly - convened by Kaushik) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Wei or Rob): Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • CapacitySummary - please update v27 in google docs
      • SiteCertificationP24 - in particular FabricUpgradeP24
      • Discussion of HS06 values in the facilities spreadsheet. Bob reports that HS06 values are inconsistent. BNL numbers are unsupported, e.g. All results have not necessarily been submitted to Bob.
        • Plan: go to CapacitySummary
        • Look at Bob's analysis
        • Will need to do a site-by-site re-certification
        • Each site should re-examine the HS values, and columns for jobs/slot, and check
    • this week
      • Program this quarter: SL6 migration; FY13 procurement; perfsonar update; Xrootd update; OSG and wn-client updates; FAX
      • WAN performance problems, and general strategy
      • Introducing FAX into real production, supporting which use-cases.

Supporting OASIS usability by OSG VOs (Scott Teige)

last meeting
  • See presentations at OSG All Hands
  • Still about a month away
  • Note this is CVMFS for OSG VOs
  • Fred will invite Scott to report in two weeks.

this meeting

  • OSG Application Software Installation Service
  • OASIS presentation
  • OASIS_Demo.pptx: OASIS_Demo.pptx
  • Three machines; OASIS login node is the value-added
  • stratum-0 (NBD); stratum-1 (critical)
  • Mechanism for serving multiple VOs for applications. Brings in asynchronous support.
  • An OASIS VO manager installs software via gsi-ssh logins. OASIS managers have a vested interest in making it work.
  • publishing mechanism controls locks and updates
  • Working on making it easier (getting keys, e.g.)
  • Cybersecurity trust comes via OSG trust relationships
  • Saul reports that HU has already done this and it works fine.

Reviewing facility accounting (Bob)

There were many questions last month concerning the WLCG accounting and apparently inconsistent reports. A few sites appeared to be significantly different than expected. It was our expectation that the HS06 machine ratings used in out capacity summary spread sheet were correct.

However, an examination of the values in that sheet showed some significant disagreements between our sites. Further examination showed that our GIP-reported sub-clusters did not agree with the spread sheet in many cases, and led us to wonder if we had a complete understanding of how the reporting is done.

A good explanation of that reporting process an be found here. There are repeated references to the lcg-reportableSites file used as the source for per-site HS06 values, but at the time I read this, there was no indication of where that file could be found, or what its content was (the Twiki may have been updated since I first found it). I therefore opened an OSG goc ticket to determine that answer.

This is the answer I received:

The information used to be available here: http://gratiaweb.grid.iu.edu/gratia/wlcg_reporting but it has been broke for some time. See... https://ticket.grid.iu.edu/goc/12335

It can also be found here http://gr13x6.fnal.gov:8319/gratia-apel/ which at one time was accessible from the old Gratia reporting UI. ... click on the 2013-04 Summary html link. It will be the 2nd column. The far right column shows the resources within the reported resource group (WLCG site = OSG resource group) whose accounting data is forwarded. And OSG resource = Gratia site

Furthermore, this table is updated, for USATLAS T1 and T2, from our capacity summary spread sheet content. It is our task to notify the OSG goc, via ticket, when values in the spread sheet change. They in turn will notify John Weigand who will then update the flat file. Our HS06 ratings have not changed in the year's span covered by the data above. A comparison of those values to the set in the early April v.27 spreadsheet is attached.

Since then many of us have worked to get up to date measurements of the HS06 rating of our hardware. Values that were previously "Best Guess", or "Clock scaling", or "4.0 conversion factors" have been replaced, and the capacity summary is now largely consistent over our sites. Where sites ran their own measurements, those have been used as opposed to a measurement made at another site. All measurements made at all sites are now in the Twiki Summary maintained by USATLAS (or soon will be). I have reported to John Weigand that the v.27 table is now frozen and should be used for our April results.

The major unanswered question at this point concerns how to deal with opportunistic resources accessible at some of our sites. Some suggestions:

  • If the resource is small, don't bother to report it
  • Report only the fraction of the resource we expect to get as a fixed resource.
  • Make a quarterly update of the fraction based upon some understanding of what is used by ATLAS
  • Adjust the fraction monthly, prior to the end of the month reporting, based upon some analysis of actual use during the month.

We need to have a common understanding of what we report in the GIP, and how that relates to the spread sheet. Future adjustments to either the GIP or the spread sheet should be propagated so that both agree as per this understanding.


  • April figures can be corrected with wlcg
  • WLCG reporting should involve dedicated resources; but we want to keep track of what is delivered.
  • OSG is using a site-average. A GOC ticket should be issued when resources are added/subtracted.
  • Make this a facility procedure: send a GOC ticket monthly with any changes to the facility capacity.
  • What about other (non-ATLAS) sites in OSG?
  • Good shape with HS values.

Facility storage deployment review

last meeting:(s)
  • Tier 1 DONE
  • WT2:DONE
  • MWT2: 3.3 PB now online. 180 TB remaining to complete the 1 PB upgrade
  • NET2: DONE
  • SWT2_UTA: Equipment being installed. Network change has been postponed. Downtime to bring online being planned.
  • SWT2_OU: 120 TB installed. DONE
this meeting:
  • Tier 1 DONE
  • WT2:DONE
  • NET2: DONE
  • SWT2_OU: 120 TB installed. DONE
  • MWT2: 3.3 PB now online. 180 TB remaining to complete the 1 PB upgrade (5/6 DONE)
  • SWT2_UTA: Progressing; R620 mobo NICs needed to be replaced. Working on configuration on the storage modules. 3660i options under study. Then, how to bring the storage into the cluster, whether downtime is needed. There will be some additional network configuring to do (might be avoidable, if an existing rack is re-used).
    • Michael: a large aggregation switch needed, in order to bring this storage into a usable state.
    • Rob: internal bottleneck if used with existing equipment?

Integration program issues

Reviewing LHCONE connectivity for the US ATLAS Facility (Shawn)

last meeting(s)
  • June 1 is the milestone date to get all sites on.
  • BNL DONE, AGLT2 DONE, 2 sites from MWT2 2/3 DONE (*MWT2_IU needs action, see below.)


  • NET2 - Holyoke to MANLAN now connected. Tier 2 subnets will be need to routed. Holyoke move is postponed. Saul expects June 1 milestone will be met -- either at Holyoke, or in place at BU. What about Harvard? Will bring this up with HU networking.
  • SWT2 - both sites are not on LHCONE presently. OU goes through ONENET (has connect to AL2S? I2 system, ride layer 2 to MANLAN and peer there). For SWT2, LEARN - Kansas City may be the connection point. Dale Finkleson thought it could happen quickly. Patrick: have not discussed locally at UTA networking; concerned about how UTA traffic is handled. Shawn: I2 can bring up a peering in any of their PoPs, e.g. in Houston, if that is the best place to peer. Will talk with network managers
  • SWT2_OU: still waiting to hear from I2 and ESnet. Zane Gray at OU is leading this.
  • Kaushik: meeting last week with campus networking to connect via LEARN, process started. Michael: who are the players on the LHCONE side - e.g. Internet2 side who are VRF provider. Notes this is about configuration issue to separate campus from LHC traffic. Kaushik notes that his campus networking team is working the issue. Michael: need to connect UTA campus people to Mike O'Conner and Dale Finkleson.
  • Horst: OU - will have a meeting tomorrow with LHCONE operations to hook up OUONE.
  • Fred: still struggling with Brocade firmware issues causing checksum errors. Testing setup at Indianapolis; problem reproduced by Brocade. Fred will arrange a meeting with him early next week.

this meeting

  • Updates?
  • OU: Horst believes it will be straightforward, will follow-up.
  • UTA: Patrick: waiting to hear back from UTA networking staff to see the status. Will check today.
  • NET2: waiting on Holyoke move.
  • IU: Brocade and IU engineers are still engaged in working on the problem.

Deprecating PandaMover in the US ATLAS Computing Facility (Kasuhik)

last meeting
  • Kaushik: need a discussion with the DDM team first since it will involve an increase in load, but have an internal discussion first.
  • Michael - panda mover operations are opaque. We should join the mainstream. The current model works perfectly for all the other clouds. Little noise about these issues.
  • Kaushik: have not revisited its need in a long time.
  • Rob: suggests moving a Tier 2 site off pandamover, and watch the effect.
  • Kaushik - will get started.
  • Hiro wants to keep Pandamover for tape staging, as he feels its more efficient. Michael - doesn't feel the benefit is worth keeping Pandamover. Notes tape-staging happens rarely, only during specific campaigns.

this meeting

  • Michael and Kaushik discussion last week: will expect significant Rucio activity shortly ~ 1 month. Concerns about adding DDM activity during this time.
  • Suggested to allow the situation to continue as is.
  • Run Pandamover until Rucio is fully deployed. ~ 6-9 months. Revisit after this time.

The transition to SL6

last meeting
  • All sites - deploy by end of May
  • Shuwei validating on SL6.4; believes ready to go. BNL_PROD will be modified - in next few days 20 nodes will be converted. Then the full set of nodes.
  • At MWT2 will use UIUC campus cluster nodes; will start on this tomorrow.
  • Need a twiki setup to capture details.
  • Doug - provide a link from the SIT page. Notes prun does compilation.
  • Expect all sites to participate and convert to SL6 as old clients will be disabled in June.

this meeting

  • This gets us most quickly to the latest dq2 client.
  • NET2: concerned about the timescale. Will have a problem meeting the deadline.
  • At BNL, the time required was about 4 hours.
  • WT2: have a new 3,000 node cluster coming up as RHEL6 - ATLAS may have access to. Timing issues for the shared resources. ATLAS dedicated resources can be moved to SL6, though.
  • AGLT2: concern is ROCKS, glued to SL5. Simultaneous transition to Forman and Puppet.
  • OU: transition from Platform to Puppet. OSCER cluster is RHEL6 (though job manager issues).
  • UTA: cannot do it before June 1. ROCKS, plus heavily involved with storage and networking, limiting the time.

Evolving the ATLAS worker node environment

last meeting

this meeting

Updates from the Tier 3 taskforce?

  • Michael: a survey has been circulated about the resources deployed, usage, LOCALGROUPDISK, etc. Published today.
  • Result of this will be a set of recommendations to the IB.

Transition from DOEGrids to DigiCerts

last week
  • Michael: there have been issues with people applying for certs - taking too long. Process being investigated - too many people involved, and probably not the right people, taking too long. Decision to shift responsibility transferred from ITD Help Desk into the Tier 1 to process requested. Requests should take less than hour.
  • Doug has volunteered US ATLAS analysis support to help facilitate the process

this week

  • Michael: situation has significantly improved since we have in-sourced the RA services. (within US ATLAS)
  • Turnaround is vastly reduced.
  • Alden: went through the process; just now got a full working proxy just last Monday. From DAST coordinator point of view, finding half the users choose the wrong DigiCert options. Believes a change in documentation.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Running smooth
    • Pilot update from Paul
    • Large number of waiting jobs from the US - premature assignment?
    • Following-up with PENN: email got not response. Paul Keener reported an auto-update to his storage; reverted back to previous version (March 28). Transfers are now stable at the site, the ticket has been closed.
    • Discussion about Site Storage blacklisting. Its essentially an automatic blacklisting. Discussed using atlas-support-cloud-us@cern.ch. The problem is what to do with the Tier 3. Doug will make sure the tier3 sites have correct email addresses in AGIS.
  • this meeting:

Data Management and Storage Validation (Armen)

  • Reference
  • last meetings(s):
    • Will need another USERDISK cleanup. Hiro will send an email.
    • PRODDISK needs to be cleaned at NET2 and MWT2.
    • The SRM values reported incorrectly at SLAC and SWT2.
    • The SRM value at SWT2 space tokens dropped, then came back, except at GROUPDISK. Notes this is transient behavior.
    • GROUPDISK loss at MWT2 - site report below.
    • NET2 deletion issue - it is still slow. There is also a reporting issue here as well. This is reported everyday in Savannah. There is an active discussion. Saul believes it is functional by lowering the chunk size to 10 (other sites have 80-120). lcg-del command used by Tomas sometimes drops connections. Dropped connections 1 out of 10 from BU. Next step? Try duplicating problem on a new machine. Dropped packets?
  • this meeting:

Shift Operations (Mark)

  • last week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    1)  4/16: SLACXRD file transfer errors (SRM) - issue was quickly resolved, and https://ggus.eu/ws/ticket_info.php?ticket=93378 was closed early a.m. on 4/17.  eLog 43883.
    2)  4/18: SWT2_CPB - file transfers were failing with a security error ("credential with subject: /DC=org/DC=doegrids/OU=Services/CN=gk03.atlas-swt2.org has expired").  The expired host 
    certificates were updates early the next day.  https://ggus.eu/ws/ticket_info.php?ticket=93460 closed, eLog 43909.
    3)  4/18: NET2_DATADISK file transfer errors ("Error reading token data").  As of 4/23 no recent errors of this type, so closed https://ggus.eu/ws/ticket_info.php?ticket=93464.  Issue being 
    investigated, eLog 43962.
    4)  4/18: Pilot update from Paul (v57a).  Details here:
    Also, update v57b was released on 4/24 with an urgent fix for a problem that was preventing installation jobs from executing.
    5)  4/20: CERN - as of ~15:30 UTC widespread problem affecting many services.  Issue was due to a network problem in B513.  Problem reported fixed as of ~21:00 UTC.  More details in 
    https://ggus.eu/ws/ticket_info.php?ticket=93514, eLog 43925.  (https://ggus.eu/ws/ticket_info.php?ticket=93516 was incorrectly opened for MWT2 during this period.  Not a site issue, instead 
    related to the CERN outage. eLog 43927.)
    6) 4/23: BNL_ATLAS_RCF - https://savannah.cern.ch/bugs/index.php?101293 was opened due to jobs failing with "pilot: Job killed by signal 15: Signal handler has set job result to 
    FAILED, ec = 1201."  Known issue of opportunistic use of the cluster.  Ticket was closed, eLog 43955.
    Follow-ups from earlier reports:
    (i)  4/7: Transfer errors with Tier-3 site SMU as the source ("[USER_ERROR] source file doesn't exist]").  https://ggus.eu/ws/ticket_info.php?ticket=93166 in progress, eLog 43743.
    Update 4/24: site admin has requested a production role in order to be able to more effectively work on this issue.  

  • this week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting: not available this week
    1)  4/25: WISC - high rate of DDM deletion errors.  Site reported the issue was due to an expired host certificate on 4/29, and that a replacement had been requested.  New cert installed the next day, 
    https://ggus.eu/ws/ticket_info.php?ticket=93651 was closed.  eLog 44037.  (Duplicate ggus tickets 93652 & 93727 were also opened during this period.)
    2)  4/25: NET2 - file transfer failures with "Error reading token data: Connection closed."  As of 4/30 the errors went away, so https://ggus.eu/ws/ticket_info.php?ticket=93660 was closed.  eLog 44038.
    3)  4/25: NERSC_LOCALGROUPDISK transfer failures with checksum errors.  Update from Iwona on 5/1: Automatic yum update killed our "adler helpers" - daemons handling remote checksum 
    calculations. We improved monitoring to catch it faster and plan to patch ssh startup scripts. 
    Issue seems to be resolved - o.k. to close https://ggus.eu/ws/ticket_info.php?ticket=93666?  eLog 43997.
    4)  4/30: File transfers to SLACXRD (from TAIWAN-LCG2) were failing due to an expired host cert on the SLAC side.  https://ggus.eu/ws/ticket_info.php?ticket=93747 in progress, eLog 44043.
    5)  4/30: SMU file transfer failures ("[SECURITY_ERROR] not mapped./DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=ddmadmin/CN=531497/CN=Robot: ATLAS Data Management]").  
    https://ggus.eu/ws/ticket_info.php?ticket=93748 in progress, eLog 44035.
    Follow-ups from earlier reports:
    (i)  4/7: Transfer errors with Tier-3 site SMU as the source ("[USER_ERROR] source file doesn't exist]").  https://ggus.eu/ws/ticket_info.php?ticket=93166 in progress, eLog 43743.
    Update 4/24: site admin has requested a production role in order to be able to more effectively work on this issue. 

DDM Operations (Hiro)

  • this meeting:

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last meeting(s):
    • Release by March. With 10G. Goal in facility to deploy across the facility by end March.
    • NET2 - CERN connectivity - has it been improved?
    • LHCONE connectivity for NET2 and SWT2 - timeline?
    • Prepare with discussions at NET2, even if the setup will come in with the move to Holyoke; get organized. Move won't happen before the end of March. The WAN networking at Holyoke is still not well defined. Start a conversation about bringing LHCONE.
    • rc2 for perfsonar; rc3 next week; sites should prepare to upgrade.
    • Network problems seen generally for westbound traffic.
  • this meeting:
    • UTA issue. Longstanding poor in-bound performance. ESnet ticket is open. Patrick and Sarah have been doing testing, observing checksum problems. Also at BNL - but they've cleared up (not correlated with a specific event). Hiro ran load test - 60 MB/s. The in-bound path seems to be taking the commodity path. (Patrick is following-up this afternoon.) Outbound direct has been better, but there are now checksum problems. Hiro is updating the ESnet ticket with measurements.
    • Michael: we need to get our providers involved as soon as possible, and through official channels. Note - we are the customers of the service. Need to use the ticketing system.
    • Notes:
       Meeting Notes for NA Throughput Meeting --- April 30, 2013
      Attending: Shawn, Dave, Mark,Andy, Jason, Patrick, Philippe, Rob, Horst, Tom, John, Saul, Sarah
      Excused: Garhan
      1) Meeting agenda:  Changes, edits or additions?  None needed.
      2) Status of perfSONAR-PS Install in USATLAS
            i) Mesh configuration.  Sites not yet deployed with mesh configuration as of last call needing a status update (see notes from last call below):
                  a) NET2  -  Did Augustine complete this?  Status?  Saul isn't sure and will find out the status. 
                  b) WT2  - Please provide an update? Email Wei and Yee for status
                  c) BNL -  Was BNL updated to use the mesh?  (Using RC2 or RC3?)  Had to fall-back to previous version because of problems  This is the only 1G host at BNL so we will wait on retrying until UTA issue is resolved or v3.3 is ready.
           ii) 10GE Bandwidth instance deployed?  (If not, when?  See last notes below for status at the end of November)
                  a) SWT2:  Both OU and UTA needed downtimes.  OU was scheduled April 8. OU done?  UTA timeline?   SWT2_OU had network upgraded.  Installed RC3 on 10G host for testing.  Once v3.3 is out Horst will rename as production.  SWT2_UTA is waiting on replacement 10G NICs this week.  Hopefully this week.
                  b) WT2:  Have 3 10G perfSONAR-PS hosts but not allowed to get results  of tests until security approves.  Any update on approval status?  Email Wei and Yee for status
      3) perfSONAR-PS issues AND testing v3.3 RC3
          i) New issues noted?  None reported. 
         ii) Toolkit update status: RC3 info (Andy).   Andy reported that a bunch of new issues noted.  RC4 will likely happen soon.  Then 1-2 more weeks before a final version. Important we have this release as reliable and robust as possible.  Horst asked about his problems getting services to start.  Logs sent to Aaron to determine if this is specific to OU host or an RC3 issue. 
         iii) Modular dashboard news and plans / GitHub status  (Tom, Ian, ?)   GUI updated with CRUD capabilities.   Tom will go back to datastore work.   Mesh-configuration reading or traceroute matrix creation.  Action item: Shawn will email Tom info on the mesh-configuration from Aaron.  
         iv) NTP clock issue at MWT2_UC? (from last meeting notes).  Dave found no problem BUT maybe LHCONE related load may be causing issues.  
         v) Status of asymmetric BWCTL tests at Nebraska? (Garhan) Garhan will update via email or at next meeting.
        vi) Test of "Update" from v3.2.2/SL5 to v3.3 RC3/SL6?  (Garhan).  Will be trying this during the next week 
      4) Throughput 
          i)  New issues to track? None.  
         ii)   UTA: Discussion on current status of the inbound poor throughput problem.  Please see perfSONAR-PS results on the old dashboard at http://perfsonar.racf.bnl.gov:8080/exda/?page=25&cloudName=USATLAS   Also we have some 1G testing from http://psum06.aglt2.org/toolkit  and http://psum05.aglt2.org/toolkit running since yesterday.    Discussion of PDF file  Shawn sent (attached).  Patrick mentioned checksum errors on "outbound" files to BNL (10%?).   No problems to AGLT2, OU, UC, NET2 (lcg-cp or srmcp was used; 200 tests in a row) but did see it on IU (gridftp destination); IU was very bad though.   Now testing with IU using wget/http swapping file back and forth.   Sarah reported results.   UC had 0 errors and IU had 1 in 1000 which had a byte-swap problem.  Sarah asked if the first two bytes of the ADLER32 checksum had a problem and the last two are OK.  Patrick: in general that was the case for the BNL  transfer.   IU->UTA is slower than UTA->IU.  Testing is continuing.  Action item: Patrick will contact LEARN/UTA networking to determine why inbound seems to be coming via commodity/GPN connection   Patrick will try to get inbound routing fixed and then repeat tests.   
         iii)  New developments?
         iv) Monitoring (Hiro?)    Hiro will report on this at the next meeting. (Ran out of time)
      5) Site Round-table and reports (any/all of USCMS, ATLAS Canada or US ATLAS)
         (Ran out of time)
      6 AOB and next meeting  
          Two weeks from today is the ATLAS TIM in Tokyo.  We have enough to discuss that we will change the schedule to meet NEXT week , May 7th.  Shawn will be at ANSE meeting in Caltech but will try to run the meeting from there. 
      Please send along any additions or corrections via email.  Thanks,

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference

last week(s)

  • Wei - Gerri packed the voms security module into rpm, works at SLAC and with DPM sites. Once Gerri has a permanent location for a permanent location, will create a set of detailed instruction.
  • Ilija has requested a central git repo at CERN for FAX. Can WLCG task force provide a point of coordination.
  • Ilija - doing FDR testing. Also testing using the Slim Skim Service. Seeing nice results for file transfers to UC3. Now seeing 4GB/s. How is the collector holding up.
this week

Site news and issues (all sites)

  • T1:
    • last meeting(s):
    • this meeting: looking at next gen network switches; equipment for the eval - Cisco, Xtreme, Arista; discovered issues with bw/stream (not exceeding 10g, e.g.). Will get a unit June/July timeframe. CPU procurement. Dell won the bid. (HP, Oracle, IBM). Two sandybridge, 2.3 GHz, 64 GB mem, 8x500 GB = 4 TB ($5k). Tested extensively.

  • AGLT2:
    • last meeting(s):Working on dcache pool servers to sl6.3; most converted. Moving VMs. Shawn working w/ Patrick and Sarah on proddisk cleanup. Then to get CCC mechanism working again.
    • this meeting: 40G connection between T3 and T2 - working well. Using LR 40 Gbps optics from Color Chip. S4810.

  • NET2:
    • last meeting(s): Had a spike in lost heartbeat jobs due to a bug in the gatekeeper SEG; OSG is working on a patch. Still have the slow deletion issue.
    • this week: Move begins Monday, May 6.

  • MWT2:
    • last meeting(s): Upgrades to UC network - reconfigured with bonded 40G ports between Cisco and Dell stacks. 2x10G bonded for some s-nodes. IU reconfiguration for jumbo frames. New compute nodes at UIUC - 14 R420s, getting built with SL5. Also adding more disk for DDN, but there are continued GPFS issues, working closely with campus cluster admins. GROUPDISK data loss - CCC was reporting a large amount of dark data incorrectly. Recovering what we can from other sites, and notifying users, and modifying procedures so it doesn't happen again.
    • this meeting: Chicago: all storage online except for one server, H810 controller on one replaced. IU networking. UIUC - 16 new Dell machines now up. C8220s with sleds. 512 additional job slots, EL 6, osg-wn-client 3.1. Fixed BIOS settings to be performance, HS benchmarks in-line with AGLT2.

  • SWT2 (UTA):
    • last meeting(s): A couple of storage issues possibly creating the dips in the accounting plots. New version of panda mover in the git repo that uses python 2.6. Use with caution. Will be busy getting new storage online.
    • this meeting:

  • SWT2 (OU):
    • last meeting(s): Downtime next week. 10g perfsonar node as well.
    • this meeting:

  • WT2:
    • last meeting(s): Dell switch issue - minor issue. Two 10g uplinks channel bonded sometimes has trouble, possible problem with the environment.
    • this meeting: No burning issues; future batch system will be lsf 9.1. Testing.


last meeting this meeting

-- RobertGardner - 01 May 2013

  • HS06_Compare_Apr2013.pdf: Comparison of HS06 values in early April in the v.27 capacity summary, highlighting the inconsistencies

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


pptx OASIS_Demo.pptx (101.1K) | RobertGardner, 30 Apr 2013 - 21:18 |
pdf WLCG_vs_early_v27_HS06.pdf (52.6K) | RobertBall, 01 May 2013 - 11:36 | WLCG HS06 values vs early v.27 spreadsheet HS06 values for USATLAS T1 and T2 sites
pdf HS06_Compare_Apr2013.pdf (52.7K) | RobertBall, 01 May 2013 - 11:37 | Comparison of HS06 values in early April in the v.27 capacity summary, highlighting the inconsistencies
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback