r4 - 12 Dec 2012 - 13:56:22 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesDec122012

MinutesDec122012

Introduction

Minutes of the Facilities Integration Program meeting, December 12, 2012
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern

  • Your access code: 2913843
  • Your private passcode: 4519
  • Dial *8 to allow conference to continue without the host
    1. Dial Toll-Free Number: 866-740-1260 (U.S. & Canada)
    2. International participants dial: Toll Number: 303-248-0285. Or International Toll-Free Number: http://www.readytalk.com/intl
    3. Enter your 7-digit access code, followed by “#”
    4. Press “*” and enter your 4-digit Chairperson passcode, followed by “#”

Attending

  • Meeting attendees: Wei, Rob, Bob, Saul, Mark, Armen, Patrick, James Koll (MSU), Sarah, John
  • Apologies: Jason, Kaushik,
  • Guests:

Integration program update (Rob, Michael)

Disk procurement

last meeting:
  • MWT2 - have received all R720, MD1200 and 8024F switch. Beginning installation this week.
  • NET2 - awaiting 720 TB delivery to BU from Dell.
  • AGLT2 - ordered and received disk storage - new MD3260's - configuring dynamic disk pools, R6 equiv; understanding overheads.
  • SLAC - done.
  • SWT2_UTA - waiting on final quote from Dell. Ordering MD3660i (2 10G ports on each controller). ~ 1PB. (n.b. about 0.5 PB free now, so no crunch)
  • Tier 1 - 2.6 PB arrived; will be starting deployment. Mid-December, maybe earlier.
this meeting:
  • Updates:
  • AGLT2: Testing best capacity and perf of md3260. decided not to use storage pools. takes too much of the storage. choosing 20 disk raid6. getting set to bring the storage online by end of month. At least some by next week. https://www.aglt2.org/wiki/bin/view/AGLT2/Md3260Benchmark
  • MWT2: 1 PB storage racked and in process of powering.
  • NET2: Funding didn't arrive till quite late; made an order for 720 TB. md3260's. Starting to arrive now. ETA: first week of January. Got a very good price. Will be moving to Holyoke end of March.
  • SWT2_UTA: Sent in PO's. No delivery date. 1.3 PB.
  • SWT2_OU:
  • Tier 1:

Cloud SE endpoint (Doug)

last meeting:
  • We are setting up analysis clusters in the cloud, predominately Panda queues.
  • Amazon EC2 - BNL 30K credit. FutureGrid. Sergey's Google CE project. Will use BNL storage elements.
  • D3PD? production is the workflow - hampered by the platform.
  • Jose is working on cloud resource provisioning.
  • APF and SE support needed from BNL.
  • Need to get next gen D3PD? transferred.

this meeting:

  • if updates

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • reprocessing is still ongoing (started period H a couple of days ago), though mostly done.
    • sites have heavy IO load from merge jobs. Keep an eye on storage and networking.
    • PRODDISK may need more space for the heavy IO tasks.
    • sites have been mostly full of jobs past month (occasional drops monday, as usual).
    • keep eye also on DATADISK.
  • this meeting:
    • Reprocessing is virtually done, a success, so expect smaller loads
    • Production will continue through the holidays
    • US facility performance has been very good, low number of issues
    • Mark suggests that when DDM emails from Hiro arrive, if the issue is more than transient, good idea to cc the shift list (atlas-project-adc-operations-shifts@cern.ch).

Multicore configuration

last meeting
  • Will at BNL is close to a solution to dynamically partition resources to request MC and high mem slots in Condor
  • Hope to have a solution by end of December.

this meeting:

  • if updates

Data Management and Storage Validation (Armen)

Shift Operations (Mark)

  • last week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=220806
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-12_3_2012.html
    
    1)  11/30: AGLT2 - file transfer failures ("[GRIDFTP_ERROR] globus_ftp_client: the operation was aborted").  From Shawn: Two of our 5 dCache 
    doors had been having memory issues. Memory was added but the updated dCache door configuration wasn't put in place. This has been done 
    now and the problematic doors have been restarted.  https://ggus.eu/ws/ticket_info.php?ticket=89099 closed on 12/2.  eLog 41380.
    2)  12/4: Problematic dCache library created problems for analysis sites - from Torre: A problem has recurred in a corrupt dCache library being 
    disseminated by sw installation which results in ANALY jobs failing for all sites using dCache.  Experts and DAST list informed.  eLog 41465.
    3)  12/5 early a.m.: MWT2 - job failures due to stage-in (checksum) errors - https://ggus.eu/ws/ticket_info.php?ticket=89205 - eLog 41490.  Site 
    off-lined the WN's affected by the adler32 problem - issue under investigation.  eLog 41490.
    
    Follow-ups from earlier reports:
    
    (i)  7/12: NET2 DDM deletion errors - ggus 84189 marked 'solved' 7/13, but the errors reappeared on 7/14.  Ticket 'in-progress', eLog 37613/50.
    Update 9/24: Continue to see deletion errors - ggus 85951 re-opened.  eLog 39571.
    Update 10/9: site admins working on a solution - coming soon.  Closed ggus 85951 - issue can be tracked in ggus 84189. 
    Update 10/17: ggus 87512 opened for this issue - linked to ggus 84189.
    Update 10/31: BeStMan upgrade may resolve the issue of deletion errors.  ggus 81489 closed.  Any remaining problem will be tracked in 
    https://ggus.eu/ws/ticket_info.php?ticket=87784.
    Update 11/29: number of deletion errors reduced over the past few days - decided to close ggus 87784.
    (ii)  8/24: NERSC_SCRATCHDISK - file transfer failures with SRM errors.  Issue was due to the fact that the site admins had previously taken 
    the token off-line to protect it from undesired auto-deletion of files.  ggus 85490 closed, eLog 38791.  
    https://savannah.cern.ch/support/index.php?131508 (Savannah site exclusion ticket), eLog 38795.
    Update 11/2: See latest information in Savannah 131508 (eventually the space token will be decommissioned).
    

  • this week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-12_10_2012.html
    
    1)  12/6: AGLT2 - from Bob: From 9:05am to 9:50am EST today, the NFS server hosting all OSG home directories, etc, froze up.  Consequently 
    we have lost some 2000 jobs running here.
    2)  12/8: MWT2 - file transfer failures - Rob reported this was due to a networking problem at IU - experts investigating.  Update 12/10: issue possibly 
    resolved (routing to a particular dCache pool node was fixed).
    3)  12/8: NET2 - again seeing a spike in DDM deletion errors - https://ggus.eu/ws/ticket_info.php?ticket=89339 - eLog 41596.
    
    Follow-ups from earlier reports:
    
    (i)  8/24: NERSC_SCRATCHDISK - file transfer failures with SRM errors.  Issue was due to the fact that the site admins had previously taken the token 
    off-line to protect it from undesired auto-deletion of files.  ggus 85490 closed, eLog 38791.  https://savannah.cern.ch/support/index.php?131508 
    (Savannah site exclusion ticket), eLog 38795.
    Update 11/2: See latest information in Savannah 131508 (eventually the space token will be decommissioned).
    (ii)  12/5 early a.m.: MWT2 - job failures due to stage-in (checksum) errors - https://ggus.eu/ws/ticket_info.php?ticket=89205 - eLog 41490.  Site off-lined 
    the WN's affected by the adler32 problem - issue under investigation.  eLog 41490.
    Update 12/6: extensive checking / debugging performed - still some additional checks to be made (see details in the ggus ticket).  No recent errors, so 
    ggus 89205 was closed.
    

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last meeting(s):
    • See meeting notes from yesterday's meeting.
    • Still have a number of sites that have not completed 10G bandwidth instances in place. Mesh configuration in place.
      • Mesh - done: UM, MWT2, MSU & BNL waiting on new version of CD
      • NET2 - Mesh ? 10G:
      • SWT2 - Mesh ? 10G: OU waiting on external network configuration; waiting on install of 10G switch. Estimate. UTA - waiting on network update. (Mark - meeting with campus networking tomorrow morning. Will require fiber route, will happen in the next week or two; will also discuss joining LHCONE.)
      • WT2 - Mesh ?
    • Simone has been pushing as part of WLCG operations getting perfsonar deployed everywhere; Shawn is suggesting using Hiro's load tests. OSG OIM test instance that supports perfsonar (then goes onto WLCG).
    • Yesterday's notes:
      NOTES for November 27th NA Throughput Meeting
             =====================================
      
      Attending:  Shawn, John, Marek, Lucy, Rob, Dave, Ryan, Horst, Azher, Hiro,
      Excused: Tom, Jason, Andy, Philippe
      
      AGENDA:
      
      1) Agenda review and update.   None 
      
      2) Status of perfSONAR-PS Install in USATLAS
      
           i) "Mesh" configuration deployed?  (If  not, when?)
                     AGLT2:  UM done, MSU (this week)
                     MWT2:   All done (Thanks Dave)
                     NET2:    No report:  Can Saul or John provide an update?
                     SWT2:   OU done, UTA no report: Can Mark or Patrick provide an update?
                     WT2:     No report:  Can Wei or Yee provide an update?
                      BNL:    CD installs prevent this.  Netboot from USB isn't working (sda vs sdb issue?).   Once v3.3 is out BNL can utilize mesh configuration.
      
           ii) 10GE Bandwidth instance deployed?  (If not, when?)
                     AGLT2:  All done
                     MWT2:  All done
                     NET2:   Not done yet:  Can Saul or John provide an update?
                     SWT2:  Still waiting on final site changes for Lustre at OU (will need to coordinate 10GE PS change then).   UTA: Needs 10GE network ports: Can Mark or Patrick provide an update?
                     WT2:    No report:  Can Wei or Yee provide an update?
                     BNL:    All done.
      
      Rob has a question about the old Koi boxes.  Some boxes causing lots of warnings.  Shawn described the intent to use these boxes as a shadow test infrastructure at the same scale as the production instances.  However any site having a problem keeping such nodes running should feel free to take them out of service.   We hope to keep enough testing infrastructure in place to test new "beta" versions of software as they are released (like upcoming V3.3 of perfSONAR-PS).
      
      3) perfSONAR-PS Topics
      
          i) New issues noted?  Dave reported on the Illinois throughput box is unable to get bi-directional testing to BNL's since about September 11 or 12th.  LHCMON-MWT2_Illinois works but the other directions don't work.  *ACTION ITEM*: John will check the IPs on the BNL systems and work with Dave on initial debugging.   May have to involve Jason or others to find root cause.  Could be corrupted configuration on LHCMON?  
      
         ii) Toolkit update status:   Andy Lake provided an update via email:   
      
      "Hi Shawn, 
      I'm not sure much of this is new information, but we're hoping to have an early beta before the holiday. It likely won't have the full-set of features that will be in the final release, but should allow for people to start testing the CentOS 6 changes at a minimum. A few highlights we expect:
      - I'm not sure if we will have all combinations of NetInstall, LiveCD, LiveUSB, 32-bit, and 64-bit by the holiday but likely will have some subset of those. We are currently working on upgrade scripts for those as well, so hopefully the CentOS 5 -> 6 transition will be as painless as possible.
      - The plan is still to add an updated Lookup Service to the toolkit, but likely this won't be ready for the December beta. We want to make sure we have all the backward compatibility worked out and we have the best long-term path forward.
      - There will be a traceroute GUI, likely the version shared by the University of Wisconsin. 
      - Aaron's mesh-config agent will be included on the toolkit by default
      
      Those are the big items I can think of in terms of features. A more complete list of the bugs we are targeting is here: http://code.google.com/p/perfsonar-ps/issues/list?can=2&q=Milestone%3DRelease3.3
      
      Thanks,
      Andy"
      
         iii) Modular dashboard news:   Code will be moved to GitHub "soon".   Need to arrange with Tom Wlodek on how to best do this.  Andy,  Tom and Shawn will setup the project once the code is transferred to an OSG repository as an intermediate step.                     
      
      4) Throughput 
      
          i)  New issues to track?   SWT2_UTA has slow inbound (FTS is backlogged).   Hiro mentioned FTS monitor shows many T1 to many US Tier-2 shows some real slow transfers.  Hiro sent a link showing the issue.  Would be nice to identify.  We will use Hiro;s transfers and perfSONAR-PS to see what we can find.  
      
         ii)  New developments:  No update
      
         iii) Monitoring:   WLCG operations summary from Shawn describing plans to get perfSONAR-PS instances deployed WLCG-wide and suitably registered in OIM or GOCDB.  Hiro mentioned issues with transfers from  Tier-1's to MWT2.  Rob looked at active transfers. No current "smoking gun".   Also discussed checksum issue previously seen inbound to MWT2.  Discussion about the possible source. Main hint seems to be these files are all *large* (~> 8GB?).   Could be related to the 'csm' policy setup on MWT2 pools nodes.  Need to check /dcache*/pool/setup files to see what the 'csm policy' is and if it is consistent on nodes.
      
      5) Site Round-table and reports
      
          i) USCMS:   Lucy and Marek reported on deployment:   working on establishing CMS European Tier-2 testing cloud.  Get Tier-2's testing to Tier-1 to verify configuration.   Marek is providing Simone with a list of CMS Tier-2 sites.    Lucy also reported on her GUI work.   Working on uploading configuration info (for example from the Mesh-config).  Lucy will rewrite the code using 'struts'.   Lucy has an issue getting Tomcat to run the test dashboard code.  Seems to be some environment issue.
      
         ii) Canada
      
         iii) Sites
      
      6 AOB and next meeting  -  Out of time.  Next meeting will *not* be on December 11 since that is both CMS and ATLAS meeting week.   Look for email announcing the next meeting; tentatively set for Tuesday, December 18th.
      
      Send along any additions or corrections to these notes to the mailing list.  Thanks,
      
      Shawn   

  • this meeting:

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference

last week(s)

  • beta version of f-stream enabled UCSD collector available. Needed changes in the monitor protocol. Only in xrootd 3.3.0-rc2. Ilija will change dcache monitor to adhere to new protocol
  • today moving MWT2 to dCache 1.9.12-23.
  • work ongoing in adopting AGIS for tests
  • Hiro produced site specific datasets for all the sites
  • new dashboard monitor ready
  • 3 italian sites added (Rome1, Frascati, Napoli). Working on their monitoring issues
  • Next week PAT tutorial. Will have to recheck if everything works. We should consider this as a small full-dress rehearsal.
  • CERN and AGLT2 have a problem with the limited proxy access.
  • Doug will subscribe D3PD's to sites in the US.
this week

Site news and issues (all sites)

  • T1:
    • last meeting(s): 2.6 PB soon. Opp resources at BNL - 2000 ATLAS jobs running on nuclear resources now (eviction). Amazon cloud resources will be used for scalability testing. WN benchmarking report sent out, on Sandybridge (production hardware from Dell); performance was much better than with the pre-production machines. 30% better performance better than 2.8GHz Westmere (working on pricing). Require to provide a security facility plan to NSF. Do sites have an institution security plan? If so share with Michael. Hans and Borut to participate in HLT farm to a cloud resource.
    • this meeting:

  • AGLT2:
    • last meeting(s): Have all storage received at UM, MSU; configuring. Online by middle of December. January 3 will be offline for MILR switch upgrades to 100G. Second outage December 17 to test MSU systems with new personnel. Will start on SL6
    • this meeting: things have been working well, focus has been md3260. Have got a lot of srmwatch errors, might be correlated with a user asking for too many jobs. HC has analy queue back online. Had to suppress # running jobs.

  • NET2:
    • last meeting(s): BU is switching over to SGE - will be sending testjobs shortly. An issue with release validation at HU.
    • this meeting: Problems this morning - spikes in gatekeeper load and DDM errors. Might have large number of dq2-gets may be the cause, investigating. Panda queues on BU side are nearly drained. New storage starting to arrive. End of March move. HU running fine. BU-HU network link may be need maintenance.

  • MWT2:
    • last meeting(s): Investigations of poor internal network performance at IU continue: switch firmware updated today. Increased memory and java heapsize (doubling both) on SRM door. Investigating DDM checksum failures.
    • this meeting: Big issue was had a # jobs fail with stage-in. Adler32 checksum indicates byte swapping. Narrowing down source. LHCONE config at IU, or UC-IU physical path. Unintended change in path happened. Also finding poor performance at IU internal network, and also finding some checksum failures on DDM transfers.

  • SWT2 (UTA):
    • last meeting(s): Things are running fine, working on storage
    • this meeting: there was a problem with maui on the CPB cluster; drained and fixed. asymmetric network throughput to/from BNL. Getting network staff to track it down.

  • SWT2 (OU):
    • last meeting(s): Disk order has gone out. Horst is getting close to having clean-up script.
    • this meeting:

  • WT2:
    • last meeting(s): Storage is online. SLAC Tier 3 - 20 R510's with 3 TB drives --> 500 TB
    • this meeting: all

AOB

last meeting
  • Alden reports that validated releases no longer publish into BDII; Patrick will do a test of the removal of grid3-locations.txt file, to see that nothing breaks. Alden will send a formal announcement to the usatlas-grid-l list when it is finalized.
this meeting
  • January 9, 2103.


-- RobertGardner - 11 Dec 2012

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback