r6 - 28 Nov 2012 - 15:17:33 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesNov282012

MinutesNov282012

Introduction

Minutes of the Facilities Integration Program meeting, November 28, 2012
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • USA Toll-Free: 888-273-3658
    • USA Caller Paid/International Toll : 213-270-2124
    • ACCESS CODE: 3444755
    • HOST PASSWORD: 6081

Attending

  • Meeting attendees: Armen, Mark, Alden, Patrick, Kaushik, Horst, Hiro, Doug, Ilija, Sarah, Michael, Dave, Rob,
  • Apologies: Jason, Saul
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, bi-weekly - convened by Kaushik) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Wei or Rob): Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • Facilities capacity spreadsheet updates - see associated Google docs link, shared via (our private) usatlas-t2-l@lists.
      • Santa Cruz: http://indico.cern.ch/conferenceDisplay.py?confId=201788
      • Coming changes to Panda queue configurations for sites via AGIS https://indico.cern.ch/getFile.py/access?contribId=9&resId=0&materialId=slides&confId=213765[link]]. For now, proceed as before. Alden - there will a new field to associate queues with CEs.
      • OSG and Pacman - request from Dan Fraser to consider phase out date for Pacman-based OSG CEs; current timeframe under consideration is April-July 2013 - in advance of the transition of OSG software to SHA-2 and the new digicert CA.
      • Disk round-up (see below).
      • The end of favorable Dell program.
      • Configuring multicore and high mem queues, and demands on partitioning resources manually. Should we form a well-focused working group to implement this in a more dynamic fashion. There are some capabilities in Condor - but the developers will need to correct a few things. Wei has also looked into this. Dynamic Scheduling and possible virtualized. Getting folks involved - Tom would like.
      • PandaMover ... will setup a committee - take to data management meeting agenda
    • this week
      • Tier 1/2/3 Jamboree December 10-11, 2012: agenda here
      • Analysis functional testing for virtualized cloud queue at MWT2: (c.f. Lincolns' presentation @ UCSC, here
      • For FY13Q1, IntegrationPhase23, SiteCertificationP23, CapacitySummary
        • Storage expansion and networking activities are major areas.
      • First DASPOS (data preservation - NSF PIF) project meeting this Friday at Notre Dame.
      • Reprocessing campaign underway. 2B events have been reprocessed. Overall went quite smoothly. Will produce ~ 0.5 PB size AOD sample. proton running offiically ends December 17.
      • Computing RRB scrutiny group has reviewed requirements.

Disk procurement

last meeting:
  • MWT2 - Received 36 MD1200s; waiting on R720. November timeframe likely.
  • AGLT2 - Started to receive equipment; will start racking and testing in the next two weeks. November 9 at MSU. 720 TB raw at each site.
  • NET2 - BU timeline
  • WT2 - In service 1.3 PB usable! Will update spreadsheet.
  • UTA - Still working on getting order in. SWT2 planning meeting this week - will hammer out details. November? End of year please! $ has not arrived.
  • T1 - 2.6 PB on order. Expect arrival in about 3 weeks.

this meeting:

  • MWT2 - have received all R720, MD1200 and 8024F switch. Beginning installation this week.
  • NET2 - awaiting 720 TB delivery to BU from Dell.
  • AGLT2 - ordered and received disk storage - new MD3260's - configuring dynamic disk pools, R6 equiv; understanding overheads.
  • SLAC - done.
  • SWT2_UTA - waiting on final quote from Dell. Ordering MD3660i (2 10G ports on each controller). ~ 1PB. (n.b. about 0.5 PB free now, so no crunch)
  • Tier 1 - 2.6 PB arrived; will be starting deployment. Mid-December, maybe earlier.

Cloud SE endpoint (Doug)

  • We are setting up analysis clusters in the cloud, predominately Panda queues.
  • Amazon EC2 - BNL 30K credit. FutureGrid. Sergey's Google CE project. Will use BNL storage elements.
  • D3PD? production is the workflow - hampered by the platform.
  • Jose is working on cloud resource provisioning.
  • APF and SE support needed from BNL.
  • Need to get next gen D3PD? transferred.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
  • this meeting:
    • reprocessing is still ongoing (started period H a couple of days ago), though mostly done.
    • sites have heavy IO load from merge jobs. Keep an eye on storage and networking.
    • PRODDISK may need more space for the heavy IO tasks.
    • sites have been mostly full of jobs past month (occasional drops monday, as usual).
    • keep eye also on DATADISK.

Multicore configuration

  • Will at BNL is close to a solution to dynamically partition resources to request MC and high mem slots in Condor
  • Hope to have a solution by end of December.

Data Management and Storage Validation (Armen)

  • Reference
  • last meetings(s):
    • All is well
    • USERDISK clean-up soon (for Hiro)
    • Has started to clean-up USERDISK sites directly.
    • LOCALGROUPDISK clean-up by users; Kaushik will ask Mikhail to provide a link
  • this meeting:
    • PRODDISK additions have been required.
    • DATADISK at BNL
    • NET2 - Still have a high rate of deletion errors; Armen believes it is a hardware issue. John: the BU admins are upgrading Bestman hardware. (n.b. the rate is well below 1 Hz; by comparison, others are at ~4 Hz)

Shift Operations (Mark)

  • last week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=218461
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-11_19_2012.html
    
    1)  11/16: MWT2 - jobs failing with stage-in errors (checksum mismatch).  https://savannah.cern.ch/bugs/index.php?98885, eLog 40957.
    2)  11/17 early a.m.: MWT2 - jobs were failing with a different stage-in error ("lsm-get failed (201): ERROR 201 Copy command failed").  Shifter checked some of 
    these files ~ten hours later, and all of them were visible with lcg-ls and could be copied.  So possibly this was just a transient issue.  Failed jobs should complete 
    when re-run.  eLog 40976.
    3)  11/19 - 11/21: AGLT2 - user reported problems with analysis jobs running at the site.  From Shawn: I believe we have resolved the dCache problem at AGLT2.  
    We had over 12000 dCache movers backed up at our site from the issue.   These have been fixed by a restart of the pools and doors and we are now seeing the 
    efficiency of the jobs rise again.  We will continue to watch things but hopefully AGLT2 should be providing efficient access for input files from dCache again.  
    Issue followed in a DAST e-mail thread.
    
    Follow-ups from earlier reports:
    
    (i)  7/12: NET2 DDM deletion errors - ggus 84189 marked 'solved' 7/13, but the errors reappeared on 7/14.  Ticket 'in-progress', eLog 37613/50.
    Update 9/24: Continue to see deletion errors - ggus 85951 re-opened.  eLog 39571.
    Update 10/9: site admins working on a solution - coming soon.  Closed ggus 85951 - issue can be tracked in ggus 84189. 
    Update 10/17: ggus 87512 opened for this issue - linked to ggus 84189.
    Update 10/31: BeStMan upgrade may resolve the issue of deletion errors.  ggus 81489 closed.  Any remaining problem will be tracked in 
    https://ggus.eu/ws/ticket_info.php?ticket=87784.
    (ii)  8/24: NERSC_SCRATCHDISK - file transfer failures with SRM errors.  Issue was due to the fact that the site admins had previously taken the token off-line 
    to protect it from undesired auto-deletion of files.  ggus 85490 closed, eLog 38791.  https://savannah.cern.ch/support/index.php?131508 
    (Savannah site exclusion ticket), eLog 38795.
    Update 11/2: See latest information in Savannah 131508 (eventually the space token will be decommissioned).
    (iii)  11/8: WISC - file transfers failures ("TRANSFER error during TRANSFER phase: [GRIDFTP_ERROR] globus_ftp_client: the server responded with an 
    error 500 500-Command failed. : globus_gridftp_server_posix.c:globus_l_gfs_posix_recv:914: 500-open() fail 500 End").  https://ggus.eu/ws/ticket_info.php?ticket=88302 
    in progress, eLog 40828.  https://savannah.cern.ch/support/index.php?133698 (site exclusion).
    Update 11/20: solved (but no explanation provided).  ggus 88302 closed, eLog 41080.  Savannah ticket left open pending results of tests to confirm fix.
    (iv)  11/14 early a.m.: SLACXRD file transfer failures with SRM errors ("failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]").  
    https://ggus.eu/ws/ticket_info.php?ticket=88473 - eLog 40900.
    Update 11/15: No recent errors of the type reported in the ticket - ggus 88473 closed.
    

  • this week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=218463
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-11_26_2012.html
    
    1)  11/25: AFS problem at CERN - affected several services (including https://atlas.web.cern.ch/).  Issue resolved as of early a.m. 11/26. 
    https://ggus.eu/ws/ticket_info.php?ticket=88856 .
    2)  11/28 a.m.: BNL_ATLAS_RCF - https://ggus.eu/ws/ticket_info.php?ticket=88975 due to jobs failing with "lost heartbeat" errors.  Same issue as 
    https://ggus.eu/ws/ticket_info.php?ticket=87593 - this site provides opportunistic cycles, and occasionally jobs will be evicted.  ggus 88975 closed.  
    (Need to make this information known to shifters to avoid more tickets.).  eLog 41320.
    
    Follow-ups from earlier reports:
    (i)  7/12: NET2 DDM deletion errors - ggus 84189 marked 'solved' 7/13, but the errors reappeared on 7/14.  Ticket 'in-progress', eLog 37613/50.
    Update 9/24: Continue to see deletion errors - ggus 85951 re-opened.  eLog 39571.
    Update 10/9: site admins working on a solution - coming soon.  Closed ggus 85951 - issue can be tracked in ggus 84189. 
    Update 10/17: ggus 87512 opened for this issue - linked to ggus 84189.
    Update 10/31: BeStMan upgrade may resolve the issue of deletion errors.  ggus 81489 closed.  Any remaining problem will be tracked in 
    https://ggus.eu/ws/ticket_info.php?ticket=87784.
    (ii)  8/24: NERSC_SCRATCHDISK - file transfer failures with SRM errors.  Issue was due to the fact that the site admins had previously taken the token 
    off-line to protect it from undesired auto-deletion of files.  ggus 85490 closed, eLog 38791.  https://savannah.cern.ch/support/index.php?131508 
    (Savannah site exclusion ticket), eLog 38795.
    Update 11/2: See latest information in Savannah 131508 (eventually the space token will be decommissioned).
    (iii)  11/8: WISC - file transfers failures ("TRANSFER error during TRANSFER phase: [GRIDFTP_ERROR] globus_ftp_client: the server responded with an error 
    500 500-Command failed. : globus_gridftp_server_posix.c:globus_l_gfs_posix_recv:914: 500-open() fail 500 End").  https://ggus.eu/ws/ticket_info.php?ticket=88302 
    in progress, eLog 40828.  https://savannah.cern.ch/support/index.php?133698 (site exclusion).
    Update 11/20: solved (but no explanation provided).  ggus 88302 closed, eLog 41080.  Savannah ticket left open pending results of tests to confirm fix.
    Update 11/23: no more errors over the past 24 hours - Savannah 133698 closed.  eLog 41160.
    (iv)  11/16: MWT2 - jobs failing with stage-in errors (checksum mismatch).  https://savannah.cern.ch/bugs/index.php?98885, eLog 40957.
    Update 11/28: by now this seems to be an old issue - will check whether it's o.k. to close the ticket.
    

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last meeting(s):
  • this meeting:
    • See meeting notes from yesterday's meeting.
    • Still have a number of sites that have not completed 10G bandwidth instances in place. Mesh configuration in place.
      • Mesh - done: UM, MWT2, MSU & BNL waiting on new version of CD
      • NET2 - Mesh ? 10G:
      • SWT2 - Mesh ? 10G: OU waiting on external network configuration; waiting on install of 10G switch. Estimate. UTA - waiting on network update. (Mark - meeting with campus networking tomorrow morning. Will require fiber route, will happen in the next week or two; will also discuss joining LHCONE.)
      • WT2 - Mesh ?
    • Simone has been pushing as part of WLCG operations getting perfsonar deployed everywhere; Shawn is suggesting using Hiro's load tests. OSG OIM test instance that supports perfsonar (then goes onto WLCG).
    • Yesterday's notes:
      NOTES for November 27th NA Throughput Meeting
             =====================================
      
      Attending:  Shawn, John, Marek, Lucy, Rob, Dave, Ryan, Horst, Azher, Hiro,
      Excused: Tom, Jason, Andy, Philippe
      
      AGENDA:
      
      1) Agenda review and update.   None 
      
      2) Status of perfSONAR-PS Install in USATLAS
      
           i) "Mesh" configuration deployed?  (If  not, when?)
                     AGLT2:  UM done, MSU (this week)
                     MWT2:   All done (Thanks Dave)
                     NET2:    No report:  Can Saul or John provide an update?
                     SWT2:   OU done, UTA no report: Can Mark or Patrick provide an update?
                     WT2:     No report:  Can Wei or Yee provide an update?
                      BNL:    CD installs prevent this.  Netboot from USB isn't working (sda vs sdb issue?).   Once v3.3 is out BNL can utilize mesh configuration.
      
           ii) 10GE Bandwidth instance deployed?  (If not, when?)
                     AGLT2:  All done
                     MWT2:  All done
                     NET2:   Not done yet:  Can Saul or John provide an update?
                     SWT2:  Still waiting on final site changes for Lustre at OU (will need to coordinate 10GE PS change then).   UTA: Needs 10GE network ports: Can Mark or Patrick provide an update?
                     WT2:    No report:  Can Wei or Yee provide an update?
                     BNL:    All done.
      
      Rob has a question about the old Koi boxes.  Some boxes causing lots of warnings.  Shawn described the intent to use these boxes as a shadow test infrastructure at the same scale as the production instances.  However any site having a problem keeping such nodes running should feel free to take them out of service.   We hope to keep enough testing infrastructure in place to test new "beta" versions of software as they are released (like upcoming V3.3 of perfSONAR-PS).
      
      3) perfSONAR-PS Topics
      
          i) New issues noted?  Dave reported on the Illinois throughput box is unable to get bi-directional testing to BNL's since about September 11 or 12th.  LHCMON-MWT2_Illinois works but the other directions don't work.  *ACTION ITEM*: John will check the IPs on the BNL systems and work with Dave on initial debugging.   May have to involve Jason or others to find root cause.  Could be corrupted configuration on LHCMON?  
      
         ii) Toolkit update status:   Andy Lake provided an update via email:   
      
      "Hi Shawn, 
      I'm not sure much of this is new information, but we're hoping to have an early beta before the holiday. It likely won't have the full-set of features that will be in the final release, but should allow for people to start testing the CentOS 6 changes at a minimum. A few highlights we expect:
      - I'm not sure if we will have all combinations of NetInstall, LiveCD, LiveUSB, 32-bit, and 64-bit by the holiday but likely will have some subset of those. We are currently working on upgrade scripts for those as well, so hopefully the CentOS 5 -> 6 transition will be as painless as possible.
      - The plan is still to add an updated Lookup Service to the toolkit, but likely this won't be ready for the December beta. We want to make sure we have all the backward compatibility worked out and we have the best long-term path forward.
      - There will be a traceroute GUI, likely the version shared by the University of Wisconsin. 
      - Aaron's mesh-config agent will be included on the toolkit by default
      
      Those are the big items I can think of in terms of features. A more complete list of the bugs we are targeting is here: http://code.google.com/p/perfsonar-ps/issues/list?can=2&q=Milestone%3DRelease3.3
      
      Thanks,
      Andy"
      
         iii) Modular dashboard news:   Code will be moved to GitHub "soon".   Need to arrange with Tom Wlodek on how to best do this.  Andy,  Tom and Shawn will setup the project once the code is transferred to an OSG repository as an intermediate step.                     
      
      4) Throughput 
      
          i)  New issues to track?   SWT2_UTA has slow inbound (FTS is backlogged).   Hiro mentioned FTS monitor shows many T1 to many US Tier-2 shows some real slow transfers.  Hiro sent a link showing the issue.  Would be nice to identify.  We will use Hiro;s transfers and perfSONAR-PS to see what we can find.  
      
         ii)  New developments:  No update
      
         iii) Monitoring:   WLCG operations summary from Shawn describing plans to get perfSONAR-PS instances deployed WLCG-wide and suitably registered in OIM or GOCDB.  Hiro mentioned issues with transfers from  Tier-1's to MWT2.  Rob looked at active transfers. No current "smoking gun".   Also discussed checksum issue previously seen inbound to MWT2.  Discussion about the possible source. Main hint seems to be these files are all *large* (~> 8GB?).   Could be related to the 'csm' policy setup on MWT2 pools nodes.  Need to check /dcache*/pool/setup files to see what the 'csm policy' is and if it is consistent on nodes.
      
      5) Site Round-table and reports
      
          i) USCMS:   Lucy and Marek reported on deployment:   working on establishing CMS European Tier-2 testing cloud.  Get Tier-2's testing to Tier-1 to verify configuration.   Marek is providing Simone with a list of CMS Tier-2 sites.    Lucy also reported on her GUI work.   Working on uploading configuration info (for example from the Mesh-config).  Lucy will rewrite the code using 'struts'.   Lucy has an issue getting Tomcat to run the test dashboard code.  Seems to be some environment issue.
      
         ii) Canada
      
         iii) Sites
      
      6 AOB and next meeting  -  Out of time.  Next meeting will *not* be on December 11 since that is both CMS and ATLAS meeting week.   Look for email announcing the next meeting; tentatively set for Tuesday, December 18th.
      
      Send along any additions or corrections to these notes to the mailing list.  Thanks,
      
      Shawn   

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference

last week(s)

  • Wei has tracked down an instability issue with the redirector; will coordinate fix with Andy
  • Ilija is working with DESY to incorporate federation capabilities directly in dCache
  • Certain problems with the UCSD collector; working with Matevz
  • Monitoring EOS and vector reads
  • Hiro has setup an unauthenticated LFC server. Wait for host cert. Will ask US sites to convert to using.
  • Need to put site-specific files into EOS.
this week
  • beta version of f-stream enabled UCSD collector available. Needed changes in the monitor protocol. Only in xrootd 3.3.0-rc2. Ilija will change dcache monitor to adhere to new protocol
  • today moving MWT2 to dCache 1.9.12-23.
  • work ongoing in adopting AGIS for tests
  • Hiro produced site specific datasets for all the sites
  • new dashboard monitor ready
  • 3 italian sites added (Rome1, Frascati, Napoli). Working on their monitoring issues
  • Next week PAT tutorial. Will have to recheck if everything works. We should consider this as a small full-dress rehearsal.
  • CERN and AGLT2 have a problem with the limited proxy access.
  • Doug will subscribe D3PD's to sites in the US.

Site news and issues (all sites)

  • T1:
    • last meeting(s): Evaluating MapR as a storage management solution; Hiro is working with Doug on testing, direct access jobs. Hiro is scaling this up to a reasonable number of concurrent jobs. 2.6 PB of raw disk to show up in a couple weeks. DPM team has NFS 4.1 on top of Hadoop, similar to MapR. Shawn notes HePix this year has analysis of storage systems, including NFS 4.1 client.
    • this meeting: 2.6 PB soon. Opp resources at BNL - 2000 ATLAS jobs running on nuclear resources now (eviction). Amazon cloud resources will be used for scalability testing. WN benchmarking report sent out, on Sandybridge (production hardware from Dell); performance was much better than with the pre-production machines. 30% better performance better than 2.8GHz Westmere (working on pricing). Require to provide a security facility plan to NSF. Do sites have an institution security plan? If so share with Michael. Hans and Borut to participate in HLT farm to a cloud resource.

  • AGLT2:
    • last meeting(s): Next HEPIX will be at Ann Arbor, last week October 2013. Received head node for new storage purchase. Disks still have not been shipped (arrival November 23). SL 6.3. Bob: upgrading to CFEngine3.
    • this meeting: Have all storage received at UM, MSU; configuring. Online by middle of December. January 3 will be offline for MILR switch upgrades to 100G. Second outage December 17 to test MSU systems with new personnel. Will start on SL6

  • NET2:
    • last meeting(s): BU is switching over to SGE - will be sending testjobs shortly. An issue with release validation at HU.
    • this meeting:

  • MWT2:
    • last meeting(s): GPFS maintenance at UIUC campus cluster - those nodes offline.
    • this meeting: Investigations of poor internal network performance at IU continue: switch firmware updated today. Increased memory and java heapsize (doubling both) on SRM door. Investigating DDM checksum failures.

  • SWT2 (UTA):
    • last meeting(s): Working on proddisk-cleanse program. There is a new version in git that Shawn ran into. Looking to add information about what Panda is about to run - e.g. activated jobs.
    • this meeting: Things are running fine, working on storage

  • SWT2 (OU):
    • last meeting(s): Disk order has gone out. Horst is getting close to having clean-up script.
    • this meeting:

  • WT2:
    • last meeting(s): Storage is online. SLAC Tier 3 - 20 R510's with 3 TB drives --> 500 TB
    • this meeting:

AOB

last meeting this meeting
  • Alden reports that validated releases no longer publish into BDII; Patrick will do a test of the removal of grid3-locations.txt file, to see that nothing breaks. Alden will send a formal announcement to the usatlas-grid-l list when it is finalized.


-- RobertGardner - 27 Nov 2012

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback