r5 - 18 Jan 2012 - 14:46:54 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJan18

MinutesJan18

Introduction

Minutes of the Facilities Integration Program meeting, January 18, 2012
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Our Skype-capable conference line: (6 to mute) ** announce yourself in a quiet moment after you connect *
    • USA Toll-Free: (877)336-1839
    • USA Caller Paid/International Toll : (636)651-0008
    • ACCESS CODE: 3444755

Attending

  • Meeting attendees: Rob, Dave, Hari, Michael, Patrick, Sarah, Wei, Jason, Saul, Bob, John, Horst, Hiro, Mark, Armen, Kaushik, Alden, Wensheng
  • Apologies: Xin (on vacation)
  • Guests: Alain

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon CDT, bi-weekly - convened by Kaushik) : Data management
    • Tuesday (2pm CDT, bi-weekly - convened by Shawn): Throughput meetings
    • Wednesday (1pm CDT, bi-weekly - convened by Wei or Rob): Federated Xrootd
  • Upcoming related meetings:
      • OSG All hands meeting: March 19-23, 2012 (University of Nebraska at Lincoln). Program being discussed. As last year part of the meeting will have co-located US ATLAS, and joint USCMS/OSG session.
  • For reference:
  • Program notes:
    • last week(s)
      • Integration program sketch for the quarter (FY12Q2, January 1 - March 31, 2012):
        • Complete CVMFS deployment (January)
        • Finish CA procurements
        • Readiness for LHC restart of operations (proton-proton April 1); 2012 pledges fully deployed
        • OSG CE rpm-based update
        • Hammer Cloud on OSG ITB
        • PerfSONAR-PS: 10G upgrade?
        • Tier2D
        • FAX milestones - security, functional modes, analysis performance, monitoring. - TBD
        • Opportunistic access milestone across US ATLAS sites (TBD)
        • Deployment, evaluation of APF at a T2, and a Tier 3 (local pilot)
        • Illinois integration with MWT2
        • OSG AH meeting in March - co-locate next facilities workshop
        • Others integration tasks foreseen?
          • data management? cloud?
        • ADC discussions on LFC consolidation: this is underway at CERN (Dutch, Italian now being run at CERN). Possible consolidation of T2 LFCs at the T1 and BNL. Create a short-lived study group to evaluate the pro's and con's of a consolidation to determine the next step. Have a report/conclusion for the upcoming S&C week in March.
        • Move DQ2 site services at BNL to CERN - Hiro's proposal.
        • Need clear milestones - even if they go beyond
      • Need to flesh out a US ATLAS cloud computing activity which meshes with ATLAS and OSG; Alden
    • this week
      • Integration program above is being put into a Jira issue tracking system - will be available by next meeting.
      • Alain on behalf of OSG has asked VOs to provide a list of software components we are interested in, and priorities. Rob, John Hover, and anyone else are invited to contributed. Will setup a twiki page with a list.
      • LFC consolidation working group to evaluate this, for the US ATLAS cloud. Need an informed decision by mid-March. Could do a step-wise consolidation. Result by end-of-Feb at latest.
      • Start planning regarding SL6 migration. There is a lot of work in ATLAS going on at the moment. Probably a version available by late March/April. What would be our plan to migrate? At some point WLCG + ATLAS will demand this.
      • Pledges: thanks to everyone to contributing to the capacity overview - there is a new spreadsheet at CapacitySummary. Need to have these available by early April (in particular CPU will not be an issue); for disk, two sites are about there, two are far away. Something to keep in mind. Michael will use these numbers in the presentation to the review committee during the operations review in early February, and later for agency review in March.
      • Cloud activity - Torre has asked the Facilities to provide an overview talk for Feb 2 on cloud activities. Val (LBNL) for Tier 3, John H (for T1/T2) for openstack, etc. Would like to include Alden's work on virtualized worker node. EC2 cloud work from Sergei - dynamically create proof clusters, with measurements, etc.
      • Saul - regarding storage - have two racks on order from Dell, but drives are delayed till March 19. Believes can reach 2.2 PB by April.

rpm-based OSG release (Alain Roy)

  • 2011-01-18-osg-release3-atlas.pdf
  • Page for keeping ATLAS-specific notes: GuideToOSG
  • Xrootd components are kept in synch with repo at xrootd.org to ever extent possible.
  • Tom R: point about isolating repos from external repos - keep the production system stable. This is done at MW - BNL is experimenting with an integration environment. Alain notes discussion with John Hover on this - perhaps providing an mirrored EPEL repo snapshot.
  • Saul - maybe make a recording of this presentation, and make it available to the OSG website.
  • Horst - release 2? Alain - we now keep VDT (was v2) and OSG (was v1.2) software together, removing the previous dividing line, make it release 3.
  • Bestman support - project is continuing with less support; OSG will do any development as best we can to maintain it for OSG stakeholders. There are some internal changes - eg. Java 7 - that implied internal changes, and imply some testing.
  • Goal is "yum update" works.

Progress on procurements

last meeting(s):
  • Interlagos machine - 128 cores - Shuwei's diverse set of tests show poor performance. Not usable for us. There is an effort to look at RHEL6 evaluation - which is highly recommended by AMD and Dell. Not likely to get result in time.
  • Regarding memory requirements, discussion with Borut: baseline is still 2 GB/logical core, but expect there will be high mem queues needed at some point; try
  • AGLT2: equipment at UM PO's have been put in (8 blades to a Dell chasis). S4810 F10 switch. Buying port at OmniPoP (shared switch, in coordination with MWT2 sites). MSU - meeting to discuss details.
  • MWT2: working on R410-based compute node purchase at IU and UC. Extending CC at Illinois. OmniPoP switch ports (2 UIUC, 1 UC, 1 IU plus the shared port costs).
  • SWT2: getting orders in for the remaining funds; UPS infrastructure, and a smaller compute node purchase. Purchase of two 10G gridftp doors, but in next phase (Feb, March). OU: three new head nodes, and new storage already purchased, and everything is at 10G.
  • WT2: deployed 68 R410; will spend more on storage next year and other smaller improvements. 2.1 PB currently. Will investigate SSDs for highly performant storage. 100 Gbps - will discuss with his networking group.

  • NET2: have about $20K left; ordered storage and replacement servers for HU and BU. Arriving now.
  • Dell pricing matrix has been updated.
  • C6145 eval has completed (Shuwei, Alex Undres) - ATLAS on SL6, code with gcc 4.6.2. Single process goes well, but dramatically worse when fully scheduled, in spite of HS06.

this meeting:

  • MWT2: UC, IU purchases are in to Dell; UIUC purchase of campus cluster nodes imminent (today, this week latest).
  • SWT2: pushing through final purchases. Expect to be at 2.4 PB of storage, and plenty of CPU. UPS upgrade work (2.5x) on-going.
  • AGLT2: at MSU - working out plans for adding a second VMWare cluster: run services at either site. Getting ready to send off POs. Also will purchase a small amount of storage for dCache (to 2.2 PB pledge). Adding another Juniper switch at MSU. All money will be obligated by the end of the month. At UM - all funds have been dedicated.
  • WT2 - close to storage pledge, but no immediate plans to purchase more. Actively looking at SSDs, to replace "front tier" storage.
  • NET2- as above.

Follow-up on CVMFS deployments & plans

last meeting:
  • OU - January 15
  • UTA - will focus on production cluster first. Will do a rolling upgrade. Expect completion by January 15 as well.
  • BNL - Michael notes that at BNL they have seen multiple mount points from the automouter. They seem to go away eventually. Under investigation by CVMFS experts. A ticket has been filed. In process of adding more compute nodes up to the full capacity.
  • WT2 - fully converted. DONE

  • BNL - moved another 2000 job slots into CVMFS. Now running at 4000 jobs in CVMFS queue. Last batch will be moved tomorrow or early next week, completing the deployment. Note: have deployed throughout,
  • HU - ran into a few problems with an unintended kernel updates, but this will be fixed shortly. At BU - this is a top priority - will be done by January 30.
  • Are we running into scheduling prob's because of missing (unpublished) releases. Xin: did Alessandro's jobs not run? Newer release? Probably. There is a lack of transparency to the process - when val jobs run, publication, etc. Site admins should be notified - sites individually and cloud-support. Bob claims Alessandro's web page can be setup send notification. Bring up at ADC meeting.

this meeting:

  • OU - delayed by Lustre downtime, expert's availability. New date is Feb 10.
  • UTA - still working on build package - good shape. Start rolling releases. Then Alessandro's validation. Question as to whether this can be done with or without a downtime.
  • AGLT2 - went first, was messy - ended having Alessandro run them locally.
  • BNL - now near fully converted - John has a new gatekeeper, and now using APF. Expected completion by early next week. Note Alessandro has presented a detailed talk at the ADC dev meeting on Monday; emphasizes need to migrate completely.
  • NE: converting right now. Tried to overlap downtimes. HU passed tests.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Sites were auto-offlined yesterday due to expired proxy
    • Deletion errors at OU - need an update on SRM to bestman 2. Its a timeout error making deletion slow. UTD needs an update as well. And so does BU.
  • this meeting:
    • All is fine; ramped back up quickly.
    • There are on-going problems with the Panda monitor. Torre: notes Valerie Fine is on vacation. Oracle 11G behavior for updates changed, degrading performance. Internal cache loads updates are slow, creating backlog. Panda central services guys turned off internal caches - but this gives back the slow page loading times. Real solution to the caching is having squid working properly, on the task list, never got central services squid configured correctly. There is no quick solution.

Data Management and Storage Validation (Armen)

Shift Operations (Mark)

  • Reference
  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=170427
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-1_11_2012.html
    
    1)  1/4/12: NET2 - Job failures with the message "Error details: pilot: Get error: 16010101/FSRelease-0.7.1.2.tar.gz'\'''] failed because it had non-empty stderr 
    [Permission denied], please try again."  Issue understood/resolved - from Saul & John: This was an ssh config problem that occurred for a short time while we 
    were borrowing some of our Tier 3 nodes for production on the BU side. The problem only lasted for a short time and we took the nodes out of production to fix 
    the issue.  ggus 77893 closed, eLog 32787.
    2)  1/6: OU_OCHEP_SWT2 - DDM deletion errors.  These errors will go away once the SRM service is upgraded later this month.  ggus 77926 / RT 21524 closed, 
    eLog 32859.
    3)  1/6: SWT2_CPB - DDM errors (" [DDM Site Services internal] Timelimit of 604800 seconds exceeded in FZK-LCG2_DATADISK->SWT2_CPB_PHYS-SUSY queue," etc.).  
    ggus 77945 / RT 21527 closed, as this is not a site issue, but rather an internal DDM internal timeout which shifters are requested to report to DDM ops team.  eLog 32845.
    4)  1/7: UTD-PRODDISK DDM errors (" failed to contact on remote SRM [httpg://fester.utdallas.edu:8443/srm/v2/server]").  A restart of bestman fixed the problem temporarily, 
    but the errors reappeared on 1/9.  ggus ticket 77957 closed on 1/14 since there were no recent errors observed.  eLog 32871/997.  Site was blacklisted for a couple of 
    days during this period: https://savannah.cern.ch/support/index.php?125532.
    5)  1/8: From Shawn at AGLT2: AGLT2 had one of two PC6248 switches in a stack encounter an error around 2:23 AM today. All systems connected to that unit lost network 
    connectivity, including UMFS11.AGLT2.ORG. Power-cycling the unit brought it back into the stack (around 10:10 AM Eastern). All dCache services are running and should 
    be OK now.
    6)  1/8: OU_OCHEP_SWT2 - jobs from task 645292 were failing at the site due to a checksum error on an input file.  Wensheng checked and found that the copy at OU was 
    corrupted, and hence removed.  Jobs from this task eventually completed successfully.  https://savannah.cern.ch/bugs/?90283, eLog 32897.
    
    Follow-ups from earlier reports:
    (i)  12/12: UTD-HEP - ggus 77382 opened due to DDM deletion errors at the site (~21 over a four hour period).  Ticket 'assigned' - eLog 32351.  (Duplicate ggus ticket 77440 
    was opend/closed on 12/14.)
    Update 12/24: ggus 77737 also opened for deletion errors at the site - eLog 32692.
    Update 1/9: ggus 77382 closed as an old/obsolete ticket.
    (ii)  12/13: NERSC - downtime Tuesday, Dec. 13th from 7AM-5PM Pacific time.  ggus 77417 was opened for file transfer failures during this time - shifter wasn't aware site was 
    off-line.  Outage didn't appear in the atlas downtime calendar, announcement only sent to US cloud support.  eLog 32373.
    Update 12/15: Still see SRM errors following the outage - ggus 77417 in-progress, eLog 32409.
    Update 1/9: ggus 77417 / RT 21376 closed as an old/obsolete ticket.
    (iii)  12/20: ANL_LOCALGROUPDISK - failed transfers ("failed to contact on remote SRM [httpg://atlas11.hep.anl.gov:8443/srm/v2/server]").  ggus 77630 in-progress, eLog 32526.
    Update 1/9: ggus 77630 closed as an old/obsolete ticket - no recent errors of this type.
    (iv)  12/23: NET2 - file deletion errors - ggus 77729, eLog 32587/739.  (Duplicate ggus ticket 77796 was opened/closed on 12/29.)
    (v)  12/27:  File transfer failures from CERN-PROD_DATADISK => BNL-OSG2_PHYS-SM.  Hiro noted that the issue was incorrect registration of the files from the dataset in 
    question.  Therefore issue needs to be fixed on the CERN side.  ggus 77759 in-progress, eLog 32611.
    Update 1/9: No recent errors seen (has this issue been resolved)?  ggus 77759 closed.
    
  • this meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting (presented this week by Torsten Harenberg):
    https://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=slides&confId=173396
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-1_17_2012.pdf
    
    1)  1/13: UTD-HEP - dbRelease file transfer failing ("[GENERAL_FAILURE] Error:/bin/mkdir: cannot create directory `/at/atlashotdisk/ddo/DBRelease/v170602/
    ddo.000001.frozen.showers.DBRelease.v170602': Permission deniedRef-u usatlas1 /bin/mkdir /at/atlashotdisk/ddo/DBRelease/v170602/ddo.000001.frozen.
    showers.DBRelease.v170602]").  ggus 78217, eLog 33005.  Site requested to be set off-line on 1/14 - eLog 33028, https://savannah.cern.ch/support/index.php?125621.
    2)  1/14: NERSC_SCRATCHDISK DDM errors ("[TRANSFER_TIMEOUT] globus_ftp_client_cksm (gridftp_client_wait): Connection timed out]" & "[CONNECTION_ERROR] 
    [srm2__srmPrepareToGet] failed: SOAP-ENV:Client - CGSI-gSOAP running on fts01.usatlas.bnl.gov reports Error reading token data header: Connection closed]").  
    ggus 78246, eLog 33010.
    3)  1/16: ggus 78298 was erroneously assigned to BNL, when the actual file transfer errors were at UPENN (" failed to contact on remote SRM 
    [httpg://srm.hep.upenn.edu:8443/srm/v2/server]").  This ticket was closed, and ggus 78299 was opened instead.  The issue at UPENN was resolved by restarting bestman.  
    ggus 78299 closed,  eLog 33055/56.
    Update 1/18: transfer errors reappeared, and UPENN_LOCALGROUPDISK was blacklisted.  Site reported that another bestman restart fixed the problem.
    http://savannah.cern.ch/support/?125678 (Savannah site exclusion).
    4)  1/17 early a.m.: power outage at SLAC - eLog 33071.
    5)  1/17: Major ADCR 11g database upgrade at CERN - affected most aspects of ATLAS distributed computing.  Outage over as of ~7:00 a.m. CST.  eLog 33073.
    6)  1/17: BNL - SE maintenance - coincided with CERN outage in 5).  Details here:
    https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/33067
    7)  1/17: AGLT2 maintenance outage.  Work completed as of ~8:00 p.m. CST.  Test jobs submitted to the production queue, but they failed with "Put error: lfc-mkdir threw an exception."  
    Additional test jobs submitted.  eLog 33097.
    8)  1/17: OU_OCHEP_SWT2 - ggus 78325 / RT 21558 opened due to DDM deletion errors (again...).  Tickets closed, as this problem should get addressed by 
    an SRM upgrade later this month.  eLog 33087.
    
    Follow-ups from earlier reports:
    (i)  12/12: UTD-HEP - ggus 77382 opened due to DDM deletion errors at the site (~21 over a four hour period).  Ticket 'assigned' - eLog 32351.  
    (Duplicate ggus ticket 77440 was opend/closed on 12/14.)
    Update 12/24: ggus 77737 also opened for deletion errors at the site - eLog 32692.
    Update 1/9: ggus 77382 closed as an old/obsolete ticket.
    Update 1/17: ggus 78326 opened - closed since this issue is being tracked in ggus 77737.
    (ii)  12/23: NET2 - file deletion errors - ggus 77729, eLog 32587/739.  (Duplicate ggus ticket 77796 was opened/closed on 12/29.)
    Update 1/17: ggus tickets 78324/40 opened - closed since this issue is being tracked in ggus 77729.
    
    • Still would like to remove obsolete sites in the Panda monitor.
    • Main issue are DDM deletion errors - expect it to go away when SRM is updated at three sites. Generating redundant tickets.
    • The sites requiring updates are: OU (Feb 10), NET2 (imminent, but correlated with GK overload; within a week), UTD ()
    • New coordinator for ADC OS - replacing Jarka - Alexei S.

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last meeting:
    • Last meeting was December 20. There are two R310s in the portal - but the price will be reduced. Will support 10G NICs. Still need two boxes? Yes. Sites are requested to deploy 10G capable by end of quarter.
    • LHCONE meeting at Berkely - Shawn attending.
  • this meeting:
    • Shawn and Michael will attend LHCONE/LHCOPN meeting at LBNL, 30-31 January. Will also discuss moving forward with US LHCNet.
    • LHCONE is making progress with a new architecture, away from the VLAN architecture. A couple of sites are ready, once the providers are ready (ESnet and I2). The difficult part was making it work between regions (eg. routing loops resulting).

Federated Xrootd deployment in the US

last week(s) this week:
  • See notes from last week, MinutesFedXrootdJan11
  • Waiting for a patched release so sites can move to a proxy cluster. Also we've found the problem OU has seen, and have a good work around in place. We are subscribing datasets for next round of testing. dcache-xrootd testing evaluation. Also asking sites to do a little homework to use the Vidyo system. Evaluate before the next meeting, potentially next week.

Tier 3 GS

last meeting:
  • UTD, BM update needed. Hari notices very long run jobs. Infinite loop?
this meeting:
  • Michael - notes that Armen wanted to contact T3's to retire DATADISK's tokens. Action item to follow-up

Site news and issues (all sites)

  • T1:
    • last meeting(s):
      • Holidays were uneventful.
      • VOMS server became stuck
      • Completed deployment of 1PB disk; in hands of dCache group - space will show up soon. Expected delivery of R410s in February.
    • this meeting: Queues at BNL were moved to APF - autopilot turned off - good step forward for moving the entire region to APF. Hiro has been developing a new lsm, nice extensions, in terms of supporting alternate protocols. Has installed http and xrootd. Its now comprehensive, and currently in test. With failovers, as appropriate. Goal is to eliminate failures with input files missing. As a last resort can even grab from a remote site, including the federation.

  • AGLT2:
    • last meeting(s):
    • this meeting: downtime yesterday - firmware updates to Dell switches; dCache 1.9.12-15 updated. VMWare cluster problems - a machine dropped out.

  • NET2:
    • last meeting(s):
    • this meeting: cvmfs testing completed at HU, successfully. Had some slowdown of HU nodes - a puppet known bug - will make a note.

  • MWT2:
    • last meeting(s):
    • this meeting: Had probs with KVM bridge dropping out - caused to interruptions; server is back up. Using CVMFS to export $APP, etc, to worker nodes. Procurement on track.

  • SWT2 (UTA):
    • last meeting(s):
    • this meeting: Going well over last two weeks; partial shutdown expected for Saturday morning, construction work, will affect a subset of compute nodes. Progressing with procurements and CVMFS updates.

  • SWT2 (OU):
    • last meeting(s):
    • this meeting:

  • WT2:
    • last meeting(s):
    • this meeting: Running smoothly - but lost power at SLAC, fully back now. Were down for about 16 hours.

Carryover issues (any updates?)

OSG Opportunistic Access

See AccessOSG for instructions supporting OSG VOs for opportunistic access.

last week(s)

  • SLAC: enabled HCC and Glow. Have problems with Engage - since they want gridftp from the worker node. Possible problems with Glow VO cert.
  • GEANT4 campaign, SupportingGEANT4
this week

AOB

last week this week
  • Next OSG all hands meeting is coming up - see coordinates above. We are discussing a joint session at this meeting. Hope that many will join us for the meeting, our next f2f meeting. Points of common interest with USCMS - TEG, federated Xrootd, etc., in the joint sessions. Also - OSG software, operations, and hardware developments, networking (eg LHCONE prospects). Also if you have other areas of interest please send to the list.


-- RobertGardner - 17 Jan 2012

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


pdf 2011-01-18-osg-release3-atlas.pdf (88.7K) | RobertGardner, 18 Jan 2012 - 07:56 |
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback