r4 - 01 Feb 2012 - 14:38:33 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesFeb1



Minutes of the Facilities Integration Program meeting, Feb 1, 2012
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Our Skype-capable conference line: (6 to mute) ** announce yourself in a quiet moment after you connect *
    • USA Toll-Free: (877)336-1839
    • USA Caller Paid/International Toll : (636)651-0008
    • ACCESS CODE: 3444755


  • Meeting attendees: Dave, Fred, Patrick, Rob, Mark, Saul, John B, Armen, Alden, Shawn, Sarah, Wei, Hiro, Bob, Tom, Wensheng, Horst, Doug
  • Apologies: Michael, Kaushik, Nate
  • Guests: Jason (I2)

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon CDT, bi-weekly - convened by Kaushik) : Data management
    • Tuesday (2pm CDT, bi-weekly - convened by Shawn): Throughput meetings
    • Wednesday (1pm CDT, bi-weekly - convened by Wei or Rob): Federated Xrootd
  • Upcoming related meetings:
      • OSG All hands meeting: March 19-23, 2012 (University of Nebraska at Lincoln). Program being discussed. As last year part of the meeting will have co-located US ATLAS, and joint USCMS/OSG session.
  • For reference:
  • Program notes:
    • last week(s)
      • Integration program above is being put into a Jira issue tracking system - will be available by next meeting.
      • Alain on behalf of OSG has asked VOs to provide a list of software components we are interested in, and priorities. Rob, John Hover, and anyone else are invited to contributed. Will setup a twiki page with a list.
      • LFC consolidation working group to evaluate this, for the US ATLAS cloud. Need an informed decision by mid-March. Could do a step-wise consolidation. Result by end-of-Feb at latest.
      • Start planning regarding SL6 migration. There is a lot of work in ATLAS going on at the moment. Probably a version available by late March/April. What would be our plan to migrate? At some point WLCG + ATLAS will demand this.
      • Pledges: thanks to everyone to contributing to the capacity overview - there is a new spreadsheet at CapacitySummary. Need to have these available by early April (in particular CPU will not be an issue); for disk, two sites are about there, two are far away. Something to keep in mind. Michael will use these numbers in the presentation to the review committee during the operations review in early February, and later for agency review in March.
      • Cloud activity - Torre has asked the Facilities to provide an overview talk for Feb 2 on cloud activities. Val (LBNL) for Tier 3, John H (for T1/T2) for openstack, etc. Would like to include Alden's work on virtualized worker node. EC2 cloud work from Sergei - dynamically create proof clusters, with measurements, etc.
      • Saul - regarding storage - have two racks on order from Dell, but drives are delayed till March 19. Believes can reach 2.2 PB by April.
    • this week
      • ConsolidatingLFCStudyGroupUS
      • OSG All hands registration is open: Dear OSG Community Member, We would like to to remind you that registration for the March 19-22nd OSG All Hands Meeting is now open at http://hcc.unl.edu/presentations/event.php?ideventof=5 . We look forward to seeing many of you there. We are also encouraging you to submit posters for the poster session on Tuesday, March 20. These should be no larger than 3ft X 4ft. There will also be tables available for laptop displays either with a poster or standalone. If you are interested to submit a poster, please email an abstract to osg-ahm-program@opensciencegrid.org by March 1, 2012. The program committee will be reviewing the submissions (we can accommodate a maximum of about 20). Any work related to OSG will be considered. Priority will be given to new applications recently ported to the OSG or work directly involving students. We are also looking for suggestions for a few five-minute "lightning presentations" on the plenary day - Wednesday, March 21. We suggest one of two alternative topics: "The neatest, most reusable tool my site or community uses on OSG" or "What OSG should be for the user in 2016". Please send suggestions to osg-ahm-program@opensciencegrid.org. We will be posting more details of the agenda very shortly. Regards, David Swanson , Host of the OSG All Hands Meeting 402-472-5006
        • Additionally request from Brian Bockleman for site admin technologies.
      • GuideToJIRA
      • screenshot_02.png:

Special topic: Generalized local site mover (Hiro)

  • LSM.pptx.pdf: LSM.pptx.pdf
  • Reduce file-not-found errors.
  • Add resilience to dCache restarts. xrootd and http running on all data servers; still need access to backend database. Secondary postgres is used as backup (streaming); Shawn: have two hot stand-by's, one in a VM. Postgres 9.0 or higher required.
  • Area2 client - from google - allows block access
  • Needs testing in a production environment.
  • Will make available as rpms.

Follow-up on CVMFS deployments & plans

last meeting:
  • OU - delayed by Lustre downtime, expert's availability. New date is Feb 10.
  • UTA - still working on build package - good shape. Start rolling releases. Then Alessandro's validation. Question as to whether this can be done with or without a downtime.
  • AGLT2 - went first, was messy - ended having Alessandro run them locally.
  • BNL - now near fully converted - John has a new gatekeeper, and now using APF. Expected completion by early next week. Note Alessandro has presented a detailed talk at the ADC dev meeting on Monday; emphasizes need to migrate completely.
  • NE: converting right now. Tried to overlap downtimes. HU passed tests.

this meeting:

  • UTA - has gatekeeper and CE installed; will be added into OIM, needed for Alessandro. Will expect to go live next week with validation jobs.
  • OU - on track for Feb 10
  • NET2: now transitioned. One leftover problem - analysis queue on BU side in borkeroff; Squid not being advertised correctly? A post-fix to the CVMFS install?

rpm-based OSG CE install

  • OU - CE and Bestman SE are both looking good. Not quite ready though to convert to production.
  • MWT2 - RSV installation, bug found.
  • BU - new gatekeeper, expect to be installing in about two weeks.
  • No other immediate scheduled installs.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • All is fine; ramped back up quickly.
    • There are on-going problems with the Panda monitor. Torre: notes Valerie Fine is on vacation. Oracle 11G behavior for updates changed, degrading performance. Internal cache loads updates are slow, creating backlog. Panda central services guys turned off internal caches - but this gives back the slow page loading times. Real solution to the caching is having squid working properly, on the task list, never got central services squid configured correctly. There is no quick solution.
  • this meeting:

Data Management and Storage Validation (Armen)

Shift Operations (Mark)

  • Reference
  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting (presented this week by Hiroshi Sakamoto):
    1)  1/19: NET2 - job failures with "lost heartbeat" errors.  Issue under investigation.  ggus 78396, eLog 33118.
    2)  1/22: OU_OSCER_ATLAS - job failures with segfault errors - long-standing issue, not understood (muon reconstruction s/w, somehow affects OSCER more than 
    other sites).  ggus 78466 / RT 21580 closed, eLog 33215.
    3)  1/22: OU_OCHEP_SWT2 - file transfer errors ("failed to contact on remote SRM [httpg://tier2-02.ochep.ou.edu:8443/srm/v2/server]").  From Horst: These errors 
    were again caused by heavy I/O activity on our storage system because of very I/O active jobs, and they have subsided again and I haven't seen any more errors in 
    the last hour or so, so I'm closing this ticket again.  ggus 78467 / RT 21581 closed, eLog 33222.  Also, on 1/23 file transfer failures to the site were observed - fixed by 
    a restart of the SRM service.
    4)  1/22: HU_ATLAS_Tier2 - site was not receiving production jobs (panda brokerage indicated needed atlas s/w releases weren't available).  Issue was traced by 
    Alessandro to a problem with release reporting, probably left over from an incident over the holidays where a machine running an installation agent crashed.  Apparently 
    this problem was affecting some other sites as well (MWT2).
    5)  1/23: Issue with panda schedconfig update caused many sites to be set off-line for a period of time.  Issue resolved as of ~1:00 p.m. CST.  eLog 33249.
    6)  1/25: New pilot release from Paul (SULU 50b).  Details here: http://www-hep.uta.edu/~sosebee/ADCoS/pilot-version-SULU_50b.html
    Follow-ups from earlier reports:
    (i)  12/12: UTD-HEP - ggus 77382 opened due to DDM deletion errors at the site (~21 over a four hour period).  Ticket 'assigned' - eLog 32351.  (Duplicate ggus ticket 
    77440 was opend/closed on 12/14.)
    Update 12/24: ggus 77737 also opened for deletion errors at the site - eLog 32692.
    Update 1/9: ggus 77382 closed as an old/obsolete ticket.
    Update 1/17: ggus 78326 opened - closed since this issue is being tracked in ggus 77737.
    Update 1/21: ggus 77737 closed as 'unsolved' with the comment: There are no errors currently. When there are small amounts of these error from time to time they are 
    transient, and don't create a real problem for the deletion operation - just may make it a bit slower.
    (ii)  12/23: NET2 - file deletion errors - ggus 77729, eLog 32587/739.  (Duplicate ggus ticket 77796 was opened/closed on 12/29.)
    Update 1/17: ggus tickets 78324/40 opened - closed since this issue is being tracked in ggus 77729.
    (iii)  1/13: UTD-HEP - dbRelease file transfer failing ("[GENERAL_FAILURE] Error:/bin/mkdir: cannot create directory `/at/atlashotdisk/ddo/DBRelease/v170602/ddo.
    000001.frozen.showers.DBRelease.v170602': Permission deniedRef-u usatlas1 /bin/mkdir /at/atlashotdisk/ddo/DBRelease/v170602/ddo.000001.frozen.showers.DBRelease.v170602]").  
    ggus 78217 / RT 21544, eLog 33005.  Site requested to be set off-line on 1/14 - eLog 33028, https://savannah.cern.ch/support/index.php?125621.
    (iv)  1/14: NERSC_SCRATCHDISK DDM errors ("[TRANSFER_TIMEOUT] globus_ftp_client_cksm (gridftp_client_wait): Connection timed out]" & "[CONNECTION_ERROR] 
    [srm2__srmPrepareToGet] failed: SOAP-ENV:Client - CGSI-gSOAP running on fts01.usatlas.bnl.gov reports Error reading token data header: Connection closed]").  
    ggus 78246, eLog 33010.
    Update 1/19 from a site admin: It looks as though the problem fixed itself; I will close this ticket. If the problem reoccurs please let us know and we can reopen this ticket 
    or create a new one.  ggus 78246 closed.
    (v)  1/16: ggus 78298 was erroneously assigned to BNL, when the actual file transfer errors were at UPENN (" failed to contact on remote SRM 
    [httpg://srm.hep.upenn.edu:8443/srm/v2/server]").  This ticket was closed, and ggus 78299 was opened instead.  The issue at UPENN was resolved by restarting bestman.  
    ggus 78299 closed,  eLog 33055/56.
    Update 1/18: transfer errors reappeared, and UPENN_LOCALGROUPDISK was blacklisted.  Site reported that another bestman restart fixed the problem.
    http://savannah.cern.ch/support/?125678 (Savannah site exclusion).
    Update 1/18: Issue appears to be resolved - site removed from blacklisting, Savannah 125678 updated.
    (vi)  1/17 early a.m.: power outage at SLAC - eLog 33071.
    Update 1/18: power restored.  Cause of the outage under investigation.
    (vii)  1/17: AGLT2 maintenance outage.  Work completed as of ~8:00 p.m. CST.  Test jobs submitted to the production queue, but they failed with 
    "Put error: lfc-mkdir threw an exception."  Additional test jobs submitted.  eLog 33097.
    Update 1/19: issue affecting earlier test jobs resolved - most recent jobs completed successfully.  Site set back on-line in panda - eLog 33120.
    • Still would like to remove obsolete sites in the Panda monitor.
    • Main issue are DDM deletion errors - expect it to go away when SRM is updated at three sites. Generating redundant tickets.
    • The sites requiring updates are: OU (Feb 10), NET2 (imminent, but correlated with GK overload; within a week), UTD ()
    • New coordinator for ADC OS - replacing Jarka - Alexei S.
  • this meeting: Operations summary:
    No summary from the ADCoS meeting available this week.
    1)  1/27: early a.m. network outage at CERN for ~20 minutes.  Resulted in large numbers of "lost heartbeat" job failures across all clouds.  eLog 33321.
    2)  1/28: ANL_LOCALGROUPDISK file transfer errors (" failed to contact on remote SRM [httpg://atlas11.hep.anl.gov:8443/srm/v2/server]").  ggus 78727 marked as 
    'solved' with no details.  eLog 33358.
    3)  1/28: From Bob at AGLT2: Network restart with back-ported driver seems to have hung something. Machine reboot appears to have repaired the issue, as lfc log 
    now seems normal.  Spike in LFC-related errors - latest at around 22:15 UTC.
    4)  1/28: SLACXRD - job failures with stage-out errors (example: "Error accessing path/file for root://atl-xrdr:1094//...").  Wei reported that the issue was due to running 
    out of disk space on the xrootd redirector.  Problem fixed - ggus 78735 closed, eLog 33373.
    5)  1/28: Threshold for DDM deletion error reporting increased to 100 over a four hour period.  Hopefully will reduce the level of ticketing for errors that often get resolved 
    on their own anyway.  eLog 33353/57.
    6)  1/29: User reported that jobs submitted to ANALY_LONG_BNL_ATLAS were staying in the queue for several days.  Large number of jobs waiting to run in the queue, 
    so not really a site problem.  CREM policy restricts the amount of resources Tier-1's may allocate to analysis, so not much to be done here.  
    See https://ggus.eu/ws/ticket_info.php?ticket=78736 for details. 
    7)  1/31: LCGR database downtime at CERN.  Details: http://itssb.web.cern.ch/planned-intervention/lcgr-database-migration-and-upgrade-oracle-11gr2/31-01-2012.  
    Most of the work was completed by around 2:00 p.m. CET, except for the DDM dashboards, which took longer to come back on-line.  eLog 33429.
    8)  1/31: During a downtime for the VOMS service at CERN the automatic rollover to the BNL service was failing.  Traced to known problems with some versions of the voms 
    client (newer versions are better) that do not properly fail over to a second voms server if the first attempted is down.  Issue resolved by ~8:30 a.m. CST.  Most US analysis 
    sites were auto-excluded during this time.  eLog 33407.
    9)  1/31: UTD-HEP - requested that the site be set off-line while a failed disk is being replaced.  eLog 33409,
    https://savannah.cern.ch/support/index.php?126004 (Savannah site exclusion).
    Follow-ups from earlier reports:
    (i)  12/23: NET2 - file deletion errors - ggus 77729, eLog 32587/739.  (Duplicate ggus ticket 77796 was opened/closed on 12/29.)
    Update 1/17: ggus tickets 78324/40 opened - closed since this issue is being tracked in ggus 77729.
    (ii)  1/13: UTD-HEP - dbRelease file transfer failing ("[GENERAL_FAILURE] Error:/bin/mkdir: cannot create directory `/at/atlashotdisk/ddo/DBRelease/v170602/ddo.000001.
    frozen.showers.DBRelease.v170602': Permission deniedRef-u usatlas1 /bin/mkdir /at/atlashotdisk/ddo/DBRelease/v170602/ddo.000001.frozen.showers.DBRelease.v170602]").  
    ggus 78217 / RT 21544, eLog 33005.  Site requested to be set off-line on 1/14 - eLog 33028, https://savannah.cern.ch/support/index.php?125621.
    Update 1/27: The missing DBrelease file was eventually copied to the site. Whatever issue caused the problem appears to be resolved.  Closed ggus 78217 / RT 21544.  
    Test jobs successful - site set back on-line in panda.  eLog 33298.
    (iii)  1/19: NET2 - job failures with "lost heartbeat" errors.  Issue under investigation.  ggus 78396, eLog 33118.
    Update 1/28: No recent job failures of the type described in the ticket - ggus 78396 closed, eLog 33347.
    • New pilot released last week - encourage sites to take a look.
    • Increased threshold for # DDM deletion errors
    • LCGCR database upgrade yesterday - monitoring affected, esp DDM dash.
    • VOMS outage at CERN - automatic rollover of some clients failed. Analysis queues were auto-excluded.
    • Old sites in Panda being removed
    • Do we expect a burst of activity for Moriond.

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last meeting:
    • Shawn and Michael will attend LHCONE/LHCOPN meeting at LBNL, 30-31 January. Will also discuss moving forward with US LHCNet.
    • LHCONE is making progress with a new architecture, away from the VLAN architecture. A couple of sites are ready, once the providers are ready (ESnet and I2). The difficult part was making it work between regions (eg. routing loops resulting).
  • this meeting:
    • See throughput meeting notes from last week.
    • Notes 10g is still the goal. Get these ordered.
    • UC latency node has been cleaned up.
    • Check the traceroute matrix - all sites should check this. Many sites listed as "unknown".

Federated Xrootd deployment in the US (Wei)

last week(s) this week:
  • MinutesFedXrootdJan25
  • Hiro's tool will be in the next release of dq2-client
  • Waiting for Lukas to look at bug fixes
  • More information on N2N memory issue from users at SLAC

Tier 3 GS

last meeting:
  • Michael - notes that Armen wanted to contact T3's to retire DATADISK's tokens. Action item to follow-up
this meeting:
  • Doug notes there will be a security test against Tier 3's.

Site news and issues (all sites)

  • T1:
    • last meeting(s): Queues at BNL were moved to APF - autopilot turned off - good step forward for moving the entire region to APF. Hiro has been developing a new lsm, nice extensions, in terms of supporting alternate protocols. Has installed http and xrootd. Its now comprehensive, and currently in test. With failovers, as appropriate. Goal is to eliminate failures with input files missing. As a last resort can even grab from a remote site, including the federation.
    • this meeting:

  • AGLT2:
    • last meeting(s): downtime yesterday - firmware updates to Dell switches; dCache 1.9.12-15 updated. VMWare cluster problems - a machine dropped out.
    • this meeting: Will integrate Dell S4810 switch. Working on virtualizing servers (eg. Oracle calibration database; investigating performance). Would like to virtualize dCache admin nodes. Turned on eight new Dell blade servers (24 cores each). Few PE1950's retired.

  • NET2:
    • last meeting(s): cvmfs testing completed at HU, successfully. Had some slowdown of HU nodes - a puppet known bug - will make a note.
    • this meeting: 500 new job slots at HU, 800 more at BU coming. New storage arrived, except without disks. New bandwidth node.

  • MWT2:
    • last meeting(s): Had probs with KVM bridge dropping out - caused to interruptions; server is back up. Using CVMFS to export $APP, etc, to worker nodes. Procurement on track.
    • this meeting: 8 12 core pilot nodes running from MWT2 via condor flocking, using Sarah's cvmfs repo. Feb 7.

  • SWT2 (UTA):
    • last meeting(s): Going well over last two weeks; partial shutdown expected for Saturday morning, construction work, will affect a subset of compute nodes. Progressing with procurements and CVMFS updates.
    • this meeting: Focusing on CVMFS

  • SWT2 (OU):
    • last meeting(s):
    • this meeting: Preparing for downtime next week.

  • WT2:
    • last meeting(s): Running smoothly - but lost power at SLAC, fully back now. Were down for about 16 hours.
    • this meeting: All is fine.

Carryover issues (any updates?)

OSG Opportunistic Access

See AccessOSG for instructions supporting OSG VOs for opportunistic access.

last week(s)

  • SLAC: enabled HCC and Glow. Have problems with Engage - since they want gridftp from the worker node. Possible problems with Glow VO cert.
  • GEANT4 campaign, SupportingGEANT4
this week


last week
  • Next OSG all hands meeting is coming up - see coordinates above. We are discussing a joint session at this meeting. Hope that many will join us for the meeting, our next f2f meeting. Points of common interest with USCMS - TEG, federated Xrootd, etc., in the joint sessions. Also - OSG software, operations, and hardware developments, networking (eg LHCONE prospects). Also if you have other areas of interest please send to the list.
this week

-- RobertGardner - 31 Jan 2012

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


pdf LSM.pptx.pdf (67.2K) | RobertGardner, 01 Feb 2012 - 12:39 |
png screenshot_02.png (189.5K) | RobertGardner, 01 Feb 2012 - 12:40 |
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback