r5 - 15 Feb 2012 - 14:58:41 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesFeb15



Minutes of the Facilities Integration Program meeting, Feb 15, 2012
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Our Skype-capable conference line: (6 to mute) ** announce yourself in a quiet moment after you connect *
    • USA Toll-Free: (877)336-1839
    • USA Caller Paid/International Toll : (636)651-0008
    • ACCESS CODE: 3444755


  • Meeting attendees: Booker Bense, Michael, Rob, Saul, Wei, Sarah, Shawn, Patrick, Bob, Mark, Dave, Hiro, John, Tom, Fred, Horst, Armen, Kaushik, Alden, Nate
  • Apologies:
  • Guests: Jason (I2)

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon CDT, bi-weekly - convened by Kaushik) : Data management
    • Tuesday (2pm CDT, bi-weekly - convened by Shawn): Throughput meetings
    • Wednesday (1pm CDT, bi-weekly - convened by Wei or Rob): Federated Xrootd
  • Upcoming related meetings:
      • OSG All hands meeting: March 19-23, 2012 (University of Nebraska at Lincoln). Program being discussed. As last year part of the meeting will have co-located US ATLAS, and joint USCMS/OSG session.
  • For reference:
  • Program notes:
    • last week(s)
      • ConsolidatingLFCStudyGroupUS
      • OSG All hands registration is open: Dear OSG Community Member, We would like to to remind you that registration for the March 19-22nd OSG All Hands Meeting is now open at http://hcc.unl.edu/presentations/event.php?ideventof=5 . We look forward to seeing many of you there. We are also encouraging you to submit posters for the poster session on Tuesday, March 20. These should be no larger than 3ft X 4ft. There will also be tables available for laptop displays either with a poster or standalone. If you are interested to submit a poster, please email an abstract to osg-ahm-program@opensciencegrid.org by March 1, 2012. The program committee will be reviewing the submissions (we can accommodate a maximum of about 20). Any work related to OSG will be considered. Priority will be given to new applications recently ported to the OSG or work directly involving students. We are also looking for suggestions for a few five-minute "lightning presentations" on the plenary day - Wednesday, March 21. We suggest one of two alternative topics: "The neatest, most reusable tool my site or community uses on OSG" or "What OSG should be for the user in 2016". Please send suggestions to osg-ahm-program@opensciencegrid.org. We will be posting more details of the agenda very shortly. Regards, David Swanson , Host of the OSG All Hands Meeting 402-472-5006
        • Additionally request from Brian Bockleman for site admin technologies.
      • GuideToJIRA
    • this week * ConsolidatingLFCStudyGroupUS * OSG All Hands meeting, https://indico.fnal.gov/conferenceTimeTable.py?confId=5109#20120319
        • Monday morning: US ATLAS Facilities
        • Monday afternoon: Federated Xrootd w/ CMS
        • Tuesday morning: plenary sessions on Campus grid and cloud
        • Tuesday afternoon: Joint CMS session on WLCG summaries
        • US ATLAS Facilities at OSG AH (tentative): https://indico.cern.ch/conferenceDisplay.py?confId=178216
      • GuideToJIRA - will use to coordinate perfSONAR and OSG 3.0 deployments this quarter.
      • Funding agencies are requesting metrics from the facility: this is diverse, multiple things; capacity deployment against the deadline; analysis performance; validation matrix - we used to use; site certification matrix for the phase; maintaining the result. The matrix: SiteCertificationP19.
      • Deployment of capacities to follow-up.
      • 10G perfsonar deployed by end of meeting
      • OSG 3.0 deployment
      • LHCONE - LBNL working meeting; BNL is directly connected EsNET VRF zone implemented. Should all just work. Close to being able to use infrastructure.
      • Planned interventions should be all complete by end of March

Follow-up on CVMFS deployments & plans

last meeting:
  • UTA - has gatekeeper and CE installed; will be added into OIM, needed for Alessandro. Will expect to go live next week with validation jobs.
  • OU - on track for Feb 10
  • NET2: now transitioned. One leftover problem - analysis queue on BU side in borkeroff; Squid not being advertised correctly? A post-fix to the CVMFS install?

this meeting:

  • OU - its deployed - will be there when the site comes back up, today or tomorrow - will start the validation.
  • UTA - ran into an issue with the test queue. In place now. Need Xin and Alessandro. Getting queue setup correctly on test gatekeeper - looks good.
  • No other sites have not yet deployed.

rpm-based OSG 3.0 CE install

last meeting
  • OU - CE and Bestman SE are both looking good. Not quite ready though to convert to production.
  • MWT2 - RSV installation, bug found.
  • BU - new gatekeeper, expect to be installing in about two weeks.
  • No other immediate scheduled installs.
this meeeting
  • Mirror of EPEL repo - need to follow-up

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
  • this meeting:
    • Drop off in production
    • mc12 - energy to be raised to 8 TeV - but validation continues.
    • Sites should take downtimes asap.

Data Management and Storage Validation (Armen)

  • Reference
  • last meetings(s):
    • Changes in Victor (central deletion) - changes to policy 10% free on DATADISK.
    • SCRATCHDISK - previously 20%; now its 50% free.
  • this meeting:
    • Storage status looks good. More deletion errors - LFC errors, but these were caught at fixed. (AGLT2 and NET2)
    • Allocation of GROUPDISK at Tier2s at ~ 400 TB.
    • Discussion about setting quota - they are in ToA.
    • US ATLAS policy for data management - consistent guidance for sites. Use of auto-adjuster - sites keep minimum level. Shawn's notes Savannah ticket. Action item for DDM group.

Shift Operations (Mark)

  • Reference
  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    1)  2/1: NET2 - DDM errors due to an expired host certificate on atlas.bu.edu.  The new certificate had already been installed, but an SRM restart was needed to pick 
    up the change.  Issue resolved.  eLog 33448.
    2)  2/1: SWT2_CPB_DATADISK: DDM errors ("has trouble with canonical path. cannot access it.").  Issue resolved by restarting the XrootdFS process on the SRM host.  
    eLog 33449.
    3)  2/1: SLACXRD: file transfer errors ("Can't mkdir: /xrootd/atlas/atlas...").  Wei reported the problem was fixed.  ggus 78846 closed, eLog 33514.
    4)  2/2: System for setting analysis queues to enable automatic HammerCloud testing changed.  Now using 'test' rather than 'brokeroff'.  Details in eLog 33458.
    5)  2/4: ANL_LOCALGROUPDISK - file tranfer errors ("failed to contact on remote SRM [httpg://atlas11.hep.anl.gov:8443/srm/v2/server]").  ggus 78915 in-progress, 
    eLog 33543.  (Duplicate ggus ticket 78916 was also opened/closed.)
    6)  2/6: New pilot release from Paul (SULU 50c).  Details here: http://www-hep.uta.edu/~sosebee/ADCoS/pilot-version-SULU_50c.html
    7)  2/6 p.m.: SWT2_CPB - file transfer errors ("[TRANSFER_TIMEOUT] gridftp_copy_wait: Connection timed out]").  A gridftp server had become unresponsive, so it was 
    necessary to reboot the machine.  Issue resolved.  ggus 78969 / RT 21657 closed, eLog 33616.
    8)  2/7: MWT2 - large number of "lost heartbeat" job failures.  From Sarah: The scope of the scheduled IU power maintenance yesterday was larger than expected, and 
    we lost power to all IU compute nodes and network equipment for some time. We expect once the lost heartbeat errors clear to return to a low error rate.  ggus 78999 
    in-progress (can probably be closed?) - eLog 33615/42.
    9)  2/8 early a.m.: DDM deletion errors were reported for OU_OCHEP_SWT2.  Old tickets ggus 78325 / RT 21558 re-opened/closed.  Site is in downtime this week for 
    upgrades.  eLog 33624.
    Follow-ups from earlier reports:
    (i)  12/23: NET2 - file deletion errors - ggus 77729, eLog 32587/739.  (Duplicate ggus ticket 77796 was opened/closed on 12/29.)
    Update 1/17: ggus tickets 78324/40 opened - closed since this issue is being tracked in ggus 77729.
    (ii)  1/29: User reported that jobs submitted to ANALY_LONG_BNL_ATLAS were staying in the queue for several days.  Large number of jobs waiting to run in the queue, 
    so not really a site problem.  CREM policy restricts the amount of resources Tier-1's may allocate to analysis, so not much to be done here.  
    See https://ggus.eu/ws/ticket_info.php?ticket=78736 for details. 
    Update 2/4 from Torre: Queue of waiting (activated) analysis jobs at ANALY_LONG_BNL_ATLAS has come down to a reasonable level.  
    See http://gridinfo.triumf.ca/panglia/sites/month.php?SITE=ANALY_LONG_BNL_ATLAS&SIZE=large.  ggus 78736 closed, eLog
    (iii)  1/31: UTD-HEP - requested that the site be set off-line while a failed disk is being replaced.  eLog 33409,
    https://savannah.cern.ch/support/index.php?126004 (Savannah site exclusion).
    Update 2/4: Disk replacement completed - test jobs successful, site set back on-line.  eLog 33476.
    • New pilot released last week - encourage sites to take a look.
    • Increased threshold for # DDM deletion errors
    • LCGCR database upgrade yesterday - monitoring affected, esp DDM dash.
    • VOMS outage at CERN - automatic rollover of some clients failed. Analysis queues were auto-excluded.
    • Old sites in Panda being removed
    • Do we expect a burst of activity for Moriond?
  • this meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    1)  2/8: MWT2 - DDM errors such as "SOURCE error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] TQueued" and "SOURCE error during 
    TRANSFER_PREPARATION phase: [CONNECTION_ERROR] failed to contact on remote SRM."  From Sarah: The MWT2 SRM door is experiencing memory issues. 
    We are allocating more ram to the system VM and to the SRM JVM.  Issue resolved - ggus 79027 closed, eLog 33688.
    2)  2/8: SLACXRD file transfer errors ("failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]").  Wei reported the problem was fixed.  
    No further errors - ggus 79018 closed, eLog 33682.
    3)  2/9: MWT2 file transfer errors (" Failed : All Ready slots are taken and Ready Thread Queue is full. Failing request]").  ggus 79080 in-progress, eLog 33683.
    4)  2/10: NET2 - DDM errors with "failed to contact on remote SRM [httpg://atlas.bu.edu:8443/srm/v2/server]."  Thousands of errors generated in a relatively short 
    period of time.  Site admins were aware of the issue.  The errors occurred during a brief test for switching over to bestman2 SRM.  eLog 33727.
    Later the same day the SRM transfer errors re-appeared.  From Saul: We have reverted to our bestman1 installation for the
    weekend. There was a large number of errors, but from local tests, we should be OK as of about two hours ago.  ggus 79133 closed, eLog 33741/64.
    5)  2/10: Most US sites ran low on MC production jobs.  This was due to the fact that MC11 production is winding down.
    6)  2/10: DBRelease dataset ddo.000001.Atlas.Ideal.DBRelease.v170901 was delayed getting transferred to SWT2_CPB and SLACXRD.  Jobs failed at the sites with 
    the error "DBRelease file has not been transferred yet."  ANALY_SWT2_CPB auto-excluded for several hours.  Dataset eventually transferred.  eLog 33766, 
    7)  2/11 early a.m.: VO portion of the DDM robot certificate expired, causing transfer errors like "[SECURITY_ERROR] not mapped./DC=ch/DC=cern/OU=Organic 
    Units/OU=Users/CN=ddmadmin/CN=531497/CN=Robot: ATLAS Data Management]."  Issue resolved as of early afternoon, although it took a while for the change to 
    propagate to all systems.  eLog 33751.
    8)  2/14: New pilot release from Paul (SULU 50d).  Details here:
    9)  2/15 a.m.: All US analysis sites (and also many in other clouds) were auto-excluded.  Issue was due to a cron configuration at CERN such that VOMS information 
    wasn't getting propagated correctly.  Problem fixed.  eLog 33818.
    10)  2/15: NET2 PHYS-TOP - file transfer errors (" failed to contact on remote SRM [httpg://atlas.bu.edu:8443/srm/v2/server]').  ggus 79252, eLog 33820.
    Follow-ups from earlier reports:
    (i)  12/23: NET2 - file deletion errors - ggus 77729, eLog 32587/739.  (Duplicate ggus ticket 77796 was opened/closed on 12/29.)
    Update 1/17: ggus tickets 78324/40 opened - closed since this issue is being tracked in ggus 77729.
    (ii)  2/4: ANL_LOCALGROUPDISK - file tranfer errors ("failed to contact on remote SRM [httpg://atlas11.hep.anl.gov:8443/srm/v2/server]").  ggus 78915 in-progress, 
    eLog 33543.  (Duplicate ggus ticket 78916 was also opened/closed.)
    Update 2/8: All failing transfers succeeded within a few hours after the site fixed the problem. No more errors over the past 24 hours. ggus 78915 closed - eLog 33660.
    (iii)  2/7: MWT2 - large number of "lost heartbeat" job failures.  From Sarah: The scope of the scheduled IU power maintenance yesterday was larger than expected, 
    and we lost power to all IU compute nodes and network equipment for some time. We expect once the lost heartbeat errors clear to return to a low error rate.  
    ggus 78999 in-progress (can probably be closed?) - eLog 33615/42.
    Update 2/10: no additional "lost heartbeat" errors occurring - issue resolved.  ggus 78999 closed.
    • Site state for Panda has changed - sites are put into "test" rather than "brokeroff".
    • Pilot releases from Paul
    • DDM robot certificate - voms extension expired on the cert - causing errors everywhere
    • Analysis queues were auto-offlined everywhere, prob was a cron at cern, issue now resolved.
    • Alden: Schedconfig failure resulting from a race condition with subversion updates, causing folders to be deleted. This has been fixed.

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last meeting:
    • See throughput meeting notes from last week.
    • Notes 10g is still the goal. Get these ordered.
    • UC latency node has been cleaned up.
    • Check the traceroute matrix - all sites should check this. Many sites listed as "unknown".
  • this meeting:
    • Sites need to be getting perfsonar 10G for bandwidth nodes.
    • Traceroute tests - need to check configurations.
    • LHCONE baseline measured

Federated Xrootd deployment in the US (Wei)

last week(s) this week:
  • Subscribed datasets at sites
  • Examining memory lead in proxy cluster at SLAC, getting ready for large scale test
  • dcache-xrootd door evaluated found comparable to dcap
  • Monitoring tool from dubna group
  • Hiro has provided git repo for xrootd configurations and frm scripts
  • dq2 client 1.0.0 for global file list

Tier 3 GS

last meeting:
  • Doug notes there will be a security test against Tier 3's.
this meeting:
  • Removal of DATADISK sites - like Wisconsin and other Tier3s. Kaushik and Armen wanted to draft something to this effect. Armen believes this is limited to Wisconsin.
  • UTD - do they need DATADISK?
  • Make this a coherent facility policy.

Site news and issues (all sites)

  • T1:
    • last meeting(s):
    • this meeting: Allowing more analysis jobs. Working on a new Condor version - with new group quota implementation. Whole node scheduling with Condor. Post software week we will make plans for the facility; there are options (eg. pilot).

  • AGLT2:
    • last meeting(s): Will integrate Dell S4810 switch. Working on virtualizing servers (eg. Oracle calibration database; investigating performance). Would like to virtualize dCache admin nodes. Turned on eight new Dell blade servers (24 cores each). Few PE1950's retired.
    • this meeting: Things running well - will be planning a downtime for network re-configs, new equipment. Continuing virtualizations - all services virtualized and with site redundancy. MSU has hardware on order for this. Bob: working on cfengine to quickly reconf worker nodes at both sites; applying to both sites; more flexible and quickly change. Run in failover mode first; then in a disaster recovery mode. Future look at resilient. Hot failover will be down the road.

  • NET2:
    • last meeting(s): 500 new job slots at HU, 800 more at BU coming. New storage arrived, except without disks. New bandwidth node.
    • this meeting: New workers being installed at BU and HU (~ 1800 worker nodes). Will be trying out OSG 3.0.

  • MWT2:
    • last meeting(s): 8 12 core pilot nodes running from MWT2 via condor flocking, using Sarah's cvmfs repo. Feb 7.
    • this meeting: Progress continues on deploying new hardware at all sites. Storage completed at UC. Working on computes at IU and UC. Campus cluster meeting at UIUC focusing first compute deployment, networking, storage, head nodes (condor, squid). UIUC nodes flocked to from main MWT2 queues, integrated in sysview, http://www.mwt2.org/sys/view/. Have done significant OSG opportunistic jobs during the drainage.

  • SWT2 (UTA):
    • last meeting(s): Focusing on CVMFS
    • this meeting: Xin got the atlas-wn installed.

  • SWT2 (OU):
    • last meeting(s): Preparing for downtime next week.
    • this meeting: Upgrade from Hell. Two Dell servers were faulty - iDRAC and frequent crashes. Replaced mobo, replaced memory, swapped cpu's. Lustre upgrade went well.

  • WT2:
    • last meeting(s): All is fine
    • this meeting: Discussions with Hiro gridftp-only door, by-passing SRM. EOS experimenting with an xrootd sites.

Carryover issues (any updates?)

OSG Opportunistic Access

See AccessOSG for instructions supporting OSG VOs for opportunistic access.

last week(s)

  • SLAC: enabled HCC and Glow. Have problems with Engage - since they want gridftp from the worker node. Possible problems with Glow VO cert.
  • GEANT4 campaign, SupportingGEANT4
this week


last week this week

-- RobertGardner - 14 Feb 2012

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback