r4 - 28 Mar 2012 - 14:19:13 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesMar28

MinutesMar28

Introduction

Minutes of the Facilities Integration Program meeting, Mar 28, 2012
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Our Skype-capable conference line: (6 to mute) ** announce yourself in a quiet moment after you connect *
    • USA Toll-Free: (877)336-1839
    • USA Caller Paid/International Toll : (636)651-0008
    • ACCESS CODE: 3444755

Attending

  • Meeting attendees: Bob, Fred, Rob, Lincoln, Nate, Michael, Saul, Torre, Shawn, Wei, Mark, Kaushik, Patrick, Booker, Horst, Alden, Armen, Tom, Hiro
  • Apologies:
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon CDT, bi-weekly - convened by Kaushik) : Data management
    • Tuesday (2pm CDT, bi-weekly - convened by Shawn): Throughput meetings
    • Wednesday (1pm CDT, bi-weekly - convened by Wei or Rob): Federated Xrootd
  • Upcoming related meetings:

  • For reference:
  • Program notes:
    • last week(s)
    • this week
      • Nearing end of quarter, therefore SiteCertificationP20 will need updating
      • Enabling multicore scheduling in the US Facility. More resources are needed to test out the environment for AthenaMP. Take this as an opportunity step up and provide this service, dynamically adjust resources as according to need. We'll need to work up recipes for various sites. We'll need to collaborate on this. Farm groups at BNL are starting on an recipe for Condor.
      • VmemSitesSurvey - nearly filled in. Don't want to provide high mem queues initially.
      • LFC consolidation, report is in: ConsolidatingLFCStudyGroupUS, report document
        • Hiro - will need to look at it this week. Patrick.
        • Michael - report is good, distributed leadership to help provide guidance to facility
      • Capacity summary:
        • MWT2 - will ramp to 2.4 PB, within April. Presently 1.8 PB useable.
        • AGLT2 - 2160 TB online. 90 TB additional coming at MSU. 2250 PB.
        • NET2 - 1.2 PB; two additional racks will make 2.2 PB; under test. Will be ready well before end of April.
        • SWT2 - waiting for UPS to be set, and additional power. Racks in place to get to 2.2 PB as pledge; to be installed. Awaiting on final UPS hook-up, April 2. End of April will be tight, not sure.
        • WT2 - Only 50 TB short (2150 TB usable is available now), don't have immediate plans to purchase storage. Building up area for low-density storage. Will be ordering 10 TB of SSDs.
        • BNL - both CPU and disk at pledge level
  • All hands meeting follow-on comments
    • Good meeting, US ATLAS-specific issues
    • Joint sessions with US CMS, and common all-hands meeting agenda - seemed to work well.
    • Redefinition of production operations (Kaushik and Michael)
    • Usage and accounting for resources beyond pledge - will need to follow this up, Kaushik will be leading this
    • Optimization of resources Tier 1 / Tier 2 - as discussed at last review. Raised at ATLAS ICB - needs followed up. Will be brought up at CREM. There is a lack of accounting of the general use of resources, will lead to new accounting requirements. There is a monitoring group within ADC that is collecting requirements as well - it will be an ATLAS-wide activity, it is underway.
    • Site performance metrics - we've discussed this a few weeks ago; Michael and Rob are discussing details and will discuss a proposal soon.
    • Network monitoring services - should we widen our work beyond US ATLAS; Shawn, Jason, Lothar B, Miron discussed this at a side-meeting.
    • WLCG TEG groups are ramping up with their activities - recommendations and draft reports are coming together. Data management and storage management TEG that outlines a number of recommendations; might be interesting to go over this. Michael will circulate, have a short discussion at the next meeting.
    • LHCONE is making good progress - we have peering between LHCONE and ESNet SLAC and BNL; initial measurements to Naples - no improvement, but now coming along. Would like to see at least two other sites to get included in this activity soon. Would like to see other sites join soon - AGLT2 sites.
    • LHC is getting close to providing collisions

64bit Python and LFC clients - recent problems at SLAC

last time
  • SLAC and UTA have direct reading for analysis jobs. 32-bit wn environment. SFN to be converted to URL. Works for most analysis jobs. Certain number of prun jobs require using 64 bit python. Pilot then fails return a 64 bit python; therefore cannot load the LFC client bindings.
  • Is it a problem with the version of the worker node client? Wei - running a relatively old version.
  • Marco describes what happens with the current install of wn-client.
this meeting:
  • Quick follow-up.
  • Resolved - upgraded to a later version of wn-client, and Xin installed latest atlas wn-client, which included dq2 1.0 (disabled); python 2.6 installed. All is working now.

Follow-up on CVMFS deployments & plans

last meeting:
  • UTA - very close to converting the SWT2 cluster to CVMFS. All tests were passing successfully - got sidetracked by a network issue.

this meeting:

  • Done. No reason to keep NFS installed releases.
  • Complete

rpm-based OSG 3.0 CE install

last meeting
  • In production BNLs
  • Horst claims there are two issues: RSV bug, and Condor not in a standard location.
  • NET2: Saul: have a new gatekeeper - will bring up with new OSG.
  • AGLT2: March 7 is a possibility - will be doing upgrade.
  • MWT2: on near term agenda.
  • SWT2: will take a downtime; have new hardware to bring online. Complicated with install of new UPS - expect delivery, which will take a downtime.
  • WT2: has two gatekeepers. Will use one and attempt to transition without a downtime.
this meeeting
  • Any updates?
  • AGLT2
  • MWT2 - in production DONE
  • SWT2
  • NET2
  • WT2

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Running ~50%. Expect another week before mc12 starts, so still a very good time for downtimes.
    • The crunch will start after that.
  • this meeting:
    • Completely full, mc12 keeping us busy
    • Alden has implemented MOU shares in schedconfig; US showing 20%, its a little low.

Data Management and Storage Validation (Armen)

Shift Operations (Mark)

  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=183120
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-3_19_2012.html
    
    1)  3/14: User reported that his cert was not included in the BNL VOMS server (but was o.k. in the CERN VOMS server).  ggus 80264 in-progress.
    2)  3/17: MWT2 - low efficiency for production due to large number of "lost heartbeat" errors.  From Sarah: MWT2 gatekeeper failed Saturday morning, 
    and we took the opportunity to move to a new gatekeeper machine running OSG 3.0. In the process some jobs were lost. In addition the certs on our 
    GUMS server expired later on Saturday, and were replaced this morning.  No additional errors of this type as of 3/19 - ggus 80337 closed, 
    eLog 34438/45.
    3)  3/18 p.m.: NET2 queues set off-line in preparation for a storage upgrade the next day.  3/19: queues gradually coming back on-line as of ~7:00 p.m. CST.  
    Post-upgrade two ggus tickets were opened: 80389 for checksum errors at BU, and 80390 for job failures at HU due to non-access to /gpfs1/... storage 
    area.  Both issues resolved, and the ggus tickets were closed.  eLog 34446/47.
    4)  3/19: UTD-HEP - DDM errors ("failed to contact on remote SRM [httpg://fester.utdallas.edu:8443/srm/v2/server]").  ggus 80364 in-progress, eLog 34434, 
    blacklisted in DDM: https://savannah.cern.ch/support/index.php?127205.  (This Savannah site exclusion ticket was unneeded, as the site issues already 
    covered in 126767 - so the ticket was closed.)
    5)  3/19: New pilot version from Paul (52a).  Details here:
    http://www-hep.uta.edu/~sosebee/ADCoS/pilot-version_52a.html
    6)  3/20: AGLT2 - file transfer errors ("File does not have expected length").  Bob reported that a dCache server had to be rebooted - this appeared to 
    fix the problem.  ggus 80393 closed, eLog 34465.
    7)  3/21: Xin requested that the panda site BNL_ATLAS_1 (the old primary production queue at BNL) be kept off-line until further notice.  eLog 34470.
    
    Follow-ups from earlier reports:
    
    (i)  2/29: UTD-HEP set off-line to replace a failed disk.  Savannah site exclusion: https://savannah.cern.ch/support/index.php?126767.  As of 3/7 site 
    reported the problem was fixed.  Test jobs are failing with the error "Put error: nt call last): File , line 10, in ? File /usr/lib/python2.4/site-packages/XrdPosix.py, 
    line 5, in ? import _XrdPosix ImportError: /usr/lib/python2.4/site-packages/_XrdPosixmodule.so: undefined symbol: XrdPosix_Truncate."
    Under investigation.  eLog 34259.
    Update 3/12: ggus 80175 was opened for the site due to test jobs failing with the error shown above.  Closed on 3/13 since this issue is being tracked 
    in the Savannah ticket.
    (ii)  3/2: BU_ATLAS_Tier2 - DDM deletion errors (175 over a four-hour period).  ggus 79827 in-progress, eLog 34150.
    Update 3/13: ggus 79827 closed, as this issue is being followed in the new ticket 80214.  eLog 34336.
    (iii)  3/10: Duke - DDM errors (" failed to contact on remote SRM [httpg://atlfs02.phy.duke.edu:8443/srm/v2/server]").  Issue with a fileserver which hosts 
    gridftp & SRM services being investigated.  ggus 80126, eLog 34315.  ggus 80228 also opened on 3/13 for file transfer failures at DUKE_LOCALGROUPDISK.  
    Tickets cross-referenced.  System marked as 'off-line' while the hardware problem is worked on.  eLog 34343.  DDM blacklist ticket:
    https://savannah.cern.ch/support/index.php?127055
    
  • this meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=184081
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-3_26_2012.html
    
    1)  3/21: BNL_CVMFS_1 job failures ("pilot: cmtsite command was timed out: 1, [Errno 3] No such process (timed out)").  Issue understood and 
    fixed - from Michael: The problem is resolved. It was caused by temporary high load on an AFS volume. Because of increased response times 
    commands were timing out. We are in the process of replicating this (and other) volume(s) to spread the load.  ggus 80505 closed, eLog 34497.
    2)  3/24-3/26: MWT2 wasn't receiving jobs from panda brokering due to an issue with release 17.2.0.2 (i.e., didn't appear in the BDII).  Issue with 
    the ATLAS s/w release installation system being addressed (Alessandro involved).  More details in the e-mail thread. 
    3)  3/26: DDM dashboard upgrade / bug fixes.  See:
    http://www-hep.uta.edu/~sosebee/ADCoS/DDM-dash-v2M3_2.html
    4)  3/26: USATLAS HyperNews at https://www.usatlas.bnl.gov/HyperNews/racf/login.pl will be shutdown due to software security issues. It is not 
    known if/when HyperNews could be brought back on-line.   
    5)  3/26-3/27: AGLT2 - site was set off-line for a few hours to work on an issue with iScsi access.  Once the problem was fixed HC test jobs for 
    the analysis queue were not getting picked up by pilots.  From Tadashi: The pilot didn't pickup HC jobs, which didn't have countryGrop=us, due 
    to the beyond-pledge stuff. I've disabled it for HC pilots.  Test jobs then completed - site back on-line.
    6)  3/27: SLACXRD-DATADISK was ticketed as being full in https://savannah.cern.ch/support/?127429.  Another monitoring page 
    (http://bourricot.cern.ch/dq2/media/fig/SLACXRD_DATADISK_30.png) did not confirm this.  Possibly a transient issue.  (Wei reported having upgraded 
    xrootd the same day, but the times didn't seem to coincide.)  Savannah ticket closed.  eLog 34665.  Early a.m. 3/28: Unrelated ggus ticket 80686 
    opened for well-known issue of slow transfers between US & FR clouds.  This ticket will be closed.
    7)  3/27: SWT2_CPB - file transfers were failing with the error "failed to contact on remote SRM [httpg://gk03.atlas-swt2.org:8443/srm/v2/server]."  
    A storage server had to be rebooted - issue resolved.  ggus 80677 & duplicate 80678, RT 21845/46 closed.  eLog 34651.
    8)  3/28: ggus 80702 was opened for "BNL-OSG2_DATADISK: transfers from/to CERN failing."  Hiro pointed out this is actually a problem on the 
    CERN side (transfers between CERN and other clouds are also failing with similar errors).  Ticket reassigned to CERN support.  eLog 34692.
    
    Follow-ups from earlier reports:
    (i)  2/29: UTD-HEP set off-line to replace a failed disk.  Savannah site exclusion: https://savannah.cern.ch/support/index.php?126767.  As of 3/7 site 
    reported the problem was fixed.  Test jobs are failing with the error "Put error: nt call last): File , line 10, in ? File /usr/lib/python2.4/site-packages/XrdPosix.py, line 5, 
    in ? import _XrdPosix ImportError: /usr/lib/python2.4/site-packages/_XrdPosixmodule.so: undefined symbol: XrdPosix_Truncate."
    Under investigation.  eLog 34259.
    Update 3/12: ggus 80175 was opened for the site due to test jobs failing with the error shown above.  Closed on 3/13 since this issue is being tracked 
    in the Savannah ticket.
    (ii)  3/2: BU_ATLAS_Tier2 - DDM deletion errors (175 over a four-hour period).  ggus 79827 in-progress, eLog 34150.
    Update 3/13: ggus 79827 closed, as this issue is being followed in the new ticket 80214.  eLog 34336.
    (iii)  3/10: Duke - DDM errors (" failed to contact on remote SRM [httpg://atlfs02.phy.duke.edu:8443/srm/v2/server]").  Issue with a fileserver which hosts 
    gridftp & SRM services being investigated.  ggus 80126, eLog 34315.  ggus 80228 also opened on 3/13 for file transfer failures at DUKE_LOCALGROUPDISK.  
    Tickets cross-referenced.  System marked as 'off-line' while the hardware problem is worked on.  eLog 34343.  DDM blacklist ticket:
    https://savannah.cern.ch/support/index.php?127055
    (iv)  3/14: User reported that his cert was not included in the BNL VOMS server (but was o.k. in the CERN VOMS server).  ggus 80264 in-progress.
    Update 3/21 from John Hover at BNL: Should be resolved now. There have been issues with synchronization. The user now appears in BNL VOMS, 
    with the proper attributes.  ggus 80264 closed.
    (v)  3/19: UTD-HEP - DDM errors ("failed to contact on remote SRM [httpg://fester.utdallas.edu:8443/srm/v2/server]").  ggus 80364 in-progress, eLog 34434, 
    blacklisted in DDM: https://savannah.cern.ch/support/index.php?127205.  (This Savannah site exclusion ticket was unneeded, as the site issues already covered 
    in 126767 - so the ticket was closed.)
    
    • Things have been smooth, few issues in the US cloud.
    • Version 2 of DDM dash - provide feedback. Encourage sites to try it out and provide feedback.
    • BNL shutting down hypernews, security issues.
    • AGLT2 went offline, HC jobs sent but not picked up by pilot - change needed in Panda server, having to do with above pledge resources. Seems resolved but watch for this at other sites.

DDM Operations (Hiro)

Throughput and Networking (Shawn)

Federated Xrootd deployment in the US (Wei)

last week(s) this week:
  • Working on bringing federation from R&D to production.
  • Inform the pilot to use the federation
  • Dashboard will use X509 - then will convert all sites; discussing with Marco
  • N2N? into GIT repo - Hiro
  • Reminder about April 11-12

Tier 3 GS

last meeting: this meeting:

Site news and issues (all sites)

  • T1:
    • last meeting(s): Intervention planned for next Tuesday - border router change would affect production in the US. Affecting the OPN connectivity, primarily. Moving to Nexxus generation.
    • this meeting:

  • AGLT2:
    • last meeting(s): March 7 down day. Will move network switching around. Change back to single-rapid spanning tree. Reconfigured space tokens. Will be testing OSG wn rpm.
    • this meeting: Working on local network configuration - need to get switches talking correctly, may need a downtime. May need to re-stack switches.

  • NET2:
    • last meeting(s): Downtime soon - 900 TB racked and ready. New gatekeeper, file server. Seeing CVMFS failure every two or three days; annoying since you have to drain the nodes. New worker nodes up at HU and BU. Bestman2: brought it up, had trouble restarting so needs more work; on agenda
    • this meeting: Had a downtime to upgrade NFS server and mount points on BU side. Storage upgrade in progress - 2 racks, 3 TB drives.

  • MWT2:
    • last meeting(s): Continuing to bring new compute nodes online (60 R410), gatekeeper problems earlier this week, updating. Waiting on hardware to arrive at the Illinois CC. Using existing pilot hardware. CC folks did major reconfiguration to accommodate MWT2. High connectivity up to Chicago. UIUC integration proceeding: squid and condor head nodes setup. Adding CC nodes should be straightforward.
    • this meeting: Illinois CC - hardware has arrived, 16 nodes, being racked. 4 loaners nodes in CC integrated with MWT2 running analysis and production. Perfsonar nodes available. Reconfiguration of CC to accommodate us, one more piece required for the networking, for higher capacity and for upgrading to 100 Gbps in the future.

  • SWT2 (UTA):
    • last meeting(s): CVMFS on UTA_SWT2 is done - waiting on completion of Alessandro's validation tests. Convert UTA_CPB by end of week. Have had an issue transferring files back to BNL - surprisingly. Also slow transfers coming in. Discussions with Hiro and campus networking folks. Perfsonar history does show problems about three weeks ago, and deteriorated last week. Still don't have a good answer as to what changed. Continue investigations.
    • this meeting: CVMFS installed at CPB, back online Friday no problems. Data server issue caused issues with XrootdFS. Bringing online storage. R310's received; working on optics (LR or SR).

  • SWT2 (OU):
    • last meeting(s): Came out of downtime last week. All looks good, pleased with storage performance. 1100 MB/s for gridftp IO. 250-350 MB/s on average to BNL. Started filling up on the weekend. Lustre servers work better with the cluster fills. Handling large number of analysis.
    • this meeting:

  • WT2:
    • last meeting(s):
    • this meeting: Order 10 Gbps network monitoring machines; operations are smooth. One incident over the weekend - USERDISK directories exceeded 32K number of directories, had to purge.

Carryover issues (any updates?)

OSG Opportunistic Access

See AccessOSG for instructions supporting OSG VOs for opportunistic access.

last week(s)

  • SLAC: enabled HCC and Glow. Have problems with Engage - since they want gridftp from the worker node. Possible problems with Glow VO cert.
  • GEANT4 campaign, SupportingGEANT4
this week

AOB

last week this week


-- RobertGardner - 27 Mar 2012

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback