r4 - 04 Jan 2012 - 14:25:44 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJan4

MinutesJan4

Introduction

Minutes of the Facilities Integration Program meeting, January 4, 2012
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Our Skype-capable conference line: (6 to mute) ** announce yourself in a quiet moment after you connect *
    • USA Toll-Free: (877)336-1839
    • USA Caller Paid/International Toll : (636)651-0008
    • ACCESS CODE: 3444755

Attending

  • Meeting attendees: Michael, Bob, Rob, Nate, Patrick, Saul, Nate, Sarah, Hari, Jason, John, Armen, Mark S, Dave, Shawn, Wei, Fred, Mark N, Horst, Alden, Tom, Xin
  • Apologies:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon CDT, weekly - convened by Kaushik) : Data management
    • Tuesday (2pm CDT, bi-weekly - convened by Shawn): Throughput meetings
    • Friday (1pm CDT, bi-weekly - convened by Rob): Federated Xrootd
  • Upcoming related meetings:
      • OSG All hands meeting: March 19-23, 2012 (University of Nebraska at Lincoln). Program being discussed. As last year part of the meeting will have co-located US ATLAS, and joint USCMS/OSG session.
  • For reference:
  • Program notes:
    • last week(s)
      • There is a new group working on IO performance, see: https://indico.cern.ch/conferenceDisplay.py?confId=166930. There will at some point be results and studies from this group that should benefit performance in the facility.
      • Capacity updates for end of 2011 will be needed.
      • Michael - we do need to work on the CVMFS deployment - see below.
    • this week
      • Integration program sketch for the quarter (FY12Q2, January 1 - March 31, 2012):
        • Complete CVMFS deployment (January)
        • Finish CA procurements
        • Readiness for LHC restart of operations (proton-proton April 1); 2012 pledges fully deployed
        • OSG CE rpm-based update
        • Hammer Cloud on OSG ITB
        • PerfSONAR-PS: 10G upgrade?
        • Tier2D
        • FAX milestones - security, functional modes, analysis performance, monitoring. - TBD
        • Opportunistic access milestone across US ATLAS sites (TBD)
        • Deployment, evaluation of APF at a T2, and a Tier 3 (local pilot)
        • Illinois integration with MWT2
        • OSG AH meeting in March - co-locate next facilities workshop
        • Others integration tasks foreseen?
          • data management? cloud?
        • ADC discussions on LFC consolidation: this is underway at CERN (Dutch, Italian now being run at CERN). Possible consolidation of T2 LFCs at the T1 and BNL. Create a short-lived study group to evaluate the pro's and con's of a consolidation to determine the next step. Have a report/conclusion for the upcoming S&C week in March.
        • Move DQ2 site services at BNL to CERN - Hiro's proposal.
        • Need clear milestones - even if they go beyond
      • Need to flesh out a US ATLAS cloud computing activity which meshes with ATLAS and OSG; Alden

Progress on procurements

last meeting:
  • Interlagos machine - 128 cores - Shuwei's diverse set of tests show poor performance. Not usable for us. There is an effort to look at RHEL6 evaluation - which is highly recommended by AMD and Dell. Not likely to get result in time.
  • Regarding memory requirements, discussion with Borut: baseline is still 2 GB/logical core, but expect there will be high mem queues needed at some point; try
  • AGLT2: equipment at UM PO's have been put in (8 blades to a Dell chasis). S4810 F10 switch. Buying port at OmniPoP (shared switch, in coordination with MWT2 sites). MSU - meeting to discuss details.
  • MWT2: working on R410-based compute node purchase at IU and UC. Extending CC at Illinois. OmniPoP switch ports (2 UIUC, 1 UC, 1 IU plus the shared port costs).
  • SWT2: getting orders in for the remaining funds; UPS infrastructure, and a smaller compute node purchase. Purchase of two 10G gridftp doors, but in next phase (Feb, March). OU: three new head nodes, and new storage already purchased, and everything is at 10G.
  • WT2: deployed 68 R410; will spend more on storage next year and other smaller improvements. 2.1 PB currently. Will investigate SSDs for highly performant storage. 100 Gbps - will discuss with his networking group.

this meeting:

  • NET2: have about $20K left; ordered storage and replacement servers for HU and BU. Arriving now.
  • Dell pricing matrix has been updated.
  • C6145 eval has completed (Shuwei, Alex Undres) - ATLAS on SL6, code with gcc 4.6.2. Single process goes well, but dramatically worse when fully scheduled, in spite of HS06.

Follow-up on CVMFS deployments & plans

last meeting:
  • OU - January 15
  • UTA - will focus on production cluster first. Will do a rolling upgrade. Expect completion by January 15 as well.
  • BNL - Michael notes that at BNL they have seen multiple mount points from the automouter. They seem to go away eventually. Under investigation by CVMFS experts. A ticket has been filed. In process of adding more compute nodes up to the full capacity.
  • WT2 - fully converted. DONE

this meeting:

  • BNL - moved another 2000 job slots into CVMFS. Now running at 4000 jobs in CVMFS queue. Last batch will be moved tomorrow or early next week, completing the deployment. Note: have deployed throughout,
  • HU - ran into a few problems with an unintended kernel updates, but this will be fixed shortly. At BU - this is a top priority - will be done by January 30.
  • Are we running into scheduling prob's because of missing (unpublished) releases. Xin: did Alessandro's jobs not run? Newer release? Probably. There is a lack of transparency to the process - when val jobs run, publication, etc. Site admins should be notified - sites individually and cloud-support. Bob claims Alessandro's web page can be setup send notification. Bring up at ADC meeting.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Difficulties with Panda monitoring - its being addressed.
    • Production and analysis are chugging along.
    • Is there a problem with the BNL_CVMFS test site? Its taking large number of jobs - causing problems for Panda. Condor issue at BNL. Need to follow-up with Xin or someone at BNL. On-going issue most likely - requires manual re-assignments. Hiro will investigate the issue at BNL.
    • In terms of shift coverage, most are covered.
    • Weekend problems - affecting autopilot submission. Triggered discussion to move autopyfactory by mid-January. Have discussed local installations of autopyfactor - can checkout code.
    • Brokerage issue of last week seem to have resolved.
    • Perhaps we should clean up sites in the Panda monitor.
    • Send email to Alden
  • this meeting:
    • Sites were auto-offlined yesterday due to expired proxy
    • Deletion errors at OU - need an update on SRM to bestman 2. Its a timeout error making deletion slow. UTD needs an update as well. And so does BU.

Data Management and Storage Validation (Armen)

  • Reference
  • last meetings(s):
    • Generally the storage.
    • Deletion errors - more than 10 during 4 hours creates a GGUS ticket. Most think this is not worth it?
      • Types of errors: LFC permissions issue errors, usually associated with USERDISK at AGLT2 and NET2. May need help from Hiro and Shawn. Sometimes these seem to get resolved without intervention. Are there remnants left in the LFC? Armen will send a list to Shawn.
      • OU has problems, also due to Bestman1. Same as UTD.
      • Deletion service is not getting a callback from SRM - getting timeouts instead. These were at Bestman1 failures.
      • Wisconsin failures - because they deleted files locally, the service isn't finding them. They then got blacklisted.
    • Storage reporting not working at BU.
    • DATADISK reporting at SLAC is incorrect, low.
  • this meeting:
    • Expect a number of issues when doing a large userdisk cleanup campaign.
    • Data management meeting next Tuesday? Will decide next week as otherwise things look good.

Shift Operations (Mark)

  • Reference
  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-12_26_2011.html
    
    1)  12/22: Panda monitor slow response - issue resolved.  eLog 32551.
    2)  12/22: AGLT2 - SRM issue related to versions of postgresql RPM's.  More details in message from Shawn to prodsys mailing list.  eLog 32557.
    3)  12/23: SWT2_CPB - file transfer errors like "[DDM Site Services internal] Timelimit of 604800 seconds exceeded."  ggus 77727 / RT 21449, eLog 32599.
    4)  12/23: NET2 - file deletion errors - ggus 77729, eLog 32587.
    5)  12/24: UTA_SWT2 - job failures due to transfer timeouts of output files.  ggus 77735 / RT    eLog 32588.
    6)  12/26: UTD-HEP - set off-line due to a power outage at the site.  eLog 32600, https://savannah.cern.ch/support/index.php?125433 (Savannah site 
    exclusion ticket).
    7)  12/27:  File transfer failures from CERN-PROD_DATADISK => BNL-OSG2_PHYS-SM.  Hiro noted that the issue was incorrect registration of the files from 
    the dataset in question.  Therefore issue needs to be fixed on the CERN side.  ggus 77759 in-progress,
    8)  12/28: From Bob at AGLT2 - possible power/connectivity issue on the campus, affected running jobs, resulting in several k's of "lost heartbeat" errors. 
    
    Follow-ups from earlier reports:
    
    (i)  12/6: Some users reported this error when trying to retrieve (with dq2-get) a file from WISC: "[SRM_INVALID_PATH] No such file or directory."  
    ggus 77088 in-progress.
    (ii)  12/11:  AGLT2 - ggus 77330 opened due to DDM deletion errors at the site (~8400 over a four hour period).  Ticket in-progress - eLog 32317.  ggus 77341 
    also opened for deletion errors at the site on 12/12 - in-progress.  eLog 32326.  Also ggus 77436/eLog 32383 on 12/14.
    Update 12/20 from Armen: Fixes have been made in LFC. The errors are gone.  ggus 77330/41/436 closed.
    (iii)  12/11: NET2 - ggus 77332 opened due to DDM deletion errors at the site (~1050 over a four hour period).  From Saul: Our adler checksumming was 
    getting backed up causing those errors. We added I/O resources and the errors should stop now.  Ticket was marked as solved the next day, and then additional 
    errors were reported ~eight hours later.  Final ticket status is 'unsolved'.  eLog 32318/50.  (Duplicate ggus ticket 77439 was opend/closed on 12/14.)
    Update 12/17: new deletion errors appeared at NET2 (~900 errors during a 4 hour period) - ggus 77332 status changed to 'in-progress'.  eLog 32464.
    Update 12/23: ggus ticket marked as 'resolved'.
    (iv)  12/12: ggus 77361 was opened due to an issue with the reporting of ATLAS s/w releases in BDII, originally for BNL.  Turned out to be a more widespread issue.  
    See the ggus ticket (still 'in-progress') for more details.  eLog 32338.
    (v)  12/12: UTD-HEP - ggus 77382 opened due to DDM deletion errors at the site (~21 over a four hour period).  Ticket 'assigned' - eLog 32351.  (Duplicate ggus 
    ticket 77440 was opend/closed on 12/14.)
    (vi)  12/13: NERSC - downtime Tuesday, Dec. 13th from 7AM-5PM Pacific time.  ggus 77417 was opened for file transfer failures during this time - shifter wasn't aware 
    site was off-line.  Outage didn't appear in the atlas downtime calendar, announcement only sent to US cloud support.  eLog 32373.
    Update 12/15: Still see SRM errors following the outage - ggus 77417 in-progress, eLog 32409.
    (vii)  12/16-12/18: Problem with Apache configuration on PanDA monitor machines - see details in: eLog 32501/03, also ggus 77551 (in-progress).
    Update 12/23 from Xin: Configuration change has been made on the submit hosts, closing ticket (ggus 77551).
    (iix)  12/20: ANL_LOCALGROUPDISK - failed transfers ("failed to contact on remote SRM [httpg://atlas11.hep.anl.gov:8443/srm/v2/server]").  ggus 77630 in-progress, 
    eLog 32526.
    
    • We are getting overwhelmed with tickets for deletion errors. this is partly because it has been added to the shift operations list.
    • Point 1 shifts ended December 9. Those issues have rolled into ADCOS shifts.
    • There was a discussion about ATLAS releases at BNL - turns out to be a BDII-Panda issue.
    • There have been failures at OU - tbc by Horst
    • T3 at NERSC; a downtime was announced, but shifter still ticketed. Was there a propagation failure to the downtime calendar?
  • this meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    No report available this week.
    
    1)  12/29: NERSC-LOCALGROUPDISK file transfer errors ("Invalid SRM version [] for endpoint").  Issue apparently due to a user without NERSC access attempting 
    to copy data to/from the site.  ggus 77798 closed, eLog 32718.
    2)  12/30: From Bob at AGLT2 - ~2500 jobs will  show up with "lost heartbeat".  Incident started just prior to 8am EST.  I _think_ it was caused by another maintenance 
    window on the r-LSA router at UM.  Two events, two maintenance windows, too strong a coincidence to ignore.
    3)  1/2: OU_OCEHP_SWT2 - ggus 77835 / RT 21497 opened due to DDM deletion errors at the site.  Closed, as these errors are expected to go away once the SRM 
    version is updated at the site later in January.  eLog 32775.
    4)  1/2: Transfers to/from WISC were failing with the error "The host credential has expired."  Issue resolved - Wen installed a new certificate.  ggus 77841 closed, 
    eLog 32719/22.  https://savannah.cern.ch/support/index.php?125163 (Savannah site exclusion).
    5)  1/3: US panda analysis queues set off-line due to a proxy issue on the panda server.  Issue resolved - queues back on-line.  eLog 32759.
    
    Follow-ups from earlier reports:
    
    (i)  12/6: Some users reported this error when trying to retrieve (with dq2-get) a file from WISC: "[SRM_INVALID_PATH] No such file or directory."  
    ggus 77088 in-progress.
    Update 1/2: ggus 77088 marked as 'solved'.
    (ii)  12/12: ggus 77361 was opened due to an issue with the reporting of ATLAS s/w releases in BDII, originally for BNL.  Turned out to be a more widespread issue.  
    See the ggus ticket (still 'in-progress') for more details.  eLog 32338.
    Update 12/30 from Torre: The specific issue addressed by this ticket was solved with an ATLAS-internal workaround (reliance on a cached copy of the release data 
    rather than automatic updating from BDII); since the underlying BDII issues are addressed elsewhere I think this ticket can be closed.  ggus 77361 'solved'.
    (iii)  12/12: UTD-HEP - ggus 77382 opened due to DDM deletion errors at the site (~21 over a four hour period).  Ticket 'assigned' - eLog 32351.  
    (Duplicate ggus ticket 77440 was opend/closed on 12/14.)
    Update 12/24: ggus 77737 also opened for deletion errors at the site - eLog 32692.
    (iv)  12/13: NERSC - downtime Tuesday, Dec. 13th from 7AM-5PM Pacific time.  ggus 77417 was opened for file transfer failures during this time - shifter wasn't aware 
    site was off-line.  Outage didn't appear in the atlas downtime calendar, announcement only sent to US cloud support.  eLog 32373.
    Update 12/15: Still see SRM errors following the outage - ggus 77417 in-progress, eLog 32409.
    (v)  12/20: ANL_LOCALGROUPDISK - failed transfers ("failed to contact on remote SRM [httpg://atlas11.hep.anl.gov:8443/srm/v2/server]").  ggus 77630 in-progress, 
    eLog 32526.
    (vi)  12/23: SWT2_CPB - file transfer errors like "[DDM Site Services internal] Timelimit of 604800 seconds exceeded."  ggus 77727 / RT 21449, eLog 32599.
    Update 1/2: According to  monitoring experts, this is not a site issue, and should not be reported by shifters.  ggus 77727 / RT 21449 closed, eLog 32706.
    ggus 77819 / RT 21470 opened on 12/30 for the same error - closed with the same explanation.  eLog 32705.  Also ggus 77821 opened/closed at NET2 for 
    the same issue - eLog 32667.
    (vii)  12/23: NET2 - file deletion errors - ggus 77729, eLog 32587/739.  (Duplicate ggus ticket 77796 was opened/closed on 12/29.)
    (iix)  12/24: UTA_SWT2 - job failures due to transfer timeouts of output files.  ggus 77735 / RT 21454, eLog 32588.
    Update 1/2: ggus 77735 / RT 21454 closed due to a lack of additional information needed for troubleshooting.  Didn't really look like a site issue.  There was a 
    known issue with file transfers to the TW cloud during this time, so could have been related.
    (ix)  12/26: UTD-HEP - set off-line due to a power outage at the site.  eLog 32600, https://savannah.cern.ch/support/index.php?125433 (Savannah site exclusion ticket).
    Update 12/29: Power restored, test jobs successful - site set back on-line.  eLog 32639.
    (x)  12/27:  File transfer failures from CERN-PROD_DATADISK => BNL-OSG2_PHYS-SM.  Hiro noted that the issue was incorrect registration of the files from the 
    dataset in question.  Therefore issue needs to be fixed on the CERN side.  ggus 77759 in-progress, eLog 32611.
     

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last meeting:
    • Meeting was last week. Work to get LHCONE early adopters tested. Making sure perfsonar is installed.
    • Tom working on modular dashboard; adding alerts for primitive services. At some point real contact email addresses will be used.
    • Lookup services at some sites have problems.
    • Ian Gable joined from Canadian cloud - most sites will be up. They will appear in the dashboard.
    • Using R310 for perfsonar on a 10G host. Have not done the final set of tests. (Dell does not support 10G on the R310; it does R610) Wants to do a 10G-10G test before making a recommendation.
    • Canadian sites will all be 10G (w/ X520 NICs) hosts. Will have 1G-10G issues.
    • 2G and 4 cores is plenty.
    • How do we transition the US facility? This should be a good activity for Q2.
    • AGT2-MWT2 regional issues.
  • this meeting:
    • Last meeting was December 20. There are two R310s in the portal - but the price will be reduced. Will support 10G NICs. Still need two boxes? Yes. Sites are requested to deploy 10G capable by end of quarter.
    • LHCONE meeting at Berkely - Shawn attending.

Federated Xrootd deployment in the US

last week(s) this week:
  • Next meeting: 2pm Eastern, next Wednesday

Tier 3 GS

last meeting: this meeting:
  • UTD, BM update needed. Hari notices very long run jobs. Infinite loop?

Site news and issues (all sites)

  • T1:
    • last meeting(s):
    • this meeting:
      • Holidays were uneventful.
      • VOMS server became stuck
      • Completed deployment of 1PB disk; in hands of dCache group - space will show up soon. Expected delivery of R410s in February.

  • AGLT2:
    • last meeting(s):
    • this meeting: 28th and 30th December there were some network interventions on campus, possibly.

  • NET2:
    • last meeting(s):
    • this meeting: Kernel issue mentioned above. Occasional LFC deletion errors - correlated with Adler32 checksumming? Affects SRM deletions, recovers quickly. New equipment coming, lots of work to do.

  • MWT2:
    • last meeting(s):
    • this meeting: Completed deployment of 540 TB at IU;

  • SWT2 (UTA):
    • last meeting(s):
    • this meeting: working on CVMFS deployment and validation. Long running jobs: has seen this as well. 634638.

  • SWT2 (OU):
    • last meeting(s):
    • this meeting: will continue to work on storage and head node upgrade.

  • WT2:
    • last meeting(s):
    • this meeting: all is smooth.

Carryover issues (any updates?)

OSG Opportunistic Access

See AccessOSG for instructions supporting OSG VOs for opportunistic access.

last week(s)

  • SLAC: enabled HCC and Glow. Have problems with Engage - since they want gridftp from the worker node. Possible problems with Glow VO cert.
  • GEANT4 campaign, SupportingGEANT4
this week

CVMFS deployment discussion

See TestingCVMFS

previously:

  • Wei: instructions seem to ask us to use the squid at BNL. Hardware recommendation? Same as what you're already using.
  • Load has not been observed to be high at AGLT2. Squid is single-threaded, multi-core not an issue. Want to have a good amount of memory, so as to avoid hitting local disk.
  • At AGLT2 -recommend multiple squids, and compute nodes are configured to not hit remote proxy. Doug claims it will still fail over to stratum 1 regardless.
this week:
  • See above.

AOB

last week this week


-- RobertGardner - 03 Jan 2012

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback