r7 - 03 Oct 2012 - 15:09:06 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesOct32012

MinutesOct32012

Introduction

Minutes of the Facilities Integration Program meeting, October 3, 2012
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • USA Toll-Free: 888-273-3658
    • USA Caller Paid/International Toll : 213-270-2124
    • ACCESS CODE: 3444755
    • HOST PASSWORD: 6081

Attending

  • Meeting attendees: Patrick, Dave, Rob, Michael, Saul, Joel (for Horst), Tom, Bob, Wei, Mark, John, Ilija, Armen, Mark, Adlen, Shawn, Sarah, Fred, Chris Walker (for Horst), Hiro
  • Apologies: Jason. Kaushik, Horst
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, bi-weekly - convened by Kaushik) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Wei or Rob): Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • New Integration program for FY12Q4, IntegrationPhase22 and SiteCertificationP22
      • LFC consolidation is a highest priority.
      • Rob: review of high level milestones for the facility. Thanks to all for updating the site certification table.
      • Michael: Pledges for 2013, 2014 are being declared this month. Because of the run extension there are additional resources being required. There has been a concerted effort to keep the requests at a reasonable level. There are now some solid numbers for the US facility, according to the 23% MOU share, to be discussed tomorrow at the L2/L3 meeting tomorrow.
      • Michael: multicore slots are now going to used for a validation campaign. Jose is working on getting these jobs going - so we expect the MCORE queues to be utilized. Hopefully will lead to getting AthenaMP? into production.
      • Michael: accounting statistics are being analyzed at the ICB to get valuable information about how resources are being used.
      • Michael: future of PandaMover in the US. Historically it has been quite valuable (e.g. when DDM was not available enough); e.g. staging data from tape. Kaushik: useful as a backup, especially as we transition to Rucio. Last time we tried ran into DQ2 load issues on SRM. Network load may go up to 30%, since files are re-used, which is not normal.
      • Michael: all should be re-visited. A factor of 2 or 3 should not be an argument. Note - PandaMover related issues are hard to debug.
      • Kaushik: deletion service immediately deletes datasets after the jobs.
      • Rob: could you make the change for a single site? Kaushik: yes.
      • Hiro; does not like it.
      • Reviewing the use of Pandamover in our environment. We need to organize and effort to review status, and outlook for the future. Timeframe - arrive at conclusion as to which way we should go. Action item
      • Add LHCONE peering to next quarter's program; we are already late. SWT2 is a concern. NET2 should be straightforward. SLAC should be an internal issue. We should review the sites in Europe already peered.
    • this week
      • Start of a new quarter - FY13Q1 this week
      • Reminder to register for the UC Santa Cruz meeting - https://indico.cern.ch/conferenceDisplay.py?confId=201788
      • New facilities spreadsheet coming - will be in Google Docs for convenience
      • screenshot_01.jpg:
        screenshot_01.jpg
      • Status of additional disk procurements at Tier 2.
      • Reprocessing campaign - will affect mostly the Tier 1 sites. 1.6B events. 1.5M files. 6 weeks. End by Christmas. Additional data will come to all sites. We will need the additional disk - for the Winter conferences. We expect an onslaught of user analysis. Would like to analyze the job profiles - versus category type (analy, simul, pileup). Computing management and ADC have been sending more of these to the Tier 2s.
      • Mark - makes note of increased communication between ADC management and physics coordination

Disk procurement

  • MWT2 - 1PB ordered for UC (expect by November); UIUC - staging with new instance of CC, "end of November". At IU, we may focus on networking.
  • AGT2 - UM submitted updated PO yesterday. Planning on MD3260 dense storage from Dell. Two 40G connections. MSU: 4xMD3260 w/ 2x R720; PO imminent. Est month.
  • NET2 - MD1200, 3TB drives; electrical work. Expect to issue purchase order within a week. Two racks 432 TB useable each.
  • WT2 - Have MD1200 here. Only two head nodes arrived, four more. 1PB (usable)
  • UTA - Have not sent a PO, evaluating technologies. Will have convo with Dell.
  • T1 - 2.6 PB of disk - part of this will be replacements. Nexan technologies (dual controller front ends and extensions).

Multi-core deployment progress (Rob)

last meeting:
  • Will be a standing item until we have a MC queue at each site
  • BNL DONE
  • WT2 - SLACXRD_MP8 DONE
  • MWT2_MCORE available DONE
  • AGLT2_MCORE DONE
  • NET2 - still working on it. HU - not sure how to setup with schedconfig or autopyfactory. Need Panda guidance - would be great to add it here: AthenaMPFacilityConfiguration. Alden will send a note of clarification to the t2-l.
  • SWT2_OU - Horst will inquire with OSG about multicore scheduling with the LSF cluster.
  • SWT2_UTA - since LFC migration is complete - hope to work on this week.
this meeting, reviewing status:
  • NET2: still don't have queues. Stuck with issues with OSG 3.0 rpm install; and certificate updating. Want to setup nicely - to run multicore smoothly. Estimate - by end of next week.
  • SWT2_OU: no info.
  • SWT2_UTA: created the Panda queue; have asked to have it added to HC. Will follow-up. Structurally in place, receiving pilots. Nearly to go online. Will do the same for the other cluster.

Update on LFC consolidation (Hiro, Patrick, Wei)

last week(s):
  • T3 LFC has been migrated, waiting on an AGIS update.
  • Only sites left are OU and MWT2.
  • Patrick - the CCC script can run with the dump file. Pandamover - will place files at a site, but DQ2 doesn't have them. Case of Tier2D? . Foreign Tier 1 subscribes via DQ2. HIro's solution is to use a different domain.
  • Modify domain of proddisk as it exists in DQ2 now, in TOA.
  • Sarah - using an older version of code that will cleanup files that might be used as input to jobs.
  • Code is in the git. Patrick will look at Sarah.
  • MWT2 - on Monday. Sarah working with Hiro
  • OU after that.
this week:
  • Finished DONE
  • Hiro will report completion formally to ADC

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • 25-30% of MC12 had a bug, parameter bug from MC group.
    • Tomasz Schwindt's jobs -- massive production, high priority. Send Kaushik any comments, reports about these jobs.
    • Send any issues found to DAST help - "hn-atlas-dist-analysis-help (Distributed Analysis Help)" <hn-atlas-dist-analysis-help@cern.ch>
    • Jeff at MWT2 has a script that looks for low-efficiency jobs.
  • this meeting:
    • Mark reporting
    • The auto-exclusion by HC cloud-wide due to Panda proxy failure September 22; notes there are discussions to improve HC.

Data Management and Storage Validation (Armen)

  • Reference
  • last meetings(s):
    • Hiro - will send user disk cleanup reminder
    • Armen - localgroupdisk issues - what about policies? Do we need one? Generally no deletion there. Recent issue with SLAC - a 100 TB request. Wei added this space there, about 500 TB of LOCAGROUPDISK (reduces the pledge). How to get users to clean up? Situation is different in various places.
    • Hiro: why is ToA? number different than what Ueda quotes? Where does 1.2 PB come from? Michael: comes from pledge.
    • Armen - expect more flow into localgroupdisk, since DATADISK is undergoing some deletion, or moving into GROUPDISK token areas.
    • Notes a spike in DQ2.
    • Will restart in two week.
    • Can we do something to improve DDM subscriptions to LOCALGROUPDISK.
    • Kaushik notes we will have accounting.
    • Alden: send any issues to DAST
  • this meeting:
    • Yesterday's meeting MinutesDataManageOct2
    • NET2 - deletion rate is low and failure prone.
    • Reporting issue at SLAC - Wei is following-up
    • USERDISK cleanup - new round.
    • Production role - probably usatlas role is sufficient.

Shift Operations (Mark)

  • last week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=210179
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-9_24_2012.html
    
    1)  9/19: US T3 LFC was migrated to a consolidated US T2 LFC (ust2lfc.usatlas.bnl.gov).  Completed as of ~2:45 p.m. EST.  eLog 39462.
    2)  9/21: DDM errors across all clouds with a security error ("[SECURITY_ERROR] [SrmPing] failed: ... Error Chain:globus_gsi_gssapi: Error with GSI ... 
    Error with credential: The proxy credential: /tmp/x509up_ ... with subject: /DC=ch/DC=cern/OU=OrganicUnits/OU=Users/CN=ddmadmin/CN=531497/CN=Robot: 
    ATLAS DataManagement/CN=proxy/CN=proxy/CN=proxyexpired 21 minutes ago.]").  Likely related to an FTS bug reported in ggus 81844.  Errors eventually 
    went away.
    3)  9/21: AGLT2 file transfer errors ("[FIRST_MARKER_TIMEOUT] First non-zero marker not received within 600 seconds").  Two storage nodes hung 
    up and had to be restarted.  Later a third server crashed and was power cycled.  Around this same time the analysis queue was getting hit very hard by 
    jobs performing skims.  It was determined that there were dCache queueing performance issues likely contributing to the problem, so a dCache upgrade 
    to v2.2 was performed on 9/24.  ggus 86268, eLog 39586.
    4)  9/21: BU_ATLAS_Tier2o was not receiving pilots, despite having activated jobs.  Issue understood and fixed - from Saul: Some old files in 
    ~usatlas1/.globus/job/atlas.bu.edu/ were slowing down the file system performance, possibly causing some kind of timeout.  After cleaning it up, we're ramping 
    back up.
    5)  9/21 p.m.: OU_OCHEP_SWT2 file transfer errors ("failed to contact on remote SRM").  From Horst: OU_OCHEP_SWT2's main head node, 
    tier2-01.ochep.ou.edu, crashed with what looks to be a hardware problem, possibly a CPU issue.  I have rebooted it, and it looks okay again now.
    6)  9/22 early a.m.: site across all clouds were auto-excluded by HC testing.  Issue was eventually traced to an expired proxy on the panda server which was 
    creating errors like "ddmErrorDiag: Setupper._setupDestination() could not register : hc_test.gangarbt.hc20011060.INFN-LECCE.406, transExitCode: NULL."  
    Issue resolved, sites reset on-line.  eLog 39527.
    7)  9/22: ggus 86307 was opened for file transfer failures with checksum errors, and incorrectly assigned to UPENN, when the issue was really on the remote 
    side (IT/MILANO).  Ticket closed, eLog 39534/38.
    8)  9/24: From ADC ops - The TiersOfATLASCache http://atlas.web.cern.ch/Atlas/GROUPS/DATABASE/project/ddm/releases/TiersOfATLASCache.py now 
    points to the AGIS ToA.  elog 39572.
    9)  9/24: SLACXRD - short-lived/transient issue, file transfers were failing with SRM errors.  Errors went away after ~30 minutes - ggus 86331 closed, eLog 39574.
    10)  9/26: New pilot release from Paul (v54b).  Details here: http://www-hep.uta.edu/~sosebee/ADCoS/pilot-version_54b.html
    11)  9/26: LFC for OU_OCHEP_SWT2 migrated to BNL (ust2lfc).
    12)  9/26: From Michael at BNL: We observe currently a few transfer failures due to high load on a few storage pools. We are in the process of balancing the 
    load to spread it across more pools.  eLog 39639.
    
    Follow-ups from earlier reports:
    
    (i)  6/13: Issue where some sites using CVMFS see the occasional error: "Error: cmtsite command was timed out" was raised in 
    https://savannah.cern.ch/support/?129468.  See more details in the discussion therein.
    (ii)  7/12: NET2 DDM deletion errors - ggus 84189 marked 'solved' 7/13, but the errors reappeared on 7/14.  Ticket 'in-progress', eLog 37613/50.
    Update 9/24: Continue to see deletion errors - ggus 85951 re-opened.  eLog 39571.
    (iii)  8/24: NERSC_SCRATCHDISK - file transfer failures with SRM errors.  Issue was due to the fact that the site admins had previously taken the token off-line 
    to protect it from undesired auto-deletion of files.  ggus 85490 closed, eLog 38791.  https://savannah.cern.ch/support/index.php?131508 (Savannah site 
    exclusion ticket), eLog 38795.
    (iv)  9/6: UPENN file transfer errors ("[GRIDFTP_ERROR] an end-of-file was reached globus_xio: An end of file occurred (possibly the destination disk is full)").  
    Admin reported: I believe this is caused by the gridftp server terminating due to a timeout on the disk side and misinterpreting it as an EOF.  ggus 85916 
    in-progess, eLog 39145.
    Update 9/12: http://savannah.cern.ch/support/?132023 was also opened for this issue.  eLog 39288.
    Update 9/20-22: no recent errors of the type originally reported in ggus 85916 - this ticket closed.  eLog 39538.
    Update 9/23: currently ~10 TB of free space in the token - closed Savannah 132023.
    (v)  9/12: Daily Atlas RSV vs WLCG report (currently sent out via e-mail) will be replaced on 31/Oct/2012 with a web only summary available here:
    http://rsv.opensciencegrid.org/daily-reports/2012/09/11/ATLAS_Replacement_Report-2012-09-11.html.  Please address any concerns or questions to 
    steige@iu.edu or open a ticket here: https://ticket.grid.iu.edu/goc/submit.
    

  • this week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=211100
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-10_1_2012.html
    
    1)  9/28: UTD-HEP file transfer failures: SRM errors - ggus 86539.  Site blacklisted in DDM on 10/1: https://savannah.cern.ch/support/?132600.  eLog 39770.
    2)  9/30: File transfer errors between BNL & TAIWAN/ASGC.  Initial indications were a network path issue originating on the TW cloud side.  Eventually 
    understood (10/1) - due to BNL Cyber Security blocking the FTS agent host at ASGC.  See more details in eLog 39785.  Note: this issue also affects SLAC, 
    since it is part of the DoE complex.  ggus 86537 'assigned'.
    3)  10/1: Rob reported a brief network interruption at MWT2.  Problem fixed.  eLog 39775.
    4)  10/2: In reference to widespread HC auto-exclusion incident on 9/22 - see:
    https://indico.cern.ch/getFile.py/access?contribId=9&resId=0&materialId=slides&confId=209822
    
    Follow-ups from earlier reports:
    
    (i)  6/13: Issue where some sites using CVMFS see the occasional error: "Error: cmtsite command was timed out" was raised in https://savannah.cern.ch/support/?129468.  
    See more details in the discussion therein.
    (ii)  7/12: NET2 DDM deletion errors - ggus 84189 marked 'solved' 7/13, but the errors reappeared on 7/14.  Ticket 'in-progress', eLog 37613/50.
    Update 9/24: Continue to see deletion errors - ggus 85951 re-opened.  eLog 39571.
    (iii)  8/24: NERSC_SCRATCHDISK - file transfer failures with SRM errors.  Issue was due to the fact that the site admins had previously taken the token off-line 
    to protect it from undesired auto-deletion of files.  ggus 85490 closed, eLog 38791.  https://savannah.cern.ch/support/index.php?131508 (Savannah site exclusion ticket), 
    eLog 38795.
    (iv)  9/12: Daily Atlas RSV vs WLCG report (currently sent out via e-mail) will be replaced on 31/Oct/2012 with a web only summary available here:
    http://rsv.opensciencegrid.org/daily-reports/2012/09/11/ATLAS_Replacement_Report-2012-09-11.html.  Please address any concerns or questions to steige@iu.edu or
    open a ticket here: https://ticket.grid.iu.edu/goc/submit.
    (v)  9/21: AGLT2 file transfer errors ("[FIRST_MARKER_TIMEOUT] First non-zero marker not received within 600 seconds").  Two storage nodes hung up and had to be 
    restarted.  Later a third server crashed and was power cycled.  Around this same time the analysis queue was getting hit very hard by jobs performing skims.  It was 
    determined that there were dCache queueing performance issues likely contributing to the problem, so a dCache upgrade to v2.2 was performed on 9/24.  ggus 86268, 
    eLog 39586.
    Update 9/28: Recent file transfers successful, no more errors.  Closed ggus 86268.  eLog 39691.
    

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last meeting(s):
    • Please try to get new equipment into production.
      • NET2 - has systems, not yet in production; plan was to do this at BU, not clear about HU but maybe.
      • UTA - waiting on a 10G port; working internally on which optics - SR, LR; then will buy cable.
      • SLAC - has machines, trying to get them supported in a standard way, as appliances
  • this meeting:
    • See notes from yesterday's call
    • Mesh configuration for perfsonar
    • Modular dashboard discussions
    • What is going on at Triumf from SLAC? Related to LHCONE transition at SLAC? There are likely problems beyond SLAC.
    • Will be adding LHCONE connectivity to next phase
    • NET2 - Michael has discussed possibility of having ESNet provide the LHCONE connectivity for BU - but may need to have I2 involvement - depends on institutional issues.

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference

last week(s)

  • Need upgrades to Xrootd
  • Adding sites in Germany and UK
this week
  • We got a russian site on-board. They didn't need much help. They still haven't enabled security.
  • Working on EOS into FAX. Not quite ready.
  • NET2 used to work - but no longer for unknown reasons.
  • Two German sites; can't enable X509 due to a RHEL6 issue.
  • Prague Tier 2, federated through DE cloud, now working
  • Developing new version of xrootd4j for 2.2.4
  • Andy produced first version of f-stream, giving all the monitoring required.

US analysis queue performance (Ilija)

last two meetings
  • No meeting last week due to CERN and Lyon meetings.
this week:
  • In general, sites are showing good performance.
  • Working with a few sites to fix specific issues.
  • Will have one more report on the efficiency of each site - and thereafter it will be just follow-up.
  • Summary of issues
    • Software issues - startup and stage-out time
    • Setup time for ATLAS software (CMT). This could be fixed, but it would require ~ 0.5FTE-year.
    • Will document these, summarize these findings. Should present at S&C meetings, and provide a document.

Site news and issues (all sites)

  • T1:
    • last meeting(s): FTS3 deployed for testing.
    • this meeting: Experience with ASGC machine blocked, and 2.3 PB procuring. Evaluation of a Hadoop-based storage system: apart from going for the open source Apache based Hadoop. Map-R installation on 100 nodes. Up and running. Performance tests conducted. Comes with a Scalable NFS interface. Hiro is looking into measuring it.

  • AGLT2:
    • last meeting(s): Seem to getting a lot of inefficient jobs - might have taken a dcache node offline. R720 ordered, 3260 storage ordered with RBOD. 10G NIC went offline, disabled flow-control on all 10G switches.
    • this meeting: PO plans as above. 2.2.4 dCache emergency upgrade. Seems to be working well. Now more activity on the intersite links. 5%, 3% free rule, to avoid XFS problems. Issues with XFS crashes caused by memory pressure and activity. Newer dCache seems to have helped with this. Caching within dCache - running effectively full seems to working better with 2.2.4; did have some hotspots before. May be having less overall re-usable space implying more cache thrash, not as much unpinned space. Is space being reclaimed too early? Condor issue - had implemented concurrency limits on analysis jobs; limit # analysis jobs to run. Accounting groups and concurrency limits may not play well with Condor. Very long negotiator cycles -- hours.

  • NET2:
    • last meeting(s): Preparing to purchase storage. Lots of other work on-going.
    • this meeting: Mysterious problem related to new pilot - only at NET2, evidently. Once every couple of weeks we see a batch that consume all the memory on some nodes.

  • MWT2:
    • last meeting(s): DDM issue - SRM problem. Attempting to get an SRM thread dump.
    • this meeting: Investigation on IU analysis performance continuing in detail. Upgrades for LHCONE at IU in progress, required Juniper OS update which had problems; Illinois: by October 12; hardware link in place and active. First s-node arrived at UIUC, being deployed by taub admins; Taub c-nodes updated for cvmfs fixes, and working with the core taub admin for additional utilities for Nagios. Sarah investigating times for postgres queries, and relation to billing database; need to move billing database onto separate server. Continued work on virtual machines - adapting appliance from John Hover (OpenStack-based) to use the virtlib tools directly.

  • SWT2 (UTA):
    • last meeting(s): Multicore configuration nearly finished. Working Pandamover issue with LFC consolidation. Available disk at SWT2? ~ 1600 TB. Updated today - it will be about 2100 TB.
    • this meeting: OSG 3 rpm - checking ROCKS appliances. Tues/Wed next week. (Dave notes there is a stress test template 459.)

  • SWT2 (OU):
    • last meeting(s): Ordering storage - quote in hand, placing this in the next two weeks; about 200 TB. Lustre issue was metadata server deadlock, fixed with reboot.
    • this meeting: Disk order has gone out. Horst is getting close to having clean-up script.

  • WT2:
    • last meeting(s): 1 PB usable, R610 + MD1200. - starting.
    • this meeting: Moved to LHCONE. Working on new storage. Have not had a chance to work on OSG 3.0 rpm install. Is the lsf-job manager.

AOB

last meeting this meeting


-- RobertGardner - 02 Oct 2012

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


jpg screenshot_01.jpg (31.8K) | RobertGardner, 03 Oct 2012 - 12:58 |
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback