r2 - 15 Oct 2014 - 12:10:05 - MarkSosebeeYou are here: TWiki >  Admins Web > MinutesOct152014

MinutesOct152014

Introduction

Minutes of the Bi-weekly Facilities IntegrationProgram meeting (Wednesdays, 1:00pm Eastern):

Connection info:

Attending

  • Meeting attendees:
  • Apologies:
  • Guests:

Integration program update (Rob, Michael)

LOCALGROUPDISK space monitoring (Mayuko)

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • meeting (10/1/14):
    • Mark reporting. Again a note about MCORE developments.
    • Wisconsin - a long-standing ticket. Unresponsive. Transfers are still failing.
    • Would be useful to have #active slots displayed. Request for bigpanda monitor.
  • meeting (10/15/14):

Shift Operations (Mark)

  • Reference
  • last week: Operations summary:
    AMOD/ADCoS reports from the ADC Weekly and ADCoS meetings:
    https://indico.cern.ch/event/344969/contribution/1/material/slides/0.pdf
    
    1)  10/2: as of today DB release datasets are subscribed to T1_DATADISK rather than T1_HOTDISK:
    https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/51318
    https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/51303
    2)  10/2: SWT2_CPB - site had to be taken off-line following widespread power outages due to storms in the area with high winds. Site was brought back on-line on 
    the evening of 10/4. Power was mostly restored to the campus by the evening of 10/3, but facilities personnel could not ensure the stability of the chilled water 
    system that provides cooling water for the machine room CRAC units until the next day. https://ggus.eu/index.php?mode=ticket_info&ticket_id=109017 was opened 
    for UTA_SWT2 since both sites share squid caches. Closed on 10/5. eLog 51394.
    3)  10/4: HU_ATLAS_Tier2 - squid service down. Restarting the HU gatekeeper fixed the problem. https://ggus.eu/index.php?mode=ticket_info&ticket_id=109048 was 
    closed, eLog 51359. (ggus ticket was re-opened later the same day, but this may have been a monitoring glitch, since the service had stayed up following the 
    gatekeeper restart.)
    4)  10/5: SLACXRD - source and destination file transfer errors. Wei reported that a storage host went down, and was restarted. 
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=109051 was closed, eLog 51371.
    5)  10/6: SMU_LCOALGROUPDISK - all transfers to the site failing with "Unable to connect to smuosg1.hpc.smu.edu:2811." 
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=109158 in-progress, eLog 51391.
    6)  10/7: ADC Weekly meeting:
    http://indico.cern.ch/event/339991/
    7)  10/8 early a.m. - problems with the ADCR database caused large numbers of job failures, some sites were auto-excluded. Service restored after several hours. 
    See: https://cern.service-now.com/service-portal/view-outage.do?from=CSP-Service-Status-Board&&n=OTG0014692, eLog 51404.
    
    Follow-ups from earlier reports:
    
    (i)  9/20: WISC - file transfers failing with "DESTINATION MAKE_PARENT srm-ifce err: Invalid argument, err: [SE][Mkdir][SRM_FAILURE]." 
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=108665 in progress, eLog 51118.
    Update 10/7: ggus ticket closed with the message "The service has some problem because of a xrootd upgrade on bestman server." eLog 51399.
    

  • this week: Operations summary:
    AMOD/ADCoS reports from the ADC Weekly and ADCoS meetings:
    Not available this week.
    
    1)  10/9: Pilot update from Paul (v_PICARD_59a). Details here:
    http://www-hep.uta.edu/~sosebee/ADCoS/pilot-version_PICARD_59a.html
    2)  10/11: UTA_SWT2 - source and destination file transfers failures (various srm errors). Issue was a full disk partition on the srm host (BeStMan filled the logs area). 
    Space was cleared up, srm restarted, issue resolved. https://ggus.eu/index.php?mode=ticket_info&ticket_id=109273 was closed, eLog 51452. (In the meantime we've 
    moved to logging directory to a much larger partition to avoid this problem in the future.)
    3)  10/14: AGLT2_MCORE - jobs failing heavily, particularly on two WN's, with the error "Get error: Encountered an empty SURL-GUID dictionary." Seems like a site 
    problem since jobs from the same tasks are completing elsewhere. (Issue may be due to a missing environment variable, $USER, since there are file permission errors 
    in the job logs.) https://ggus.eu/index.php?mode=ticket_info&ticket_id=109324 in-progress, eLog 51469.
    4)  10/14: WISC - file transfer errors ("/bin/mkdir: cannot create directory `/atlas/xrootd': Permission denied"). 
    Re-opened ggus ticket https://ggus.eu/?mode=ticket_info&ticket_id=108665. eLog 51484.
    5)  10/14: ADC Weekly meeting:
    http://indico.cern.ch/event/339992/
    
    Follow-ups from earlier reports:
    
    (i)  10/6: SMU_LCOALGROUPDISK - all transfers to the site failing with "Unable to connect to smuosg1.hpc.smu.edu:2811." 
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=109158 in-progress, eLog 51391. 
    

Data Management and Storage Validation (Armen)

  • Reference
  • meeting (10/1/14)
    • Regarding dark data, we've seen a reduction.
    • There are a number of unassociated files in the Rucio file catalog, but the Rucio team is working on it.
    • Would like a systematic dump of the storage for each site. Previously some sites were providing this.
    • Wei: it was the other way around - we took dumps from dq2 and compared to storage.
    • Discussion with Tomas: wants a file catalog dump from each site.
    • *Armen will send an email message
    • Is this for continuous testing/checking? Rucio provides a daily file catalog dump.
    • Is CCC up to date?
  • meeting (10/15/14)

DDM Operations (Hiro)

meeting (9/3/14)
  • No report.
meeting (10/1/14)
  • There is an FTS issue at CE
meeting (10/15/14)

Condor CE validation (Saul, Wei, Bob)

meeting (8/6/14)
  • See Bob's instructions, experience at https://www.aglt2.org/wiki/bin/view/AGLT2/CondorCE
  • Saul: Augustine working with Tim Cartwright and Brian
  • Wei: working with Brian to get it working with LSF. Got a job submitted successfully, but with memory allocation errors.
  • Overall conclusion is software just not there yet.
  • Xin - working on Condor CE at BNL
meeting (9/3/14)
  • New page from Xin: HTCondorCE
  • WT2: Brian has got it running, in a test mode. Still need to understand the package to install. Then, will need to run HC jobs. Believes its not ready for production. Also need to enable C-groups.
  • AGLT2: running HC on the test gatekeeper, running fine. Need rpms out of the OSG testing repo, to get best results. Debugging the job router is inadequate. Condor v8 has differences - there were simple changes that caused.
  • NET2: got to the stage of SGE job submissions running. There's a new version. Need to run at scale.
meeting (10/15/14)

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference meeting (9/3/14)

meeting (10/1/14)

  • EU privacy issue now fixed.
meeting (10/15/14)

ATLAS Connect (Dave, Rob)

meeting (8/6/2014) meeting (9/3/2014)
  • Still working with TACC to get a CVMFS solution, of some nature, working on Stampede. They are setting up test nodes with FUSE enabled.
  • Replicate data on their Lustre file system - using ReplicateCVMFS
meeting (10/15/2014)

Site news and issues (all sites)

  • SWT2 (UTA):
    • last meeting(s): working on bringing 30 compute nodes.
    • this meeting:

  • SWT2 (OU, OSCER):
    • last meeting(s): Ran 1800 jobs slots on OSCER - 3600 between LUCILLE, OCHEP, OSCER, really good.
    • this meeting:

  • SWT2 (LU):
    • last meeting(s): Everything running smoothly.
    • this meeting:

  • WT2:
    • last meeting(s): Short outage August 7 for internal network upgrades.
    • this meeting:

  • T1:
    • last meeting(s):
    • this meeting:

  • AGLT2:
    • last meeting(s): 14 R620's received, 3 M620s, waiting on storage and switch. MSU will delay purchases.
    • this meeting:

  • NET2:
    • last meeting(s): close to making a purchase on R630s. Haswell, there are benchmarks.
    • this week:

  • MWT2:
    • last meeting(s): Increase in MCORE. 14 new workers at UIUC - have benchmarks on Bob's twiki.
    • this meeting:

AOB

last meeting this meeting


-- RobertGardner - 14 Oct 2014

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback