r3 - 14 May 2014 - 14:11:58 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesMay142014

MinutesMay142014

Introduction

Minutes of the Bi-weekly Facilities IntegrationProgram meeting (Wednesdays, 1:00pm Eastern):

Connection info:

Attending

  • Meeting attendees: Dave, Torre, Horst, Fred, Bob, Iija, Saul, Armen, Mayuko, Armen, Mark, John Brunelle, Wei, Patrick
  • Apologies: Michael, Kaushik
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, bi-weekly - convened by Armen) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Wei or Rob): ATLAS Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • A few integration program issues:
        • Rucio conversion: status and timeline. LUCILLE was successfully transitioned; and no panda mover; Rucio used exclusively. AGLT2 was mentioned; first prep steps were successful (removal of Panda-mover). Next few days will complete the transition. Next - MWT2 will volunteer. Other comments - seem to be working very well - entire clouds have been transitioned.
        • Panda-WAN testing: Panda team is ready. Schedconfig parameters need adjusting. Ilija in contact with Tadashi. There were other changes in the pilot to m
        • Continued work on ATLAS Connect
        • xAOD: ready to test? There is the ROOTIO working group - to understand the performance, and that we have the right systems to monitor. User-job monitoring. Ilija is asking for example user code.
      • WAN performance, even in LHCONE, can sometimes lead to surprises. This is why the PET has pursued a solution which minimizes the number of domains. Has been discussed this week. US LHC Management are now in the position to let ESnet management know we can move forward. DOE office of high energy physics is in agreement. This solution will be put in place in about 6 months, pending ESnet approval. Much better network connectivity with our European counterparts.
      • Yesterday's ADC meeting - getting the ATLAS environment software setup. Dedicated ADC development meeting.
    • this week
      • Pre-DC14 exercises: internal review of sites. Performance characteristics and stability under various loads.

ATLAS Connect (Rob)

last week
  • Testing by early adopters continues
  • Much of the team is up at Condor Week this week
  • Main technical roadblock for opportunistic production jobs (via Panda, e.g. on Stampede) is two-fold (but related):
    • Delivering ATLAS compatibility libraries (HEP_OSlibs_SL6)
    • A threading problem in Parrot that may or may not be related to using multiple CVMFS repos, or something else. CCTools team currently stumped, but will com
  • Classic Tier3-style analysis batch usage continues to help improve the platform definition (e.g. installing analysis helper tools, git-svn, e.g.)
  • Usage past week: see below.
  • We'll have a meeting next Monday
this week
  • Continued testing with Parrot 4.1.4rc5.
  • Focus remains getting a usable ATLAS software environment exported to the Stampede cluster.
  • Current strategy is to use an edge VM at TACC to mount CVMFS and a squid service; then export CVMFS with Chirp.

Operations overview: Production and Analysis (Mark)

  • Production reference:
  • last meeting(s):
  • this meeting (5/14/14):
  • We are light on jobs - see link below from ADC weekly for summary of new production. Probably a couple weeks.
  • Pandamover has been turned off in the US Cloud. Need to figure out if PRODDISK is being cleaned properly. Need to check the cleanup policy. Should not see issue of jobs getting inputs removed before jobs begin.
  • Have all the sites been fully converted to Rucio? Need to find out where we stand. Mark wil follow up.
  • May be some residual data from Panda mover.

Shift Operations (Mark)

  • Reference
  • last week: Operations summary:
    AMOD/ADCoS reports from the ADC Weekly and ADCoS meetings:
    https://indico.cern.ch/event/317364/contribution/0/material/1/0.txt
    http://indico.cern.ch/event/311432/contribution/1/material/slides/0.pdf
    
    1)  4/30: UTA_SWT2 - file transfers were failing with the error "Failed to get source file size: srm-ifce err: Permission denied." Most likely these errors were related to the digicert 
    issue in the US cloud around this period. Following the next CA update the problem went away.
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=105040 was closed on 5/1, eLog 49160.
    2)  5/1: WISC: file transfers failing with "DESTINATION SRM_PUT_TURL srm-ifce err: Invalid argument." Site reported there were some errors in a new SRM service declaration, 
    but the issue had been resolved. A short downtime was taken to allow the updated agis values to propagate through the system. 
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=105043 was closed, eLog 49157.
    3)  5/1: https://ggus.eu/index.php?mode=ticket_info&ticket_id=105064 was opened for transfer errors at BNL, but the issue was the digicerts problem, and not site-related. 
    Ticket was closed on 5/7. eLog 49271.
    4)  5/2: MWT2_UC - file transfers were failing with errors like"Communication error on send, err: [SE][srmRm][] httpg://uct2-dc1.uchicago.edu:8443/srm/managerv2 ... Unexpected 
    Gatekeeper or Service Name GSS Minor Status Error Chain: globus_gsi_gssapi: Authorization denied: The expected name for the remote host (host@192.170.227.129) does not 
    match the authenticated name of the remote host (uct2-dc1.uchicago.edu)." Sarah reported there had been an issue with reverse DNS lookups that was resolved. Transfers resumed, 
    and the site was un-blacklisted. https://its.cern.ch/jira/browse/ADCSITEEXC-1677, https://ggus.eu/index.php?mode=ticket_info&ticket_id=105073 was closed, eLog 49188.
    5)  5/2: SLACXRD - file transfer failures. Most likely related to digicert issue. On 5/3 some transfer failures appeared with the error "lsm-put failed," and later with "could not open 
    connection to osgserv04.slac.stanford.edu," but these went away as of 5/5, so https://ggus.eu/index.php?mode=ticket_info&ticket_id=105120 was closed. eLog 49230.
    6)  5/2: AGLT2 - file transfers failing with various errors ("Connection timed out," "SRM_FILE_UNAVAILABLE," etc.). A power event at the MSU site caused file system corruption on 
    several dCache pools. As of 5/4 still seeing "Connection timed out" errors. Bob reported that a routing issue on a pool node was found and fixed. As of 5/7 all issues apparently resolved.
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=105070 was closed, eLog 49269.
    7)  5/6: BNL network maintenance outage completed as of ~3:00 p.m. EST. eLog 49242.
    8)  5/6: ADC Weekly meeting:
    http://indico.cern.ch/event/311432/
    
    Follow-ups from earlier reports:
    
    (i) 4/30: SMU - continuing source / destination file transfer errors, mostly "Unable to connect to smuosg1.hpc.smu.edu." https://ggus.eu/index.php?mode=ticket_info&ticket_id=101975, 
    eLog 49138. Site blacklisted.
    5/2: Site admin resolved a LCMAPS issue. File transfers now succeeding. ggus 101975 was closed, eLog 49173.
    

  • this week: Operations summary:
    AMOD/ADCoS reports from the ADC Weekly and ADCoS meetings:
    https://indico.cern.ch/event/317366/contribution/0/material/0/0.txt
    http://indico.cern.ch/event/311433/contribution/1/material/slides/0.pdf
    
    1)  5/7: SMU - file transfer failures with the error "Unable to connect to smuosg1.hpc.smu.edu." A service on the gatekeeper had crashed, and was restarted, solving the problem. 
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=105222 was closed, eLog 49267.
    2)  5/8: New pilot release from Paul (v58l). Details here:
    http://www-hep.uta.edu/~sosebee/ADCoS/pilot-version_58l.html
    3)  5/13: ADC Weekly meeting (includes talk describing MC production status/outlook):
    http://indico.cern.ch/event/311433/
    
    Follow-ups from earlier reports:
    
    None
    

Data Management and Storage Validation (Armen)

  • Reference
  • meeting (3/19/14):
    • LOCALGROUPDISK cleanups: MWT2 and SLAC, largest set of users. Getting lists of files. (35 DN's at MWT2)
      • Lots of effort, but resulting cleanup is very good.
    • USERDISK - some sites are finished. SWT2 still going on; still submitting batches for MWT2, from the previous campaign. Going slowing, since some users had several hundred thousand files per dataset. Hiro sent out notifications for the next campaign (MWT2, BNL).
    • GROUPDISK issue: dataset subscriptions not working for AGLT2 - for the Higgs group. Bugged the Higgs representative due to over quota.

  • meeting (5/14/14)
    • Non-US Cloud production inputs were being cleaned centrally.
    • Next week AGLT2 will be migrated from LFC to Rucio Catalog. Armen will look into specific status.
    • Stress testing in Rucio.

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference last meeting (4/16/14)
  • added SFU. Changing FAX topology in North A.
  • understanding EOS too many ddm endpoints issue.
  • simplified direct dCache federating
this meeting (4/30/14)
  • Reconfigured redirector network for US being setup.
  • IN2P3? is now online
  • New dCache redirectors; will simplify deployments
  • Also the ROOTIO activity
  • Analyzing cost matrix data; using Google Engine to store and analysis. 15 MB/s second for most links.
this meeting (5/14/14)
  • A few more sites have installed. PIC now federating. Working with Netherlands.
  • Last several days - running up to 700 jobs in parallel. Trying to understand.
  • Overflow rebrokering tests in progress. Submitted 20 tasks, 744 jobs, fully successful.

Site news and issues (all sites)

  • MWT2:
    • last meeting(s): Network issues. Otherwise things are quiet. Did have a hardware failure on the management node, but caused.
    • this meeting: 100g testing to BNL.

  • SWT2 (UTA):
    • last meeting(s): Working on MCORE and tuning. Patrick has implemented a local site kill-job mechanism. If a machine is swapping, sometimes causes HC jobs to fail, and getting offlined. These are ATLAS user analysis jobs.
    • this meeting: Working on Big Panda job testing. Working on setting up a network test with new equipment - working on setting a downtime. Seen a great improvement in stability by removing jobs using large amounts of memory. 6GB! There are hammer cloud jobs consuming lots of memory - related to xrootd?

  • SWT2 (OU, OSCER):
    • last meeting(s): OSCER cluster is not getting a lot of throughput as cluster is being used for weather studies. Expect to get more soon as opp queue gets reassigned.
    • this meeting: LHCONE - working with John Bigrow

  • SWT2 (LU):
    • last meeting(s): Fully functional and operational and active.
    • this meeting:

  • WT2:
    • last meeting(s): Network reconfiguration for the gridftp server. Observed a problem with the Digicert CRL update - caused probs with VOMS proxy init. Also there were GUMS issues. OSG Gratia service was down for more than one day. Were job statistics lost?
    • this meeting:

  • T1:
    • last meeting(s): Maintenance next Tuesday. Will integrate the two Arista 7500's and will replace obsolete network switches. Will then have full 100g inter-switch technology.
    • this meeting:

  • AGLT2:
    • last meeting(s): Downtime on Tuesday. Updated 2.6.22 version. Big cleanup of PRODDISK. Removing dark data - older than March 13 on USERDISK. Updated SL 3.2.5, gatekeepers running 3.2.6. (3.2.7 emergency release next week)
    • this meeting: Juniper equipment in place at both sites - working on getting the last bit of fiber in place.

  • NET2:
    • last meeting(s): DDM problem - symptom of an underlying congestion problem. Upgrading SRM host to SL6. (timeouts) Validation problem tracked down to MAXMEM setting in AGIS. It was 3000. Causes certain tasks to not-validate. > 4000. Undocumented.
    • this week: FTS3 poor performance issue addressed by deploying more gridftp issues. Gearing up to purchase storage. 700 usable TB with 4 TB drives. Half a rack.

AOB

last meeting this meeting


-- RobertGardner - 14 May 2014

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback