r5 - 18 Sep 2014 - 13:39:50 - MarkSosebeeYou are here: TWiki >  Admins Web > MinutesSep032014

MinutesSep032014

Introduction

Minutes of the Bi-weekly Facilities IntegrationProgram meeting (Wednesdays, 1:00pm Eastern):

Connection info:

Attending

  • Meeting attendees: Dave, Ilija, Michael, Bob, Sarah, Wei, Patrick, Armen, Kaushik, Fred, Hiro, John, Saul, Shawn
  • Apologies: Mark, Jason
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, on-demand - convened by Armen) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Ilija): ATLAS Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • US ATLAS Software & Computing / Physics Support meeting at LBL, August 20-22, 2014
      • Facility performance metrics meeting yesterday. See URL in email.
        • Google docs to describe plans and gather materials: http://bit.ly/V1T3Q8
        • Updates from Hiro's load testing
        • Need updates in all areas - okay to put URLs to external sources.
        • Michael: setting up object store based on Ceph at BNL; getting event service supported by pilot is making good progress. Ben is working on the integration of the S3 protocol, good since we have a large-scale pilot with ESnet and Amazon underway. having this directly in the pilot, no tricks are required to access Amazon to access block devices, and the same works with Ceph. Starting to evaluate the performance.
        • Saul: see email which summarizing pilot environment. Common environment and configuration.
      • For the facilities part of the LBL workshop (https://indico.cern.ch/event/326831/) - about 3 hours. Introductory slides to kick off discussion.
        • Management of user data. How to access, manage, delete, where, allocation.
        • Dynamic configurations for Multicore at sites: path forward, becoming a big issue, now its moving in this direction, dynamically. (Mark, has noticed Panda dynamically reassigns before availability).
        • Impact of derivation framework, and Jedi
        • Impact of new analysis models, pROOT, etc
        • Federated access - do we have a understanding of what will hit the sites? After Jedi is activated? Michael: would be good to summarize extent sites should be prepared for WAN accesses.
        • Consolidation of queues.
        • How to prioritize access for US physicists
        • Tier3 and ATLAS Connect, and the new analysis chain. Need a plan, and people willing to help.
      • this week:
        • Notes from Facilities session at Berkeley: http://bit.ly/1BbyoJY
        • Updates to SiteCertificationP30 for HTCondorCE and DC14 columns.
          • Gather list of "To-do" resulting from findings in http://bit.ly/V1T3Q8
          • Example, at MWT2 we a noticeable recurring error with lsm-puts.
        • US ATLAS-MWT2_UC is hosting the ATLAS ADC TIM meeting in Chicago, October 27-29. These technical interchange meetings are usually focused on system development and planning, and are organized by ADC management. Website is http://tim2014.mwt2.org/ - will be soliciting help with local organization.
        • Michael
          • See follow-up notes in http://bit.ly/1BbyoJY
          • Ilija - working with Torre to create a priority boosting algorithm. Will check with Tadashi, will need to tune weights accordingly.

Supporting the OSG VO on US ATLAS Sites

last meeting(s)

  • Boston: reduced queue to 4 hours, OSG made an adjustment. All good.
  • UTA: have not got to it yet
  • No other updates.
  • I asked Bo for a follow-up from the OSG side of things. Here is his summary: Hi Rob,

The most striking difference we see is at NET2 where the policy towards OSG VO shifted from a fixed quota to running fulling opportunistically. While there are now periods of time where OSG VO gets zero hours, we see much larger peaks during periods of lower ATLAS running and effectively get 8-10x more wall hours at the site:

http://gratiaweb.grid.iu.edu/gratia/bysite?starttime=time.time%28%29-6048000&endtime=time.time%28%29&span=86400&relativetime=2592000&facility=%5ENET2%24&exclude-facility=&vo=osg&exclude-vo=&user=&exclude-dn=

Saul can presumably produce a longer-term version of this plot which shows how effectively OSG VO is filling in the gaps of ATLAS production:

http://egg.bu.edu/net2/reporting%7Btype:egg.Hatch%7D/plot_sge_hourly%7Btype:egg.Hatch%7D/SGE_over_time.html

Thanks, Bo

  • At UTA, will get back to this once the network upgrade work is complete.
  • Saul: may point is the queue time can be short, 4 hours was good.

this meeting

  • UTA - its enabled. Still need to setup preemption.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • meeting (9/3/14):
    • Nothing much new... analysis is low. Has not looked at overflow jobs.
    • Otherwise is well.
    • LOCALGROUPDISK - there is a new monitoring page, now visible from BNL. Used for tracking usage policy. Setup by Myuko. Storing it in Hiro's database. Will have a report in 4 weeks. October 1.

Shift Operations (Mark)

  • Reference
  • last week: Operations summary:
    AMOD/ADCoS reports from the ADC Weekly and ADCoS meetings (Alexey Sedov):
    http://indico.cern.ch/event/332960/contribution/2/material/slides/0.pdf
    
    1)  8/20-8/21: NET - file transfer errors (source/destination) with the error (SRM) "could not open connection to atlas.bu.edu." As of 8/22 Saul reported the issue was resolved, 
    so closed https://ggus.eu/index.php?mode=ticket_info&ticket_id=107775. eLog 50713.
    2)  8/21: https://ggus.eu/?mode=ticket_info&ticket_id=107789 was opened for DDM staging errors at BNL, but this wasn't really a site issue. Instead there was a very high 
    number of queued requests related to the fact the tape endpoint was chosen as the replica from which to copy files rather than DATADISK, etc. Ticket was closed, eLog 50735. 
    Issue reassigned to DDM ops jira: https://its.cern.ch/jira/browse/ATLDDMOPS-4731, eLog 50740.
    3)  8/23: WT2/SLACXRD - power outage. https://ggus.eu/?mode=ticket_info&ticket_id=107837 was opened as a result of the associated file transfer errors. As of 8/25 systems 
    back on-line, so closed the ggus ticket. eLog 50815.
    4)  8/24: MWT2 - squid service was down. Dave reported that the squid stopped after the host ran out of disk space. Freed up some space, issue resolved. 
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=107841, eLog 50784.
    5)  8/26: ADC Weekly meeting:
    http://indico.cern.ch/event/332960/
    
    Follow-ups from earlier reports:
    
    None
    

  • this week: Operations summary:
    AMOD/ADCoS reports from the ADC Weekly and ADCoS meetings (Michal Svatos):
    http://indico.cern.ch/event/332961/contribution/2/material/slides/0.pdf
    https://indico.cern.ch/event/338376/contribution/1/material/1/0.txt
    
    1)  8/29: OUHEP_OSG - file transfers failing with the error "DESTINATION OVERWRITE srm-ifce err: Communication error on send." 
    Horst discovered a lack of SHA-2 compliance was the underlying problem, so upgrading the BeStMan software fixed the problem. 
    Issue resolved as of 9/3. https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/50873. 
    2)  9/2: ADC Weekly meeting:
    http://indico.cern.ch/event/332961/
    
    Follow-ups from earlier reports:
    
    None
    

Data Management and Storage Validation (Armen)

  • Reference
  • meeting (7/23/14)
    • Overall US is deleting at 20 Hz. (Guess). Expect backlog to be completed shortly.
    • There were a couple of issues with BNL
    • Fred - so this is 1.7M files per day? How fast is dark data being deleted? How many files do we have to delete in the US?
    • Armen - should be no dark data being generated - only "temporarily". 3M dark files due to Panda pilot issues (see Tomas' email).
    • Sarah: working with Tomas to remove files that are in the Rucio catalog but not in a dataset. Not sure if the file sizes are correct. MWT2, AGLT2, BNL definitely have these.
    • Fred: how are you confirming whether new dark data is being created? Armen: doesn't see anything in the bourricot plots. Suggests looking at Rucio dumps.
  • meeting (9/3/14)
    • Deletion: observing reasonable rates, 12-14 Hz. Week ago it was down, but only for the U.S.
    • Request to open up webdav to NET2, SLAC, SWT2. Patrick was investigating. Hiro: use Xrootd's webdav, or apache with mod-gridsite. Patrick will investigate.
    • There is no significant amounts of dark data. 20-30 TB in USERDISK, per Tier2. DATADISK - varies by Tier2.
    • Emphasizing BNL, to clean up the non-Rucio data. DDM_TEST - looking dark, sometimes this is several hundred TB. 300 TB? There is also some cleanup in DATADISK - Hiro.

DDM Operations (Hiro)

meeting (7/9/14)
  • There is dark data at NET2 to be cleaned up.
  • Empty directories can be cleaned up.
  • At Tier1, found many datasets from non-Rucio directories that need to be moved.
  • Deletion service - claim is it solved, but the rate is still low. Urgent cleanup tasks are done for some sites. Overall 5-6 Hz. Started a month ago - mostly USERDISK. 0.5M datasets. There is no plan to deal with this. We need a statement from ADC management.
  • How much storage is dark or awaiting deletion.
meeting (7/23/14)
  • No Hiro.
  • Mark: last night most sights were auto-excluded by HC. Two named errors that were DDM related (could not update datasets, etc.). No elogs or email. Perhaps Oracle update affected site services.
  • Wei: very large number in the transferring state. Mark: experts are aware. Saul: notes these are waiting for Taiwan. Mark will follow-up.
meeting (9/3/14)
  • No report.

Condor CE validation (Saul, Wei, Bob)

meeting (8/6/14)
  • See Bob's instructions, experience at https://www.aglt2.org/wiki/bin/view/AGLT2/CondorCE
  • Saul: Augustine working with Tim Cartwright and Brian
  • Wei: working with Brian to get it working with LSF. Got a job submitted successfully, but with memory allocation errors.
  • Overall conclusion is software just not there yet.
  • Xin - working on Condor CE at BNL
meeting (9/3/14)
  • New page from Xin: HTCondorCE
  • WT2: Brian has got it running, in a test mode. Still need to understand the package to install. Then, will need to run HC jobs. Believes its not ready for production. Also need to enable C-groups.
  • AGLT2: running HC on the test gatekeeper, running fine. Need rpms out of the OSG testing repo, to get best results. Debugging the job router is inadequate. Condor v8 has differences - there were simple changes that caused.
  • NET2: got to the stage of SGE job submissions running. There's a new version. Need to run at scale.

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference meeting (8/6/14)
  • Quite stable - sites are mostly in green.
  • Problem at SLAC, and at MWT2.
  • New version of Xrootd at SLAC, MWT2, 4.0.3, which allows remote debugging.
  • Tests of bandwidth to various places. MWT2 and AGLT2 - up to 4 GB/s.
  • Submitting jobs via Jedi in overflow mode, to all US sites and Triumf. Rates and % successes.
  • Monitoring in overflow developments: added into the ADC dashboard: http://dashb-atlas-job-prototype.cern.ch/dashboard/request.py/dailysummary
  • 4.0.3 on SLAC on RHEL 5,6 as proxy: crashed.
meeting (9/3/14)

ATLAS Connect (Dave, Rob)

meeting (8/6/2014) meeting (9/3/2014)
  • Still working with TACC to get a CVMFS solution, of some nature, working on Stampede. They are setting up test nodes with FUSE enabled.
  • Replicate data on their Lustre file system - using ReplicateCVMFS

Site news and issues (all sites)

  • SWT2 (UTA):
    • last meeting(s): Downtime next Monday for the big network change over. Xrootd redirector RAID controller died. Scrambled to get to a new machine.
    • this meeting: Network upgrade is done. Looking at increasing # analysis jobs. Received 30 compute nodes.

  • SWT2 (OU, OSCER):
    • last meeting(s): CVMFS upgraded to 2.1.19 on OCHEP cluster. OSCER still not upgraded.
    • this meeting: Ran 1800 jobs slots on OSCER - 3600 between LUCILLE, OCHEP, OSCER, really good.

  • SWT2 (LU):
    • last meeting(s): CVMFS already upgraded. Reverse DNS still not resolved. Seems to be flopping back-n-forth.
    • this meeting: Everything running smoothly.

  • WT2:
    • last meeting(s): Short outage August 7 for internal network upgrades.
    • this meeting:

  • T1:
    • last meeting(s): Preparing for a CPU upgrade, procurement underway. New storage replacements are being phased in. Older storage being converted to an object store using Ceph. Interested in getting experience with other interfaces, such as S3. Useful for event service. Still working on completing WAN setup. Have 100g into the lab. Moving circuit by circuit 10g infrastructure into 100g (LHCONE eg. has 40g cap). On CVMFS - moving to the new client 2.1.19.
    • this meeting: WAN connectivity: BNL building a fully redundant SciDMZ, deploying new Juniper, provided by Lab. Expect to have 200 Gbps. Looking into adding one more link. Currently conducting large scale test with Amazon, ESnet helping tremendously setting up a VPN between VA, CA, OR - to allow running ATLAS production jobs at VLS (50-100k) with no networking bottlenecks. Hiro is setting up a SE in Amazon, asked for a PB of storage. Discussion with FTS3 for adding S3 protocol. (No SRM needed! Timescale? November.) Hiro is addressing a critical item associated with FAX - when jobs running at BNL, traffic goes through the firewall rather than the SciDMZ... interferes with lab traffic, setting up a (client-side) proxy.

  • AGLT2:
    • last meeting(s): All is well. Webdav door setup successfully, not sure if its being used. 1.5 GB/s across the two sites.
    • this meeting: Some flakiness in one of the 40g LR interfaces, testing. Getting ready for purchases, updated quotes from Dell. Analysis users running multi-core jobs by mistake. The analysis is WW, checked into SVN. Implement C-groups? GLOW even sent some MC jobs. Alden: it was a one-off.

  • NET2:
    • last meeting(s): Downtime coming up, MGHPCC annual downtime day. Issue recently with LSM for files at a remote site. Migrated files in LGD. Autopyfactory issue: once a week, get a huge number of authz requests from the factory, which jams gatekeeper; and one of the sites with lost contact.
    • this week: Getting ready to make purchases, and discussing options with Dell.

  • MWT2:
    • last meeting(s): Had problems with storage servers rebooting - investigating load incidents. Closer to being consistent in CCC - 600k files are now dark, down from 5.7M files. Illinois - 14 nodes on order, damaged, getting a priority shipment. Perfsonar box at Illinois, strangely capping at 6 Gbps - related to Myricom driver. UC - delivery and installation of two 32x10g Juniper line cards; testing with a storage server at 20g LACP this week.
    • this meeting:

AOB

last meeting this meeting
  • No meeting in two weeks (Software workshop)


-- RobertGardner - 02 Sep 2014

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


pdf Tuning_overflow.pdf (1504.2K) | IlijaVukotic, 03 Sep 2014 - 12:25 |
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback