r5 - 05 Sep 2012 - 14:57:59 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesSep052012

MinutesSep052012

Introduction

Minutes of the Facilities Integration Program meeting, Sep 5, 2012
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Our Skype-capable conference line: (6 to mute) ** announce yourself in a quiet moment after you connect *
    • USA Toll-Free: (877)336-1839
    • USA Caller Paid/International Toll : (636)651-0008
    • ACCESS CODE: 3444755

Attending

  • Meeting attendees: Torre, Ilija, Sarah, Bob, Horst, Hiro, John, Shawn, Michael, Jason, Dave, Fred, Wei, Tom,
  • Apologies: Patrick
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, bi-weekly - convened by Kaushik) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Wei or Rob): Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • New Integration program for FY12Q4, IntegrationPhase22 and SiteCertificationP22
      • LFC consolidation is a highest priority.
      • Slow progress on multi-core queues: what is the schedule for all sites? (There is a solution for condor, lsf; for PBS: Patrick has a mechanism, but not optimal when there is a mixed set of jobs in the system; this is the concern.)
      • Priorities - storage procurements, MCORE at all Tier2's, LFC consolidation
    • this week
      • Rob: review of high level milestones for the facility. Thanks to all for updating the site certification table.
      • Michael: Pledges for 2013, 2014 are being declared this month. Because of the run extension there are additional resources being required. There has been a concerted effort to keep the requests at a reasonable level. There are now some solid numbers for the US facility, according to the 23% MOU share, to be discussed tomorrow at the L2/L3 meeting tomorrow.
      • Michael: multicore slots are now going to used for a validation campaign. Jose is working on getting these jobs going - so we expect the MCORE queues to be utilized. Hopefully will lead to getting AthenaMP? into production.
      • Michael: accounting statistics are being analyzed at the ICB to get valuable information about how resources are being used.
      • Michael: future of PandaMover in the US. Historically it has been quite valuable (e.g. when DDM was not available enough); e.g. staging data from tape. Kaushik: useful as a backup, especially as we transition to Rucio. Last time we tried ran into DQ2 load issues on SRM. Network load may go up to 30%, since files are re-used, which is not normal.
      • Michael: all should be re-visited. A factor of 2 or 3 should not be an argument. Note - PandaMover related issues are hard to debug.
      • Kaushik: deletion service immediately deletes datasets after the jobs.
      • Rob: could you make the change for a single site? Kaushik: yes.
      • Hiro; does not like it.

Multi-core deployment progress (Rob)

last meeting:
  • Will be a standing item until we have a MC queue at each site
  • At BNL, 100 machines
  • Follow the _MCORE convention for site
  • BNL DONE
  • WT2 - SLACXRD_MP8 DONE
  • MWT2_MCORE available DONE
  • NET2 - have a new gatekeeper setup, OSG 3 installed, working with SGE - which should be much easier to set this up using "resources" concept. There were some SGE-specific OSG fixes required.
  • SWT2 - just need to setup the Panda site for the CPB cluster.
this meeting, reviewing status:
  • AGLT2 - in preparation - next week sometime.
  • NET2 - will be starting work on this today - next week. There are questions about controlling command-line options for the scheduler. Will consult Alden & Paul.
  • SWT2 - will do this by end of the week. Close.

GlideinWMS testing status (Maxim)

last week:

  • Including a few sites in the configuration.
  • 12 days of continueous to AGLT2, MWT2, UTA
  • Load fluctuating - 14% of the production
  • No load on scheduler machine at BNL
  • Mixing pilots from APF and glideinWMS channels. Found glideinWMS slow to generate at peak loads discovered at AGLT2. Configured to use two factories, one at the GOC.
  • Operation has been very stable
  • How fast can the combined system respond? So far have been restricted to production.
  • Polling period in site APF itself. cpu cycles on glidein factory itself.
  • Integration with glexec. Has list of sites.
  • Ran thousands of jobs on the major sites.
this week
  • Would like to extend to use ANALY queues. Had started with AGLT2 last week, before the downtime. Finding much slower fill rate.
  • Maxim gave a detailed report at the ADC development meeting last Monday. Rate of submission is limited, indicating need for more pilot factories.
  • APF likely more powerful.

Update on LFC consolidation (Hiro, Patrick, Wei)

last week(s):
  • See Patrick's RFC for the clean up
  • Replica info into an SQLite database
  • Can sites just use Patrick's pandamover-cleanup script?
  • Will put this into Hiro's git repo
  • Hiro can create something on-demand. AGLT2 - does this every 12 hours.
  • Please update with ConsolidateLFC known issues.
  • Production role is needed
  • UTA SWT2 has been migrated! DONE
    • Panda mover cleanup scripts worked okay. Patrick's script might be a direct replacement.
    • Load on LFC/Oracle has been minimal; depends on how many replicas to delete
    • 10 evgen jobs ran fine. Will run test jobs requiring input.
  • Hiro sent info on creating a dump.
  • CCC - will be more interesting; may need something from central DDM - a list of guids for a specific dataset. Otherwise a run will take forever. CCC is complete consistency checker - dq2 central catalog, LFC, and whats on disk. It does cache the queries.
  • SLAC is next.
  • Hiro will setup a git repo page for the script.

this week:

  • Three sites UTA, SLAC, AGLT2 now converted. UTA seems to have problems, consulting w/ Patrick.
  • Mark notes that PandaMover needs fixing. Check of the replica succeeds, incorrectly. Patrick has contacted Tadashi, and Mark has created a ticket. Only became apparent after SLAC consolidated. Thinks this is simple.
  • Pause on new sites.
  • Hiro: need to revisit CCC. Shawn has not tried it at AGLT2 yet.
  • Shawn - is checking for production errors.
  • BU is scheduled next - perhaps next Monday. John in communication with Hiro.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Notes 150 TB transfer request to BNL DATADISK; will take up with RAC; on hold for now.
    • All is fine. 10 days ago the spike in user analysis was caused by a single user submitting production-like jobs (evgen). Blocking user analysis. User cancelled the request.
  • this meeting:
    • All is well.

Data Management and Storage Validation (Armen)

  • Reference
  • last meetings(s):
    • No report.
  • this meeting:
    • Hiro - will send user disk cleanup reminder
    • Armen - localgroupdisk issues - what about policies? Do we need one? Generally no deletion there. Recent issue with SLAC - a 100 TB request. Wei added this space there, about 500 TB of LOCAGROUPDISK (reduces the pledge). How to get users to clean up? Situation is different in various places.
    • Hiro: why is ToA? number different than what Ueda quotes? Where does 1.2 PB come from? Michael: comes from pledge.
    • Armen - expect more flow into localgroupdisk, since DATADISK is undergoing some deletion, or moving into GROUPDISK token areas.
    • Notes a spike in DQ2.
    • Will restart in two week.
    • Can we do something to improve DDM subscriptions to LOCALGROUPDISK.
    • Kaushik notes we will have accounting.
    • Alden: send any issues to DAST

Shift Operations (Mark)

  • last week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=paper&confId=203430
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-8_27_2012.html
    
    1)  8/22: UTA_SWT2 migrated to the LFC at BNL.
    2)  8/24: AGLT2 - file transfer failures with SRM errors - from the site admin: There was a problem with our virtual infrastructure storage backend affecting 
    head01.aglt2.org and others. It is now corrected and SRM should be working correctly at our site.  Issue resolved - ggus 85480 closed, eLog 38822.
    3)  8/24: NERSC_SCRATCHDISK - file transfer failures with SRM errors.  Issue was due to the fact that the site admins had previously taken the token 
    off-line to protect it from undesired auto-deletion of files.  ggus 85490 closed, eLog 38791.  https://savannah.cern.ch/support/index.php?131508 
    (Savannah site exclusion ticket), eLog 38795.
    4)  8/27: MWT2 - ggus 85054 was re-opened for job failures with the error "pilot: Get error: Failed to get LFC replicas."  Issue was due to a problem 
    distributing certificates to WN's - now fixed.  ggus ticket again closed - eLog 38878.
    5)  8/27: BNL - T0 exports to the site were failing with the error message "globus_gass_copy_register_url_to_url transfer timed out."  From Jane: 
    There were problems with two GFTP doors. The hosts were rebooted and the problem was gone.  ggus 85541 closed, eLog 38874.
    6)  8/28: AGLT2 networking issue - from Shawn: The UltraLight router in Chicago was decommissioned today and we don't seem to have all the routes 
    that were provided from that router. This is causing some problems for what should be general internet connected subnets.  This includes the gratia 
    reporting host and the pilot factory at BNL.  Issue reported to be resolved on 8/29 - see:
    http://www-hep.uta.edu/~sosebee/ADCoS/AGLT2-router-decommission-8_29_12.html.  eLog 38900.  (There were also file transfer/network problems at 
    MWT2 around this time - not clear if the issues were related.)
    7)  8/28: OU_OCHEP_SWT2 - Horst reported a problem with the Lustre storage system at the site.  Blacklisted while the issue is being worked on.  
    (Had to be done manually - automated system did not work.)  https://savannah.cern.ch/support/index.php?131626 (Savannah site exclusion), eLog 38909.
    
    Follow-ups from earlier reports:
    
    (i)  6/13: Issue where some sites using CVMFS see the occasional error: "Error: cmtsite command was timed out" was raised in https://savannah.cern.ch/support/?129468.  
    See more details in the discussion therein.
    (ii)  7/12: NET2 DDM deletion errors - ggus 84189 marked 'solved' 7/13, but the errors reappeared on 7/14.  Ticket 'in-progress', eLog 37613/50.
    (iii)  7/29: File transfers to UPENN_LOCALGROUPDISK failing with checksum mismatch.  https://savannah.cern.ch/bugs/index.php?96424 (Savannah DDM), 
    eLog 38037. 
    (iv)  8/20 p.m.: UPENN LOCAL GROUPDISK file transfer errors ("[INVALID_PATH] globus_ftp_client: the server responded with an error 500 500-Command 
    failed. : globus_l_gfs_file_open failed. 500-globus_xio: Unable to open file...").  From the site admin: My xrootd redirector was/is unhappy. I have put a temporary 
    fix in place. Hopefully this will solve the problem.  ggus 85369 in-progress, eLog 38602.
    (v)  8/22: Jobs waiting at BNL for rel.17.2.4.3 - "special brokerage for manyInput: AtlasProduction/17.2.4.3/i686-slc5-gcc43-opt not found at 
    BNL_CVMFS_1, BNL_PROD_MP8."  Alessandro is working on this problem with the release.  ggus 85405, eLog 38655.  (Related Savannah tickets: 
    https://savannah.cern.ch/support/?131355, https://savannah.cern.ch/support/?131412)
    Update 8/24: Issue resolved - ggus 85405 closed.
    

  • this week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=paper&confId=203434
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-9_3_2012.html
    
    1)  8/29: SLACXRD migrated to LFC at BNL.
    2)  8/29: BNL - file transfer failures to OSG2_SCRATCHDISK with the error "file exists, overwrite is not allowed."  From Hiro: This is caused by a race 
    condition between the deletion service and subscription. DDM will eventually take care itself. This has been known and the ticket should not be created. 
    https://savannah.cern.ch/support/index.php?131665 closed, eLog 38927.
    3)  8/29: UPENN_LOCALGROUPDISK file transfer errors ("possibly the destination disk is full").  From the site admin: This happens occasionally in times 
    of high load. The "possibly the destination disk is full" is a red-herring and has nothing to do with the real cause.  Reducing the load on the storage 
    system fixed the problem.  https://savannah.cern.ch/support/index.php?131656 closed, eLog 38935.
    4)  8/30: BNL reported a fiber cut on the ESnet links between New York and Washington.  Issue resolved as of ~7:30 UTC on 8/31.  eLog 38979.  See:
    http://www-hep.uta.edu/~sosebee/ADCoS/BNL-ESnet-fiber-cut-8_31-12.html
    5)  8/31: UPENN SRM transfer errors - appeared to again be due to an overload condition on the storage system (see #3 above).  Problem resolved itself 
    a few hours.  Site admin asked whether the incoming connections could somehow be throttled to avoid a repeat of the problem (perhaps at the FTS channel 
    level?).  ggus 85699 closed, eLog 39002.
    6)  8/31: MWT2 file transfer errors ("failed to contact on remote SRM [httpg://uct2-dc1.uchicago.edu:8443/srm/managerv2]").  Rob closed ggus 85703 
    with a detailed explanation of the status at the site.  Will continue to investigate the issue.  eLog 38991.
    7)  9/1: AGLT2: file transfer failures with "locality is unavailable" errors.  From Bob: Currently we do not route to TRIUMF or several other locations, due 
    to the shutdown of the UltraLight router. We are working to establish alternate routing.  ggus 85712 in-progress, eLog 39010.
    8)  9/3: ggus 85728 was opened for "stuck" file transfers to BNL.  Not really "stuck," since all of the transfers eventually either finished or failed.  Also, the 
    ticket was mistakenly routed to MWT2.  Ticket closed.
    9)  9/3: From Rob: There was a power interruption at MWT2_UC last night affecting a number of worker nodes. These will result in lost heartbeat failures 
    in the MWT2 queue.  eLog 39047.
    10)  9/4: UTA_SWT2: Jobs failing with an error like "cp: cannot stat '/xrd/atlasproddisk/panda/dis/12/07/22/...': No such file or directory."  Problem is being 
    investigated.  ggus 85771 / RT 22442 in-progress, eLog 39114.
    11)  9/4: AGLT2 - LFC migration to BNL host underway.  http://savannah.cern.ch/support/?131669 (Savannah site exclusion).
    
    Follow-ups from earlier reports:
    
    (i)  6/13: Issue where some sites using CVMFS see the occasional error: "Error: cmtsite command was timed out" was raised in https://savannah.cern.ch/support/?129468.  
    See more details in the discussion therein.
    (ii)  7/12: NET2 DDM deletion errors - ggus 84189 marked 'solved' 7/13, but the errors reappeared on 7/14.  Ticket 'in-progress', eLog 37613/50.
    (iii)  7/29: File transfers to UPENN_LOCALGROUPDISK failing with checksum mismatch.  https://savannah.cern.ch/bugs/index.php?96424 (Savannah DDM), eLog 38037. 
    (iv)  8/20 p.m.: UPENN LOCAL GROUPDISK file transfer errors ("[INVALID_PATH] globus_ftp_client: the server responded with an error 500 500-Command failed. : 
    globus_l_gfs_file_open failed. 500-globus_xio: Unable to open file...").  From the site admin: My xrootd redirector was/is unhappy. I have put a temporary fix in place. 
    Hopefully this will solve the problem.  ggus 85369 in-progress, eLog 38602.
    Update 8/30: recent transfers are succeeding, issue resolved.  ggus 85369 closed.
    (v)  8/24: NERSC_SCRATCHDISK - file transfer failures with SRM errors.  Issue was due to the fact that the site admins had previously taken the token off-line
    to protect it from undesired auto-deletion of files.  ggus 85490 closed, eLog 38791.  https://savannah.cern.ch/support/index.php?131508 (Savannah site exclusion 
    ticket), eLog 38795.
    (vi)  8/28: OU_OCHEP_SWT2 - Horst reported a problem with the Lustre storage system at the site.  Blacklisted while the issue is being worked on.  (Had to be done 
    manually - automated system did not work.)  https://savannah.cern.ch/support/index.php?131626 (Savannah site exclusion), eLog 38909.
    Update 8/29 p.m.: Horst reported the problem was resolved with DDN support.  As of 8/30 site unblacklisted in DDM, test jobs successful, all issues resolved.  eLog 38937.
    

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last meeting(s):
    • Meeting yesterday focus was ease of installation and reliability. There is a Twiki page tracking everything.
    • Work in OSG to possibly re-package the toolkit to make it easier - there a meeting to be organized (OSG, I2, Esnet) to discuss this, and modular dashboard.
    • Please try to get new equipment into production.
      • NET2 - has systems, not yet in production; plan was to do this at BU, not clear about HU but maybe.
      • UTA - waiting on a 10G port; working internally on which optics - SR, LR; then will buy cable.
      • SLAC - has machines, trying to get them supported in a standard way, as appliances
  • this meeting:
    • Ask again that all sites upgrade perfsonar.

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference

last week(s)

  • Finding issues with current version of Xrootd that will affect FAX infrastructure - just discovered.
  • Read-only LFC at CERN works as expected. New version of java N2N? was needed java LFC api does not support not-authenticated accesses, tested and deployed at Wuppertal.
  • Monitoring - re-discussing parts of monitoring info collections with Andy. Working on a dCache XRootD? info collection code.
  • HC tests of federation are running and providing load to FAX. The sites are online most of the time.
  • Monitoring information publication - running patched version at SLAC. We still have issues at UK.
  • Ilija - discussing with Andy publishing monitoring metrics - has changes on hold until Andy has consulted other customers of Xrootd. Collecting mon info from dcache-xrootd door. Will have version next week.
this week
  • Preliminary format of new monitoring messages agreed upon. Estimated time to implement in xrootd 2-3 months, dCache and collector shortly afterwards. Have to investigate if we can get some additional information from current monitoring.
  • got first real user(s). need to advertise.
  • Wei: new xrootd release had problems, expect another today or tomorrow.
  • work starting on joining LRZ into federation.

US analysis queue performance (Ilija)

last two meetings
  • New Panda pilot with timed code fixes went online on 16th Aug. Average stage out times went from 120 seconds to ~10 seconds. At least 4 sites drastically improved stage in times: Freiburg, MPPMU, OU_CHEP_SWT2, SCINET.
  • Even more optimized version is produced. Will be tested when Paul comes back from vacation.
  • most sites at >80% efficiency.
  • investigation ongoing at event loop efficiency at two sites.
  • New pilot went online 16 August. Contained changes to timed code - stage out change from 120 sec to (1-10) seconds. At four sites, stagein-times went down dramatically.
  • An even newer version will get back another 10 seconds. Most sites are 80% efficient.
  • Working on problematic sites.
  • Next week sites are asked to report on bottlenecks at their site using Ilija's test program. Won't be able to get this for UTA.
  • now really difficult job starts.
this week:
  • Produced site-by-site comparison for direct and copy2scratch modes. Results can be found here. To be able to advice site on optimal access mode for analysis jobs one would need to do a stress test with a large number of test jobs or preferably switch ANALY queue from its current mode during one or two days .
  • Most sites show reasonable efficiency. a few strange results will need to be addressed.
  • It is obvious that an even loop CPU efficiency is still not the biggest part of the total execution time. Needs further investigations.
  • Two sites reported details on their investigations. No obvious weak spots found. Will continue investigating.

Site news and issues (all sites)

  • T1:
    • last meeting(s): Working on storage technologies. Content-managed storage and object-based storage, also in the wide-area context. Also progress in cloud processing area.
    • this meeting: New version of FTS 3, ready for testing though far from complete; Hiro installing. 2400 MCORE slots. Not quite happy with the way Condor handles multi-core reservations are handled. On Friday, Dan Fraser and Todd Tannenbaum will discuss shortcomings and path going forward. WLCG has discussed taking up the federated ID management capability, again. There will be a pilot program, and BNL will be involved. Looking at modern storage management solutions - eg. Hadoop-based installations. There are other companies that are addressing shortcomings in the native version. Company called http://www.mapr.com. NFS interface; distributed metadata; quotas.

  • AGLT2:
    • last meeting(s): Things are smooth. Moving forward with storage procurement. Adding three systems at UM, three at MSU, each with 6 shelves. 96G RAM, 3 H800, dual CPU slower CPUs.
    • this meeting: Back online with both queues. About to purchase. Dell visit - new dense storage, available for ordering. MD3260 node - front-end R-baud node; MD3060e; 60 disks in 4U. Dynamic disk pool RAID - much faster rebuild, dynamically size; price was an issue. R720 head node for one of these. More storage behind a single headnode than previously, so evaluating potential bottlenecks. 1/2 PB in 18U! Now have one domain per pool, than than six per; more flexibility. Updated Condor 7.8.2 installed. 2.0.13 to 2.0.18 cvmfs; most recent wn-client.

  • NET2:
    • last meeting(s): Running smoothly; met with Dell yesterday had interesting discussions. Recovered from losing LFC, no further problems.
    • this meeting: Preparing to purchase storage. Lots of other work on-going.

  • MWT2:
    • last meeting(s): 20G update at UC campus core went well. CVMFS-distributed wn-client and $APP being deployed across sites.
    • this meeting: LHCONE migration at UC. 800 slotes in MCORE queue. SRM failures on Friday - could have been SRM thread. Perfsonar upgraaded to 10G, and configured to test against LHCONE sites; configured with 10G/1G so we can test both properly. Converted condor config to use IP addresses, will be upgrading soon.

  • SWT2 (UTA):
    • last meeting(s): LFC consolidation and pandamover cleanup, has gone well. CPB cluster - new storage has been deployed. Patrick on vacation next week. CPB issue explicitly at yesterday's ADC weekly, Kaushik has been following up, SSB is green. A ticket was expected.
    • this meeting: Multicore configuration nearly finished. Working Pandamover issue with LFC consolidation. Available disk at SWT2? ~ 1600 TB. Updated today - it will be about 2100 TB.

  • SWT2 (OU):
    • last meeting(s):
    • this meeting: Ordering storage - quote in hand, placing this in the next two weeks; about 200 TB. Lustre issue was metadata server deadlock, fixed with reboot.

  • WT2:
    • last meeting(s): Smooth until this morning - have a disk server error, down for two hours. Storage has been ordered 44 units of MD1200 3 TB drives. Working on SSD cache.
    • this meeting: Migrated SRM and one of the CE's to virtual machines. Outage this weekend - Friday to Monday morning, extensive power work. PO for next storage went out, delivery: 1 PB usable. Also ordering 20 R510's for SLAC's local Tier 3 use, PROOF cluster expansion.

AOB

last week this week
  • Shawn, Michael: not all ATLAS subnets are advertised from the PBR at BNL.


-- RobertGardner - 04 Sep 2012

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback