r4 - 22 Aug 2012 - 14:22:57 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesAug222012

MinutesAug222012

Introduction

Minutes of the Facilities Integration Program meeting, Aug 22, 2012
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Our Skype-capable conference line: (6 to mute) ** announce yourself in a quiet moment after you connect *
    • USA Toll-Free: (877)336-1839
    • USA Caller Paid/International Toll : (636)651-0008
    • ACCESS CODE: 3444755

Attending

  • Meeting attendees: Rob, Dave, Saul, Torre, Sarah, Tom, Patrick, Mark, Wei, John, Hiro, Alden, Kaushik
  • Apologies: Jason, Shawn, Michael, Armen
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, bi-weekly - convened by Kaushik) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Wei or Rob): Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • New Integration program for FY12Q4, IntegrationPhase22 and SiteCertificationP22
      • Pricing discussions with Dell and timelines
      • Multi-gatekeeper availability issue - not an issue (its an OR)? Fred still isn't certain this applies in all cases.
      • LFC consolidation is a highest priority.
      • Slow progress on multi-core queues: what is the schedule for all sites? (There is a solution for condor, lsf; for PBS: Patrick has a mechanism, but not optimal when there is a mixed set of jobs in the system; this is the concern.)
    • this week
      • Priorities - storage procurements, MCORE at all Tier2's, LFC consolidation

Multi-core deployment progress (Rob)

last meeting:
  • Will be a standing item until we have a MC queue at each site
  • At BNL, 100 machines
  • Follow the _MCORE convention for site
this meeting, reviewing status:
  • AGLT2 - no progress, Bob to report in two weeks.
  • MWT2_MCORE available DONE; using 10 24 slot R410's to back it; using static configuration. 1 multi-core machine per node (so 10 machines with 80 cores). There is a HC functional testing suite.
  • NET2 - have a new gatekeeper setup, OSG 3 installed, working with SGE - which should be much easier to set this up using "resources" concept. There were some SGE-specific OSG fixes required.
  • SWT2 - just need to setup the Panda site for the CPB cluster.
  • WT2 - SLACXRD_MP8 DONE
  • Notes -

GlideinWMS testing status (Maxim)

last week:

  • Configured APF on the BNL node grid 04 (co-hosting schedd)
  • Found and corrected mistakes in the above (thanks John H)
  • The Front End at CERN had an outage, now fixed
  • Tested pilot generation via APF when test job submitted to Panda, jobs get registered and pilots generated
  • minor maintenance on APF instance at CERN
* still to-do *
  • based on the above, activate OU and UTA_SWT2 (this week)

this week

  • Including a few sites in the configuration.
  • 12 days of continueous to AGLT2, MWT2, UTA
  • Load fluctuating - 14% of the production
  • No load on scheduler machine at BNL
  • Mixing pilots from APF and glideinWMS channels. Found glideinWMS slow to generate at peak loads discovered at AGLT2. Configured to use two factories, one at the GOC.
  • Operation has been very stable
  • How fast can the combined system respond? So far have been restricted to production.
  • Polling period in site APF itself. cpu cycles on glidein factory itself.
  • Integration with glexec. Has list of sites.
  • Ran thousands of jobs on the major sites.

Update on LFC consolidation (Hiro, Patrick, Wei)

last week:
  • See Patrick's RFC for the clean up
  • Replica info into an SQLite database
  • Can sites just use Patrick's pandamover-cleanup script?
  • Will put this into Hiro's git repo
  • Hiro can create something on-demand. AGLT2 - does this every 12 hours.
  • Please update with ConsolidateLFC known issues.
  • Production role is needed

this week:

  • UTA SWT2 has been migrated! DONE
    • Panda mover cleanup scripts worked okay. Patrick's script might be a direct replacement.
    • Load on LFC/Oracle has been minimal; depends on how many replicas to delete
    • 10 evgen jobs ran fine. Will run test jobs requiring input.
  • Hiro sent info on creating a dump.
  • CCC - will be more interesting; may need something from central DDM - a list of guids for a specific dataset. Otherwise a run will take forever. CCC is complete consistency checker - dq2 central catalog, LFC, and whats on disk. It does cache the queries.
  • SLAC is next.
  • Hiro will setup a git repo page for the script.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Mark reporting - things have been very smooth. ATLAS software installation system notifications (Mark will send a note)
    • Convo on job limits; Michael notes that for production jobs there should be minimal limites, or even unlimited. 3 days is a reasonable limit (Mark notes this is even useful.) For analysis jobs, 24 hours would be reasonable. We may need to use this to provide feedback. Bring this up with Alden, discuss with DAST.
    • We do have another unusual situation - lots of assigned jobs at MWT2.
  • this meeting:
    • Notes 150 TB transfer request to BNL DATADISK; will take up with RAC; on hold for now.
    • All is fine. 10 days ago the spike in user analysis was caused by a single user submitting production-like jobs (evgen). Blocking user analysis. User cancelled the request.

Data Management and Storage Validation (Armen)

Shift Operations (Mark)

  • last week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-8_13_2012.html
    
    1)  8/10: MWT2/IllinoisHEP - jobs failing with the error "pilot: Get error: Failed to get LFC replicas: -1 (lfc_getreplicas failed with: 2702, 
    Bad credentials)|Log put error: lsm-put failed."  Dave reported there were four WN's in total that were causing the errors, so these machines 
    were set off-line.  ggus 85054 in-progress, eLog 38395.
    2)  8/11: BNL file transfer errors ("GENERAL_FAILURE] FAILED:  at Sat Aug 11 14:07:20 EDT 2012 state Failed : ThePinCallbacks Timeout").  
    From Michael: The issue is understood and resolved.  The transfer failures were caused by an overloaded dCache pool node. The overload 
    was caused by internal migration from another storage server that is currently being drained in addition to production traffic. The traffic associated 
    with migration has been adjusted such that it doesn't interfere any more.  eLog 38369.
    3)  8/12: Saul noticed that pilots were failing at NET2 with the message "ERROR: proxy has expired - Error: Failed to submit job. Check schedd: 
    gridui12.usatlas.bnl.gov!"  Jose and John at BNL updated the proxy, and this fixed the problem.
    4)  8/13: UTA_SWT2 - site appeared to be not receiving pilots, but the underlying problem was due to a non-ATLAS VO submitting a huge number 
    of jobs to the site, and this caused major problems for the batch system.  The offending jobs were removed, and normal pilot submissions resumed.  
    We will discuss this issue with the VO in question to hopefully avoid a similar situation in the future.
    
    Follow-ups from earlier reports:
    
    (i)  6/13: Issue where some sites using CVMFS see the occasional error: "Error: cmtsite command was timed out" was raised in 
    https://savannah.cern.ch/support/?129468.  See more details in the discussion therein.
    (ii)  7/12: NET2 DDM deletion errors - ggus 84189 marked 'solved' 7/13, but the errors reappeared on 7/14.  Ticket 'in-progress', eLog 37613/50.
    (iii)  7/29: File transfers to UPENN_LOCALGROUPDISK failing with checksum mismatch.  https://savannah.cern.ch/bugs/index.php?96424 
    (Savannah DDM), eLog 38037. 
    (iv)  8/6: MWT2 - FT and other transfers failing with SRM errors.  Issue was actually at IllinoisHEP ("[CONNECTION_ERROR] failed to contact on 
    remote SRM [httpg://osgx1.hep.uiuc.edu:8443/srm/managerv2]").  Problem is being worked on.  ggus 84924 in-progress, eLog 38276/300.  (Around 
    this time Sarah reported: We temporarily had a bad configuration in place for the dCache poolmanager. That config has been reverted and transfers 
    are now successful.  ggus 84932 was opened for this issue - ticket is currently 'in-progress'.  eLog 38300.
    Update 8/9: ggus 84924 closed - no recent transfer errors - issue was due to a bad configuration, now fixed.  (ggus 84932 also closed.)
    

  • this week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-8_20_2012.html
    
    1)  8/15: UTA_SWT2 - file transfers failing at the site with SRM errors.  Appeared to be several concurrent issues (jobs doing a lot of staging, 
    hitting a heavily loaded storage server; high level of central deletions; checksum calculations adding to load on storage server).  Temporarily 
    reduced the number of running jobs in the cluster to alleviate the load.  ggus 85225 / RT 22374 were closed on 8/16, but re-opened later that 
    day for (unrelated) transfer errors.  In this case a failed drive in a RAID array took a different storage server off-line.  During this outage some 
    files was corrupted, which resulted in checksum errors on 8/19.  The files were declared lost to the deletion recovery service.  As of 8/20 all 
    issues resolved, transfers are completing successfully.  ggus 85225 / RT 22374 again closed.  eLog 38586, 
    https://savannah.cern.ch/support/index.php?131312 (Savannah site exclusion).
    2)  8/15: Wei announced a power outage at SLAC which required WT2 to be shutdown: 15-Aug-2012 3pm to 16-Aug-2012 5pm, Pacific 
    daylight saving time.  Outage ended according to schedule.
    3)  8/16: New pilot release from Paul (v54a).  Details here:
    http://www-hep.uta.edu/~sosebee/ADCoS/pilot-version_54a.html
    4)  8/17: SWT2_CPB - file transfer failures with SRM errors.  A storage server went off-line due to a problem with the NIC in the machine 
    (cooling fan).  Problem fixed, host back on-line.  ggus 85299 / RT 22382 closed, eLog 38633.
    5)  8/18:  SLACXRD file transfer errors (" [SECURITY_ERROR] not mapped./DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=ddmadmin/
    CN=531497/CN=Robot: ATLAS Data Management]").  Wei reported the problem was fixed later that night - ggus 85306 closed, eLog 38546.
    6)  8/20 p.m.: UPENN LOCAL GROUPDISK file transfer errors ("[INVALID_PATH] globus_ftp_client: the server responded with an error 
    500 500-Command failed. : globus_l_gfs_file_open failed. 500-globus_xio: Unable to open file...").  From the site admin: My xrootd redirector 
    was/is unhappy. I have put a temporary fix in place. Hopefully this will solve the problem.  ggus 85369 in-progress, eLog 38602.
    7)  8/22: Jobs waiting at BNL for rel.17.2.4.3 - "special brokerage for manyInput: AtlasProduction/17.2.4.3/i686-slc5-gcc43-opt not found at 
    BNL_CVMFS_1, BNL_PROD_MP8."  Alessandro is working on this problem with the release.  ggus 85405, eLog 38655.  
    (Related Savannah tickets: https://savannah.cern.ch/support/?131355, https://savannah.cern.ch/support/?131412)
    
    Follow-ups from earlier reports:
    
    (i)  6/13: Issue where some sites using CVMFS see the occasional error: "Error: cmtsite command was timed out" was raised in 
    https://savannah.cern.ch/support/?129468.  See more details in the discussion therein.
    (ii)  7/12: NET2 DDM deletion errors - ggus 84189 marked 'solved' 7/13, but the errors reappeared on 7/14.  Ticket 'in-progress', eLog 37613/50.
    (iii)  7/29: File transfers to UPENN_LOCALGROUPDISK failing with checksum mismatch.  https://savannah.cern.ch/bugs/index.php?96424 
    (Savannah DDM), eLog 38037. 
    (iv)  8/10: MWT2/IllinoisHEP - jobs failing with the error "pilot: Get error: Failed to get LFC replicas: -1 (lfc_getreplicas failed with: 2702, Bad 
    credentials) | Log put error: lsm-put failed."  Dave reported there were four WN's in total that were causing the errors, so these machines were 
    set off-line.  ggus 85054 in-progress, eLog 38395.
    Update 8/16: ggus 85054 was closed, but the error reappeared the next day.  From Dave: We seem to have a time base problem on a number of 
    our worker nodes, which shows up as authentication problems.  The problematic nodes were taken off-line and fixed - ticket again closed on 8/21.
    

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last meeting(s):
    • Meeting yesterday focus was ease of installation and reliability. There is a Twiki page tracking everything.
    • Work in OSG to possibly re-package the toolkit to make it easier - there a meeting to be organized (OSG, I2, Esnet) to discuss this, and modular dashboard.
    • Please try to get new equipment into production.
      • NET2 - has systems, not yet in production; plan was to do this at BU, not clear about HU but maybe.
      • UTA - waiting on a 10G port; working internally on which optics - SR, LR; then will buy cable.
      • SLAC - has machines, trying to get them supported in a standard way, as appliances
  • this meeting:

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference

last week(s)

  • Have EU and BNL global redirectors are connected, and peered.
  • Read-only LFC being setup at CERN; does not require a credential with ATLAS role - will solve connecting EU sites
  • Ilija: may need backup solutions; have selected sites with a credential; and also webdav interface
  • N2N? working, C and Java version.
  • dcache-xrootd door bug fixed, so we now can have authenticated access
  • Monitoring - working on getting similar information from xrootd and dcache; meeting with dCache last week. New versions will be needed for both xrootd and dcache, and new plugins.
  • Ilija now supporting collector at SLAC.
  • Topology viewer / monitor.

this week

  • Finding issues with current version of Xrootd that will affect FAX infrastructure. Just found last night.
CONFLICT original 2:
CONFLICT version 3:
  • Read-only LFC at CERN works as expected. New version of java N2N? was needed java LFC api does not support not-authenticated accesses, tested and deployed at Wuppertal.
  • Monitoring - re-discussing parts of monitoring info collections with Andy. Working on a dCache XRootD? info collection code.
  • HC tests of federation are running and providing load to FAX. The sites are online most of the time.
CONFLICT version new:
  • Monitoring information publication - running patched version at SLAC. We still have issues at UK.
  • Ilija - discussing with Andy publishing monitoring metrics - has changes on hold until Andy has consulted other customers of Xrootd. Collecting mon info from dcache-xrootd door. Will have version next week.
  • HC tests of FAX running
CONFLICT end

US analysis queue performance (Ilija)

last two meetings
  • Had a short meeting last week - only development was 120 second stage-out times. Traced to timed code in pilot - sent fixes back to Paul. Expect to gain from this in both stage-in as well, and perhaps other places in the pilot. This will appear in a future pilot release.
  • Preparing stage-in to local disk versus direct access
  • Met with Philip Canal and Brian Bockelman about ROOT monitoring - to inspect what users are doing to improve efficiency. Also about optimizing memory usage - CMS has found 40% reduction in memory usage using Ilija's basket re-ordering scheme.

this week:

CONFLICT original 2:
CONFLICT version 3:
  • Read-only LFC at CERN works as expected. New version of java N2N? was needed java LFC api does not support not-authenticated accesses, tested and deployed at Wuppertal.
  • Monitoring - re-discussing parts of monitoring info collections with Andy. Working on a dCache XRootD? info collection code.
  • HC tests of federation are running and providing load to FAX. The sites are online most of the time.
  • new Panda pilot with timed code fixes went online on 16th Aug. Average stage out times went from 120 seconds to ~10 seconds. At least 4 sites drastically improved stage in times: Freiburg, MPPMU, OU_CHEP_SWT2, SCINET.
  • even more optimized version is produced. Will be tested when Paul comes back from vacation.
  • most sites at >80% efficiency.
  • investigation ongoing at event loop efficiency at two sites.
CONFLICT version new:
  • Monitoring information publication - running patched version at SLAC. We still have issues at UK.
  • Ilija - discussing with Andy publishing monitoring metrics - has changes on hold until Andy has consulted other customers of Xrootd. Collecting mon info from dcache-xrootd door. Will have version next week.
  • HC tests of FAX running
  • New pilot went online 16 August. Contained changes to timed code - stage out change from 120 sec to (1-10) seconds. At four sites, stagein-times went down dramatically.
  • An even newer version will get back another 10 seconds. Most sites are 80% efficient.
  • Working on problematic sites.
  • Next week sites are asked to report on bottlenecks at their site using Ilija's test program. Won't be able to get this for UTA.
CONFLICT end
  • now really difficult job starts.
  • sites asked to identify their current bottleneck.

Site news and issues (all sites)

  • T1:
    • last meeting(s): Working on storage technologies. Content-managed storage and object-based storage, also in the wide-area context. Also progress in cloud processing area.
    • this meeting:

  • AGLT2:
    • last meeting(s): Downtime on July 30; back online, running well since then. User attempting to kill jobs - several thousand, killing locally as they come in.
    • this meeting: Things are smooth. Moving forward with storage procurement. Adding three systems at UM, three at MSU, each with 6 shelves. 96G RAM, 3 H800, dual CPU slower CPUs.

  • NET2:
    • last meeting(s): LFC issues - had disk problems.
    • this meeting: Running smoothly; met with Dell yesterday had interesting discussions. Recovered from losing LFC, no further problems.

  • MWT2:
    • last meeting(s): 20G update at UC campus core went well. CVMFS-distributed wn-client and $APP being deployed across sites.
    • this meeting:

  • SWT2 (UTA):
    • last meeting(s): 0.5 PB of storage coming online. OSG CE update problems - condor-G/APF and PBS job manager for globus.
    • this meeting: LFC consolidation and pandamover cleanup, has gone well. CPB cluster - new storage has been deployed. Patrick on vacation next week. CPB issue explicitly at yesterday's ADC weekly, Kaushik has been following up, SSB is green. A ticket was expected.

  • SWT2 (OU):
    • last meeting(s):
    • this meeting:

  • WT2:
    • last meeting(s): smooth operation - planning for power outtage August 16. Forming committee to search for replacement of lsf.
    • this meeting: Smooth until this morning - have a disk server error, down for two hours. Storage has been ordered 44 units of MD1200 3 TB drives. Working on SSD cache.

AOB

last week this week


-- RobertGardner - 21 Aug 2012

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback