r7 - 08 Aug 2012 - 15:00:57 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesAug82012

MinutesAug82012

Introduction

Minutes of the Facilities Integration Program meeting, Aug 8, 2012
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Our Skype-capable conference line: (6 to mute) ** announce yourself in a quiet moment after you connect *
    • USA Toll-Free: (877)336-1839
    • USA Caller Paid/International Toll : (636)651-0008
    • ACCESS CODE: 3444755

Attending

  • Meeting attendees: Bob, Chris Walker (OU, for Horst), Patrick, Dave, Armen, Mark, John, Sarah, Michael, Shawn, Hiro, Ilija, Fred,
  • Apologies: Horst, Jason
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, bi-weekly - convened by Kaushik) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Wei or Rob): Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
    • this week
      • New Integration program for FY12Q4, IntegrationPhase22 and SiteCertificationP22
      • Pricing discussions with Dell and timelines
      • Multi-gatekeeper availability issue - not an issue (its an OR)? Fred still isn't certain this applies in all cases.
      • LFC consolidation is a highest priority.
      • Slow progress on multi-core queues: what is the schedule for all sites? (There is a solution for condor, lsf; for PBS: Patrick has a mechanism, but not optimal when there is a mixed set of jobs in the system; this is the concern.)

Multi-core progress

  • Will be a standing item until we have a MC queue at each site
  • At BNL, 100 machines
  • Follow the _MCORE convention for site

GlideinWMS testing plan (Maxim)

last week:

  • with one available US schedd, we can reasonably hope to handle <10**4 jobs
  • target is 50k
  • we won't be able to ramp up within a week or two, until more machines are commissioned and software installed -- thus no immediate large impact
  • need more resources at BNL

  • OU (Horst)
  • UTA_SWT2 (per Sasha Vaniashin)
  • TBA

  • remember this is a scalability test so failures can be expected by definition
  • assuming roughly 4 schedd machines handling US submission, an outage will be noticeable
  • do we need to go full scale? Can we extrapolate from stressing one schedd with 10 to 20 thousand jobs?

this week

  • Configured APF on the BNL node grid 04 (co-hosting schedd)
  • Found and corrected mistakes in the above (thanks John H)
  • The Front End at CERN had an outage, now fixed
  • Tested pilot generation via APF when test job submitted to Panda, jobs get registered and pilots generated
  • minor maintenance on APF instance at CERN

* still to-do *

  • based on the above, activate OU and UTA_SWT2 (this week)

Update on LFC consolidation (Hiro, Patrick, Wei)

last week:
  • Waiting for completion of dump script to work from Patrick. Patrick is working on it.
  • Patrick also wants to look at running pandamover cleanup script against BNL's LFC. Will perform a test. Will compare a local to remote test.
  • If this test works will convert the site to BNL LFC.

this week:

  • See Patrick's RFC for the clean up
  • Replica info into an SQLite database
  • Can sites just use Patrick's pandamover-cleanup script?
  • Will put this into Hiro's git repo
  • Hiro can create something on-demand. AGLT2 - does this every 12 hours.
  • Please update with ConsolidateLFC known issues.
  • Production role is needed

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
  • this meeting:
    • Mark reporting - things have been very smooth. ATLAS software installation system notifications (Mark will send a note)
    • Convo on job limits; Michael notes that for production jobs there should be minimal limites, or even unlimited. 3 days is a reasonable limit (Mark notes this is even useful.) For analysis jobs, 24 hours would be reasonable. We may need to use this to provide feedback. Bring this up with Alden, discuss with DAST.
    • We do have another unusual situation - lots of assigned jobs at MWT2.

Data Management and Storage Validation (Armen)

Shift Operations (Mark)

  • last week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=201644
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-7_30_2012.html
    
    1)  7/26: Updated US Cloud Frontier/Squid configuration updated in AGIS to improve the fail-over policy.  See:
    https://twiki.cern.ch/twiki/bin/viewauth/Atlas/FrontierAGISChangelog, http://www-hep.uta.edu/~sosebee/ADCoS/Frontier-fail-over-US-sites-7_26_12.html,  
    eLog 37959.
    2)  7/28: AGLT2 site setting queues to 'brokeroff' in preparation for a maintenance outage beginning the next day.
    Update 7/31 p.m. from Bob: We are now ending our downtime for all AGLT2 queues, now that our SE rsv probe is passing.  I have set the queues for "test" 
    with HC.Test.Me comment.
    3)  7/29: File transfers to UPENN_LOCALGROUPDISK failing with checksum mismatch.  https://savannah.cern.ch/bugs/index.php?96424 (Savannah DDM), 
    eLog 38037. 
    4)  7/30: UPENN_LOCALGROUPDISK file transfer failures ("[GRIDFTP_ERROR] an end-of-file was reached globus_xio: An end of file occurred (possibly 
    the destination disk is full)").  From the site admin: This is the result of xrootd issues under load; it will eventually resolve itself. I have been unable to get help 
    to find the underlying cause, even in figuring out how to start to debug.  As of 8/1 the transfer efficiency had improved to ~90%, so ggus 84659 was closed.  
    eLog 38137.
    5)  8/1: AGLT2 file transfer failures ("[NO_PROGRESS] No markers indicating progress received for more than 180 seconds]").  From Bob: We had troubles 
    on 3 dCache file servers overnight, plus some afs issues. afs and one file server are resolved. Working on the other 2 file servers.  Later: All file servers and 
    their pools are now back online.  ggus 84714 closed, eLog 38132.
    
    Follow-ups from earlier reports:
    
    (i)  6/13: Issue where some sites using CVMFS see the occasional error: "Error: cmtsite command was timed out" was raised in https://savannah.cern.ch/support/?129468.  
    See more details in the discussion therein.
    (ii)  7/12: NET2 DDM deletion errors - ggus 84189 marked 'solved' 7/13, but the errors reappeared on 7/14.  Ticket 'in-progress', eLog 37613/50.
    (iii)  7/22: NERSC_SCRATCHDISK file transfer errors ("System error in write: No space left on device").  Armen and Stephane pointed out there is a mismatch 
    between the token size/space between the SRM and dq2.  Also, the latest transfer failures are due to a checksum error.  NERSC admins notified.  
    http://savannah.cern.ch/support/?130509 (Savannah site exclusion), eLog 37840.  ggus 84466 / RT 22301 also opened on 7/23 for file transfer failures.  eLog 37882.
    (iv)  7/22: HU_ATLAS_Tier2 - job failures with the error "pilot: Put error: lfc-mkdir threw an exception: [Errno 3] No such process|Log put error: lfc-mkdir threw an exception."  
    From John: We are having disk hardware issues with our LFC.  We have all our daily backups and are formulating a plan forward.  Later John announced it wasn't 
    possible to save the old LFC host hardware, so a new instance has to be created, restore from a backup, and perform consistency checks.  
    ggus 84436 in-progress, eLog 37876.
    Update 8/1: LFC recovered and back on-line.  ggus 84436 closed, eLog 38173.
    (v)  7/23: Transfers to UTD_HOTDISK and UTD_LOCALGROUPDISK failing with SRM errors.  ggus 84467 in-progress, eLog 37883.
    Update: site downtime Tue, 24 July, 5am – Sun, 29 July, 6am.
    (vi) 7/24: UPENN_LOCALGROUPDISK file transfer errors ("failed to contact on remote SRM [httpg://srm.hep.upenn.edu:8443/srm/v2/server]").  ggus 84518 in-progress, 
    eLog 37910.  (Site admin reports the problem has been fixed - can the ticket be closed?)
    Update 7/26: transfers completed, no recent SRM errors, ggus 84518 closed.  eLog 37948.
    (vii)  7/25: UTA_SWT2_PRODDISK file transfer errors (" [INTERNAL_ERROR] no transfer found for the given ID. Details: error creating file for memmap ...).  
    Issue under investigation.  ggus 84548 / RT 22307 in-progress, eLog 37917.
    Update later on 7/25: not a site issue, but rather related to FTS.  From Hiro: For some reason, the old FTS box (shut down a few month ago) was still running 
    the channel for star-uta channel. I just shut it off.  ggus 84548 closed.
    

  • this week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=203431
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-8_6_2012.html
    
    1)  8/6: MWT2 - FT and other transfers failing with SRM errors.  Issue was actually at IllinoisHEP ("[CONNECTION_ERROR] failed to contact on remote 
    SRM [httpg://osgx1.hep.uiuc.edu:8443/srm/managerv2]").  Problem is being worked on.  ggus 84924 in-progress, eLog 38276.  (Around this time Sarah 
    reported: We temporarily had a bad configuration in place for the dCache poolmanager. That config has been reverted and transfers are now successful.  
    ggus 84932 was opened for this issue - ticket is currently 'in-progress'.  eLog 38300.
    2)  8/7: UPENN_LOCALGROUPDISK - file transfer failures with SRM errors ("[CONNECTION_ERROR] failed to contact on remote SRM 
    [httpg://srm.hep.upenn.edu:8443/srm/v2/server]").  From the site admin: This was a hardware problem that brought our SRM down. It is fixed now. 
    As of early a.m. 8/8 no recent errors ggus 84957 closed.  eLog 38305.
    
    Follow-ups from earlier reports:
    
    (i)  6/13: Issue where some sites using CVMFS see the occasional error: "Error: cmtsite command was timed out" was raised in https://savannah.cern.ch/support/?129468.  
    See more details in the discussion therein.
    (ii)  7/12: NET2 DDM deletion errors - ggus 84189 marked 'solved' 7/13, but the errors reappeared on 7/14.  Ticket 'in-progress', eLog 37613/50.
    (iii)  7/22: NERSC_SCRATCHDISK file transfer errors ("System error in write: No space left on device").  Armen and Stephane pointed out there is a 
    mismatch between the token size/space between the SRM and dq2.  Also, the latest transfer failures are due to a checksum error.  NERSC admins notified.  
    http://savannah.cern.ch/support/?130509 (Savannah site exclusion), eLog 37840.  ggus 84466 / RT 22301 also opened on 7/23 for file transfer failures.  eLog 37882.
    Update 8/6: An upgrade of BeStMan fixed a problem with checksum calculations.  No recent errors, so ggus 84466 / RT 22301 were closed. 
    (iv)  7/23: Transfers to UTD_HOTDISK and UTD_LOCALGROUPDISK failing with SRM errors.  ggus 84467 in-progress, eLog 37883.
    Update: site downtime Tue, 24 July, 5am – Sun, 29 July, 6am.
    Update 8/4: Since the conclusion of the downtime the earlier failed transfers have succeeded, so ggus 84467 was closed.  eLog 38219.
    (v)  7/29: File transfers to UPENN_LOCALGROUPDISK failing with checksum mismatch.  https://savannah.cern.ch/bugs/index.php?96424 (Savannah DDM), 
    eLog 38037. 
    

DDM Operations (Hiro)

  • this meeting:
    • BU LFC issue has been resolved.

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last meeting(s):
    • 10G perfsonar sites?
      • NET2: up - to be configured.
      • WT2: have to check with networking people; equipment arrived.
      • SWT2: have hosts, need cables.
      • OU: have equipment.
      • BNL - equpiment on hand
    • Michael: we are running behind schedule with connecting AGLT2 and MWT2 to LHCONE
    • See yesterday's throughput meeting minutes from Shawn
    • 10G perfsonar installation at sites - all sites have it, but not operational everywhere. BNL - this week; SWT2 - negotiating with local networking; SLAC - security issue, classify as appliance.
    • Found and fixed problem with AGLT2 path to non-ESNet sites due to machines in Chicago.
    • LHCONE: circuits setup at MWT2 and AGLT2 and BNL should take preference, primary path; next path would be LHCONE.
    • Michael: dedicated 10G transatlantic for LHCONE traffic between Starlight and Europe (paid by GEANT-DANTE). So available bandwidth to the infrastructure is growing. Up to 60 Gbps (50 is shared). Situation has improved significantly. Demonstrated large bandwidth between small Tier 2 in Italy and BNL. 3X improvement DESY-BNL.
  • this meeting:
    • Meeting yesterday focus was ease of installation and reliability. There is a Twiki page tracking everything.
    • Work in OSG to possibly re-package the toolkit to make it easier - there a meeting to be organized (OSG, I2, Esnet) to discuss this, and modular dashboard.
    • Please try to get new equipment into production.
      • NET2 - has systems, not yet in production; plan was to do this at BU, not clear about HU but maybe.
      • UTA - waiting on a 10G port; working internally on which optics - SR, LR; then will buy cable.
      • SLAC - has machines, trying to get them supported in a standard way, as appliances

Federated Xrootd deployment in the US (Wei, Ilija)

last week(s)

this week

  • Have EU and BNL global redirectors are connected, and peered.
  • Read-only LFC being setup at CERN; does not require a credential with ATLAS role - will solve connecting EU sites
  • Ilija: may need backup solutions; have selected sites with a credential; and also webdav interface
  • N2N? working, C and Java version.
  • dcache-xrootd door bug fixed, so we now can have authenticated access
  • Monitoring - working on getting similar information from xrootd and dcache; meeting with dCache last week. New versions will be needed for both xrootd and dcache, and new plugins.
  • Ilija now supporting collector at SLAC.
  • Topology viewer / monitor.

US analysis queue performance (Ilija)

last two meetings
  • Had a meeting last week - some issues were solved, and some new ones have appeared.
  • Considering direct access versus stage-in by site.
  • We observe that when switching modes there is 2 minute stage-out time that seems to be added. Will need to be investigated. Has been in discussion with Paul.
  • Went through all the sites to consider performance. NET2 - Saul will be in touch with Ilija.

this week:

  • Had a short meeting last week - only development was 120 second stage-out times. Traced to timed code in pilot - sent fixes back to Paul. Expect to gain from this in both stage-in as well, and perhaps other places in the pilot. This will appear in a future pilot release.
  • Preparing stage-in to local disk versus direct access
  • Met with Philip Canal and Brian Bockelman about ROOT monitoring - to inspect what users are doing to improve efficiency. Also about optimizing memory usage - CMS has found 40% reduction in memory usage using Ilija's basket re-ordering scheme.

Site news and issues (all sites)

  • T1:
    • last meeting(s): Progress on cloud computing - demonstrated that interfaces to EC2 interfaces are working transparently, accessing Amazon services. Both dedicated and Amazon under a single queue name. Looking at storage technologies, speaking with vendors.
    • this meeting: Working on storage technologies. Content-managed storage and object-based storage, also in the wide-area context. Also progress in cloud processing area.

  • AGLT2:
    • last meeting(s): Downtime next Monday. Networking, OSG update, AFS update, remap servers to use 2x10G connections, prep work underway; may need to extend, depending on Juniper stacking at MSU. 20G pipe saturated during HC testing will be upgraded to 4 x 10G. QOS rules will be implemented to separate control from data channels. Simple config - prioritize private network over public.
    • this meeting: Downtime on July 30; back online, running well since then. User attempting to kill jobs - several thousand, killing locally as they come in.

  • NET2:
    • last meeting(s): Weekend incident - accumulated large number of thread to sites with low-bandwidth. Eventually squeeze out slots for other clients. Failed to get a dbrelease file. Ramp up in planning for move to Holyoke - 2013-Q2. Going to move PBS to OpenGridEngine at BU (checked with Alain that its supported). Direct reading tests from long ago showed GPFS.
    • this meeting: LFC issues - had disk problems.

  • MWT2:
    • last meeting(s): LHCONE peering attempt last week resulted in production interruption. Working on git change management in front of puppet and intra-MWT2 distribution of worker node-client, certificates, and $APP using a CVMFS repo. Tomorrow scheduled upgrade to MWT2_UC's network to campus core to 20G (two 10G bonded with LCAP). Updates to GPFS servers at UIUC campus cluster. Added two new servers for metadata.
    • this meeting: 20G update at UC campus core went well. CVMFS-distributed wn-client and $APP being deployed across sites.

  • SWT2 (UTA):
    • last meeting(s): Updating switch stack to more current firmware version (Mark). Updated CVMFS and kernels on all compute nodes. Installed OSG 3.x at SWT2_CPB; note: pay attention to accounts to be created. One nagging issue - availability reporting showing as zero. Otherwise things are running well. Bringing 0.5 PB online.
    • this meeting: 0.5 PB of storage coming online. OSG CE update problems - condor-G/APF and PBS job manager for globus.

  • SWT2 (OU):
    • last meeting(s):
    • this meeting:

  • WT2:
    • last meeting(s): WT2: got storage quote from Dell, price are 12% higher than Dell's ATLAS LHC web page (for head nodes and Md1200s). Will ask for an explanation. WT2 operation is OK, LFC stability problem is resolved. Still plan to migrate LFC to BNL in mid-August since SLAC is planning a power outage around that time.
    • this meeting: smooth operation - planning for power outtage August 16. Forming committee to search for replacement of lsf.

Carryover issues (any updates?)

rpm-based OSG 3.0 CE install

last meeting(s)
  • In production BNLs
  • Horst claims there are two issues: RSV bug, and Condor not in a standard location.
  • NET2: Saul: have a new gatekeeper - will bring up with new OSG.
  • AGLT2: March 7 is a possibility - will be doing upgrade.
  • MWT2: done.
  • SWT2: will take a downtime; have new hardware to bring online. Complicated with install of new UPS - expect delivery, which will take a downtime.
  • WT2: has two gatekeepers. Will use one and attempt to transition without a downtime.

this meeeting

  • Any updates?
  • AGLT2
  • MWT2
  • SWT2
  • NET2
  • WT2

OSG Opportunistic Access

See AccessOSG for instructions supporting OSG VOs for opportunistic access.

last week(s)

this week

AOB

last week this week


-- RobertGardner - 07 Aug 2012

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback