r6 - 06 Feb 2013 - 17:27:50 - MarkSosebeeYou are here: TWiki >  Admins Web > MinutesFeb62013

MinutesFeb62013

Introduction

Minutes of the Facilities Integration Program meeting, Feb 6, 2012
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern

  • Your access code: 2913843
  • Your private passcode: 4519
  • Dial *8 to allow conference to continue without the host
    1. Dial Toll-Free Number: 866-740-1260 (U.S. & Canada)
    2. International participants dial: Toll Number: 303-248-0285. Or International Toll-Free Number: http://www.readytalk.com/intl
    3. Enter your 7-digit access code, followed by “#”
    4. Press “*” and enter your 4-digit Chairperson passcode, followed by “#”

Attending

* Meeting attendees: Patrick, Saul, Fred, Dave, Shawn, Wei, James, Ilija, * Meeting attendees: Patrick, Saul, Fred, Dave, Shawn, Wei, James, Ilija, Torre, Sarah, Jason, Mark, Alden, Bob, John Brunelle, Horst, Michael, Hiro, Kaushik, Doug * Apologies: Armen
  • Guests:

Integration program update (Rob, Michael)

* * Panda Mover migration * * Note Fabiola's message about the run extension by a few days. Should have low impact, till Feb 10, then one last pp run, up until Feb 14. *

FAX FDR pilot issues

last meeting
  • Top level wrapper to sites (bash) - invokes a python wrapper, which invokes Paul's pilot
  • Dependence on python and DQ2Client?
  • FAX worked fine, but then there were problem with standard analysis jobs
  • We have a work-around for the near term, but we need a longer-term strategy to deal with this on both grid production environments.
  • Modular wrapper developed by Jose is key
this week
  • New pilot in production

Tier2D evaluation at ADC (Doug)

last meeting this meeting - resolved? * * SWT2_CPB - might be issues at BNL and AGLT2 rather than UTA itself - path specific? Could be that the test intervals are short. BNL may have issues with getting ready for the 100G connection. Jason involved with troubleshooting French and Taiwan Tier1's. Kaushik notes factor of 3 improvement in throughput. Shawn notes there has been lots of traffic internally at AGLT2, and other network testing in Chicago (MSU vs UM). * * NET2 - had issues with DDM transfers. * * Saul believes there was an FTS issue. 2000 subscription backlog. Hiro believes the transfer is slow. Tier 1 to BU, as well as the * channel to BU. * Have analyzed gridftp bottleneck - transfer to CERN sites very slow.
    • Michael: Need to get NET2 onto LHCONE.
    • Saul notes they are going to have 10G circuit to MANLAN with the move to Holyoke
    • Would like to tune up DDM traffic overall HIro

Storage deployment

last meeting:
  • Tier 1 DONE
  • AGLT2 DONE
  • WT2:DONE
  • MWT2: Still working on storage deployment; 6 R720 + MD1200's installed, powered. Re-configuring local network (plot below).
  • NET2: Racked and stacked. Electrical work is happening today. Optimistically new storage will be online next week. 720 TB raw.
  • SWT2_UTA: Equipment on order; expecting delivery before end of month. Won't take a downtime until after Moriond.
  • SWT2_OU: Week of Feb 4 is the scheduled downtime, but will postpone.
this meeting:
  • Tier 1 DONE
  • AGLT2 DONE
  • WT2:DONE
* MWT2: * MWT2: 500 TB * NET2: * NET2: 576 TB usable, to be part of the GPFS. Expect a week from whenever the SGE is migration is complete. Est. 2 weeks. * SWT2_UTA: * SWT2_UTA: Expect remainder of delivery this week or next week. * SWT2_OU: * SWT2_OU: Storage is online - but waiting to encorporate into Lustre.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • PRODDISK squeezed due to group production
    • PRODDISK versus DATADISK: merge or not - still not settled.
    • Retiring PandaMover - to discuss with Alexei
    • US Tier 2 starvation issue seems resolved. Lots of problems with Panda Mover getting the needed input files to Tier 2s. Last week emergency meeting - decision to increase GP share at the Tier 1.
    • Will be a shift from production to analysis; group production nearly finished.
    • At BNL we will move resources from production to analysis.
    • NET2 and SLAC seem to have enough jobs and/or pilots. Saul will open a thread on this.
  • this meeting:
* * Saul notes continued problems with jobs getting stuck in the transferring state. Kaushik notes the brokering limit on transferring to running > 2 wil be raised. Also, why is the number of transferring jobs increased? There as well is an autopyfactory job submission issue. Will need to discuss with John Hover. Saul and John to discuss with Jose and John Hover. * * Hiro notes transfers back to FZK might be slowing this. *

Data Management and Storage Validation (Armen)

Shift Operations (Mark)

  • last week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=232789
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-1_28_2013.html
    
    1)  1/25: BNL_CLOUD - job failures with "lost heartbeat" errors.  Same issue as reported in ggus 90591 (Amazon's EC2 service, spot pricing mechanism, etc.).  
    https://ggus.eu/ws/ticket_info.php?ticket=90813 closed, eLog 42417.
    2)  1/26: MWT2 - from Sarah: (i) I ran a re-index on some of the srmspacefile indexes this morning, which temporarily locked the table and caused transfer failures. 
    (ii) In doing the reindexing other issues with the database came up. We are doing some emergency maintenance now to get back to stable.  dCache@MWT2 will 
    be completely offline for ~30 minutes. (iii) We are doing emergency maintenance on dCache@MWT2. We expect to be back up shortly.
    3)  1/29: Backlog of transferring jobs in the U.S. cloud high, as reported by Hiro.  Fix applied to the BNL SS box to equally distribute (without overlap) the FTS jobs 
    to check to each FTS polling agent thread.  Backlog cleared up following the application of the fix.
    
    Follow-ups from earlier reports:
    
    (i)  8/24: NERSC_SCRATCHDISK - file transfer failures with SRM errors.  Issue was due to the fact that the site admins had previously taken the token off-line to 
    protect it from undesired auto-deletion of files.  ggus 85490 closed, eLog 38791.  https://savannah.cern.ch/support/index.php?131508 (Savannah site exclusion ticket), 
    eLog 38795.
    Update 11/2: See latest information in Savannah 131508 (eventually the space token will be decommissioned).
    (ii)  12/8: NET2 - again seeing a spike in DDM deletion errors - https://ggus.eu/ws/ticket_info.php?ticket=89339 - eLog 41596.
    (iii)  1/18: ggus 90548 / RT 22875 were opened for job failures at OU_OCHEP_SWT2 with "replica not found" pilot errors.  Not a site issue - such failures were seen 
    at most U.S. sites.  Seems to be an issue with PanDA/LFC.  Is the problem understood?  Both tickets closed, eLog 42183.
    (iv)  1/21: NERSC destination file transfer failures - https://ggus.eu/ws/ticket_info.php?ticket=90619 in-progress, eLog 42446.
    Update 1/26: https://ggus.eu/ws/ticket_info.php?ticket=90866 also opened for SRM errors at the site.  For some reason an outage created in OIM did not propagate to 
    GOCDB/AGIS, hence the token wasn't auto-blacklisted.  eLog 42484.
    (v)  1/21: SWT2_CPB DDM deletion errors - probably not a site issue, as the some of the errors are related to datasets with malformed names, others are deletion attempts 
    for very old datasets.  Working with DDM experts to resolve the issue.  https://ggus.eu/ws/ticket_info.php?ticket=90644 in-progress.
    Update 1/23: Opened https://savannah.cern.ch/support/?135310 - awaiting a response from DDM experts.
    
  • this week: Operations summary:
    
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=232803
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-2_4_2013.html
    
    1)  1/31: AGLT2 - file transfer failures with SRM errors.  Two issues - brief routing interruption while a WAN connection was reconfigured, and an issue with a CRL 
    updater.  Both resolved, https://ggus.eu/ws/ticket_info.php?ticket=91059 closed on 2/1.  eLog 42599.
    2)  1/31: SLAC - file transfer errors ("[SOURCE error during TRANSFER phase: [GRIDFTP_ERROR] globus_ftp_client: the operation was aborted]").  Issue with a 
    network interface - from Wei: we have a problem with the switch port of our gridftp servers. We switched to use another gridftp server for now.
    As of 2/5 no more errors, so https://ggus.eu/ws/ticket_info.php?ticket=91064 was closed (along with duplicate ggus tickets 91061, 91114.  eLog 42695.
    3)  2/1: MWT2 - file transfer failures with SRM errors.  Issue understood - during a period of database maintenance one of the commands maxed
    out the disks, causing SRM failures. The command was stopped and the errors went away.  https://ggus.eu/ws/ticket_info.php?ticket=91105 closed, eLog 42613.
    4)  2/2: From Dave at MWT2/Illinois: One of our new pool nodes seems to have developed a serious networking problem. We have marked the pools on this node 
    read only and are in the progress of migrating the data  off to other pools. Luckily it is a small amount of data and should not take long.
    5)  2/3: UPENN - file transfer failures with SRM errors.  https://ggus.eu/ws/ticket_info.php?ticket=91122 was clsoed after a BeStMan restart fixed the problem.  
    eLog 42671.
    6)  2/4: New pilot release from Paul (v.56b).  Details here:
    http://www-hep.uta.edu/~sosebee/ADCoS/pilot-version_56b.html
    7)  2/5: IllinoisHEP file transfer failures with SRM errors.  https://ggus.eu/ws/ticket_info.php?ticket=91189.  Turned out the errors were actually due to two problematic 
    remote sites, FR/R0-07 and NL/IL-TAU-HEP.  After these sites were blacklisted the errors stopped.  ggus 91189 closed, eLog 42709.
    
    Follow-ups from earlier reports:
    
    (i)  8/24: NERSC_SCRATCHDISK - file transfer failures with SRM errors.  Issue was due to the fact that the site admins had previously taken the token off-line to 
    protect it from undesired auto-deletion of files.  ggus 85490 closed, eLog 38791.  https://savannah.cern.ch/support/index.php?131508 (Savannah site exclusion ticket), 
    eLog 38795.
    Update 11/2: See latest information in Savannah 131508 (eventually the space token will be decommissioned).
    (ii)  12/8: NET2 - again seeing a spike in DDM deletion errors - https://ggus.eu/ws/ticket_info.php?ticket=89339 - eLog 41596.
    (iii)  1/18: ggus 90548 / RT 22875 were opened for job failures at OU_OCHEP_SWT2 with "replica not found" pilot errors.  Not a site issue - such failures were seen 
    at most U.S. sites.  Seems to be an issue with PanDA/LFC.  Is the problem understood?  Both tickets closed, eLog 42183.
    Update 1/30: https://savannah.cern.ch/bugs/index.php?100175 opened for this issue.
    (iv)  1/21: NERSC destination file transfer failures - https://ggus.eu/ws/ticket_info.php?ticket=90619 in-progress, eLog 42446.
    Update 1/26: https://ggus.eu/ws/ticket_info.php?ticket=90866 also opened for SRM errors at the site.  For some reason an outage created in OIM did not propagate 
    to GOCDB/AGIS, hence the token wasn't auto-blacklisted.  eLog 42484.
    Update 2/3: Recent file transfers are succeeding at ~100%, so ggus 90619 was closed.  Similarly for ggus 90866 - closed on 2/5.  eLog 42633, 42694.
    (v)  1/21: SWT2_CPB DDM deletion errors - probably not a site issue, as the some of the errors are related to datasets with malformed names, others are deletion 
    attempts for very old datasets.  Working with DDM experts to resolve the issue.  https://ggus.eu/ws/ticket_info.php?ticket=90644 in-progress.
    Update 1/23: Opened https://savannah.cern.ch/support/?135310 - awaiting a response from DDM experts.
    
* * 4000 job failures from get replica errors. Panda brokering issue? Mark - noted that input files were never there a UTA. Mark will push on this. * * Also notes deletion errors at UTA - clearly something wrong. *

DDM Operations (Hiro)

* * Site services have been slow. DDM service level issue - requiring development. Random selection of transfers have resulted in a number slow transfers. * * D3PD? replication for Moriond? *

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last meeting(s):
    • See email for notes
    • New perfsonar release candidate for v3.3; we have 5 sites that will participate in the testing. UC, IU, MSU, UNL, UM.
    • Consistently bad performance to/from RAL. How should we act in this regard. Hiro has added additional FTS channels for RAL, to help.
    • Will discuss at next WLCG Operations meeting. Hiro notes that inter-T2 throughput is slow.
  • this meeting:
* * See notes in email yesterday * * New version of perfsonar under test at AGLT2 and MWT2 (rc1). * * Release by March. With 10G. Goal in facility to deploy across the facility by end March. * Amazon connectivity to AGLT2 being worked on (default uses commercial network, which is relatively slow). Hopefully soon.

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference

last week(s)

  • link to FDR twiki https://twiki.cern.ch/twiki/bin/viewauth/Atlas/JanuaryFDR
  • Three major issues - distributing input data to all sites; still getting authorization errors for some sites. (Not many connections are affected)
  • Realistic analysis jobs from HC, sorting through pilot issues.
this week * * Wei is testing voms module provided by Gerry. Still a few things to sort out, but * * Added a few more sites. Working with the Spanish cloud; expecting new French sites. * * New monitoring collector needed at CERN? The collector at SLAC is not very stable, working with Matevz. * All fax information is coming from AGIS now. ---++ Site news and issues (all sites) * Found orphaned jobs from HC. * Progress on skim-slim service at UC * T1: * Connectivity between sites still not 100% * last meeting(s): John is currently writing a cost-aware cloud scheduler. Adds policies that are cost driven, to expand to "pay as you go" resources. The current demand-driven event is helping drive better understanding with Amazon policies for provisioning and cost modeling. No indication of bottlenecks into/out of storage. * Later, work on N2N? in Rucio * this meeting:

* AGLT2: * last meeting(s): Things are running well. Lots of jobs in the analysis queue. Attempting to get a second peering into MiLAR, which will provide direct routes to Omnipop and MWT2. ---++ Site news and issues (all sites) * this meeting:

* NET2: * T1: * last meeting(s): Problems this morning - spikes in gatekeeper load and DDM errors. Might have large number of dq2-gets may be the cause, investigating. Panda queues on BU side are nearly drained. New storage starting to arrive. End of March move. HU running fine. BU-HU network link may be need maintenance. * last meeting(s): John is currently writing a cost-aware cloud scheduler. Adds policies that are cost driven, to expand to "pay as you go" resources. The current demand-driven event is helping drive better understanding with Amazon policies for provisioning and cost modeling. No indication of bottlenecks into/out of storage. * this meeting: I thought it would be a good idea to put in all the details for a change... * this meeting: cloud activities have ramped up significantly; aiming for 5000 instances, ran into HTCondor CCB scale issue. Fixed. 100G network equipment coming to BNL. 1) Subscription backlog/FTS/gridftp: We moved griftp traffic to a new host as an attempt to help with the pre-Moriond "Tier 2 starvation issue". The performance of individual gridftp transfers is good, but there are long pauses even when there is plenty of networking and I/O capacity. It's acting as if there isan FTS bottleneck, however chaning #files and #streams in FTS doesn't seem to have any major effect.

As a result of this, there is a subscription bottleneck (~2000 subscriptions are current). We'd like * AGLT2: to have a consultation from DDM to help with this if possible. * last meeting(s): Things are running well. Lots of jobs in the analysis queue. Attempting to get a second peering into MiLAR, which will provide direct routes to Omnipop and MWT2. * this meeting: Heavy usage - more than 50k analysis jobs / day. Have had some crashes on the new storage. Addressing bottleneck between the two sites. 150 cores brought online. 2) We borrowed Tier 3 resources to boost production, re: pre-Moriond.

3) Our new SGE PanDA? queues (BU_ATLAS_Tier2 and ANALY_BU_ATLAS_Tier2) are both working in PanDA? with HC and real analysis jobs. The only remaining problem is getting condor feedback of running jobs to work for APF. Thanks to Jose, John H. and OSG guys for helping with this.

4) New storage is racked and stacked, electrical work is done. Getting this up is high on our * NET2: to do list. Currently there is plenty of free disk space. * last meeting(s): Problems this morning - spikes in gatekeeper load and DDM errors. Might have large number of dq2-gets may be the cause, investigating. Panda queues on BU side are nearly drained. New storage starting to arrive. End of March move. HU running fine. BU-HU network link may be need maintenance. * this meeting: 5) We reported a problem with slow transferring in PanDA? . This has drastically different effects on different sites, e.g. HU drained almost completely because of this while BU had plenty of activated jobs.

6) We see a problem similar to what Horst sees at OU with quite a few jobs using a few minutes of CPU time but >1 wall-day.

7) Perfsonar 10Gpbs optics are installed on the new bandwidth node.

8) Our usual end-to-end WAN study found a weak link from MANLAN to CERN causing, e.g. outgoing traffic from NET2 to France to be 10x faster than NET2 to Switzerland.

9) Bestman deletion errors continue at about a 10% rate in spite of reducing the load on our SRM host. We'll deal with this as soon as we get SGE squared away.

10) We still have approximatly 1/2 of the 2012 hardware funds to spend.

11) We saw a problem with monitoring jobs using ping at Harvard. ping is blocked at Harvard and this had the effect of gradually using up all the LSF job slots until someone notices and kills them. We think that this is resolved.

12) Since the gridftp move on Jan. 17, we have a problem with both BU and HU being reported in "Unknown" state to WLCG (thanks to Fred Luehring for pointing this out). We still don't know what's going on here.

13) Last week Ueda made some improvements to our AGIS entries, especially so that BU and HU belong to "US-NET2". This is generally an improvement, but if you are sensitive to AGIS changes, you might notice something.

14) Preparations for Holyoke are actively underway. Planning for networking, a 10Gbps link dedicated for LHCone, late stages of negotiating with vendors to move the equipment. The move will occur not before March 31 with ~1 month slippage fairly likely.

15) We sometimes run into situations where it looks like we're not getting enough pilots but have no feedback as to why.

16) Paul Nilssons ErrorDiagnosis? .py

  • MWT2:
    • last meeting(s): Job slots going unused. Switched gridftp transfers to new hardware. Shorthanded at BU due to admins working on setting up Holyoke. Illinois progressing on getting 100G. Lincoln getting additional resources at UC3 hooked up, about 500 job slots.
    • this meeting: 100 TB added. Will be working on DQ2 + PY 2.6 client.

  • SWT2 (UTA):
    • last meeting(s): Deletion errors with Bestman for certain datasets. Have a ticket open. Network issue still being worked on by Jason; looks like it might be a busy link in Chicago. Continued work on storage. Finalizing purchase for additional compute nodes. Will circulate specifications to the tier 2 list.
    • this meeting:

  • SWT2 (OU):
    • last meeting(s): Disk order has gone out. Horst is getting close to having clean-up script.
    • this meeting:

  • WT2:
    • last meeting(s): Unused job slots, lack of pilots.
    • this meeting:

AOB

last meeting this meeting


-- RobertGardner - 05 Feb 2013-- RobertGardner - 05 Feb 2013

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback