r7 - 25 Oct 2012 - 12:41:13 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesOct242012

MinutesOct242012

Introduction

Minutes of the Facilities Integration Program meeting, October 24, 2012
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • USA Toll-Free: 888-273-3658
    • USA Caller Paid/International Toll : 213-270-2124
    • ACCESS CODE: 3444755
    • HOST PASSWORD: 6081

Attending

  • Meeting attendees: Fred, Rob, Michael, Saul, Patrick, Wei, John, Dave, Ilija, Armen, Mark, Kaushik, Alden, Hiro, Bob, Tom, Shawn
  • Apologies: Jason, Kaushik,
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, bi-weekly - convened by Kaushik) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Wei or Rob): Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • Start of a new quarter - FY13Q1 this week
      • Reminder to register for the UC Santa Cruz meeting - https://indico.cern.ch/conferenceDisplay.py?confId=201788
      • New facilities spreadsheet coming - will be in Google Docs for convenience
      • Status of additional disk procurements at Tier 2.
      • Reprocessing campaign - will affect mostly the Tier 1 sites. 1.6B events. 1.5M files. 6 weeks. End by Christmas. Additional data will come to all sites. We will need the additional disk - for the Winter conferences. We expect an onslaught of user analysis. Would like to analyze the job profiles - versus category type (analy, simul, pileup). Computing management and ADC have been sending more of these to the Tier 2s.
      • Mark - makes note of increased communication between ADC management and physics coordination peered.
    • this week
      • Facilities capacity spreadsheet updates - see associated Google docs link, shared via (our private) usatlas-t2-l@lists.
      • Santa Cruz: http://indico.cern.ch/conferenceDisplay.py?confId=201788
      • Coming changes to Panda queue configurations for sites via AGIS https://indico.cern.ch/getFile.py/access?contribId=9&resId=0&materialId=slides&confId=213765[link]]. For now, proceed as before. Alden - there will a new field to associate queues with CEs.
      • OSG and Pacman - request from Dan Fraser to consider phase out date for Pacman-based OSG CEs; current timeframe under consideration is April-July 2013 - in advance of the transition of OSG software to SHA-2 and the new digicert CA.
      • Disk round-up (see below).
      • The end of favorable Dell program.
      • Configuring multicore and high mem queues, and demands on partitioning resources manually. Should we form a well-focused working group to implement this in a more dynamic fashion. There are some capabilities in Condor - but the developers will need to correct a few things. Wei has also looked into this. Dynamic Scheduling and possible virtualized. Getting folks involved - Tom would like.
      • PandaMover ... will setup a committee - take to data management meeting agenda

Disk procurement

last meeting:
  • MWT2 - 1PB ordered for UC (expect by November); UIUC - staging with new instance of CC, "end of November". At IU, we may focus on networking.
  • AGT2 - UM submitted updated PO yesterday. Planning on MD3260 dense storage from Dell. Two 40G connections. MSU: 4xMD3260 w/ 2x R720; PO imminent. Est month.
  • NET2 - MD1200, 3TB drives; electrical work. Expect to issue purchase order within a week. Two racks 432 TB useable each.
  • WT2 - Have MD1200 here. Only two head nodes arrived, four more. 1PB (usable)
  • UTA - Have not sent a PO, evaluating technologies. Will have convo with Dell.
  • T1 - 2.6 PB of disk - part of this will be replacements. Nexan technologies (dual controller front ends and extensions).

this meeting:

  • MWT2 - Received 36 MD1200s; waiting on R720. November timeframe likely.
  • AGLT2 - Started to receive equipment; will start racking and testing in the next two weeks. November 9 at MSU. 720 TB raw at each site.
  • NET2 - BU timeline
  • WT2 - In service 1.3 PB usable! Will update spreadsheet.
  • UTA - Still working on getting order in. SWT2 planning meeting this week - will hammer out details. November? End of year please! $ has not arrived.
  • T1 - 2.6 PB on order. Expect arrival in about 3 weeks.

Multi-core deployment progress (Rob)

last meeting:
  • Will be a standing item until we have a MC queue at each site
  • BNL DONE
  • WT2 - SLACXRD_MP8 DONE
  • MWT2_MCORE available DONE
  • AGLT2_MCORE DONE
  • NET2: still don't have queues. Stuck with issues with OSG 3.0 rpm install; and certificate updating. Want to setup nicely - to run multicore smoothly. Estimate - by end of next week.
  • SWT2_OU: no info.
  • SWT2_UTA: created the Panda queue; have asked to have it added to HC. Will follow-up. Structurally in place, receiving pilots. Nearly to go online. Will do the same for the other cluster.
this meeting, reviewing status:
  • NET2: queues are in schedconfig, in autopy factory; at point of asking HC jobs DONE
  • SWT2_OU:
  • SWT2_CPB: now running in production DONE
  • Andy Washbrook - ask to submit a large set of jobs against sites. To validate the capability.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Mark reporting
    • The auto-exclusion by HC cloud-wide due to Panda proxy failure September 22; notes there are discussions to improve HC.
  • this meeting:
    • We are running low on tasks at the moment. We might continue to see this for a while.

Data Management and Storage Validation (Armen)

Shift Operations (Mark)

  • last week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=209111
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-10_15_2012.html
    
    
    1)  10/12: SMU_HPC file transfer errors ("SOURCE error during TRANSFER_PREPARATION phase: [USER_ERROR] source file doesn't exist] ").  
    https://ggus.eu/ws/ticket_info.php?ticket=87326 in-progress, eLog 40085.
    2)  10/12: FTS errors to SRM fester.utdallas.edu (UTD) - site was still in a maintenance downtime, outage extended.  ggus 87291 closed, eLog 40086.
    3)  10/12: Alexei announced the creation of task definition number 1M in the ATLAS production system.
    4)  10/13: UTA_SWT2 - destination file transfer failures with SRM errors.  Network connectivity to the cluster was lost for ~eight hours.  Suspect 
    the link from the main campus to the remote computing center.  No further issues once the link was restored.  ggus 87334 / RT 22602 closed, eLog 40107.
    5)  10/15: Express Stream reprocessing started.
    
    Follow-ups from earlier reports:
    
    (i)  6/13: Issue where some sites using CVMFS see the occasional error: "Error: cmtsite command was timed out" was raised in 
    https://savannah.cern.ch/support/?129468.  See more details in the discussion therein.
    (ii)  7/12: NET2 DDM deletion errors - ggus 84189 marked 'solved' 7/13, but the errors reappeared on 7/14.  Ticket 'in-progress', eLog 37613/50.
    Update 9/24: Continue to see deletion errors - ggus 85951 re-opened.  eLog 39571.
    Update 10/9: site admins working on a solution - coming soon.  Closed ggus 85951 - issue can be tracked in ggus 84189. 
    (iii)  8/24: NERSC_SCRATCHDISK - file transfer failures with SRM errors.  Issue was due to the fact that the site admins had previously taken the 
    token off-line to protect it from undesired auto-deletion of files.  ggus 85490 closed, eLog 38791.  https://savannah.cern.ch/support/index.php?131508 
    (Savannah site exclusion ticket), eLog 38795.
    (iv)  9/12: Daily Atlas RSV vs WLCG report (currently sent out via e-mail) will be replaced on 31/Oct/2012 with a web only summary available here:
    http://rsv.opensciencegrid.org/daily-reports/2012/09/11/ATLAS_Replacement_Report-2012-09-11.html.  Please address any concerns or questions to 
    steige@iu.edu or open a ticket here: https://ticket.grid.iu.edu/goc/submit.
    (v)  9/30: File transfer errors between BNL & TAIWAN/ASGC.  Initial indications were a network path issue originating on the TW cloud side.  Eventually 
    understood (10/1) - due to BNL Cyber Security blocking the FTS agent host at ASGC.  See more details in eLog 39785.  Note: this issue also affects SLAC, 
    since it is part of the DoE complex.  ggus 86537 'assigned'.
    Update 10/5: The remaining issue is transfers between SLAC & TAIWAN - see https://ggus.eu/ws/ticket_info.php?ticket=86767.  eLog 39946.
    Update 10/16: Problem with the SLAC asymmetric route is now fixed. A policy base routing which is specific for SLAC is added on TRIUMF edge router.  
    ggus 86767 closed.
    (vi)  10/5: Shifter reported file transfer failures between SLACXRD & TRIUMF-LCG2 - https://ggus.eu/ws/ticket_info.php?ticket=86767 opened.  
    See: https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/39913 (known issue - Network experts at SLAC and in Canada are investigating).
    Update 10/16: resolved -- ggus 86767 closed.
    (vii)  10/7: AGLT2 SRM was reported as somewhat downgraded (87%) according to SAM monitoring.  From Shawn: The switch stack was restored about 
    4:45 PM EST but there are now problems with some of the VM on our VMware system which lost contact with their iSCSI storage. We are continuing to work 
    on recovering.  https://ggus.eu/ws/ticket_info.php?ticket=86897 in-progress, eLog 39967.
    Update 10/14: problem solved, ggus 86897 closed.
    (iix)  10/9 late p.m.: AGLT2 file transfer errors (SRM).  From Shawn: Our dCache billing DB filled its partition causing postgresql problems. We are cleaning the 
    DB records and will restore service ASAP.  https://ggus.eu/ws/ticket_info.php?ticket=87079 in-progress, eLog 40021.
    Update 10/11 from Shawn: Once the DB was compacted via backup/restore we had sufficient space to return to production.  ggus 87079 closed.
    (ix)  10/9-10/10: UTA_SWT2 - downtime to upgrade the OSG CE, perform other various system maintenance tasks. 
    Update 10/11 early a.m.: all work completed, HC tests successful, site back on-line.
    

  • this week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=209111
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-10_22_2012.html
    
    1)  10/18: MWT2 - file transfer errors ("[AGENT error during ALLOCATION phase: [CONFIGURATION_ERROR] No site found for host uct2-dc1.uchicago.edu].")  
    Seemed to indicate a possible DNS glitch?  Or FTS?  Apparently a transient problem - errors stopped.  ggus 87574 closed on 10/23.  eLog 40363.
    9:30 p.m.
    2)  10/19: UPENN_LOCALGROUPDISK file transfer errors ("GRIDFTP_ERROR] an end -of-file was reached globus_xio: An end of file occurred (possibly 
    the destination disk is full)").  Site reported this is not a full disk condition, but rather temporary load fluctuations.  https://ggus.eu/ws/ticket_info.php?ticket=87584 & 
    https://savannah.cern.ch/support/?133124 (Savannah DDM ops) tickets closed on 10/22.
    3)  10/19: Shifter opened https://ggus.eu/ws/ticket_info.php?ticket=87593 for "lost heartbeat" job errors at BNL_ATLAS_RCF.  From Michael: This site, with 
    resources owned by a different community, provides compute cycles on an opportunistic basis. When the owner ramps up usage to the limit ATLAS jobs are 
    evicted, leading to failures because of "lost heartbeat". The jobs are automatically retried and will most likely succeed at the second attempt.  ggus ticket 
    closed, eLog 40263.
    4)  10/19: Saul reported that te NET2 sites were draining, despite having activated jobs.  Issue was tracked down to a problem at BU in the ~usatlas1 file system.  
    Site began ramping back up once this was resolved.  HU sites also draining, but for an unrelated issue (release reporting problem seen previously).  
    From Yuri, 10/21: The new validation/install jobs for AtlasPhysics-17.2.2.4.2, 17.2.4.6.1/2, 17.2.4.8.1 and TrigMC-17.2.4.4.1 have been submitted for HU.
    5)  10/22: Rob reported that MWT2 had been drained of production jobs for ~24 hours.  Panda brokerage was not assigning jobs to the site due to the large 
    number of jobs stuck in the 'transferring' state.  Tadashi modified the broker so it will send jobs in a situation like this one.  eLog 40337/38.
    6)  10/23 p.m.: Thousands of file transfer failures in the US cloud with errors like "AGENT error during TRANSFER_SERVICE phase: [INTERNAL_ERROR]
    error creating file for memmap /var/tmp/glite-url-copy-edguser/space-token-name: No space left on device]."  Hiro reported the problem was fixed as 
    of ~9:30 p.m. EST.  eLog 40369.
    7)  10/24 early a.m.: UPENN file transfer errors ("Unable to open file...System error in open: No such file or directory"). 
    https://ggus.eu/ws/ticket_info.php?ticket=87760, eLog 40377.
    
    Follow-ups from earlier reports:
    
    (i)  6/13: Issue where some sites using CVMFS see the occasional error: "Error: cmtsite command was timed out" was raised in 
    https://savannah.cern.ch/support/?129468.  See more details in the discussion therein.
    (ii)  7/12: NET2 DDM deletion errors - ggus 84189 marked 'solved' 7/13, but the errors reappeared on 7/14.  Ticket 'in-progress', eLog 37613/50.
    Update 9/24: Continue to see deletion errors - ggus 85951 re-opened.  eLog 39571.
    Update 10/9: site admins working on a solution - coming soon.  Closed ggus 85951 - issue can be tracked in ggus 84189. 
    Update 10/17: ggus 87512 opened for this issue - linked to ggus 84189.
    (iii)  8/24: NERSC_SCRATCHDISK - file transfer failures with SRM errors.  Issue was due to the fact that the site admins had previously taken the token 
    off-line to protect it from undesired auto-deletion of files.  ggus 85490 closed, eLog 38791.  https://savannah.cern.ch/support/index.php?131508 
    (Savannah site exclusion ticket), eLog 38795.
    (iv)  9/12: Daily Atlas RSV vs WLCG report (currently sent out via e-mail) will be replaced on 31/Oct/2012 with a web only summary available here:
    http://rsv.opensciencegrid.org/daily-reports/2012/09/11/ATLAS_Replacement_Report-2012-09-11.html.  Please address any concerns or questions to 
    steige@iu.edu or open a ticket here: https://ticket.grid.iu.edu/goc/submit.
    (v)  10/12: SMU_HPC file transfer errors ("SOURCE error during TRANSFER_PREPARATION phase: [USER_ERROR] source file doesn't exist] ").  
    https://ggus.eu/ws/ticket_info.php?ticket=87326 in-progress, eLog 40085.
    

  • Doing express-stream reprocessing; will do full-scale in the next few weeks, reprocessing at the Tier 1. May see some of these jobs at Tier 2's - will imply heavy use of proddisk and networking.
  • MWT2 draining issue - Tadashi did put in a work-around for the case of high number of transferring jobs.

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • Blocking issue for ASGC Taiwan - blocked by the BNL campus border.
    • According to the netflow monitor, the FTS agent host attempted to access 9 different hosts at BNL at a certain rate, exceeding a blocking threshold. Fit the profile of a suspicious activity. Cybersecurity has already applied exceptions.
    • BNL was put into the Taiwan cloud, for the first time. This triggered it.
    • FTS 3.0 - still under testing and development. Its functional. Main difference will be configuration: removal of the channel concept, configured by the endpoints. No longer limiting concurrency between the sites. Wei: what about priorities. We should develop a testing activity around it. Multi-VO.
  • this meeting:
    • Concurrency at UC was too low for the number of jobs running.
    • FTS 3.0 - has auto-tuning for the concurrency. Will be testing as a replacement starting next week, working with other sites. MW and AGL will volunteer.

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last meeting(s):
    • See notes from yesterday's call
    • Mesh configuration for perfsonar
    • Modular dashboard discussions
    • What is going on at Triumf from SLAC? Related to LHCONE transition at SLAC? There are likely problems beyond SLAC.
    • Will be adding LHCONE connectivity to next phase
    • NET2 - Michael has discussed possibility of having ESNet provide the LHCONE connectivity for BU - but may need to have I2 involvement - depends on institutional issues.
  • this meeting:

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference

last week(s)

  • We got a russian site on-board. They didn't need much help. They still haven't enabled security.
  • Working on EOS into FAX. Not quite ready.
  • NET2 used to work - but no longer for unknown reasons.
  • Two German sites; can't enable X509 due to a RHEL6 issue.
  • Prague Tier 2, federated through DE cloud, now working
  • Developing new version of xrootd4j for 2.2.4
  • Andy produced first version of f-stream, giving all the monitoring required.
this week
  • Wei has tracked down an instability issue with the redirector; will coordinate fix with Andy
  • Ilija is working with DESY to incorporate federation capabilities directly in dCache
  • Certain problems with the UCSD collector; working with Matevz
  • Monitoring EOS and vector reads
  • Hiro has setup an unauthenticated LFC server. Wait for host cert. Will ask US sites to convert to using.
  • Need to put site-specific files into EOS.

US analysis queue performance (Ilija)

last two meetings
  • In general, sites are showing good performance.
  • Working with a few sites to fix specific issues.
  • Will have one more report on the efficiency of each site - and thereafter it will be just follow-up.
  • Summary of issues
    • Software issues - startup and stage-out time
    • Setup time for ATLAS software (CMT). This could be fixed, but it would require ~ 0.5FTE-year.
    • Will document these, summarize these findings. Should present at S&C meetings, and provide a document.
this week:
  • Will prepare a summary of findings to date, and on-going optimizations at sites

Site news and issues (all sites)

  • T1:
    • last meeting(s): Experience with ASGC machine blocked, and 2.3 PB procuring. Evaluation of a Hadoop-based storage system: apart from going for the open source Apache based Hadoop. Map-R installation on 100 nodes. Up and running. Performance tests conducted. Comes with a Scalable NFS interface. Hiro is looking into measuring it.
    • this meeting: Evaluating MapR as a storage management solution; Hiro is working with Doug on testing, direct access jobs. Hiro is scaling this up to a reasonable number of concurrent jobs. 2.6 PB of raw disk to show up in a couple weeks. DPM team has NFS 4.1 on top of Hadoop, similar to MapR. Shawn notes HePix this year has analysis of storage systems, including NFS 4.1 client.

  • AGLT2:
    • last meeting(s): PO plans as above. 2.2.4 dCache emergency upgrade. Seems to be working well. Now more activity on the intersite links. 5%, 3% free rule, to avoid XFS problems. Issues with XFS crashes caused by memory pressure and activity. Newer dCache seems to have helped with this. Caching within dCache - running effectively full seems to working better with 2.2.4; did have some hotspots before. May be having less overall re-usable space implying more cache thrash, not as much unpinned space. Is space being reclaimed too early? Condor issue - had implemented concurrency limits on analysis jobs; limit # analysis jobs to run. Accounting groups and concurrency limits may not play well with Condor. Very long negotiator cycles -- hours.
    • this meeting: Next HEPIX will be at Ann Arbor, last week October 2013. Received head node for new storage purchase. Disks still have not been shipped (arrival November 23). SL 6.3. Bob: upgrading to CFEngine3.

  • NET2:
    • last meeting(s): Mysterious problem related to new pilot - only at NET2, evidently. Once every couple of weeks we see a batch that consume all the memory on some nodes.
    • this meeting: BU is switching over to SGE - will be sending testjobs shortly. An issue with release validation at HU.

  • MWT2:
    • last meeting(s): Investigation on IU analysis performance continuing in detail. Upgrades for LHCONE at IU in progress, required Juniper OS update which had problems; Illinois: by October 12; hardware link in place and active. First s-node arrived at UIUC, being deployed by taub admins; Taub c-nodes updated for cvmfs fixes, and working with the core taub admin for additional utilities for Nagios. Sarah investigating times for postgres queries, and relation to billing database; need to move billing database onto separate server. Continued work on virtual machines - adapting appliance from John Hover (OpenStack-based) to use the virtlib tools directly.
    • this meeting: GPFS maintenance at UIUC campus cluster - those nodes offline.

  • SWT2 (UTA):
    • last meeting(s): OSG 3 rpm - checking ROCKS appliances. Tues/Wed next week. (Dave notes there is a stress test template 459.)
    • this meeting: Working on proddisk-cleanse program. There is a new version in git that Shawn ran into. Looking to add information about what Panda is about to run - e.g. activated jobs.

  • SWT2 (OU):
    • last meeting(s): Disk order has gone out. Horst is getting close to having clean-up script.
    • this meeting:

  • WT2:
    • last meeting(s): Moved to LHCONE. Working on new storage. Have not had a chance to work on OSG 3.0 rpm install. Is the lsf-job manager.
    • this meeting: Storage is online. SLAC Tier 3 - 20 R510's with 3 TB drives --> 500 TB

AOB

last meeting this meeting
  • Alden - new multi-CE
  • Fred - March S&C week conflicts with OSG AH meeting.
  • Short meeting next week to assess status of re-processing tasks, and hot topics


-- RobertGardner - 23 Oct 2012

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback