r6 - 27 Jun 2012 - 14:57:40 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJun272012

MinutesJun272012

Introduction

Minutes of the Facilities Integration Program meeting, June 27, 2012
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Our Skype-capable conference line: (6 to mute) ** announce yourself in a quiet moment after you connect *
    • USA Toll-Free: (877)336-1839
    • USA Caller Paid/International Toll : (636)651-0008
    • ACCESS CODE: 3444755

Attending

  • Meeting attendees: Rob, Ilija, Michael, Bob, Saul, Fred, Shawn, Armen, Mark, Patrick, Torre, Wei, Dave, Alden, HIro, Horst, Tom, John,
  • Apologies: Kaushik, Jason
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, bi-weekly - convened by Kaushik) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Wei or Rob): Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • IntegrationPhase21, SiteCertificationP21
      • Another check of storage capacities
        • MWT2 - will ramp to 2.4 PB, within April. Presently 1.8 PB useable.
        • SWT2 - UPS online yesterday. Racked, mostly cabled. Will be at the pledge for SWT2: goal is by end of next week. Will have enough to alleviate any storage pressure.
      • LFC migration:
        • Hiro has started program for to replicate Mysql DB at LFC at BNL
        • Hiro to bring up LFC service
        • Goal: bring into production this week. Have this in place by SW week. (two weeks from now)
        • Hiro notes CERN is running 5 or 6 instances of LFC, due to client limitations. 100 at a time.
        • Anticipate running this for an extended period at BNL.
        • Hiro notes clients are at Tier2's - this is different than those at CERN
      • Michael: metrics for the facilities - not much progress, but still on our list.
      • Agree on a set of Tier 2 usage statistics plots. What we find important, and for presentations. So we have an up-to-date picture. Make suggestions, discuss here at this meeting.
      • Networking - milestone for LHCONE connectivity in May - won't be able to reach it. BNL has a firm plan for LHCONE connectivity next week.
        • Shawn reports options for connecting in Chicago (Dale Sinkelson - LHCONE contact, coordinating with Joel Mambretti - I2 VRF). What about fnal's 6504?
        • Transatlantic networking - all aspects - including T2's. US ATLAS and US CMS mgt developing strategic plan for requirements and implementation, strategy for next several years.
      • Analysis performance optimization - Ilija has started to act on this, will report.
      • Alexei - ADC T1/2/3 jamboree, Nov/Dec - there is a doodle poll out. Oriented towards facilities operations. Will be at CERN.
      • Tier 2 guidance for this year - coming soon. We should agree what to purchase, to maintain a balance of CPU and storage. We need to do this quickly - by end of June.
    • this week
      • SiteCertificationP21
      • OmniPoP switching in Chicago; update from I2 today regarding LHCONE.
      • Tier 2 guidance - coming: basis for discussion is we have more CPU than pledge, so ratio cpu/disk implies an adjustment is necessary. Michael has been figuring a sensible basis; will need discussion with ADC management. Discussion with Borut regarding projections - there is a comprehensive paper. Most of the Tier 2's would need to install 2-3 PB of disk to create balance; this would not be reasonable in next fiscal year. Will plan on a site-by-site basis the plan. Need to get to a resolution on resource requirements for the coming year, and next fiscal years beyond. Expect sometime to have guidance sometime next week. Might expect more resident data for analysis.
      • 2013, 2014 pledges request from LCG office, new figures by September 30, for RRB meeting October.
      • Torre: plans for SSD-based storage, for hierarchical storage? We have been looking into this at BNL and SLAC - there is a prospect for this. AGLT2 - looking at LSI cascade solution (block level); potentially easy, but requires use of Dell SSDs which are expensive; Shawn working with Dell on this.
      • LFC consolidation - status?
        • Asking for max thread count to database to be increased - currently 100; will install a new one.
        • Hiro discussion with Cedric and others - merge script for Oracle.
        • Timeline: few days to merge the UTA_SWT2 site to Oracle, and then run in production. Patrick notes he runs proddisk often; this may complicate things.
      • Deploying federation infrastructure - widely. New bi-weekly meeting series.
      • Time date for next facilities f2f meeting (@ Santa Cruz). Mid-November time frame.
      • Accounting and statistics data: aggregation across computing elements, appearing as individual sites. ICB message to address accounting issues appropriately. Discussing with Jarka as to the specific aggregations that are occurring.
      • At last GDB meeting in mid-June there was a discussion about migration of glue 1.3 to 2.0, a large change. There was a deadline by September 30(!!). If sites are not ready then they'll be dropped. OSG apparently was not contacted, but they've taken this as an action item.
      • Next generation computing element. Two alternatives - a thin condor CE, and CREAM. Two teams are looking into these. BNL and Nebraska evaluating these; July 12 is next blueprint meeting in Madison.
      • Status of OSG 3.x deployment.
        • MWT2 - have OSG 3.x running, will update to latest
        • AGLT2 - will setup a test gatekeeper to try out the installation; will need to inform APF.
        • NET2 - installed on a new gatekeeper; will bring it in as a new Panda site. Working on BU - should be ready soon.
        • WT2 - we have two gatekeepers, will switch one of the two. Timeframe: 4-6 weeks. Not confident about stability of information system.
        • SWT2 - UTA: installed rpm-based Bestman at one of our clusters, that went fine. Using this as a basis for setting up builds for new devices. Is LFC available as an rpm? Yes. Will be taking a downtime anyway.
        • SWT2 - OU: working with OSCER admins to install OSG 3 on their new gatekeepers - this gain experience.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Suddenly in the US cloud a lull in production jobs; BNL takes all the activated jobs. Has happened before.
    • Expect an increase in reco and digi jobs for ICHEP (July). Q: 3.8 G jobs possible - what do we do to prepare? Mark will follow-up offline.
    • Wei notes that the Panda monitor is incorrectly reporting #analysis and #production. Wei will put in a ticket.
    • Michael notes last night there was a thread about Multicore queues. NB: AthenaMPFacilityConfiguration ; so we should plan on a short timescale to get this done. We need PBS and LFS configuration instructions. This is in the matrix.
  • this meeting:
    • High priority production for ICHEP is essentially done.
    • US Cloud has been performing well during this time.

Data Management and Storage Validation (Armen)

  • Reference
  • last meetings(s):
    • Generally things are okay.
    • Sites have been sent lists of inconsistent files, ghosts
    • Deletion service causing transient deletion errors
    • USERDISK cleanup went well. Nearly all done. LFC transient errors seen by the deletion service. Sees this daily - for example at SLAC, where the backend database is shared with other SLAC users which sometimes create heavy load.
    • Shawn sees more LFC orphans being generated, correlated with central deletion, 380K.
  • this meeting:
    • Some sites have run low on GROUPDISK
    • USERDISK cleanup has been scheduled - after July 9

Shift Operations (Mark)

  • last week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=196599
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-6_18_2012.html
    
    1)  6/13: Issue where some sites using CVMFS see the occasional error: "Error: cmtsite command was timed out" was raised in 
    https://savannah.cern.ch/support/?129468.  See more details in the discussion therein.
    2)  6/16: From Alexei: Many mc12, mc11_14TeV datasets will be subscribed to T1, T2 in the next 1-2 days.  It may explain peak on DDM dashboard 
    plots.  eLog 36920.
    3)  6/16: MWT2 - file transfer failures with errors like " [GRIDFTP_ERROR] globus_ftp_client: the server responded with an error 451 Operation failed: 
    Best pool  too high : Infinity]."  Issue was due to full write pools, fixed by draining some of them.  Also plan to re-enable automated pool 
    balancer.  ggus 83319 closed, eLog 36977.  https://savannah.cern.ch/support/index.php?129547 (Savannah site exclusion).
    4)  6/16: BNL_ATLAS_RCF jobs failures with "lost heartbeat" errors.  From John Hover: This queue is running on non-dedicated resources, so ATLAS 
    jobs are subject to eviction without warning when the resource owner submits a lot of jobs. We'd expect "lost heartbeat" errors when that happens. 
    We'll check to make sure that is what happened, and look into mitigating the impact of evictions.  ggus 83321 in-progress, eLog 36938.
    5)  6/17: AGLT2_DATADISK - Functional test failures with the error "file exists, overwrite is not allowed."  
    https://savannah.cern.ch/support/index.php?129560 (ADC Ops Support Savannah), eLog 36963.
    
    Follow-ups from earlier reports:
    
    (i)  4/10: New type of DDM error seen at NERSC, and also UPENN: "Invalid SRM version."  ggus tickets 81011, 81012 & 81110 all related to this issue.
    Update: As of 4/19 this issue being tracked in ggus 81012 - ggus 81011, 81110 closed.
    Update 6/14: The issue with NERSC not being in the BDII system should get fixed during an upcoming maintenance outage.
    (ii)  4/23: SWT2_CPB - User reported problems transferring files from the site using a certificate singed by the APACGrid CA.  (A similar problem 
    occurred last week at SLAC - see ggus 81351.)  Under investigation - details in ggus 81495 / RT 21947.
    Update 6/5: new version of BeStMan installed which should fix the problem with the APACGrid CA.  Awaiting confirmation from the user.
    Update 6/19: User reported he is now able to transfer files from SWT2_CPB following the BeStMan update. 
    (iii)  5/7: AGLT2 - file transfer failures ("[SOURCE error during TRANSFER_PREPARATION phase: [USER_ERROR] source file doesn't exist]").  
    Some issue between what is on disk at AGLT2 compared to DDM.  Experts involved.  ggus 81921 / RT 21997 in-progress, eLog 35820.
    Update 6/14: From Shawn: I think we can close this ticket. Some of the issue has been fixed via DDM cleanups. While there is still more to do I don't 
    think we need to keep the ticket on this issue open.  ggus & RT tickets closed.
    (iv)  5/14: ANL_LOCALGROUPDISK - file transfer errors due to the filesystem being read-only ("Error:/bin/mkdir: cannot create directory...").  
    ggus 82210 in-progress, eLog 36023.
    Update 6/14: No recent errors of the type reported in the original ticket - ggus 82210 closed.
    (v)  5/25: ATLAS MC production managers: expect an increase in digitization+reconstruction jobs at the tier-2's in preparation for ICHEP in July.  
    More details: http://www-hep.uta.edu/~sosebee/ADCoS/digi+reco-jobs-T2s-ICHEP-July2012.html
    (vi)  5/30:  BU_ATLAS_Tier2 - ggus 80214 was re-opened due to continuing DDM deletion errors at the site.  eLog 36488.
    Update 6/13 from Saul: Errors are gone. We'll watch for a similar burst in the future.  ggus 80214 again closed.  (Shifter reported >100 deletion 
    errors during a four hour period on 6/16, but no ticket was created - eLog 36915.)
    (vii)  6/4: UTD_LOCALGROUPDISK - file transfer errors with "source file doesn't exist."  ggus 82837 in-progress, eLog 36625.
    Update 6/14: ggus ticket was closed with no explanation/details.
    (iix)  6/10: MWT2 file transfer errors ("failed to contact on remote SRM [httpg://uct2-dc1.uchicago.edu:8443/srm/managerv2]").  Errors stopped later 
    the same day, but the problem is being investigated.  ggus 83107 in-progress, eLog 36798.
    Update 6/14 from Sarah: We did not find anything unusual in the log files during the time of the errors, and there were successful transfers during that 
    time period. Since there have been no more recurrences, this is all the troubleshooting we can do.  ggus 83107 closed.
    (ix)  6/10: SMU - DDM errors due to "[INTERNAL_ERROR] Checksum mismatch."  From Justin: There was major maintenance on the File system on Friday. 
    I have some follow up work to do first thing on Monday. I believe that will resolve the issues.  ggus 83108 in-progress, eLog 36799.
    Update 6/14:  Issue due to a lack of resources for the Lustre file system was resolved (checksum calculations were timing out).  ggus 83108 closed.
    (x)  6/11: NERSC file transfer failures with SRM errors.  ggus 83160 in-progress.  Iwona reported that a maintenance downtime had been set in OIM, but 
    it appears this information did not get propagated to AGIS?  See: https://savannah.cern.ch/bugs/?94341.  eLog 36822.
    Update 6/14: Earlier transfer errors have disappeared - current transfers are succeeding.  ggus 83160 closed,
    

  • this meeting: Operations summary:
     
    Yuri's summary from the weekly ADCoS meeting (presented this week by Helmut Wolters):
    https://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=197442
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-6_25_2012.html
    
    1)  6/21: BU_ATLAS_Tier2 - file deletion errors re-appeared, ggus 80214 re-opened.  As of 6/25 no recent errors, so the ticket was again closed.  
    eLog 37052.
    2)  6/21: The single job from merge task 868189 failed three times at SLAC with an the error "Put error: lfc-mkdir failed."  Job eventually ran successfully 
    at BNL, so the task is done.  https://savannah.cern.ch/support/index.php?129735, eLog 37065.
    3)  6/23: HU_ATLAS_Tier2: jobs failing with "Read-only file system: '/tmp/..." on WN atlas5915.rc.fas.harvard.edu.  From John & Saul: These errors 
    were because of a node with a failing hard drive. We've taken it out of the queues.  ggus 83514 closed, eLog 37123.
    4)  6/26: UPENN_LOCALGROUPDISK - file transfer errors (SRM).  ggus 83602 in-progress, eLog 37168.
    
    Follow-ups from earlier reports:
    
    (i)  4/10: New type of DDM error seen at NERSC, and also UPENN: "Invalid SRM version."  ggus tickets 81011, 81012 & 81110 all related to this issue.
    Update: As of 4/19 this issue being tracked in ggus 81012 - ggus 81011, 81110 closed.
    Update 6/14: The issue with NERSC not being in the BDII system should get fixed during an upcoming maintenance outage.
    (ii)  4/23: SWT2_CPB - User reported problems transferring files from the site using a certificate singed by the APACGrid CA.  (A similar problem occurred 
    last week at SLAC - see ggus 81351.)  Under investigation - details in ggus 81495 / RT 21947.
    Update 6/5: new version of BeStMan installed which should fix the problem with the APACGrid CA.  Awaiting confirmation from the user.
    Update 6/19: User reported he is now able to transfer files from SWT2_CPB following the BeStMan update. 
    Update 6/25: Since the transfers are now successful tickets closed.  (Patrick communicated with Wei at SLAC regarding the BeStMan update that solved 
    this issue, since similar ones have also been seen there.)
    (iii)  6/13: Issue where some sites using CVMFS see the occasional error: "Error: cmtsite command was timed out" was raised in 
    https://savannah.cern.ch/support/?129468.  See more details in the discussion therein.
    (iv)  6/16: BNL_ATLAS_RCF jobs failures with "lost heartbeat" errors.  From John Hover: This queue is running on non-dedicated resources, so ATLAS 
    jobs are subject to eviction without warning when the resource owner submits a lot of jobs. We'd expect "lost heartbeat" errors when that happens. 
    We'll check to make sure that is what happened, and look into mitigating the impact of evictions.  ggus 83321 in-progress, eLog 36938.
    Update 6/25: ggus 83552 was opened for the same issue.  eLog 37146.
    (v)  6/17: AGLT2_DATADISK - Functional test failures with the error "file exists, overwrite is not allowed."  
    https://savannah.cern.ch/support/index.php?129560 (ADC Ops Support Savannah), eLog 36963.
    

  • Notes:
    • At the conclusion of ICHEP processing, from Junji Tojo: ADC successfully delivered the datasets for H->gamma+gamma and H->4l for all the data taken, in time for the analysis for ICHEP. I would say this is really the big success. I also would like to thank you and the entire ADC folks for very hard & collaborative works to support the groups. That was my great experience.

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last meeting:
    • Will now just use the new list.
    • Useful to have both Candian and CMS sites participating. Eg. load on latency host reduced with a syslog "-" option. (reduces synching)
    • Will setup a new page with tips will be setup.
    • Notes for this week on the list.
  • this meeting:
    • See notes sent yesterday.
    • There are some issues with the collector on the dashboard; Tom looking into getting an additional collector. This is thought to be responsible for empty measurements.
    • Throughput involving Trimuf and sites in Germany. Hopefully getting sites on LHCONE to avoid.
    • 10G perfsonar sites?
      • NET2: up - to be configured.
      • WT2: have to check with networking people; equipment arrived.
      • SWT2: have hosts, need cables.
      • OU: have equipment.
      • BNL - equpiment on hand

Federated Xrootd deployment in the US (Wei, Ilija)

last week(s) this week:
  • New bi-weekly meeting series including participants from UK, DE and CERN; first meeting this week:
  • First version of dCache 1.9.12.X xrootd N2N plugin deployed at MWT2 (uc & iu) and AGLT2. At DESY Hamburg successfully tested version for dCache 2.2. Problem with dCache 2.0.3-1.
    • stress test of 200 nodes each doing simultaneous access to one random file hits limit of 100 simultaneous connections to LFC
    • two doors are used (one at uc and one at iu), no significant load.
    • as is now no authentication is performed. Problems seen with authentication on when on 1.9.12.X versions of dCache.
    • will try to redo LFC authentication. There is a proposal to move to WebDAV instead.
  • DPM - this is in relatively good shape. There is an issue of providing multi-VO capability.
  • Monitoring - will need publish information from billing database. Would like to have a standard for this.
  • Redirectors being setup at CERN.

US analysis queue performance (Ilija)

last two meetings
  • Twiki to document progress is here: AnalysisQueuePerformance
  • Organizational tools in place
  • Some tools to help monitor and inform about performance still in development
  • Base-line measurement is not yet established.
  • A lot of issues to investigate are cropping up.
  • Report1.pptx: AQP summary of last two meetings

Site news and issues (all sites)

  • T1:
    • last meeting(s): completed benchmarking R420 with Sandybridge 2.4 GHz processor; Shuwei ran his jobs including ROOT benchmarks, evgen G4, digi, reco - compared to current R410 westmere-based; no big advantage seen for this particular machine over westmere-based 2.8 GHz; not faster, or only slightly faster. Suspicion is IO subsystem including disk controller and memory access. Chris looking at standard benchmarks like bonnie - found it slower. Pricing was much higher. No advantage in performance over price. Interesting presentation at CHEP by Forrest from Dell; hope to be able to follow-up to determine what is going on. Invite a Dell expert to upcoming ATLAS SW week. 6145 - will get one re-sent.
    • this meeting: pass - all is smooth

  • AGLT2:
    • last meeting(s): Testing R420 as well. Not sure it will be price effective solution. Interested in changing analy queue over to direct read; got same event rate. Billing logs can be mined with Sarah's scripts. (Michael notes that Patrick strongly recommends decoupling billing DB from other services, and using mirroring capabilities of Postrgres9).
    • this meeting: all is well

  • NET2:
    • last meeting(s): Running smoothly - for the last 6 days a much lower number of production jobs. Lots of resources falling on the floor. Joined FAX federation. New gatekeeper and lsm nodes, bringing up a parallel Panda site - perhaps a better PBS or Condor. Also working on release reporting with Burt Holtzman.
    • this meeting: all is well

  • MWT2:
    • last meeting(s): Sarah - HC data migrated to IU storage, testing ROOTIO jobs by hand. Dave - campus cluster improvements to GPFS and updates preparing for 100 Gbps in the future. A number of worker nodes affected by network glitches.
    • this meeting: continued HC studies

  • SWT2 (UTA):
    • last meeting(s): Big news is UPS commissioned! Adding equipment - will have to take a downtime in June. Racking servers. Production cluster CPB has been draining off/on over the last week.
    • this meeting: UTA-SWT2 is tight on disk space, lots of usage by current set of tasks, will be tracking and running proddisk-cleanse. CPB - NFS server is in bad shape. Hardware raid on sys disk has failed. May need to take a downtime. Need a downtime anyway since new UPS is in place. New equipment coming online. There is a Panda mover issue with CPB - data only coming slowly.

  • SWT2 (OU):
    • last meeting(s):
    • this meeting: all is well

  • WT2:
    • last meeting(s): Storage node rebooted - all back to normal. Doesn't think there is a problem.
    • this meeting: LFC backend database issue, there was an upgrade and will do a migration. Will update Bestman.

Carryover issues (any updates?)

rpm-based OSG 3.0 CE install

last meeting(s)
  • In production BNLs
  • Horst claims there are two issues: RSV bug, and Condor not in a standard location.
  • NET2: Saul: have a new gatekeeper - will bring up with new OSG.
  • AGLT2: March 7 is a possibility - will be doing upgrade.
  • MWT2: done.
  • SWT2: will take a downtime; have new hardware to bring online. Complicated with install of new UPS - expect delivery, which will take a downtime.
  • WT2: has two gatekeepers. Will use one and attempt to transition without a downtime.

this meeeting

  • Any updates? this week see above
  • AGLT2
  • MWT2
  • SWT2
  • NET2
  • WT2

OSG Opportunistic Access

See AccessOSG for instructions supporting OSG VOs for opportunistic access.

last week(s)

this week

AOB

last week this week
  • Next generation Intel CPU; Sandybridge from Dell and IBM under eval. The performance improvements do not seem to justify the current pricing. Will still be buying Westmere machines for RHIC. There is some indication the sandybridge machines may not show up in the dell matrix. Shawn: not for near term. Might be September.


-- RobertGardner - 26 Jun 2012

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


pptx Report1.pptx (719.7K) | IlijaVukotic, 27 Jun 2012 - 11:17 | AQP summary of last two meetings
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback