r4 - 01 Jun 2012 - 11:01:28 - IlijaVukoticYou are here: TWiki >  Admins Web > MinutesMay30

MinutesMay30

Introduction

Minutes of the Facilities Integration Program meeting, May 30, 2012
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Our Skype-capable conference line: (6 to mute) ** announce yourself in a quiet moment after you connect *
    • USA Toll-Free: (877)336-1839
    • USA Caller Paid/International Toll : (636)651-0008
    • ACCESS CODE: 3444755

Attending

  • Meeting attendees: Michael, Rob, Tom, Dave, Patrick, Joel Snow (for Horst), Saul, Torre, Wei, Chris Walker (from OU), Hiro, John Brunelle, Armen, Mark, Shawn, Wensheng, Fred, Alden
  • Apologies: Kaushik, Horst
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon CDT, bi-weekly - convened by Kaushik) : Data management
    • Tuesday (2pm CDT, bi-weekly - convened by Shawn): Throughput meetings
    • Wednesday (1pm CDT, bi-weekly - convened by Wei or Rob): Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
    • this week
      • SiteCertificationP21
      • Another check of storage capacities
        • MWT2 - will ramp to 2.4 PB, within April. Presently 1.8 PB useable.
        • SWT2 - UPS online yesterday. Racked, mostly cabled. Will be at the pledge for SWT2: goal is by end of next week. Will have enough to alleviate any storage pressure.
      • LFC migration:
        • Hiro has started program for to replicate Mysql DB at LFC at BNL
        • Hiro to bring up LFC service
        • Goal: bring into production this week. Have this in place by SW week. (two weeks from now)
        • Hiro notes CERN is running 5 or 6 instances of LFC, due to client limitations. 100 at a time.
        • Anticipate running this for an extended period at BNL.
        • Hiro notes clients are at Tier2's - this is different than those at CERN
      • Michael: metrics for the facilities - not much progress, but still on our list.
      • Agree on a set of Tier 2 usage statistics plots. What we find important, and for presentations. So we have an up-to-date picture. Make suggestions, discuss here at this meeting.
      • Networking - milestone for LHCONE connectivity in May - won't be able to reach it. BNL has a firm plan for LHCONE connectivity next week.
        • Shawn reports options for connecting in Chicago (Dale Sinkelson - LHCONE contact, coordinating with Joel Mambretti - I2 VRF). What about fnal's 6504?
        • Transatlantic networking - all aspects - including T2's. US ATLAS and US CMS mgt developing strategic plan for requirements and implementation, strategy for next several years.
      • Analysis performance optimization - Ilija has started to act on this, will report.
      • Alexei - ADC T1/2/3 jamboree, Nov/Dec - there is a doodle poll out. Oriented towards facilities operations. Will be at CERN.
      • Tier 2 guidance for this year - coming soon. We should agree what to purchase, to maintain a balance of CPU and storage. We need to do this quickly - by end of June.

APF + Condor-G + OSG 3.0 discussion (John Hover)

last meeting
  • See plot below - Condor-G losing communication with the GK. Lose track of jobs - leads to chaos.
  • On-going emails with Brian, Jamie.
  • Only involves gt5 gk's
  • Upgrading parameters in Condor-G, or Condor updates has not fixed it
  • Rate of updates - leading to loss
  • Sites should not update OSG, for now, but a small site could update and provide some new information - eg if latency is an issue
  • Patrick might update SWT2_UTA, with the caveat
  • MWT2 ratio of job/GK largest in gt5
  • Other note: continuing to update sites to APF.
  • Kaushik: Can sites have a paramater to speed up ramping of pilots (eg. nqueue in schedconfig) for APF.
  • Which sites have already been converted?

this meeeting

  • Running mostly stable for the past week. A second gatekeeper will be added at MWT2 to allieve the stress.
  • At SW there was an issue - HC got set into test; APF was getting activated jobs, and so no pilots. Jose implemented a system to submit a minimum.

Special topic US analysis queue performance (Ilija)

last meeting
  • Ilija will chair a regular meeting. 2pm Central Tuesday; first will be June 5
  • Twiki to document progress will be here: AnalysisQueuePerformance
this meeting
  • First meeting is next week.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • A large backlog - production and analysis
    • A couple of issues: HU, SWT2 - working on resolving these; for HU - is there a problem with SSB description of sites for Tier 2s?
    • 800K activated analysis, 40K running
    • There is a lot of re-brokering going on.
    • Need more analysis slots
  • this meeting:
    • Suddenly in the US cloud a lull in production jobs; BNL takes all the activated jobs. Has happened before.
    • Expect an increase in reco and digi jobs for ICHEP (July). Q: 3.8 G jobs possible - what do we do to prepare? Mark will follow-up offline.
    • Wei notes that the Panda monitor is incorrectly reporting #analysis and #production. Wei will put in a ticket.
    • Michael notes last night there was a thread about Multicore queues. NB: AthenaMPFacilityConfiguration ; so we should plan on a short timescale to get this done. We need PBS and LFS configuration instructions. This is in the matrix.

Data Management and Storage Validation (Armen)

  • Reference
  • last meetings(s):
    • Missing, corrupt file issue at several sites - caused by deletion in USERDISK.
    • Admins urged to check for LFC ghosts. See, MinutesDataManageMay8
    • Double-running of same job at SWT2, causing checksum mismatches. Seen before, but still under discussion
    • Sites should give feedback to deletion service developers; update Savannah ticket. https://savannah.cern.ch/bugs/?94422
    • USERDISK emails went out, deletions coming.
  • this meeting:
    • Generally things are okay.
    • Sites have been sent lists of inconsistent files, ghosts
    • Deletion service causing transient deletion errors
    • USERDISK cleanup went well. Nearly all done. LFC transient errors seen by the deletion service. Sees this daily - for example at SLAC, where the backend database is shared with other SLAC users which sometimes create heavy load.
    • Shawn sees more LFC orphans being generated, correlated with central deletion, 380K.

Shift Operations (Mark)

  • last week: Operations summary:
     
    Yuri's summary from the weekly ADCoS meeting:
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-5_21_2012.html
    
    1)  5/17: SLACXRD: problematic storage node produced job failures (stage-in errors) and file transfer failures ("Request timeout (internal error or too long 
    processing), request aborted])."  Wei reported the problem was fixed (bad drive, but a spare didn't come online correctly).  eLog 36096.
    2)  5/17: SWT2_CPB - file transfer errors ("failed to contact on remote SRM [httpg://gk03.atlas-swt2.org:8443/srm/v2/server'").  Issue was due to a failed drive 
    in a RAID, but the spare had not been automatically swapped in.  Failed the drive manually to force the rebuild - issue resolved.
    3)  5/18: new pilot release from Paul (v53b).  Details here:
    http://www-hep.uta.edu/~sosebee/ADCoS/pilot-version_53b.html
    4)  5/18: Issues at some sites with the the CERN CRL's expiring too close to the renewal time.  Depending on when an updater ran could get caught by the CRL's 
    in the meantime expiring. From CERN IT: Our CRL is republished daily with a 5 days expiration, there was last week an issue on the publication tool which was fixed 
    last Friday.  ggus 82326 closed, eLog 36123.
    5)  5/19: MWT2 queues were draining due to a lack of activated.  Issue eventually resolved, but don't know if the reason was determined?
    6)  5/19: UTD-HEP: file transfer failures with SRM errors.  There was a scheduled power outage for maintenance, but it somehow did not get reported correctly in OIM.  
    Once the outage was completed transfers resumed successfully, and the site was unblacklisted.  ggus 82345 closed, eLog 36210, 
    https://savannah.cern.ch/support/index.php?128771.
    7)  5/20: MWT2 job failures - from Rob: The gatekeeper condor process for MWT2 failed last night; this morning the cause was determined and the service was restarted. 
    Note to shift: a number of jobs will have failed as a result.  eLog 36160.  (ggus 82351 was opened during this period - since closed.  eLog 36161.)
    8)  5/21: OUHEP_OSG_HOTDISK file transfer errors ("failed to contact on remote SRM [httpg://ouhep2.nhn.ou.edu:8443/srm/v2/server].")  From Horst: Our bestman2 SRM 
    was hung up for some reason. Restarting it fixed the problem.  ggus 82387/88 & RT 22046/48 were closed (one set of tickets was a duplicate).  eLog 36197/98.
    9)  5/22: ANL_LOCALGROUPDISK DDM errors ("/bin/mkdir: cannot create directory ... Read-only file system").  ggus 82389 in-progress, eLog 36199.
    
    Follow-ups from earlier reports:
    
    (i)  3/10: Duke - DDM errors (" failed to contact on remote SRM [httpg://atlfs02.phy.duke.edu:8443/srm/v2/server]").  Issue with a fileserver which hosts gridftp & 
    SRM services being investigated.  ggus 80126, eLog 34315.  ggus 80228 also opened on 3/13 for file transfer failures at DUKE_LOCALGROUPDISK.  Tickets 
    cross-referenced.  System marked as 'off-line' while the hardware problem is worked on.  eLog 34343.  DDM blacklist ticket:
    https://savannah.cern.ch/support/index.php?127055
    Update 4/5: Downtime extended until the end of April.
    Update 5/1: Downtime again extended - may decide to remove the site.
    Update 5/18 from Doug: We agreed to remove the site from Tiers of Atlas, Hiro will do it at some point. We are going to use the gridftp only site as the alternative.  
    ggus 80126 closed, 80228 is still open.  (Savannah ticket was closed on 5/23.)
    (ii)  4/10: New type of DDM error seen at NERSC, and also UPENN: "Invalid SRM version."  ggus tickets 81011, 81012 & 81110 all related to this issue.
    Update: As of 4/19 this issue being tracked in ggus 81012 - ggus 81011, 81110 closed.
    (iii)  4/23: SWT2_CPB - User reported problems transferring files from the site using a certificate singed by the APACGrid CA.  (A similar problem occurred last 
    week at SLAC - see ggus 81351.)  Under investigation - details in ggus 81495 / RT 21947.
    (iv)  4/24: SMU_LOCALGROUPDISK file transfer errors ("source file doesn't exist").  Update from Justin: These files have been deleted and an LFC update has 
    been requested.  ggus 81526 in-progress, eLog 35463.
    (v)  4/25: Users reported problems accessing files from SLAC - ggus 81615 was opened.  Not obvious this was a site issue - activity around the time seemed 
    similar to that for other US sites.  Waiting on an update from the ticket owner. 
    (vi)  5/7: AGLT2 - file transfer failures ("[SOURCE error during TRANSFER_PREPARATION phase: [USER_ERROR] source file doesn't exist]").  Some issue 
    between what is on disk at AGLT2 compared to DDM.  Experts involved.  ggus 81921 / RT 21997 in-progress, eLog 35820.
    (vii)  5/11: File transfers to UPENN_LOCALGROUPDISK from BNL-OSG2_SCRATCHDISK were failing with "Cheksum mismatch" errors.  ggus 82123 in-progress, 
    eLog 35912.
    Update 5/19 from the site admin: I believe that poor performance on my side was causing bad behavior in FTS. I solved my performance problems by rolling the 
    kernel version back to 2.6.18-194.11.4.el5. Version 2.6.18-308.4.1.el5 seems to cause many problems.  ggus 82123 closed, eLog 36135.
    (iix)  5/14: ANL_LOCALGROUPDISK - file transfer errors due to the filesystem being read-only ("Error:/bin/mkdir: cannot create directory...").  
    ggus 82210 in-progress, eLog 36023.
    

  • this meeting: Operations summary:
     
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=193307
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-5_28_2012.html
    
    1)  5/24: DDM errors at multiple clouds/sites due to expired ddmadmin robot proxy (e.g., "/CN=Robot: ATLAS Data Management/CN=proxy/CN=proxy/CN=proxyexpired 
    46 minutes ago").  From Ueda: "proxy expired" errors in transfers have been observed in three clouds recently. The error is caused by a backlog in transfers, 
    which may not be a site issue. Need to investigate the cause before submitting a ggus ticket. Put a savannah ticket in ddm-ops and consult with the experts.  
    See: https://savannah.cern.ch/bugs/index.php?94853.  ggus 82509/RT 22056 were opened for file transfer failures at SWT2_CPB with the expired proxy error. 
    Not a site problem - the tickets were closed - eLog 36312.
    2)  5/25: BNL - file transfer failures with SRM errors ("failed to contact on remote SRM [httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2]").  Hardware problem with 
    an SRM database host was fixed - issue resolved.  ggus 82539 closed, eLog 36332.
    3)  5/25: SWT2_CPB - file transfer errors ("failed to contact on remote SRM [httpg://gk03.atlas-swt2.org:8443/srm/v2/server]").  From Patrick: We determined that a 
    back-end component of the storage system at SWT_CPB was having issues with too many open file descriptors. We have restarted the service and storage system 
    appears to be functioning correctly.  Successful transfers resumed - issue resolved.  ggus 82541/RT 22060 closed, eLog 36357.
    4)  5/25: ATLAS MC production managers: expect an increase in digitization+reconstruction jobs at the tier-2's in preparation for ICHEP in July.  More details:
    http://www-hep.uta.edu/~sosebee/ADCoS/digi+reco-jobs-T2s-ICHEP-July2012.html
    5)  5/29 early a.m.: many sites were auto-excluded with the HC job error "ddm: Setupper._setupDestination() could not register :
    hc_test.dvanders.hc20005896.LPC.76."  From Solveig:  A mistake was made in deploying a new release this morning, and the project tag validation was broken.  
    It is fixed now, and the error should go away.  eLog 36461.
    6)  5/30 early a.m.: SLACXRD file transfer failures (" [GRIDFTP_ERROR] globus_ftp_client: the server responded with an error 500 500-Command failed.: 
    globus_gridftp_server_posix.c:globus_l_gfs_posix_recv:914: 500-open() fail 500").  Wei reported the problem had been fixed.  Successful transfers resumed after 
    a few hours, so ggus tickets 82654/74 were closed.  eLog 36490.
    7)  5/30: UTA_SWT2 - file transfer and DDM deletion errors.  Issue is due to an expired host certificate.  ggus 82676 / RT 22070 in-progress, eLog 36491.
    
    Follow-ups from earlier reports:
    
    (i)  4/10: New type of DDM error seen at NERSC, and also UPENN: "Invalid SRM version."  ggus tickets 81011, 81012 & 81110 all related to this issue.
    Update: As of 4/19 this issue being tracked in ggus 81012 - ggus 81011, 81110 closed.
    (ii)  4/23: SWT2_CPB - User reported problems transferring files from the site using a certificate singed by the APACGrid CA.  (A similar problem occurred last week 
    at SLAC - see ggus 81351.)  Under investigation - details in ggus 81495 / RT 21947.
    (iii)  4/24: SMU_LOCALGROUPDISK file transfer errors ("source file doesn't exist").  Update from Justin: These files have been deleted and an LFC update has been 
    requested.  ggus 81526 in-progress, eLog 35463.
    (iv)  4/25: Users reported problems accessing files from SLAC - ggus 81615 was opened.  Not obvious this was a site issue - activity around the time seemed similar 
    to that for other US sites.  Waiting on an update from the ticket owner. 
    (v)  5/7: AGLT2 - file transfer failures ("[SOURCE error during TRANSFER_PREPARATION phase: [USER_ERROR] source file doesn't exist]").  Some issue between 
    what is on disk at AGLT2 compared to DDM.  Experts involved.  ggus 81921 / RT 21997 in-progress, eLog 35820.
    (vi)  5/14: ANL_LOCALGROUPDISK - file transfer errors due to the filesystem being read-only ("Error:/bin/mkdir: cannot create directory...").  ggus 82210 in-progress, 
    eLog 36023.
    (vii)  5/22: ANL_LOCALGROUPDISK DDM errors ("/bin/mkdir: cannot create directory ... Read-only file system").  ggus 82389 in-progress, eLog 36199.
    5/28: Update from Ueda - this destination is being decommissioned. no problem to have it as read-only.  Added related issue:
    https://savannah.cern.ch/bugs/index.php?94799.  Not sure why this ticket was assigned to VO support, rather than to the site, but anyway, this destination is being 
    decommissioned. and probably the site has set it to read-only.  No more transfers to the site should be made.  ggus 82389 closed,
    

  • Notes:

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last meeting:
    • Now called North American Throughput meeting. CMS and others now joining. New list throughput-l@lists.bnl.gov.
    • Most sites have upgraded; WT2 has hardware.
  • this meeting:
    • Will now just use the new list.
    • Useful to have both Candian and CMS sites participating. Eg. load on latency host reduced with a syslog "-" option. (reduces synching)
    • Will setup a new page with tips will be setup.
    • Notes for this week on the list.

Federated Xrootd deployment in the US (Wei, Ilija)

last week(s) this week:
  • Last meeting MinutesFedXrootdMay16
  • Discussion CHEP - native dcache-xrootd door N2N? being developed
  • Will need a monitoring solution for this case -
  • Munich site - will be brought up as a new site
  • CERN will setup a european redirector as well
  • Twiki access to CERN - readable
  • Wei discussed with Paul implementing the fallback solution

Site news and issues (all sites)

  • T1:
    • last meeting(s):few issues with expired certificates - hopefully resolved by now. Looking forward to multicore jobs to see how the condor job configuration is working. 8 threads in MC config chosen. Using a separate Panda queue.
    • this meeting: completed benchmarking R420 with Sandybridge 2.4 GHz processor; Shuwei ran his jobs including ROOT benchmarks, evgen G4, digi, reco - compared to current R410 westmere-based; no big advantage seen for this particular machine over westmere-based 2.8 GHz; not faster, or only slightly faster. Suspicion is IO subsystem including disk controller and memory access. Chris looking at standard benchmarks like bonnie - found it slower. Pricing was much higher. No advantage in performance over price. Interesting presentation at CHEP by Forrest from Dell; hope to be able to follow-up to determine what is going on. Invite a Dell expert to upcoming ATLAS SW week. 6145 - will get one re-sent.

  • AGLT2:
    • last meeting(s): Had issues with the main postgres database for dCache hosting by SSDs - filled. RAID1 tried, but found one or the other going offline. OCZ SSDs without Dell firmware - may not be talking well with the H800. Now in VMs - worry about IOPS, but seems to be working well.
    • this meeting: Testing R420 as well. Not sure it will be price effective solution. Interested in changing analy queue over to direct read; got same event rate. Billing logs can be mined with Sarah's scripts. (Michael notes that Patrick strongly recommends decoupling billing DB from other services, and using mirroring capabilities of Postrgres9).

  • NET2:
    • last meeting(s): On-going problem with HU; production queue drained, lcg-info-site command is not returning information. HU Analysis queue has though continued to run. Alden takes result and populates table, needed for Panda brokerage. CVMFS is being used at HU; is it being tagged by Alessandro's framework? Suddenly happened two days ago. Will double check these.
    • this meeting: Running smoothly - for the last 6 days a much lower number of production jobs. Lots of resources falling on the floor. Joined FAX federation. New gatekeeper and lsm nodes, bringing up a parallel Panda site - perhaps a better PBS or Condor. Also working on release reporting with Burt Holtzman.

  • MWT2:
    • last meeting(s): Checksum errors have abated - not related to packet loss or NIC errors as originally thought. Gatekeeper (OSG 3, GRAM5) and AutoPYFactory incidents and triage caused by scalability problems with Condor-G. These seem to have been mitigated. Issues with GPFS performance at UIUC.
    • this meeting: Sarah - HC data migrated to IU storage, testing ROOTIO jobs by hand. Dave - campus cluster improvements to GPFS and updates preparing for 100 Gbps in the future. A number of worker nodes affected by network glitches.

  • SWT2 (UTA):
    • last meeting(s): Starting to work on LFC migration - later this week hope to switch to BNL. Energizing UPS requires scheduling with rest of building's users. Might be next week. Will begin racking soon thereafter.
    • this meeting: Big news is UPS commissioned! Adding equipment - will have to take a downtime in June. Racking servers. Production cluster CPB has been draining off/on over the last week.

  • SWT2 (OU):
    • last meeting(s):
    • this meeting: All is well. Horst is on a cruise.

  • WT2:
    • last meeting(s): 12 960 GB SSDs OCZ, R610, 10G. Exceeding SAS2 channel limit, exceeding 12 Gbps. Will put this behind analysis queue.
    • this meeting: Storage node rebooted - all back to normal. Doesn't think there is a problem.

Carryover issues (any updates?)

rpm-based OSG 3.0 CE install

last meeting(s)
  • In production BNLs
  • Horst claims there are two issues: RSV bug, and Condor not in a standard location.
  • NET2: Saul: have a new gatekeeper - will bring up with new OSG.
  • AGLT2: March 7 is a possibility - will be doing upgrade.
  • MWT2: done.
  • SWT2: will take a downtime; have new hardware to bring online. Complicated with install of new UPS - expect delivery, which will take a downtime.
  • WT2: has two gatekeepers. Will use one and attempt to transition without a downtime.

this meeeting

  • Any updates?
  • There is a new release 3.1.0; Horst will take a look for problems on his ITB site.
  • AGLT2
  • MWT2 - 3.0.10 in production
  • SWT2
  • NET2
  • WT2

OSG Opportunistic Access

See AccessOSG for instructions supporting OSG VOs for opportunistic access.

last week(s)

this week

AOB

last week this week


-- RobertGardner - 29 May 2012

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback