r7 - 15 May 2012 - 19:59:22 - WeiYangYou are here: TWiki >  Admins Web > MinutesMay9

MinutesMay9

Introduction

Minutes of the Facilities Integration Program meeting, May 9, 2012
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Our Skype-capable conference line: (6 to mute) ** announce yourself in a quiet moment after you connect *
    • USA Toll-Free: (877)336-1839
    • USA Caller Paid/International Toll : (636)651-0008
    • ACCESS CODE: 3444755

Attending

  • Meeting attendees: Dave, Fred, Rob, Bob, Patrick, Shawn, Joel Snow, Jason, Saul, Horst, Wei, Hiro, Chris Walker, Ilija, Mark, Kaushik, Armen, John B, John H, Tom, Alden, Sarah
  • Apologies: Michael
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon CDT, bi-weekly - convened by Kaushik) : Data management
    • Tuesday (2pm CDT, bi-weekly - convened by Shawn): Throughput meetings
    • Wednesday (1pm CDT, bi-weekly - convened by Wei or Rob): Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • IntegrationPhase21, SiteCertificationP21
      • Capacity summary: updates
        • MWT2 - will ramp to 2.4 PB, within April. Presently 1.8 PB useable.
        • AGLT2 - 2160 TB online. 120 TB additional coming at MSU. 2250 PB.
        • NET2 - 2.1 PB online and useable
        • SWT2 - just about done with UPS, more extensive than had been planned. Wrap up in next few days. Final commissioning next week. Then new storage starts installation. Last week went ahead and added 200 TB. 1.35 PB + 350 TB = 1.7 PB useable now. Once the new stuff is in it will exceed.
        • WT2 - Only 50 TB short (2150 TB usable is available now), don't have immediate plans to purchase storage. Building up area for low-density storage. Will be ordering 10 TB of SSDs.
        • BNL - both CPU and disk at pledge level
      • Michael: will be going into a new planning round based on latest resource document; a new table. Will be keeping primary data at sites - will lead to a different percentage between CPU and storage.
      • SupportingCMS - follow-up discussion from Dan post meeting about CMS policy for group accounts. Preference is to use glexec and pool accounts, but Brian might have a work around. Under discussion.
      • Michael: LFC consolidation discussed at ADC weekly - how its going to go. There was a question as to the number of instances at BNL (3, 2 or 1)? Ultimate goal would be just 1. Gradual consolidating. We also need some development for dark data cleanup.
      • Multicore configuration queue setup at BNL. Post these AthenaMPFacilityConfiguration; see also AthenaMPFacilityTests. (WIP)
      • From TIM meeting, analysis performance from TTreeCache performance. There are some performance gaps, of up to 20%. Perhaps ask Sergei to reproduce his timings. Wahid's presentation. Make this a visible facility activity: make a working group with organized plan.
      • Cloud resources at BNL - now being provided. Sergei, Val and Doug contacted by John Hover - an OpenStack environment based on EC2 interfaces. Adding more resource to this virtual environment - 200 virtual machines available.
    • this week
      • Capacity spreadsheet - received updates from Saul, Horst, and Mark. Any others? Would like to circulate updated copy tomorrow.
      • SiteCertificationP21 available.
        screenshot_02.png
      • Added two special topics for this week - pilot issues and analy queue performance

APF + Condor-G + OSG 3.0 discussion (John Hover)

  • See plot below - Condor-G losing communication with the GK. Lose track of jobs - leads to chaos.
  • On-going emails with Brian, Jamie.
  • Only involves gt5 gk's
  • Upgrading parameters in Condor-G, or Condor updates has not fixed it
  • Rate of updates - leading to loss
  • Sites should not update OSG, for now, but a small site could update and provide some new information - eg if latency is an issue
  • Patrick might update SWT2_UTA, with the caveat
  • MWT2 ratio of job/GK largest in gt5
  • Other note: continuing to update sites to APF.
  • Kaushik: Can sites have a paramater to speed up ramping of pilots (eg. nqueue in schedconfig) for APF.
  • Which sites have already been converted?

Special topic US analysis queue performance (Ilija)

  • HC_US_IOmonitoring.pdf: HC_US_IOmonitoring.pdf
  • Ilija will chair a regular meeting. 2pm Central Tuesday; first will be June 5

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Multicore queue now needed for certain tasks
    • TIM: DEFT dynamic job definition tool. Panda decides the length of job. GlideinWMS and Panda discussed extensively; will start sending pilots in this way. First goal is to run a scaling test. We need transparency here. Maxim will provide a plan and twiki to document this. Federated Xrootd step-by-step; first use-case is for transparent access to missing files; eventually a plan for using FAX for real data handling of all files - came up with a plan where a federated testing service provides a cost function. Kaushik and Doug are working on a plan.
    • There is a network problem currently at HU.
  • this meeting:
    • A large backlog - production and analysis
    • A couple of issues: HU, SWT2 - working on resolving these; for HU - is there a problem with SSB description of sites for Tier 2s?
    • 800K activated analysis, 40K running
    • There is a lot of re-brokering going on.
    • Need more analysis slots

Data Management and Storage Validation (Armen)

Shift Operations (Mark)

  • last week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-4_30_2012.html
    
    1)  4/25: File transfer errors between AGLT2_PRODDISK and TRIUMF-LCG2_MCTAPE - issue most likely on the TRIUMF end, so ownership of ggus 81628 was transferred 
    to that site.  eLog 35504.
    2)  4/25: Job failures at BNL ("Get error: Failed to get LFC replicas (lfc_getreplicas failed with: 2703, Could not secure the connection)").  Issue was an expired host certificate, 
    now updated.  eLog 35506.
    3)  4/25: Users reported problems accessing files from SLAC - ggus 81615 was opened.  Not obvious this was a site issue - activity around the time seemed similar to that for 
    other US sites.  Waiting on an update from the ticket owner. 
    4)  4/28: FTS problem at BNL - transfers to US tier-2's stopped for several hours.  Issue resolved as of ~5:00 p.m. EST.
    5)  4/29: File transfer errors at OUHEP_OSG_HOTDISK ("failed to contact on remote SRM [httpg://ouhep2.nhn.ou.edu:8443/srm/v2/server]").  From Horst: We rebooted the 
    OUHEP cluster into the newest RLEH5.8 kernel Friday night, and apparently there was a race condition between the SRM server and the gatekeeper coming back up, since 
    the SRM server depends on the gatekeeper's CA area, so it didn't start. I just started it, and everything seems fine.  ggus 81731 / RT 21971 closed, eLog 35600.  
    https://savannah.cern.ch/support/index.php?128258 (site was blacklisted during this period).
    Update 4/30: problem with file transfers reappeared (in this case to OUHEP_OSG_DATADISK).  Horst resolved a problem with xfs - ggus 81763 / RT 21984 closed, eLog 35643.
    6)  4/29: AGLT2_PRODDISK file transfer errors ("failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]").  From Shawn: Both SSDs hosting the 
    dCache DB (RAID-1 config) failed around 4:40 AM Eastern time. We have recovered the disks and dCache should be back online.  Transfers succeeding as of early a.m. 
    4/30 - ggus 81733 closed, eLog 35601.  https://savannah.cern.ch/support/index.php?128259 (site was blacklisted during this period).
    Update 5/2 early a.m.: problem with SSD RAID-1 set re-appeared.  Switched over to backup VM setup.  ggus 81801 in-progress, eLog 35653.
    
    Follow-ups from earlier reports:
    
    (i)  3/2: BU_ATLAS_Tier2 - DDM deletion errors (175 over a four-hour period).  ggus 79827 in-progress, eLog 34150.
    Update 3/13: ggus 79827 closed, as this issue is being followed in the new ticket 80214.  eLog 34336.
    (ii)  3/10: Duke - DDM errors (" failed to contact on remote SRM [httpg://atlfs02.phy.duke.edu:8443/srm/v2/server]").  Issue with a fileserver which hosts gridftp & SRM services 
    being investigated.  ggus 80126, eLog 34315.  ggus 80228 also opened on 3/13 for file transfer failures at DUKE_LOCALGROUPDISK.  Tickets cross-referenced.  System 
    marked as 'off-line' while the hardware problem is worked on.  eLog 34343.  DDM blacklist ticket:
    https://savannah.cern.ch/support/index.php?127055
    Update 4/5: Downtime extended until the end of April.
    Update 5/1: Downtime again extended - may decide to remove the site.
    (iii)  4/8: UTD-HEP - file transfer errors ("failed to contact on remote SRM [httpg://fester.utdallas.edu:8443/srm/v2/server]").  ggus 81035, eLog 35046.
    Update 4/26: Most recent issue was a problem with the site SRM not accepting the DDM robot proxy (more details in https://savannah.cern.ch/support/?127808). Recent 
    troubleshooting may have fixed the problem, so the site was unblacklisted to test the status.  File transfers succeeding.  Closed ggus 81035 and Savannah 127808 - eLog 35546.
    (iv)  4/9: NERSC - file transfer errors to SCRATCHDISK ("failed to contact on remote SRM [httpg://pdsfdtn1.nersc.gov:62443/srm/v2/server]").  ggus 81050 in-progress, eLog 81050. 
    (v)  4/10: New type of DDM error seen at NERSC, and also UPENN: "Invalid SRM version."  ggus tickets 81011, 81012 & 81110 all related to this issue.
    Update: As of 4/19 this issue being tracked in ggus 81012 - ggus 81011, 81110 closed.
    (vi)  4/12: UTD-HEP - site requested to be unblacklisted in DDM.  However, file transfers started failing heavily, so had to set site off again.  
    See: https://savannah.cern.ch/support/?127808, eLog 35123/24/203.
    See (iii) above - file transfer issues at the site resolved - tickets closed.
    (vii)  4/21: WISC_LOCALGROUPDISK file transfer failures with "source file doesn't exist" errors.  ggus 81474 in-progress, eLog  35501.
    (iix)  4/22: MWT2 - job failures with " Get error: lsm-get failed."  See details in ggus 81477 (in-progess) - eLog 35424.
    Update 4/27: No more recent errors of the type reported in ggus 81477 - ticket closed.
    (ix)  4/22: MWT2 - ggus 81487 opened due to jobs failing with the "lost heartbeat" error.  Ticket in-progress, eLog 35433.
    Update 4/27: A problem with a gatekeeper led to Condor dropping jobs, resulting in the lost h.b. errors.  Issue resolved, ggus 81487 closed.
    (x)  4/23: SWT2_CPB - User reported problems transferring files from the site using a certificate singed by the APACGrid CA.  (A similar problem occurred last week at SLAC - 
    see ggus 81351.)  Under investigation - details in ggus 81495 / RT 21947.
    (xi)  4/24: SMU_LOCALGROUPDISK file transfer errors ("source file doesn't exist").  Update from Justin: These files have been deleted and an LFC update has been requested.  
    ggus 81526 in-progress, eLog 35463.
    (xii)  4/24: John at NET2 reported that the HU_ATLAS site was draining for lack of production jobs.  Pilots are unable to download files from the panda servers, and 
    immediately exit with the message "curl: (52) Empty reply from server /usr/bin/python: can't open file 'atlasProdPilot.py': [Errno 2] No such file or directory."  Problem under 
    investigation - see details in e-mail thread.  eLog 35477.
    Update: This issue was resolved.  For some reason the IP addresses of the compute nodes (via a shared NAT egress IP) was now required to provide reverse DNS look-ups.  
    Once this was implemented jobs again began working.
     

  • this meeting: Operations summary:
     
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=190524
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-5_7_2012.html
    
    1)  5/2 p.m.: Hiro reported a problem with dCache at BNL - issue resolved after ~one hour.  eLog 35675.
    2)  5/3: Modifications to the DDM functional tests announced - details in: https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/35686.
    3)  5/4: SWT2_CPB: file transfer errors ("Source file/user checksum mismatch").  Patrick pointed out this is not a site issue, but rather something odd with Panda, where 
    some user analysis jobs wrote the same output file.  The result is inconsistency between LFC/disk checksums, hence the errors.  Panda experts notified.  
    ggus 81879 / RT 21992 in-progress, eLog 35727.
    4)  5/4: Jobs from some MadGraph evgen task were running on non-multicore sites in the US cloud.  Problem understood (a software fix did not get propagated 
    correctly) - tasks aborted.  https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/35737.
    5)  5/5 early a.m.: BNL - file transfers failing due to dCache issue (pnfsManager process crashed).  Problem quickly fixed - transfers resumed.  eLog 35759.
    6)  5/7: AGLT2 - file transfer failures ("[SOURCE error during TRANSFER_PREPARATION phase: [USER_ERROR] source file doesn't exist]").  Some issue between 
    what is on disk at AGLT2 compared to DDM.  Experts involved.  ggus 81921 / RT 21997 in-progress, eLog 35820.
    7)  5/8: ggus 82016 / RT 21999 were opened regarding transfers problems between several sites across different clouds (including SWT2_CPB) to/from the UK cloud.  
    There was a question whether any modifications were needed to some FTS channels, and if there were any known site issues.  Ticket owner reported the problem 
    could be mitigated by reverting to FTS globus-url-copy mode.  The ticket was marked as 'unsolved', but the workaround presumably will suffice.
    
    Follow-ups from earlier reports:
    
    (i)  3/2: BU_ATLAS_Tier2 - DDM deletion errors (175 over a four-hour period).  ggus 79827 in-progress, eLog 34150.
    Update 3/13: ggus 79827 closed, as this issue is being followed in the new ticket 80214.  eLog 34336.
    Update 5/3: Deletion errors seem to have decreased.  Site is still monitoring the situation.  Closed ggus 80214.
    (ii)  3/10: Duke - DDM errors (" failed to contact on remote SRM [httpg://atlfs02.phy.duke.edu:8443/srm/v2/server]").  Issue with a fileserver which hosts gridftp & SRM 
    services being investigated.  ggus 80126, eLog 34315.  ggus 80228 also opened on 3/13 for file transfer failures at DUKE_LOCALGROUPDISK.  Tickets cross-referenced.  
    System marked as 'off-line' while the hardware problem is worked on.  eLog 34343.  DDM blacklist ticket:
    https://savannah.cern.ch/support/index.php?127055
    Update 4/5: Downtime extended until the end of April.
    Update 5/1: Downtime again extended - may decide to remove the site.
    (iii)  4/9: NERSC - file transfer errors to SCRATCHDISK ("failed to contact on remote SRM [httpg://pdsfdtn1.nersc.gov:62443/srm/v2/server]").  ggus 81050 in-progress, 
    eLog 81050. 
    Update 5/5: Site was in a downtime to upgrade the firmware on a server.  Since that time no new errors, so ggus 81050 was closed - eLog 35762.
    (iv)  4/10: New type of DDM error seen at NERSC, and also UPENN: "Invalid SRM version."  ggus tickets 81011, 81012 & 81110 all related to this issue.
    Update: As of 4/19 this issue being tracked in ggus 81012 - ggus 81011, 81110 closed.
    (v)  4/21: WISC_LOCALGROUPDISK file transfer failures with "source file doesn't exist" errors.  ggus 81474 in-progress, eLog  35501.
    Update 5/4: ggus ticket 81474 closed - no details provided.
    (vi)  4/23: SWT2_CPB - User reported problems transferring files from the site using a certificate singed by the APACGrid CA.  (A similar problem occurred last week 
    at SLAC - see ggus 81351.)  Under investigation - details in ggus 81495 / RT 21947.
    (vii)  4/24: SMU_LOCALGROUPDISK file transfer errors ("source file doesn't exist").  Update from Justin: These files have been deleted and an LFC update has been 
    requested.  ggus 81526 in-progress, eLog 35463.
    (iix)  4/25: Users reported problems accessing files from SLAC - ggus 81615 was opened.  Not obvious this was a site issue - activity around the time seemed similar 
    to that for other US sites.  Waiting on an update from the ticket owner. 
    (ix)  4/29: AGLT2_PRODDISK file transfer errors ("failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]").  From Shawn: Both SSDs hosting 
    the dCache DB (RAID-1 config) failed around 4:40 AM Eastern time. We have recovered the disks and dCache should be back online.  Transfers succeeding as of 
    early a.m. 4/30 - ggus 81733 closed, eLog 35601.  https://savannah.cern.ch/support/index.php?128259 (site was blacklisted during this period).
    Update 5/2 early a.m.: problem with SSD RAID-1 set re-appeared.  Switched over to backup VM setup.  ggus 81801 in-progress, eLog 35653.
    Update 5/3: problem fixed - dCache back on-line.  Transfers succeeding, so closed ggus 81801.  eLog 35718.  http://savannah.cern.ch/support/?128362.
    

  • There may be issues with Panda monitor

DDM Operations (Hiro)

Throughput and Networking (Shawn)

Federated Xrootd deployment in the US (Wei, Ilija)

last week(s) this week:

Site news and issues (all sites)

  • T1:
    • last meeting(s):few issues with expired certificates - hopefully resolved by now. Looking forward to multicore jobs to see how the condor job configuration is working. 8 threads in MC config chosen. Using a separate Panda queue.
    • this meeting:

  • AGLT2:
    • last meeting(s): DNS issue, repaired. Working on improving networking at UM - Dell Force10. 4810 1U switch.
    • this meeting: Had issues with the main postgres database for dCache hosting by SSDs - filled. RAID1 tried, but found one or the other going offline. OCZ SSDs without Dell firmware - may not be talking well with the H800. Now in VMs - worry about IOPS, but seems to be working well.

  • NET2:
    • last meeting(s): Networking HU to CERN not working.
    • this meeting: On-going problem with HU; production queue drained, lcg-info-site command is not returning information. HU Analysis queue has though continued to run. Alden takes result and populates table, needed for Panda brokerage. CVMFS is being used at HU; is it being tagged by Alessandro's framework? Suddenly happened two days ago. Will double check these.

  • MWT2:
    • last meeting(s): Continuing to study networking issues; gatekeeper running modified 3.0.10. UIUC campus cluster nodes offline while GPFS system issues are addressed; expect to bring nodes back online later today. Will be working on dCache pool node solution.
    • this meeting: Checksum errors have abated - not related to packet loss or NIC errors as originally thought. Gatekeeper (OSG 3, GRAM5) and AutoPYFactory incidents and triage caused by scalability problems with Condor-G. These seem to have been mitigated. Issues with GPFS performance at UIUC.

  • SWT2 (UTA):
    • last meeting(s): User with APAC grid CA having trouble downloading files.
    • this meeting: Starting to work on LFC migration - later this week hope to switch to BNL. Energizing UPS requires scheduling with rest of building's users. Might be next week. Will begin racking soon thereafter.

  • SWT2 (OU):
    • last meeting(s):
    • this meeting: 40 rogue analysis jobs using 26GB ram, removed.

  • WT2:
    • last meeting(s):
    • this meeting: 12 960 GB SSDs OCZ, R610, 10G. Exceeding SAS2 channel limit, exceeding 12 Gbps. Will put this behind analysis queue.

Carryover issues (any updates?)

rpm-based OSG 3.0 CE install

last meeting(s)
  • In production BNLs
  • Horst claims there are two issues: RSV bug, and Condor not in a standard location.
  • NET2: Saul: have a new gatekeeper - will bring up with new OSG.
  • AGLT2: March 7 is a possibility - will be doing upgrade.
  • MWT2: done.
  • SWT2: will take a downtime; have new hardware to bring online. Complicated with install of new UPS - expect delivery, which will take a downtime.
  • WT2: has two gatekeepers. Will use one and attempt to transition without a downtime.

this meeeting

  • Any updates?
  • There is a new release 3.1.0; Horst will take a look for problems on his ITB site.
  • AGLT2
  • MWT2 - 3.0.10 in production
  • SWT2
  • NET2
  • WT2

OSG Opportunistic Access

See AccessOSG for instructions supporting OSG VOs for opportunistic access.

last week(s)

this week

AOB

last week this week


-- RobertGardner - 08 May 2012

  • Pilot draining issue at MWT2
    screenshot_03.png

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


pdf HC_US_IOmonitoring.pdf (4519.6K) | RobertGardner, 09 May 2012 - 12:26 |
png screenshot_02.png (39.2K) | RobertGardner, 09 May 2012 - 12:46 |
png screenshot_03.png (67.3K) | RobertGardner, 09 May 2012 - 12:47 |
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback