r5 - 25 Apr 2012 - 14:37:39 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesApr25

MinutesApr25

Introduction

Minutes of the Facilities Integration Program meeting, April 25, 2012
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Our Skype-capable conference line: (6 to mute) ** announce yourself in a quiet moment after you connect *
    • USA Toll-Free: (877)336-1839
    • USA Caller Paid/International Toll : (636)651-0008
    • ACCESS CODE: 3444755

Attending

  • Meeting attendees: Ilija, Rob, Alden, Michael, Armen, Bob, Mark, Fred, Sarah, Wei, Saul, Dave, John B, Tom, Horst, Kaushik, Mark
  • Apologies: Shawn, Jason
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon CDT, bi-weekly - convened by Kaushik) : Data management
    • Tuesday (2pm CDT, bi-weekly - convened by Shawn): Throughput meetings
    • Wednesday (1pm CDT, bi-weekly - convened by Wei or Rob): Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • GuideToJIRA - will use to coordinate perfSONAR and OSG 3.0 deployments this quarter.
      • IntegrationPhase20, SiteCertificationP20
      • Enabling multicore scheduling in the US Facility. More resources are needed to test out the environment for AthenaMP. Take this as an opportunity step up and provide this service, dynamically adjust resources as according to need. We'll need to work up recipes for various sites. We'll need to collaborate on this. Farm groups at BNL are starting on an recipe for Condor.
      • LFC consolidation, ConsolidatingLFCStudyGroupUS, report document
      • All hands meeting follow-on comments
      • FAX workshop in progress
      • Business about the frontier setup misconfiguration - can it be monitored from the client? We need to investigate this.
      • We are back in data taking mode - the LHC is performing well - so our primary goal is stability and good performance. Make sure communication is prompt - tickets should be responded to within 12 hours.
    • this week
      • IntegrationPhase21, SiteCertificationP21
      • Capacity summary: updates
        • MWT2 - will ramp to 2.4 PB, within April. Presently 1.8 PB useable.
        • AGLT2 - 2160 TB online. 120 TB additional coming at MSU. 2250 PB.
        • NET2 - 2.1 PB online and useable
        • SWT2 - just about done with UPS, more extensive than had been planned. Wrap up in next few days. Final commissioning next week. Then new storage starts installation. Last week went ahead and added 200 TB. 1.35 PB + 350 TB = 1.7 PB useable now. Once the new stuff is in it will exceed.
        • WT2 - Only 50 TB short (2150 TB usable is available now), don't have immediate plans to purchase storage. Building up area for low-density storage. Will be ordering 10 TB of SSDs.
        • BNL - both CPU and disk at pledge level
      • Michael: will be going into a new planning round based on latest resource document; a new table. Will be keeping primary data at sites - will lead to a different percentage between CPU and storage.
      • SupportingCMS - follow-up discussion from Dan post meeting about CMS policy for group accounts. Preference is to use glexec and pool accounts, but Brian might have a work around. Under discussion.
      • Michael: LFC consolidation discussed at ADC weekly - how its going to go. There was a question as to the number of instances at BNL (3, 2 or 1)? Ultimate goal would be just 1. Gradual consolidating. We also need some development for dark data cleanup.
      • Multicore configuration queue setup at BNL. Post these AthenaMPFacilityConfiguration; see also AthenaMPFacilityTests. (WIP)
      • From TIM meeting, analysis performance from TTreeCache performance. There are some performance gaps, of up to 20%. Perhaps ask Sergei to reproduce his timings. Wahid's presentation. Make this a visible facility activity: make a working group with organized plan.
      • Cloud resources at BNL - now being provided. Sergei, Val and Doug contacted by John Hover - an OpenStack environment based on EC2 interfaces. Adding more resource to this virtual environment - 200 virtual machines available.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Completely full, mc12 keeping us busy
    • Alden has implemented MOU shares in schedconfig; US showing 20%, its a little low.
  • this meeting:
    • Multicore queue now needed for certain tasks
    • TIM: DEFT dynamic job definition tool. Panda decides the length of job. GlideinWMS and Panda discussed extensively; will start sending pilots in this way. First goal is to run a scaling test. We need transparency here. Maxim will provide a plan and twiki to document this. Federated Xrootd step-by-step; first use-case is for transparent access to missing files; eventually a plan for using FAX for real data handling of all files - came up with a plan where a federated testing service provides a cost function. Kaushik and Doug are working on a plan.
    • There is a problem currently at HU.

Data Management and Storage Validation (Armen)

Shift Operations (Mark)

  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=186845
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-4_16_2012.html
    
    1)  4/11: WISC - DDM errors ("failed to contact on remote SRM [httpg://atlas07.cs.wisc.edu:8443/srm/v2/server]").  Problem reported to be fixed 
    (no details).  ggus 81162 closed, eLog 35113.
    2)  4/12: UTD-HEP - site requested to be unblacklisted in DDM.  However, file transfers started failing heavily, so had to set site off again.  
    See: https://savannah.cern.ch/support/?127808, eLog 35123/24/203.
    3)  4/12: Network issue at CERN created various problems for a period of several hours.  More details in eLog 35148/50.
    4)  4/12: ggus 81213 was opened for what appeared to be SRM errors at BNL.  Issue was actually the network link between TRIUMF and BNL.  
    See details in the ticket (now closed) & eLog 35182.
    5)  4/13 early a.m.: power outage at SLAC.  Power restored as of late afternoon 4/14.  eLog 35168.
    6)  4/13 early a.m.: AGLT2 - file transfer errors ("failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]").  Shawn reported 
    that the DB partition on the dCache headnode filled up.  Issue resolved - ggus 81228 closed, eLog 35171.
    7)  4/14: BNL - file transfer errors due to expired host certificate ("Credential with subject: /DC=org /DC=doegrids /OU=Services/CN = 
    dcsrm.usatlas.bnl.gov has expired").  Certificate quickly renewed - issue resolved.  ggus 81281 closed, eLog 35214.
    8)  4/17: SLAC - user reported problems attempting to download files from the site when using certificates signed by APACGrid.  Wei reported 
    the certificate list was updated, and this fixed the problem (user transfers now succeeding).  ggus 81351 closed.
    
    Follow-ups from earlier reports:
    
    (i)  2/29: UTD-HEP set off-line to replace a failed disk.  Savannah site exclusion: https://savannah.cern.ch/support/index.php?126767.  As of 3/7 
    site reported the problem was fixed.  Test jobs are failing with the error "Put error: nt call last): File , line 10, in ? File /usr/lib/python2.4/site-packages/
    XrdPosix.py, line 5, in ? import _XrdPosix ImportError: /usr/lib/python2.4/site-packages/_XrdPosixmodule.so: undefined symbol: XrdPosix_Truncate."
    Under investigation.  eLog 34259.
    Update 3/12: ggus 80175 was opened for the site due to test jobs failing with the error shown above.  Closed on 3/13 since this issue is being tracked 
    in the Savannah ticket.
    Update 4/17: Savannah 126767 closed, as latest site issues being tracked in https://savannah.cern.ch/support/?127808.
    (ii)  3/2: BU_ATLAS_Tier2 - DDM deletion errors (175 over a four-hour period).  ggus 79827 in-progress, eLog 34150.
    Update 3/13: ggus 79827 closed, as this issue is being followed in the new ticket 80214.  eLog 34336.
    (iii)  3/10: Duke - DDM errors (" failed to contact on remote SRM [httpg://atlfs02.phy.duke.edu:8443/srm/v2/server]").  Issue with a fileserver which hosts 
    gridftp & SRM services being investigated.  ggus 80126, eLog 34315.  ggus 80228 also opened on 3/13 for file transfer failures at DUKE_LOCALGROUPDISK.  
    Tickets cross-referenced.  System marked as 'off-line' while the hardware problem is worked on.  eLog 34343.  DDM blacklist ticket:
    https://savannah.cern.ch/support/index.php?127055
    Update 4/5: Downtime extended until the end of April.
    (iv)  4/7: UTD-HEP - following being un-blacklisted in DDM (see (iv) below) site requested to be tested by HC/panda.  However, pilots were not able 
    to find/access the atlas s/w release areas.  RT 21898 opened.
    (v)  4/8: UTD-HEP - file transfer errors ("failed to contact on remote SRM [httpg://fester.utdallas.edu:8443/srm/v2/server]").  ggus 81035, eLog 35046.
    (vi)  4/9: NERSC - file transfer errors to SCRATCHDISK ("failed to contact on remote SRM [httpg://pdsfdtn1.nersc.gov:62443/srm/v2/server]").  
    ggus 81050 in-progress, eLog 81050. 
    (vii)  4/10: New type of DDM error seen at NERSC, and also UPENN: "Invalid SRM version."  ggus tickets 81011, 81012 & 81110 all related to this issue.
     

  • this meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=187835
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-4_23_2012.html
    
    1)  4/18: From Hiro - In case if you have noticed that BNL FTS has stopped working with gsiftp endpoint since the upgrade, I think I fixed it. Please let 
    me know if it is still not working.
    2)  4/18: MWT2 - job failures ("Unspecified error, consult log file") due to some test nodes picking up production jobs.  Nodes removed - problem solved.  
    eLog 35346.
    3)  4/18: SWT2_CPB - file transfer errors - issue was an expired host.  Certificate updated, solved the problem.  eLog 35347.
    4)  4/19: High number of panda analysis jobs in the 'holding' state.  Issue was slowness in the LFC+DQ2 registration step.  Issue seemed to clear up 
    after a few hours.  eLog 35399, http://savannah.cern.ch/bugs/?93869.
    5)  4/20: SWT2_CPB - file transfer errors ( "[CONNECTION_ERROR] failed to contact on remote SRM [httpg://gk03.atlas-swt2.org:8443/srm/v2/server]").  
    A dataserver was heavily loaded, which in turn impacted the SRM host.  The SRM host was rebooted, and to reduce the load on the overheated dataserver 
    disk clean-ups were run in the background to free up some space on other hosts. Also, an additional rack of storage to help alleviate the space crunch.  
    ggus 81465 / RT 21941 closed, eLog 35397.
    6)  4/21: WISC_LOCALGROUPDISK file transfer failures with "source file doesn't exist" errors.  ggus 81474 in-progress, eLog  35501.
    7)  4/22: MWT2 - job failures with " Get error: lsm-get failed."  See details in ggus 81477 (in-progess) - eLog 35424.
    8)  4/22: MWT2 - ggus 81487 opened due to jobs failing with the "lost heartbeat" error.  Ticket in-progress, eLog 35433.
    9)  4/23: SWT2_CPB - User reported problems transferring files from the site using a certificate singed by the APACGrid CA.  (A similar problem occurred 
    last week at SLAC - see ggus 81351.)  Under investigation - details in ggus 81495 / RT 21947.
    10)  4/23: Attempts to create a proxy when accessing the BNL voms server (vo.racf.bnl.gov) were hanging up.  From John Hover: Service hung at 4:00AM. 
    A service monitoring tool detected the problem and attempted a restart. The restart failed because of a full log partition.  Service is now restored, and we're 
    looking into why the partition status didn't generate any internal alerts.  ggus 81505 closed, eLog 35452.
    11)  4/24: SMU_LOCALGROUPDISK file transfer errors ("source file doesn't exist").  Update from Justin: These files have been deleted and an LFC update 
    has been requested.  ggus 81526 in-progress, eLog 35463.
    12)  4/24: John at NET2 reported that the HU_ATLAS site was draining for lack of production jobs.  Pilots are unable to download files from the panda servers, 
    and immediately exit with the message "curl: (52) Empty reply from server /usr/bin/python: can't open file 'atlasProdPilot.py': [Errno 2] No such file or directory."  
    Problem under investigation - see details in e-mail thread.  eLog 35477.
    
    Follow-ups from earlier reports:
    
    (i)  3/2: BU_ATLAS_Tier2 - DDM deletion errors (175 over a four-hour period).  ggus 79827 in-progress, eLog 34150.
    Update 3/13: ggus 79827 closed, as this issue is being followed in the new ticket 80214.  eLog 34336.
    (ii)  3/10: Duke - DDM errors (" failed to contact on remote SRM [httpg://atlfs02.phy.duke.edu:8443/srm/v2/server]").  Issue with a fileserver which hosts gridftp & 
    SRM services being investigated.  ggus 80126, eLog 34315.  ggus 80228 also opened on 3/13 for file transfer failures at DUKE_LOCALGROUPDISK.  Tickets 
    cross-referenced.  System marked as 'off-line' while the hardware problem is worked on.  eLog 34343.  DDM blacklist ticket:
    https://savannah.cern.ch/support/index.php?127055
    Update 4/5: Downtime extended until the end of April.
    (iii)  4/7: UTD-HEP - following being un-blacklisted in DDM (see (iv) below) site requested to be tested by HC/panda.  However, pilots were not able to find/access 
    the atlas s/w release areas.  RT 21898 opened.
    Update 4/24: RT ticket marked as 'solved' (no explanation).
    (iv)  4/8: UTD-HEP - file transfer errors ("failed to contact on remote SRM [httpg://fester.utdallas.edu:8443/srm/v2/server]").  ggus 81035, eLog 35046.
    (v)  4/9: NERSC - file transfer errors to SCRATCHDISK ("failed to contact on remote SRM [httpg://pdsfdtn1.nersc.gov:62443/srm/v2/server]").  
    ggus 81050 in-progress, eLog 81050. 
    (vi)  4/10: New type of DDM error seen at NERSC, and also UPENN: "Invalid SRM version."  ggus tickets 81011, 81012 & 81110 all related to this issue.
    Update: As of 4/19 this issue being tracked in ggus 81012 - ggus 81011, 81110 closed.
    (vii)  4/12: UTD-HEP - site requested to be unblacklisted in DDM.  However, file transfers started failing heavily, so had to set site off again.  
    See: https://savannah.cern.ch/support/?127808, eLog 35123/24/203.
     

DDM Operations (Hiro)

Throughput and Networking (Shawn)

Federated Xrootd deployment in the US (Wei, Ilija)

last week(s) this week:

Site news and issues (all sites)

  • T1:
    • last meeting(s): Intervention planned for next Tuesday - border router change would affect production in the US. Affecting the OPN connectivity, primarily. Moving to Nexxus generation.
    • this meeting: few issues with expired certificates - hopefully resolved by now. Looking forward to multicore jobs to see how the condor job configuration is working. 8 threads in MC config chosen. Using a separate Panda queue.

  • AGLT2:
    • last meeting(s): Working on local network configuration - need to get switches talking correctly, may need a downtime. May need to re-stack switches. All storage is online and in production 2160 TB in production. Equip coming for MSU.
    • this meeting: DNS issue, repaired. Working on improving networking at UM - Dell Force10. 4810 1U switch.

  • NET2:
    • last meeting(s): Storage is online, and being used - under some load. 2.1 PB now.
    • this meeting: Networking HU to CERN not working.

  • MWT2:
    • last meeting(s): Illinois CC - hardware has arrived, 16 nodes, being racked. 4 loaners nodes in CC integrated with MWT2 running analysis and production. Perfsonar nodes available. Reconfiguration of CC to accommodate us, one more piece required for the networking, for higher capacity and for upgrading to 100 Gbps in the future.
    • this meeting: Continuing to study networking issues; gatekeeper running modified 3.0.10. UIUC campus cluster nodes offline while GPFS system issues are addressed; expect to bring nodes back online later today. Will be working on dCache pool node solution.

  • SWT2 (UTA):
    • last meeting(s): a couple of incidents - MD1000 rebuild needed; ran checksums; 3 files might have been affected. GUMS host cert expired. Everything else is running smoothly. UPS upgrade is progressing.
    • this meeting: User with APAC grid CA having trouble downloading files.

  • SWT2 (OU):
    • last meeting(s):
    • this meeting:

  • WT2:
    • last meeting(s):
    • this meeting: There's a ticket for an increase in number of failed jobs.

Carryover issues (any updates?)

rpm-based OSG 3.0 CE install

last meeting(s)
  • In production BNLs
  • Horst claims there are two issues: RSV bug, and Condor not in a standard location.
  • NET2: Saul: have a new gatekeeper - will bring up with new OSG.
  • AGLT2: March 7 is a possibility - will be doing upgrade.
  • MWT2: done.
  • SWT2: will take a downtime; have new hardware to bring online. Complicated with install of new UPS - expect delivery, which will take a downtime.
  • WT2: has two gatekeepers. Will use one and attempt to transition without a downtime.

this meeeting

  • Any updates?
  • There is a new release 3.1.0; Horst will take a look for problems on his ITB site.
  • AGLT2
  • MWT2 - 3.0.10 in production
  • SWT2
  • NET2
  • WT2

OSG Opportunistic Access

See AccessOSG for instructions supporting OSG VOs for opportunistic access.

last week(s)

this week

AOB

last week this week


-- RobertGardner - 25 Apr 2012

  • ddm-eff.png:
    ddm-eff.png

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


png screenshot_01.png (42.0K) | RobertGardner, 25 Apr 2012 - 11:34 |
png ddm-eff.png (161.0K) | RobertGardner, 25 Apr 2012 - 14:12 |
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback