r6 - 26 Oct 2011 - 17:16:06 - WeiYangYou are here: TWiki >  Admins Web > MinutesOct26

MinutesOct26 tex.png

Introduction

Minutes of the Facilities Integration Program meeting, Oct 26, 2011
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Our Skype-capable conference line: (6 to mute) ** announce yourself in a quiet moment after you connect *
    • USA Toll-Free: (877)336-1839
    • USA Caller Paid/International Toll : (636)651-0008
    • ACCESS CODE: 3444755

Attending

  • Meeting attendees: Bob, Patrick, Michael, Alden, Torre, Dave, Wei, Hari, Armen, Saul, Tom, Hiro, Xin, Kaushik, Mark
  • Apologies: Sarah, Fred,

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon CDT, weekly - convened by Kaushik) : Data management
    • Tuesday (2pm CDT, bi-weekly - convened by Shawn): Throughput meetings
    • Friday (1pm CDT, bi-weekly - convened by Rob): Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
    • this week
      • CapacitySummary - no changes received. Here are the current capacities:
        screenshot_23.png
      • Local CPU pool job for scheduling beyond pledge resources; Alden is looking into this.
      • Networking issues:
        • slow transfers from our Tier 2's and Tier 1s (Atlantic and Canada). We need to get into the habit of asking network experts for help, as well as relying on our own tools. Esnet can help.
        • We had problems getting CERN's attention for Calibration data to AGLT2.
        • Perfsonar was key to resolving this issue.
        • Tier 1 service coordination meeting is a forum to raise issues.
      • LHCONE
        • No where near production quality
        • Testbed at the moment
        • See talk of last week
        • Agreement with ADC operations - step back. Ask a small number of sites to get onto the infrastructure in a well-managed way; present this list at an upcoming ADC meeting.
        • Will not accept an approach with sites attaching by themselves
      • Participation in technical working groups news
        • WLCG storage management technical working group -- 2015
        • Call for volunteers for participation.
      • Cloud
        • Cui Lin's discussion at SMU meeting - continued last week at SW week (John's presentation) - well received and aligned with ATLAS plans. More concrete activities in the future.
      • Federated Xrootd
        • Wei gave an excellent presentation on where we are
        • Dan van Der Steer tried it out, found shortcomings, made good progress
        • Potential is still great - for Tier 3's, and Tier 2s
        • We should continue to pursue this

Focus topic: local-site-mover stats and dCache locality (Tom Rockwell)

  • Defer to next meeting.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • RAC - we have a new request for e-gamma regional production. These are being more or less auto-approved, since we have the capacity above pledge. Kaushik sees no need to change system. Michael would like to come back to verifying that resources above pledge are being prioritized for US physicists; Alden is investigating the algorithm and the data from Panda DB cloud.
    • Not sure of what the drop in production over the weekend; not sure of the cause. No mention at all of drainage on any list or report.
    • Mystery, since Borut wanted more resources last week. Wanted 30% set aside for MC backlog, so not surprising.
    • No big production campaigns being discussed, mainly getting MC - should be smooth.
    • Tier2D? issue: ~2500 failed jobs last weekend due to backlog at FR cloud. There is also a preliminary discussion about going to complete cloud-less model. Open it up so data can go from anywhere-to-anywhere. There will be a special meeting on Monday. Are we operating with the right kind of channels. Regarding liaison with other FTS admins - first point should be the cloud-support list. At least need to bring this up with at software week.
  • this week:
    • Spoke with Borut today - making sure all clouds are full; will have enough G4 to keep us going for many months. We should never see drains. If we see them, raise an alarm, its a technical issue.
    • Regional production may be coming to RAC soon.
    • All is well otherwise.
    • Doug: Output of group-level activities into the US cloud? Not sure, CREM meeting tomorrow. Urges Doug to make subscriptions. PD2P? only looks at DATADISK - Kaushik suggests this as an alternative.

Data Management and Storage Validation (Armen)

  • Reference
  • last week(s):
    • c.f. MinutesDataManageOct4
    • Overall storage status is good.
    • Asking for proddisk cleanup at SLAC and MWT2
    • BNL and NET2 are continuing USERDISK cleanup. Struggling with parallel DATADISK cleanups.
    • 4-5 HZ at Tier 2s, higher at Tier 1.
    • Hiro has prepared the next round of deletions
    • LFC ACL issues - no problems currently, but issue is not totally understood.
    • Tier 3 storage and deletion issues for Wisconsin - have sent email to them to take action.
  • this week:
    • Sites should cleanup PRODDISK
    • NET2 issue - Saul to discuss.
    • USERDISK cleanup finished everywhere except SW and BNL

Shift Operations (Mark)

  • Reference
  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting (presented this week by Kai Leffhalm):
    
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-10_17_2011.html
    
    1)  10/12: From Hiro - Several replica entries in US LFC's were done with short form, which has caused jobs to fail at some 
    sites (in worst case, the site was set offline due to the failures of Hammer Cloud jobs). The transfers of such a files have been 
    stopped and should not happen any more.
    2)  10/14: AGLT2 file transfer errors ("[NO_PROGRESS] No markers indicating progress received for more than 180 seconds").  
    Probably not an issue on the AGLT2 side, but rather slowness on the remote end (in this case TRIUMF).  There was a parallel 
    issue causing jobs failures (NFS server hung up), which has since been resolved.  ggus 75302 closed, eLog 30415.
    Update 10/15: ggus 75348 was opened for the site, again initially related to job failures due to slow output file transfer timeouts.  
    What was probably a real site issue (jobs failing with "ERROR 201 Copy command exited with status 256 | Log put error: lsm-put 
    failed (201)") began appearing on 10/17.  Most likely related to a dCache service restart.  Ticket in-progress, eLog 30443, 
    https://savannah.cern.ch/support/?124073 (Savannah site exclusion).
    3)  10/15: SLAC - high failure rate for production jobs with "lost heartbeat" errors.  Issue was a power outage at the site which 
    took many machines off-line.  Problem resolved, ggus 75349 closed, eLog 30543.
    4)  10/16:  SMU_LOCALGROUPDISK file transfer errors were reported in ggus 75362.  However, details of the error indicated 
    this was actually due to an SRM timeout error at SWT2_CPB.  Most likely a transient problem, since there haven't been any 
    recent errors of this type.  ggus 75362 closed, eLog 30544.
    5)  10/18: NET2 - file transfer errors ("file has trouble with canonical path. cannot access it").  Saul reported that the issue was 
    possibly a transient GPFS problem causing several nodes to briefly lose their mounts.  ggus 75430 in-progress, eLog 30502.
    6)  10/18-10/19: ANALY queues at BNL were set off-line by the HammerCloud tests.  Issue was traced to the OSG client which 
    had been recently upgraded.  Rolled back to previous version for now.   
    
    Follow-ups from earlier reports:
    
    (i)  Previous week: ongoing testing at the new US tier-3 site BELLARMINE-ATLAS-T3.  Test jobs have run successfully at the site.
    (ii)  10/9: Shifter submitted http://savannah.cern.ch/bugs/?87589 regarding job output transfer timeouts from MWT2_UC => TRIUMF.  
    This is the same type of issue that has been observed at several US tier-2d's when attempting to copy job output files to other 
    clouds.  Working to understand this problem and decide how best to handle these situations.  Discussion in eLog 30170.
    (iii)  10/10: SLAC - file transfer error (" failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]").  
    Blacklisted in DDM during this period.  ggus 75185, eLog 30182,  https://savannah.cern.ch/support/index.php?124031 (Savannah 
    site exclusion).
    Update 10/13: Issue resolved - ggus 75185 closed, eLog 30330.  (ggus 75276 was also opened around this time - 
    closed as a duplicate.)
     
    • low level of probs in the past two weeks for US sites.
    • Three carryover issues above.
    • Certification of Bellarmine
    • Doug believes ICB approval needed
  • this meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=slides&confId=156083
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-10_24_2011.html
    
    1)  10/19: BNL - job failures due to conditions data access problem - errors like "Request to [>PnfsManager@local] timed out. 
    (errno 1).  Failed open file in the dCache."  From Jane: Those transfer failures were caused due to PnfsManager timed out. 
    Pnfs load was pretty high during 9am-2pm due to intensive client requests and had caused quite some transfer failures. The 
    load of PNFS is fine now.  ggus 75499 closed, eLog 30539.
    2)  10/19-10/20: File transfer errors at NET2.  Issue was due to checksum problem after some new storage was brought on-line.  
    See details in https://ggus.eu/ws/ticket_info.php?ticket=75430 (in-progress), https://savannah.cern.ch/bugs/?87984, eLog 30586.
    3)  10/20: SLAC - two issues: (i) file transfers errors ("failed to contact on remote SRM 
    [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]") - issue with a storage server resolved, ggus 75507 closed, eLog 30611; 
    (ii) file transfer timeouts from various remote sites => SLAC (e.g., Timelimit of 604800 seconds exceeded in 
    RAL-LCG2_DATADISK->SLACXRD_DATADISK queue).  Known issue with network slowness / corresponding dq2 timeouts.  
    Tweaks applied to FTS settings.  ggus 75552 closed, eLOg 30606.
    4)  10/20: UTD-HEP - site had been set back on-line a few days earlier after issues with the lsm and cluster configuration had 
    been resolved.  Power to the data center was interrupted (breakers tripped), resulting in DDM & job errors, and the site was set 
    off-line by shifters.  Services restored as of 10/23 - test jobs successful, back on-line.  ggus 75536 closed, 
    https://savannah.cern.ch/support/index.php?124178 (Savannah site exclusion), eLog 30616/19/710.
    5)  10/21-10/22: AGLT2 - maintenance outage - test jobs successful, queues back on-line as of Saturday evening.
    6)  10/22: OUHEP_OSG - file transfer errors ("failed to contact on remote SRM [httpg://ouhep2.nhn.ou.edu:8443/srm/v2/server]").  
    Horst reported there is a hardware problem with the SRM host which require some time to get it repaired.  Site set off-line in DDM 
    (https://savannah.cern.ch/support/?124250), ggus 75608 / RT 21007 in-progress, eLog 30744.
    7)  10/23: WISC_LOCALGROUPDISK file transfer errors ("[TRANSFER error during TRANSFER phase: [GRIDFTP_ERROR] 
    globus_ftp_client: the server responded with an error500 500-Command failed. : globus_gridftp_server_posix.c:globus_l_gfs_posix_
    recv:914:500-open() fail500").  Wen fixed the problem - ggus 75613 closed, eLog 30743, https://savannah.cern.ch/support/index.php?124228 
    (site was blacklisted during this time).
    8)  10/24: UPENN_LOCALGROUPDISK - file transfer failures ("500-globus_xio_file_driver.c:globus_l_xio_file_open:381:500-System 
    error in open: Operation canceled500-globus_xio: A system call failed: Operation canceled500").  Problem fixed by the site admin 
    early next day.  ggus 75659 closed, eLog 30767.
    9)  10/25: MWT2_UC - job failures with the error "Transformation not installed in CE" (16.6.7.13).  Alessandro reported that the 
    release installation system indicates the software has been installed.  Under investigation - ggus 75692, eLog 30762.
    
    Follow-ups from earlier reports:
    
    (i)  Previous week: ongoing testing at the new US tier-3 site BELLARMINE-ATLAS-T3.  Test jobs have run successfully at the site.
    (ii)  10/9: Shifter submitted http://savannah.cern.ch/bugs/?87589 regarding job output transfer timeouts from MWT2_UC => TRIUMF.  
    This is the same type of issue that has been observed at several US tier-2d's when attempting to copy job output files to other clouds.  
    Working to understand this problem and decide how best to handle these situations.  Discussion in eLog 30170.
    (iii)  10/14: AGLT2 file transfer errors ("[NO_PROGRESS] No markers indicating progress received for more than 180 seconds").  
    Probably not an issue on the AGLT2 side, but rather slowness on the remote end (in this case TRIUMF).  There was a parallel issue 
    causing jobs failures (NFS server hung up), which has since been resolved.  ggus 75302 closed, eLog 30415.
    Update 10/15: ggus 75348 was opened for the site, again initially related to job failures due to slow output file transfer timeouts.  
    What was probably a real site issue (jobs failing with "ERROR 201 Copy command exited with status 256 | Log put error: lsm-put failed 
    (201)") began appearing on 10/17.  Most likely related to a dCache service restart.  Ticket in-progress, eLog 30443, 
    https://savannah.cern.ch/support/?124073 (Savannah site exclusion).
    (iv)  10/18: NET2 - file transfer errors ("file has trouble with canonical path. cannot access it").  Saul reported that the issue was 
    possibly a transient GPFS problem causing several nodes to briefly lose their mounts.  ggus 75430 in-progress, eLog 30502.
    
    • UTD HEP finally got back online (thanks to Marco helping with local-site-mover)
    • Notes SSB daily resumes have been useful. Not sure if "no activity" means a problem.
    • HIro: BNL has had problems getting excluded in the past week.
      • Alessandro's script runs two times a day; 6000 lines of bash and python
      • Sometimes removes a local setup script, temporarily. Results in a mixup of 32 bit and 64 bit libraries being available
      • Has the job profile changed recently, requiring more conditions data to be read

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week:
    • Long meeting yesterday - next meeting will be in two weeks, another off-week meeting, to fit between other meetings.
    • Perfsonar node configuration will be available in dell matrix
    • Transatlantic - AGLT2 issue resolved, using perfsonar.
    • New release of perfsonar coming - week from Friday; updates will be via yum.
    • New kind of test: scheduled traceroute. Would like every Tier 2 to schedule this. Will have instructions for this.
    • Dashboard discussions
    • LHCONE likely to go slower than we anticipated. Will stay engaged as things evolve, providing our feedback.
    • Mesh testing decisions need to be made for cross-cloud testing. Goal is to have each Tier 1 tested by at least one Tier 2 in the US
    • New Esnet circuits to UC and SLAC; would like to schedule load tests before, and after. Test to all 5 T2's. Hiro will set this up. Friday morning: 10 am Eastern. Half hour test. Have not done this in a while.
  • this week:

Federated Xrootd deployment in the US

last week(s) this week:
  • Global redirector issue after Dan's observation - the proxy server keeps crashing. N2N? is crashing.

Tier 3 GS

last meeting:
  • UTD - needs to update Joe's DN
  • Mark will provide a list of important operations lists
  • AK wants to know what the next steps are.

this meeting:

  • Hari: UTD back online, all is well.

Site news and issues (all sites)

  • T1:
    • last week(s):
      • Hurricane Irene response last week, exercised emergency shutdown procedures. Everything powered off. Restart went smoothly.
      • Lost almost nothing on the ATLAS side - on the RHIC side, 40 worker nodes had issues
      • 12K disk drives
      • Increase bandwidth to Tier 2s from BNL - new 10G circuit available. UC to be migrated off to a new 10G. Other circuit waiting for new I2 switch to come online at MANLAN, ~ 2 weeks.
      • LHCONE proceeding in Europe. Hope to see a couple of sites in US participating.
    • this week:
      • Have on order PB of disk on order (Nexan - 3 TB drives); need to have this in place for next reprocessing round. Will evaluate Hitachi.

  • AGLT2:
    • last week(s):
      • Connection to CERN fixed.
      • Preparing for generator test on Oct 22.
      • Will update OSG, OSG-wn, and wlcg-client
      • Have about 390 TB out, but not allocated.
    • this week:
      • Went down on Thursday - UPS testing to test backup generator. OSG 1.2.23 - CE, wn-client, wlcg-client all upgraded. And Xin's new ATLAS-wn.
      • 4 am incident today - transfers failing; seemed to be related to site-aware configuration, needed to run sweeper to free up space. Caused SRM watch to go read. Consulting dCache.
      • Discussing transfers rates to Triumf with Ian Gable.
      • Tom: dcache pool node failed, caused network storm. Looking at rate limits on switches. You see lots of packets. 10M packets/sec, crashed node had highest input rate. 24 port 10G switch blade network.

  • NET2:
    • last week(s):
      • Bring up newest storage rack - powered up, being tested, online within two weeks, 430 TB useable.
      • Local networking re-arrangements to convert 6509 to pure 10g network. Getting a Dell switch to move local network off this. Plugged directly into the NOX, so it must be a Cisco router.
      • Going to 2x10g to wide area
      • Running 500 analy jobs at HU routinely. 650 is limit.
    • this week:
      • Major prob on Oct 20. New storage, GPFS config. Next 24 hours resulted in corrupted files. Cannot make the problem come back. Not sure if its a GPFS problem, may upgrade. Following Hiro's recipe for declaring bad files.
      • Working on conversion to CVMFS.
      • New gatekeeper and some new server machines
      • Doug: will you add more to the groupdisk area? Guideline is to provide 100 TB (per group, within the groupdisk area)

  • MWT2:
    • last week:
      • Purchase request in for 720 TB of storage at UC; new headnodes at IU.
      • Dave Illinois campus computing cluster pilot, Phase 1: Condor flocking demonstrated between clusters
      • Sarah - redoing tests w/ direct access dcap vs xrootd. Looking impact of NAT and dcap. Seeing less of the problem, but still persists. Want to look at newer version of libdcap.
    • this week:
      • Took delivery of first part of 720 TB storage upgrade
      • Sarah and Fred busy working on rack re-arrangement at IU
      • https://integrationcloud.campfirenow.com/room/192194
      • Illinois - Dave is updating phase 1 of the pilot, tieing Illinois site to campus cluster has been successful - running Panda jobs. Document this for ultimate incorporation into MWT2 queues. Specifications for worker nodes for second instance of campus cluster. Otherwise things are find at Illinois.

  • SWT2 (UTA):
    • last week:
      • xrootdFS fell over - turned back on. otherwise things are fine.
    • this week:
      • HEPSPEC survey - will update table
      • Cleaned up PRODDISK; discovering cycling TB day writing into storage
      • Looking at bestman/xrootd install from VDT repos; had a problem with ATLAS certificates; there is a missing use-case in the install instructions.
      • Deletion service configuration changed; saw a load issue when the rate was high. Not sure of cause, since local proddisk-cleanse does not cause this kind of spike.

  • SWT2 (OU):
    • last week:
      • Had a storage reboot yesterday. Load issue on Lustre servers - 9900 issues after reboot. Load gone now though.
      • Asked IBM about adding 3 TB drives - can't do this without voiding warranty.
    • this week:

  • WT2:
    • last week(s):
      • Power outage last weekend. Power company outage.
      • Continue to install 68 R410.
    • this week:
      • Working on installation of new R410
      • New storage online, 550 TB of storage. Some problems with the storage server from Dell, continuing.
      • Reconfiguring disk for performance, will lose 250TB as a result. Net gain is ~ 300 TB of storage (will be up to 2.1, 2.2 PB).

Carryover issues (any updates?)

OSG Opportunistic Access

See AccessOSG for instructions supporting OSG VOs for opportunistic access.

last week(s)

  • SLAC: enabled HCC and Glow. Have problems with Engage - since they want gridftp from the worker node. Possible problems with Glow VO cert.
this week

CVMFS deployment discussion

See TestingCVMFS

previously:

  • For the Tier 1, based on developments in ADC and ATLAS, have setup a CVMFS based Panda instance. This could be used for Trigger reprocessing - there is a resource shortage at CERN, so making additional capacity at Tier 1. Xin and Chris Hollowell (sp) - have this Puppetized. Releases have been tagged by Alessandro. So another site has been validated. Xin notes there were some tweaks needed to publish to BDII, and this used a separate gatekeeper. Pilots will be coming from an autopyfactor method, the new version.
  • Status at Illinois: running for quite a while - no problems with production or analysis. Have run in both modes (HOTDISK & CVMFS) with conditions data.
  • AGLT2: there are some issues with releases being re-written in the grid3-locations files. Some interference with Xin's install jobs? Bob believes you cannot do this live - can't switch from NFS-mounted releases to CVMFS releases.
  • MWT2 queue: running CVMFS and the new wn-client with python 2.6; test jobs ran fine, running successfully; still working on ANALY queue.
  • There has been significant progress in terms of support from CERN IT.
  • Sites in the next weeks: SWT2_CPB cluster
this week:
  • Wei: instructions seem to ask us to use the squid at BNL. Hardware recommendation? Same as what you're already using.
  • Load has not been observed to be high at AGLT2. Squid is single-threaded, multi-core not an issue. Want to have a good amount of memory, so as to avoid hitting local disk.
  • At AGLT2 -recommend multiple squids, and compute nodes are configured to not hit remote proxy. Doug claims it will still fail over to stratum 1 regardless.

Python + LFC bindings, clients

last week(s):
  • NET2 - planning
  • AGLT2 - 3rd week of October
  • Now would be a good time to upgrade; future OSG releases will be rpm-based with new configuration methods.
this week:

AOB

last week
  • None.
this week
  • Doug: Autopyfactory and Tier 3-Panda work will require schedconfig work.


-- RobertGardner - 25 Oct 2011

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


png screenshot_23.png (391.2K) | RobertGardner, 26 Oct 2011 - 10:53 |
png tex.png (9.6K) | RobertGardner, 26 Oct 2011 - 11:03 |
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback