r5 - 06 Jul 2011 - 14:35:22 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJuly6

MinutesJuly6

Introduction

Minutes of the Facilities Integration Program meeting, July 6, 2011
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Our Skype-capable conference line: (6 to mute) ** announce yourself in a quiet moment after you connect *
    • USA Toll-Free: (877)336-1839
    • USA Caller Paid/International Toll : (636)651-0008
    • ACCESS CODE: 3444755

Attending

  • Meeting attendees: Dave, Michael, Shawn, John D, Rob, Torre, Patrick, Sarah, Bob, Jason, Tom, Saul, Charles, Alden, Booker, Fred, Horst, Mark, Wensheng, Doug,
  • Apologies:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon CDT, weekly - convened by Kaushik) : Data management
    • Tuesday (2pm CDT, bi-weekly - convened by Shawn): Throughput meetings
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
    • this week
      • SiteCertificationP17 - quarterly report deadline approaching
      • Need to discuss issue of LFC-bindings and workernode-client, wlcg-client updates.
      • Michael: there is a WLCG activity in progress for a new FTS version, FTS 3. Will not rely on statically configured channels - will be driven to figure paths dynamically; will use historical information about performance. Was discussed at a Tier 1 services coordination meeting. Would like to offer a testbed for the WLCG team, eg. ITB. Shawn: have they incorporated network awareness, perfsonar data, LHCONE? Hooks in place? Michael - yes - this would be the central component.
      • Discussion on workshop on federated data storage for the LHC using xrootd - there is a workshop Sep 13-15 in Lyon. Federated access, monitoring, security and support. Andy is sending out a letter of invitation shortly.

Federated Xrootd deployment in the US

last week(s)
  • Update on progress towards milestones.
  • 3.0.4 rpm's available
  • Andy is working on integrating X509 code
  • CGW working on name translation module
  • Communication of Tier 3 sites - are they getting prepared and getting ready for deployment? Not sure.
  • More detail please next week?
  • At BNL - there is significant progress using federated namespace and FRM. Got it working, next will look at performance. Hiro - dq2-get now has a plugin that can work for federated or native xrootd. Heard that newer xrootd door in dCache is quite good.
this week:
  • No report this week.

OSG Opportunistic Access (Rob)

See AccessOSG for instructions supporting OSG VOs for opportunistic access.

last week(s)

  • Will need to track down "unknown" category
  • No HCC issues
  • Engage - would like to start - do we have sites ready to enable Engage?
this week
  • facility_hours_bar_smry.png:
    facility_hours_bar_smry.png
  • Booker reports progress with SLAC security - no roadblocks - should get HCC jobs online soon.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • All is well, 0% failure rate
    • UK was out of jobs for some reason
    • RAC meeting had okay'd SUSY production, so we're full.
    • Saw up-tick in analysis activity - perhaps as result of ATLAS week talks?
    • Pattern in user analysis seems to skim data on the grid, download to n-tuple laptops
    • Will analysis at T2s go up w/ more replication of n-tuples? Notes 100GB almost not worth doing it on the grid.
    • Group analysis now being done as group production
    • Effects of large amounts of data we have still to be seen - going back to
    • PD2P - discussion
      • not getting any data at T2's; why? waiting time going down, no user complaints, so what's the problem?
      • Note - RAW and ESD are not allowed; only AOD and highly skimmed ntuple
      • No evidence of physics-backlog, only a 'technical' backlog
      • 2 copies on first use; one based on brokerage, second based on MOU share
      • Increase pre-placed ntpules, "the old way"
      • Use closeness property of dq2
      • Grouping of Tier2's by performance, size and storage metrics
      • Will double amount of data to sites, at minimum
      • Q (Wensheng): what about datasets by users? PD2P - will not touch it, unless its in DATADISK. Will think about this.
      • Torre: rebrokerage - should we reduce threshold? Alden will repeat study
      • Will improve monitoring and logging - to improve knowledge of why copy was made; weights are in the logs.
      • Torre - Jarka's plots.
  • this week:
    • Problems fetching data w/ dq2-get at SWT2. But it worked from lxplus, sent feedback to user.
    • ADC meeting on Monday - low level reports from users having problems fetching data. Larger than US issue.
    • Doug: Can tracer logs give transfer failures?
    • Michael: heard claim of 10% loss due to "dq2-get problems".
    • July 3 - several hours with communication probs getting to Panda server. Jobs ran twice - being addressed Tadashi

Data Management and Storage Validation (Armen)

  • Reference
  • last week(s):
    • No meeting this week, no major issues, following up from last week.
  • this week:
    • MinutesDataManageJul5
    • Main issues: LFC errors assume to be recurring - ACL issues? Slowness of deletion rate - dq2 database bottleneck? Deletion volume has increased 5x, is the claim.
    • We saw deletion rate fall in May, 10x, in one day. Last two months its been much lower. Issue tracked down by migration of dq2 db to new cluster, not sure why; indexing rebuilt, optimized, still not solved. (Was supposed to be an improvement)
    • Discussing moving deletion service locally - to the sites. This is a high priority to solve for DDM team.
    • Deletion agent is part of site services - Hiro attempting to do this at BNL. Not sure about prospects for Tier 2 sites. Testing with Vincent. Hiro does not believe its a locality issue.

Shifters report (Mark)

  • Reference
  • last meeting: Operations summary:
     Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=2&confId=143581
    
    1)  6/22: SLACXRD SRM errors (" failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]").  Later that day Wei 
    reported the problem had been fixed.  ggus 71834 closed, eLog 26697.
    2)  6/23 early a.m.: NET2 - low DDM transfer efficiency.  From Saul: we saw a big burst of adler32 checksumming of small USERDISK files overnight 
    (I suspect that this is part of an ATLAS-wide burst of user activity).  This caused our adler software to run out of I/O resources and eventually caused 
    bestman to stop.  We added more I/O resources and re-started bestman about 1.5 hours ago.  The adler backlog is down and we have been operating 
    normally since then.  ggus 71843 closed, eLog 26712.
    3)  6/23: (minor) pilot update from Paul (v47h): added debugging info in order to understand failures seen on dCache sites (TypeError: 'int' object is not 
    callable), related to Savannah ticket https://savannah.cern.ch/bugs/index.php?83380.
    4)  6/23: IllinoisHEP - job failures with the error "SyntaxError: invalid syntax."  ggus 71863, eLog 26723.  Production queue set off-line.
    Update 6/27-6/28: Dave reported that the issue was likely due to a problem with a squid server, which in turn impacted releases/cvmfs.  Machine was 
    taken off-line - test jobs completed successfully, site back => on-line.  (Following the re-start jobs were initially failing on one problematic WN, since removed.)  
    ggus 71863 closed, eLog 26886.
    5)  6/23: BNL - SE maintenance intervention.  Some file transfer / job errors, but went away once the work was completed.  eLog 26722.
    6)  6/24: Major issue with production across all clouds.  Issue was traced to an overloaded host (atlascomputing.web.cern.ch) which was being hit with large 
    numbers of 'wget' requests to download MC job options files.  (This system has been in place for several years, but over time the size of the job options .tgz 
    files has grown considerably.)
    Many tasks were either paused or aborted to relieve the load on the server.  Discussions underway about how to address this problem.  Some info in 
    eLog 26744, 52, 54-56, more in an e-mail thread.
    7)  6/25: ggus 71925 opened due to file transfer failures between IN2P3-CC & MWT2.  Incorrectly assigned to MWT2 - actually an issue in the IN2P3 side.  
    Awaiting a response from IN2P3 personnel.  ggus ticket closed, eLog 26781.  (Also see related ggus ticket 71933.)
    8)  6/25: BNL voms server was not accessible (a 'voms-proxy-init' against the server was hanging up).  From John at BNL: I checked the server and although 
    the process was running, voms-proxy-init was indeed failing. A service restart has restored the functionality.  ggus 71926 closed, eLog 26785.
    9)  6/25-6/26: NET2 - DDM errors ("failed to contact on remote SRM [httpg://atlas.bu.edu:8443/srm/v2/server]").  Issue was due to heavy SRM activity.  Saul 
    reported that changes were implemented to address the problem.  No additional errors as of early 6/26.  ggus 71923 closed, eLog 26778.
    10)  6/27: SWT2_CPB - a user reported that his jobs were failing with the error "No input file available - check availability of
    input dataset at site."  Issue understood and resolved - from Patrick: The problem was traced to how the input files were registered in our LFC.  The files 
    were registered in a compact form that causes problems for the run-athena transform because our system is configured to read ROOT files directly from 
    storage.  The problematic LFC registrations were isolated to a week long period in May when BNL began to run a new DQ2 Stite Service version.  
    ggus 71935 / RT 20296 closed.
    11)  6/28: Longstanding ggus ticket 69526 at NERSC closed (recent file transfer failures eventually succeeded on subsequent attempts).  eLog 26876.
    12)  6/28: ALGT2 - Bob reported that the site analysis queue was still set to 'brokeroff' after being auto-excluded by hammercloud testing on 6/25.  For some 
    reason the 'HC.Test.Me' comment wasn't set for the site.  This was corrected, but as of 6/29 a.m. ANALY_AGLT2 is still in the 'brokeroff' state? 
    
    Follow-ups from earlier reports:
    
    (i)  6/2: MWT2_UC - job failures with the error "taskBuffer: transfer timeout for..."  Not a site issue, but rather related to the problem seen recently with 
    transfers between US tier-2's and European destinations (under investigation).  ggus 71177 closed, eLog 26032.
    Update 6/7: still see large numbers of these kinds of job failures.  ggus 71314, eLog 26202.
    See also discussion in DDM ops Savannah: https://savannah.cern.ch/bugs/?82974.
    Update 6/14: ggus 71314 is still 'in-progress', but no recent updates from FZK/DE cloud.
    (ii)  6/10: HU_ATLAS* queues set off-line in preparation for a weekend maintenance downtime.  Outage completed as of early a.m. 6/13.  However, jobs 
    are not running at the site (brokerage) due to missing information about atlas s/w releases (BDII forwarding to CERN?).  Issue being tracked here: 
    https://ticket.grid.iu.edu/goc/viewer?id=10566.
    (iii)  6/13: SLAC - production job failures with the error "pilot: Exception caught by pilot watchdog: [Errno 10] No child processes trans: Unspecified error, 
    consult log file."  Wei solved the problem by disabling the multi-job pilots.  Issue will be raised with panda / pilot developers.  ggus 71475 closed, eLog 26382.
    (iv)  6/19:  DDM transfer errors to SLACXRD_PERF-JETS from multiple sources (" [DDM Site Services internal] Timelimit of 172800 seconds exceeded").  
    ggus 71675 in-progress, eLog 26572.
    Update 6/27 from Wei: I will trace this one via GGUS ticket system. It is not a bug anywhere, and I made agreement with US ATLAS computing management 
    that this looks like a long term small project.  ggus 71675 closed.
    
  • this meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-7_4_2011.html
    
    1)  6/29: UTD-HEP - file transfer errors with "failed to contact on remote SRM."  Site admin fixed the problem, ggus 72079 closed, eLog 26914.
    2)  6/29: From Bob at AGLT2: I have stopped auto-pilots for the AGLT2 production queue in preparation for our scheduled down time beginning at 8am EDT 
    on Thursday, 6/30.  Outage completed as of early p.m. 6/30 - test jobs completed successfully, site set back on-line. 
    3)  7/1: IllinoisHEP - dCache storage errors such as "Can't open source file : Unable to connect to server System error: Connection refused Failed to create 
    a control line Failed open file in the dCache. Can't open source file : Unable to connect to server, System error."  Issue resolved by rebooting the chimera 
    server.  ggus 72168 closed, eLog 26977.
    4)  7/2: ggus 72176 was opened and assigned to AGLT2 for DDM transfer status messages like "NoneERROR MSG: [DDM Site Services internal] Timelimit 
    of 172800 seconds exceeded in IN2P3-CC_DATADISK->AGLT2_PHYS-SM queue."  Not a site problem - from Hiro: This is not really site issue, but rather it 
    is dq2 ss. And, this message is not really error. Shifters are advised to ignore them. Also, it is discussed elsewhere. I am closing this ticket.  eLog 26986.
    5)  7/2 - 7/4: SLAC - a couple of incidents with SRM host problems, power lost to a couple of data servers.  ggus 72222 was opened on 7/4 
    ("gridftp_copy_wait: Connection timed out "), and Wei reported the problem was fixed.  eLog 27016.
    6)  7/3 - 7/4: Communication with the panda server was very slow at times, in some cases timing out.  Reported in eLog 27005.  Probably the reason for the 
    larger number of "lost heartbeat" job failures around this period.  Also, see 12 below.
    7)  7/4: SWT2_CPB: user reported he was unable to retrieve some of his data files from the site via dq2-get.  We were able to successfully copy the dataset 
    on lxplus (7/5), so concluded this must have been a temporary glitch.  Suggested the user re-try the copy - awaiting feedback.  ggus 72206 / RT 20354.
    8)  7/4: ggus 72225 erroneously opened for supposed file transfer errors at BNL - actually related to checksum transfer errors at SWT2_CPB and SLAC 
    (see 12 below).  Ticket still 'assigned', will be closed, eLog 27019.
    9)  7/4: From Michael at BNL: We currently observe high load on the SE namespace db. This is causing some transfers to fail. We are in the
    process of investigating the issue. Later from Hiro: dCache service was started. The number of concurrent transfers seem to be a bit high at around 
    300~400.  Adjusting the concurrency in DDM/FTS. it seems to be fixed.  eLog 27018/22.
    10)  7/5: From Bob at AGLT2: We will replace a failed DIMM in the dCache admin node head01 between 2:30pm-3:00pm.  Auto-pilots are stopped. 
    OIM outages are in place, "outage" for the SE, "at risk" for the 2 CE.  A short time later: dCache is back now, actual downtime from 14:36-14:49. Auto-pilots 
    are restarted.
    11)  7/5:  Brief power interruption at MWT2 (UC).  eLog 27045.  Later in the evening, from Rob: We're still recovering from the power glitch. We've just fixed 
    a d-cache mis-config and restarted SRM.
    12)  7/5: SWT2_CPB & SLAC: file transfer errors with checksum mismatch errors.  Patrick tracked this down to an unusual scenario in which the same user 
    analysis job was run twice by panda, resulting in the output files being overwritten by the pilot on the second pass, but a second LFC registration could not 
    be done, hence the checksum discrepancies.  Tadashi noted that these jobs ran around the same time as the panda server problem on 7/3.  He will modify 
    the system to handle cases such as this in the future.  (See details in the prodsys e-mail thread.)
    13)  7/6: AGLT2 - file transfer failures ("failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]").  Issue understood - from Shawn: 
    The MSUFS02 node had its /var partition encounter an error which took it offline. A reboot seems to have fixed the problem.  FTS channels re-enabled, 
    ggus 72298 closed since the issue was already being tracked in RT 20360 (which can now probably be closed as well).
    
    Follow-ups from earlier reports:
    (i)  6/2: MWT2_UC - job failures with the error "taskBuffer: transfer timeout for..."  Not a site issue, but rather related to the problem seen recently with 
    transfers between US tier-2's and European destinations (under investigation).  ggus 71177 closed, eLog 26032.
    Update 6/7: still see large numbers of these kinds of job failures.  ggus 71314, eLog 26202.
    See also discussion in DDM ops Savannah: https://savannah.cern.ch/bugs/?82974.
    Update 6/14: ggus 71314 is still 'in-progress', but no recent updates from FZK/DE cloud.
    (ii)  6/10: HU_ATLAS* queues set off-line in preparation for a weekend maintenance downtime.  Outage completed as of early a.m. 6/13.  However, 
    jobs are not running at the site (brokerage) due to missing information about atlas s/w releases (BDII forwarding to CERN?).  Issue being tracked here: 
    https://ticket.grid.iu.edu/goc/viewer?id=10566.
    (iii)  6/24: Major issue with production across all clouds.  Issue was traced to an overloaded host (atlascomputing.web.cern.ch) which was being hit with 
    large numbers of 'wget' requests to download MC job options files.  (This system has been in place for several years, but over time the size of the job 
    options .tgz files has grown considerably.)
    Many tasks were either paused or aborted to relieve the load on the server.  Discussions underway about how to address this problem.  Some info in 
    eLog 26744, 52, 54-56, more in an e-mail thread.
    (iv)  6/28: ALGT2 - Bob reported that the site analysis queue was still set to 'brokeroff' after being auto-excluded by hammercloud testing on 6/25.  
    For some reason the 'HC.Test.Me' comment wasn't set for the site.  This was corrected, but as of 6/29 a.m. ANALY_AGLT2 is still in the 'brokeroff' state? 
    Update 7/1: after AGLT2 was ready to come back on-line post their maintenance outage (see 2 above) the mechanism to transition the site analysis 
    queue automatically from 'brokeroff' to 'on-line' was still not working correctly.  Dan reported that there were some known problems with the system - 
    suggested setting the queue on-line manually, which was done.
    

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week:
    • Nothing this week; bi-weekly meeting next week.
  • this week:
    • Meeting this week - notes posted
    • For next quarter focus for inter-Cloud problems. Italian, Canadian clouds will deploy perfsonar-PS, as well as LHCOPN.
    • How best to leverage the deployed infrastructure. Can't do full mesh - select representative sites.
    • Related - enhance infrastructure to allow tests on demand, eg., T2-T2 transfers, schedule on-demand test. Whats the network path - isolate site from net problems.
    • Using Tomasz infrastructure
    • Distribute tests over resources

CVMFS

See TestingCVMFS

last week:

  • Illinois: there was a squid problem creating corruption, caused jobs to fail; flushing cache and restart
  • MWT2 - passed all of Alessandro's validation tests, and test jobs
  • Switching back and forth - HOTDISK and CVMFS
  • Stratum 1 server mirrored at BNL - done last week

this week:

  • Dave - updated docs on CERN twiki. There is a nagging problem related to job failures; thought it might be a squid issue. But sees jobs failing with python mis-match problems. Investigated python logs. Only happens on certain jobs. Happening at both at MWT2 and IllinoisHEP.
  • Patrick - have asked to get a new resource approved for OSG for testing; awaiting WLCG BDII.
  • John: which sites are using the new namespace?

Tier 3 Integration Program (Doug Benjamin)

Tier 3 References:
  • The link to ATLAS T3 working groups Twikis are here
  • T3g Setup guide is here
  • Users' guide to T3g is here
  • US ATLAS Tier3 RT Tickets

last week(s): this week:

  • No report

Tier 3GS site reports (Doug Benjamin, Joe, AK, Taeksu)

last week:
  • AK - waiting for firewire by-pass
  • Hari - jobs running smoothly

this week:

  • No report

Site news and issues (all sites)

  • T1:
    • last week:
      • Hiro - working on dCache pnfs to chimera, testing-rehearsal. Target - August.
    • this week:
      • As mentioned earlier, empty directories issue. Otherwise very smooth.
      • Uptick in analysis jobs - a the Tier1 and overall generally.
      • Chimera upgrade is taking shape.

  • AGLT2:
    • last week(s):
      • Tomorrow downtime for upgrades; re-do WAN (new AS number); current golden dCache release; reconsolidate OSG-NFS server to appropriate box, minor switch firmware updates.
      • Brokeroff issue: network hiccup, set analysis queue to brokeroff. Put back online but still. Supposed to add comment hc.test.me.
      • CVMFS - upgrading worker node rpms
    • this week:
      • Major changes last week. New AS number, own routing entity for AGLT2 now. Added virtual routing config at UM site. Updated firmware on switches.
      • Current golden - dCache 1.9.12-6, there are some issues
      • Updated condor.

  • NET2:
    • last week(s):
      • Reached stable plateau in I/O upgrade project ~last weekend. Stable operations, can feed steady 950MB/s to HU workers via LSM, Tufts LSM working, Adler-spreader smoothing out spiky adler load. Lot's more to do.
      • About to place order for ~500TB from Dell
      • Will wait for new storage before getting 2d 10Gpbs NoX? link
      • Admin's setting up HCC
      • Tufts LSM upgraded
      • Smooth NET2/BU Tier 3/HU Tier 3 operations otherwise
    • this week:
      • Smooth running

  • MWT2:
    • last week:
      • Progress with unified MWT2 queue (Condor scheduler, running jobs at both sites, used for CVMFS testing)
      • Progress with new cobbler+puppet system -
      • Downtime next week, July 7
    • this week:
      • Downtime tomorrow - DHCP for Cobbler, firmware on s-nodes, omreport on s-nodes, osg-gk (add two slots), srm (add two slots), UPS work

  • SWT2 (UTA):
    • last week:
      • Creating a new OSG resource for CVMFS testing
      • User job that failed because of how data was registered in LFC - short versus long form - how will this be managed?
      • Engage - already running for quite a while.
    • this week:
      • Quiet - except for doubly-run jobs. Working on clean-up - follow Hiro's suggestion to declare files as bad.
      • CVMFS work
      • Tested fetch for job options from atlas web server

  • SWT2 (OU):
    • last week:
      • on vacation
    • this week:
      • Ran into a problem with load on OSCER_ATLAS cluster.

  • WT2:
    • last week(s):
      • Problem with LFC hardware yesterday, replaced.
      • DDM transfer failures from Germany and France - all logfiles. ROOT file are working fine. Is FTS failing to get these? Email sent to Hiro. NET2 also finding performance problems.
      • Hiro - notes many of these are never started, they're in the queue too long.
      • Suspects these are group production channels.
      • T2D? channels.
      • FZK to SLAC seems to be failing all the time. Official Tier 1 service contact for FZK?
    • this week:

Carryover issues (any updates?)

Python + LFC bindings, clients (Charles)

last week(s):
  • New package from VDT delivered, but still has some missing dependencies (testing provided by Marco)
this week:

WLCG accounting

last week: this week:

HTPC configuration for AthenaMP testing (Horst, Dave)

last week
  • Dave reports successful jobs submitted by Douglas last week
this week

AOB

last week
  • None.
this week
  • Doug: Any site using the old CVMFS release repository? Okay to switch off? Answer: no production site using it, okay to switch off.


-- RobertGardner - 05 Jul 2011

  • osg_facility_hours.png:
    osg_facility_hours.png

  • vo_hours_bar_smry.png:
    vo_hours_bar_smry.png

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


png facility_hours_bar_smry.png (40.7K) | RobertGardner, 06 Jul 2011 - 11:54 |
png osg_facility_hours.png (65.6K) | RobertGardner, 06 Jul 2011 - 11:55 |
png vo_hours_bar_smry.png (46.6K) | RobertGardner, 06 Jul 2011 - 11:56 |
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback