r7 - 14 May 2010 - 11:25:20 - MarkSosebeeYou are here: TWiki >  Admins Web > MinutesMay12

MinutesMay12

Introduction

Minutes of the Facilities Integration Program meeting, May 12, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (605) 715-4900, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Rob, Shawn, Jason, Bob, Fred, John, Sarah, Tom, Xin, Wei, Horst, Saul, Rik, Patrick, Nate, Charles, Karthik, Kaushik, Armen
  • Apologies: Michael, John D.

Integration program update (Rob, Michael)

  • SiteCertificationP13 - FY10Q3
  • Special meetings
    • Tuesday (9:30am CDT): Frontier/Squid
    • Tuesday (9:30am CDT): Facility working group on analysis queue performance: FacilityWGAP suspended for now
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
      • Following reprocessing, a lot of data has been subscribed to the Tier 1 and Tier 2s. So lots of data replication is underway. We're hitting reality - users are waiting for data.
      • First time we're exercising the entire chain under load; its real data, so we're sensitive to real latencies and expectations. What are the performance optimizations, etc?
      • We have lots of good and valuable discussions. We need to analyze and understand the limitations.
      • Note the machine still has significant problems - we had a good run over the weekend, still sorting out several issues.
    • this week

Feature talk: Xrootd local site mover (Charles Waldman)

  • Specification: LocalSiteMover
  • last week
    • LocalSiteMoverXrootd
    • Taken existing gen 0 site mover scripts from dcache lsm, doing simple translation for xrootd protocol
    • Q from Wei: for direct reads for analysis how does the lsm apply. Could leverage python-xrootd bindings library from Charles.
    • lsm and pcache - lsm-get is responsible getting input files for the job.
    • Question about bundling the two - or at least making these
  • this week:
    • XrdPosix module: source code is here: http://repo.mwt2.org/viewvc/xrd-python
      • Code is largely auto-generated by SWIG
    • lsm-scripts using XrdPosix: http://repo.mwt2.org/viewvc/lsm/MWT2_X
    • Testing on stand-alone xrd system and ANALY_MWT2_X
    • TODOs
      • Merge with Pedro's lsm-scripts (improved diagnostics)
      • Determine if there is a better way to handle checksums
    • Related topic - making Panda aware of contents of xroot pools.
      • Currently uses GUID - could this mechanism use LFN?
      • Otherwise, need good tools to handle GUIDs
    • Wei: possible better way to test checksum handling
    • Post ANALY_MWT2_X testing, try out with SLAC and UTA

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

References
  • The link to ATLAS T3 working groups Twikis are here
  • Draft users' guide to T3g is here

last week(s):

  • XrootdOSGAtlasMeetingMay4
  • Work converging to provide a set of recommendations and instructions for a non-grid Tier 3
  • Planning Tier 3 workshop around June 7
  • Funding is quite uncertain, though some institutions have heard their grant offices
  • Tier 3 Panda testing - job submissions being tested. Still trying to get HC working.
  • Data distribution - dq2-client that uses FTS to be released in two weeks (requires gridftp-only).
  • Doug: has a request from Simone for a completely separate FTS instance for gridftp-only endpoints.
  • Hiro has checked that gridftp-only works fine.
  • Michael: if we need a separate instance, we will set this up.
  • Will need more sites testing.
  • Throughput testing will require more.
  • Doug: pcache on distributed xrootd (processors and storage on same nodes).
    • Probably not to be implemented in pcache, but the tool that loads in data to the Tier 3.
    • Rik: probably will take a longer time to implement.
  • NFSv4 client bug - ATLASLocalRootBase + CVMFS, etc. Found in testing wlcg-client-lite. Bug triggered in RHEL 5.4 and below. Waiting for beta release in SL5.
  • (note NSFv4 looks to be the default for RHEL6)
this week:

Operations overview: Production and Analysis (Kaushik)

Data Management & Storage Validation (Kaushik)

  • Reference
  • last week(s):
    • Getting lots of data now!
    • Three major topics: space status, making sure we're not running out.
    • Slow transfers T0-->T1-->T2, being investigated, hot topic, thread on-going.
    • Discussion in ADC phone meeting - German cloud experience - T2's can only get data within its cloud. Users are constantly requesting data outside of their clouds. DaTri is handling this by two subscriptions via the Tier 1. In German cloud ran out of space. Proposal - start testing direct channels between largest T2s and T1's. Possible go through a star channel.
    • Do we participate in these tests? Note: will improve performance for users. Won't waste scratch space at Tier 1s.
    • Pedro: what about the load on people - figuring out bottlenecks, remote site issues, etc. What is the operational load increase? Kaushik will make this point.
  • this week:
    • Data replication much better understood now. MWT2 going well. AGLT2 - going well, little backlog. SLAC - good shape; NE - GPFS configuration issue solved, much improved; SW - limited 1 Gbps network (otherwise no intrinsic issues).
    • Discussion w/ ADC - issue of priorities. MC data coming 2-3x real data (based on size). No way within DQ2 to adjust. Everything coming in with default share. Reprocessed data caused bottleneck.
    • Simone and Hiro implemented a shares/priority solution.
    • Wensheng did some manual interventions - re-subscribing some datasets not specifying source allowed replication between T2s.
    • Lot of work over past week to make sure we have enough space.

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting: http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=93559 
    1) Upcoming: 5/11, BNL - The Condor batch system will be upgraded on May 13 (Thursday) beginning at 8 am EDT. Duration: May 13, from 8 am - 12 noon Expected User Impact: No batch jobs can be scheduled or executed during the upgrade. 
    2) 4/28: DDM transfer errors between MWT2_UC_USERDISK ==> MWT2_IU_LOCALGROUPDISK. The STAR-IU FTS channel was temporarily stopped to prevent additional errors. From Charles: The offending LFC entries have been cleaned up. We are re-enabling the channel and watching for errors - if this condition returns we'll pause the channel. Issue seems to be resolved. 
    3) 4/28: From Pedro at BNL: Our dcache srm server stopped working. We've managed to recover the service. Please ignore the dashboard error messages. eLog 12020. 
    4) 4/29 - 4/30: Slowness in data transfers to the BNL SE. From Hiro: The network engineers have identified the faulty link and switch. The throughput to/from BNL has been restored by bypassing them with rerouting of the traffic. ggus 57801 (closed), eLOg 12039. 
    5) 4/29: DDM errors at AGLT2 - FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]. Givin' up after 3 tries]. From Shawn: This was due to a connection limit being reached on our dCache headnode's posgresql max_connection parameter. 
    Fix was to double it from 300->600 and restarting first postgresql and then all dCache services on this node. eLog 12055. 
    6) 4/29 - 5/1: Issues with atlas s/w releases installation at MWT2_IU. Now resolved. See long discussion threads in ggus 57820, RT 16151 (both closed), eLog 12065. 
    7) 5/3: BNL - from Michael: BCF Facility Services will be putting the APC UPS into bypass mode in order to repair a severe problem with its batteries. Although the equipment will continue to operate on line power there will be no UPS protection until the problem is corrected. No interruption in service is expected unless there is a power line glitch. No impact observed. 
    8) 5/4: Xin reported a problem with atlas s/w installation jobs failing (for example, see panda job i.d.'s 1068250967, 1068414227). From Tadashi: I've fixed test/installSW.py on SVN. Issue resolved. 
    9) 5/4: From John at NET2 / HU: We were going along fine at ~750 concurrent jobs for days, but when I lifted that limit today, our lsm and storage again ran into scaling issues. I'm going to get us back down to the 750 level, where things working correctly. I will do this while keeping the site online in panda. 
    
    Follow-ups from earlier reports: 
    (i) 4/11: Failed jobs at AGLT2 with errors like: 11 Apr 13:37:44| lcgcp2SiteMo| !!WARNING!!2999!! The pilot will fail the job since the remote file does not exist. Shawn verified that the file exists in /pnfs, and that the correct checksum info is available via dCache/Chimera. Could there be some timing issue present? What does getdCacheChecksum() try to do? 
    I note it mentioned "attempt 1"...are there follow-on attempts or is this site-db configured? 
    Paul added to the thread in case there is an issue on the pilot side. ggus 57186, RT 15953, eLog 11406. In progress. 
    Update, 4/16: Still see this error at a low level, intermittently. For example ~80 failed jobs on this date. More discussion posted in the ggus ticket (#57186). Update, 5/4: Additional information posted in the ggus ticket. Also, see comments from Paul. 
    (ii) 4/23: OU sites were set off-line in advance of major upgrades -- from Horst: We'll be taking OU_OCHEP_SWT2 down for a complete hardware and software upgrade on Monday morning. So could you please drain all the queues -- both *OU_OCHEP_* and *OU_OSCER_* -- and turn pilot submission (and everything else that might be accessing them) off starting this afternoon, 
    until we're ready to start back up, which will be at least a week? I'll also schedule a maintenance in OSG OIM, 
    which I will keep updated when we know better how long Dell and DDN will take for the upgrade. eLog 11813. 
    (iii) 4/23: Spring reprocessing exercise is underway. Useful links: http://panda.cern.ch:25980/server/pandamon/query?mode=listtask http://gridinfo.triumf.ca/panglia/graph-generator/?SITE=DE&TASK=ALL&TIME=hour&SIZE=large http://atladcops.cern.ch:8000/j_info/repro_feb_10_RAW_ESD.html Update: this reprocessing exercise has now been completed.
    • Low number of US issues, smooth running.
    • Added several shifters in the US time zone. Helped greatly.
  • this meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=94637
    
    1)  5/6 p.m.: AGLT2 - problem with the dCache headnode - from Shawn:
    Ok..just restarting dCache. The node has been up a while and I have been looking around for problems.  Not sure what happened to the node...some kind of hardware issue or OS lockup.  Anyway I have the power port mapping information so future occurrences should be quicker to deal with.  Tomorrow we will investigate alternative hardware. 
    It turns out the SSD we want to use won't work in this system as it is configured.  
    For now dCache should be operational again shortly.  eLog 12327/45.
    2)  5/6: Transfer errors at SLAC such as:
    2010-05-06 05:19:50 DESD_MET.131664._000195.pool.root.1 FAILED_TRANSFER
    DEST SURL: srm://osgserv04.slac.stanford.edu:8443/srm/v2/server?SFN=/xrootd/atlas/atlasdatadisk/data10_7TeV/DESD_MET/r1239_p134/data10_7TeV.00153030.physics_MinBias.merge.DESD_MET.r1239_p134_tid131664_00/DESD_MET.131664._000195.pool.root.1
    ERROR MSG: [FTS] FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]. Givin' up after 3 tries]
    Issue resolved - from Wei:
    SLAC's LFC db seems to got corrupted during an operation after a firmware upgrade. I restored the LFC from a backup
    and the LFC is now functioning. The DDM transfer should go back to normal. We expect to lose a few hours of data in LFC and some job failure due to this.  ggus 58000 (closed), eLog 12309.
    3)  5/7: Data transfer errors at SLAC:
    FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]. Givin' up after 3 tries]
    From Wei:
    Bestman event log filled up /tmp. I restarted bestman without writing event log.
    4)  5/7: Jobs failing at MWT2_IU with "ddm: Adder._updateOutputs() could not add files to..." errors.  From Sarah:
    This was due to cleanup activity at our site. Please disregard.
    ggus 58040 (closed), eLog 12661.
    5)  5/7: Still seeing SE problem at AGLT2 - from Shawn:
    We are taking another OIM outage on AGLT2_SE.  The head01 node has become unresponsive in SRM again. We are trying to find the right "chassis" to host both the existing disk and the new SSD. As soon as we do we will bring up HEAD01 on that hardware. 
    Issue resolved - site re-activated for DDM transfers.  eLog 12413, https://savannah.cern.ch/support/index.php?114334.
    6)  5/8: DDM transfer errors at NET2:
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] failed to contact on remote SRM [httpg://atlas.bu.edu:8443/srm/v2/server]. Givin' up after 3 tries].  From Saul:
    Fixed by restarting bestman.  ggus 58073 (closed), eLog 12424.
    7)  5/9: Job failures at HU_ATLAS_Tier2 and MWT2_IU due to missing release BTagging/15.6.8.6.1.  Installed at both sites by Xin -- issue resolved.  ggus 58084 (closed), eLog 12663.
    8)  5/10: Transfer errors at MWT2_DATADISK:
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase:
    [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://uct2-dc1.uchicago.edu:8443/srm/managerv2]. Givin' up after 3 tries].  From Rob: A dcache storage issue - fixed now.  ggus 58092 (closed), eLOg 12475.
    9)  5/11: From Wei at SLAC:
    We just had an unexpected outage with one of the storage boxes. Replacing the motherboard fixed the problem. Our power work is still on schedule and I will starting turn service at SLAC down in two hours.  ggus 58163 (closed), eLog 12532.
    10)  5/12:  SLAC outage -- from Wei:
    SLAC has scheduled an outage at 5/12 4am to 5pm UTC to prepare for a 3-hour early morning power work. We will shutdown all our service during the time. Depend on weather condition, we might cancel and reschedule it at the last minute.  Update, 5/12 afternoon: Power outage at SLAC is over. I am turning services on.
    11)  5/12: DDM transfer errors at AGLT2 were initially reported as "no space left on device" errors.  From Shawn:
    Space on the pools was not the problem. The logging for postgresql filled the partition (log files). It was fixed and the new log directory is soft-linked to another partition.  Savannah 67337, ggus 58172 (both closed), eLog 12621.
    12)  5/12: From Bob at AGLT2:
    At 1pm today (EDT) we will begin the process of reconfiguring our dCache so that we no longer have distinct, physical pool disks assigned to one and only one space token, but will instead have all pool disks grouped and space tokens will become logical assignments.  This will greatly ease the troubles we've had for the past week getting space where and when it was needed.  
    We have thought this through pretty well.  
    We do not expect troubles, but that does not mean we will not have any.  
    This message is a warning that this process will begin, and that transient dcache difficulties _could_We will notify everyone when we have completed this task potentially arise.  .
    
    Follow-ups from earlier reports:
    
    (i)  4/11: Failed jobs at AGLT2 with errors like:
    11 Apr 13:37:44| lcgcp2SiteMo| !!WARNING!!2999!! The pilot will fail the job since the remote file does not exist.
    Shawn verified that the file exists in /pnfs, and that the correct checksum info is available via dCache/Chimera.  Could there be some timing issue present? What does getdCacheChecksum() try to do? I note it mentioned "attempt 1"...are there follow-on attempts or is this site-db configured?  Paul added to the thread in case there is an issue on the pilot side.  
    ggus 57186, RT 15953, eLog 11406.  In progress.
    Update, 4/16: Still see this error at a low level, intermittently.  For example ~80 failed jobs on this date.  More discussion posted in the ggus ticket (#57186).
    Update, 5/4: Additional information posted in the ggus ticket.  Also, see comments from Paul.
    Update, 5/10: Additional information posted in the ggus ticket.
    (ii)  4/23: OU sites were set off-line in advance of major upgrades -- from Horst:
    We'll be taking OU_OCHEP_SWT2 down for a complete hardware and software upgrade on Monday morning.  So could you please drain all the queues -- both *OU_OCHEP_* and *OU_OSCER_* -- and turn pilot submission (and everything else that might be accessing them) off starting this afternoon, until we're ready to start back up, which will be at least a week?  
    I'll also schedule a maintenance in OSG OIM, 
    which I will keep updated when we know better how long Dell and DDN will take for the upgrade.  eLog 11813.
    (iii)  Upcoming: 5/13, BNL -
    The Condor batch system will be upgraded on May 13 (Thursday) beginning at 8 am EDT.
    Duration:
    May 13, from 8 am - 12 noon
    Expected User Impact:
    No batch jobs can be scheduled or executed during the upgrade.
    (iv)  5/4: From John at NET2 / HU:
    We were going along fine at ~750 concurrent jobs for days, but when I lifted that limit today, our lsm and storage again ran into scaling issues.  I'm going to get us back down to the 750 level, where things working correctly.  I will do this while keeping the site online in panda.
    Update from John, 5/6:
    Just a heads up that we're still trying out some things to improve performance.  This time we were able to run steady at 1500 jobs for over 24 hours, but we just ran into a snag.  A (hopefully very small) batch of failures will be showing up shortly, but we believe we've caught things in time so that we can keep the site online.
    
    

DDM Operations (Hiro)

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc.
  • last meeting
    • ITB site info now published though reported the gatekeeper OS rather than the worker-node OS. ATLAS releases were installed.
    • dq2-client package installation on sites. Working towards unique set of packages. Discussing this with Marco.
    • Next step: after details have settled, try out a Tier 2 site - OU good candidate.
  • this meeting:
    • We continue discussion regarding layout of supporting software for the ATLAS release, eg. dq2-client; need conclusions from Alessandro
    • Testing with UTD site. Sending test jobs.
    • Will test with OU

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

  • Reference
  • last week(s)
    • PFC corruption at HU - was affecting production jobs which it should never do. This file is not used, but it needs to exist and be in proper XML format. Hiro reported a problem with its content. Alessandro was working on a validity check in the software - thought this was done. Saul: it was not corrupted actually, but out of date. This is a general problem - the jobs which install this sometimes fail (dq2 failures), and this will affect running analysis jobs. Fred will discuss with Richard Hawking next week at CERN and report back. We need a consistency checker for this.
    • New version of squid - recommended for deployment. See message from Dario.
    • AGLT2 updated - but got a slightly older version of the rpms. Needs to update.
    • Advice: make sure you stop running processes; uninstall old version before installing the new release.
    • Caution: Customizations will be overwritten. ACLs for example.
    • John will update US facility instructions - will test at MWT2.
    • Fred testing latest release at BNL and CERN - working with lastest versions of Athena.
    • New Frontier client release has been delayed.
    • John is upgrading BNL servers; will work on Tier 2 instructions after that.
  • this week

Throughput Initiative (Shawn)

Site news and issues (all sites)

  • T1:
    • last week(s): DDN array evaluation - converted front end to Linux and XFS. Expect evaluation to last another two weeks. Issue with UPS system (battery based). Batteries exhibiting thermal runaway; switched mode into by-pass. There are measures underway to solve the problem.
    • this week:

  • AGLT2:
    • last week: dCache server and headnode issues being addressed.
    • this week: Reconfiguring dCache right now, sent out instructions for comment. Going from hard-coded pools to one large pool group. Allow logical control, rather than by hand which doesn't scale. Changed dCache admin databases on Intel SSDs - huge difference in CCC run time. All databases.

  • NET2:
    • last week(s): Diskspace is tight at northeast. Order for first of three new rack of storage going out this week - will add PB of raw storage. Network issue - turned out to be a firewall issue. John: still working on ramping HU back up to full capacity.
    • this week: http://atlas.bu.edu/~youssef/2010-05-12/. Regarding lsm-get timout - may need to ask Paul to increase this.

  • MWT2:
    • last week(s): Cisco switch stable since going to gridftp2.
    • this week: Smooth running most of last week; dcache issue over the weekend - missed configuration on new storage pools (fixed); Squid updated at IU; Kernel bug issue. Working with Tier 3 team on xrootd testing. Kernel bug: - soft lockup message - we see jobs not using CPU but loading up. Updated to kernel - but didn't solve problem.

  • SWT2 (UTA):
    • last week: All is well this week.
    • this week: Upgrading nodes on older UTA_SWT2; networking issues; Squid update later today.

  • SWT2 (OU):
    • last week: Dell on-site installing nodes; by end of next week expect to be online again.
    • this week: Cluster upgrade worked fine. 10G NIC on head node keeps locking up. Will swap with (Dell supported) NIC. Condor configuration, OSG installation, LFC, Squid, till to come. Hope to be back online next week.

  • WT2:
    • last week(s): All is well. Storage evaluation - thumper-like- from a local vendor. Berkeley communications.
    • this week: Yesterday had 2 hour outtage caused by failed mobo on .. Scheduled outtage mostly finished. Adding another gridftp server - RHEL5. Running low on storage - deleting files from PRODDISK - have ~ 100 TB free. dq2-put into LOCALGROUPDISK? There ACL settings in LFC and ToA. Use DaTri to move to LOCALGROUPDISK.

Carryover issues ( any updates?)

AOB

  • last week
  • this week


-- RobertGardner - 11 May 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback