r3 - 02 Jun 2010 - 14:13:36 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJune2

MinutesJune2

Introduction

Minutes of the Facilities Integration Program meeting, June 2, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (605) 715-4900, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Rob, Dave, Fred, Jason (I2), John DeStefano, Sarah, Rik, Patrick, Shawn, Wei, Xin, Saul, Jim, John B, Armen, Mark, Kaushik
  • Apologies: Michael, Horst, Hiro

Integration program update (Rob, Michael)

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

References
  • The link to ATLAS T3 working groups Twikis are here
  • Draft users' guide to T3g is here

last week(s):

this week:
  • Working on Tier 3 build instructions - there are some minor configurations for Xrootd that are being worked on.
  • Doug, Rik, Shawn - deployment of perfsonar for T3; conclusion for time being won't recommend for baseline T3.
  • Tuesday, Wednesday of next week

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • Analysis reference:
  • last meeting(s):
    • After reprocessing - we went empty
    • Finally got started yesterday - all fixes to a transformation error; running fine
    • User analysis - activity also reduced; group-level DPD production concluding for now
    • Otherwise all is looking good
    • May 31 - June 2 LHC will be shutdown, a good time to take downtime.
  • this week:
    • New version of Geant
    • We've been drained - are there other samples that could be generated?
    • Regional production requests from the RAC? Jim C will contact Kevin for W-prime samples
    • Oracle database patch for security - caused scaling problems. Roll back this morning - didn't go well.
    • Things are back to a functioning state.
    • User analysis has been going well... 38B events processed. Problems this morning - jobs piling up at BNL. From AGLT2 - lots activated, none running - "failing to get job scripts". No jobs running at other T2's either. Subversion move happened last week - is there a hard-coded dependency in the analysis pilot factory?

Data Management & Storage Validation (Kaushik)

  • Reference
  • last week(s):
    • MinutesDataManageMay25
    • Storage crunch
    • Deleting old datasets
    • Discussions w/ Stephane, Simone, Alexei - about an automated system; discussion of threshold to trigger deletions
    • Will be doing more aggressive deletion
    • Quotas? Far away
    • Wensheng surveying USERDISK usage
    • Data popularity is available in DQ2; ADC looking into this.
    • 72-80 TB of DATADISK can be deleted (removing the 900 GeV data). Concern - is it will take weeks to delete. Kaushik believes not, its 4-5 days.
    • Kaushik believes the reprocessing caused the explosion and that was a one-off, that going forward the rate will be reduced.
    • Charles - disk latency*disk deletion time product is important
    • Hiro notes deletion goes as 1 Hz.
    • Charles - could do this much more quickly at the site-level.
    • Michael: should we reduce # replicas?
    • Charles notes the LFC access time field is not being used presently - could be used to identify cold data
  • this week:
    • A very active week, lots of data deletion at sites; all sites in good shape except SLAC
    • Discussion of new data distribution model - based on analysis requests (changing from push to "pull")
    • Will go into Panda - will generate a generic DDM subscription
    • Give more flexibility for local deletion at sites - using Tier 2's as caches
    • Charles notes auto space-token adjuster for dcache sites

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=95901
    
    1)  5/19: ATGLT, from Bob:
    An AT&T fiber was cut around 6:30pm. This caused a partial disruption between MSU and UM machines of AGLT2, with the ultimate effect that MSU workers have no afs access at all (OSGWN setup is in afs). I don't know what will happen with jobs actually running at MSU at this time. 
    Jobs running at UM will run and commplete fine, and dCache file servers at MSU are fine. I have therefore initiated a peaceful condor idle of all MSU worker nodes. This means we will run at reduced capacity until the fiber problem can be resolved.
    2)  5/19-20: New pilot version from Paul (44a), and minor patch (44b).  Details are here:
    http://www-hep.uta.edu/~sosebee/ADCoS/pilot-update-May19-20-44a_b.html
    3)  5/20: second half of the May 2010 reprocessing exercise has begun.  Status:
    http://atladcops.cern.ch:8000/j_info/
    4)  5/21: From Hiro:
    There was a change in alias for LFC within BNL CE hosts to solve the network issue for some jobs to fail under certain heavy traffic.    However, although it worked in the test, this change made the clients/jobs to fail with authentication errors.    
    As a result, the alias was changed it to back the original setting.  Meantime, you will notice some jobs failed with authentication errors.
    5)  5/21: SWT2_CPB - file transfer failures like:
    FTS Retries [1] Reason [TRANSFER error during TRANSFER phase: [GRIDFTP_ERROR] globus_ftp_client: the server responded with an error500 500-Command failed. : globus_gridftp_server_posix.c:globus_l_gfs_posix_recv:914:500-open() fail500 End.]  From Patrick:
    The data server that was trying to store the transfers was misconfigured. The server was reconfigured and xrootd was restarted.  ggus RT 58418, RT 17002 (both closed), eLog 13137.
    6)  5/21-22: SWT2_CPB - A/C water leak in the machine room forced a power shutdown.  Once power was restored and the services brought back on-line test jobs succeeded - the site is now back up.   eLog 13002.
    7)  5/22: AGLT2 - low efficiency for file transfers.  Issue was heavy load on an SRM server, now resolved.  eLog 12997.
    8)  5/23: MWT2_UC - file transfer failures:
    FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [LOCALITY]
    Source file .... locality is UNAVAILABLE].  From Sarah:
    Two of the pools at MWT2 went offline this morning due to memory issues.  They're back online now, and these transfers should start to succeed. 
    9)  5/23: MWT2_DATADISK low on free space:
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase:
    [NO_SPACE_LEFT] at Sat May 22 19:17:57 CDT 2010 state Failed : space with id=2310037 does not have enough space]
    As of 5/25 47 TB free space now available.  Savannah 67809 (closed), eLog 13009.
    10)  5/23-26: WISC_DATADISK - file transfer errors.  From site admin:
    We have some power problem. Now all data servers are not available.  I already submit an OIM unscheduled downtime. Sorry for the problem.  We will make the service available as soon as possible when the power problem is solved.  Later:
    The problem was solved.  We will have a scheduled downtime tomorrow evening in the university to upgrade the power.
    On 5/25: After the power upgrade in the whole CS room, some of our servers failed to get ip address. Now we are working on it.  ggus 58444 (in progress), eLog 13110,13.
    11)  5/24: From John at NET2:
    Since there's been so little demand for production grid jobs over the past few days (today we ramped down to zero) I'm going to set HU_ATLAS_Tier2 to brokeroff so that we can perform some i/o tests without grid jobs interfering or getting harmed.  
    This should only be for about a day or so.
    12)  5/25: From Wei at SLAC, regarding problems with the SE:
    A data server went down at midnight. I got it back.  I think we also have some intermediate DNS issue due to partial power outage today.
    13)  5/25: From Bob at AGLT2:
    I have stopped auto-pilots to AGLT2 and to ANALY_AGLT2 while we update the OSGWN version at our site.  I will let the remaining jobs here (63 at last count) complete to a great extent, update the distribution, then re-enable the pilots.
    Later:
    OSGWN version upgraded to 1.2.9 and tested.  Restarted queues.  Jobs are running cleanly.
    
    Follow-ups from earlier reports:
    (i)  4/11: Failed jobs at AGLT2 with errors like:
    11 Apr 13:37:44| lcgcp2SiteMo| !!WARNING!!2999!! The pilot will fail the job since the remote file does not exist.
    Shawn verified that the file exists in /pnfs, and that the correct checksum info is available via dCache/Chimera.  Could there be some timing issue present? What does getdCacheChecksum() try to do? I note it mentioned "attempt 1"...
    are there follow-on attempts or is this site-db configured?  
    Paul added to the thread in case there is an issue on the pilot side.  ggus 57186, RT 15953, eLog 11406.  In progress.
    Update, 4/16: Still see this error at a low level, intermittently.  For example ~80 failed jobs on this date.  More discussion posted in the ggus ticket (#57186).
    Update, 5/4: Additional information posted in the ggus ticket.  Also, see comments from Paul.
    Update, 5/10: Additional information posted in the ggus ticket.
    Update, 5/17: Additional information posted in the ggus ticket.
    Update, 5/21: Additional information posted in the ggus ticket.
    (ii)  4/23: OU sites were set off-line in advance of major upgrades -- from Horst:
    We'll be taking OU_OCHEP_SWT2 down for a complete hardware and software upgrade on Monday morning.  So could you please drain all the queues -- both *OU_OCHEP_* and *OU_OSCER_* -- and turn pilot submission 
    (and everything else that might be accessing them) off starting this afternoon, 
    until we're ready to start back up, which will be at least a week?  I'll also schedule a maintenance in OSG OIM, which I will keep updated when we know better how long Dell and DDN will take for the upgrade.  eLog 11813.
    Update, week of 5/19-26: Horst thinks the upgrade work is almost done, so the site may be back on-line soon.
  • this meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=96677
    
    1)  5/26: SE problem at SLAC, now resolved.  From Wei:
    A large number of group data was put on one data server which I put in service as we were about to run out of space. Now we are paying the price for that. I will open FTS and SRM for now with reduced transfer rate (# parallel transfers in fts) and will put a cap on the number of analysis jobs.
    2)  5/27: AGLT2 - autopilot failures with the error "Failed to download/unpack pilotcode.tar.gz."  Issue resolved - from Bob: A missing routing table entry within Ultralight was repaired around 6pm.  All seems fine now.
    3)  5/27-28: From Charles at MWT2_UC:
    There has been a sudden loss of connectivity to the MWT2_UC cluster, looks like either a power or network disruption. We are investigating currently.  Later:
    The network problem has been resolved and MWT2_UC is back online.  eLog 13195.
    4)  5/27-28: Transfer errors at AGLT2:
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]. Givin' up after 3 tries].  Issue resolved - from Bob:
    We have just doubled the RAM in our dCache head node, and modified the vm.swappiness parameter.   Response and load look good right now.  We will continue to monitor the system.  It had been io bound (iostat maxed) since yesterday.  eLog 13199.
    5)  5/29: Transfer errors at MWT2_UC:
    DEST SURL: srm://uct2-dc1.uchicago.edu:8443/srm/managerv2?SFN=/pnfs/uchicago.edu/atlasdatadisk/...
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] failed to contact on remote SRM [httpg://uct2-dc1.uchicago.edu:8443/srm/managerv2]. Givin' up after 3 tries].  From Rob:
    The problem has been corrected and transfers can resume.  ggus 58622 (closed), eLog 13252.
    6)  5/30: From Michael at BNL:
    DDM reports some files being "unavailable" at BNL. We are investigating. Later:
    Investigations unveiled that dCache took 2 pools offline because of conflicting information about available space. The pools are operational again and transfers out of these pools resume.  eLog 13269.
    7)  5/31: Pilot update from Paul:
    Two additional minor patches were released today related to user jobs and file stager. Using the optional switch --accessmode=filestager[/direct] now updates the copysetup field correctly. The pilot is now always setting the runAthena option –lfcHost (requested by Tadashi Maeno et al.). 
    A left-over test code snippet in v 44c released a few hours ago 
    forced file stager to be used which caused problems for user jobs on sites not supporting file stager, corrected in v 44c2. Now running.
    8)  5/31-6/1:  Transfer errors at AGLT2:
    DEST SURL: srm://head01.aglt2.org:8443/srm/managerv2?SFN=/pnfs/aglt2.org/atlascalibdisk/...
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] at Mon May 31 19:24:36 EDT 2010 state Failed : Marking Space as Being Used failed =>ERROR: duplicate key value violates unique constraint "srmspacefile_pkey"].  Issue resolved - from Shawn:
    In trying to add some files to the srmspacefile table an inconsistency was introduced such that dCache kept trying to use an existing record id to insert a new srmspacefile record. The srmspacemanagernextid table needed to be updated
    and dCache restarted to fix the problem. SRM transfers are again working at AGLT2 and this problem should be resolved.  ggus 58671 (closed), eLog 13370.
    9)  5/31-6/2: LHC outage.  This period is available for site maintenance periods as needed.
    10)  6/1: Maintenance outage at BNL -
    Major facility maintenance at the US Atlas Tier 1 facility at BNL will result in all services hosted at BNL being unavailable for four hours. The maintenance involves multiple services at BNL. 
    Services restored as of ~1:30 EST.  eLog 13341.
    11)  6/1: Maintenance outage at AGLT2 completed.  Test jobs successful, site set back to on-line.  eLog 13346.
    12)  6/2: Maintenance outage at MWT2_UC - from Aaron:
    We are taking a downtime June 2nd in order to update our networking infrastructure as well as upgrade to a new kernel on our worker nodes. We will be down from 9AM - 5PM CST, will send an announcement when we're back on-line.  (In progress.)
    
    Follow-ups from earlier reports:
    (i)  4/11: Failed jobs at AGLT2 with errors like:
    11 Apr 13:37:44| lcgcp2SiteMo| !!WARNING!!2999!! The pilot will fail the job since the remote file does not exist.
    Shawn verified that the file exists in /pnfs, and that the correct checksum info is available via dCache/Chimera.  Could there be some timing issue present? What does getdCacheChecksum() try to do? I note it mentioned "attempt 1"...are there follow-on attempts or is this site-db configured?  
    Paul added to the thread in case there is an issue on the pilot side.  ggus 57186, RT 15953, eLog 11406.  In progress.
    Update, 4/16: Still see this error at a low level, intermittently.  For example ~80 failed jobs on this date.  More discussion posted in the ggus ticket (#57186).
    Update, 5/4: Additional information posted in the ggus ticket.  Also, see comments from Paul.
    Update, 5/10: Additional information posted in the ggus ticket.
    Update, 5/17: Additional information posted in the ggus ticket.
    Update, 5/21: Additional information posted in the ggus ticket.
    Update, 5/26: A combination of recent pilot changes + updated WN client s/w at AGLT2 solved this problem.  ggus ticket 57186 closed.
    (ii)  4/23: OU sites were set off-line in advance of major upgrades -- from Horst:
    We'll be taking OU_OCHEP_SWT2 down for a complete hardware and software upgrade on Monday morning.  So could you please drain all the queues -- both *OU_OCHEP_* and *OU_OSCER_* -- and turn pilot submission (and everything else that might be accessing them) 
    off starting this afternoon, until we're ready to start back up,  which will be at least a week?  I'll also schedule a maintenance in OSG OIM, 
    which I will keep updated when we know better how long Dell and DDN will take for the upgrade.  eLog 11813.
    Update, week of 5/19-26: Horst thinks the upgrade work is almost done, so the site may be back on-line soon.
    (iii)  5/23-26: WISC_DATADISK - file transfer errors.  From site admin:
    We have some power problem. Now all data servers are not available.  I already submit an OIM unscheduled downtime. Sorry for the problem.  We will make the service available as soon as possible when the power problem is solved.  Later:
    The problem was solved.  We will have a scheduled downtime tomorrow evening in the university to upgrade the power.
    On 5/25: After the power upgrade in the whole CS room, some of our servers failed to get ip address. Now we are working on it.  ggus 58444 (in progress), eLog 13110,13.
    Update, 5/26 p.m. - From Wen at WISC: The network problem is fixed. The SRM service is available now.
    

DDM Operations (Hiro)

Throughput Initiative (Shawn)

Site news and issues (all sites)

  • T1:
    • last week(s): Planning to upgrade Condor clients on worker nodes next week. Force 10 code update. Continue testing DDN.
    • this week:

  • AGLT2:
    • last week: Watching space tokens. 100 TB free. New storage nodes at MSU - 8 MD1200. Will be purchasing head nodes. Upgraded OSG wn-client 1.2.9; upgraded wlcg-client. Founding lcg-cp hangs, for up to 8 hours.
    • this week: 7.4 Condor; Updated dCache. Found throughput drop off last night - found traffic bouncing between bonded NICs, fixed and restarted network service. Backing up dcache metadata.

  • NET2:
    • last week(s): Dell storage arriving; focusing on networking.
    • this week: all is well, 110 TB free space; new storage still arriving;

  • MWT2:
    • last week(s): working on lsm failures - examining performance. Older nodes cannot write to disk as fast as we transfer over the network. Has become the rate limiting factor, not enough disk IO capacity. Will be changing the scheduling of jobs to compute nodes.
    • this week: Analy-x queue work continues (distributed xrootd backend). Build jobs working, fixing config for LFC lookup. Cisco firmware update - testing under load.

  • SWT2 (UTA):
    • last week: Water leak in machine room - dumped power to the lab. During the downtime put in an xrootd to report usage of space token. Will be bringing 200 TB online shortly. Network update - met w/ CIO last week, now have an upgrade path. Ordered a dedicated 10G port for switch in Dallas ~ 2 months. Bid for fiber to Dallas - 3 months. (NLR in Dallas, and I2 will be bringing 10G there). Dallas to Houston is LEARN network. Preference would be direct connectivity.
    • this week: Space tight - 60 TB free. Perc6 card issue on new storage. Will soon bring in 200 TB of storage, estimate tomorrow.

  • SWT2 (OU):
    • last week: Ready to bring OU back online. Waiting to be added to ToA, then ready for testing.
    • this week: Almost there - few issues with ATLAS release location, in contact with Mark; suggests cc'ing Alden in reply.

  • WT2:
    • last week(s): Ordering next storage - consulting w/ Shawn very helpful. Decided to go with Dell even if low-density. 15 R610 (to save a little space), 45 MD1000. 48GB memory.
    • this week: Running out of space. Deletion is going on but slowly (small files perhaps)?

Carryover issues ( any updates?)

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

  • Reference
  • last week(s)
    • Instructions for installing Squid start here - SquidTier2
    • Updates to SiteCertificationP13 upon completion
    • Notification to Fred for testing
    • Notification to John for completion, and update of the https://twiki.cern.ch/twiki/bin/view/Atlas/T2SquidDeployment deployment page
    • Problems with sites not failing over to the secondary Squid lauchpads?
    • Fred will test actual failover for a DNS round-robin server for Squid at AGLT2.
    • Tier 3's will need backup - associate with a Tier 2
    • Monitor the squid; Cacti monitoring, Monit monitoring; Nagios plugin (general, or squid-frontier); is there anything for Ganglia.
  • this week
    • Fred testing SLAC. AGLT2 tests worked -
    • John: fail-over policy; if squid fails, it would try another T2 site. Not working. Discussing new policy to fail over to Frontier server at cloud's T1; if fails there then it tries CERN; if fails there it fails the job. Seeing silent fail-overs to Oracle database.
    • Its not understood how the configuration for the failover choice is determined.
    • Fred will investigate further

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc.
  • last meeting
    • BNL available for testing. OU will also be available soon. Will keep bugging Alessandro.
  • this meeting:
    • No update

AOB

  • last week
  • this week


-- RobertGardner - 01 Jun 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback