r4 - 07 Jul 2010 - 14:27:20 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJul7

MinutesJul7

Introduction

Minutes of the Facilities Integration Program meeting, July 7, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (605) 715-4900, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Dave, Michael, Kaushik, Rob, Charles, Rik, Saul, Wei, Bob, John, Jim, Armen, Mark, Karthik
  • Apologies: Fred, Hiro, Horst

Integration program update (Rob, Michael)

  • SiteCertificationP13 - FY10Q3
  • Special meetings
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
      • For the analysis queue benchmark we ran a round of HC stress tests over the sites
      • Release 15.6.6
      • Output DS: user.elmsheus.hc.10000267.*
      • Input DS Patterns: mc09*merge.AOD*.e*r12*
      • Athena User Area: /data/hammercloud/atlas/inputfiles/muon1566/MuonTriggerAnalysis_1566.tar.gz
      • Athena Option file: /data/hammercloud/atlas/inputfiles/muon1566/MuonTriggerAnalysis_1566.py
      • ANALY_AGLT2, SWT2_CPB, SLAC, NET2: http://hammercloud.cern.ch/atlas/10000220/test/
      • ANALY_MWT2, MWT2_X, SWT2_CPB, SLAC, NET2, BNL_ATLAS_1: http://hammercloud.cern.ch/atlas/10000241/test/
      • Finding average wall times efficiencies between 40-62%
      • SLAC and SWT2_CPB efficiencies especially good (high 50s, 50%)
    • this week
      • Beam commissioning work at LHC has been successful - good runs, high intensity up to 10^30; some stability issues but good progress.
      • No news yet on the next re-processing campaign.
      • WLCG collaboration meeting at Imperial College in London. Discussions dominated by storage and data transfers. Fractional data access below the file level is being discussed. Over-arching theme is caching rather than pre-placement at a large scale. Discussions about how to use the existing resources more efficiently. Many demonstrators have been proposed as follow-up to Amsterdam brainstorming meeting. Expect the next year to be interesting. Xrootd is a big topic - all over the place - a global redirector is being pushed by CMS, also with plugins other storage backends could be used. Looking for industry standards, NFS v4.1 is promising, CERN-DESY partnership has formed; idea is to optimize wide area transfers - part of data access/replication mechanisms available in NFS v4.1. The client, and other missing pieces are becoming available. Wei is setting up a global redirector at SLAC. "Global dynamic inventory".
      • Quarterly reports are due. Part of this is to update the facilities spreadsheet (see CapacitySummary). Questions/discussion:
        • What is the required accuracy? From Bob, presently: The AMD 275 values reported are 6.08 and 7.01, with 7.01 from WLCG, so a 14% variation. The E5335 has the second biggest discrepancy in our sheets, with 5.60 and 6.28 values in use, a 12% variation.
        • Presently we do not have all processor types (also server types) benchmarked. Should each T2 perform these benchmarks? (The spec.org educational site license is $200.)
        • Reported measurements of the HS06 committee or the Hepix group. Relevant links are:
        • http://www.spec.org/benchmarks.html
        • http://w3.hepix.org/benchmarks/doku.php (table links further down the page)
        • cf. Bob's report on MinutesJune16
        • cpu2006-hepspec06-table.jpg

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

References
  • The link to ATLAS T3 working groups Twikis are here
  • Draft users' guide to T3g is here
last week(s):
  • Busy working on procedures for Tier 3
this week:
  • Have a finalized procedure for installing Tier 3's from scratch - taking only one 8 hour day.
  • Tier-3 Panda to be added when Doug returns
  • Have been running functionality tests for Tier 3 (condor job submission, grid submission, etc.)
  • Xrootd demonstrator project - Doug setting up machines, will happen next week
  • manageTier3SW package will install all the ATLAS related software - to provide a uniform look-n-feel

Operations overview: Production and Analysis (Kaushik)

Data Management & Storage Validation (Kaushik)

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=99883
    
    1)  SWT2_CPB: beginning on 6/22 issues with gatekeeper being heavily loaded, apparently due to activity from the condor grid monitor agents.  Ongoing discussions with Xin, Jamie to diagnose this problem.
    2)  6/23: SLAC - ~half-day outage to patch and restart a NFS server hosting ATLAS releases.  Completed.  eLog 13990.
    3)  6/24: SLAC - DDM errors such as:
    FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [CONNECTION_
    ERROR] failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]. Givin' up after 3 tries]
    >From Wei:
    We have blacklisted SLACXRD_GROUPDISK because it ran out of disk space. However, the effect will not be immediate. Our SRM will likely be overrun once a while due to large number of srmGetSpaceTokens and srmGetSpaceMetadata in short period. This is because there are still large number of transfer 
    requests already in DQ2 SS queues for SLACXRD_GROUPDISK.  
    ggus 59384 (closed), eLog 13991.
    4)  6/25: Job failures at NET2 with errors like:
    25 Jun 14:24:09 | /atlasgrid/Grid3-app/atlas_app/atlas_rel/15.8.0/cmtsite/setup.sh: No such file or directory
    25 Jun 14:24:09 | runJob.py   | !!WARNING!!2999!! runJob setup failed: installPyJobTransforms failed
    >From Saul:
    I suddenly realized that I probably caused this problem my own self.  Around the time of the errors, I was making an egg demo that would have been hitting the release file system pretty hard ( http://egg.bu.edu/15.8.0-egg/index.html ).
    We'll confirm, but there was probably nothing wrong otherwise.  No additional failed jobs observed.
    5)  6/25: BNL - file transfer errors such as:
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] at Fri Jun 25 11:31:10 EDT 2010 state Failed : File name is too long].  From Hiro:
    BNL dCache does not allow the logical file name longer than 199 characters. I have canceled these problematic transfers since they will never succeed Users should reduce the length of file name. (Users should not put all metadata of files in the filename itself.) 
    I have contacted the DQ2 developers to limit the length.  
    Savannah 69217, eLog 14016.
    6)  6/26: BNL - jobs stuck in the "waiting" state due to missing file data10_7TeV.00155550.physics_MinBias.recon.ESD.f260._lb0156._0001.1 from dataset data10_7TeV.00155550.physics_MinBias.recon.ESD.f260.  Pavel subscribed BNL, and the files are now available.  Savannah 69223, eLog 14024.
    7)  6/28-29: OU_OCHEP_SWT2 - file transfer errors:
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://tier2-02.ochep.ou.edu:8443/srm/v2/server]. Givin' up after 3
    tries].  From Horst:
    [Our head node, tier2-01, had crashed with a kernel panic.  It's back up now, and a couple of SRM and LFC test commands succeded fine.  Can you please try again, and close this ticket, if it works again now?] 
    No recent errors of this type observed - ggus 59490 & RT 17324 (closed), eLog 14118.
    8)  6/28-29: NET2 - job failures with the error:
    pilot: installPyJobTransforms failed: sh: /atlasgrid/Grid3-app/atlas_app/atlas_rel/15.6.10/cmtsite/setup.sh: No such file or directory.  From Saul:
    One of our sysadmins mistakenly mounted an old copy of the releases area on some of our nodes, so it's not surprising that a recent production cache is missing. These nodes have been taken offline and the correct releases will be remounted. I'll keep the ticket open until these nodes are back with the right mounts.
    Issue resolved, ggus 59439 (closed), eLog 14078.
    9)  6/29: Proxies on all DDM VO boxes at CERN expired.  Proxies renewed as of ~8:30 a.m. CST.  eLog 14109.
    10)  6/29: MWT2_IU maintenance outage completed as of ~4:00 p.m. CST.  >From Sarah:
    We will be applying software updates to the MWT2_IU gatekeeper, with the goal of performance and job scheduling improvements.
    11)  6/30: Apparently a recurrence of the "long filename" issue in 5) above.  From Michael:
    The problem is caused by an overload situation of the storage system's namespace component. The likely reason for the high load are user analysis jobs requesting to create files with very long file names. These operations fail and are retried at high rate.  Experts are looking into ways to fix the problem.  
    ggus 59567 (closed), eLog 14141/33.
    
    Follow-ups from earlier reports:
    (i)  4/23: OU sites were set off-line in advance of major upgrades -- from Horst:
    We'll be taking OU_OCHEP_SWT2 down for a complete hardware and software upgrade on Monday morning.  So could you please drain all the queues -- both *OU_OCHEP_* and *OU_OSCER_* -- and turn pilot submission (and everything else that might be accessing them) off starting this afternoon, 
    until we're ready to start back up, which will be at least a week?  
    I'll also schedule a maintenance in OSG OIM, which I will keep updated when we know better how long Dell and DDN will take for the upgrade.  eLog 11813.
    Update, week of 5/19-26: Horst thinks the upgrade work is almost done, so the site may be back on-line soon.
    Update 6/10, from Horst:
    We're working on release installations now with Alessandro, and there seem to be some problems, possibly related to the fact that the OSG_APP area is on the Lustre file system.
    Update 6/17: test jobs submitted which uncovered an FTS issue (fixed by Hiro).  As of 6/22 test jobs are still failing with ddm registration errors - under investigation.
    (ii)  6/12: WISC - job failures with errors like:
    FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] Error:/bin/mkdir: cannot create directory ...: Permission deniedRef-u xrootd /bin/mkdir.  Update 6/15 from Wen:
    The CS room network was still messed up by a power cut and the main servers are still not accessible. Now I drag out some other machines to start these services. I hope I can get the main server as soon as possible.  Savannah 115123 (open), eLog 13790.
    (iii)  6/19 - 6/20: NET2 - "No space left on device" errors at NET2_DATADISK & MCDISK.  From John & Saul:
    There has been a big burst of data arriving at NET2 and our DATADISK and MCDISK space tokens have run out of space. Armen and Wensheng have been helping us with this today, but since we can write data very fast, our space tokens can fill up very quickly.  There is far more subscribed than free space, 
    so we need some DDM help.  ggus 59220 (in progress), eLog 13888/90.
    Update, 6/26: additional space made available, MCDISK and DATADISK now o.k.  ggus 59220 closed.
    
    • Not much production over the past week.
    • OU: need scheddb updates, getting closer.
  • this meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=100604
    
    1)  7/1: SWT2_CPB - another issue with the internal cluster network switch stack.  Traced the problem to a defective switch, it was replaced, and the issue was apparently resolved.  Test jobs successful, site back to on-line.  eLog 14198.
    2)  7/2 early a.m.: BNL - SRM service restarted - from Pedro:
    The dcache namespace got overloaded again.  Stopping the srm service decreased the load.  We're starting srm again and we'll be performing another investigation of why we're having problems again with our storage services.  
    Please expect a lot of failures to be showing up in dashboard for the next hour.  eLog 14192.
    3)  7/3: early a.m.: BNL - overnight shifter noticed ~120 jobs with stage-in/out errors.  Possibly related to the issued reported the previous morning?  Seemed a transient problem, as no further errors observed beyond the original set.  eLog 14211.
    4)  7/4 - 7/6: MWT2_UC - DDM errors:
    Failed to contact on remote SRM,  MWT2_UC_PERF-JETS space token.  From Aaron (7/6 p.m.):
    This was due to a load condition on our storage systems (uct2-dc1 and uct2-dc2) which caused many transfers to timeout, and a few to EOF.  These transfer errors ceased after the number of transfers decreased. We are currently doing another operation which may add some more load to these services, but this should be complete by the end of this evening and no further transfer errors are expected.  ggus 59690 (still open), Savannah 69500, eLog 14234/306.
    5)  7/6: BNL - DDM errors:
    Source file [srm://dcsrm.usatlas.bnl.gov/pnfs/usatlas.bnl.gov/BNLT0D1/data10_7TeV/ESD/...]: locality is UNAVAILABLE]
    ACTIVITY: Data Consolidation.  From Pedro:
    One of our storage servers was automatically rebooted since it was malfunctioning.  It's expected the dcache pools on the server to become unavailable while they do a self check.  The server is now online for quite some time and BNL efficiency on the dashboard is 99% 
    so I'm closing the ticket.  
    ggus 59761 (closed), eLog 14332, Savannah 115530.
    6)  7/6: UTD-HEP - hardware maintenance complete, test jobs successful - site set back to on-line.  eLog 14304.
    7)  7/6: BNL - power outage - various updates from Michael:
    (i) The primary reason for transfer (and other) service failures is a partial loss of electrical power. One of the
    UPS systems tripped indicating that the reason was current over draw. BNL electricians are investigating the
    problem.
    (ii) Power was restored. We are in the process of completing the restoration of the services. First transfers resumed
    already but expect failures with different patterns for the next ~hour until things have stabilized.
    (iii) Following the restoration of power all affected services were restarted and are operational again. Experts will keep
    monitoring the systems.  eLog 14313.
    8)  7/6 - 7/7: WISC, DDM errors:
    failed to contact on remote SRM [httpg://atlas07.cs.wisc.edu:8443/srm/v2/server]. Givin' up after 3 tries].  From Wen:
    The problem is fixed.  It's caused by a vdt-update-certs-wrapper crash. It moved the certificates directory to old but failed to get the latest certificates version dir.  The result is that bestman cannot find any CA certificates dir to validate user cert.  ggus 59800 (closed), eLog 14328.
    9)  7/7: NET2 outage on Thursday, 7/8 - from Saul & John:
    ANALY_NET2, BU_ATLAS_Tier2o and HU_ATLAS_Tier2 will be down tomorrow to prepare our machine room for new storage racks.
    
    Follow-ups from earlier reports:
    (i)  4/23: OU sites were set off-line in advance of major upgrades -- from Horst:
    We'll be taking OU_OCHEP_SWT2 down for a complete hardware and software upgrade on Monday morning.  So could you please drain all the queues -- both *OU_OCHEP_* and *OU_OSCER_* -- and turn pilot submission (and everything else that might be accessing them) off starting this afternoon, 
    until we're ready to start back up, which will be at least a week?  
    I'll also schedule a maintenance in OSG OIM, which I will keep updated when we know better how long Dell and DDN will take for the upgrade.  eLog 11813.
    Update, week of 5/19-26: Horst thinks the upgrade work is almost done, so the site may be back on-line soon.
    Update 6/10, from Horst:
    We're working on release installations now with Alessandro, and there seem to be some problems, possibly related to the fact that the OSG_APP area is on the Lustre file system.
    Update 6/17: test jobs submitted which uncovered an FTS issue (fixed by Hiro).  As of 6/22 test jobs are still failing with ddm registration errors - under investigation.
    Update 7/6: additional incorrect entries in schedconfigdb discovered and were fixed (remnants from the pre-upgrade settings).  Will try new test jobs.
    (ii)  6/12: WISC - job failures with errors like:
    FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] Error:/bin/mkdir: cannot create directory ...: Permission deniedRef-u xrootd /bin/mkdir.  Update 6/15 from Wen:
    The CS room network was still messed up by a power cut and the main servers are still not accessible. Now I drag out some other machines to start these services. I hope I can get the main server as soon as possible.  Savannah 115123 (open), eLog 13790.
    (iii)  SWT2_CPB: beginning on 6/22 issues with gatekeeper being heavily loaded, apparently due to activity from the condor grid monitor agents.  Ongoing discussions with Xin, Jamie to diagnose this problem.
    Update: we believe this issue has been resolved.  Several factors may have been involved, but certainly restarting the pbs service in a clean state helped, since the system was probably slow in reporting the states of batch jobs to the condor submitter at BNL, 
    which led to overloads in the number of pilots coming into the site.  We'll continue to monitor the situation.
    5)  6/25: BNL - file transfer errors such as:
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] at Fri Jun 25 11:31:10 EDT 2010 state Failed : File name is too long].  From Hiro:
    BNL dCache does not allow the logical file name longer than 199 characters. I have canceled these problematic transfers since they will never succeed Users should reduce the length of file name. (Users should not put all metadata of files in the filename itself.) 
    I have contacted the DQ2 developers to limit the length.  Savannah 69217, eLog 14016.
    7/7: any updates on this issue?
    
    • Trouble getting pilots at SLAC - is there a known problem somewhere? 10K jobs activated, only 100 running. Need to follow-up with Xin.

DDM Operations (Hiro)

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week(s):
    • Network asymmetries - is Esnet involved? Dave (Illinois) investigating, possible issue with campus switch
    • May need to get path-providers involved
    • Hiro has FTS instrumented so we can get information on transfer times. Will track overhead. Will check concurrency.
    • Alerting - will be trying to get this to be part of the system. Check for deviations from expected behavior.
    • Notes from meeting:
      	USATLAS Throughput Meeting - June 29, 2010
      	==========================================
      
      Attending: Shawn, Andy, Dave, Sarah, Aaron, Saul, Karthik, Hiro, Philippe
      Excused: Jason
      0) Action items...reminder about the list from the agenda
      1) perfSONAR status:  New Dell R410 is online with current perfSONAR release.   Will be used for testing.  Status of work on improving responsiveness of perfSONAR.   Next release has some SQL improvements optimizing data access.  Currently the next release (V3.2) is under "internal" testing.   A few weeks from now USATLAS can start testing.
      Issues for discussion from perfSONAR results:  
      	a) Sarah noted UC has a longstanding issue with outbound larger than inbound (same as OU).   
      	b) Problem with services at OU stopping.   Possibly just a display issue?  Doesn't seem to be the case since there are matching discontinuities in the data.  Log file have been sent and still awaiting resolution.  
           c) Philippe noted that MSU-UC testing shows larger bandwidth "outbound" (900 MSU-UC vs. 600 UC-MSU) while for MSU-OU it is opposite (400 MSU-OU vs. 800 OU-MSU).  Checking the UM and MSU perfSONAR nodes shows:
      	from MSU and UM
      	    AGL to UChicago > UChi to AGL
      	    AGL to OU       < OU   to AGL
      	looking from BNL, same thing
      	    BNL to UChicago > UChi to BNL
      	    BNL to OU       < OU   to BNL
      	looking from IU, same thing
      	    IU  to Uchicago > UChi to UI
      	    IU  to OU       < OU   to IU
           d) David noted that campus is slow (perfSONAR found this issue) and campus was notified...problem was then identified as a missed firmware upgrade.  Will be fixed tomorrow.  ICCN vs. ESnet difference.  Over ESnet is showing asymmetry.  Testing to ESnet (Chicago) is good from Illinois but testing to the next "hop" Cleveland shows a big asymmetry (Ill->Cleveland 900 Mbps, but Cleveland->Ill is 200-400 Mbps).   
      2) Transaction rate capability:  Hiro has created new plots of FTS details.  Follow "Show Plots" in the following list.   A sample test is done.  The test is to transfer 1K files of 1MB in size.   The first plot is the 
      histogram of transfer time per file while the second plot is the histogram of transfer time + times in the queue.  As a result, the 2nd one is indicative of SRM performance. However, it is mostly controlled 
      by the FTS channel parameter for the number of concurrent transfers.   
      But, you can general see how many files your sites should be able to get (if files are small.)
      AGLT2
      http://www.usatlas.bnl.gov/fts-monitor/ftsmon/showJob?jobid=8b747104-83a9-11df-9d63-f0c52a5177e3 
      UC
      http://www.usatlas.bnl.gov/fts-monitor/ftsmon/showJob?jobid=8c56cffa-83ab-11df-9d63-f0c52a5177e3 
      SLAC
      http://www.usatlas.bnl.gov/fts-monitor/ftsmon/showJob?jobid=f69bca7f-83ad-11df-9d63-f0c52a5177e3 
      Other sites will follow (doing that right now.)   Hiro will setup to have same concurrency for this test.   Also add in plot of "overhead" time per transfer.
      3) Alerting progress:   Shawn or Sarah?     No progress yet.  Andy will send documentation links for using clients to access perfSONAR data (Hiro inquired about how to best access the existing data).
      4) Site reports:  Open forum for reports on throughput related items.   Hiro reports that PANDA will now be subscribing data and will be potentially getting datasets from sources outside the hierarchy.   This means that better monitoring and debugging is critical.  Default will be to rely on DQ2 to select the source for transfers.  Will need watching.      
      Please send along any corrections or additions to the mailing list.   Next meeting in 2 weeks (July 13).
      Shawn

  • this week:

Site news and issues (all sites)

  • T1:
    • last week(s): New worker nodes have arrived, installed: 247 machines, ~2000 cores. Lots of improvement in DDN over the past week, will go forward with this system. Tier 1: 5K jobs, 6PB storage. Filename length issue caused job failures. 199 characters. PNFS-specific. Pedro has fixed this in the local-site-mover script, aborts request.
    • this week: Network people connecting rest of worker nodes. 20 Nexans being racked, awaiting front-end servers (IBM servers, will run Open Solaris and ZFS). DDN is doing well. Partial power failure yesterday, lost a flywheel UPS - losing 1MW. Positive note - people on holiday responded quickly, brought back 5K disks in less than 1 hour.

  • AGLT2:
    • last week: All going okay for the past week. Debugging the path-too-long issue.
    • this week: Running smoothly over the past weeks, good since we've been away. Alignment splitter. New baby boy (Zahary) at AGLT2! Congratulations to Tom!

  • NET2:
    • last week(s): A production issue last week with old releases getting mounted for a short while. Preparing machine room for new storage - expect a one day downtime next Thursday.
    • this week: Will be taking a downtime tomorrow for machine room work. SRM transfers interrupted for a short while last week (user cert issue in the ATLAS VO). New server on the HU site to start up the analysis queue (2K job slots).

  • MWT2:
    • last week(s): Testing static-script for OSG setup - seems to work okay. Maui adjustments for new scheduling order setup at IU. Will do the same for UC (minimize startup latency for analysis jobs - multiple jobs staging data simultaneously on the same node). Also have been testing the queue ANALY_MWT2_X.
    • this week: Had a number of DDM transfers failing over the weekend, investigating - large number of requests led to overloading of dcache head nodes; also suspect srmspacefile table out of synch with pnfs. Brought another data server online (156 TB), another server drained and raids rebuilt (still in progress - expect online by week's end, adding another 156 TB), see monitor at http://www.mwt2.org/sys/space). Problems with software raid-0/disk issues on about 32 compute nodes lead to filesystem errors and ANALY job failures - reverting to previous situation and studying the problem. Studies continue on xrootd using ANALY_MWT2_X. Several comparison hammercloud tests have been run:

  • SWT2 (UTA):
    • last week: Having an issue with the Grid Monitor not giving correct status - Condor-G then running into trouble. Looking at auto-adjuster for nqueue when there are no jobs. Preparing to remove files that are not associated with storage area.
    • this week: Found and replaced switch in stack - bad flash - has been stable operationally. Otherwise all is well. Received notice of a missing rpm on compute nodes - tracked down.

  • SWT2 (OU):
    • last week: Still waiting for scheddb to be updated, then on to the next step - getting very close to being finished.
    • this week: Bringing new OU online - update LFC host name; next step is to run more test jobs. Next Wednesday outage for UPS work.

  • WT2:
    • last week(s): Making progress on new disk acquisition. Wei on vacation.
    • this week: Want to make space in grouddisk - doing some deletions; automatic deletion has not started. Had AFS server failed, fortunately no impact (no transfers in flight at the time, no job failures). Global xrootd redirector for testing Tier 3. There is no technical specification for this - discussions w/ Andy. Setting up a PROOF cluster. Dell R510s with 12 disks. Sent email to Walker for 6 core chips - expect update early next week.

Carryover issues ( any updates?)

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc.
  • last meeting
    • Email alerts for release installation problems (w/ Alessandro's system): https://atlas-install.roma1.infn.it/atlas_install/protected/subscribe.php
    • OU, Illinois-T3, BNL testing
    • OU - waiting for full panda validation
    • Illinois - problems with jobs sent WMS - lag in status update
    • BNL - Alessandro tested poolfilecatalog creation - there were problems with the environment; Xin provided a patch.
    • Waiting for feedback from Alessandro
  • this meeting:

dCache local site mover plans (Charles)

last week:
  • Meeting with Pedro and Shawn - to create a unified local site mover
  • Been on vacation

this week:

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

AOB

  • last week
  • this week
    • No meeting this week due to Tier 2 cost/benefit discussion at CERN.

chart-2.png chart-3.png chart-4.png

chart.png chart-1.png chart-5.png


-- RobertGardner - 06 Jul 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


pngchart-1.png (9.5K) | RobertGardner, 07 Jul 2010 - 12:48 |
pngchart-2.png (10.3K) | RobertGardner, 07 Jul 2010 - 12:48 |
pngchart-3.png (10.0K) | RobertGardner, 07 Jul 2010 - 12:48 |
pngchart.png (11.0K) | RobertGardner, 07 Jul 2010 - 12:49 |
pngchart-4.png (9.0K) | RobertGardner, 07 Jul 2010 - 12:50 |
pngchart-5.png (9.8K) | RobertGardner, 07 Jul 2010 - 12:50 |
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback