r4 - 25 May 2010 - 21:27:36 - MarkSosebeeYou are here: TWiki >  Admins Web > MinutesMay19

MinutesMay19

Introduction

Minutes of the Facilities Integration Program meeting, May 19, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (605) 715-4900, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Rob, Saul, Tom, Sarah, Torre, Fred, John DeStefano, Wei, Bob, Patrick, John Brunelle, Karthik, Michael, Doug, Rik, Nate, Charles, Fred, Hiro, Shawn, Xin, Wensheng
  • Apologies: Horst

Integration program update (Rob, Michael)

  • SiteCertificationP13 - FY10Q3
  • Special meetings
    • Tuesday (9:30am CDT): Frontier/Squid
    • Tuesday (9:30am CDT): Facility working group on analysis queue performance: FacilityWGAP suspended for now
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
    • this week
      • Updates to SiteCertificationP13 for fabric upgrade plans and Squid updates - due
      • WLCG data: WLCGTier2May2010
      • WLCG storage workshop: General registration for the Jamboree on the Evolution of WLCG Data & Storage Management is now open. The event page can be found at http://indico.cern.ch/conferenceDisplay.py?confId=92416. Please note the registration and payment details (EUR 120), as well as the event location (central Amsterdam). Cheers, Jamie
      • Last week's agency review by NSF/DOE - computing somewhat shorter than usual
        • A number of questions - why is OSG important, etc - should continue
        • Tier 2 performance and contributions - importance to the physics program
        • Full copy of ESD and AODs at the Tier 1
        • A lot of attention to M and O, detector upgrades; "computing is working"
      • New era in operations - lots of data subscriptions: underscores need to get to storage pledges at Tier2's (1100 TB). Important for distributing analysis load across the facility
      • Next year - even more storage - and the ramp up is steep at the Tier 2s. 1600 TB next year. 20 PB total T2 in '12

MCDISK pausing to speed up reprocessing distribution to Tier 2s (Hiro)

  • Proposal to pause MCDISK subscriptions; allow reprocessed datasets to transfer

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

References
  • The link to ATLAS T3 working groups Twikis are here
  • Draft users' guide to T3g is here

last week(s):

this week:
  • Continuing with program - ready with all services needed for a Tier 3g, for people to start building and doing analysis
  • Documentation
  • Final re-build of test cluster
  • June 8 - Phase 1 Tier 3 ready
    • mangeTier3SW - ROOT, pacman, wlcg-client-lite, xrootd, dq2, etc.
    • ATLAS code and conditions - CVMFS
    • Condor plus ARCOND wrapper, and Tier 3 Panda
    • Xrootd for data management - plus additional tools
    • dq2-client, Bestman SRM, dq2-FTS (Hiro)
      • waiting on new version dq2-client; estimate: next week
      • plugin will allow directory creation; can also use uberftp
      • may go beyond June 8
    • expect funding to arrive to grant offices mid-July
    • Panda Tier 3 - Torre needs to clean up installation procedure; very soon

Operations overview: Production and Analysis (Kaushik)

Data Management & Storage Validation (Kaushik)

  • Reference
  • last week(s):
    • Data replication much better understood now. MWT2 going well. AGLT2 - going well, little backlog. SLAC - good shape; NE - GPFS configuration issue solved, much improved; SW - limited 1 Gbps network (otherwise no intrinsic issues).
    • Discussion w/ ADC - issue of priorities. MC data coming 2-3x real data (based on size). No way within DQ2 to adjust. Everything coming in with default share. Reprocessed data caused bottleneck.
    • Simone and Hiro implemented a shares/priority solution.
    • Wensheng did some manual interventions - re-subscribing some datasets not specifying source allowed replication between T2s.
    • Lot of work over past week to make sure we have enough space.
  • this week:
    • Charles: PRODDISK cleanup discussion on-going, to move under central control; Note - we're using PRODDISK-cleanse because of Pandamover datasets.
    • Waiting on final word from Kaushik
    • Shawn: has situation with LFC entries missing as a result of mis-identification as ghosts. Re-subscribe? About 10TB of data. Charles will follow-up
    • USERDISK - needs to be cleaned up - Hiro will do this for all the sites

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=94637
    
    1)  5/6 p.m.: AGLT2 - problem with the dCache headnode - from Shawn:
    Ok..just restarting dCache. The node has been up a while and I have been looking around for problems.  Not sure what happened to the node...some kind of hardware issue or OS lockup.  Anyway I have the power port mapping information so future occurrences should be quicker to deal with.  Tomorrow we will investigate alternative hardware. 
    It turns out the SSD we want to use won't work in this system as it is configured.  
    For now dCache should be operational again shortly.  eLog 12327/45.
    2)  5/6: Transfer errors at SLAC such as:
    2010-05-06 05:19:50 DESD_MET.131664._000195.pool.root.1 FAILED_TRANSFER
    DEST SURL: srm://osgserv04.slac.stanford.edu:8443/srm/v2/server?SFN=/xrootd/atlas/atlasdatadisk/data10_7TeV/DESD_MET/r1239_p134/data10_7TeV.00153030.physics_MinBias.merge.DESD_MET.r1239_p134_tid131664_00/DESD_MET.131664._000195.pool.root.1
    ERROR MSG: [FTS] FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]. Givin' up after 3 tries]
    Issue resolved - from Wei:
    SLAC's LFC db seems to got corrupted during an operation after a firmware upgrade. I restored the LFC from a backup
    and the LFC is now functioning. The DDM transfer should go back to normal. We expect to lose a few hours of data in LFC and some job failure due to this.  ggus 58000 (closed), eLog 12309.
    3)  5/7: Data transfer errors at SLAC:
    FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]. Givin' up after 3 tries]
    From Wei:
    Bestman event log filled up /tmp. I restarted bestman without writing event log.
    4)  5/7: Jobs failing at MWT2_IU with "ddm: Adder._updateOutputs() could not add files to..." errors.  From Sarah:
    This was due to cleanup activity at our site. Please disregard.
    ggus 58040 (closed), eLog 12661.
    5)  5/7: Still seeing SE problem at AGLT2 - from Shawn:
    We are taking another OIM outage on AGLT2_SE.  The head01 node has become unresponsive in SRM again. We are trying to find the right "chassis" to host both the existing disk and the new SSD. As soon as we do we will bring up HEAD01 on that hardware. 
    Issue resolved - site re-activated for DDM transfers.  eLog 12413, https://savannah.cern.ch/support/index.php?114334.
    6)  5/8: DDM transfer errors at NET2:
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] failed to contact on remote SRM [httpg://atlas.bu.edu:8443/srm/v2/server]. Givin' up after 3 tries].  From Saul:
    Fixed by restarting bestman.  ggus 58073 (closed), eLog 12424.
    7)  5/9: Job failures at HU_ATLAS_Tier2 and MWT2_IU due to missing release BTagging/15.6.8.6.1.  Installed at both sites by Xin -- issue resolved.  ggus 58084 (closed), eLog 12663.
    8)  5/10: Transfer errors at MWT2_DATADISK:
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase:
    [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://uct2-dc1.uchicago.edu:8443/srm/managerv2]. Givin' up after 3 tries].  From Rob: A dcache storage issue - fixed now.  ggus 58092 (closed), eLOg 12475.
    9)  5/11: From Wei at SLAC:
    We just had an unexpected outage with one of the storage boxes. Replacing the motherboard fixed the problem. Our power work is still on schedule and I will starting turn service at SLAC down in two hours.  ggus 58163 (closed), eLog 12532.
    10)  5/12:  SLAC outage -- from Wei:
    SLAC has scheduled an outage at 5/12 4am to 5pm UTC to prepare for a 3-hour early morning power work. We will shutdown all our service during the time. Depend on weather condition, we might cancel and reschedule it at the last minute.  Update, 5/12 afternoon: Power outage at SLAC is over. I am turning services on.
    11)  5/12: DDM transfer errors at AGLT2 were initially reported as "no space left on device" errors.  From Shawn:
    Space on the pools was not the problem. The logging for postgresql filled the partition (log files). It was fixed and the new log directory is soft-linked to another partition.  Savannah 67337, ggus 58172 (both closed), eLog 12621.
    12)  5/12: From Bob at AGLT2:
    At 1pm today (EDT) we will begin the process of reconfiguring our dCache so that we no longer have distinct, physical pool disks assigned to one and only one space token, but will instead have all pool disks grouped and space tokens will become logical assignments.  This will greatly ease the troubles we've had for the past week getting space where and when it was needed.  
    We have thought this through pretty well.  
    We do not expect troubles, but that does not mean we will not have any.  
    This message is a warning that this process will begin, and that transient dcache difficulties _could_We will notify everyone when we have completed this task potentially arise.  .
    
    Follow-ups from earlier reports:
    
    (i)  4/11: Failed jobs at AGLT2 with errors like:
    11 Apr 13:37:44| lcgcp2SiteMo| !!WARNING!!2999!! The pilot will fail the job since the remote file does not exist.
    Shawn verified that the file exists in /pnfs, and that the correct checksum info is available via dCache/Chimera.  Could there be some timing issue present? What does getdCacheChecksum() try to do? I note it mentioned "attempt 1"...are there follow-on attempts or is this site-db configured?  Paul added to the thread in case there is an issue on the pilot side.  
    ggus 57186, RT 15953, eLog 11406.  In progress.
    Update, 4/16: Still see this error at a low level, intermittently.  For example ~80 failed jobs on this date.  More discussion posted in the ggus ticket (#57186).
    Update, 5/4: Additional information posted in the ggus ticket.  Also, see comments from Paul.
    Update, 5/10: Additional information posted in the ggus ticket.
    (ii)  4/23: OU sites were set off-line in advance of major upgrades -- from Horst:
    We'll be taking OU_OCHEP_SWT2 down for a complete hardware and software upgrade on Monday morning.  So could you please drain all the queues -- both *OU_OCHEP_* and *OU_OSCER_* -- and turn pilot submission (and everything else that might be accessing them) off starting this afternoon, until we're ready to start back up, which will be at least a week?  
    I'll also schedule a maintenance in OSG OIM, 
    which I will keep updated when we know better how long Dell and DDN will take for the upgrade.  eLog 11813.
    (iii)  Upcoming: 5/13, BNL -
    The Condor batch system will be upgraded on May 13 (Thursday) beginning at 8 am EDT.
    Duration:
    May 13, from 8 am - 12 noon
    Expected User Impact:
    No batch jobs can be scheduled or executed during the upgrade.
    (iv)  5/4: From John at NET2 / HU:
    We were going along fine at ~750 concurrent jobs for days, but when I lifted that limit today, our lsm and storage again ran into scaling issues.  I'm going to get us back down to the 750 level, where things working correctly.  I will do this while keeping the site online in panda.
    Update from John, 5/6:
    Just a heads up that we're still trying out some things to improve performance.  This time we were able to run steady at 1500 jobs for over 24 hours, but we just ran into a snag.  A (hopefully very small) batch of failures will be showing up shortly, but we believe we've caught things in time so that we can keep the site online.
  • this meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=95201
    
    1)  May 2010 reprocessing exercise has begun.  Useful links:
    https://twiki.cern.ch/twiki/bin/view/Atlas/ADCDataReproSpring2010
    https://twiki.cern.ch/twiki/bin/view/Atlas/ADCoS#Reprocessing_jobs
    2)   5/13: Transfer errors at MWT2_UC:
    FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://uct2-dc1.uchicago.edu:8443/srm/managerv2]. Givin' up after 3 tries]
    >From Sarah:
    UC SRM door crashed due to 'Too many open files'.  It seems the ulimit nofile setting in the dCache start-up file is not being applied.  I set it manually and restarted the door, and transfers are starting to succeed. 
    We are looking into a long-term solution, as well as monitoring SRM door health.  eLog 12614.
    3)  5/13-14: File transfers failing from ILLINOISHEP_PRODDISK to BNL-OSG2_MCDISK with source errors:
    FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [CONNECTION_
    ERROR] failed to contact on remote SRM [httpg://osgx1.hep.uiuc.edu:8443/srm/managerv2]. Givin' up after 3 tries].
    >From Dave at Illinois:
    The srm node was somehow confused on srmping. Restarting dcache on the node seems to have fixed the problem.
    ggus 58225 (closed), eLog 12638.
    4)  5/14: Transfer errors at AGLT2_PRODDISK:
    FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]. Givin' up after 3 tries].
    >From Shawn:
    Postgresql connection limit on our dCache/SRM headnode was reached at 3 AM. I increased it to 1000 and restarted Postgresql and dCache services at 7:40 AM. I see SRM transfers are again working.  ggus 58229 (closed), eLog 12654.
    5)   5/14: BNL, data transfer errors to ATLASDATADISK, ATLASDATATAPE and ATLASMCDISK:
    FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] failed to contact on remote SRM [httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2]. Givin' up after 3 tries].  From Michael:
    The problem is solved, again caused by the namespace management component of dCache.  eLog 12668.
    6)  5/14: BNL - Network port scan testing postponed until further notice.
    7)  5/17:  AGLT2 - USERDISK token filled up.  5 TB additional space added.  Test jobs successful at ANALY_AGLT2 - site back to 'online'.  RT 16346 (closed), eLog 12772.
    8)  5/17-19: NET2 - jobs failing with NOLOCALSPACE and FILECOPYERROR messages.  From Saul:
    It looks like this is two independent problems:
    1) some BU_ATLAS_Tier2o jobs failed from running out of local scratch space
    2) some ANALY_NET2 jobs failed with "Put" errors copying to our SE (gpfs1)
    ....................
    1) is going to require more investigation, but there is no immediate problem - all the workers have scratch space and no more jobs are failing.
    2) is also going to require more investigation. The errors coincide with a time yesterday when one of our gpfs1 volumes was down to 1.5TB. We were, however, watching it carefully at the time and didn't let it go below about that so I
    don't actually understand how the errors happened. We'll have to get this from our local site mover logs. As with 1), there doesn't seem to be an immediate problem as gpfs1 has 6TB free at the moment and the errors ended approximately
    when we freed more space.  Issues resolved - ggus 58257 (closed), eLog 12749.
    9)  5/18: BNL - networking issue resolved.  From Michael:
    The network problem caused by a routing loop between core routers/switches was solved around 18:50 UTC. All services
    are back to normal.  eLog 12847.
    10)  5/19: BNL - out of space errors for MCDISK.  Additional space was added -- issue resolved.  ggus 58326 (closed), eLog 12862/63.
    
    Follow-ups from earlier reports:
    (i)  4/11: Failed jobs at AGLT2 with errors like:
    11 Apr 13:37:44| lcgcp2SiteMo| !!WARNING!!2999!! The pilot will fail the job since the remote file does not exist.
    Shawn verified that the file exists in /pnfs, and that the correct checksum info is available via dCache/Chimera.  Could there be some timing issue present? What does getdCacheChecksum() try to do? 
    I note it mentioned "attempt 1"...are there follow-on attempts or is this site-db configured?  
    Paul added to the thread in case there is an issue on the pilot side.  ggus 57186, RT 15953, eLog 11406.  In progress.
    Update, 4/16: Still see this error at a low level, intermittently.  For example ~80 failed jobs on this date.  More discussion posted in the ggus ticket (#57186).
    Update, 5/4: Additional information posted in the ggus ticket.  Also, see comments from Paul.
    Update, 5/10: Additional information posted in the ggus ticket.
    U[pdate, 5/17: Additional information posted in the ggus ticket.
    (ii)  4/23: OU sites were set off-line in advance of major upgrades -- from Horst:
    We'll be taking OU_OCHEP_SWT2 down for a complete hardware and software upgrade on Monday morning.  So could you please drain all the queues -- both *OU_OCHEP_* and *OU_OSCER_* -- 
    and turn pilot submission (and everything else that might be accessing them) off starting this afternoon, 
    until we're ready to start back up, which will be at least a week?  
    I'll also schedule a maintenance in OSG OIM, which I will keep updated when we know better how long Dell and DDN will take for the upgrade.  eLog 11813.
    

DDM Operations (Hiro)

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week(s):
  • this week:
    • Minutes from this week's meeting:
      	USATLAS Throughput Meeting Notes --- May 18, 2010
          ===================================================
      Attending:  Shawn, Dave, Jason, Sarah, Andy, Saul, Philippe, Karthik, Hiro, Nate, Charles
      Excused: Horst
      
      Review of existing action items (see meeting agenda below).   Topics will be covered mostly within the agenda.  Note updated spreadsheet attached with meeting announcement.   Sites should verify they are properly entered AND that they have fully setup scheduled tests as appropriate (see notes in spreadsheet).
      
      1) perfSONAR
       a) Network status and new issues?   BNL issue noted:  campus-wide outage being worked on. 
       b) Review issues with bandwidth asymmetry at OU (Karthik):  Status update: results inconclusive from NPAD/NDT tests.  Jason and Karthik will continue to look at this.
       c) Review UTA low bandwidth test results (Mark):  Status update:  FTS channel fully loaded causing low perfSONAR results. (**Resolved**)  
       d) Marking "milestone" complete ??   perfSONAR deployed at all the Tier-2s and the Tier-1? **Yes...closing this milestone**.
       e) BU and Harvard will decide if perfSONAR instance at Harvard (testing to BU) is useful.  
       f) Possible new hardware option from Dell for perfSONAR:  Shawn/Jason
      
      2) Transaction testing update - Hiro blocked right now by network issue at BNL.   Will produce a number of plots related to transaction testing results.   Submit times and finishing times plots.  Will eventually be available from FTS and/or throughput.  URL for "stack" plots for total bandwidth to Tier-2s will be sent by Hiro later (once it is ready for heavier use).
      
      3) Alerting on perfSONAR:   Sarah did some initial poking.   Possibilities to verify testing configurations and site status.  Some work to do to make the user interface easier and more intuitive.   Jason confirmed that verifying the set of tests implemented at a site via the API should be doable.    Harder to get version info.    
      Sarah, Jason and Philippe discussed a perfSONAR "enhancement" which would allow a meaningful site label to be created which would group all test results involving that site under the label.   Then we wouldn't care that some measurements were done using the IP vs some done using the DNS name, etc.    
      
      4) Site reports (open)
      	Philippe brought up an issue:  lhcperfmon issue displaying AGLT2_UM and AGLT2_MSU.  Jason requested email with details to be shared with developers.   Sarah reported IU doesn't have a problem with the same graph.  Hiro reported BNL's result is also blank to AGLT2 (both sites) but has data to IU.   Maybe a Google API issue?   Will be looked at (**ACTION ITEM**)
      	Saul will check in the install perfSONAR instance at BU (currently 3.1.2 ? Should be 3.1.3)   
      
      Next meeting in 2 weeks (June 1, 2010).   Please send along corrections or additions to the list.   Thanks,
      Shawn 

Site news and issues (all sites)

  • T1:
    • last week(s): DDN array evaluation - converted front end to Linux and XFS. Expect evaluation to last another two weeks. Issue with UPS system (battery based). Batteries exhibiting thermal runaway; switched mode into by-pass. There are measures underway to solve the problem.
    • this week: Increasing capacity of CPU and disk. ETA 1 June for 2000 cores from Dell. Evaluation of DDN continues. Continued problems with servers - considering going back to Sun servers. Network problem yesterday - router reboot created a routing loop (spanning tree path calculations), required manual intervention.

  • AGLT2:
    • last week: dCache server and headnode issues being addressed. Reconfiguring dCache right now, sent out instructions for comment. Going from hard-coded pools to one large pool group. Allow logical control, rather than by hand which doesn't scale. Changed dCache admin databases on Intel SSDs - huge difference in CCC run time. All databases.
    • this week: Converting from many pool groups to one large pool group - so as to adjust space tokens on the fly. SSD on postgres database working really well. CCC running fast causing overloading local disk causing SRM timeouts, need to migrate to anther server. Bringing another 100 TB into dcache in a few days. Will be up to 1.1 PB soon. There is also new storage at MSU. 6 shelves.

  • NET2:
    • last week(s): Diskspace is tight at northeast. Order for first of three new rack of storage going out this week - will add PB of raw storage. Network issue - turned out to be a firewall issue. John: still working on ramping HU back up to full capacity. http://atlas.bu.edu/~youssef/2010-05-12/. Regarding lsm-get timout - may need to ask Paul to increase this.
    • this week: Running relatively smoothly. Top priority besides ops is storage upgrade. Order for first of 3 racks has gone out. 336 TB raw (IBM servers and Dell MD1000, MD3000 servers). BU-HU network being improved, final stages. Local site mover improvements.

  • MWT2:
    • last week(s): Smooth running most of last week; dcache issue over the weekend - missed configuration on new storage pools (fixed); Squid updated at IU; Kernel bug issue. Working with Tier 3 team on xrootd testing. Kernel bug: - soft lockup message - we see jobs not using CPU but loading up. Updated to kernel - but didn't solve problem.
    • this week: http://twiki.mwt2.org/bin/view/Main/SiteReportMay19. Pilot was timing out transfers that were still in progress.

  • SWT2 (UTA):
    • last week: Upgrading nodes on older UTA_SWT2; networking issues; Squid update later today.
    • this week: Installed 50 R410 in the production only cluster. +300 cores. 400 TB on floor - burn in (2-3 weeks operational); swap 40 TB of older storage; Ordered 200 TB (may have power issues). This will put us over 1 PB. Xrootd version has issue with reporting space usage back to SRM when running with CNSd; workaround fix; will need to re-synch dq2 and srm. Squid upgrade to-do: Q: failover. Network bandwidth meeting tomorrow with administration.

  • SWT2 (OU):
    • last week: Cluster upgrade worked fine. 10G NIC on head node keeps locking up. Will swap with (Dell supported) NIC. Condor configuration, OSG installation, LFC, Squid, till to come. Hope to be back online next week.
    • this week: Troubles with Netrion-Dell 2950 compatibility; Intel card more stable. Still not getting full throughput. PCM cluster manager had many issues - had to work with Dell remotely to fix these. Condor configured. About to start on OSG. Shooting for Monday. 170 TB of useable storage.

  • WT2:
    • last week(s): Yesterday had 2 hour outtage caused by failed mobo on .. Scheduled outtage mostly finished. Adding another gridftp server - RHEL5. Running low on storage - deleting files from PRODDISK - have ~ 100 TB free. dq2-put into LOCALGROUPDISK? There ACL settings in LFC and ToA. Use DaTri to move to LOCALGROUPDISK.
    • this week: Running low on storage - will be deleting data from SCRATCHDISK and GROUPDISK. Meeting this afternoon regarding storage purchase for end of next month. 500 TB useable, leading to 1.1 PB.

Carryover issues ( any updates?)

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc.
  • last meeting
    • We continue discussion regarding layout of supporting software for the ATLAS release, eg. dq2-client; need conclusions from Alessandro
    • Testing with UTD site. Sending test jobs.
    • Will test with OU
  • this meeting:
    • Discussing file structure across all ATLAS sites - will need to migrate carefully the US sites to minimize impact. Waiting for test results.

AOB

  • last week
  • this week
    • None


-- RobertGardner - 18 May 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback