r5 - 25 Aug 2010 - 16:35:20 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesAug25

MinutesAug25

Introduction

Minutes of the Facilities Integration Program meeting, Aug 25, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (605) 715-4900, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Kaushik, Mark, Michael, Charles, Nate, Aaron, Sarah, Dave, Jason, Saul, Shawn, John, Horst, Karthik, Patrick, Torre, Bob, Hiro, Rik, Wei, Doug, Wensheng
  • Apologies: Fred

Integration program update (Rob, Michael)

  • IntegrationPhase14 NEW
  • Special meetings
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
      • Another f2f meeting in October 13-14 at SLAC.
      • LHC luminosity record over the weekend 1 pb-1; discussions to increase this possibly increasing CMS energy; efficiency and reliability of the machine improving, resulting in long runs.
      • T0 performance issues - prompt reconstruction events overwhelming the T0 resources, requiring interrupts.
      • Lots of analysis data
    • this week
      • OSG Storage Forum and US ATLAS Distributed Facility workshop, upcoming.
      • Benchmarks of new hardware - will start collecting these (Bob) - HepSpecBenchmarks.html. Send emailed results to Bob.
      • LHC continues to make progress - a couple of good fills recently - integrated luminosity ~ 3 pb-1. Increasing beam intensity by increasing bunches (50, yesterday). 400 bunches by end of the year is the goal. (fb-1 goal by end of next year). Tier 0 still encounter problems with overall bandwidth - export bandwidth must sometimes be reduced, but they catch up quickly and data distribution goes very quickly. Real physics ~ 36 hours over the week.
      • Anticipated program at SLAC - data distribution for analysis, PD2P? and associated data handling issues. Data deletion, dynamically. Some of the sites are close to capacity limits.
      • Analysis performance as it pertains to data access. Tier 3 development and integration into the facility. Cross-cloud traffic across Tier 3's - now getting into the production picture.
      • Idle CPUs - the RAC has a scheme for additional production or other types of analysis work, to backfill.

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

References
  • The link to ATLAS T3 working groups Twikis are here
  • Draft users' guide to T3g is here
last week(s):
  • ARRA funds are beginning to materialize; there is a DOE site that tracks this information; thus expect mid-August funds
  • Phase 1 version of Tier 3 is ready, updating documentation so that sites can start ordering servers
  • Phase 2 needs - analysis benchmarks for Tier 3 (Jim C); local data management; data access via grid (dq2-FTS); CVMFS - will need a mirror at BNL; effort to use Puppet to streamline T3 install and hardware recovery
  • Xrootd federation demonstrator project
  • UTA has setup a Tier 3 in advance of the ARRA funding.
this week:
  • Meeting this morning w/ OSG Tier 3 support, plus Rob and Hiro. Discussion about registration of Tier 3's, but also support.
  • We expect a large number of sites not part of DDM, but using gridFTP and the BNL FTS.
  • Existing Tier 3's w/ SRM and part of DDM. For new ones: we need a well-defined set of steps to become part of the DDM system. Rob will start up a page for this
  • For gridftp-only sites, how sites should get registered was discussed.
  • For Panda-Tier 3 sites, they appear as ANALY queues - but we don't want DAST supporting them - how?

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • Analysis reference:
  • last meeting(s):
    • New schedconfig modification techniques. Instructions can be found here: SchedConfid modification
    • expect production workloads to vary for the near term
    • production versus analysis: need to discuss min/max (floor/ceiling) for the analy queues in the RAC.
    • general discussion about getting more analysis jobs running, and response.
    • With PD2P? , roughly 20% data being distributed to Tier 2s.
    • Current model is that dataset replications are triggered to Tier 2s, jobs still run at Tier 1 first. Efficient for 50% of the jobs (50% are reused, so far).
    • Rebrokering is coming - users can use containers and send jobs to multiple sites. Then the long queue at BNL can be examined, and jobs sent to Tier 2's when the datasets have arrived
    • Problem with user jobs - submitting jobs with prun causing problems (example pyROOT+dcap). Forward the issues to DAST.
    • Users can be blacklisted via GGUS
    • Sites are free to set their own wall time limits
  • this week:
    • US sites -all looks fine, a lot of user analysis jobs. Production has dropped off. Some regional requests are coming, but it won't keep our sites busy. There are some group and recon processes - but they are at Tier 1.
    • Why is group prod only at Tier 1? Not sure, clearly non-optimal.
    • More discussion about multi-job pilots - see notes above
    • All requests for LOCALGROUPDISK through the DaTri system have to be approved by Kaushik and Armen - labor intensive - discussing in ADC; needs to change the policy. Use a threshold? Yes.
    • Note DQ2 will have a quota systems.

Data Management & Storage Validation (Kaushik)

  • Reference
  • last week(s):
    • Problems with central deletion at SLAC which has hard limits on space tokens - starts too late, not efficient enough; lower the deletion trigger threshold? Each space token size must be fixed.
    • Michael: sites are going down more frequently; mostly SRM related. We see Bestman failures particularly. We may need a concentrated effort here to resolve Bestman reliability problems. Wei will drive the issue.
  • this week:
    • No meeting this week
    • Sites are full. Lots of deletions in progress.
    • Issue of sites w/o explicit space token boundaries has not been resolved - will need ADC intervention.

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=104142
    
    1)  8/11:  Transfers to WISC were failing with the error "trouble with canonical path - cannot access it."  Issue resolved by site admin.  ggus 61060 (closed), eLog 15692.
    2)  8/12: Storage system issues at OU resolved (see follow-up section below) - test jobs successful, so OU_OCHEP_SWT2 set on-line. eLog 15704.
    3)  8/12: BNL - job failures with the error "!!WARNING!!3000!! Trf setup file does not exist at: /usatlas/OSG/atlas_app/atlas_rel/15.6.10/IBLProd/15.6.10.4.10/IBLProdRunTime/cmt/setup.sh. 
    For some reason the jobs attempted to run prior to the completion of the release installation.  Once the release was in place issue resolved.  Savannah 71412, eLog 15745.
    4)  8/12-8/13: MWT2_UC - problem with one of the dCache data pools.  Data recovered, consistency checks performed, issues resolved.   eLog 15735.
    5)  8/13: MWT2_IU - transfer errors such as:
    TS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://iut2-dc1.iu.edu:8443/srm/managerv2]. Givin' up after 3 tries].  Issue resolved - from Sarah:
    Chimera was down this morning. The service has been restarted, and transfers should start to succeed.  ggus 61103 (closed), eLog 15761.
    6)  8/13-8/14: Teansfers errors at MWT2_UC_PERF-JETS: "failed to contact on remote SRM."  Restart of the SRM service resolved the problem.  ggus 61126 (closed), eLog 15777/830.
    7)  8/14: SLACXRD file transfer errors:
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]. Givin' up after 3 tries].
    Issue resolved - from Booker at SLAC: I have restarted bestman on osgserv04. There were many java.io.EOF exceptions in the log.  ggus 61142 (closed), eLog 15812/25.
    8)  8/14: SWT2_CPB - file transfer errors due to problem with the SRM interface:
    DEST SURL: srm://gk03.atlas-swt2.org:8443/srm/v2/server?SFN=/xrd/datadisk/data10_7TeV/NTUP_JETMET/f282_p209/data10_7TeV.00160958.physics_JetTauEtmiss.merge.NTUP_JETMET.f282_p209_tid159252_00/NTUP_JETMET.159252._000393.root.1
    ERROR MSG: FTS State [Failed] Reason [TRANSFER error during TRANSFER phase: [GRIDFTP_ERROR] globus_ftp_client: the server responded with an error500 500-Command failed. : globus_gridftp_server_posix.c:globus_l_gfs_posix_recv:914:500-open() fail500 End.]  
    Issue resolved - from Patrick: A RAID card in one of the storage servers went into a bad state Saturday morning causing problems with two mounted filesystems. The server was restarted early Sunday morning and has been operating correctly ever since. 
    Transfer errors ceased after the system was rebooted.  
    ggus 61143, RT 17974 (closed), eLog 15833.
    9)  8/15: OU_OCHEP - transfer errors such as:
    FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] failed to contact on remote SRM [httpg://tier2-02.ochep.ou.edu:8443/srm/v2/server]. Givin' up after 3 tries].  Issue resolved - from Horst:
    Our SRM was overloaded from too many BNL transfers. It has been restarted and should be working properly again. I'm closing this ticket, please activate the FTS channel again if it has been disabled.  ggus 61149, RT 17976 (closed), eLog 15832.
    10)  8/15-8/16: NERSC_HOTDISK request timeout transfer errors.  ggus 61148 (in progress), eLog 15906.
    11)  8/15-8/17: OU_OCHEP - various issues with Lustre timeouts.  FTS channels toggled on/off during periods of congestion.  "On" as of 8/17, 5:00 p.m. CST.
    12)  8/16-8/17: MWT2_UC maintenance outage - from Aaron:
    We are taking a downtime tomorrow, Tuesday August 17th all day in order to migrate our PNFS service to new hardware.
    Site back on-line as of ~4:30 p.m. CST 8/17.  Some issues with pilots at the site - under investigation.
    13)  8/17: From Wei at SLAC:
    Due to power work we will need to shutdown some of our batch nodes. This will likely result in reduced capacity at WT2 (8/18 - 8/20). We may also use the opportunity to reconfigure our storage (If that happens, we will send out an outage notice).
    
    Follow-ups from earlier reports:
    (i)  7/12-13: OU_OCHEP_SWT2: best/SRM issues.  Restart fixed one issue, but there is still a lingering mapping/authentication problem.  Experts are investigating.  ggus 60005 & RT 17494 (both closed), currently being tracked in ggus 60047, RT 17509, eLog 14551.
    Update 7/14: issue still under investigation.  
    RT 17509, ggus 60047 closed.  Now tracked in RT 17568, ggus 60272.
    Update, 8/2: issues with checksums now appear to point to underlying storage or file system issues.   See:
    https://ticket.grid.iu.edu/goc/viewer?id=8961
    Update, 8/9 from Horst: progress with the slow adler32 checksum calculations - see details in RT 17568.
    Update, 8/12 from Horst: after reconfiguring our SE to store adler32 checksums on disk now, and several other file system tunings, transfers are succeeding now, so I'm closing this ticket (RT 17568). 
    
    • Still working on SRM issues at OU - Hiro and Horst to follow-up offline.
  • this meeting:
    Yuri's summary from the weekly ADCoS meeting: he's on vacation this week.  See summary from Alessandra Forti here:
    http://www-hep.uta.edu/~sosebee/ADCoS/World-wide-Panda_ADCoS-report-%28Aug17-23-2010%29.txt
    
    1)  8/18: WISC - DDM transfer errors:
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://atlas07.cs.wisc.edu:8443/srm/v2/server]. Givin' up after 3 tries].  
    Resolved as of 8/23, ggus 61283 closed, eLog 16167.  (Note: ggus 61352 may be a related ticket?)
    2)  8/18 - 8/19: SWT2_CPB - Source file/user checksum mismatch errors affecting DDM transfers.  Issue resolved - from Patrick: The RAID in question rebuilt because of a failed drive and the problem has disappeared.  ggus 61249, RT 1791 both closed, eLog 15965.
    3)  8/18 - 8/19: SWT2_CPB - fiber cut on a major AT&T segment in the D-FW area.  Reduced network capacity during this time, so Hiro reduced FTS transfers to single file to help alleviate the load.  Cut fiber repaired, system back to normal.
    4)  8/19: OU_OCHEP transfer errors:
    [SOURCE error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] failed to contact on remote SRM [httpg://tier2-02.ochep.ou.edu:8443/srm/v2/server].  From Horst: This must've been another SE overload because of the Lustre timeout bug. It has
    resolved itself with an auto-restart of Bestman, therefore I turned the channel back on and am resolving this ticket.  ggus 61289, RT 17997 both closed, eLog 16019.
    5)  8/19: from Charles at UC:
    Our PNFS server at MWT2_UC is temporarily down due to a power problem. It will be back up ASAP. In the meanwhile, FTS channels for UC are paused.
    Later:
    Server is back up and channels are being reopened.
    6)  8/19: SRM problem at BNL - issue resolved.  From Michael:
    Most likely caused by a faulty DNS mapping file that was created by BNL's central IT services. Forensic investigations are still ongoing. Transfers resumed at good efficiency.
    7)  8/19 - 8/20: Maintenance outage at OU_OCHEP_SWT2 for an upgrade to the Lustre file system.  8/21: Test jobs successful following the Lustre upgrade, so OU_OCHEP_SWT2 set on-line.  Still waiting for a complete set of ATLAS s/w releases 
    to be installed at OU_OSCER_ATLAS.  eLog 16119.
    8)  8/19: SWT2_CPB off-line for ~6 hours to add storage to the cluster.  Work completed, system back up as of late Thursday evening.  eLog 16072.
    9)  8/20 - 8/25: BNL - job failures with the error "Get error: lsm-get failed: time out."  Issue with a dCache pool - from Pedro (8/25):
    We still have a dcache pool offline.  It should be back by tomorrow noon.  In the meantime we move the files which are needed by hand but, of course, the jobs will fail on the first attempt.  ggus 61338/355, eLog 16154/230.
    10)  8/20: Job failures at IllinoisHEP with the error "lcg_cp: Communication error on send."  Issue resolved - from Dave:
    I believe this was due to some networking issues on campus.  ggus 61339 closed, eLog 16084.
    11)  8/23: BNL - file transfer errors from BNL to several destinations:
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] failed to contact on remote SRM [httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2]. Givin' up after 3 tries].  From Michael:
    The problem is solved. The dCache core server lost connection with all peripheral dCache components. Transfers resumed after restart of the server.  ggus 61359 closed, eLog 16174.
    12)  8/24 - 8/25: SLAC - DDM errors: 'failed to contact on remote SRM': SLACXRD_DATADISK and SLACXRD_MCDISK.  From Wei:
    Disk failure. Have turned off FTS channels to give priority to RAID resync. Will turn them back online when resync finish.  ggus 61537 in progress, eLog 16223/33.
    
    Follow-ups from earlier reports:
    (i)  8/15-8/16: NERSC_HOTDISK request timeout transfer errors.  ggus 61148 (in progress), eLog 15906.
    Update from Hiro, 8/23: NERSC has fixed the problem. The transfers are working normally.  ggus 61148 closed.
    (ii)  8/16-8/17: MWT2_UC maintenance outage - from Aaron:
    We are taking a downtime tomorrow, Tuesday August 17th all day in order to migrate our PNFS service to new hardware.
    Site back on-line as of ~4:30 p.m. CST 8/17.  Some issues with pilots at the site - under investigation.
    Update, 8/24: Xin added a second submit host for ANALY_MWT2, as it appeared that a single one could not keep up, especially in situations where the analysis jobs are very short (<5minutes, etc.) and cycle through the system at a high rate.  
    Also, the schedconfig variable "timefloor = None" was be set to 60 (i.e., pilots will run for at least 60 minutes if there are real jobs to pick up).
    (iii)  8/17: From Wei at SLAC:
    Due to power work we will need to shutdown some of our batch nodes. This will likely result in reduced capacity at WT2 (8/18 - 8/20). We may also use the opportunity to reconfigure our storage (If that happens, we will send out an outage notice).
    8/25: Presumably this outage is over?
    

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • Problem of getting data from foreign Tier 1's? The 30 TB dataset came through to MWT2 finally. It took five days - there was a bottleneck at CNAF. (the networking issue is under investigation)
  • this meeting:
    • USERDISK getting cleaned up at all sites
    • BNL-Italy low bandwidth still being investigated; discussed at WLCG management meeting - there are organizational issues. Will be discussed at WLCG Service Coordination meeting.
    • Hiro notes that multi-cloud transfers are very manpower intensive.
    • DQ2 accounting - looking at DATADISK and MCDISK - notes that a few sites are very close to the limit, in particular NET2 and SLAC.
    • Armen: relying on central deletion is not working. During last several weeks doing manual cleanups. After USERDISK cleanup will investigate DATADISK.
    • NET2 - moving storage into new rack of servers, so situation should improve.

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week(s):
    • Work at the OU site to determine the asymmetry
    • Illinois - waiting for a config change in mid-August
    • Also an asymmetry at UC, not as severe
    • Meeting next tuesday.
    • DYNES - dynamic provisioning, funded NSF MRI program, 40 sites will be instrumented
  • this week:
    • Phone meeting notes:
      USATLAS Throughput Meeting Notes --- August 24, 2010
           ====================================================
      
      Attending:  Shawn, Jason, Dave, Andy, Sarah, Horst, Karthik, Philippe, Aaron, Tom, Hiro
      
      1) perfSONAR identified issue resolution status:  
      	a) OU: Karthik reporting on OU testing.  Relocated test node which improved things.  Now using Tier-2 nodes directly testing to BNL (BNL->OU is OK, OU->BNL is poor).   Lots of tunings on OneNet and at BNL and OU.   Results from today suggest some issue in the tracepath but perhaps should stay focused on the current tools/tests for now.   Horst and Karthik are enabling  BWCTL on their 10GE hosts.  Hiro tested with real file transfers (BNL->OU is OK (1 file 100MB/sec, 5 streams), OU->BNL is poor performance (20MB/sec; same options; same file)).
           b) Illinois -  Configuration change made to the 6500...no change in dropped packets from BNL to Illinois (problem is opposite direction from OU).   Net admins see some possible issues but haven't had time to track things down yet and have a new testing node to help isolate the issue.  Trying to enable jumbo frames (MTU=9000).   (Bit of a discussion on jumbo frame issues seen at MWT2 and OU).   Tracepath may be useful to debug UC-IU jumbo problems involving PBS.   Some discussion about how NATs deal with jumbo frames on outside/normal frames on inside.  
      
      2) perfSONAR --- RC3 out maybe next week.  New version (CentOS/driver) seems to have changed the performance of the system at BNL (say 900 Mbps -> 700 Mbps ). This is possibly a show-stopper.  Trying new driver at BNL to see if it resolves things. Some new patches out to extract some timing info from old and current perfSONAR instances from Jason.  Philippe reported on patching/fixing systems at AGLT2_UM.   After 'myisamchk -er' on DB things were much faster.  Some nightly checks not in place or perhaps not working?   No progress yet on "single perfSONAR instance" Dell R410 at UM.  Waiting for the release to test it.  Very soon will have focused  access on testing/deploying on the R410.     Some discussion about install options for next perfSONAR release.   Jason is working on documenting the "to disk" install variant and it should be available when the release is ready. Will use a repo and YUM to maintain on disk version.  
      
      3) Monitoring perfSONAR ---  Plugins with "liveness" and thresholds are being worked on.  Work in progress to convert them into RPM.  Can be installed on perfSONAR host and/or Nagios server.  Plugins are perl scripts underneath.  Tom thought it would be straightforward to implement on the BNL Nagios server and can even provide a "perfSONAR" focused page showing just the perfSONAR related tests.  Andy mentioned that documentation will be ready in advance of the plugins and he will let us know so we can start discussing options.
      
      4) Round-table, site reports, other topics --- Hiro - Question: will perfSONAR be extended to each ATLAS site?  Jason:  perfSONAR MDM exists at Tier-1, Tier-0. Hiro points out problems with trying to debug Tier-1 issues and perfSONAR would be helpful.  Also question about Nagios plugin working with MDM variant; Andy: may work but slight differences may cause problems.   MWT2 has asymmetry involving UC, IU is OK.  
      
      Heads up: once the perfSONAR release is ready sites should be prepared to upgrade ASAP (within 1-2 weeks).  
      
      Plan to have our next meeting in 2 weeks.   Please send along any additions or corrections to these notes via email.
      
      Thanks,
      
      Shawn
    • DYNES - first task is to form an external review committee. For example to select sites participating in the infrastructure.

Site news and issues (all sites)

  • T1:
    • last week(s): Reprocessing from tape exercise on-going. Panda mover is used to stage data in while the reprocessing jobs; now done very efficiency with Pedro's changes. 30K completed reprocessing jobs per day with very good staging performance - 3 TB/hour coming off tape, distributed to servers at 1 GB/s.
    • this week: currently a problem with one of the storage pools - large - 100K files/day to re-create the metadata. local-site-mover - Pedro making progress handling of exceptions (have seen a rate of 2-5%). Want data access problems to become invisible.

  • AGLT2:
    • last week: One dcache pool was misconfigured; was brought in manually but with old setup file, fixed. Site is doing very well overall. Implemented the time-floor at 60 minutes (length of time pilot stays in the system). Reached 1600 jobs. Have a network issue - virtual circuit between BNL and UM routing was not correct. Working on next purchasing, need to get HS06 numbers. Looking at power consumption. Have got two new servers, X5650 processors, R710, 24 GB ram; 12 or 24 cores.
    • this week: Putting in storage orders - R710+MD1200 shelves. Focused also network infrastructure upgrades and service nodes. Running with direct access for a while; have found some queuing of movers on pools. (Note MWT2 is running dcap 1000 movers; note there is a newer dcap version test).

  • NET2:
    • last week(s): Brief local networking problem yesterday, otherwise things have been smooth. Expect new storage to be available tomorrow - will be a new DATADISK.
    • this week: 250 TB rack is online, DATADISK is being migrated. Working on a second rack. Working on HU analysis queue.

  • MWT2:
    • last week(s): Continued problems with Chimera, in touch with dcache team. Down for two hours. We have a pool offline. New storage headnode is being built up this week. Remote access issues being addressed.
    • this week: Chimera issues resolved - due to schema change; No crashes since. PNFS relocated to newer harder.

  • SWT2 (UTA):
    • last week: Incident on Monday w/ Bestman, used too many FDs, restarted fine. Wei has a potential change to the xrootd data server.
    • this week: 200 TB of storage added to system - up and running without problems. Will be making modifcations for multipilot.

  • SWT2 (OU):
    • last week:
       But we're making pretty good progress with our site, so I'll give you
      a quick progress report here.
      
      As I already emailed to the RT ticket yesterday, we got the adler32
      calculation and storing in an extended file attribute working, and we're
      now running an hourly cron job which does that on all new files in our SE,
      so that should already help quite a bit in terms of scaling.
      
      And I just worked out a way with Paul to have the pilot set the adler32
      value on our SE at the end of each job, which should make this cron job
      unnecessary eventually, and hopefully that will make it into the next
      pilot version next week, and might be useful for other sites as well,
      depending on how or if they want to do something similar.
      
      So at this point we're waiting for Hiro to run some more transfer tests
      -- ideally starting slowly with 10 parallel transfers, to see how it goes --
      and assuming that holds up, then set us to test again to see if everything
      holds up.
      
      We'll most likely have another 1-day down time some time next week for
      a Lustre version upgrade, which will hopefully improve the storage
      throughput even more.
      
      Thanks,
      
      	Horst
    • this week: Lustre upgrade seems to have stabilized things. Pilot update for adler32 update. Getting OSCER back online. Found problems installing new releases using Alessandro.

  • WT2:
    • last week(s): Storage installation in progress. Have set 'timefloor', lower nqueue for analysis site, seems working. SRM space reporting use a cache.
    • this week: Disk failure yesterday - had to shutdown FTS for a while; fixed. Installing additional storage next week.

Carryover issues ( any updates?)

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc.
  • last meeting(s)
    • Has checked with Alessandro - no show stoppers - but wants to check on a Tier 3 site
    • Will try UWISC next
    • PoolFileCatalog creation - US cloud uses Hiro's patch; For now will run a cron job to update PFC on the sites.
    • Alessandro will prepare some documentation on the new system.
    • Will do BNL first - maybe next week.
  • this meeting:

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

AOB

  • last week
  • this week
    • LHC accelerator outtage next week, M-Thurs.


-- RobertGardner - 24 Aug 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback