r5 - 03 Nov 2010 - 14:22:53 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesNov3

MinutesNov3

Introduction

Minutes of the Facilities Integration Program meeting, Nov 3, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • 866-740-1260, Access code: 7027475

Audio Details: Dial-in Number:
U.S. & Canada: 866.740.1260
U.S. Toll: 303.248.0285
Access Code: 7027475
Registration Link: https://cc.readytalk.com/r/bd2w3deu2kkg

Attending

  • Meeting attendees: Aaron, Nate, Charles, Rob, Karthik, John De Stefano, Dave, Michael, Sarah, Torre, Saul, Alden, Shawn, Patrick, Rik, Doug, Armen, Mark, Justin, Horst, Bob, Xin, Jim, Wei, Kaushik, Hiro
  • Apologies: Jason

Integration program update (Rob, Michael)

  • IntegrationPhase15 NEW
  • Special meetings
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
      • CVE-2010-3856, patch is available - email Sarah if you like a script to
      • rds module - new kernel module also available
      • Any last updates to facilities spreadsheet?
      • Michael - machine status update - still anticipate providing high intensity beams overnight. Tomorrow will mark the end of the p-p run. Next week it will switch to heavy ion.
      • Reprocessing has been delayed, tomorrow at the latest. Second phase to cover data in October, ADC can proceed immediately.
      • Some discussion about ESD and merged ESDs as input for AOD production. Seeing ~40 TB arriving overnight - majority are ESDs. (First pass reconstruction from recent runs.)
      • MC reprocessing will run everywhere, data reprocessing at the Tier 1.
      • 500M events was the original goal for MC - about 50% complete
    • this week
      • Facility capacity summary as of end of September: capacity-summary-fy10q4.key.pdf: capacity-summary-fy10q4.key.pdf
      • Probably has some errors (Tier 1 core count off, eg). Please let Rob know of any errors.
      • We also need to make sure capacities are updated as new resources are deployed. These are reported to WLCG.
      • This morning the LHC will try again tonight to provide stable beams in p-p mode, one more round. Machine group is still preparing for the heavy ion run. (50TB - analysis will be at dedicated sites.) There may be some T3's interested. Kaushik: MC production will run at Tier 3.

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

References
  • The link to ATLAS T3 working groups Twikis are here
  • Draft users' guide to T3g is here
last week(s):
  • Sites are receiving funds and making orders to Dell
  • Brandeis is up - there are problems with Panda pilots; working out last bits of xrootd; Stony Brook is coming online.
  • T3 and DYNES project - there are requests due in about a month; are the T3's paying attention, or are they working on purchases, etc.
  • Work is proceeding on the federated xrootd
  • T3 documentation moved to twiki at CERN, public svn repo
  • Columbia T3?
  • Trip planned to work on xrootd and puppet configuration management
this week:
  • Six or seven sites are in the process of coming up. Stony Brook, Irvine, Iowa, Columbia, ...
  • Panda-T2 not quite working. Alden: there was a cert issue at Brandeis, should be fixed. There may be other bugs behind it, waiting on other information. Submit hello-world job to test.
  • Started discussion w/ OSG re: security. Felt the information from recent security incidents was too technical. Discussion for new workflow - OSG security informs Doug, then a T3G? specific communication will be created, then sent back to OSG-security for dissemination. Rik: wants the message to be very targeted. Horst: the part-time physicist-admin needs to respond.
  • Xrootd - we have a configuration, as a first step. Now in second stage for on-site data management, as well as transfer of data from the outside.
  • Demonstrator project - want to make sure recommendations for Tier 3's will be compatible with a federated structure. Doug will visit Andy and Wei and SLAC.
  • Doug: need some documentation for dq2-FTS - installing the plugin.
  • Buffer space has a stand-alone data server. You can specify a directory path according to the namespace convention. A python script will be put into the SVN repository.
  • (Hiro will change the plugin to follow the convention)
  • Hiro has setup the FTS monitoring

Operations overview: Production and Analysis (Kaushik)

Data Management & Storage Validation (Kaushik)

  • Reference
  • last week(s):
    • Some clean-up of sites, have about 30% available
    • Waiting on central ops cleanup before doing more; expect campaign to last a month
    • Proddisk clean up issue - panada mover files unregistered - sit at sites as dark data; can central ops handle this? meanwhile Charles provided a script to recognize these and cleanup
    • Hiro is running userdisk cleanup at sites, in progress
  • this week:
    • Lots of clean up by Armen, sites in good shape
    • Looking into BNL storage: 600 TB of deletion waiting, an issue with the deletion service. Also, from user deletions (small files) creating a backlog.
    • At other T2's some deletion 40-60 TB
    • Will keep an eye on these
    • Michael: observe long periods without activity; Armen and Hiro have been complaining.
    • Charles - central deletion file-by-file, and going through central operations. Believes admin should be handled centrally, but actual deletion operation locally inside the SRM. Michael - this is understood.

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=111758
    
    1)  10/21: NERSC_HOTDISK file transfer errors - authentication issue with NERSC accepting the ATLAS production voms proxy.  Hiro set the site off-line in DDM until the problem is resolved.  ggus 63319 in-progress, eLog 18494.
    2)  10/21: HU_ATLAS_Tier2 - job failures due to missing/incomplete ATLAS release 16.0.2.  Missing s/w installed, issue resolved.  https://savannah.cern.ch/bugs/index.php?74275 closed, eLog 18551.
    3)  10/21 - 22: SLAC maintenance outage - completed, back on-line as of ~4:45 p.m. CST Friday.  ggus tickets 63369 & 63372 were opened during this period, both subsequently closed; eLog 18550.  From Wei:
    It took much longer than we expect. WT2 is now back online to the status before the outage (with one failed disk). But at least we produced more error/warning logs in the effect to satisfy Oracle's disk warranty requirement.
    4)  10/23- 10/25: SWT2_CPB went off-line on Saturday due to a problem with the building generator-backed power feed to the cluster UPS.  Power was restored, but it was decided to use this outage to make a 
    planned change to the xrootd system.  Back on-line as of 11:00 p.m. on Monday.  eLog 18640.
    5)  10/24: MWT2_DATADISK - file transfer errors with "source file doesn't exist."  Issue understood - from Wensheng:
    This is a kind of race condition that happened. The dataset replica removal at MWT2_DATADISK was triggered for space purpose. There are multiple replicas available elsewhere.  Savannah 74358 closed, eLog 18618.
    6)  10/27: HU_ATLAS_Tier2 - job failures with lsm errors:
    "27 Oct 04:18:14|Mover.py | !!FAILED!!3000!! Get error: lsm-get failed: time out after 5400 seconds."  ggus 63486 in-progress, eLog 18670.
    7)  10/27 early a.m.:  RT # 18441 was generated for SWT2_CPB_SE due to one or more RSV tests failing for a short period of time.  Issue understood - from Patrick: The addition of new storage to the SE required a restart of the SRM.   
    This seems to have occurred during the RSV tests, as later tests are passing.  Ticket closed.
    8)  10/27: Job failures at several U.S. sites due to missing atlas s/w release 16.0.2.  Issue understood - from Xin:
    SIT released a new version of the pacball for release 16.0.2, so I had to delete the existing 16.0.2 and re-install them. So far the base kit 16.0.2 has been re-installed, and 16.0.2.2 cache is also available at most sites, 
    I just start the re-installation of 16.0.2.1 cache, which should be done in a couple of hours.  ggus 63503 in-progress (but can probably be closed at this point), eLog 18678.
    
    Follow-ups from earlier reports:
    
    (i)  Update 9/16: Full set of atlas releases installed at OU_OSCER_ATLAS. Test jobs successful - site set on-line.
    Update 10/14:  Trying to understand why production jobs aren't being brokered to OU_OSCER_ATLAS.
    Update 10/20: Still trying to understand the brokerage problem.
    (ii)  9/26: UTA_SWT2: job failures with the error "CoolHistSvc ERROR No PFN found in catalogue for GUID 160AC608-4D6A-DF11-B386-0018FE6B6364."  ggus 62428 / RT 18249 in-progress, eLog 17474.
    Update from Patrick, 10/4: We are investigating the use of round-robin DNS services to create a coarse load balancing mechanism to distribute data access to multiple Frontier/Squid clients.
    Update from Patrick, 10/21: This issue has been resolved.  The POOLFileCatalog.xml file is now being generated correctly for the cluster and we have configured a squid instance to support Frontier access, when needed.  
    ggus & RT tickets closed.
    (iii)  9/30: HU_ATLAS_Tier2 - jobs from several tasks were failing with the error "TRF_UNKNOWN | 'poolToObject: caught error."  ggus 62642 in-progress, eLog 17662. 
    Update: as of 10/20 issue resolved, and the ggus ticket was closed.
    (iv)  10/4: ANL_LOCALGROUPDISK - all transfers to / from the token are failing.  ggus 62750 in-progress, eLog 17803.
    Update 10/22: ggus ticket closed by Doug B.  eLog 18539.
    
    
    • A quiet week in the US cloud
    • All carryover issues from last week resolved - good DONE
    • Large spike in job failures over the weekend - merge tasks failing badly
    • Two new shifters added in US timezone this week
  • this meeting:
    • Weekly report:
      Yuri's summary from the weekly ADCoS meeting:
      http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=112416
      
      1)  10/27: WISC_DATADISK - possibly a missing file.  ggus 63526 in-progress, eLog 18698.
      2)  10/27: NET2_USERDISK transfer errors - " [SOURCE error during TRANSFER_PREPARATION phase: [USER_ERROR] source file doesn't exist]."
      >From Saul: This doesn't appear to be a site issue (the files are indeed no longer listed for our site in DQ2, in our LFC, or on disk), but rather some sort of race condition between scheduling to use us as a source and deletion.  ggus 63533 closed, eLog 18704.
      3)  10/27: SMU_LOCALGROUPDISK - file transfer errors due to an expired host cert.  New cert installed on 10/29, site un-blacklisted on 10/31.  ggus 63535 closed, eLog 18810.
      4)  10/27: SLACXRD - job failures with the error "FID "8E91164C-1E3C-DB11-8CAB-00132046AB63" is not existing in the catalog."  Xin fixed the PFC - issue resolved.  https://savannah.cern.ch/bugs/?74553, eLog 18760.
      5)  10/28: UPENN - file transfer errors due to an expired host cert.  New cert installed on 10/29, but continued to see transfer errors.  Hiro helped the site to debug the problem.  Issue seems to be resolved as of 11/2, so ggus 63574 closed, eLog 18951.
      6)  10/28: Job failures at OU_OCHEP & OSCER with an error like "pilot: Get error: No such file or directory."  Issue was an incorrect entry in schedconfig (seprodpath = storage/data/atlasproddisk).  Updated, issue resolved.
      7)  10/29: From Bob at AGLT2: 3 short, closely spaced power hits took down 7 WN, and the jobs that were running on them at the time.  Perhaps 80 jobs were lost.  WN are back up now.
      8)  10/29: Disk failure problem at SLAC - from Wei: We have many disk failures in a storage box. I am shutting down everything to minimum data loss.
      Data from the affected storage was moved elsewhere - issue resolved.
      9)  10/30:  BNL-OSG2_DATADISK - file transfer errors due to timeouts.  Issue resolved - from Michael: The load across pools was re-balanced.  eLog 18794.
      10)  10/30: AGLT2 - large number of failed jobs with "lost heartbeat" errors.  From Tom at MSU: "Some cluster network work yesterday had more impact than foreseen resulting in removal of all running Atlas jobs."  ggus 63629 / RT 18466 closed, eLog 18832.
      11) 10/30: Job failures at OU_OSCER_ATLAS due to missing release 16.0.2.  Alessandro fixed an issue with the s/w install system that was preventing a re-try after one or more earlier failed attempts.  16.0.2 now available at the site.  ggus 63635 / RT 18467 closed, eLog 18812.
      12)  10/30: ANL - file transfer tests failing with the error "failed to contact on remote SRM [httpg://atlas11.hep.anl.gov:8443/srm/v2/server]. Givin' up after 3 tries]."  ggus 63633 in-progress, eLog 18807.
      13)  10/30: New site SLACXRD_LMEM-lsf availabe, test jobs submitted.  Initially an issue with getting pilots to run at the site - now resolved.  Queue is currently set to 'brokeroff'.
      14)  10/31: Job failures at SLACXRD with the error "Required CMTCONFIG (i686-slc5-gcc43-opt) incompatible with that of local system."  From Xin: The installation at SLAC is corrupted, I am reinstalling there, will update the ticket after the re-install is done.  
      ggus 63639 in-progress, eLog 18845.
      15)  11/1: Maintenance outage at AGLT2.  Queues back on-line as of ~10:00 p.m. EST.
      16)  11/1: Power outage at BNL (Switching back to utility power failed following completion of work on a primarily electrical feed circuit).  Issue resolved, all services restored as of ~11:00 p.m. EST.  eLog 18896.
      17)  11/1: OU_OCHEP_SWT2 file transfer errors.  Issue understood - from Horst: SRM errors were caused by our Lustre servers crashing and rebooting.  DDN fixed the problem, and they are investigating what happened.  ggus 63644 & 63662 / RT 18473 & 18480 closed,  eLog 18851.
      18)  11/1: NET2 job failures understood - from John & Saul: Just to let you know that we're getting some LSM errors at our Harvard sites due to an overloaded gatekeeper at BU. We've taken some steps which should clear this up, but we're expecting a batch of failed jobs in the 
      next couple of hours.  eLog 18875.
      19)  11/1: HU_ATLAS_Tier2 - large number of job failures with the error "sm-get failed: time out after 5400 seconds."  Issue understood - from Saul: This problem is gone now. It was caused by a sudden bunch of production jobs with huge 2.6 GB log files. 
      Paul Nilsson has submitted a ticket about that.  
      We've also made networking adjustments so that these kind of jobs wouldn't actually fail in the future.  ggus 63665 / RT 18545 closed, eLog 18891.  https://savannah.cern.ch/bugs/index.php?74720.
      20)  11/2: AGLT2 - all jobs failing with the errors indicating a possible file system problem.  From Bob: We have determined that the problem is a corrupted NFS file system hosting OSG/DATA and OSG/APP. That is the bad news. 
      The good news is that this is a copy to a new host from yesterday, so the original will be used to re-create it.  ggus 63684 in-progress, eLog 18913.  Queues set off-line.
      21)  11/2: OU_OSCER_ATLAS: jobs using release 16.0.2.3 are failing with seg fault errors, while they finish successfully at other sites.  Alessandro checked the release installation, and this doesn't appear to be the issue.  May need to run a job ""by-hand" to 
      get more detailed debugging information.  In-progress.
      
      Follow-ups from earlier reports:
      
      (i)  Update 9/16: Full set of atlas releases installed at OU_OSCER_ATLAS. Test jobs successful - site set on-line.
      Update 10/14:  Trying to understand why production jobs aren't being brokered to OU_OSCER_ATLAS.
      Update 10/20: Still trying to understand the brokerage problem.
      Update 10/27: The field 'CMTCONFIG' in schedconfig for OSCER had an old value, so jobs now getting brokered to the site.
      (ii)  10/21: NERSC_HOTDISK file transfer errors - authentication issue with NERSC accepting the ATLAS production voms proxy.  Hiro set the site off-line in DDM until the problem is resolved.  ggus 63319 in-progress, eLog 18494.
      (iii)  10/27: HU_ATLAS_Tier2 - job failures with lsm errors:
      "27 Oct 04:18:14|Mover.py | !!FAILED!!3000!! Get error: lsm-get failed: time out after 5400 seconds."  ggus 63486 in-progress, eLog 18670.
      Update 10/28, from Saul: We had a networking problem last night between about midnight and 6 a.m. EST.  We don't yet understand exactly what happened, but during this period, our network throughput went way down and about 500 jobs failed due to LSM timeout.  ggus 63486 closed.
      (iv)  10/27: Job failures at several U.S. sites due to missing atlas s/w release 16.0.2.  Issue understood - from Xin:
      SIT released a new version of the pacball for release 16.0.2, so I had to delete the existing 16.0.2 and re-install them. So far the base kit 16.0.2 has been re-installed, and 16.0.2.2 cache is also available at most sites, I just start the re-installation of 16.0.2.1 cache, 
      which should be done in a couple of hours.  ggus 63503 in-progress (but can probably be closed at this point), eLog 18678.
      Update 11/2: No additional errors - ggus ticket closed.
      
      
    • Most issues above have been resolved.
    • Open issues - AGLT2 down at the moment.
    • Job failures at OU-OSCER - site specific issue, dealing with this on email.
    • * * *

DDM Operations (Hiro)

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week(s):
    • Meeting notes
      USATLAS Throughput Meeting - October 26, 2010
      	=============================================
      
      Attending:  Shawn, Dave, Andy, Philippe, Sarah, Karthik, Hiro, John, Horst, Tom, Doug 
      Excused: Jason
      
      1) No updates for retesting OU - BNL path.   Karthik reported that as of the USATLAS facility meeting there was still poor throughput.   John will re-run BNL tests to various ESnet locations.    Dave reported that not too much progress.  The perfSONAR box was moved and then broke.  Need to get back to.  Karthik reports on tests during the call: OU->Kansas, OU->ESnet(BNL) gets 3 Mbps.  Unable to run reverse direction. Fixed reverse direction (problem in config at OU) (More details later in notes)
      
      2) perfSONAR status.   BNL and OU have CDs burned. Plan to install/upgrade soon.  Illinois has success in using the LiveCD.  Attempt to upgrade using net-install option not completely successful.  The perfSONAR MA won't start in this case.  Followed Jason's instructions (twice)...need to work with Jason on debugging.  MSU updated both nodes to v3.2 using LiveCD method; latency is OK, bandwidth has a service not running.  Philippe has email out about this problem.
      
      3) Monitoring -  Nagios monitoring discussed. Tom gave overview of current situation and is willing to work with our group in defining further monitoring capabilities for the dashboard. Dashboard for perfSONAR seems very useful and should meet our needs for monitoring perfSONAR instances.  Currently have SLAC, BU and OU instances down.  Discussed possible extensions for Tom's Nagios dashboard.   Tom can also add additional email notifications.   If site's want to add additional responsible perfSONAR people they can send Tom the address(es).   
      
      Hiro is working on gathering the perfSONAR data and augmenting it with additional tracking of the traceroute (forward and reverse) between sites.    
      
      Further testing info.   John gets ~1 Gbps BNL-Chicago while only 16-200 Mbps BNL-Kansas City.  Traceroute shows the path to OU includes both Chicago and Kansas City.  Karthik's tests from bnl-pt1.es.net to Kansas City got 4.5 Gbps and 4.2 Gbps Kansas to bnl-pt1.es.net.  John's succeeding tests show BNL-Kansas City close to 1 Gbps.    Could be real congestion is complicating the testing.   Situation seems to be that there is a problem between OU and Kansas City but this could also be real traffic congesting the links.  Needs further work.
      
      Hiro mentioned Tier-2 to Tier-2 tests (worldwide) are underway.   Important to have network data to help support this work longer term.
      
      Doug mentioned alerting ATLAS sites to the DYNES process and the need to make sure we have a large number of ATLAS institutions participating.  Note: deadline for DYNES site submissions is the end of November!  USATLAS related sites should be strongly encouraged to participate.  See http://www.internet2.edu/ion/dynes.html for more information (and pass the word).
      
      We plan to meet again in 2 weeks at our regular time.  Please send along correction or additions to the list.
      
      Thanks,
      
      Shawn
    • John and BNL made measurements between OU and BNL to understand connectivity. Finding problems between Chicago and OU - Kansas City, but precise location not determined.
    • 3.2 release of perfsonar - what to use this version to collect infrastructure - to get sites updated.
    • Nagios monitoring at BNL - nice way to keep track of services, which are up, etc?
    • Eg. UDP packet transmission test (10 out of a group of 600)
    • Hiro is gathering the data from instances, making it available for plots from his site
    • DYNES information for site has been available - finalized by November 1. Encourage all sites on the call to be part of DYNES. Need all Tier 2's be a part of it.
    • http://www.internet2.edu/ion/dynes.html
  • this week:
    • 3.2 release updates please at all sites
    • Request to preserve historical data
    • Details - available in Jason's FAQ
    • Karthik will send a link
    • Due by next throughput meeting

Global Xrootd

last week:

this week:

  • make sure we're on the same page regarding the global name convention (/atlas/dq2 vs /atlas)
    • drop leading /grid and ../dq2 after atlas
    • provision for "local" files (/atlas/local)
  • LFC global namespace plugin now available - Charles: this is done, modulo /dq2 mangling
    • Wei has it working at SLAC
    • Using Brian's dcap libs for xrootd available
  • xrootd at SLAC now visible from global redirector with this plugin
  • frm caching tests between SLAC , Duke and FNAL (xrdcp was used a cheat to trigger the frm copy)
  • hooking in dcache at MWT2
  • AGLT2 (post UC testing)
  • hooking in xrootd at UTA
  • GPFS at NET2 - should be a easy case (Saul will talk to Wei and Charles) - should be the same a Lustre
  • next tests
  • summary of osg-xrootd meeting yesterday
    From: Tanya Levshina 
    Date: November 2, 2010 1:30:49 PM CDT
    To: Wei Yang , Marco Mambelli , Alain Roy , Doug Benjamin , Charles G Waldman , Tim Cartwright 
    Cc: Doug Benjamin , Rob Gardner , Brian Bockelman , Rik Yoshida , Fabrizio Furano , Wilko Kroeger , OSG-T3-LIASON@OPENSCIENCEGRID.ORG
    Subject: OSG-ATLAS-Xrootd meeting - November 2nd @11 am CST - minutes
     
    Attendees; Alain, Tim, Marco, Andy, Charles, Wei and Tanya
    
    Agenda:
    
    1. Meeting time. Tanya will set up a doodle poll to decide on the  time for the next meeting (ATLAS week at CERN during first week of December).
    
    2. VDT progress with RPMS (Tim Cartwright)
         a. xrootd rpm will be released within couple of weeks
         b. yum repository is setup in vdt
         b. configuration will come later in a separate package ~ in a  month
         c. VDT will work on related packages xrootd-grdiftp, xrootfdfs, bestman after this is done. Xrootd has a higher priority then these other packages.
    
    Tanya: Are these priorities coming from Atlas Tier-3?
    
    Marco: I've talked to Doug B. and it looks like the xrootd rpm will be used outside US as well, so xrootd rpm  release has higher priority then other components. 
    Also, it is ok if rpms have to be installed as "root" but authorized user should be able to change configuration. 
    
    Andy: no-privileged authorized user should have access to configuration files and be able to start/stop services.
    
    Alain: this could be done with sudo.
    
    3. The xrootd release with all the patches provided by Brian should come within next week. New xrootdFS will be included into repository and can be built simultaneously with xrootd.
    
    Tim: Please let me know when the new release is ready. 
    Andy: you should subscribe to xrootd-l@slac.stanford.edu to get notification
    
    Tanya: Should we change anything in VTD configuration for this release?
    Andy: It is backward compatible and will remain so for the next two(?) years but you can drop couple of unnecessary directives if you want. 
    
    Tanya:  Should we worry about adding in configuration changes for demonstrator projects. Do Atlas Tier-3 sites need them right away? 
    Andy: I don't think that the regular Tier-3 site will use it now and the sites that are working with demonstrator projects know how to change the configuration. You should talk to Doug and Rik to understand their requirements.
    
    Tanya: is this new xrootdFS has fixes that take care of file deletion by not authorized users? 
    Wei: Yes, it is fixed but xtoordFS now requires creation of the keys for xootdfs and distribution them to all data servers.  
    Tanya: We need to understand this better. Could we talk about it in details?
    
    
    Please feel free to add/modify my notes.
    Thanks,
    Tanya

Site news and issues (all sites)

  • T1:
    • last week(s): will be adding 1300 TB of disk; installation is ready and will be handed over to dcache group to integrate into the dcache system by next week. CREAM investigations are continuing. LCG made an announcement to the user community that we'll deprecate the existing CE by the end of the year. Urging sites to convert. Have discussed with OSG on behalf of US ATLAS and US CMS - Alain Roy is working on this, will be ready soon. Submission and Condor batch backend sites will need to be tested. Preliminary results looked good to a single site, but CMS found problems with submission to multiple sites. Plan is to submit 10K jobs from BNL submit host to 1000 slots at UW, to validate readiness (Xin). Note: no end of support for the current OSG gatekeeper, GT2-based.
    • this week:

  • AGLT2:
    • last week:Taken delivery of all of the disk trays, under test. Coordinate turning on shelves between the sites. Looking at XFS su & sw sizes as tool to optimize MD1200 performance. At Michigan, two dcache headnodes, 3 each at a site. Expect a shutdown in December. Major network changes, deploy 8024F's. Performance issue with H800 and 3rd MD1200 shelf.
    • this week:

  • NET2:
    • last week(s): ANALY queue at HU - available to add more analysis jobs. Expect to stay up during the break.
    • this week:

  • MWT2:
    • last week(s): Security mitigation complete. Pool with a bad DIMM and needs a bios update. Running stably.
    • this week:

  • SWT2 (UTA):
    • last week: Working to conditions access setup correctly for UTA_SWT2 cluster since its being converted to analysis; a second squid server at UTA, using same DNS name.
    • this week:

  • SWT2 (OU):
    • last week: Everything running smoothly. Only issue getting OSCER production jobs.
    • this week:

  • WT2:
    • last week(s): Deployed pcache, working fine. 4 hour shutdown to update kernels(?) Two disks failed last week, need to produce more logging.
    • this week:

Carryover issues ( any updates?)

HEPSpec 2006 (Bob)

last week:

  • HepSpecBenchmarks
  • MWT2 results were run in 64 bit mode by mistake; Nate is re-running.
  • Assembling all results in a single table.
  • Please send any results to Bob - important for running the benchmark.
  • Duplicate results don't hurt.

this week:

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc.
  • last report
    • Testing at BNL - 16.0.1 installed using Alessandro's system, into the production area. Next up is to test DDM and poolfilecatalog creation.
  • this meeting:

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

libdcap & direct access (Charles)

last report(s):
  • Rebuilt libdcap locally
  • Current HC performance, http://hammercloud.cern.ch/atlas/10000957/test/
  • Comparing to previous results with dcap++
  • Looking at messaging between dcap client and server
  • Most problems with prun jobs (pathena seems robust)
  • Trying to get dcap++ merged into official dcap sources.. libdcap has diverged.
  • Continuing to track down stalls in analysis jobs - leading to hangs in dcap.
  • newline causing buffer overflow
  • chipping away at this on the job-side.
  • dcap++ and newer stock libdcap could be merged at some point.
  • Maximum active movers needs to be high - 1000.
  • Submitted patch to dcache.org for review
  • No estimate on libdcap++ inclusion in to the release
  • No new failure modes observed

this meeting:

AOB

  • last week
    • Please check out Bob's HS06 benchmark page and send him any contributions.
  • this week


-- RobertGardner - 02 Nov 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


pdf capacity-summary-fy10q4.key.pdf (902.1K) | RobertGardner, 03 Nov 2010 - 12:34 |
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback