r7 - 10 Nov 2010 - 14:33:33 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesNov10

MinutesNov10

Introduction

Minutes of the Facilities Integration Program meeting, Nov 10, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • 866-740-1260, Access code: 7027475

Audio Details: Dial-in Number:
U.S. & Canada: 866.740.1260
U.S. Toll: 303.248.0285
Access Code: 7027475
Registration Link: https://cc.readytalk.com/r/bd2w3deu2kkg

Attending

  • Meeting attendees: Rob, Aaron, Nate, Charles, Michael, Dave, Fred, Karthik, Booker, Rik, Sarah, Wei, Doug B, Shawn, Patrick, Xin, Tom, Saul, Armen, Kaushik, Mark, Horst, Wensheng, Torre, Alden, John B, Hiro,
  • Apologies: Jason (@ SC), John De Stefano

Integration program update (Rob, Michael)

  • IntegrationPhase15 NEW
  • Special meetings
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • Program notes:
    • last week(s)
      • Facility capacity summary as of end of September; may have errors (Tier 1 core count off, eg). Please let Rob know of any errors.
      • We also need to make sure capacities are updated as new resources are deployed. These are reported to WLCG.
      • This morning the LHC will try again tonight to provide stable beams in p-p mode, one more round. Machine group is still preparing for the heavy ion run. (50TB - analysis will be at dedicated sites.) There may be some T3's interested. Kaushik: MC production will run at Tier 3.
    • this week
      • CapacitySummary - complete for the last phase, thanks all.
      • There may be some issues with installed capacity as reported to WLCG - http://gstat-wlcg.cern.ch/apps/capacities/comparision/
      • Heads up regarding site status monitoring and auto-exclusion changes coming next week (from Nurcan):
        • See talks by Alessandro Di Girolamo and Dan van der Ster at this week's ADC weekly meeting, http://indico.cern.ch/conferenceDisplay.py?confId=112808
        • Sites should make it sure that Athena release 15.6.9 is always available at their sites (used by analysis test jobs by HammerCloud, a second test using release 16.0.2 to be added)
        • Mail by ADC shifters will be sent to US cloud support mailing list, atlas-support-cloud-us@cern.ch. Make it sure we have relevant people subscribed to this list. Currently Nurcan, Alden, Mark and racf-wlcg-announce-l@lists.bnl.gov (who is in this list? Asked J. Hover). Add Hiro, Wensheng, Xin? Subscription to the cloud support is via this link.
        • Nurcan to give detailed report next week after the system has been tested
      • Reprocessing campaign is on-going, at the Tier 1, and the MC will be taking off soon.
      • Its almost a PB of data, US has got 380 TB to do at BNL. Well underway. All Tier 1's are participating
      • Stress in usage of DATADISK - at BNL it was filling rapidly. 2 TB/ hour, so that management of this space is critical. 7 PB in production, but only a few hundred TB left, so its an urgent matter now.
      • Heavy ion collions - only took 4 days to convert from p-p. December 7 will shut down.

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc.
  • last report
    • Testing at BNL - 16.0.1 installed using Alessandro's system, into the production area. Next up is to test DDM and poolfilecatalog creation.
  • this meeting:
    • See slides attached below
    • goal is to unify methods with other clouds
    • https://atlas-install.roma1.infn.it/atlas_install/ - site admins can subscribe, and get notified
    • Documentation isn't ready
    • Migrate Tier 2's, but when?
    • SLAC - planning to move ATLAS releases to a new server.
    • Start with 16.3.0; will also require a change in the ToA.
    • Xin will communicate the transition with the site admins and with Alessandro
    • Will start with all the Tier 2's.

Global Xrootd: Tier 3 (Doug), Tier 2 (Charles)

last week(s):
  • Twiki page setup at CERN: https://twiki.cern.ch/twiki/bin/viewauth/Atlas/AtlasXrootdSystems
  • Meeting https://twiki.cern.ch/twiki/bin/viewauth/Atlas/XrdMeetingOct25
  • make sure we're on the same page regarding the global name convention (/atlas/dq2 vs /atlas)
    • drop leading /grid and ../dq2 after atlas
    • provision for "local" files (/atlas/local)
  • LFC global namespace plugin now available - Charles: this is done, modulo /dq2 mangling
    • Wei has it working at SLAC
    • Using Brian's dcap libs for xrootd available
  • xrootd at SLAC now visible from global redirector with this plugin
  • frm caching tests between SLAC , Duke and FNAL (xrdcp was used a cheat to trigger the frm copy)
  • hooking in dcache at MWT2
  • AGLT2 (post UC testing)
  • hooking in xrootd at UTA
  • GPFS at NET2 - should be a easy case (Saul will talk to Wei and Charles) - should be the same a Lustre
  • next tests
  • summary of osg-xrootd meeting yesterday
    From: Tanya Levshina 
    Date: November 2, 2010 1:30:49 PM CDT
    To: Wei Yang , Marco Mambelli , Alain Roy , Doug Benjamin , Charles G Waldman , Tim Cartwright 
    Cc: Doug Benjamin , Rob Gardner , Brian Bockelman , Rik Yoshida , Fabrizio Furano , Wilko Kroeger , OSG-T3-LIASON@OPENSCIENCEGRID.ORG
    Subject: OSG-ATLAS-Xrootd meeting - November 2nd @11 am CST - minutes
    Attendees; Alain, Tim, Marco, Andy, Charles, Wei and Tanya
    Agenda:
    1. Meeting time. Tanya will set up a doodle poll to decide on the  time for the next meeting (ATLAS week at CERN during first week of December).
    2. VDT progress with RPMS (Tim Cartwright)
         a. xrootd rpm will be released within couple of weeks
         b. yum repository is setup in vdt
         b. configuration will come later in a separate package ~ in a  month
         c. VDT will work on related packages xrootd-grdiftp, xrootfdfs, bestman after this is done. Xrootd has a higher priority then these other packages.
    Tanya: Are these priorities coming from Atlas Tier-3?
    Marco: I've talked to Doug B. and it looks like the xrootd rpm will be used outside US as well, so xrootd rpm  release has higher priority then other components. 
    Also, it is ok if rpms have to be installed as "root" but authorized user should be able to change configuration. 
    Andy: no-privileged authorized user should have access to configuration files and be able to start/stop services.
    Alain: this could be done with sudo.
    3. The xrootd release with all the patches provided by Brian should come within next week. New xrootdFS will be included into repository and can be built simultaneously with xrootd.
    Tim: Please let me know when the new release is ready. 
    Andy: you should subscribe to xrootd-l@slac.stanford.edu to get notification
    Tanya: Should we change anything in VTD configuration for this release?
    Andy: It is backward compatible and will remain so for the next two(?) years but you can drop couple of unnecessary directives if you want. 
    Tanya:  Should we worry about adding in configuration changes for demonstrator projects. Do Atlas Tier-3 sites need them right away? 
    Andy: I don't think that the regular Tier-3 site will use it now and the sites that are working with demonstrator projects know how to change the configuration. You should talk to Doug and Rik to understand their requirements.
    Tanya: is this new xrootdFS has fixes that take care of file deletion by not authorized users? 
    Wei: Yes, it is fixed but xtoordFS now requires creation of the keys for xootdfs and distribution them to all data servers.  
    Tanya: We need to understand this better. Could we talk about it in details?
    Please feel free to add/modify my notes.
    Thanks,
    Tanya
this week:
  • Doug: testing revised configuration files from Andy and Wei, and scripts for input copy. Will need a new version of xrootd than whats in VDT. Several sites in Europe (esp UK, DPM; Spain, Lustre) expressing interest in participating. Also needs to work with Graham for a schedule. Will put this into the Tier 3 part of the project.
  • Charles: convention for namespace implemented easily w/ symlinks in the LFC. Module for LFC lookup - working at SLAC and UC. xrd-dcap debugging - will need to be repackaged. UC-SLAC testbed working. Can access a file using global namespace ATLAS.
  • Patrick, Shawn, Saul standing by ready to test.

Tier 3 Integration Program (Doug Benjamin, Rik Yoshida)

References
  • The link to ATLAS T3 working groups Twikis are here
  • Draft users' guide to T3g is here
last week(s):
  • Six or seven sites are in the process of coming up. Stony Brook, Irvine, Iowa, Columbia, ...
  • Panda-T2 not quite working. Alden: there was a cert issue at Brandeis, should be fixed. There may be other bugs behind it, waiting on other information. Submit hello-world job to test.
  • Started discussion w/ OSG re: security. Felt the information from recent security incidents was too technical. Discussion for new workflow - OSG security informs Doug, then a T3G? specific communication will be created, then sent back to OSG-security for dissemination. Rik: wants the message to be very targeted. Horst: the part-time physicist-admin needs to respond.
  • Xrootd - we have a configuration, as a first step. Now in second stage for on-site data management, as well as transfer of data from the outside.
  • Demonstrator project - want to make sure recommendations for Tier 3's will be compatible with a federated structure. Doug will visit Andy and Wei and SLAC.
  • Doug: need some documentation for dq2-FTS - installing the plugin.
  • Buffer space has a stand-alone data server. You can specify a directory path according to the namespace convention. A python script will be put into the SVN repository.
  • (Hiro will change the plugin to follow the convention)
  • Hiro has setup the FTS monitoring
this week:
  • Lining up examples for analysis - Nils working at ANL for three days. Amir's n-tuple example on Tier 3.
  • Desy has large n-tuple benchmark package - adapting this as well.
  • Tier3-Panda - has an account at Argonne, working.
  • Doug at SLAC - working on the next T3 xrootd configuration; needs to be synchronized with the VDT rpm
  • Doug will work with Yu Shu to use Puppet.
  • All the scripts are in SVN at CERN, head node installation has been tested by Doug; worker-node installation has been tested.
  • Twiki security policy creating access problems
  • Yale is having problems with client tools - will look into gridftp-FTS
  • dq2-ls and dq2-get will go into the next release candidate, before December
  • CERN IT plus Dubna developers are starting a development effort for T3s.

Operations overview: Production and Analysis (Kaushik)

Data Management & Storage Validation (Kaushik)

  • Reference
  • last week(s):
    • Lots of clean up by Armen, sites in good shape
    • Looking into BNL storage: 600 TB of deletion waiting, an issue with the deletion service. Also, from user deletions (small files) creating a backlog.
    • At other T2's some deletion 40-60 TB
    • Will keep an eye on these
    • Michael: observe long periods without activity; Armen and Hiro have been complaining.
    • Charles - central deletion file-by-file, and going through central operations. Believes admin should be handled centrally, but actual deletion operation locally inside the SRM. Michael - this is understood.
  • this week:
    • Facing storage shortfalls world-wide
    • BNL is getting full as well (already using far more than the pledge) - about 1 PB free (down from 1.2 a few days ago)
    • Can we afford secondary copies? May need to delete older ESD copies. US physicists may need to have jobs scheduled elsewhere (other clouds) - its an ATLAS problem
    • AGLT2 can add more to DATADISK if needed
    • Can the space-token auto-adjuster be used at BNL? Probably not - they are hard-tokens.

Shifters report (Mark)

  • Reference
  • last meeting:
    • Weekly report:
      Yuri's summary from the weekly ADCoS meeting:
      http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=112416
      
      1)  10/27: WISC_DATADISK - possibly a missing file.  ggus 63526 in-progress, eLog 18698.
      2)  10/27: NET2_USERDISK transfer errors - " [SOURCE error during TRANSFER_PREPARATION phase: [USER_ERROR] source file doesn't exist]."
      >From Saul: This doesn't appear to be a site issue (the files are indeed no longer listed for our site in DQ2, in our LFC, or on disk), but rather some sort of race condition between scheduling to use us as a source and deletion.  ggus 63533 closed, eLog 18704.
      3)  10/27: SMU_LOCALGROUPDISK - file transfer errors due to an expired host cert.  New cert installed on 10/29, site un-blacklisted on 10/31.  ggus 63535 closed, eLog 18810.
      4)  10/27: SLACXRD - job failures with the error "FID "8E91164C-1E3C-DB11-8CAB-00132046AB63" is not existing in the catalog."  Xin fixed the PFC - issue resolved.  https://savannah.cern.ch/bugs/?74553, eLog 18760.
      5)  10/28: UPENN - file transfer errors due to an expired host cert.  New cert installed on 10/29, but continued to see transfer errors.  Hiro helped the site to debug the problem.  Issue seems to be resolved as of 11/2, so ggus 63574 closed, eLog 18951.
      6)  10/28: Job failures at OU_OCHEP & OSCER with an error like "pilot: Get error: No such file or directory."  Issue was an incorrect entry in schedconfig (seprodpath = storage/data/atlasproddisk).  Updated, issue resolved.
      7)  10/29: From Bob at AGLT2: 3 short, closely spaced power hits took down 7 WN, and the jobs that were running on them at the time.  Perhaps 80 jobs were lost.  WN are back up now.
      8)  10/29: Disk failure problem at SLAC - from Wei: We have many disk failures in a storage box. I am shutting down everything to minimum data loss.
      Data from the affected storage was moved elsewhere - issue resolved.
      9)  10/30:  BNL-OSG2_DATADISK - file transfer errors due to timeouts.  Issue resolved - from Michael: The load across pools was re-balanced.  eLog 18794.
      10)  10/30: AGLT2 - large number of failed jobs with "lost heartbeat" errors.  From Tom at MSU: "Some cluster network work yesterday had more impact than foreseen resulting in removal of all running Atlas jobs."  ggus 63629 / RT 18466 closed, eLog 18832.
      11) 10/30: Job failures at OU_OSCER_ATLAS due to missing release 16.0.2.  Alessandro fixed an issue with the s/w install system that was preventing a re-try after one or more earlier failed attempts.  16.0.2 now available at the site.  ggus 63635 / RT 18467 closed, eLog 18812.
      12)  10/30: ANL - file transfer tests failing with the error "failed to contact on remote SRM [httpg://atlas11.hep.anl.gov:8443/srm/v2/server]. Givin' up after 3 tries]."  ggus 63633 in-progress, eLog 18807.
      13)  10/30: New site SLACXRD_LMEM-lsf availabe, test jobs submitted.  Initially an issue with getting pilots to run at the site - now resolved.  Queue is currently set to 'brokeroff'.
      14)  10/31: Job failures at SLACXRD with the error "Required CMTCONFIG (i686-slc5-gcc43-opt) incompatible with that of local system."  From Xin: The installation at SLAC is corrupted, I am reinstalling there, will update the ticket after the re-install is done.  
      ggus 63639 in-progress, eLog 18845.
      15)  11/1: Maintenance outage at AGLT2.  Queues back on-line as of ~10:00 p.m. EST.
      16)  11/1: Power outage at BNL (Switching back to utility power failed following completion of work on a primarily electrical feed circuit).  Issue resolved, all services restored as of ~11:00 p.m. EST.  eLog 18896.
      17)  11/1: OU_OCHEP_SWT2 file transfer errors.  Issue understood - from Horst: SRM errors were caused by our Lustre servers crashing and rebooting.  DDN fixed the problem, and they are investigating what happened.  ggus 63644 & 63662 / RT 18473 & 18480 closed,  eLog 18851.
      18)  11/1: NET2 job failures understood - from John & Saul: Just to let you know that we're getting some LSM errors at our Harvard sites due to an overloaded gatekeeper at BU. We've taken some steps which should clear this up, but we're expecting a batch of failed jobs in the 
      next couple of hours.  eLog 18875.
      19)  11/1: HU_ATLAS_Tier2 - large number of job failures with the error "sm-get failed: time out after 5400 seconds."  Issue understood - from Saul: This problem is gone now. It was caused by a sudden bunch of production jobs with huge 2.6 GB log files. 
      Paul Nilsson has submitted a ticket about that.  
      We've also made networking adjustments so that these kind of jobs wouldn't actually fail in the future.  ggus 63665 / RT 18545 closed, eLog 18891.  https://savannah.cern.ch/bugs/index.php?74720.
      20)  11/2: AGLT2 - all jobs failing with the errors indicating a possible file system problem.  From Bob: We have determined that the problem is a corrupted NFS file system hosting OSG/DATA and OSG/APP. That is the bad news. 
      The good news is that this is a copy to a new host from yesterday, so the original will be used to re-create it.  ggus 63684 in-progress, eLog 18913.  Queues set off-line.
      21)  11/2: OU_OSCER_ATLAS: jobs using release 16.0.2.3 are failing with seg fault errors, while they finish successfully at other sites.  Alessandro checked the release installation, and this doesn't appear to be the issue.  May need to run a job ""by-hand" to 
      get more detailed debugging information.  In-progress.
      
      Follow-ups from earlier reports:
      
      (i)  Update 9/16: Full set of atlas releases installed at OU_OSCER_ATLAS. Test jobs successful - site set on-line.
      Update 10/14:  Trying to understand why production jobs aren't being brokered to OU_OSCER_ATLAS.
      Update 10/20: Still trying to understand the brokerage problem.
      Update 10/27: The field 'CMTCONFIG' in schedconfig for OSCER had an old value, so jobs now getting brokered to the site.
      (ii)  10/21: NERSC_HOTDISK file transfer errors - authentication issue with NERSC accepting the ATLAS production voms proxy.  Hiro set the site off-line in DDM until the problem is resolved.  ggus 63319 in-progress, eLog 18494.
      (iii)  10/27: HU_ATLAS_Tier2 - job failures with lsm errors:
      "27 Oct 04:18:14|Mover.py | !!FAILED!!3000!! Get error: lsm-get failed: time out after 5400 seconds."  ggus 63486 in-progress, eLog 18670.
      Update 10/28, from Saul: We had a networking problem last night between about midnight and 6 a.m. EST.  We don't yet understand exactly what happened, but during this period, our network throughput went way down and about 500 jobs failed due to LSM timeout.  ggus 63486 closed.
      (iv)  10/27: Job failures at several U.S. sites due to missing atlas s/w release 16.0.2.  Issue understood - from Xin:
      SIT released a new version of the pacball for release 16.0.2, so I had to delete the existing 16.0.2 and re-install them. So far the base kit 16.0.2 has been re-installed, and 16.0.2.2 cache is also available at most sites, I just start the re-installation of 16.0.2.1 cache, 
      which should be done in a couple of hours.  ggus 63503 in-progress (but can probably be closed at this point), eLog 18678.
      Update 11/2: No additional errors - ggus ticket closed.
      
      
    • Most issues above have been resolved.
    • Open issues - AGLT2 down at the moment.
    • Job failures at OU-OSCER - site specific issue, dealing with this on email.
  • this meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=113057
    
    1)  11/5: New pilot version from Paul (SULU 45a) - details here:
    http://www-hep.uta.edu/~sosebee/ADCoS/pilot-version-SULU_45a.html
    2)  11/5: WT2 - short outage from ~2:30pm to 5:00pm PDT to replace a system disk on a storage box.
    Site back on-line as of ~4:45 PDT.
    3)  11/6 - 11/7: BNL dCache issues for ATLASDATADISK and ATLASMCDISK space tokens:
    "FTP Door: got response from '[>PoolManager@dCacheDomain:*@dCacheDomain:SrmSpaceManager@srm-dcsrm03Domain:*@srm-dcsrm03Domain:*@dCacheDomain]' with error Best pool  too high : NaN] ACTIVITY: Data Consolidation."  Resolved - from Iris / Michael: 
    It was a space issue which has been fixed (MCDISK filled up).  ggus 63996/99 closed, eLog 19105/125.
    4)  11/7 - 11/8: SWT2_CPB DDM errors - status from Patrick:
    Sunday there was an issue in the configuration of Bestman associated with the number of open file descriptors.  Restarting the SRM cleared the issue.  We had more problems today (Monday), but did not see an issue with the number of open files.  
    We are seeing a high load on one dataserver and may make some changes to the xrootd configuration on this node to see if it can improve things.  We have restarted Bestman and modified the number of the worker threads associated with the 
    XrootdFS component that bestman relies on.  Transfer errors have stopped.
    5)  11/9 early a.m.: SWT2_CPB file transfer errors.  Issue was a problematic network switch port.  Later additional transfer errors were observed, due to the fact that the xrootd storage server plugged into the bad switch port was inaccessible for a period of time.  
    All issues seem to now be resolved.  RT 18616 / ggus 64117 closed, eLog 19265. 
    6)  11/9: HU_ATLAS_Tier2 job failures with the error "Get error: lsm-get failed: time out after 5400 seconds ."  Issue resolved - ggus 64108 closed, eLog 19266.
    
    Follow-ups from earlier reports:
    
    (i)  10/21: NERSC_HOTDISK file transfer errors - authentication issue with NERSC accepting the ATLAS
    production voms proxy.  Hiro set the site off-line in DDM until the problem is resolved.  ggus 63319 in-progress, eLog 18494.
    Update: solved as of 11/5 - ggus ticket closed.
    (ii)  10/27: WISC_DATADISK - possibly a missing file.  ggus 63526 in-progress, eLog 18698.
    Update 11/6, from Wen at WISC: Now these 5 files are available. I think it's transferred from other sites by Function Test. So this ticket can be closed.  ggus 63526 was subsequently closed.
    (iii)  10/30: ANL - file transfer tests failing with the error "failed to contact on remote SRM [httpg://atlas11.hep.anl.gov:8443/srm/v2/server]. Givin' up after 3 tries]."  ggus 63633 in-progress, eLog 18807.
    Update, 11/1: Network device failures were solved by a reboot of the machine.  ggus 63633 closed.
    (iv)  10/31: Job failures at SLACXRD with the error "Required CMTCONFIG (i686-slc5-gcc43-opt) incompatible with that of local system."  From Xin: The installation at SLAC is corrupted, I am reinstalling there, will update the ticket after the re-install is done.  
    ggus 63639 in-progress, eLog 18845.
    Update, 11/5: ATLAS release 15.6.13 reinstalled, no additional errors of this type seen.  ggus 63639 closed, eLog 19066.
    (v)  11/2: AGLT2 - all jobs failing with the errors indicating a possible file system problem.  From Bob: We have determined that the problem is a corrupted NFS file system hosting OSG/DATA and OSG/APP. That is the bad news. 
    The good news is that this is a copy to a new host from yesterday, so the original will be used to re-create it.  
    ggus 63684 in-progress, eLog 18913.  Queues set off-line.
    Update, 11/4 from Bob at UM: NFS server for OSG directories was reloaded and resolved the issue. This server disk was originally built as XFS and mounted with inode64. It worked fine for all OS level commands, but failed in various python packages used in ATLAS kits. 
    The disk was emptied, and inode64 was removed before it was reloaded. 
    Test jobs successful, queues set back 'on-line.  ggus 63684 closed, eLog 18965.
    (vi)  11/2: OU_OSCER_ATLAS: jobs using release 16.0.2.3 are failing with seg fault errors, while they finish successfully at other sites.  Alessandro checked the release installation, and this doesn't appear to be the issue.  
    May need to run a job "by-hand" to get more detailed debugging information.  In-progress.
    
     
    • Another quiet week in the US cloud
    • Most carryover issues of the last week resolved
    • New pilot from Paul - significant
    • Alden: has removed queues

DDM Operations (Hiro)

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week:
    • 3.2 release updates please at all sites
    • Request to preserve historical data
    • Details - available in Jason's FAQ
    • Karthik will send a link
    • Due by next throughput meeting
  • this week:
    • Meeting notes
      
      
    • See notes in email
    • Illinois asymmetry resolved - could have been an update to a switch.
    • Goal was to get all perfsonar's updated to 3.2; good progress - issues with SLAC (has alternative version for local security)
    • NET2 had one box down for a while
    • All other sites are updated
    • Question about version at BNL
    • Want Nagios plugins extended to show version

Site news and issues (all sites)

  • T1:
    • last week(s): will be adding 1300 TB of disk; installation is ready and will be handed over to dcache group to integrate into the dcache system by next week. CREAM investigations are continuing. LCG made an announcement to the user community that we'll deprecate the existing CE by the end of the year. Urging sites to convert. Have discussed with OSG on behalf of US ATLAS and US CMS - Alain Roy is working on this, will be ready soon. Submission and Condor batch backend sites will need to be tested. Preliminary results looked good to a single site, but CMS found problems with submission to multiple sites. Plan is to submit 10K jobs from BNL submit host to 1000 slots at UW, to validate readiness (Xin). Note: no end of support for the current OSG gatekeeper, GT2-based.
    • this week: Reprocessing keeping us busy, especially due to the space crunch (dcache adjustments). Looking into purchasing more disk using FY11 funds.

  • AGLT2:
    • last week:Taken delivery of all of the disk trays, under test. Coordinate turning on shelves between the sites. Looking at XFS su & sw sizes as tool to optimize MD1200 performance. At Michigan, two dcache headnodes, 3 each at a site. Expect a shutdown in December. Major network changes, deploy 8024F's. Performance issue with H800 and 3rd MD1200 shelf.
    • this week: Lustre issue - mounted on worker nodes; metadata server, Bob working it. New resources arriving - blade severs arriving. Tom: 64 R410s, all but 10 arrived and racked; setting them up. A little more than doubling the MSU HS.

  • NET2:
    • last week(s): ANALY queue at HU - available to add more analysis jobs. Expect to stay up during the break.
    • this week: Running at full capacity, including HU Westmere nodes. Problems keeping ANALY_HU full, a few

  • MWT2:
    • last week(s): Security mitigation complete. Pool with a bad DIMM and needs a bios update. Running stably.
    • this week: gridftp server problems - no route to host - server disabled, investigating.

  • SWT2 (UTA):
    • last week: Working to conditions access setup correctly for UTA_SWT2 cluster since its being converted to analysis; a second squid server at UTA, using same DNS name.
    • this week: A couple of issues with the SRM over the weekend - #open files for Bestman. Getting started setting up global xrootd system, in place, will need to turn things on. New 10G connection coming online, new switch in place, testing w/ test-cluster.

  • SWT2 (OU):
    • last week: Everything running smoothly. Only issue getting OSCER production jobs.
    • this week: Turned on HT on R410 - now running close to 800 jobs. All running smoothly. There is an "LFC problem" - investigating.

  • WT2:
    • last week(s): Deployed pcache, working fine. 4 hour shutdown to update kernels(?) Two disks failed last week, need to produce more logging.
    • this week: On Friday, problem with a mobo in an older thumper. Large numbers of xrootd connections leading to solaris problems (?). Quite a few SRM issues.. investigating, could be related to the solaris problem. Received 10 Dell

Carryover issues ( any updates?)

HEPSpec 2006 (Bob)

last week:

  • HepSpecBenchmarks
  • MWT2 results were run in 64 bit mode by mistake; Nate is re-running.
  • Assembling all results in a single table.
  • Please send any results to Bob - important for running the benchmark.
  • Duplicate results don't hurt.

this week:

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

libdcap & direct access (Charles)

last report(s):
  • Rebuilt libdcap locally
  • Current HC performance, http://hammercloud.cern.ch/atlas/10000957/test/
  • Comparing to previous results with dcap++
  • Looking at messaging between dcap client and server
  • Most problems with prun jobs (pathena seems robust)
  • Trying to get dcap++ merged into official dcap sources.. libdcap has diverged.
  • Continuing to track down stalls in analysis jobs - leading to hangs in dcap.
  • newline causing buffer overflow
  • chipping away at this on the job-side.
  • dcap++ and newer stock libdcap could be merged at some point.
  • Maximum active movers needs to be high - 1000.
  • Submitted patch to dcache.org for review
  • No estimate on libdcap++ inclusion in to the release
  • No new failure modes observed

this meeting:

AOB

  • last week
    • Please check out Bob's HS06 benchmark page and send him any contributions.
  • this week
    • Reminder: all T2's to submit DYNES applications by end of the month. Template to become available.


-- RobertGardner - 09 Nov 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


pdf install-migration.pdf (50.6K) | XinZhao, 10 Nov 2010 - 11:35 | status update of Atlas Software Installation System Migration
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback