r5 - 30 Mar 2011 - 14:25:28 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesMar30

MinutesMar30

Introduction

Minutes of the Facilities Integration Program meeting, March 23, 2011
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • 866-740-1260, Access code: 7027475

Audio Details: Dial-in Number:
U.S. & Canada: 866.740.1260
U.S. Toll: 303.248.0285
Access Code: 7027475; Chair passcode: 8734
Registration Link: https://cc.readytalk.com/r/bd2w3deu2kkg

Attending

  • Meeting attendees: Charles, Aaron, Shawn, Nate, Rob, Dave, AK, Michael, Karthik, Sarah, John de Stefano, Saul, Jason, Wei, Tom, Armen, Kaushik, Mark, Alden, Wensheng, Torre, Horst, Hiro, Bob,
  • Apologies: Doug B, Joe Izen, Patrick

Integration program update (Rob, Michael)

  • IntegrationPhase16 NEW
  • Special meetings
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • Program notes:
    • last week(s)
      • WLCG reporting - still to sort out - Karthik reporting. See https://twiki.grid.iu.edu/bin/view/Accounting/OSGtoWLCGDataFlow. Problems using KSI2K? factor, on the OSG side - is the table incorrect? Also problems with HT on or off.
      • Capacity spreadsheet reporting - see updates with HT and number of jobs per node.
      • ATLAS ADC is in the process checking capacities - a new web page provided DDM, via SRM; browsing the page found the deployed capacities are underreporting for every site. We need to understand why. Eg. AGLT2 (1.9 PB versus 1.4 PB reported). We need to take a look into this. Michael will provide link. SWT2 - may be related to the space token reporting.
      • Expected capacity to deliver - may need to average out to meet the pledges between T1 and T2's.
      • LHC first collisions on Sunday, stable beams are still rare - working on protections, loss maps.
      • http://bourricot.cern.ch/dq2/accounting/federation_reports/USASITES/
      • WLCG usage accounting reporting needs to be fixed, and the VO share to be provided to WLCG. Along to capacity reporting.
      • WLCG MB is looking at capacity provisioning in terms of the 2011 pledges - and we are about 1 PB short.
      • There is a list of technical R & D issues as discussed in Napoli; there a twik list of activities. Summary from Torre: will need to start organizing and meeting around them quickly. In contxt of ADC re-org. Cloud, fed xrootd, no-mysql among topics. Proceeding towards production for some, others will require taskforces. Alexei - sent around a list, ought be finalized quickly. Next step will be to take from ATLAS to CERN IT, and to extend to CMS and other experiments. First step in wider collaboration.
      • Machine ramping up nicely, anticipate analysis will follow and challenges for pileup, etc. In view of this the capacities need to be up, and stable,

    • this week
      • Quarterly reports due
      • WLCG accounting status. Karthik - there is a consensus to report HS06 per job slot, to account for hyperthreading. The shared HS is lower for HT slots. Sites publish cpus_per_node, cores_per_node - in use. Would be preferable if OSG had a lookup table for this information. This will be an average across all the subclusters for the site. Not sure if GIP has the lookup table, or the number of processor types. Bob has a link with all the measurements we know about. What about the HEPIX benchmark page? And then about contributed CPUs? Opportunistic versus dedicated. Saul can compare with the egg numbers. Karthik has given feedback to Brian. We need to make sure there is a convergence on this topic OSG-wide. Michael will bring this up ET.
      • There is a substantial amount of activity related to Tier 3 packaging. OSG is working in this area for some time. https://twiki.grid.iu.edu/bin/view/SoftwareTeam/XrootdRPMPhase1Reqs - focus first in on Tier 3. Will need to sign off on this as soon as possible. There is a similar effort at CERN, and there will OSG collaboration. We need to sign off on the requirements.
      • WLCG shortage shortfall - need to get missing 1 PB in size installed by July.

CVMFS (John)

  • Setting up a replica server as a testbed - had some breakthroughs, al is running as expected. Its working - CERN notified. Functional tests passed.
  • Sites currently connected to CERN could connect at BNL.
  • Sarah W: Version 2.6.1 working at MWT2_IU - there was a workaround required, writing up feedback. Working on testnodes. Updating puppet module to deploy automatically. 3000 job slots.

Tier 3 Integration Program (Doug Benjamin)

Tier 3 References:
  • The link to ATLAS T3 working groups Twikis are here
  • T3g Setup guide is here
  • Users' guide to T3g is here

last week(s):

  • Doug travels to Arizona next week (Tues-Thursday) to help setup their Tier 3 site
  • last week had a meeting during the OSG Hands meeting with VDT about Xrootd rpm packaging. OSG/VDT promised a new rpm soon.
    • Next week - Wei reports a new release is imminent.
  • CVMFS meeting Wednesday 16-Mar 17:00 CET
    • Move to final namespace in advance migration to CERN IT - not sure about timescale
    • Nightlies and conditions data
    • AGLT2 discovered a problem with fresh installation - testing a different machine. Should be fixed so as to not damage the file system.
  • Write up on Xrootd federation given to Simone Campana and Torre Wenaus. They are collecting information on R&D tasks and task forces
  • wlcg-client - now supported by Charles; some python issues need to be resolved.
  • UTD report from Joe - cmtconfig error - was running CentOS, tracked down to firmware updater from redhat needs. Lost heartbeat errors - tracked down
this week:
  • AK - working on firewall issues affecting transfer rate. Working with the IT director to resolve the issue.

Operations overview: Production and Analysis (Kaushik)

Data Management and Storage Validation (Armen)

  • Reference
  • last week(s):
    • Storage monitoring problem fixed.
    • Storage reporting categories - unallocated (non-SRM) and unpowered (on floor, but not connected).
    • Deletion - userdisk on the way. Issues with central deletion (old issue) being followed up.
    • LFC ghost category rising (Charles)
    • Old groupdisk issue - all data is now cleaned up.
    • localgroupdisk cleaning and monitoring - accounting.
    • New proddisk cleanup utility sent by Charles
    • Hiro will work on a page of stacked plots of each site. Will need to work with ADC to show augmenting of the storage in the DQ2 accounting.
    • Discussion about localgroupdisk policy. Wei notes we don't have balance as to how much to allocate there. Michael notes localgroupdisk does not count against pledge. Tendency is to merge tokens.
  • this week:
    • MinutesDataManageMar29
    • Removing legacy tokens and data at BNL
    • Central deletion on-going. There have been timeouts, plan to discuss next week.
    • Localgroupdisk monitor from Hiro
    • Two non-official categories- unallocated, unpowered. Can they be incorporated in SRM?
    • Charles - bulk deletion methods were not being used for most US sites. Change was made yesterday. Goal is 10 Hz at Tier 2s for LFC+SRM deletion. Might be able to see a reduction in gap by looking at bourricot monitoring page. Should see backlogs delete. Increase number of deletions.

Shifters report (Mark)

  • Reference
  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=132241
    
    1)  3/16: AGLT2 - Issue with dCache file server resolved.  Files on the machine were inaccessible for a couple of hours while a firmware upgrade was performed.
    2)  3/18 - 3/21: SWT2_CPB - FT and SRM errors ("failed to contact on remote SRM [httpg://gk03.atlas-swt2.org:8443/srm/v2/server]").  Issue was a problem NIC 
    in one of the storage servers which took the machine off the network, resulting in the SRM errors.  Resolved as of 3/21.  RT 19593 / ggus 68782 closed, eLog 23381/481.
    3)  3/18: SWT2_CPB - unrelated issue to 2) above, although the tickets were getting mixed up, job failures with the error "transformation not installed in CE (16.0.3.4)."  
    Xin successfully re-ran the validation for this cache, so not clear what the issue is.  Closed ggus 68740 / RT 19587 as "unsolved," eLog 23488.
    4)  3/19: SLACXRD - large backlog of transferring jobs - issue understood, FTS channels had not been  re-opened after adjusting bestman.  ggus 68783 closed, eLog 23350.
    5)  3/19: Some inconsistencies in the panda monitor were reported (for example number of running jobs).  Resolved - https://savannah.cern.ch/bugs/index.php?79654, eLog 23370.
    6)  3/21: SLACXRD_DATADISK file transfer errors (" failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]").  Issue resolved, ggus 68804 
    closed, eLog 23447.
    7)  3/21: OU_OCHEP_SWT2 - maintenance outage in order to move the cluster.  Work completed as of ~6:00 p.m. CST.  Test jobs successful, site set back on-line.  eLog 23514.
    8)  3/22 - 3/23: Jobs from several heavy ion tasks were failing in the U.S. cloud (and others) with the error "No child processes."  Paul suspects this may be due to the fact 
    that the pilot has to send a large field containing output file info in the TCP message, and this overloads the TCP server on the WN used by the pilot.  If this is the case a fix will 
    be implemented.  See: https://savannah.cern.ch/bugs/?79915, eLog 23555.
    
    Follow-ups from earlier reports:
    
    (i)  1/9: AGLT2 - low-level of job failures with the error "Put error: lfc_creatg failed with (2704, Bad magic number)."  Site is investigating.
    (ii)  1/19: BNL - user reported a problem while attempting to download files from the site - for example: "httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2: CGSI-gSOAP running 
    on t301.hep.tau.ac.il reports Error reading token data header: Connection closed."  ggus 66298.  From Hiro:
    There is a known issue for users with Israel CA having problem accessing BNL and MWT2. This is actively investigated right now. Until this get completely resolved, users are suggested 
    to request DaTRI request to transfer datasets to some other sites (LOCAGROUPDISK area) for the downloading.
    Update 3/14 from Iris: The issue is still under investigation. Thank you for your patience.
    (iii)  2/10: A bug in the most recent OSG software release (1.2.17, released on Monday, February 7th) affects WLCG availability reporting for sites. Sites may go into an UNKNOWN 
    state one day after updating.  Thus it is recommended that sites defer upgrading their OSG installations until a fix is released.  See: http://osggoc.blogspot.com/
    (iv)  3/10: SLACXRD_LOCALGROUPDISK transfer errors ("failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]").  From Wei: We are hit very 
    hard by analysis jobs. Unless that is over, I expect error like this to continue.  As of 3/14 issue probably resolved - we can close ggus 68498.  eLog 22978.
    Update 3/20: ggus 68498 closed,
    (v)  3/12: SLACXRD_LOCALGROUPDISK transfer errors with "[NO_SPACE_LEFT] No space found with at least .... bytes of unusedSize]."  https://savannah.cern.ch/bugs/index.php?79353 
    still open, eLog 23037.  Later the same day: SLACXRD_PERF-JETS transfer failures with "Source file/user checksum mismatch" errors.  https://savannah.cern.ch/bugs/index.php?79361.  
    Latest comment to the Savannah ticket suggests declaring the files lost to DQ2 if they are corrupted.  eLog 23048.
    Update 3/21: Savannah 79353 closed (free space is available).
    (vi)  3/13: OU_OSCER_ATLAS job failures due to a problem with release 15.6.3.10.  As of 3/14 Alessandro was reinstalling the s/w.  Can we close this ticket?  ggus 68611 / RT 19561, 
    eLog 23134, https://savannah.cern.ch/bugs/index.php?79368.
    (vii)  3/14: MWT2_UC file transfer errors ("[GENERAL_FAILURE] AsyncWait] Duration [0]").  From Aaron: This is due to a dcache pool which has been restarted multiple times this afternoon. 
    We are attempting to get this server more stable or drain it, and we expect to be running again without problems within an hour or two.  Can we close this ticket?  ggus 68617, eLog 23139.
    Update 3/16: ggus 68617 closed.
    (iix)  3/15: HU_ATLAS_Tier2 and ANALY_HU_ATLAS_Tier2 set off-line at Saul's request.  ggus 68660, https://savannah.cern.ch/support/index.php?119796, eLog 23194.
    Update 3/16: Some CRL's updated (jobs had been failing with "bad credentials" errors) - test jobs successful, queues set back on-line.  ggus 68660 closed.
    
    • HI reprocessing job failures at NET2 - "no child process" reported by pilot. Paul involved - large amount of info to be sent by pilot, developing work around.
  • this meeting:* Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=133197
    
    1)  3/23: Maintenance outage at SLAC.  Work completed as of ~5:30 p.m. CST.  Test jobs successful - queues set back on-line.  eLog 23613/14.  http://savannah.cern.ch/support/?119952
    2)  3/23: Issue with gratia reporting at SWT2_CPB resolved.  RT 19614 closed.
    3)  3/24: SWT2_CPB now using Alessandro's s/w installation system.
    4)  3/25: UTD-HEP maintenance outage originally scheduled for 3/23 had to be postponed.  eLog 23651, https://savannah.cern.ch/support/index.php?119962
    Update: this maintenance outage now set for 3/30.  See eLog 23794, https://savannah.cern.ch/support/?120085.
    5)  3/27: BNL-OSG2_DATATAPE file transfer errors like "[AGENT error during TRANSFER_SERVICE phase: [INTERNAL_ERROR] cannot create archive repository: No space 
    left on device]."  Issue resolved, ggus 69083 closed.  See eLog 23706 for a detailed description of the problem from Michael.
    6)  3/27: Shawn reported that intermittent network issues were occurring between AGLT2 & BNL.  Ongoing ESnet work at StarLight.
    7)  3/28:  SWT2_CPB - user jobs were failing due to stage-in errors.  The problem was a defective drive in one of the RAID arrays.  It was generating enough errors to cause file 
    system errors, but as yet had not been swapped out of the array.  Forced the spare disk on-line, issue resolved.  User jobs finished successfully on a subsequent attempt.  
    ggus 69114 / RT 19710 closed.
    8)  3/29: Modification to the panda server load balanced DNS alias.  Details in eLog 23762.  Seemed to be some transient residual issues from this change, but the system was 
    mostly responsive as of ~2:00 p.m. CST.
    9)  3/29: From Aaron at MWT2 - We have completed the maintenance on our site here at MWT2, and we're back online for both MWT2_UC and ANALY_MWT2.  ~4:45 p.m.
    
    Follow-ups from earlier reports:
    
    (i)  1/9: AGLT2 - low-level of job failures with the error "Put error: lfc_creatg failed with (2704, Bad magic number)."  Site is investigating.
    Update 3/29: Haven't noticed any recent occurrences of this error, will close this item for now.
    (ii)  1/19: BNL - user reported a problem while attempting to download files from the site - for example: "httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2: CGSI-gSOAP 
    running on t301.hep.tau.ac.il reports Error reading token data header: Connection closed."  ggus 66298.  From Hiro: There is a known issue for users with Israel CA having 
    problem accessing BNL and MWT2. This is actively investigated right now. Until this get completely resolved, users are suggested to request DaTRI request to transfer 
    datasets to some other sites (LOCAGROUPDISK area) for the downloading.
    Update 3/14 from Iris: The issue is still under investigation. Thank you for your patience.
    (iii)  2/10: A bug in the most recent OSG software release (1.2.17, released on Monday, February 7th) affects WLCG availability reporting for sites. Sites may go into an 
    UNKNOWN state one day after updating.  Thus it is recommended that sites defer upgrading their OSG installations until a fix is released.  See: http://osggoc.blogspot.com/
    (iv)  3/12: SLACXRD_LOCALGROUPDISK transfer errors with "[NO_SPACE_LEFT] No space found with at least .... bytes of unusedSize]."  
    https://savannah.cern.ch/bugs/index.php?79353 still open, eLog 23037.  Later the same day: SLACXRD_PERF-JETS transfer failures with "Source file/user checksum mismatch" 
    errors.  https://savannah.cern.ch/bugs/index.php?79361.  Latest comment to the Savannah ticket suggests declaring the files lost to DQ2 if they are corrupted.  eLog 23048.
    Update 3/21: Savannah 79353 closed (free space is available).
    (v)  3/13: OU_OSCER_ATLAS job failures due to a problem with release 15.6.3.10.  As of 3/14 Alessandro was reinstalling the s/w.  Can we close this ticket?  
    ggus 68611 / RT 19561, eLog 23134, https://savannah.cern.ch/bugs/index.php?79368.
    Update 3/25: clean-up / re-installation of atlas release 15.6.3 (and the associated caches) succeeded.  ggus & RT tickets closed.
    (vi)  3/22 - 3/23: Jobs from several heavy ion tasks were failing in the U.S. cloud (and others) with the error "No child processes."  Paul suspects this may be due to the fact 
    that the pilot has to send a large field containing output file info in the TCP message, and this overloads the TCP server on the WN used by the pilot.  If this is the case a 
    fix will be implemented.  See: https://savannah.cern.ch/bugs/?79915, eLog 23555.
    Update 3/24: Paul updated the pilot to address this issue (SULU 46c).  More details here: http://www-hep.uta.edu/~sosebee/ADCoS/pilot-version-SULU46c.html
    
    • HI reprocessing related to the pilot resolved now.
    • Alessandro contacted for release installation at SWT2_CPB, complete.
    • DNS load balancing tweak at CERN - there were some errors, but now sorted out.

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • See email of last night - ACL LFC set by Tier 2 has a problem, potentially creating problems for central deletion; missing production role in group. Horst requesting a script. Hiro will work with each site.
    • MWT2 - need one more fix.
    • LFC ghosts at MWT2 - understood, bulk methods since version 1.7.2, and running and older version?
    • Will provide a table of localgroupdisk breakdown.
  • this meeting:
    • No operational problems.
    • All LFC ACL problems have been resolved
    • New monitor announced: The breakdown for usage in the local group disk are in US is visible at the same place (http://www.usatlas.bnl.gov/dq2/monitor). You can look at the US wide information at "LOCALGROUP" link. The page will show replica owner with more than 1% of total usage. To see a specific user, one just need to search by the text area. From there, one can drill down more detail by clicking various links. Alternatively, one can get to the site specific info at "Site summary" page. It will provides the breakdown specific to particular site. If one needs different way to look at the usage, please let me know. -Hiro

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week:
    • Action item all T2's to get another load test in. Sites to contact Hiro, monitor the results. An hour long test. ASAP.
    • Throughput meeting:
      USATLAS Throughput Meeting Notes --- March 29, 2011
                      ==========================================
       
      Attending: Shawn, Dave, Jason, Sarah, Philippe, Karthik, Andy, Tom, Horst, Hiro, Aaron
       
      1)      Review of action items
      a.       Dell R410  - Nothing to report.  perfSONAR developers  temporarily redirected.  Still on the roadmap and will be worked on once manpower is again available. 
      b.      AGLT2-MWT2_IU issue – No update for now.  Sarah, Shawn and Jason haven’t had time to get into this.  No evidence that it is impacting real transfers between the sites but it should be investigated.
      c.       Load-test rescheduling status.   MWT2_UC and Illinois are done but need to be documented on the link off the site certification page.  *ACTION*: Other sites need to get this done!  Horst reported Hiro’s standardized DQ2 throughput tests are showing a decrease (will cover in section 4 below).
      2)      perfSONAR status items
      a.       Summary of issues:    Packet drops into AGLT2_UM.  Issue with clock at NET2_BU.   Seeing unusual divergence in OWAMP measurements between MSU perfSONAR node and BU node.  Only see this on the MSU-BU test from the MSU side while many OWAMP measurements at BU side show similar results.  Assume this is something about BU clock on the OWAMP perfSONAR node.  Jason checked configuration and it is fine.  Hardware issue?
      b.       Philippe noticed that the latency shifted between AGLT2_MSU and UTA from  31 down to 13 both directions on 21 March.   Throughput increased slightly at the same time.   Also from Philippe is a plot of the latency between UTA and BNL which show the decrease in two steps (attached as 20110321_bnl_uta_stepdown.png).
      c.       Configuration issues:  Discuss having a “maintenance window” in common between all sites.    Suggestion is 1-2AM Eastern.  Issues?   Are we sure services are impacted?    Philippe:  if alerts do go out we should develop a Twiki on how-to-respond.  Tom is seeing a flood of emails nightly.    Need to  figure out how to improve this.   Discussion that not all services are equally important/critical.  Decided on the following:
                                                                     i.      *ACTION*: *All sites* should make sure they schedule their perfSONAR maintenance tasks starting at 1AM Eastern
                                                                   ii.      *ACTION*: *Tom* will prevent any email alerts related to perfSONAR between 1-2AM Eastern
                                                                  iii.      Alerts will only be sent out if the service is down for an extended period.  For “critical” measurement services this should be 2 hours while for non-critical services this should be 24 hours.   We have seen that many issues are intermittent and self-healing and don’t want to alert till we are sure something needs attention.  *ACTION*: *Jason* will let Tom know which services have a 2 hour alert wait time and which are 24 hours.
      3)      Throughput monitoring:
      a.       Hiro throughput via dq2 is working.   Needs clean-up of old MCDISK entries.
      b.      Nothing new on merging perfSONAR and DQ2 throughput tests
      c.       Update on RSV’ifying existing perfSONAR probes.  Tom presented current status and plans.  Working with OSG and perfSONAR folks on getting a new system ready.  Issues with how the Gratia DB might support measurements involving two sites (not a currently supported use-case).  Tom mentioned future plans on developing a  configuration build GUI anticipating managing lots of future sites. 
      d.      Update on ktune.  Problem observed by Aaron at  MWT2_UC related to one of the ktune scripts Shawn included not correctly parsing VLAN NIC aliases like eth0.29.  *ACTION* *Shawn* will provide updated script in ktune rpm which should work.  Interested in testers for this package.  Especially interested in a) before and after benchmarking and b) differences between local site tunings and what is in the ktune packge. 
      e.      Kernel  issues.   Nate at UC has updated the UltraLight kernel to use the 2.6.38 kernel.   Web page at http://twiki.mwt2.org/bin/view/ITB/UltraLightKernel  needs updating to discuss installation steps and issues.  OpenAFS is potentially a problem.  Aaron mentioned that the kernel should have built-in AFS already…needs to be verified and documented.   *Need testing volunteer sites*.  Goal is to improve storage node performance.  Want  before/after benchmarking to document changes (good or bad).   Should  be installable on SL5 hosts without problems.
      4)      Site reports:
      a.       OU seeing throughput decreased  AGLT2 800MB/s but dropped to around 400MB/s, 400-500 to BNL/s dropped to 200-250 MB/sec (and others UTA, MWT2 show drops).  No similar drop seen in the 1Gbps perfSONAR tests though.  Horst will explore possible local site issues.    May also just schedule a load-test to see if problems are real.  Loadtest may more easily expose any existing problem.
      b.      Aaron reported on NAT issues at UC.  The Cisco 6509 shows “no route to host”  for things on that NAT.  Campus networking involved.  No resolution yet.  Aaron will keep us posted.
       
      No AOB.   We will plan to meet again in 2 weeks at the usual time.   Please send along additions or corrections to the list.
       
      Shawn

    • Reminder to complete load testing, and update site certification table SiteCertificationP16
    • still working on ktune for tuning the kernal - issue as to which parameters to use for server-kernel settings. Shawn to provide information and pointers.
    • perfsonar integrated into RSV
    • meeting next Tuesday
    • Remind sites to get load tests finished. MWT2, UIUC, AGLT2 have completed this, and document on site certification table.

  • this week:

Federated Xrootd at sites: Tier 3 (Doug), Tier 2 (Charles)

last week(s):
  • Doug sent a document to Simone and Torre - to be part of an ATLAS task force, R & D project, may be discussed during SW week.
  • Charles - continuing to test - performance tests 500 MB/s. Post LFC model work - to replace the LFC-callout plugin (requires normalizing paths; getting rid of DQ2 suffixes - some setting changed).
  • Stil working on getting the no-LFC mode functional - some progress on that.
  • Investigating client-side modifications for xrd equivalent to "libdcap++"
  • Performance tests continuing
  • Will standardize on xrootd rpm release
  • Version 3.0.3 is working fine at the moment.
this week:
  • No major news - wide area client performance.

Site news and issues (all sites)

  • T1:
    • last week(s): BNL has its own PRODDISK area now. Deployed about 2PB of disk, in production. Will need to remove some of the storage. ... SRM database hiccup, investigating. Procurement for 150 westmere-based nodes, R410 with X5660, in advanced state. Pledge plus 20%. Looking at storage management soln on lower level (http://www.nexenta.org/, alternative to Solaris/ZFS). Close to getting another 10G wide area circuit through ESnet - will see how to use it; possibly to connect to LHCONE open access point.
    • this week: Waiting for 150 nodes, will bring to 8500 slots. PNFS upgrade to Chimera, discussing timeline and plan. Earlier than previously anticipated - this summer rather than at year's end. Otherwise very stable.

  • AGLT2:
    • last week(s): All is working well. Have had some checksum failures - chasing this down. Users attempting to get files that were once here, but no longer. Is the user job unknowingly removed files under the usatlas1 account? Looking at options to trap the remove command, and log these. Want to get the lsm installed here, to instrument IO. Doing some work on the SE. Would like to get better aspects of IO for jobs. Testing on ktune. ... Below April 1 in space tokens - will bring on two servers into production. Monitoring page updated showing storage in different categories. Working on firmware updates. Working on ktune, checking settings. Nexan evaluation to fiber channel storage w/ SATABEAST, connect to Dell headnode. Will spend a week testing and integration.
    • this week: Working on ktune and kernel from Nate. OSG 1.2.19 is now available, will bring this online. Also brought on two new pool servers. Federated storage report shows AGLT2.

  • NET2:
    • last week(s): Work at BU storage - all underway to improve transfers to HU. Two GPFS filesystems will be combined (will change reporting momentarily). New switch for HU connectivity. Production job failures at HU last night - expired CRL - stopped running for some reason. .. Tier 3 has arrived and is being tested and setup, along with new 10G switch - which also will be used for analysis at Harvard. Changing GPFS filesystem, as before. One or more 10G links to add to NOX. Planning big move to Holyoke in 2012. Procurement of additional storage at BU side.
    • this week: GPFS2&3 joined, space should be available by the end of the week. Going to multiple gridftp, multiple LSM host; using cluster NFS to export gpfs volumes to HU; gatekeeper upgrade; getting ready to purchase worker nodes and storage on the BU side. Tier 3 still being worked on. Started collection of historical and current statistics & graphics at http://egg.bu.edu/atlas . Will defer the load test till after the new storage comes online.

  • MWT2:
    • last week(s): Working on a new MWT2 endpoint using Condor as a scheduler. Correct CPUs arrived from Dell - to be replaced.
    • this week: Continued preparations for move of MWT2_UC to new server room. CPU replacement at IU completed. LFC updated at UC. Kernels on storage nodes updated, found better performance. NAT testing on Cisco. GUMS relocated to VM. New monitor from Charles, http://www.mwt2.org/sys/userjobs.html.

  • SWT2 (UTA):
    • last week: Lost power on campus Monday afternoon - problem in switch gear for cooling. ... Storage server failed over weekend, recovered on Monday. Working with Alessandro on using his installation method. Working with Armen to get USERDISK deletions working at the site.
    • this week: Mark - analy jobs failing because jobs couldn't access input files off storage. There was a faulty drive, fixed.

  • SWT2 (OU):
    • last week: Waiting for final confirmation for compute node additions next week. Investigating Alessandro's install job hang. ... Moved cluster Monday afternoon, ready for Dell to install the nodes, scheduled for Monday.
    • this week: 844 slots - 18 HT nodes. Ready to run a load test - will take offline with Hiro.

  • WT2:
    • last week(s): Last week problem with a Dell machine storage - replaced CPU and memory, though not stressed. Planning 3 major outages - each lasting a day or two: March, April, early May. Will be setting final dates soon. Getting quote for new switch. ... Channel bonding for Dell 8024F uplinks; need to update firmware. Will have storage outtage this afternoon. Shawn reports having done this successfully with latest firmware.
    • this week: Setup a small batch queue with outbound network connections to allow the installation jobs to run. Updated Bestman to latest version - to put the total space in a flat file. Next procurement - MD1000, looking to by 24 of them.

Carryover issues ( any updates?)

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc. Note: https://atlas-install.roma1.infn.it/atlas_install/ - site admins can subscribe, and get notified of release installation & validation activity at their site.

  • last report(s)
    • IU and BU have now migrated.
    • 3 sites left: WT2, SWT2-UTA, HU
    • Waiting on confirmation from Alessandro; have requested completion by March 1.
    • Focusing on WT2 - there is a proxy issue
    • No new jobs yet to: SWT2, HU - jobs are timing out, not running.
    • There is also Tufts. BDII publishing.
    • One of the problems at SLAC is lack of outbound links, and the new procedure will probably use gridftp. Discussing options with them.
    • WT2- waiting for a queue with outbound connections - Wei has submitted
    • HU - Saul will check (Harvard is working in the new system S.Y.)
  • this meeting:

AOB

  • last week
    • Joe Izen - UTD: smooth week of running; will be taking a short outtage. Production side all is well. Tier 3 work progress.
  • this week


-- RobertGardner - 29 Mar 2011

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback