r4 - 29 Sep 2010 - 14:36:08 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesSep29

MinutesSep29

Introduction

Minutes of the Facilities Integration Program meeting, Sep 29, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • 866-740-1260, Access code: 7027475

Attending

  • Meeting attendees: Jason, John D, Patrick, Justin, Booker, Sarah, Dave, Michael, Aaron, Nate, Charles, Saul, Shawn, John B, Fred, Rik, Wei, Armen, Mark, Kaushik, Torre, Bob, Wensheng, Xin
  • Apologies: Horst

Integration program update (Rob, Michael)

  • IntegrationPhase14 NEW
  • Special meetings
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
      • CVMFS evaluation at the Tier 2 - OU, UC, AGLT2 interested in evaluation. See further TestingCVMFS.
      • WLCG asking for US ATLAS pledges for 2011, prelim for 2012. ~12K HS06.
      • Open ATLAS EB meeting on Tuesday. Next reprocessing campaign is shaping up. 7 TeV runs will be used. 1B events. October 20 deadline for building the dataset. Nov 26 repro deadline. For next major phys conferences (La Thuile - March 2011). 6-8 weeks for the simulation - mostly at the Tier 2s.
      • Mid-October face-to-face meeting at SLAC. Make your reservations at the guest house soon. Agenda is being discussed, slowly shaping up.
      • Data management meeting yesterday - notice production has ramped up; but also have a very spikey analysis load. At BNL have shifted 1,000 more cores into analysis. Please move more resources into analysis.
      • LHC - after completion of technical stop still in development mode. No beam expected before the weekend. Expect ramp up 50 to 300 bunches.
      • Rob and Kaushik met with Michael Barnett (ATLAS Outreach Coordinator) last Friday to discuss Tier 2 outreach - suggested creating brochure, perhaps similar to this circa-2008 computing brochure created by Dario, http://pdgusers.lbl.gov/~pschaffner/atlas_computing_brochure.html. Suggestions and contributions welcome.
      • OSG Storage Forum next week @ UC - will be circulating a US ATLAS T2 storage overview set of slides; talks from AGLT2 (Shawn), MWT2 (Aaron & Sarah), WT2 (Wei), OU (Horst), Tier 3 program (Rik), Illinois (Dave), SMU (Justin), data access performance (Charles); should also have time to discuss distributed xrootd services as a remote access infrastructure. Note: proposal canceling next week's usual facilities meeting.
      • Working on anticipated capacity over the next 5 years
      • Existing and new institutions wishing to create Tier 2's will want to know what the capacities are.
      • LHC technical stop - restarted w/ beam development, increasing the bunch train. S
      • Science policy meeting had ATLAS results with full statistics. Our analysis contributions in the facility helped make this happen. Getting a handle on the full-range of Higgs studies - comparison w/ Tevatron.
      • Discussion in progress - 2011 run may be extended to reach luminosity goals
      • Note - heavy ions will start on Nov 8, 3-4 weeks, leading into the Christmas shutdown (Dec 6).
    • this week
      • SLAC meeting.. there will be a phone connection most likely for remote participants
      • Reprocessing campaign preparations are underway - repro of the majority of 7 TEV data, 1B events. Tier 1 will be a major contributor, and possibly Tier 2s. Limit interventions.
      • Machine making good progress - goal to get to 400 bunches, showing good progress. Integrated luminosity increasing exponentially. Will be interesting to see how analysis patterns evolve once freshly reprocessed data is available.
      • Expect also major activity from Tier 3's in the coming months.

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

References
  • The link to ATLAS T3 working groups Twikis are here
  • Draft users' guide to T3g is here
last week(s):
  • Sites are starting to get money and acquire equipment
  • Meeting last week with physics conveners - how analysis will be done, getting analysis examples; validation
  • ADC monitoring meeting - looking at data monitoring in Tier 3. Meeting with ALICE to discuss re-using agents
  • Doug - tagged as ADC technical coordinator
  • Xrootd-VDT on native packaging. Global ATLAS namespace discussion/proposal.
  • Existing Tier 3's into functional tests. ADC will be bringing a new person to improve monitoring for Tier 3 sites.
  • UTD not receiving functional tests for DDM. Want to avoid this - may need to write-up procedure to remove sites.
  • Alden: setting up a Tier 3 - however needing more documentation. Doug: send errors to RT system.
  • Setting up CVMFS server for configuration files. RAL- doing 800 job tests.
  • node affinity to be used from pcache - Charles will work with Doug on this
  • dq2-FTS testing? non this past week.
this week:
  • Sites have received funds and are procuring equipment. Reviewing setup with Brandeis step-by-step
  • Fresno has ordered equipment, Columbia/Nevis, etc. Bulk still to come.
  • Instructions are being moved to CERN from Argonne's twiki. Updates from Brandeis installation going in at the same time.
  • rpm distro for xrootd from VDT on going - important for standardization in ATLAS
  • data management at T3 - in the last few weeks have had meetings at CERN (Neng, Gerry) and w/ Andy & Wei. Plan in place. Use of pq2 tools from U Wisc.
  • FTS transfers to T3 using dq2-get
  • data affinity
  • T3 federation using xrootd

Operations overview: Production and Analysis (Kaushik)

Data Management & Storage Validation (Kaushik)

  • Reference
  • last week(s):
    • deletion rate issues, but overall everything looks healthy
    • all in good shape
    • Michael - at the Tier 1 ADC notified of a missing library, related to Oracle. Communication issue between facility and ADC. ATLAS dependency in underlying OS components - change not documented. Would be good to have a page.
    • There used to a webpage - but its outdated; also there have been multiple lists.
    • Saul: there are many libraries required, more than 200, - not formally specified anywhere. ldd on everything in the release.
    • Would like an official page for this - we need to address this.
  • this week:
    • GROUPDISK crunch coming up; discussion about what to do. Nothing actionable now, but keep an eye on it at the sites.
    • Why are datasets aborted - are they documented? Usually associated with aborted tasks. At AGLT2 found an analysis that ran okay before, but on re-run missing datasets were found as aborted. Look for the tag.
    • High memory job queue being tested at SLAC. (Pileup and heavy ion). 4 GB/core is about the limit. SMU has slots with 6 GB/core.

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=107844
    
    1)  9/17: BNL - ddm errors:
    [DESTINATION error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://dcsrm.usatlas.bnl.gov:8443 srm/managerv2]. Givin' up after 3 tries].  Issue resolved - ggus 62206 closed, eLog 17059.
    2)  9/17: AGLT2 - ddm errors caused by a lack of disk space needed on the server hosting the postgres db that is associated with the SE's namespace manager.  Issue resolved - eLog 17080.
    3)  9/18 - 9/20: UTA_SWT2 - job failures with stage-in errors like "Get error: /cluster/xrootd/xrootd/lib/libXrdPosixPreload.so from LD_PRELOAD cannot be preloaded: ignored."  Two worker nodes had a problem with access to the xrootd s/w area - 
    they were removed from service - issue resolved.  ggus 62240 (closed / re-opened / closed), RT 18202 closed, eLog 17173.
    4)  9/20: Transfers to UTD_LOCALGROUPDISK failed with the error "System error in open: No space left on device."  ggus 62248 in-progress, eLog 17144. 
    5)  9/21: It was necessary to roll back an upgrade of the DDM central catalog due to issues with the dq2 clients. 
    https://savannah.cern.ch/support/?116842
    6)  9/21: DDM errors at SLACXRD_PRODDISK(DATADISK) - for example:
    failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server.  Issue resolved.  ggus 62277 closed, eLog 17248.
    
    Follow-ups from earlier reports:
    (i)  8/19 - 8/20: Maintenance outage at OU_OCHEP_SWT2 for an upgrade to the Lustre file system.  8/21: Test jobs successful following the Lustre upgrade, so OU_OCHEP_SWT2 set on-line.  Still waiting for a complete set of ATLAS s/w releases 
    to be installed at OU_OSCER_ATLAS.  eLog 16119.
    As of 8/31 no updates about atlas s/w installs on OU_OSCER.  Also, work underway to enable analysis queue at OU (Squid, schedconfig mods, etc.)
    As of 9/7: ongoing discussions with Alessandro DeSalvo regarding atlas s/w installations at the site.
    Update 9/13: Thanks to Rod Walker for necessary updates to ToA.  Analysis site should be almost ready for testing.
    Update 9/16: Full set of atlas releases installed at OU_OSCER_ATLAS. Test jobs successful - site set on-line.
    (ii)  9/13: AGLT2 - large number of failed jobs from task 167289 with "lost heartbeat" errors.  From Bob:
    The osg home NFS volume was mis-behaving from approximately 14:30-19:00 EDT on Sunday, 9/12. A system power cycle at the latter time appears to have resolved this issue. As far as I can tell, all lost heartbeat failures from this 
    task were reported when the volume came back online and resumed normal behavior.  ggus 62035 in-progress.
    Update 9/20: jobs which failed with these "lost heartbeat" errors eventually succeeded - no recent errors of this type - ggus 62035 closed.
    
    
  • this meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=108734
    
    1)  Many sites taking outages to apply security patches.
    2)  9/22: Access to the panda monitor was blocked by two problematic sessions.  Killed by CERN IT support - access restored after ~one hour.
    3)  9/23: AGLT2 maintenance outage, 8:00 a.m. - 4:00 p.m. - from Bob:
    Work on our network switches and some dcache servers. a disk shelf will be moved to a different server during this time, resulting in a pnfs re-registration of files on that shelf. Some files here may show unavailable 
    until that re-registration completes, even after we come back online.  Most services back on-line as of ~4:15 p.m. EST.  (Some lingering DDM errors while two disk shelves were rebuilding - all done as of ~noon 9/24.)
    4)  9/24: NET2_HOTDISK was reporting "NO_SPACE_LEFT" - issue understood and resolved, from Saul:
    srm reporting was broken after the reboot for the security patch. It should be fixed now.  ggus 62414 (closed), eLog 17498.
    Problem recurred on 9/25, ggus ticket re-opened, issue again resolved, ticket closed.  http://savannah.cern.ch/support/?117015
    5)  9/25: Failed SUSY event generation jobs produced large number of failures across multiple clouds.  Tasks were aborted as of ~noon Saturday.  https://savannah.cern.ch/bugs/index.php?72087
    6)  9/25: DDM errors at BNL - from Michael:
    Due to high load on the namespace related MCDISK db there are some transfer failures. Experts are taking care of this and the failure rate will clear in the next ~1h.  eLog 17443.
    7)  9/25 - 9/26: Large numbers of failed tasks in the US cloud with "athena non-zero exit" and athena segfault errors.  Tasks aborted.  eLog 17453.
    8)  9/26: UTA_SWT2: job failures with the error "CoolHistSvc ERROR No PFN found in catalogue for GUID 160AC608-4D6A-DF11-B386-0018FE6B6364."  ggus 62428 / RT 18249 in-progress, eLog 17474.
    9)  9/26: WISC_LOCALGROUPDISK DDM transfers failing:
    [INVALID_PATH] globus_ftp_client: the server responded with an error550 550-Command failed : globus_gridftp_server_posix.c:globus_l_g
    fs_posix_stat:358:550-System error in stat: No such file or directory550-A system call failed: No such file or directory550 End.]  ggus 62427 in-progress, eLog 17463.
    10)  9/26: SLAC - job failures with the error "COOL exception caught: Connection on "oracle://ATLAS_COOLPROD/ATLAS_COOLOFL_RPC" cannot be established ( CORAL : "ConnectionPool::getSessionFromNewConnection" from "CORAL/Services/ConnectionService" )."  
    ggus 62434 in-progress, eLog 17508.
    11)  9/27: DDM central services upgraded.  (Previous attempt from ~one week ago had to be rolled back.)  eLog 17514/519.
    12)  9/27: Job failures at NET2 with LFC errors.  From Saul at BU: 
    Our LFC was (essentially) offline starting mid-morning.  It's back now.  We'll be keeping an extra eye on things.  ggus 62470 closed, eLog 17615.
    13)  9/27: OU_OCHEP_SWT2: Job failures with the error:
    Get error: No such file or directory: storage/data/atlasproddisk/mc09_7TeV/EVNT/e534/...
    Not clear why the pilot was looking in badly formed directory paths for the data.  ggus 62474 / RT 18254 closed, eLog 17616.
    14)  9/28: From Aaron at MWT2 -
    We just had a brief problem with our PNFS server at UC. This was likely due to a service restart by me, and I've restarted our pnfsd and confirmed that files are accessible again.  eLog 17580.
    15)  9/28: Test jobs submitted to HU_ATLAS_Tier2 were successful (requested by John) - site set back to on-line. 
    
    Follow-ups from earlier reports:
    (i)  8/19 - 8/20: Maintenance outage at OU_OCHEP_SWT2 for an upgrade to the Lustre file system.  8/21: Test jobs successful following the Lustre upgrade, so OU_OCHEP_SWT2 set on-line.  Still waiting for a complete set of ATLAS s/w releases 
    to be installed at OU_OSCER_ATLAS.  eLog 16119.
    As of 8/31 no updates about atlas s/w installs on OU_OSCER.  Also, work underway to enable analysis queue at OU (Squid, schedconfig mods, etc.)
    As of 9/7: ongoing discussions with Alessandro DeSalvo regarding atlas s/w installations at the site.
    Update 9/13: Thanks to Rod Walker for necessary updates to ToA.  Analysis site should be almost ready for testing.
    Update 9/16: Full set of atlas releases installed at OU_OSCER_ATLAS. Test jobs successful - site set on-line.
    Update 9/21: Some modifications to the schedconfigdb for the the analysis queue were implemented.
    Update 9/29: On-going testing of the analysis cluster (Fred, Alessandro, et al).
    (ii)  9/20: Transfers to UTD_LOCALGROUPDISK failed with the error "System error in open: No space left on device."  ggus 62248 in-progress, eLog 17144. 
    Update 9/29: ggus 62248 still shown "in-progress."  Need to follow-up on this issue.
    
    • Over the weekend a large number of tasks were failing - eg. SUSY evgen jobs.
    • DDM central catalog services updated again, no indication of problems this time.

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • Site services seem to be fine; a change made for Wisconsin, to fix site name.
    • BNL disappearing from LCG BDII - found the cause, duplicated entries in the BNL gums. There was a modified VO name in GUMS, there were requirements in spelling not met.
    • Tier 3- make sure OIM entries are updated. Communicating these to Doug.
    • data deletion proceeding according to expectations? Armen: can it be sped up? It will probably be towards the end of the week, another 30K by the weeks. Hiro: should be able to double the rate.
    • Hiro - believes there are problems, the service is leaving files.
    • Working on AGLT2 list.
  • this meeting:
    • Warning messages now include in-depth SAM report. Take note of the time difference test time and report time.
    • In discussion w/ Simone on global namespace so that the dq2 client can use it.
    • Need a site to try out dq2-FTS - Brandeis
    • Michael - grid information system as provided by BDII; OSG interoperability and top-level BDII in Europe now being used for FTS and availability. However for the last few months resources have been dropping out leading to reduced availability. There is a high level discussion in the WLCG management board to revisit this infrastructure. There is a proposal to furnish a reliable setup of BDII in Europe and US. Counter proposal would be to have OSG provide this service, with the appropriate SLA. Hope is to improve the reliability. Xin will be make sure we are well represented.

libdcap & direct access (Charles)

last meeting:
  • Rebuilt libdcap locally
  • Current HC performance, http://hammercloud.cern.ch/atlas/10000957/test/
  • Comparing to previous results with dcap++
  • Looking at messaging between dcap client and server
  • Most problems with prun jobs (pathena seems robust)
  • Trying to get dcap++ merged into official dcap sources.. libdcap has diverged.
  • Continuing to track down stalls in analysis jobs - leading to hangs in dcap.
  • newline causing buffer overflow
  • chipping away at this on the job-side.
  • dcap++ and newer stock libdcap could be merged at some point.
  • Maximum active movers needs to be high - 1000.
  • Submitted patch to dcache.org for review
  • No estimate on libdcap++ inclusion in to the release
  • No new failure modes observed

this meeting:

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week(s):
    • From Jason:
      Hi Rob;
      I will need to mail in an update today since I am currently traveling:
      - RC3 Is being tested at UM/MSU/BNL.  Some minor hiccups unrelated to the performance issue were found (permissions on directory that stores some owamp data).
      - This release will be looking at the driver for the NICs on the KOI boxes, there are two we can use (sky2 or sk98lin), and we are trying the latter right now.
      - Early results show that it doesnt solve the issue, but we will be checking to see what improvements it has over the sky2 driver.  It appears that for a long enough test (20+ seconds) we can get up to 940Mb/s, but the ramp up is a lot longer than it used to be.
      - We will also be looking at the kernel, and trying different versions.  Specifically we want to see why the handling of soft interrupts may have changed between the prior releases and the current.
      Thanks;
      -jason
    • Notes from this week's throughput meeting:
      USATLAS Throughput Meeting Notes --- September 14th, 2010
      	=========================================================
      
      Attending: Shawn, John, Dave, Karthik, Andy, Sarah, Philippe, Tom, Aaron, Hiro
      Excused: Jason
      
      1) Discussion about problem status at:
      	a) OU - Karthik reported on status. Lots of test information sent to Chris Tracy/ESnet who looked it over.  No host issue is apparent. Possible issues with autotuning or kernel/stack/driver issues.  Try UDP to help isolate the issues (but it is disruptive).  Indication of issue at BNL.  
           b) BNL - John reported dCache machine links all transferred to Force10 (no more Cisco in that path).  Planning to move a VLAN associated with the WAN and run it directly to the ATLAS area, bypassing the BNL core (next few weeks ?). Queue settings setup according the ESnet recommendations now (as of a couple weeks ago).   Some drops were observed at the 40Gbps to 10Gbps system.  However drops not seen during testing.   Traceroute BNL to OU through ESnet cloud allowed John to determine test-points along the path.   One example was BNL-Cleveland which showed a factor of 10 asymmetry in/out directions, however testing to the next point (Chicago) showed a symmetric result!  Was repeatable.  The newmon 10GE testing box is directly connected to the border router 'mutt' (LHCnet interface) which interconnects to the other border router 'amon'.  Testing from US Tier-2's to newmon will traverse 'amon' and 'mutt' before getting to newmon.  
           c) Illinois - Campus perfSONAR box is working and being used to help diagnose the problem.  Box was setup at campus border and the asymmetry is not observed, indicating the campus may have the problem.  Next steps will involve moving the campus perfSONAR box to the Tier-3 location and retesting.  Progress is being made.
      
      2) perfSONAR release update and information on improving existing system performance.  
           a) Results from "beta" tests at sites  (pSPT v3.2rc3 at BNL, AGLT2 UM, AGLT2 MSU)
           b) Status update on Dell R410 box for use as throughput/latency perfSONAR node (Jason).  
           c) Current release schedule
      
      From Jason: " RC3 Up at BNL/UM/MSU.  Philippe just reported a problem with his, we will look into it.
      
      Early results from UM show that the newer sk98lin driver didn't solve the issue, we will need to examine to see what positive effects it did have.  In some basic testing I was able to get a 30 second test up to 941Mb after about 20 seconds of ramp up - the cards are capable but something else either in the kernel or on the machine may be limiting it from getting there sooner.
      
      Next steps:
       - Watch the RC3 hosts for a couple of days
       - Prepare a 'new' (2.6.27 or newer) kernel to use with both drivers to see if this sidesteps the issue
       - Looking into ksoftirqd some more, and why it is active on the new hosts.
      
      RC4 will be in 2 or so weeks pending the results of the testing."
      
      Philippe reported on the issue with 3.2rc3 on the latency box.  Apparently a protection issue which must be manually handled.  Fix doesn't survive a reboot though.
      
      3) Monitoring and using perfSONAR in USATLAS. 
           a) Status of "Nagios" plugin (Andy) --- Developer working on it has produced an rpm and documentation.  Some minor issues that need resolving...maybe another day or two.  Tom is ready to test it once it is available.  
      	b) Discussion of useful monitoring strategies --- Nothing yet to discuss.  Will wait for 3.2 release.
      
      4) Site reports --- Aaron mentioned a network expansion at MWT2_UC going to 2x10GE between the 2 primary switches via Cisco's etherchannel. Shawn described updates at AGLT2_UM which will add 2 of the Dell 8024F switches (24 port SFP+) as a primary 10GE "backbone" for the local storage nodes and network.  New storage nodes and switches will uplink to both switches (active-backup mode likely) for resiliency.   John mentioned that BNL is using 100Gbps interswitch trunks now (10x10GE).  Some discussion about how trunking works in practice.
      
      AOB:  None.   We will plan to meet again in 2 weeks.  Please send along any corrections or additions to these notes by sending them to the mailing list.
      
      Shawn 
    • Michael: still have problems with BNL to CNAF and NGDF; current arrangement is not good. Shawn: has the monitoring been done in the OPN points of presence?
    • Believe service providers are not proactive.. not communicating across service providers, at the administrative level. Problem seems to be in the GEANT2 network and regional providers.
    • At BNL there are two paths - OPN reserved for CERN traffic; at least three providers to get from one T1 to the other.
    • Just need to apply pressure - ATLAS to WLCG.

  • this week:
    • Notes from this week's throughput meeting:
      USATLAS Throughput Meeting - Sept 28, 2010
      	      ==========================================
      
      Attending: Shawn, Karthik, Dave, Aaron, Philippe, Sarah, Jason, Andy, Tom
      Excused: Horst
      
      1) Problem status
      	a) OU - Karthik, no update.  Looking at possible issues at BNL.  Not sure of the status at OU/OneNet/NLR.   OU to Kansas, testing looks OK.   The perfSONAR graphs are not conclusive currently to see if the OU asymmetry is still present. 
      	b) BNL - No report today.
      	c) Illinois - No updates yet.  Still waiting for permissions to relocate test box to appropriate location within campus.  Have jumbo-frames enabled on a private subnet.   Paths are now jumbo frame enabled on the network.
      
      2) perfSONAR status -  Jason: RC3 testing finished.   The problem seems to have been the change in TCP congestion algorithm (change to patched cubic).   RC4 now will have Reno NPAD, NDT and BWCTL will use HCTP. New driver for KOI box will be in place.  New txqueuelen will be setup to 10000 instead of 1000.   Maybe 2 weeks of testing before final perfSONAR 3.2 release.   The net-install will allow us to install to disk.  
      	Near-future will be to investigate R410 box as single install; multiple purpose option.  
      
      3) Monitoring issues -  Andy, rpms for Nagios plugin ready.   Tom will start looking at it tomorrow.   Discussion about future monitoring interests.  Aaron suggested adding some kind of path MTU info.  Andy mentioned it may be in a future 'traceroute'->'tracepath' upgrade for perfSONAR. Lots of discussion about what we want for USATLAS.  Will need to play with v3.2 and "client" tools to see how easy it is to get and summarize data.  Possible "asymmetry" indicator might be helpful.  Rate sites on how asymmetric flows into/out-of site are.  Maybe simple "green/yellow/red" indicator?  Need to get some experience in what is interesting and useful.  
      
      4) Site reports - Roundtable - MWT2 - No issues, working well.  AGLT2_UM deploying new equipment.   Worried about longer term throughput because of available inter-site bandwidth.  Going to dual SFP+ 10GE connections on servers and headnodes.  
      
      We plan to meet again in two weeks.  Reminder to all sites...once the final v3.2 perfSONAR is ready we want quick deployment to all sites (within 2 weeks; preferably much sooner).  
      
      Send along corrections or additions to the list.  Thanks,
      
      Shawn
    • Expect 3.2 to be released in about three weeks. Heads up to sites; note there will be a net-install option available
    • Nagios page at BNL will monitor instances.
    • Michael - debugging efforts on international scale for Tier 1's - the Italian to BNL link. However the NGDF - BNL solved quickly. Burden should not be just on the end sites. The problem owner cannot be one of the end-sites. Now its going to be the WLCG, and they delegate to the LHC OPN that are in contact with the providers along the path. The OPN quickly determined there was a problem with a router at or close to CERN.
    • Hiro - Throughput tests to be expanded to include European Tier 1's.
    • Redo of load tests at some point.

HEPSpec 2006 (Bob)

last week:

this week:

Site news and issues (all sites)

  • T1:
    • last week(s): Not much to report.. provisioning more disk capacity; adding Nexan RAID (IBM in front, running Solaris+ZFS) - 1.3 PB of storage shortly. (Added to 1.6 PB DDN from two months ago, which focuses on read applications; write applications not as performance on the smaller servers where there is checksumming on the fly).
    • this week: fully occupied for preparation for the upcoming repro campaign. Pedro working on the optimization of the staging service, re: input data from tape, completed lengthy program of work. Implementing call backs to speed up Panda mover, avoiding name server. Getting data off the tape at a rate the drives can obtain. Still working on stage-in wrapper for resiliency. CREAM CE - deployment is advanced in Europe and with large scale testing, now being requested by ATLAS.

  • AGLT2:
    • last week: Dell visit yesterday - Walker and Roger; have setup a test site; more interested in having representative analysis jobs. Goal is to allow engineers to observe behavior and benchmark their systems. How to go forward getting jobs? Packaged pre-packed analysis jobs. Equipment orders arriving. UM - blade. MSU - 1U servers. Nearline SAS, 7200 rpm 2TB.
    • this week: Deploying new equipment - decommissioned old dcache nodes. 376 TB deployed at UM ready for test. Headnodes for storage have not arrived. Waiting for blades, chasis, SFP networking components. Tom: 16 of the md1200's to be delivered; 4 R710's. Network changes underway. 400 TB at MSU to come online. SFP+ cables hard to get. Two six core worker nodes running - R610, w/ 24 job slots each. 300W X5660.

  • NET2:
    • last week(s): 250 TB rack is online, DATADISK is being migrated. Working on a second rack. Working on HU analysis queue. Running smoothly in the past week. Move to DATADISK is 95% complete. Wensheng and Armen have cleared up some space. 250 MB/s nominal copy. HU as an analysis queue - still working on this. Need to put another server on the Harvard side. Another 250 TB has been delivered, awaiting electrical work. Future - a green computing space, multi-university, may move NET2 there.
    • this week: GROUPDISK space token was overflowing; turned off for now; moving to larger partition. On-going problem with Gratia at Harvard, a very large lsf site. As a result not getting into WLCG statistics. Gratia folks (Philip Canal) are engaged, though still not resolved. HU is up now and running at full capacity again; also HU ANALY queue has been setup. Will add 432 cores for reprocessing campaign. Wei notes that SLAC had similar problems - hacked into the script to filter out non-ATLAS jobs.

  • MWT2:
    • last week(s): Running very stable. Maui upgrade. Completed retirement of older worker-node dcache storage.
    • this week: Database backing SRM server onto a new server; will migrate. Updated kernel to 2.6.35.5.

  • SWT2 (UTA):
    • last week: No much to report this week - GGUS ticket from a black hole. Analysis going very well in the past week.
    • this week: Another quiet week.

  • SWT2 (OU):
    • last week: production running smoothly. No longer getting DDM errors since the last upgrade. Frontier-Squid has setup completed; waiting on scheddb, ToA updates. Fred: is OU getting poolfilecatalog files updates - Xin will check woth Alessandro. Alden: will work with Horst to get scheddb changes up to date. Fred: believes ToA is done.
    • this week: Still working on security patch. Working on getting jobs running in analysis queues - Patrick suspects there is a problem with the Panda configuration.

  • WT2:
    • last week(s): all is well. Replaced a number of disks and cables. One of the new storage servers failing stress tests. One of the Thors seems to be dropping off intermittently, traced to bug in Solaris.
    • this week: All is running smoothly. Will need to reconfigure the storage so replicate files in hotdisk for dbrelease files. Running stress tests on the new storage.

Carryover issues ( any updates?)

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc.
  • last meeting(s)
    • Alessandro is running the final tests at OU now.
    • Expect to start next week migrating BNL Tier 1. Next new release kit will then use Alessandro's system. Expect this to take a couple of weeks.
    • Have asked for documentation for site administrators - there are options. Understanding installation jobs sent via wlcg system.
  • this meeting:
    • Testing at BNL - 16.0.1 installed using Alessandro's system, into the production area. Next up is to test DDM and poolfilecatalog creation.

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

AOB

  • last week
    • LHC accelerator outtage next week, M-Thurs.
  • this week
    • Fred sent a batch of 450 jobs to BNL ANALY queues - killed and restarted with no message. So jobs ran over night rather than 2 hours; had not seen the behavior before.
    • Was it a memory issue? Did it hit a physical memory limit.


-- RobertGardner - 28 Sep 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback