r7 - 09 Oct 2010 - 11:27:37 - SaulYoussefYou are here: TWiki >  Admins Web > MinutesOct6

MinutesOct6

Introduction

Minutes of the Facilities Integration Program meeting, Oct 6, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • 866-740-1260, Access code: 7027475

Attending

  • Meeting attendees: Aaron, Rob, Dave, Shawn, Saul, Torre, Booker, Michael, Mark, Armen, John, Bob, John B, Xin, Wei, Nate, Charles, Doug, Tom, Patrick, Hiro, Rik, Wensheng,
  • Apologies: Fred, Horst, Kaushik, Jason

Integration program update (Rob, Michael)

  • IntegrationPhase14 NEW
  • Special meetings
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
      • SLAC meeting.. there will be a phone connection most likely for remote participants
      • Reprocessing campaign preparations are underway - repro of the majority of 7 TEV data, 1B events. Tier 1 will be a major contributor, and possibly Tier 2s. Limit interventions.
      • Machine making good progress - goal to get to 400 bunches, showing good progress. Integrated luminosity increasing exponentially. Will be interesting to see how analysis patterns evolve once freshly reprocessed data is available.
      • Expect also major activity from Tier 3's in the coming months.
    • this week
      • Finalizing SLAC meeting agenda
      • Facility capacity updates
      • Autumn reprocessing campaign for Winter conferences:
      • Machine has made lots of progress - successful stores are leading to large integrated luminosities.
      • Several presentations at this week's ATLAS week
      • Will Tier 2's be involved? Most likely not.
      • Oct 19 - a sign-off meeting before the bulk of prod jobs to be released on Oct 25.
      • Q: what about ntuple creation - part of reprocessing? Probably yes.

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

References
  • The link to ATLAS T3 working groups Twikis are here
  • Draft users' guide to T3g is here
last week(s):
  • Sites have received funds and are procuring equipment. Reviewing setup with Brandeis step-by-step
  • Fresno has ordered equipment, Columbia/Nevis, etc. Bulk still to come.
  • Instructions are being moved to CERN from Argonne's twiki. Updates from Brandeis installation going in at the same time.
  • rpm distro for xrootd from VDT on going - important for standardization in ATLAS
  • data management at T3 - in the last few weeks have had meetings at CERN (Neng, Gerry) and w/ Andy & Wei. Plan in place. Use of pq2 tools from U Wisc.
  • FTS transfers to T3 using dq2-get
  • data affinity
  • T3 federation using xrootd
this week:
  • Brandeis installation progressing.
  • CVMFS support at CERN - question - we now have an important statement from ATLAS supporting this.
  • Statement was for support R&D for cernvm, and that CVMFS for Tier 3 and Tier 2.
  • CVMFS - infrastructure in place for conditions data as well - close to be able to test.
  • Data management at Tier 3g - phone call coming up this afternoon w/ Wei and Andy.
  • More Tier 3's are coming online - Santa Cruz acquiring equipment.
  • Q: inconsistency between Tier 3 LFC and physical storage. And some T3 acting as sources.
  • Hiro has a program for LFC-local storage consistency.
  • dq2-get-FTS plugin - should be ready for next distribution.
  • dq2-client developer discussion - option into dq2-get to allow DDM convention for global namespace. Similar modification for dq2-ls.
  • Doug: can Hiro provide instructions to use plugin with current

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • Analysis reference:
  • last meeting(s):
    • all is well, thanks! we need to minimize disruptions, we have a huge backlog of g4 sim tasks
    • High memory job queue being tested at SLAC. (Pileup and heavy ion). 4 GB/core is about the limit. SMU has slots with 6 GB/core.
  • this week:
    • Not too many issues to report - most sites at or near full capacity
    • Issue yesterday 15.6.9 - failures, CMT error. Problem with the release install - looks like its resolved now. Xin: there were two problems: 15.6.9 - special cache was installed in most of the US sites - broke the requirements file, causing HC job failures. Caused by install script mis-handling special cache. A separate problem was caused by Alessandro's system. Both have been fixed.
    • Seems like in the past week there has been more aggressive blacklisting of sites, not sure why - seems premature? Sometimes for Athena crashes rather than site problems. Potentially new shifters - not properly trained.
    • Sometimes shifter are too aggressive in creating tickets - do they get guidance to recognize spikes.
    • Michael: please raise this issue again with shift crew.

Data Management & Storage Validation (Kaushik)

  • Reference
  • last week(s):
    • GROUPDISK crunch coming up; discussion about what to do. Nothing actionable now, but keep an eye on it at the sites.
    • Why are datasets aborted - are they documented? Usually associated with aborted tasks. At AGLT2 found an analysis that ran okay before, but on re-run missing datasets were found as aborted. Look for the tag.
  • this week:
    • Armen: no major issues - following some cleanup issues.
    • See table of group disk space requirements -next year it will double.
    • Content in the generic, legacy ATLASGROUPDISK - will be retired in the next 6 months. There are physics subdirectories within the space token. Note difference between DQ2 endpoints versus the space tokens created.
    • The accounting is done in DQ2.

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=108734
    
    1)  Many sites taking outages to apply security patches.
    2)  9/22: Access to the panda monitor was blocked by two problematic sessions.  Killed by CERN IT support - access restored after ~one hour.
    3)  9/23: AGLT2 maintenance outage, 8:00 a.m. - 4:00 p.m. - from Bob:
    Work on our network switches and some dcache servers. a disk shelf will be moved to a different server during this time, resulting in a pnfs re-registration of files on that shelf. Some files here may show unavailable 
    until that re-registration completes, even after we come back online.  Most services back on-line as of ~4:15 p.m. EST.  (Some lingering DDM errors while two disk shelves were rebuilding - all done as of ~noon 9/24.)
    4)  9/24: NET2_HOTDISK was reporting "NO_SPACE_LEFT" - issue understood and resolved, from Saul:
    srm reporting was broken after the reboot for the security patch. It should be fixed now.  ggus 62414 (closed), eLog 17498.
    Problem recurred on 9/25, ggus ticket re-opened, issue again resolved, ticket closed.  http://savannah.cern.ch/support/?117015
    5)  9/25: Failed SUSY event generation jobs produced large number of failures across multiple clouds.  Tasks were aborted as of ~noon Saturday.  https://savannah.cern.ch/bugs/index.php?72087
    6)  9/25: DDM errors at BNL - from Michael:
    Due to high load on the namespace related MCDISK db there are some transfer failures. Experts are taking care of this and the failure rate will clear in the next ~1h.  eLog 17443.
    7)  9/25 - 9/26: Large numbers of failed tasks in the US cloud with "athena non-zero exit" and athena segfault errors.  Tasks aborted.  eLog 17453.
    8)  9/26: UTA_SWT2: job failures with the error "CoolHistSvc ERROR No PFN found in catalogue for GUID 160AC608-4D6A-DF11-B386-0018FE6B6364."  ggus 62428 / RT 18249 in-progress, eLog 17474.
    9)  9/26: WISC_LOCALGROUPDISK DDM transfers failing:
    [INVALID_PATH] globus_ftp_client: the server responded with an error550 550-Command failed : globus_gridftp_server_posix.c:globus_l_g
    fs_posix_stat:358:550-System error in stat: No such file or directory550-A system call failed: No such file or directory550 End.]  ggus 62427 in-progress, eLog 17463.
    10)  9/26: SLAC - job failures with the error "COOL exception caught: Connection on "oracle://ATLAS_COOLPROD/ATLAS_COOLOFL_RPC" cannot be established ( CORAL : "ConnectionPool::getSessionFromNewConnection" from "CORAL/Services/ConnectionService" )."  
    ggus 62434 in-progress, eLog 17508.
    11)  9/27: DDM central services upgraded.  (Previous attempt from ~one week ago had to be rolled back.)  eLog 17514/519.
    12)  9/27: Job failures at NET2 with LFC errors.  From Saul at BU: 
    Our LFC was (essentially) offline starting mid-morning.  It's back now.  We'll be keeping an extra eye on things.  ggus 62470 closed, eLog 17615.
    13)  9/27: OU_OCHEP_SWT2: Job failures with the error:
    Get error: No such file or directory: storage/data/atlasproddisk/mc09_7TeV/EVNT/e534/...
    Not clear why the pilot was looking in badly formed directory paths for the data.  ggus 62474 / RT 18254 closed, eLog 17616.
    14)  9/28: From Aaron at MWT2 -
    We just had a brief problem with our PNFS server at UC. This was likely due to a service restart by me, and I've restarted our pnfsd and confirmed that files are accessible again.  eLog 17580.
    15)  9/28: Test jobs submitted to HU_ATLAS_Tier2 were successful (requested by John) - site set back to on-line. 
    
    Follow-ups from earlier reports:
    (i)  8/19 - 8/20: Maintenance outage at OU_OCHEP_SWT2 for an upgrade to the Lustre file system.  8/21: Test jobs successful following the Lustre upgrade, so OU_OCHEP_SWT2 set on-line.  Still waiting for a complete set of ATLAS s/w releases 
    to be installed at OU_OSCER_ATLAS.  eLog 16119.
    As of 8/31 no updates about atlas s/w installs on OU_OSCER.  Also, work underway to enable analysis queue at OU (Squid, schedconfig mods, etc.)
    As of 9/7: ongoing discussions with Alessandro DeSalvo regarding atlas s/w installations at the site.
    Update 9/13: Thanks to Rod Walker for necessary updates to ToA.  Analysis site should be almost ready for testing.
    Update 9/16: Full set of atlas releases installed at OU_OSCER_ATLAS. Test jobs successful - site set on-line.
    Update 9/21: Some modifications to the schedconfigdb for the the analysis queue were implemented.
    Update 9/29: On-going testing of the analysis cluster (Fred, Alessandro, et al).
    (ii)  9/20: Transfers to UTD_LOCALGROUPDISK failed with the error "System error in open: No space left on device."  ggus 62248 in-progress, eLog 17144. 
    Update 9/29: ggus 62248 still shown "in-progress."  Need to follow-up on this issue.
    
    • Over the weekend a large number of tasks were failing - eg. SUSY evgen jobs.
    • DDM central catalog services updated again, no indication of problems this time.
  • this meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=109499
    
    1)  9/29: BNL - file transfer errors, for example:
    SRC SURL: srm://dcsrm.usatlas.bnl.gov:8443/srm/managerv2?SFN=/pnfs/usatlas.bnl.gov/BNLT0D1/valid1/AOD/r1580/valid1.00142193.physics_MinBias.recon.AOD.r1580_tid171982_00
    FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] AsyncWait]
    Issue resolved - SRM service restarted.  ggus 62594 closed, elog 17618.
    2)  9/29: Test jobs at ANALY_HU_ATLAS_Tier2 completed sucessfully - site set 'on-line'.
    3)  9/29: AGLT2 - TRANSFER_PREPARATION phase DDM errors.  Issue resolved - from Shawn:
    Deleting and recreating the /pnfs/aglt2.org/atlasscratchdisk/user09 directory tree has fixed the problem.  ggus 62603 closed, eLog 17632.
    4)  9/30: BNL - file transfer errors, for example:
    [DESTINATION error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] at Thu Sep 30 04:15:53 EDT 2010 state TQueued : put on the thread queue]; [DESTINATION error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] at Thu Sep 30 03:59:00 EDT 2010 
    state AsyncWait : calling Storage.prepareToPut()]; failed to contact on remote SRM [httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2].  
    Issue understood - from Iris: Due to high load and transfer errors, dcache core service was restarted around 4 am. The error you saw was due to the restart.  ggus 62613 closed, eLog 17646.
    5)  9/30: HU_ATLAS_Tier2 - jobs from several tasks were failing with the error "TRF_UNKNOWN | 'poolToObject: caught error."  ggus 62642 in-progress, eLog 17662.  (A similar issue was seen at IllinoisHEP see eLog
    6)  9/30:  BNL - DDM errors like:
    [DESTINATION error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2]. Givin' up after 3 tries].  From Pedro: pnfs was restarted.  problems should have disappeared.  
    ggus 62654 closed, eLog 17664.
    7)  10/1: BU_ATLAS_Tier2o - failed jobs with stage-in errors ("lsm-get failed").  Issue resolved - from John:
    One of the BU worker nodes (atlas-h08) lost its storage mounts.  It should be back soon.  eLog 17699.
    8)  10/2: WISC_DATADISK transfer errors:
    failed to contact on remote SRM [httpg://atlas07.cs.wisc.edu:8443/srm/v2/server]. Givin' up after 3 tries].  ggus 62694, eLog 17713.  Site was blacklisted.  (Comment from the site admin, 10/3: Most of the data servers are back. 
    But there are still some failing servers.  We are working on it.)
    9)  10/3: BNL - DDM errors understood - from Michael:
    This is to let shifters know that due to the vacuum process (reorganization of the db) currently running on the MCDISK related namespce db there is some access concurrency on this db leading to a transfer failure rate of ~5%. Note due to the size of the db this 
    may continue for the next couple of hours and does not affext other space token areas (i.e. datadisk, atlasuserdisk etc).
    Later: This intervention was completed. Transfer errors as reported by the dashboard will cease within the next ~1h.  eLog 17744.
    10)  10/3: SWT2_CPB_DATADISK - DDM errors.  Not a site issue, but rather the timing of dataset subscription / deletion.  Details from Hiro in https://savannah.cern.ch/bugs/index.php?73511.  eLog 17764/91.
    11) 10/4: Queue ANALY_HU_ATLAS_Tier2-lsf was set off-line by the shifter due to failed jobs and the status comment 'unused'.  Job failures were athena-related (didn't look like a site problem), and the 'unused' comment is no longer shown in the panda clouds view.  
    Set back on-line.  eLog 17832.  
    10/5: Alden fixed a schedconfigdb bug that was causing the comment field to revert to the old value after modification.
    12)  10/4: WISC_LOCALGROUPDISK - all transfers to / from the token are failing.  ggus 62748 in-progress, eLog 17802.
    13)  10/4: ANL_LOCALGROUPDISK - all transfers to / from the token are failing.  ggus 62750 in-progress, eLog 17803.
    14)  10/4: AGLT2_PHYS-SM - DDM transfer errors.  Issue understood - from Shawn:
    The source of the problem was not the host or its NIC but the switch-stack it connects to. We found three masters in a switch stack of 3 units. The stack has been reset and now seems to be working again.  ggus 62762 closed, eLog 17833.
    15)  10/4: ANALY_SWT2_CPB - jobs were failing with an error indicating a CMT problem in release 15.6.9.  Issue was a bug in the s/w install db (thanks for the info Xin), which has since been fixed.  ggus 62775 / RT 18299.  See 17) below.
    16)  10/4: New site ANALY_SLAC_LMEM is being tested (Wei's request).
    17)  10/4 - 10/5: Many errors in multiple clouds due to problem with installation of atlas s/w release 15.6.12.9.  Most sites have been patched.  See discussion in ggus 62785, https://savannah.cern.ch/bugs/?73626, eLog thread.
    
    Follow-ups from earlier reports:
    (i)  8/19 - 8/20: Maintenance outage at OU_OCHEP_SWT2 for an upgrade to the Lustre file system.  8/21: Test jobs successful following the Lustre upgrade, so OU_OCHEP_SWT2 set on-line.  Still waiting for a complete set of ATLAS s/w releases to be installed 
    at OU_OSCER_ATLAS.  eLog 16119.
    As of 8/31 no updates about atlas s/w installs on OU_OSCER.  Also, work underway to enable analysis queue at OU (Squid, schedconfig mods, etc.)
    As of 9/7: ongoing discussions with Alessandro DeSalvo regarding atlas s/w installations at the site.
    Update 9/13: Thanks to Rod Walker for necessary updates to ToA.  Analysis site should be almost ready for testing.
    Update 9/16: Full set of atlas releases installed at OU_OSCER_ATLAS. Test jobs successful - site set on-line.
    Update 9/21: Some modifications to the schedconfigdb for the the analysis queue were implemented.
    Update 9/29: On-going testing of the analysis cluster (Fred, Alessandro, et al).
    (ii)  9/20: Transfers to UTD_LOCALGROUPDISK failed with the error "System error in open: No space left on device."  ggus 62248 in-progress, eLog 17144. 
    Update 9/29: ggus 62248 still shown "in-progress."  Need to follow-up on this issue.
    (iii)  9/26: UTA_SWT2: job failures with the error "CoolHistSvc ERROR No PFN found in catalogue for GUID 160AC608-4D6A-DF11-B386-0018FE6B6364."  ggus 62428 / RT 18249 in-progress, eLog 17474.
    Update from Patrick, 10/4: We are investigating the use of round-robin DNS services to create a coarse load balancing mechanism to distribute data access to multiple Frontier/Squid clients.
    (iv)  9/26: WISC_LOCALGROUPDISK DDM transfers failing:
    [INVALID_PATH] globus_ftp_client: the server responded with an error550 550-Command failed : globus_gridftp_server_posix.c:globus_l_g
    fs_posix_stat:358:550-System error in stat: No such file or directory550-A system call failed: No such file or directory550 End.]  ggus 62427 in-progress, eLog 17463.
    (v)  9/26: SLAC - job failures with the error "COOL exception caught: Connection on "oracle://ATLAS_COOLPROD/ATLAS_COOLOFL_RPC" cannot be established ( CORAL : "ConnectionPool::getSessionFromNewConnection" from "CORAL/Services/ConnectionService" )."  
    ggus 62434 in-progress, eLog 17508.
    
    

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • Warning messages now included in-depth SAM report. Take note of the time difference test time and report time.
    • In discussion w/ Simone on global namespace so that the dq2 client can use it.
    • Need a site to try out dq2-FTS - Brandeis
    • Michael - grid information system as provided by BDII; OSG interoperability and top-level BDII in Europe now being used for FTS and availability. However for the last few months resources have been dropping out leading to reduced availability. There is a high level discussion in the WLCG management board to revisit this infrastructure. There is a proposal to furnish a reliable setup of BDII in Europe and US. Counter proposal would be to have OSG provide this service, with the appropriate SLA. Hope is to improve the reliability. Xin will be make sure we are well represented.
  • this meeting:
    • Central deletion bug - was deleting LFC entries by mistake. Panda production needed these files - leads to stuck jobs. Bug is finally fixed (thinks). Will need to do some cleanup. Only affects Tier1. (These are sub-datasets.) Large list, will have to go through them.
    • Deletion rate is set too low at BNL to protect the storage - will need to meet with ADC tomorrow to discuss the issue. NB: how to do deletions efficiently - perhaps bypassing SRM which is heavy. Deploy a local agent - remove metadata at the most convenient way. 30K requests / hour, increasing. Vast amount of small files produced logs by user analy jobs; discussed at ADC meeting.
    • Debugging networks over long distances - still investigating the networking problem between CNAF and BNL.

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week(s):
    • Meeting notes:
      USATLAS Throughput Meeting - Sept 28, 2010
      	      ==========================================
      
      Attending: Shawn, Karthik, Dave, Aaron, Philippe, Sarah, Jason, Andy, Tom
      Excused: Horst
      
      1) Problem status
      	a) OU - Karthik, no update.  Looking at possible issues at BNL.  Not sure of the status at OU/OneNet/NLR.   OU to Kansas, testing looks OK.   The perfSONAR graphs are not conclusive currently to see if the OU asymmetry is still present. 
      	b) BNL - No report today.
      	c) Illinois - No updates yet.  Still waiting for permissions to relocate test box to appropriate location within campus.  Have jumbo-frames enabled on a private subnet.   Paths are now jumbo frame enabled on the network.
      
      2) perfSONAR status -  Jason: RC3 testing finished.   The problem seems to have been the change in TCP congestion algorithm (change to patched cubic).   RC4 now will have Reno NPAD, NDT and BWCTL will use HCTP. New driver for KOI box will be in place.  New txqueuelen will be setup to 10000 instead of 1000.   Maybe 2 weeks of testing before final perfSONAR 3.2 release.   The net-install will allow us to install to disk.  
      	Near-future will be to investigate R410 box as single install; multiple purpose option.  
      
      3) Monitoring issues -  Andy, rpms for Nagios plugin ready.   Tom will start looking at it tomorrow.   Discussion about future monitoring interests.  Aaron suggested adding some kind of path MTU info.  Andy mentioned it may be in a future 'traceroute'->'tracepath' upgrade for perfSONAR. Lots of discussion about what we want for USATLAS.  Will need to play with v3.2 and "client" tools to see how easy it is to get and summarize data.  Possible "asymmetry" indicator might be helpful.  Rate sites on how asymmetric flows into/out-of site are.  Maybe simple "green/yellow/red" indicator?  Need to get some experience in what is interesting and useful.  
      
      4) Site reports - Roundtable - MWT2 - No issues, working well.  AGLT2_UM deploying new equipment.   Worried about longer term throughput because of available inter-site bandwidth.  Going to dual SFP+ 10GE connections on servers and headnodes.  
      
      We plan to meet again in two weeks.  Reminder to all sites...once the final v3.2 perfSONAR is ready we want quick deployment to all sites (within 2 weeks; preferably much sooner).  
      
      Send along corrections or additions to the list.  Thanks,
      
      Shawn
    • Expect 3.2 to be released in about three weeks. Heads up to sites; note there will be a net-install option available
    • Nagios page at BNL will monitor instances.
    • Michael - debugging efforts on international scale for Tier 1's - the Italian to BNL link. However the NGDF - BNL solved quickly. Burden should not be just on the end sites. The problem owner cannot be one of the end-sites. Now its going to be the WLCG, and they delegate to the LHC OPN that are in contact with the providers along the path. The OPN quickly determined there was a problem with a router at or close to CERN.
    • Hiro - Throughput tests to be expanded to include European Tier 1's.
    • Redo of load tests at some point.

  • this week:
    • from Jason:
      Hi Rob;
      
      I am traveling tomorrow so will be unable to attend.  The update for our end:
      
      - pSPT rc4 in the hands of USATLAS testers at UM/MSU/BNL.  Recently installed, so not enough data yet to verify.
      - Finishing up final details on 3.2 release, expect release in next 2 weeks (as long as testing looks ok).
      - NAGIOS plugins are in the hands of BNL, did not hear updates on how the testing is going, but no rush on that.
      
      Thanks;
      
      -jason
    • October 27 for deployment date all sites updated with new release

HEPSpec 2006 (Bob)

last week:

this week:

  • MWT2 results were run in 64 bit mode by mistake; Nate is re-running.
  • Assembling all results in a single table.
  • Please send any results to Bob - important for running the benchmark.
  • Duplicate results don't hurt.

Site news and issues (all sites)

  • T1:
    • last week(s): fully occupied for preparation for the upcoming repro campaign. Pedro working on the optimization of the staging service, re: input data from tape, completed lengthy program of work. Implementing call backs to speed up Panda mover, avoiding name server. Getting data off the tape at a rate the drives can obtain. Still working on stage-in wrapper for resiliency. CREAM CE - deployment is advanced in Europe and with large scale testing, now being requested by ATLAS.
    • this week: will be adding 1300 TB of disk; installation is ready and will be handed over to dcache group to integrate into the dcache system by next week. CREAM investigations are continuing. LCG made an announcement to the user community that we'll deprecate the existing CE by the end of the year. Urging sites to convert. Have discussed with OSG on behalf of US ATLAS and US CMS - Alain Roy is working on this, will be ready soon. Submission and Condor batch backend sites will need to be tested. Preliminary results looked good to a single site, but CMS found problems with submission to multiple sites. Plan is to submit 10K jobs from BNL submit host to 1000 slots at UW, to validate readiness (Xin). Note: no end of support for the current OSG gatekeeper, GT2-based.

  • AGLT2:
    • last week:Deploying new equipment - decommissioned old dcache nodes. 376 TB deployed at UM ready for test. Headnodes for storage have not arrived. Waiting for blades, chasis, SFP networking components. Tom: 16 of the md1200's to be delivered; 4 R710's. Network changes underway. 400 TB at MSU to come online. SFP+ cables hard to get. Two six core worker nodes running - R610, w/ 24 job slots each. 300W X5660.
    • this week: Delivery of disks at both sites. Awaiting headnodes at UM sites. Will be 1.9PB when operational. One more PO to go out. SFP+ switches won't be delivered till Nov 1. Looking at locality of client and pools between the two sites - are there dcache reconfigurations or locality options. Local site mover may be helpful in this context. Tom: all parts arrived at MSU. Upcoming outtages coming when new switches and hardware arrive. (full day scale outages)

  • NET2:
    • last week(s): GROUPDISK space token was overflowing; turned off for now; moving to larger partition. On-going problem with Gratia at Harvard, a very large lsf site. As a result not getting into WLCG statistics. Gratia folks (Philip Canal) are engaged, though still not resolved. HU is up now and running at full capacity again; also HU ANALY queue has been setup. Will add 432 cores for reprocessing campaign. Wei notes that SLAC had similar problems - hacked into the script to filter out non-ATLAS jobs.
    • this week: All running smoothly - full capacity, mainly production. Westmere nodes at HU have been in use (432 cores). John has new HU ANALY queue up and working. There are some errors at the moment - Athena crashes or a release installation issue. Continue to move space tokens around into the new storage. Still migrating GROUPDISK - mostly done - expect larges tracks of data to free up later today. John - on-going problem with Gratia: cron job can't catch up. The "suppress-local" which was ineffective; instead use Wei's patch which works great, so fixed now. Condor-G problems keeping site filled; found a bug in the globus perl module interfacing with LSF; Jaimie provided patch, fixed.

  • MWT2:
    • last week(s): Database backing SRM server onto a new server; will migrate SRM service itself at some point. Updated kernel to 2.6.35.5 everywhere.
    • this week: Quiet week, focusing on stability. Slowing migrating compute nodes to SL 5.5. Moving database for SRM door onto higher end server reducing failure rate. Emails sent to users regarding.

  • SWT2 (UTA):
    • last week: Quiet week.
    • this week: Completed security updates completed on both clusters - all went fine. LFC version update announcement - we'll need to update the facility LFC's at some point.

  • SWT2 (OU):
    • last week: Still working on security patch. Working on getting jobs running in analysis queues - Patrick suspects there is a problem with the Panda configuration.
    • this week: Site name available but the queue name isn't visible for ANALY queue. Will consult Alden.

  • WT2:
    • last week(s): All is running smoothly. Will need to reconfigure the storage so replicate files in hotdisk for dbrelease files. Running stress tests on the new storage.
    • this week: not much to report - all is working fine. 1.4 PB useable is now available.

Carryover issues ( any updates?)

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc.
  • last report
    • Testing at BNL - 16.0.1 installed using Alessandro's system, into the production area. Next up is to test DDM and poolfilecatalog creation.
  • this meeting:

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

libdcap & direct access (Charles)

last report(s):
  • Rebuilt libdcap locally
  • Current HC performance, http://hammercloud.cern.ch/atlas/10000957/test/
  • Comparing to previous results with dcap++
  • Looking at messaging between dcap client and server
  • Most problems with prun jobs (pathena seems robust)
  • Trying to get dcap++ merged into official dcap sources.. libdcap has diverged.
  • Continuing to track down stalls in analysis jobs - leading to hangs in dcap.
  • newline causing buffer overflow
  • chipping away at this on the job-side.
  • dcap++ and newer stock libdcap could be merged at some point.
  • Maximum active movers needs to be high - 1000.
  • Submitted patch to dcache.org for review
  • No estimate on libdcap++ inclusion in to the release
  • No new failure modes observed

this meeting:

AOB

  • last week
    • LHC accelerator outtage next week, M-Thurs.
    • Fred sent a batch of 450 jobs to BNL ANALY queues - killed and restarted with no message. So jobs ran over night rather than 2 hours; had not seen the behavior before.
    • Was it a memory issue? Did it hit a physical memory limit.
  • this week
    • No meeting next week - SLAC f2f meeting.


-- RobertGardner - 05 Oct 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback