r9 - 05 May 2010 - 14:51:53 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesMay5

MinutesMay5

Introduction

Minutes of the Facilities Integration Program meeting, May 5, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (605) 715-4900, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Shawn, Pedro, Booker, Charles, Wei, Nate, Jason (I2), Patrick, Tom, Patrick, Saul, Michael, Xin, Sarah, John, Doug, Armen, Mark, Nurcan, Kaushik
  • Apologies: OU

Integration program update (Rob, Michael)

  • SiteCertificationP13 - FY10Q3
  • Special meetings
    • Tuesday (9:30am CDT): Frontier/Squid
    • Tuesday (9:30am CDT): Facility working group on analysis queue performance: FacilityWGAP suspended for now
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
      • As data taking ramps up we're seeing lots more analysis activity. The analy load is more evenly spread; now that user-cancelled jobs are no longer counted as failures, the US efficiency is quite high - about 2% failures.
      • We had a good store over the weekend - BNL was getting data ~GB/s for over 36 hours. As machine starts up again we'll see if this continues.
      • Reprocessing campaign has been completed quickly. All Tier 1's worked very well.
      • Data distribution to all Tier 1's and then to Tier 2's is going well. Not all Tier 2's have deployed their capacities, and PI's are urged to meet their pledges. There is a sharing plan under discussion in the RAC. Would be nice to fine tune the distribution formulas with usage patterns.
    • this week
      • Following reprocessing, a lot of data has been subscribed to the Tier 1 and Tier 2s. So lots of data replication is underway. We're hitting reality - users are waiting for data.
      • First time we're exercising the entire chain under load; its real data, so we're sensitive to real latencies and expectations. What are the performance optimizations, etc?
      • We have lots of good and valuable discussions. We need to analyze and understand the limitations.
      • Note the machine still has significant problems - we had a good run over the weekend, still sorting out several issues.

Feature talk: BNL dCache local site mover (Pedro Salgado)

Discussion:

  • To setup at other sites - may need the web service installed for accessing the pnfsid; is the pnfsid translation optional - should be.
  • The web service providing this - hopefully converge on a single source.
  • Test cases are also valid for Xrootd
  • Pedro, Shawn and Charles to meet offline. Follow-up during first week of June.

Local Site Mover for Xrootd (Charles)

  • Specification: LocalSiteMover
  • last week
    • python wrappers for xrootd complete - not all functions are complete
    • working on the lsm script itself
    • have test xrootd instances setup for testing
    • hope to start testing at xrootd sites by this time next week.
  • this week:
    • LocalSiteMoverXrootd
    • Taken existing gen 0 site mover scripts from dcache lsm, doing simple translation for xrootd protocol
    • Q from Wei: for direct reads for analysis how does the lsm apply. Could leverage python-xrootd bindings library from Charles.
    • lsm and pcache - lsm-get is responsible getting input files for the job.
    • Question about bundling the two - or at least making these

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

References
  • The link to ATLAS T3 working groups Twikis are here
  • Draft users' guide to T3g is here

last week(s):

  • Tier 3-Panda is working. There were some residual issues to sort out (to discuss w/ Torre). Working with Dan Van de Steer to get HC tests running on Tier 3.
  • Need to update Tier 3g build instructions - after the software week. Target mid-May
  • Tier 3 plans are gelling. CVMFS - implies T3's will require squids - therefore will need monitoring.
  • Request for second FTS at BNL for gridftp-only endpoints.
  • Work proceeding on Xrootd; srm check summing required for T3; Doug working on that.

this week:

  • XrootdOSGAtlasMeetingMay4
  • Work converging to provide a set of recommendations and instructions for a non-grid Tier 3
  • Planning Tier 3 workshop around June 7
  • Funding is quite uncertain, though some institutions have heard their grant offices
  • Tier 3 Panda testing - job submissions being tested. Still trying to get HC working.
  • Data distribution - dq2-client that uses FTS to be released in two weeks (requires gridftp-only).
  • Doug: has a request from Simone for a completely separate FTS instance for gridftp-only endpoints.
  • Hiro has checked that gridftp-only works fine.
  • Michael: if we need a separate instance, we will set this up.
  • Will need more sites testing.
  • Throughput testing will require more.
  • Doug: pcache on distributed xrootd (processors and storage on same nodes).
    • Probably not to be implemented in pcache, but the tool that loads in data to the Tier 3.
    • Rik: probably will take a longer time to implement.
  • NFSv4 client bug - ATLASLocalRootBase + CVMFS, etc. Found in testing wlcg-client-lite. Bug triggered in RHEL 5.4 and below. Waiting for beta release in SL5.
  • (note NSFv4 looks to be the default for RHEL6)

Operations overview: Production and Analysis (Kaushik)

Data Management & Storage Validation (Kaushik)

  • Reference
  • last week(s):
    • MinutesDataManageApr13
    • Main issues have been keeping an eye on storage - need to do some dataset deletion.
    • Remnant of consolidating site services at BNL - short vs long for physical file names. Before this wasn't an issue since sites were internally consistent. SLAC and SWT2 had long form, fixed by Hiro. Bottom line: a mess in the LFCs. Short/long at all sites. Form task force to address quickly. Need to make consistent.
    • Replication percentages to T2 sites now set to nominal.
  • this week:
    • Getting lots of data now!
    • Three major topics: space status, making sure we're not running out.
    • Slow transfers T0-->T1-->T2, being investigated, hot topic, thread on-going.
    • Discussion in ADC phone meeting - German cloud experience - T2's can only get data within its cloud. Users are constantly requesting data outside of their clouds. DaTri is handling this by two subscriptions via the Tier 1. In German cloud ran out of space. Proposal - start testing direct channels between largest T2s and T1's. Possible go through a star channel.
    • Do we participate in these tests? Note: will improve performance for users. Won't waste scratch space at Tier 1s.
    • Pedro: what about the load on people - figuring out bottlenecks, remote site issues, etc. What is the operational load increase? Kaushik will make this point.

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=92784
    
    1)  4/21: Filled system disk at HU_ATLAS_Tier2 -- details from John:
    I moved /tmp to the local scratch disk and fixed up the corrupted /etc/services file.  I will have to follow up about:
    1) gratia-lsf uses up 6 GB of /tmp every 10 minutes because it makes a line-reversed copy of our entire LSF history file using tac, which requires a lot of tmp space.
    2) the 75,000 files were left in the top of /tmp over past 24 hours.
    I see pilots already finishing successfully (noting that the site is offline).  I'm going to turn the site back online, with a local cap of 100 jobs just to make sure everything is really working again, and then I'll remove the cap .
    2)  4/21: From Bob at AGLT2:
    I have set us off-line so that we can drain, and we will then fix a Condor problem we have here.  This has already caused a bunch of "lost heartbeat", and eventually a bunch more will report in.  The problem has a known cause.  Jobs that complete normally are unaffected.  We expect to be back online late this afternoon.
    Following the Condor fix test jobs were successful, site set back to 'on-line'.  eLog 11748.
    3)  4/22: From Hiro:
    BNL dCache has been updated to fix the gridftp adapter problem.   Therefore, I will change all FTS channels to use GRIDFTP2 in this afternoon (2PM US East).    The sites, which allows the direct writting to storage disks/pools (eg dCache sites), should pay attention to their SEs.
    4)  4/22: MWT2_UC - jobs failing with the error "22 Apr 06:14:30|pilot.py | !!FAILED!!1999!! Too little space left on local disk to run job: 573571072 B (need > 2147483648 B)."  Issue resolved - from Charles at UC:
    Problem should be fixed now.  Background - we are using pcache
    [https://twiki.cern.ch/twiki/bin/view/Atlas/Pcache ] which uses a subset of scratch space for a file cache. The max size of this cache was set to 90%, which leaves ~40GB free. This job set filled up the available space quickly before a cache cleanup pass could free up space by deleting cached files. I've reduced the pcache max space limit from 
    90% to 80%, which should prevent recurrence of this problem.  ggus 57532 (closed), eLog 11810.
    5)  4/22:  Transfer failures at SWT2_CPB:
    SRC SURL: srm://gk03.atlas-swt2.org:8443/srm/v2/server?
    SFN=/xrd/atlasproddisk/mc09_7TeV.105001.pythia_minbias.merge.NTUP_MINBIAS.e517_s764_s767_r1229_p133_
    tid126581_00_sub06836636/NTUP_MINBIAS.126581._012481.root.1
    ERROR MSG: [FTS] FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [USER_ERROR] source file doesn't exist]
    Problem understood, issue resolved.  From Patrick:
    The system disk on one of our dataservers got filled and we think that this is the root cause of a problem with the portion of Xrootd used by the SRM.  The SRM was not able to 'stat' files and reported that the files did not exist.
    We sync'ed the dataserver contents to the CNSD's copy of the contents. For some reason, the CNSD is still maintaining some older files that have been deleted, but these should not cause operational issues.  We also took the time to perform some minor maintenance on the dataservers.  The service is seemingly running fine now.  eLog 11806/07.
    6)  4/23: Job failures at MWT2_IU with errors like:
    23 Apr 05:55:23|LocalSiteMov| !!WARNING!!2995!! lsm-put failed (51456): 201 Copy command failed
    23 Apr 05:55:24|Mover.py | !!WARNING!!2999!! Error in copying (attempt 1): 1137 - lsm-put failed (51456): 201 Copy command failed
    23 Apr 05:55:24|Mover.py | !!WARNING!!2999!! Failed to transfer NTUP_MINBIAS.126581._014054.root.1: 1137 (Put error: Error in copying the file from job workdir to localSE)
    >From Sarah at IU:
    Thank you for reporting the issue! We found that certain worker nodes in the cluster had an older version of the lsm-put script, which caused certain put operations to fail. We've updated those nodes and continue to monitor.  ggus 57584, RT 16080, eLog 11872.  4/26: still see job failures with stage-out errors.  From Sarah:
    Proddisk had reached 99% usage, causing writing job outputs to fail. I have allocated more space.  ggus ticket closed.
    7)  4/23: OU sites were set off-line in advance of major upgrades -- from Horst:
    We'll be taking OU_OCHEP_SWT2 down for a complete hardware and software upgrade on Monday morning.  So could you please drain all the queues -- both *OU_OCHEP_* and *OU_OSCER_* -- and turn pilot submission (and everything else that might be accessing them) off starting this afternoon, until we're ready to start back up, which will be at least a week?  
    I'll also schedule a maintenance in OSG OIM, which I will keep updated when we know better how long Dell and DDN will take for the upgrade.  eLog 11813.
    8)  4/23:  Spring reprocessing exercise is underway.  Useful links:
    http://panda.cern.ch:25980/server/pandamon/query?mode=listtask
    http://gridinfo.triumf.ca/panglia/graph-generator/?SITE=DE&TASK=ALL&TIME=hour&SIZE=large
    http://atladcops.cern.ch:8000/j_info/repro_feb_10_RAW_ESD.html
    9)  4/23: ANALY_SWT2_CPB: jobs were failing due to an issue with a specific transform (runGen) which does not default to the system python in the same way as other transforms.  To address this issue a 32-bit version of WN client was installed to ensure that the pilot picks up a workable python version and thus decouple the issue from the job transforms.
    10)  4/23: UTA_SWT2 - NFS server problems.  Evidence that the NIC in the machine was dropping some packets.  Not clear what was causing the problem.  A system reboot cleared up the problem for now.  4/24: One of the xrootd data servers crashed.  Used this opportunity to update xrootd on all the servers, along with some modifications to the XFS file system mounts.  
    System restarted with modified options to the NIC driver - this issue seemingly resolved.
    11)  4/23: HU_ATLAS_Tier2 - stage-in errors like:
    23 Apr 18:31:43|LocalSiteMov| !!WARNING!!2995!! lsm-get failed (28169):
    23 Apr 18:31:44|Mover.py | !!FAILED!!2999!! Error in copying (attempt 1): 1099 - lsm-get failed (28169):
    23 Apr 18:31:44|Mover.py | !!FAILED!!2999!! Failed to transfer DBRelease-10.3.1.tar.gz: 1099 (Get error: Staging input file failed)
    23 Apr 18:31:44|Mover.py | !!FAILED!!3000!! Get error: lsm-get failed (28169):
    >From Saul:
    We're continuing to work on this issue. It's understood but not completely resolved. Let's close the ticket and open a new one if the errors reappear.  ggus 57615 (closed), eLog 11817.
    12)  4/24: FTS errors at SLAC -
    [FTS] FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]. Givin' up after 3 tries].  From Wei:
    One of our data servers was down. It is back online.  ggus 57622, eLog 11826.
    13)  4/24: IllinoisHEP, file transfer problems with SRM errors:
    SRC SURL:
    srm://osgx1.hep.uiuc.edu:8443/srm/managerv2?SFN=/pnfs/hep.uiuc.edu/data4/atlas/proddisk/mc09_7TeV/ESD/e505_s765_s767_r1250/mc09_7TeV.105722.PythiaB_bbe7X.recon.ESD.e505_s765_s767_r1250_tid126992_00/ESD.126992._003134.pool.root.1
    ERROR MSG: [FTS] FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase:
    [CONNECTION_ERROR] failed to contact on remote SRM [httpg://osgx1.hep.uiuc.edu:8443/srm/managerv2]. Givin' up after 3 tries].  From Dave at Illinois:
    Restarted parts of dCache (I believe a pool node was confused) and all seems well at this point.  ggus 57634, eLog 11854.
    14)  4/26 - 4/28: Jobs failing at most U.S. sites due to missing release 15.6.9.  There was a problem with the install pilots which was preventing the s/w installation jobs from running.  This issue has been resolved.  eLog 11996, ggus 57681.
    
    Follow-ups from earlier reports:
    
    (i)  4/11: Failed jobs at AGLT2 with errors like:
    11 Apr 13:37:44| lcgcp2SiteMo| !!WARNING!!2999!! The pilot will fail the job since the remote file does not exist.
    Shawn verified that the file exists in /pnfs, and that the correct checksum info is available via dCache/Chimera.  Could there be some timing issue present? What does getdCacheChecksum() try to do? I note it mentioned "attempt 1"...are there follow-on attempts or is this site-db configured?  Paul added to the thread in case there is an issue on the pilot side.  
    ggus 57186, RT 15953, eLog 11406.  In progress.
    Update, 4/16: Still see this error at a low level, intermittently.  For example ~80 failed jobs on this date.  More discussion posted in the ggus ticket (#57186).
    (ii)  4/21: MWT2_UC, MWT2_IU, ANALY_MWT2 offline for kernel upgrades and network tests (in progress).  eLog 11724.
    Update, 4/21 p.m.: Maintenance completed, sites back 'on-line'.
    
  • this meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=93559
    
    1)  Upcoming: 5/11, BNL -
    The Condor batch system will be upgraded on May 13 (Thursday) beginning at 8 am EDT.
    Duration:
    May 13, from 8 am - 12 noon
    Expected User Impact:
    No batch jobs can be scheduled or executed during the upgrade.
    2)   4/28: DDM transfer errors between MWT2_UC_USERDISK ==> MWT2_IU_LOCALGROUPDISK.  The STAR-IU  FTS channel was temporarily stopped to prevent additional errors.  From Charles:
    The offending LFC entries have been cleaned up. We are re-enabling the channel and watching for errors - if this condition returns we'll pause the channel.  Issue seems to be resolved. 
    3)  4/28: From Pedro at BNL:
    Our dcache srm server stopped working.  We've managed to recover the service.  Please ignore the dashboard error messages.  eLog 12020.
    4)  4/29 - 4/30: Slowness in data transfers to the BNL SE.  From Hiro:
    The network engineers have identified the faulty link and switch. The throughput to/from BNL has been restored by bypassing them with rerouting of the traffic.  ggus 57801 (closed), eLOg 12039.
    5)  4/29: DDM errors at AGLT2 -
    FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]. Givin' up after 3 tries].  From Shawn:
    This was due to a connection limit being reached on our dCache headnode's posgresql max_connection parameter.
    Fix was to double it from 300->600 and restarting first postgresql and then all dCache services on this node.  eLog 12055.
    6)  4/29 - 5/1: Issues with atlas s/w releases installation at MWT2_IU.  Now resolved.  See long discussion threads in ggus 57820, RT 16151 (both closed), eLog 12065.
    7)  5/3: BNL - from Michael:
    BCF Facility Services will be putting the APC UPS into bypass mode in order to repair a severe problem with its batteries. Although the equipment will continue to operate on line power there will be no UPS protection until the problem is corrected. 
    No interruption in service is expected unless there is a power line glitch. 
    No impact observed.
    8)  5/4: Xin reported a problem with atlas s/w installation jobs failing (for example, see panda job i.d.'s 1068250967,  1068414227).  From Tadashi:
    I've fixed test/installSW.py on SVN.
    Issue resolved.
    9)  5/4: From John at NET2 / HU:
    We were going along fine at ~750 concurrent jobs for days, but when I lifted that limit today, our lsm and storage again ran into scaling issues.  I'm going to get us back down to the 750 level, where things working correctly.  I will do this while keeping the site online in panda.
    
    Follow-ups from earlier reports:
    (i)  4/11: Failed jobs at AGLT2 with errors like:
    11 Apr 13:37:44| lcgcp2SiteMo| !!WARNING!!2999!! The pilot will fail the job since the remote file does not exist.
    Shawn verified that the file exists in /pnfs, and that the correct checksum info is available via dCache/Chimera.  Could there be some timing issue present? What does getdCacheChecksum() try to do? I note it mentioned "attempt 1"...are there follow-on attempts or is this site-db configured?  
    Paul added to the thread in case there is an issue on the pilot side.  ggus 57186, RT 15953, eLog 11406.  In progress.
    Update, 4/16: Still see this error at a low level, intermittently.  For example ~80 failed jobs on this date.  More discussion posted in the ggus ticket (#57186).
    Update, 5/4: Additional information posted in the ggus ticket.  Also, see comments from Paul.
    (ii)  4/23: OU sites were set off-line in advance of major upgrades -- from Horst:
    We'll be taking OU_OCHEP_SWT2 down for a complete hardware and software upgrade on Monday morning.  So could you please drain all the queues -- both *OU_OCHEP_* and *OU_OSCER_* -- and turn pilot submission (and everything else that might be accessing them) off starting this afternoon, 
    until we're ready to start back up, which will be at least a week?  
    I'll also schedule a maintenance in OSG OIM, which I will keep updated when we know better how long Dell and DDN will take for the upgrade.  eLog 11813.
    (iii)  4/23:  Spring reprocessing exercise is underway.  Useful links:
    http://panda.cern.ch:25980/server/pandamon/query?mode=listtask
    http://gridinfo.triumf.ca/panglia/graph-generator/?SITE=DE&TASK=ALL&TIME=hour&SIZE=large
    http://atladcops.cern.ch:8000/j_info/repro_feb_10_RAW_ESD.html
    Update: this reprocessing exercise has now been completed.
     
    • Low number of US issues, smooth running.
    • Added several shifters in the US time zone. Helped greatly.

DDM Operations (Hiro)

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc.
  • last meeting
    • Xin is helping Alessandro test his system on the ITB site. WMS for the testbed at CERN is not working - i.e. ITB not reporting to the PPS BDII. Working this with GOC.
    • Need to publish the BNL ITB site into the production WLCG BDII in order to test. Reconfigured, and changed OIM.
    • Information has still not appeared on the WLCG side. Once available, then Alessandro can submit jobs.
  • this meeting:
    • ITB site info now published though reported the gatekeeper OS rather than the worker-node OS. ATLAS releases were installed.
    • dq2-client package installation on sites. Working towards unique set of packages. Discussing this with Marco.
    • Next step: after details have settled, try out a Tier 2 site - OU good candidate.

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

  • Reference
  • last week(s)
    • PFC corruption at HU - was affecting production jobs which it should never do. This file is not used, but it needs to exist and be in proper XML format. Hiro reported a problem with its content. Alessandro was working on a validity check in the software - thought this was done. Saul: it was not corrupted actually, but out of date. This is a general problem - the jobs which install this sometimes fail (dq2 failures), and this will affect running analysis jobs. Fred will discuss with Richard Hawking next week at CERN and report back. We need a consistency checker for this.
    • New version of squid - recommended for deployment. See message from Dario.
    • AGLT2 updated - but got a slightly older version of the rpms. Needs to update.
    • Advice: make sure you stop running processes; uninstall old version before installing the new release.
    • Caution: Customizations will be overwritten. ACLs for example.
    • John will update US facility instructions - will test at MWT2.
  • this week
    • Fred testing latest release at BNL and CERN - working with lastest versions of Athena.
    • New Frontier client release has been delayed.
    • John is upgrading BNL servers; will work on Tier 2 instructions after that.

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week(s):
    • From Jason:
                - The pSPT 3.1.3 was released on April 23rd (on time)
                - Tier1 and Tier2 should be upgrading as we speak 
                - Report problems as always to the mailing list.
               Thanks;
               -jason
    • Notes from last meeting
      	USATLAS Throughput Meeting Notes --- April 20, 2010
           ===================================================
      
      Attending:  Shawn, Sarah, Horst, Karthik, Jason, Philippe, Hiro, Aaron, Andy, Dave
      Excused:
      
      1) perfSONAR Issues
      	a) Next release candidate testing status (Jason):   Have RC3 out and available.   Latency node upgrade for RC3.  Release still on target for Friday (April 23)
      	b) Status of issues identified by perfSONAR at OU (Karthik/Jason):  Karthik working with Jason on asymmetry.  Outbound bandwidth much higher than input.  Karthik wrote some script following advice from Jason which verified the issue.  Jason recommended redoing manually to verify each test is valid.  Next step is to go deeper with more advanced tools.  Suggestions forthcoming from Jason once he has more time to analyze the data.   One issue is that if we need UDP testing it needs to be enabled by sites.   Recommendation is to setup UDP testing ONLY for Tier-2 performance nodes.  This means Tier-2 sites need to enable UDP testing from all the other Tier-2 performance nodes.  **ACTION ITEM**: Jason will send info on this configuration option.  **ACTION ITEM**: Shawn will provide the list of relevant IPs that Tier-2 sites need.   
           c) Status of issues identified by perfSONAR at UTA (Mark): No update but current monitoring shows the problem seems to still be there.  **ACTION ITEM**: Need to schedule some detailed exploration to try to isolate the problem.
           d) New issues, interesting results, problem reports?    Karthik reported on latency node missing "admin" info.   Philippe asked about latency testing (1-way)...since the DB is so big asking for the page frequently times out.  Plots also time out.   Known issue.  Jason: optimizing is underway but not for this release.  Issue at MWT2_UC with storage partition filling up.   Aaron will check the verbosity settings on the logs.   Also will clean out the /var/logs area.  Next release may help with this type of issue. 
      
      Ongoing item about perfSONAR.   Need an updated recommendation for "current" perfSONAR hardware for sites buying now (Tier-3s for example).    Would like to have a well defined Dell box (given ATLAS pricing) that sites can just order (fixed configuration).   **ACTION ITEM**: Shawn will try to customize a possibility and share with Jason.   Shawn/Jason can work with Dell on tuning the setup.
      
      
      2) Transaction testing for USATLAS.  Summary of status? (Hiro):   Hiro made 1000 files of 1MB each.  GridFTP2 bug is being worked on (Michael talked to Patrick about prioritizing resolving this). For now we could try testing to GridFTP1/dCache and/or BestMan sites.  **ACTION ITEM**: Hiro will try test to a site soon (AGLT2?)  **ACTION ITEM**: Longer term Hiro will setup histogram of "times" where 0.0 time is when the first of the 1000 files started transferring on the FTS channel.  Histogram for each "test" will have 1 entry per file (1000 files sent) showing the time that file completed transferring.   Have to get some experience with this new test to see how best to use it.  
      
      3) Site reports - (Round table...open for reporting, problems and questions)
           Illinois - Dave reported perfSONAR operating well.  Had sent results last time in email the look consistent with AGLT2's results.
      	MWT2_UC -  Aaron reported on network failures induced on Cisco 6509 from multiple clients hitting 1 server during testing/benchmarking. Good performance from new Dell nodes (part of the reason the Cisco switch was swamped!) The ~900 MB/sec Perc6/E with 1 MD1000 shelf (redundantly cabled). Aaron will provide link discussing setup and test results.  
      	AGLT2 - Bottleneck in NFS server used to host OSG home areas (especially usatlas1).   Same server hosts ATLAS software installs.   Looked at Lustre but not suitable to fix this problem.  Instead using Lustre to migrate away from NFS servers (Tier-3 storage).  Will be testing SSD(s) to replace home area storage for OSG.   Also exploring migrating ATLAS software into AFS.  SSD change will require site downtime to implement. 
      
      4) AOB - None
      
      Next meeting is in two weeks.  Look for email before the meeting.   Send along any agenda items you would like to see added.
      
      Corrections or additions can be sent to the email list.  Thanks,
      
      Shawn
    • Focus is on getting new perfsonar deployed.
    • Working on a new Dell perfsonar platform defined
    • Will spend time during meetings to track down perfsonar issues
  • this week:
    • Most recent perfsonar is running on some sites.
    • Working on new hardware platform specification from Dell
    • Transaction testing - fits well within Hiro's infrastructure
    • Using collected perfsonar data to trigger alerts or warnings for well defined problems

Site news and issues (all sites)

  • T1:
    • last week(s): had a good weekend - storage worked perfectly. reprocessing campaign went very well - completed our shares ahead of others. evaluating DDN storage system, FC to four front end servers, 1200 disk system, 4 GB/s writes, less for reads, all using dcache, 2PB useable disk behind the four servers; would make 6PB in total for the storage system; wn purchase underway, 2K cores; putting Pedro's lsm into production. Pedro's lsm uses gsidcap rather than srm to put data into the SE. Should we consider Pedro's lsm at the other dcache sites? Note he has added additional failure monitoring; will ask Pedro for a presentation at this meeting;
    • this week: DDN array evaluation - converted front end to Linux and XFS. Expect evaluation to last another two weeks. Issue with UPS system (battery based). Batteries exhibiting thermal runaway; switched mode into by-pass. There are measures underway to solve the problem.

  • AGLT2:
    • last week: Order for new storage at MSU; Tom: 50K order; 6 GB SAS; MD1200 shelve; 12 vs 3.5 drives; two servers, 8 shelves, 2 TB drives; nearline configuration (SATA disks with dual port SAS frontend; seagate). 27% more per useable TB in this configure. (MD1000's still best price $/TB, 15 drives vs 12). Dell will be updating portal with 6 core 5620s Westmere on Friday, 24 GB. Does this change any of the pricing for the previous generation 5500s? Switching Tier 3 storage over to Lustre. Sun NAS running ZFS for VO home dirs, much better performance.
    • this week: Looking at dcache server headnode load issues.

  • NET2:
    • last week(s): Filesystem problem turned out to be a local networking problem. HU nodes added - working on ramping up jobs. Top priority is acquiring more storage - will be Dell. DQ2 SS moved to BNL. Shawn helped tune up perfsonar machines. Moving data around - ATLASDATADISK seems too large. Also want to start using pcache. Built new NFS filesystem to improve performance. Installed pcache at HU - big benefit. Addressed issues with Condor-G from Panda. Ramped HU all the way up; major milestone in that all systems running at capacity. Gatekeepers are holding up - even with 500 MB/s incoming from DDM; interactive users. space situation is top priority. About to do storage upgrade - purchase 360 TB raw per rack. Improving network between HU and BNL (there was a 1G limit).
    • this week: Diskspace is tight at northeast. Order for first of three new rack of storage going out this week - will add PB of raw storage. Network issue - turned out to be a firewall issue. John: still working on ramping HU back up to full capacity.

  • MWT2:
    • last week(s): 25 TB of dark data have appeared recently - most of this is in proddisk. These are datasets in dq2. Armen: this is happening in other clouds as well. However is this a US-only issue? (same space token?) Charles will follow-up with the list.
    • this week: Cisco running stabling even since going to gridftp2.

  • SWT2 (UTA):
    • last week: probs with analysis transformations - that run event generation; came down to python version used, fixed by re-installing 32bit wn-client; NFS issue; new xrootd service. DDM transfers were failing - tracked down to an MD1000 being rebuilt. 200 TB, 52 8-core worker nodes ordered.
    • this week: all is well.

  • SWT2 (OU):
    • last week: Dell on-site installing nodes; by end of next week expect to be online again.
    • this week: installing new equipment.

  • WT2:
    • last week(s): All is well; found a issue where a lot of jobs were reading from a single data server; security team has approved deploying perfsonar, finally. planning for next storage purchase. Considering a local vendor (based on supermicro mobo). Setting up a proof cluster.
    • this week: all is well. Storage evaluation - thumper-like- from a local vendor. Berkeley communications.

Carryover issues (any updates?)

AOB

  • last week
  • this week


-- RobertGardner - 04 May 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


pdf SM-LSM-001.pdf (44.0K) | PedroSalgado, 04 May 2010 - 15:18 | SM-LSM-001 procedure.
pdf SM-LSM-002.pdf (43.3K) | PedroSalgado, 04 May 2010 - 15:18 | SM-LSM-002 procedure.
pdf SM-LSM-003.pdf (49.3K) | PedroSalgado, 04 May 2010 - 15:18 | SM-LSM-003 procedure.
pdf SM-LSM-004.pdf (42.8K) | PedroSalgado, 04 May 2010 - 15:18 | SM-LSM-004 procedure.
pdf SM-LSM-005.pdf (44.3K) | PedroSalgado, 04 May 2010 - 15:19 | SM-LSM-005 procedure.
pdf SM-LSM-006.pdf (49.4K) | PedroSalgado, 04 May 2010 - 15:19 | SM-LSM-006 procedure.
pdf lsm.admin.pdf (56.6K) | PedroSalgado, 04 May 2010 - 15:25 | LSM administration guide.
pdf lsm.index.pdf (41.4K) | PedroSalgado, 04 May 2010 - 15:25 | LSM main page.
pdf lsm.user.pdf (68.1K) | PedroSalgado, 04 May 2010 - 15:25 | LSM user guide.
pdf 20100505_Local_Site_Mover.pdf (1288.3K) | PedroSalgado, 05 May 2010 - 09:22 | BNL local site mover implementation.
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback