r3 - 26 May 2010 - 14:27:17 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesMay26

MinutesMay26

Introduction

Minutes of the Facilities Integration Program meeting, May 26, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (605) 715-4900, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Michael, Rob, Charles, Patrick, Karthik, Nate, Aaron, Shawn, Sarah, Rik, Jason, Bob, Saul, Kaushik, Mark, Wei, Armen, Xin, Tom
  • Apologies: Fred, Horst, John DeStefano

Integration program update (Rob, Michael)

  • SiteCertificationP13 - FY10Q3
  • Special meetings
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
      • Updates to SiteCertificationP13 for fabric upgrade plans and Squid updates - due
      • WLCG data: WLCGTier2May2010
      • WLCG storage workshop: General registration for the Jamboree on the Evolution of WLCG Data & Storage Management is now open. The event page can be found at http://indico.cern.ch/conferenceDisplay.py?confId=92416. Please note the registration and payment details (EUR 120), as well as the event location (central Amsterdam). Cheers, Jamie
      • Last week's agency review by NSF/DOE - computing somewhat shorter than usual
        • A number of questions - why is OSG important, etc - should continue
        • Tier 2 performance and contributions - importance to the physics program
        • Full copy of ESD and AODs at the Tier 1
        • A lot of attention to M and O, detector upgrades; "computing is working"
      • New era in operations - lots of data subscriptions: underscores need to get to storage pledges at Tier2's (1100 TB). Important for distributing analysis load across the facility
      • Next year - even more storage - and the ramp up is steep at the Tier 2s. 1600 TB next year. 20 PB total T2 in '12
    • this week
      • Meetings: WLCG Storage meeting (Amsterdam): June 16-18; discussion on goals, see Kors' email.
      • OSG Storage Forum in September 21-22; location U of Chicago
      • ATLAS Tier 3 meeting at Argonne June 8,9
      • LHC - intervention next week;
      • Space crunch .. already

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

References
  • The link to ATLAS T3 working groups Twikis are here
  • Draft users' guide to T3g is here

last week(s):

  • Tier 3 meeting coming up June 8, 9, http://indico.cern.ch/conferenceOtherViews.py?view=standard&confId=94434 at ANL. Primarily for Tier3g sites.
  • Xrootd ganglia monitoring from Patrick - (development)
  • Continuing with program - ready with all services needed for a Tier 3g, for people to start building and doing analysis
  • Documentation
  • Final re-build of test cluster
  • June 8 - Phase 1 Tier 3 ready
    • mangeTier3SW - ROOT, pacman, wlcg-client-lite, xrootd, dq2, etc.
    • ATLAS code and conditions - CVMFS
    • Condor plus ARCOND wrapper, and Tier 3 Panda
    • Xrootd for data management - plus additional tools
    • dq2-client, Bestman SRM, dq2-FTS (Hiro)
      • waiting on new version dq2-client; estimate: next week
      • plugin will allow directory creation; can also use uberftp
      • may go beyond June 8
    • expect funding to arrive to grant offices mid-July
    • Panda Tier 3 - Torre needs to clean up installation procedure; very soon
this week:
  • No news- see last week for Phase 1 deployment
  • Rebuilding model-Tier 3 at ANL
  • All working groups will be wrapping up in the next couple of weeks

Operations overview: Production and Analysis (Kaushik)

Data Management & Storage Validation (Kaushik)

  • Reference
  • last week(s):
    • Charles: PRODDISK cleanup discussion on-going, to move under central control; Note - we're using PRODDISK-cleanse because of Pandamover datasets.
    • Waiting on final word from Kaushik
    • Shawn: has situation with LFC entries missing as a result of mis-identification as ghosts. Re-subscribe? About 10TB of data. Charles will follow-up
    • USERDISK - needs to be cleaned up - Hiro will do this for all the sites
  • this week:
    • MinutesDataManageMay25
    • Storage crunch
    • Deleting old datasets
    • Discussions w/ Stephane, Simone, Alexei - about an automated system; discussion of threshold to trigger deletions
    • Will be doing more aggressive deletion
    • Quotas? Far away
    • Wensheng surveying USERDISK usage
    • Data popularity is available in DQ2; ADC looking into this.
    • 72-80 TB of DATADISK can be deleted (removing the 900 GeV data). Concern - is it will take weeks to delete. Kaushik believes not, its 4-5 days.
    • Kaushik believes the reprocessing caused the explosion and that was a one-off, that going forward the rate will be reduced.
    • Charles - disk latency*disk deletion time product is important
    • Hiro notes deletion goes as 1 Hz.
    • Charles - could do this much more quickly at the site-level.
    • Michael: should we reduce # replicas?
    • Charles notes the LFC access time field is not being used presently - could be used to identify cold data

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=95201
    
    1)  May 2010 reprocessing exercise has begun.  Useful links:
    https://twiki.cern.ch/twiki/bin/view/Atlas/ADCDataReproSpring2010
    https://twiki.cern.ch/twiki/bin/view/Atlas/ADCoS#Reprocessing_jobs
    2)   5/13: Transfer errors at MWT2_UC:
    FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://uct2-dc1.uchicago.edu:8443/srm/managerv2]. Givin' up after 3 tries]
    >From Sarah:
    UC SRM door crashed due to 'Too many open files'.  It seems the ulimit nofile setting in the dCache start-up file is not being applied.  I set it manually and restarted the door, and transfers are starting to succeed. 
    We are looking into a long-term solution, as well as monitoring SRM door health.  eLog 12614.
    3)  5/13-14: File transfers failing from ILLINOISHEP_PRODDISK to BNL-OSG2_MCDISK with source errors:
    FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [CONNECTION_
    ERROR] failed to contact on remote SRM [httpg://osgx1.hep.uiuc.edu:8443/srm/managerv2]. Givin' up after 3 tries].
    >From Dave at Illinois:
    The srm node was somehow confused on srmping. Restarting dcache on the node seems to have fixed the problem.
    ggus 58225 (closed), eLog 12638.
    4)  5/14: Transfer errors at AGLT2_PRODDISK:
    FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]. Givin' up after 3 tries].
    >From Shawn:
    Postgresql connection limit on our dCache/SRM headnode was reached at 3 AM. I increased it to 1000 and restarted Postgresql and dCache services at 7:40 AM. I see SRM transfers are again working.  ggus 58229 (closed), eLog 12654.
    5)   5/14: BNL, data transfer errors to ATLASDATADISK, ATLASDATATAPE and ATLASMCDISK:
    FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] failed to contact on remote SRM [httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2]. Givin' up after 3 tries].  From Michael:
    The problem is solved, again caused by the namespace management component of dCache.  eLog 12668.
    6)  5/14: BNL - Network port scan testing postponed until further notice.
    7)  5/17:  AGLT2 - USERDISK token filled up.  5 TB additional space added.  Test jobs successful at ANALY_AGLT2 - site back to 'online'.  RT 16346 (closed), eLog 12772.
    8)  5/17-19: NET2 - jobs failing with NOLOCALSPACE and FILECOPYERROR messages.  From Saul:
    It looks like this is two independent problems:
    1) some BU_ATLAS_Tier2o jobs failed from running out of local scratch space
    2) some ANALY_NET2 jobs failed with "Put" errors copying to our SE (gpfs1)
    ....................
    1) is going to require more investigation, but there is no immediate problem - all the workers have scratch space and no more jobs are failing.
    2) is also going to require more investigation. The errors coincide with a time yesterday when one of our gpfs1 volumes was down to 1.5TB. We were, however, watching it carefully at the time and didn't let it go below about that so I
    don't actually understand how the errors happened. We'll have to get this from our local site mover logs. As with 1), there doesn't seem to be an immediate problem as gpfs1 has 6TB free at the moment and the errors ended approximately
    when we freed more space.  Issues resolved - ggus 58257 (closed), eLog 12749.
    9)  5/18: BNL - networking issue resolved.  From Michael:
    The network problem caused by a routing loop between core routers/switches was solved around 18:50 UTC. All services
    are back to normal.  eLog 12847.
    10)  5/19: BNL - out of space errors for MCDISK.  Additional space was added -- issue resolved.  ggus 58326 (closed), eLog 12862/63.
    
    Follow-ups from earlier reports:
    (i)  4/11: Failed jobs at AGLT2 with errors like:
    11 Apr 13:37:44| lcgcp2SiteMo| !!WARNING!!2999!! The pilot will fail the job since the remote file does not exist.
    Shawn verified that the file exists in /pnfs, and that the correct checksum info is available via dCache/Chimera.  Could there be some timing issue present? What does getdCacheChecksum() try to do? I note it mentioned "attempt 1"...
    are there follow-on attempts or is this site-db configured?  
    Paul added to the thread in case there is an issue on the pilot side.  ggus 57186, RT 15953, eLog 11406.  In progress.
    Update, 4/16: Still see this error at a low level, intermittently.  For example ~80 failed jobs on this date.  More discussion posted in the ggus ticket (#57186).
    Update, 5/4: Additional information posted in the ggus ticket.  Also, see comments from Paul.
    Update, 5/10: Additional information posted in the ggus ticket.
    U[pdate, 5/17: Additional information posted in the ggus ticket.
    (ii)  4/23: OU sites were set off-line in advance of major upgrades -- from Horst:
    We'll be taking OU_OCHEP_SWT2 down for a complete hardware and software upgrade on Monday morning.  So could you please drain all the queues -- both *OU_OCHEP_* and *OU_OSCER_* -- and turn pilot submission 
    (and everything else that might be accessing them) 
    off starting this afternoon, until we're ready to start back up, which will be at least a week?  
    I'll also schedule a maintenance in OSG OIM, which I will keep updated when we know better how long Dell and DDN will take for the upgrade.  eLog 11813.
    
  • this meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=95901
    
    1)  5/19: ATGLT, from Bob:
    An AT&T fiber was cut around 6:30pm. This caused a partial disruption between MSU and UM machines of AGLT2, with the ultimate effect that MSU workers have no afs access at all (OSGWN setup is in afs). I don't know what will happen with jobs actually running at MSU at this time. 
    Jobs running at UM will run and commplete fine, and dCache file servers at MSU are fine. I have therefore initiated a peaceful condor idle of all MSU worker nodes. This means we will run at reduced capacity until the fiber problem can be resolved.
    2)  5/19-20: New pilot version from Paul (44a), and minor patch (44b).  Details are here:
    http://www-hep.uta.edu/~sosebee/ADCoS/pilot-update-May19-20-44a_b.html
    3)  5/20: second half of the May 2010 reprocessing exercise has begun.  Status:
    http://atladcops.cern.ch:8000/j_info/
    4)  5/21: From Hiro:
    There was a change in alias for LFC within BNL CE hosts to solve the network issue for some jobs to fail under certain heavy traffic.    However, although it worked in the test, this change made the clients/jobs to fail with authentication errors.    
    As a result, the alias was changed it to back the original setting.  Meantime, you will notice some jobs failed with authentication errors.
    5)  5/21: SWT2_CPB - file transfer failures like:
    FTS Retries [1] Reason [TRANSFER error during TRANSFER phase: [GRIDFTP_ERROR] globus_ftp_client: the server responded with an error500 500-Command failed. : globus_gridftp_server_posix.c:globus_l_gfs_posix_recv:914:500-open() fail500 End.]  From Patrick:
    The data server that was trying to store the transfers was misconfigured. The server was reconfigured and xrootd was restarted.  ggus RT 58418, RT 17002 (both closed), eLog 13137.
    6)  5/21-22: SWT2_CPB - A/C water leak in the machine room forced a power shutdown.  Once power was restored and the services brought back on-line test jobs succeeded - the site is now back up.   eLog 13002.
    7)  5/22: AGLT2 - low efficiency for file transfers.  Issue was heavy load on an SRM server, now resolved.  eLog 12997.
    8)  5/23: MWT2_UC - file transfer failures:
    FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [LOCALITY]
    Source file .... locality is UNAVAILABLE].  From Sarah:
    Two of the pools at MWT2 went offline this morning due to memory issues.  They're back online now, and these transfers should start to succeed. 
    9)  5/23: MWT2_DATADISK low on free space:
    FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase:
    [NO_SPACE_LEFT] at Sat May 22 19:17:57 CDT 2010 state Failed : space with id=2310037 does not have enough space]
    As of 5/25 47 TB free space now available.  Savannah 67809 (closed), eLog 13009.
    10)  5/23-26: WISC_DATADISK - file transfer errors.  From site admin:
    We have some power problem. Now all data servers are not available.  I already submit an OIM unscheduled downtime. Sorry for the problem.  We will make the service available as soon as possible when the power problem is solved.  Later:
    The problem was solved.  We will have a scheduled downtime tomorrow evening in the university to upgrade the power.
    On 5/25: After the power upgrade in the whole CS room, some of our servers failed to get ip address. Now we are working on it.  ggus 58444 (in progress), eLog 13110,13.
    11)  5/24: From John at NET2:
    Since there's been so little demand for production grid jobs over the past few days (today we ramped down to zero) I'm going to set HU_ATLAS_Tier2 to brokeroff so that we can perform some i/o tests without grid jobs interfering or getting harmed.  
    This should only be for about a day or so.
    12)  5/25: From Wei at SLAC, regarding problems with the SE:
    A data server went down at midnight. I got it back.  I think we also have some intermediate DNS issue due to partial power outage today.
    13)  5/25: From Bob at AGLT2:
    I have stopped auto-pilots to AGLT2 and to ANALY_AGLT2 while we update the OSGWN version at our site.  I will let the remaining jobs here (63 at last count) complete to a great extent, update the distribution, then re-enable the pilots.
    Later:
    OSGWN version upgraded to 1.2.9 and tested.  Restarted queues.  Jobs are running cleanly.
    
    Follow-ups from earlier reports:
    (i)  4/11: Failed jobs at AGLT2 with errors like:
    11 Apr 13:37:44| lcgcp2SiteMo| !!WARNING!!2999!! The pilot will fail the job since the remote file does not exist.
    Shawn verified that the file exists in /pnfs, and that the correct checksum info is available via dCache/Chimera.  Could there be some timing issue present? What does getdCacheChecksum() try to do? I note it mentioned "attempt 1"...
    are there follow-on attempts or is this site-db configured?  
    Paul added to the thread in case there is an issue on the pilot side.  ggus 57186, RT 15953, eLog 11406.  In progress.
    Update, 4/16: Still see this error at a low level, intermittently.  For example ~80 failed jobs on this date.  More discussion posted in the ggus ticket (#57186).
    Update, 5/4: Additional information posted in the ggus ticket.  Also, see comments from Paul.
    Update, 5/10: Additional information posted in the ggus ticket.
    Update, 5/17: Additional information posted in the ggus ticket.
    Update, 5/21: Additional information posted in the ggus ticket.
    (ii)  4/23: OU sites were set off-line in advance of major upgrades -- from Horst:
    We'll be taking OU_OCHEP_SWT2 down for a complete hardware and software upgrade on Monday morning.  So could you please drain all the queues -- both *OU_OCHEP_* and *OU_OSCER_* -- and turn pilot submission 
    (and everything else that might be accessing them) off starting this afternoon, 
    until we're ready to start back up, which will be at least a week?  I'll also schedule a maintenance in OSG OIM, which I will keep updated when we know better how long Dell and DDN will take for the upgrade.  eLog 11813.
    Update, week of 5/19-26: Horst thinks the upgrade work is almost done, so the site may be back on-line soon.
    

DDM Operations (Hiro)

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week(s):
    • NetworkMonitoring
    • Minutes from meeting:
      	USATLAS Throughput Meeting Notes --- May 18, 2010
          ===================================================
      Attending:  Shawn, Dave, Jason, Sarah, Andy, Saul, Philippe, Karthik, Hiro, Nate, Charles
      Excused: Horst
      
      Review of existing action items (see meeting agenda below).   Topics will be covered mostly within the agenda.  Note updated spreadsheet attached with meeting announcement.   Sites should verify they are properly entered AND that they have fully setup scheduled tests as appropriate (see notes in spreadsheet).
      
      1) perfSONAR
       a) Network status and new issues?   BNL issue noted:  campus-wide outage being worked on. 
       b) Review issues with bandwidth asymmetry at OU (Karthik):  Status update: results inconclusive from NPAD/NDT tests.  Jason and Karthik will continue to look at this.
       c) Review UTA low bandwidth test results (Mark):  Status update:  FTS channel fully loaded causing low perfSONAR results. (**Resolved**)  
       d) Marking "milestone" complete ??   perfSONAR deployed at all the Tier-2s and the Tier-1? **Yes...closing this milestone**.
       e) BU and Harvard will decide if perfSONAR instance at Harvard (testing to BU) is useful.  
       f) Possible new hardware option from Dell for perfSONAR:  Shawn/Jason
      
      2) Transaction testing update - Hiro blocked right now by network issue at BNL.   Will produce a number of plots related to transaction testing results.   Submit times and finishing times plots.  Will eventually be available from FTS and/or throughput.  URL for "stack" plots for total bandwidth to Tier-2s will be sent by Hiro later (once it is ready for heavier use).
      
      3) Alerting on perfSONAR:   Sarah did some initial poking.   Possibilities to verify testing configurations and site status.  Some work to do to make the user interface easier and more intuitive.   Jason confirmed that verifying the set of tests implemented at a site via the API should be doable.    Harder to get version info.    
      Sarah, Jason and Philippe discussed a perfSONAR "enhancement" which would allow a meaningful site label to be created which would group all test results involving that site under the label.   Then we wouldn't care that some measurements were done using the IP vs some done using the DNS name, etc.    
      
      4) Site reports (open)
      	Philippe brought up an issue:  lhcperfmon issue displaying AGLT2_UM and AGLT2_MSU.  Jason requested email with details to be shared with developers.   Sarah reported IU doesn't have a problem with the same graph.  Hiro reported BNL's result is also blank to AGLT2 (both sites) but has data to IU.   Maybe a Google API issue?   Will be looked at (**ACTION ITEM**)
      	Saul will check in the install perfSONAR instance at BU (currently 3.1.2 ? Should be 3.1.3)   
      
      Next meeting in 2 weeks (June 1, 2010).   Please send along corrections or additions to the list.   Thanks,
      Shawn 
  • this week:

Site news and issues (all sites)

  • T1:
    • last week(s): Increasing capacity of CPU and disk. ETA 1 June for 2000 cores from Dell. Evaluation of DDN continues. Continued problems with servers - considering going back to Sun servers. Network problem yesterday - router reboot created a routing loop (spanning tree path calculations), required manual intervention.
    • this week: Planning to upgrade Condor clients on worker nodes next week. Force 10 code update. Continue testing DDN.

  • AGLT2:
    • last week: Converting from many pool groups to one large pool group - so as to adjust space tokens on the fly. SSD on postgres database working really well. CCC running fast causing overloading local disk causing SRM timeouts, need to migrate to anther server. Bringing another 100 TB into dcache in a few days. Will be up to 1.1 PB soon. There is also new storage at MSU. 6 shelves.
    • this week: Watching space tokens. 100 TB free. New storage nodes at MSU - 8 MD1200. Will be purchasing head nodes. Upgraded OSG wn-client 1.2.9; upgraded wlcg-client. Founding lcg-cp hangs, for up to 8 hours.

  • NET2:
    • last week(s): Running relatively smoothly. Top priority besides ops is storage upgrade. Order for first of 3 racks has gone out. 336 TB raw (IBM servers and Dell MD1000, MD3000 servers). BU-HU network being improved, final stages. Local site mover improvements.
    • this week: Dell storage arriving; focusing on networking.

  • MWT2:
    • last week(s): http://twiki.mwt2.org/bin/view/Main/SiteReportMay19. Pilot was timing out transfers that were still in progress.
    • this week: working on lsm failures - examining performance. Older nodes cannot write to disk as fast as we transfer over the network. Has become the rate limiting factor, not enough disk IO capacity. Will be changing the scheduling of jobs to compute nodes.

  • SWT2 (UTA):
    • last week: Installed 50 R410 in the production only cluster. +300 cores. 400 TB on floor - burn in (2-3 weeks operational); swap 40 TB of older storage; Ordered 200 TB (may have power issues). This will put us over 1 PB. Xrootd version has issue with reporting space usage back to SRM when running with CNSd; workaround fix; will need to re-synch dq2 and srm. Squid upgrade to-do: Q: failover. Network bandwidth meeting tomorrow with administration.
    • this week: Water leak in machine room - dumped power to the lab. During the downtime put in an xrootd to report usage of space token. Will be bringing 200 TB online shortly. Network update - met w/ CIO last week, now have an upgrade path. Ordered a dedicated 10G port for switch in Dallas ~ 2 months. Bid for fiber to Dallas - 3 months. (NLR in Dallas, and I2 will be bringing 10G there). Dallas to Houston is LEARN network. Preference would be direct connectivity.

  • SWT2 (OU):
    • last week: Troubles with Netrion-Dell 2950 compatibility; Intel card more stable. Still not getting full throughput. PCM cluster manager had many issues - had to work with Dell remotely to fix these. Condor configured. About to start on OSG. Shooting for Monday. 170 TB of useable storage.
    • this week: Ready to bring OU back online. Waiting to be added to ToA, then ready for testing.

  • WT2:
    • last week(s): Running low on storage - will be deleting data from SCRATCHDISK and GROUPDISK. Meeting this afternoon regarding storage purchase for end of next month. 500 TB useable, leading to 1.1 PB.
    • this week: Ordering next storage - consulting w/ Shawn very helpful. Decided to go with Dell even if low-density. 15 R610 (to save a little space), 45 MD1000. 48GB memory.

Carryover issues ( any updates?)

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc.
  • last meeting
    • Discussing file structure across all ATLAS sites - will need to migrate carefully the US sites to minimize impact. Waiting for test results.
  • this meeting:
    • BNL available for testing. OU will also be available soon. Will keep bugging Alessandro.

AOB

  • last week
  • this week


-- RobertGardner - 25 May 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback