r6 - 15 Apr 2011 - 17:47:52 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesApr13



Minutes of the Facilities Integration Program meeting, April 13, 2011
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • 866-740-1260, Access code: 7027475

Audio Details: Dial-in Number:
U.S. & Canada: 866.740.1260
U.S. Toll: 303.248.0285
Access Code: 7027475; Chair passcode: 8734
Registration Link: https://cc.readytalk.com/r/bd2w3deu2kkg


  • Meeting attendees: Rob, Aaron, Jason, Nate, John DeStefano, Hiro, Patrick, Charles, Karthik, Saul, Shawn, Xin, Tom, Armen, Kaushik, Mark, Bob, Wei, Michael, Taeksu, Wensheng, Joe Izen
  • Apologies: Akhtar Mahmood <amahmood@bellarmine.edu>
  • Welcome Hampton University, Taeksu Shin

Integration program update (Rob, Michael)

  • IntegrationPhase16 final updates due
  • Special meetings
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • Program notes:
    • last week(s)
      • Quarterly reports due
      • WLCG accounting status. Karthik - there is a consensus to report HS06 per job slot, to account for hyperthreading. The shared HS is lower for HT slots. Sites publish cpus_per_node, cores_per_node - in use. Would be preferable if OSG had a lookup table for this information. This will be an average across all the subclusters for the site. Not sure if GIP has the lookup table, or the number of processor types. Bob has a link with all the measurements we know about. What about the HEPIX benchmark page? And then about contributed CPUs? Opportunistic versus dedicated. Saul can compare with the egg numbers. Karthik has given feedback to Brian. We need to make sure there is a convergence on this topic OSG-wide. Michael will bring this up ET.
      • There is a substantial amount of activity related to Tier 3 packaging. OSG is working in this area for some time. https://twiki.grid.iu.edu/bin/view/SoftwareTeam/XrootdRPMPhase1Reqs - focus first in on Tier 3. Will need to sign off on this as soon as possible. There is a similar effort at CERN, and there will OSG collaboration. We need to sign off on the requirements.
      • WLCG shortage shortfall - need to get missing 1 PB in size installed by July.
    • this week
      • Final capacity reporting in the facilities spreadsheet, see CapacitySummary
      • Final site certification check off: SiteCertificationP16
      • Tier2 plans for meeting pledged storage capacities (see http://bourricot.cern.ch/dq2/accounting/federation_reports/USASITES/ ):
        • June/July at latest for meeting the pledge - party.
        • AGLT2: NA
        • MWT2: working on LOCALGROUPDISK reduction; will procure 156 TB post server move.
        • SWT2: at ~1200 TB presently. Another 200 TB are un-powered, to bring online at CPB. Power situation in lab needs to be fixed, this will take time. Should have 1.4 PB/1.6PB fairly quickly. Will see about the remaining 200 TB after this.
        • NET2: Working on a new purchase - two racks of md1000's. 576 TB useable added 930 TB. Have plenty of space/power.
        • WT2: currently have 1.75 TB in storage, though not all in a space token. 24 MD1000's sent out 2 weeks ago.
      • Special integration program topics:
        • Python >= 2.5, LFC bindings, wlcg-client, wlcg-client-lite (for Tier3 support), (Charles, below)
        • dq2-client 0.1.35 to be installed on all US sites, ATLAS-worker-node-client update (Xin, cf below)
        • WLCG accounting status update (Karthik, cf below)
        • Xrootd 3.0.3 rpms available from VDT for testing
        • HCC opportunistic use coming next week. Adam Caprez and Derek Weitzel will join the meeting
          • Bob - notes that glidein's can be stressful on the gatekeeper - they have separated their gatekeeper. Agreement is to be able to kill these if ATLAS jobs become.
        • ATLAS software week last week - a lot of fruitful discussions, directly connected Tier2Ds? , group channels for remote Tier1 and US T2's, etc. DDM development - future DQ2. Cf.
        • R&D activities are being launched - federated xrootd, sub-file caching, cloud computing, ... etc. Would be good to have a summary of these activities during this meeting.
        • 50 ns bunch spacing successful.

Python + LFC bindings, clients (Charles)

  • new dq2 clients package requiring at least 2.5, recommendation is 2.6. goal is to not distribute this with clients. goal is to make our install look like lxplus - /usr/bin/python26 installable from yum. Plus setup files. this will be the platform pre-requisite.
  • LFC python bindings problem - mixed 32/64 bit environ. Goal is to make sure wlcg-client has both 32 bit and 64 bit environment. /lib and /lib64 directories. _lfc.so - also pulls in whole Globus stack. Hopefully next update for wlcg-client will incorporate these changes. Charles will write this up.

dq2-client update on sites (Xin)

  • Required by Paul for pilot reporting to consistency service
  • Gets installed from ATLAS-wn.pacman, a wrapper package in $APP
  • dq2-client available but requires Python >=2.5
  • Updated all US ATLAS sites to 0.1.35
  • Still need patch to update pool file catalog. Hiro is making a new patch.
  • In the future, Alessandro will take care of this as well - for PFC.
  • Paul needs the new dq2 client to report the consistency service
  • Charles - previous patch had hardcoded values in ToA - need to make sure it makes sense.
  • There is a newer client may not require the patch, but this comes with 0.1.36.

Release installation (Xin)

  • Migration to Alessandro's system is all done
  • PFC piece comes later, after the pilot.

WLCG accounting (Karthik)

  • There were active discussions - including HT in the normalization factor.
  • Feedback from Burt re: GIP, etc.
  • Brian is the point of contact. Dan has been contacted.
  • Wei has confirmed that SLAC figures are close to that from the March WLCG report
  • Saul's report shows agreement with to 10%, except for cases with the with foreign cloud production

CVMFS (John DeStefano, sites)

last week:
  • Setting up a replica server as a testbed - had some breakthroughs, al is running as expected. Its working - CERN notified. Functional tests passed.
  • Sites currently connected to CERN could connect at BNL.
  • Sarah W: Version 2.6.1 working at MWT2_IU - there was a workaround required, writing up feedback. Working on test nodes. Updating puppet module to deploy automatically. 3000 job slots.

this week:

  • Mirror at BNL
  • For Tier2 sites: TestingCVMFS
    • Status:
      Regarding CVMFS testing and configuration to point to the CVMFS testbed at BNL, see the below (thanks John).      
              Note - Alessandro expects to have the final layout and release repository available on the production 
              CERN IT servers within the next two weeks, so stay tuned for final instructions.   We'll update our US 
              facility-specific instructions (http://www.usatlas.bnl.gov/twiki/bin/view/Admins/TestingCVMFS.html) 
              to point to general, official ATLAS instructions as they become available.  
               - Rob
      Begin forwarded message:
      From: John DeStefano 
      Date: April 7, 2011 6:56:02 AM CDT
      To: Rob Gardner , Doug Benjamin 
      Subject: Connection information for CVMFS test bed at BNL
      Here are our test servers:
      Replica server: gcecore04.racf.bnl.gov
      Cache server: gridopn02.usatlas.bnl.gov
      A few caveats:
      - This is purely a test environment (obviously)
      - The cache is set to accept incoming requests on port 80, but from outside BNL you _may_ need to instead specify an alternate port (35753) in order to bypass BNL campus proxy restrictions
      - If you have problems connecting to the replica server, please let me know.
      - Most importantly, please share your results and levels of success or failure so I know what's working or not?
      So your configs would need something like:
  • Encourage testing at the sites
  • There are test sites at AGLT2 and MWT2.
  • Patrick - may be able to join the testing.
  • OU - Horst will test.
  • This comes in addition to the local storage required for jobs (20 GB/job). Varies for site to site, AGLT2 is 14 GB/job in scheddb. Kaushik - this impacts which jobs get run.
  • Review scheddb entries for this

Tier 3 Integration Program (Doug Benjamin)

Tier 3 References:
  • The link to ATLAS T3 working groups Twikis are here
  • T3g Setup guide is here
  • Users' guide to T3g is here

last week(s):

  • AK - working on firewall issues affecting transfer rate. Working with the IT director to resolve the issue.
this week:
  • Note from Doug:
    From: Doug Benjamin 
    Date: April 11, 2011 4:20:07 AM CDT
    To: Michael Ernst , Rob Gardner 
    Cc: Doug Benjamin , Rik Yoshida 
    Subject: I will not be at the facilities meeting this week
    Dear Rob and Michael,
     I am returning from CERN on Wednesday so I will not be able to attend the
    facilities meeting.
      Notes for the meeting: 
    - new Tier 3 online at UC Santa Cruz (will update spreadsheet)
    - waiting to hear from Arizona on the status of their Tier 3 (did they do any additional work 
      after I left, there were a few small issues?)
    - New ATLAS wide Tier 3 document being written by me as an ATLAS com note.
    -  new version of dq2 client tools out but will not be deployed at US Tier 3 sites using
       ATLASLocalRootBase (default for new sites) due to a problem with wlcg-client-lite
      (see below for a message from Asoka DeSilva)
    -  Asoka has  requested that the US setup a Tier 3 testbed that does not use 
      ATLASLocalRootBase. Similar to the way lxplus is used as a test bed for gLite.
      Are the BNL interactive machines the proper place? Do they use wlcg-client or
      wlcg-client-lite to setup dq2?
    -  Andy H is working with Dean Schaumburg at Stony Brook to debug his problems 
      with his xrootd server side inventory
    - Kaushik has agreed to talk at the US ATLAS IB meeting on Panda for Tier 3.
      I will then use his talk to query all of the US ATLAS institutions to see if they 
      have any interest in setting up a Panda queue at their Tier 3.  If no-one is
      interested then we will drop the issue completely and spend the resources
      on other more important items.
    I was wondering if you have this new version available for deployment ?  
    dq2 0.1.36 has been released and for gLite, on lxplus and at Tier3s, it seems to work only with 32-bit python 2.5/2.6 - but that is good enough for me to deploy on Monday.
    US Tier3 sites will not be able to use this new version of DQ2 since the primary middleware is wlcg-client and only the 64-bit version is installed. (I tried with both 32/64 bit versions of python on the 64-bit OS)
    LFC exception [Python bindings missing]
    On this related matter, can someone in the US please setup a test bed that does NOT use ATLASLocalRootBase / manageTier3SW.  Similar to that on lxplus which serves as the testbed for gLite, I'd like to be able to look at the scripts in the US which use wlcg-client-lite to setup dq2  and use that as an example to encapsulate inside ATLASLocalRootBase.  
    Thanks !

Tier 3GS site reports

  • UTD (Joe): proddisk space filling up quickly. Kaushik notes that all jobs are recon, pilueup, merge jobs; need to clean regularly. Joe notes will need to do this every day. There is a question about whether a production role is required to run the script.
  • Illinois - to report during MWT2

Operations overview: Production and Analysis (Kaushik)

Data Management and Storage Validation (Armen)

  • Reference
  • last week(s):
    • MinutesDataManageMar29
    • Removing legacy tokens and data at BNL
    • Central deletion on-going. There have been timeouts, plan to discuss next week.
    • Localgroupdisk monitor from Hiro
    • Two non-official categories- unallocated, unpowered. Can they be incorporated in SRM?
    • Charles - bulk deletion methods were not being used for most US sites. Change was made yesterday. Goal is 10 Hz at Tier 2s for LFC+SRM deletion. Might be able to see a reduction in gap by looking at bourricot monitoring page. Should see backlogs delete. Increase number of deletions.
  • this week:
    • MinutesDataManageApr12
    • GROUPDISK issue is resolved (not to be confused with the groupdisk storage area) - these are out of toa
    • bnltape, bnldisk, bnlpanda - all data is on tape, concerning cleanup central deletion is not working; 1/2 PB of data there to delete.
      • mc08 - need to contact users
      • user08, user09 - will send users email
    • userdisk cleanup is done
    • localgroupdisk - 50 TB (all in compliance)
    • Storage reporting - unallocated, unpowered now available for srm for dCache (BNL, AGLT2, Illinois). Hiro will provide monitor. Wei - will provide this for xrootd and GPFS.

Shifters report (Mark)

  • Reference
  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    1)  3/31: SWT2_CPB - file transfer errors like "failed to contact on remote SRM [httpg://gk03.atlas-swt2.org:8443/srm/v2/server]."
    One of the xrootd data servers was heavily loaded for a period of several hours, creating the SRM timeouts.  Eventually the load came down, and all of the transfers 
    succeeded on subsequent attempts.  ggus 69238 / RT 19727 closed, eLog 23899.
    2)  3/31 - 4/1: Hiro reported that the SRM service at BNL was unstable, and a shifter opened ggus 69242 around this time.  This issue seemed to be some maliciously 
    constructed queries which created problems for the SRM database.  Performance returned to normal once the queries were out of the system, but there is a concern the 
    problem will recur if similar queries come into the system in the future.  eLog 23888.  As of 4/4 no additional issues like this - ggus 69242 closed.
    3)  4/4 - 4/5: DDM errors at several U.S. cloud sites with "file exists, overwrite is not allowed" errors for Sonar/Functional Test data.  From Hiro: we manually deleted all 
    of these files which were failed to be cleaned by the central services. After the deletion, the transfers were successful.  
    https://savannah.cern.ch/bugs/index.php?80461, eLog 24010.
    4)  4/5: NET2 - file transfer failures with the error " [TRANSFER error during TRANSFER phase: [NO_PROGRESS] No markers indicating progress received for more 
    than 60 seconds]."  ggus 69384 in-progress, eLog 24019.
    5)  4/5: BNL SRM service was restored after changing a PostgreSQL database parameter used by the service.  eLog 24029.
    6)  4/5: AGLT2 - OSG 1.2.19 upgrade completed as of ~12:30 CST.
    Follow-ups from earlier reports:
    (i)  1/19: BNL - user reported a problem while attempting to download files from the site - for example: "httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2: CGSI-gSOAP 
    running on t301.hep.tau.ac.il reports Error reading token data header: Connection closed."  ggus 66298.  From Hiro:
    There is a known issue for users with Israel CA having problem accessing BNL and MWT2. This is actively investigated right now. Until this get completely resolved, users 
    are suggested to request DaTRI request to transfer datasets to some other sites (LOCAGROUPDISK area) for the downloading.
    Update 3/14 from Iris: The issue is still under investigation. Thank you for your patience.
    (ii)  2/10: A bug in the most recent OSG software release (1.2.17, released on Monday, February 7th) affects WLCG availability reporting for sites. Sites may go into an 
    UNKNOWN state one day after updating.  Thus it is recommended that sites defer upgrading their OSG installations until a fix is released.  See: http://osggoc.blogspot.com/
    (iii)  3/12: SLACXRD_LOCALGROUPDISK transfer errors with "[NO_SPACE_LEFT] No space found with at least .... bytes of unusedSize]."  
    https://savannah.cern.ch/bugs/index.php?79353 still open, eLog 23037.
    Later the same day: SLACXRD_PERF-JETS transfer failures with "Source file/user checksum mismatch" errors.  https://savannah.cern.ch/bugs/index.php?79361.  Latest 
    comment to the Savannah ticket suggests declaring the files lost to DQ2 if they are corrupted.  eLog 23048.
    Update 3/21: Savannah 79353 closed (free space is available).
    Update 4/4: no further updates to Savannah 79361 - close this one for now.  (Looks like most recent transfer attempts were several weeks ago.)
    (iv)  3/25: UTD-HEP maintenance outage originally scheduled for 3/23 had to be postponed.  eLog 23651, https://savannah.cern.ch/support/index.php?119962
    Update: this maintenance outage now set for 3/30.  See eLog 23794, https://savannah.cern.ch/support/?120085.
    Update 4/1: maintenance completed, test jobs successful, site set back on-line.  eLog 23904.
    Update 4/3 - 4/4: following the restart some memory issues with jobs were observed.  Site believes the problem is understood and resolved.  
    Latest test jobs successful => on-line.  eLog 23977.
  • this meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    1)  4/6: MWT2_UC - Savannah 120259 was opened due to some "file exists" transfer error.  (Should have created a DDM Savannah, rather than this "site exclusion" 
    one, but that's another story.)  All files eventually transferred successfully.  eLog 24083.
    2)  4/8: NERSC - file transfer errors.  See ggus 69526 (in-progress), eLog 24176.
    3)  4/8: OU_OSCER_ATLAS - still see intermittent job failures with segfault errors.  Site was set off-line 4/11 due to a spike in the failure rate.  Discussed in: 
    https://savannah.cern.ch/support/?120307 (site exclusion), ggus 69558 / RT 19757, eLog 24133/92, https://savannah.cern.ch/bugs/index.php?79656.
    4)  4/8 - 4/11: SLAC maintenance outage (power work).  Completed as of 4/11 early a.m.
    5)  4/9: SWT2_CPB - file transfer errors ("failed to contact on remote SRM [httpg://gk03.atlas-swt2.org:8443/srm/v2/server]").  Issue was due to high load on a data 
    server, which lasted for 4-5 hours.  All transfers eventually completed.  ggus 69543 / RT 19755 closed, eLog 24144.
    6)  4/10: bulk muon reprocessing campaign, which began on 4/5, was declared complete.  See: https://twiki.cern.ch/twiki/bin/viewauth/Atlas/Muon2011Reprocessing
    7)  4/10: BNL - job failures due to "PandaMover staging error: File is not cached errors at BNL."  Issue is understood (large number of staging requests issued by 
    production jobs requesting HITS files).  See discussion in eLog 24188.
    8)  4/11: IllinoisHEP - job failures in task 296070 due to missing input files.  Dave at Illinois reported that it appears the files were never transferred to the site?  
    ggus 69601 in-progress, eLog 24234, http://savannah.cern.ch/bugs/?80830.
    9)  4/11 - 4/12: BNL admins requested that the BNL_ATLAS_2 queue be set off-line.  It is expected that in the future this queue will be used only for special requests.  
    Initially when the queue was set off-line a few hundred jobs got stuck in the 'activated' state.  They were cleared out by setting the queue state to 'brokeroff' (i.e., existing 
    jobs run, but no new jobs will be brokered there).  https://savannah.cern.ch/support/index.php?120330, eLog 24283.
    10)  4/12: HU_ATLAS_Teir2 - file transfer failures with the error "lsm-get failed: time out after 5400 seconds."  Set off-line temporarily.  Issue was  a filesystem 
    problem - now resolved.  Set back on-line.  ggus 69606 closed, https://savannah.cern.ch/support/?120336, eLog 24276.
    11)  4/12: UTD-HEP - job failures with errors like "Mover.py | !!FAILED!!3000!! Get error: Replica with guid 601B99EE-1E42-E011-BA72-001D0967D549 not found 
    at srm://fester.utdallas.edu."  Possibly due to concurrently running a disk clean-up script.  ggus 69641 in-progress, eLog 24284.
    Follow-ups from earlier reports:
    (i)  1/19: BNL - user reported a problem while attempting to download files from the site - for example: "httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2: CGSI-gSOAP 
    running on t301.hep.tau.ac.il reports Error reading token data header: Connection closed."  ggus 66298.  From Hiro:
    There is a known issue for users with Israel CA having problem accessing BNL and MWT2. This is actively investigated right now. Until this get completely resolved, 
    users are suggested to request DaTRI request to transfer datasets to some other sites (LOCAGROUPDISK area) for the downloading.
    Update 3/14 from Iris: The issue is still under investigation. Thank you for your patience.
    (ii)  2/10: A bug in the most recent OSG software release (1.2.17, released on Monday, February 7th) affects WLCG availability reporting for sites. Sites may go into an 
    UNKNOWN state one day after updating.  Thus it is recommended that sites defer upgrading their OSG installations until a fix is released.  See: http://osggoc.blogspot.com/
    Update 3.29: release 1.2.19 announced - ready for site installations.
    (iii)  4/5: NET2 - file transfer failures with the error " [TRANSFER error during TRANSFER phase: [NO_PROGRESS] No markers indicating progress received for more 
    than 60 seconds]."  ggus 69384 in-progress, eLog 24019.
    Update 4/8: No recent errors of this type observed - ggus 69384 closed, eLog 24114. 
    • Spike in OSCER failures

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week:
    • Action item all T2's to get another load test in. Sites to contact Hiro, monitor the results. An hour long test. ASAP.
  • this week:
    • Throughput meeting:
      US ATLAS Throughput Meeting – April 12,  2011
      Attending:  Shawn, Jason, Karthik, Dave, Andy, Tom, Aaron, Horst
      Excused: Philippe
      1)      Review past items
      a.       Dell R410 (single node perfSONAR instance) status.   No updates yet.
      b.      Issue between AGLT2 and MWT2_IU?   Something changed March 21? Seems OK since then. Resolved perhaps by routing changes?
      c.       Load tests redone?  Need updates from MWT2_IU, NET2, SWT2_UTA and WT2 – No update since no-one from those sites on the call.  *All sites need to make sure the load-retest is done and documented*.
      2)      perfSONAR status:  The dashboard at BNL is all-green except for a possible throughput issue with WT2. 
      a.       Problems noted yesterday with OU and BNL (Philippe): Update.  Plot attached.  Reboot fixed it.  Hiro’s throughput tests showed poor performance over the weekend as well.   BNL system was strange.  Jason reported reboot/clean didn’t resolve.  Further DB maintenance seems to have fixed it.   Test results were being saved but not “seen”.  Will watch it.
      b.      New issues? WT2 throughput?   Yee restarted perfSONAR node…we will watch to see if it is fixed.  Possible issue between OU and IU (asymmetry).  OU to IU is slow compared to other direction for at least the last month.  Try a set of “divide and conquer” measurements to intermediate perfSONAR points.  *Jason will assist  Karthik/Horst in this testing*.  
      c.       Reconfiguration for nightly maintenance from 1-2AM EASTERN.  Status at SWT2, WT2, Illinois?  OU needs to do it (Karthik reports it being done during the call).  Illinois already done 2 weeks ago.  *UTA and SLAC need to reconfigured*.  
      3)      Throughput monitoring.  Additional throughput issues?
      a.       MCDISK entries removed on Hiro’s throughput tests?     Still present and not yet removed…hopefully soon? *Action item for Hiro*
      b.      Merging perfSONAR throughput with FTS/DDM throughput in graphs?   No report.
      c.       Tom reported on Gratia/RSV perfSONAR dashboard status:   Added throughput between AGLT2_UM and BNL on current test dashboard as an example.   Probes right now have “hidden” names and is inefficient.    Need multi-host probes before filling out matrix.   Also need DB changes (summary table) to make things more efficient and faster.   Tom is working with Gratia developers to address this.    Primitive probes could be added when available.   Some minor work on host report inside probes needs to happen.  Andy/Jason will be working on this.   Notification list needs updating.  *Shawn will send contact list for perfSONAR instances around for comment*. 
      d.      ‘ktune’ package status:   Updates for  VLAN interfaces/aliases not ready yet…soon.  The package is deployed at AGLT2_UM on all the  dCache storage nodes and on one machine at MWT2_UC.   Aaron mentioned file-system mount tunings may be added.  Will work with Shawn on feeding this type of info into the package.
      e.      UltraLight kernel status.   Being tested.  2.6.38-UL1 being tested at  AGLT2 on a dCache test node.   MWT2 is running 2.6.36-UL5 on all storage and worker nodes.   Running well and performs better than stock SL/CentOS kernel.
      4)      Site reports (around the table)  -- Aaron reported on preparing to use Dell 8024F switch at MWT2.  *Shawn will send configs in use at AGLT2 to Aaron for reference*.
      5)      AOB –none
      We  will plan to have our next meeting in two weeks at the regular time.   Send along corrections or additions to the list.
    • Throughput:
      • MWT2_IU, NET2 - defer until after IO upgrade (), SWT2_UTA (will try to run a test)
    • Perfsonar instances - agreed to run cron them 1-2 am Eastern.

Federated Xrootd at sites: Tier 3 (Doug), Tier 2 (Charles)

last week(s):
  • Investigating client-side modifications for xrd equivalent to "libdcap++"
  • Performance tests continuing
  • Will standardize on xrootd rpm release
  • Version 3.0.3 is working fine at the moment.
this week:
  • xrootd-lfc bridge status at the sites
  • global redirector not running (Wei)
  • client tools, etc, should be available. Can Tier 3's join?
  • Need to look at load on the redirector (Wei)
  • No issues with regard to the global namespace - is this working with dq2-get.
  • Does dq2-get -g copy to the right

Site news and issues (all sites)

  • T1:
    • last week(s): Waiting for 150 nodes, will bring to 8500 slots. PNFS upgrade to Chimera, discussing timeline and plan. Earlier than previously anticipated - this summer rather than at year's end. Otherwise very stable.
    • this week: Storage consolidation, millions of files to be moved and deleted ~ month. Hiro and his team planning on chimera migration. New glibc security vulnerability patch available. Did local testing, with Athena as well. Replacing now, not requiring a downtime.

  • AGLT2:
    • last week(s): Working on ktune and kernel from Nate. OSG 1.2.19 is now available, will bring this online. Also brought on two new pool servers. Federated storage report shows AGLT2.
    • this week: Quiet, waiting for jobs. Rolling update and rebuild SL 5.5 of all worker nodes in progress. Fabric update table for analysis jobs.

  • NET2:
    • last week(s): GPFS2&3 joined, space should be available by the end of the week. Going to multiple gridftp, multiple LSM host; using cluster NFS to export gpfs volumes to HU; gatekeeper upgrade; getting ready to purchase worker nodes and storage on the BU side. Tier 3 still being worked on. Started collection of historical and current statistics & graphics at http://egg.bu.edu/atlas . Will defer the load test till after the new storage comes online.
    • this week: Major IO upgrade on-going - rebalancing GPFS volumes. Going to multiple gridftp endpoints. Cluster NFS is running. Network BU-HU issues resolved.

  • MWT2:
    • last week(s): Continued preparations for move of MWT2_UC to new server room. CPU replacement at IU completed. LFC updated at UC. Kernels on storage nodes updated, found better performance. NAT testing on Cisco. GUMS relocated to VM. New monitor from Charles, http://www.mwt2.org/sys/userjobs.html.
    • this week: Major downtime pushed back till April 18 for UC server room move. LOCALGROUPDISK cleanup in progress. Site reports:
      • UC (Aaron): move is our priority. srm services moved into a vm. upgrade of glibc. Throughput to IU to later this week.
      • IU (Sarah): all is well; on vacation this week.
      • UIUC (Dave): all working well - 250 jobs failed for pandamover reasons. LOCALGROUP disk - 20 TB AOD files downloaded. There must be an agreement with a group representative.

  • SWT2 (UTA):
    • last week: Mark - analy jobs failing because jobs couldn't access input files off storage. There was a faulty drive, fixed.
    • this week: Taking down on of the clusters for OSG upgrade, and security update. New OSG versions of complete stack. SWT2 (production only) cluster. Probably Friday.

  • SWT2 (OU):
    • last week: 844 slots - 18 HT nodes. Ready to run a load test - will take offline with Hiro.
    • this week: All is well. Pursuing segfault issue at OSCER - Ed Moyse is looking at it locally.

  • WT2:
    • last week(s): Setup a small batch queue with outbound network connections to allow the installation jobs to run. Updated Bestman to latest version - to put the total space in a flat file. Next procurement - MD1000, looking to by 24 of them.
    • this week: Scheduled power outtage over the weekend, came back okay. Need to bring back global redirector. PO sent for storage.

Carryover issues (any updates?)


last week this week

-- RobertGardner - 12 Apr 2011

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback