r4 - 20 Apr 2011 - 14:49:44 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesApr20

MinutesApr20

Introduction

Minutes of the Facilities Integration Program meeting, April 20, 2011
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • 866-740-1260, Access code: 7027475

Audio Details: Dial-in Number:
U.S. & Canada: 866.740.1260
U.S. Toll: 303.248.0285
Access Code: 7027475; Chair passcode: 8734
Registration Link: https://cc.readytalk.com/r/bd2w3deu2kkg

Attending

  • Meeting attendees: Adam, Derek, Dave, John, Shawn, Michael, Charles, AJ, Karthik, Eric & Jason, Saul, Justin, Patrick, Sarah, Armen, Tom, Torre, Xin, Rob, Fred, Horst, Wei, Kaushik, Bob, Hiro, Taeksu, Mark, Fred
  • Guests: the HCC VO: Derek Weitzel <dweitzel@cse.unl.edu>, Adam Caprez <acaprez@cse.unl.edu>
  • Apologies: Doug, Nate, Aaron

Integration program update (Rob, Michael)

OSG Opportunistic Access

  • The HCC (Holland Computing Center at the University of Nebraska) VO
  • Contacts: Derek Weitzel <dweitzel@cse.unl.edu>, Adam Caprez <acaprez@cse.unl.edu>
  • Website, http://hcc.unl.edu/main/index.php
  • Presentation: HCC-opportunistic-Atlas.pdf: HCC-opportunistic-Atlas.pdf
  • don't use $APP, $DATA, or home directories
  • glideins will exit after 16 hours, or 20 mins if they're idle. Kaushik notes most of our sites do not do prevention.
  • Load on the gatekeeper caused by glidein? Notes glidein-cms is 'nice' since only one instance is used. Believes loads are small.
    • The issue is loads caused by sudden preemption
  • Expanding usage
    • expand usage at BNL
    • Kaushik - suggests putting a limit on the max number of slots - eg. 10-20% of the site capacity maximum.
    • Michael - we should try to get some statistics - want to dynamically change this.
    • Suggest max glidein time of 8 hours. all agreed.

LHCONE (Eric Boyd)

  • Meeting @ Internet2 offices, Washington, May 12.
  • http://www.internet2.edu/science/LHCONE.html
  • See email announcement last Friday
  • LHC computing report - moving towards a mesh model.
  • Lyon meeting last year led to a call for global distributed exchange.
  • LHCONE will make a distributed version of the MANLAN 'black box' - to enable distributed peering
  • Roadmap and prototype has been laid out.
  • Two-node distributed exchange in the US (Chicago and NY), that would expand
  • April 5 meeting in Europe - changes and operations. Need an equivalent meeting in north america.
  • Meeting to discuss evaluating the goals, reacting to the prototype, is it responsive to the experiment's needs.
  • July timeframe for a prototype
  • ESNet, Canarie, Internet2 are co-sponsoring.
  • Network development and design initiative announced today at I2 meeting: IU, Stanford, I2 - based on Open Flow - nationwide layer2 service. OSSE - Open Science Scholarship Educational services (?) - extend prototype across the country.
  • Sensitive to cost constraints
  • ION service - auto-provisioning of short lived circuits - extension of this, 4 x 10 Gbps service. DYNES - has all this plus an overlay of the transfer service. DYNES as a way to connect to LHCONE.
  • Will have some remote video conference capabilities.

Operations overview: Production and Analysis (Kaushik)

Data Management and Storage Validation (Armen)

  • Reference
  • last week(s):
    • MinutesDataManageApr12
    • GROUPDISK issue is resolved (not to be confused with the groupdisk storage area) - these are out of ToA
    • bnltape, bnldisk, bnlpanda - all data is on tape, concerning cleanup central deletion is not working; 1/2 PB of data there to delete.
      • mc08 - need to contact users
      • user08, user09 - will send users email
    • userdisk cleanup is done
    • localgroupdisk - 50 TB (all in compliance or working towards)
    • Storage reporting - unallocated, unpowered now available for srm for dCache (BNL, AGLT2, Illinois). Hiro will provide monitor. Wei - will provide this for xrootd and GPFS.
  • this week:
    • No meeting this week, no urgencies

Shifters report (Mark)

  • Reference
  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-4_11_11.txt
    
    1)  4/6: MWT2_UC - Savannah 120259 was opened due to some "file exists" transfer error.  (Should have created a DDM Savannah, rather than this "site exclusion" 
    one, but that's another story.)  All files eventually transferred successfully.  eLog 24083.
    2)  4/8: NERSC - file transfer errors.  See ggus 69526 (in-progress), eLog 24176.
    3)  4/8: OU_OSCER_ATLAS - still see intermittent job failures with segfault errors.  Site was set off-line 4/11 due to a spike in the failure rate.  Discussed in: 
    https://savannah.cern.ch/support/?120307 (site exclusion), ggus 69558 / RT 19757, eLog 24133/92, https://savannah.cern.ch/bugs/index.php?79656.
    4)  4/8 - 4/11: SLAC maintenance outage (power work).  Completed as of 4/11 early a.m.
    5)  4/9: SWT2_CPB - file transfer errors ("failed to contact on remote SRM [httpg://gk03.atlas-swt2.org:8443/srm/v2/server]").  Issue was due to high load on a data 
    server, which lasted for 4-5 hours.  All transfers eventually completed.  ggus 69543 / RT 19755 closed, eLog 24144.
    6)  4/10: bulk muon reprocessing campaign, which began on 4/5, was declared complete.  See: https://twiki.cern.ch/twiki/bin/viewauth/Atlas/Muon2011Reprocessing
    7)  4/10: BNL - job failures due to "PandaMover staging error: File is not cached errors at BNL."  Issue is understood (large number of staging requests issued by 
    production jobs requesting HITS files).  See discussion in eLog 24188.
    8)  4/11: IllinoisHEP - job failures in task 296070 due to missing input files.  Dave at Illinois reported that it appears the files were never transferred to the site?  
    ggus 69601 in-progress, eLog 24234, http://savannah.cern.ch/bugs/?80830.
    9)  4/11 - 4/12: BNL admins requested that the BNL_ATLAS_2 queue be set off-line.  It is expected that in the future this queue will be used only for special requests.  
    Initially when the queue was set off-line a few hundred jobs got stuck in the 'activated' state.  They were cleared out by setting the queue state to 'brokeroff' (i.e., existing 
    jobs run, but no new jobs will be brokered there).  https://savannah.cern.ch/support/index.php?120330, eLog 24283.
    10)  4/12: HU_ATLAS_Teir2 - file transfer failures with the error "lsm-get failed: time out after 5400 seconds."  Set off-line temporarily.  Issue was  a filesystem 
    problem - now resolved.  Set back on-line.  ggus 69606 closed, https://savannah.cern.ch/support/?120336, eLog 24276.
    11)  4/12: UTD-HEP - job failures with errors like "Mover.py | !!FAILED!!3000!! Get error: Replica with guid 601B99EE-1E42-E011-BA72-001D0967D549 not found 
    at srm://fester.utdallas.edu."  Possibly due to concurrently running a disk clean-up script.  ggus 69641 in-progress, eLog 24284.
    
    Follow-ups from earlier reports:
    (i)  1/19: BNL - user reported a problem while attempting to download files from the site - for example: "httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2: CGSI-gSOAP 
    running on t301.hep.tau.ac.il reports Error reading token data header: Connection closed."  ggus 66298.  From Hiro:
    There is a known issue for users with Israel CA having problem accessing BNL and MWT2. This is actively investigated right now. Until this get completely resolved, 
    users are suggested to request DaTRI request to transfer datasets to some other sites (LOCAGROUPDISK area) for the downloading.
    Update 3/14 from Iris: The issue is still under investigation. Thank you for your patience.
    (ii)  2/10: A bug in the most recent OSG software release (1.2.17, released on Monday, February 7th) affects WLCG availability reporting for sites. Sites may go into an 
    UNKNOWN state one day after updating.  Thus it is recommended that sites defer upgrading their OSG installations until a fix is released.  See: http://osggoc.blogspot.com/
    Update 3.29: release 1.2.19 announced - ready for site installations.
    (iii)  4/5: NET2 - file transfer failures with the error " [TRANSFER error during TRANSFER phase: [NO_PROGRESS] No markers indicating progress received for more 
    than 60 seconds]."  ggus 69384 in-progress, eLog 24019.
    Update 4/8: No recent errors of this type observed - ggus 69384 closed, eLog 24114. 
    
    • Spike in OSCER failures
  • this meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=1&confId=135036
    
    1)  4/13: SWT2_CPB - user reported a problem downloading some files from the site.  The error was related to a glitch in updating the CA certificates/CRL's on 
    the cluster (hence errors like "The certificate has expired: Credential with subject: /C=PL/O=GRID/CN=Polish Grid CA has expired").  Problem should be fixed 
    now - waiting for a confirmation from the user.  ggus 69674 / RT 19779.
    2)  4/14: OU_OCHEP_SWT2 - job failures with the error " Unable to verify signature! Server certificate possibly not installed."  Eventually it was determined the 
    issue was with release 16.6.3.  Alessandro re-installed this version, including the various caches, and this appears to have solved the problem.  
    ggus 69690 / RT 19786 closed, eLog 24332.
    3)  4/14: New pilot version from Paul (SULU 47a).  Details here:
    http://www-hep.uta.edu/~sosebee/ADCoS/pilot-version-SULU-47a.html
    4)  4/14: UTD-HEP set off-line at the request of site admin (PRODDISK low on space).  eLog 24359.
    5)  4/14: IllinoisHEP file transfer errors ("[GENERAL_FAILURE] AsyncWait] Duration [0]").  From Dave: A restart of dCache on the pool node last night appears 
    to have fixed the problem. No new transfer problems have been seen since the restart.  ggus 69719 closed, eLog 24371.
    6)  4/19: UTA_SWT2 - maintenance outage to update s/w on the cluster (OSG, Bestman, xrootd, etc.)  Work completed as of early a.m. 4/20.  Test jobs 
    submitted to the site.
    7)  4/19: From Bob at AGLT2 -  The UM Ann Arbor campus suffered a nearly complete power outage at 12:39 p.m. today.   Some 100 or so jobs here were lost, 
    that had been running on worker nodes without UPS protection.  Apparently, global network connectivity to AGLT2 was also impacted, as we have some reports 
    of job submission problems to ANALY_AGLT2.  The outage lasted for 3-5minutes on the main UM AGLT2 sites.  It is unknown how long the network was down.
    8)  4/20 early a.m.: SWT2_CPB - initially file transfers were failing due to an expired host cert, which has been updated.  A bit later later transfer failures were 
    reported with the error "failed to contact on remote SRM [httpg://gk03.atlas-swt2.org:8443/srm/v2/server]" and ggus 69875 / RT 19808 were re-opened.  This 
    latter issue was possibly due to a couple of data servers being heavily loaded for several hours.  eLog 24558. 
    9)  4/20: Tadashi updated the panda server so that jobs go to 'waiting' instead of 'failed' when release/cache is missing in a cloud.
    
    Follow-ups from earlier reports:
    (i)  1/19: BNL - user reported a problem while attempting to download files from the site - for example: "httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2: 
    CGSI-gSOAP running on t301.hep.tau.ac.il reports Error reading token data header: Connection closed."  ggus 66298.  From Hiro:
    There is a known issue for users with Israel CA having problem accessing BNL and MWT2. This is actively investigated right now. Until this get completely 
    resolved, users are suggested to request DaTRI request to transfer datasets to some other sites (LOCAGROUPDISK area) for the downloading.
    Update 3/14 from Iris: The issue is still under investigation. Thank you for your patience.
    (ii)  4/8: NERSC - file transfer errors.  See ggus 69526 (in-progress), eLog 24176.
    Update 4/19: some progress has been made on understanding the issue(s) - will close this ticket once it appears everything is working correctly.
    (iii)  4/8: OU_OSCER_ATLAS - still see intermittent job failures with segfault errors.  Site was set off-line 4/11 due to a spike in the failure rate.  
    Discussed in: https://savannah.cern.ch/support/?120307 (site exclusion), ggus 69558 / RT 19757, eLog 24133/92, https://savannah.cern.ch/bugs/index.php?79656.
    (iv)  4/11: IllinoisHEP - job failures in task 296070 due to missing input files.  Dave at Illinois reported that it appears the files were never transferred to the site?  
    ggus 69601 in-progress, eLog 24234, http://savannah.cern.ch/bugs/?80830.
    Update 4/13: no more errors of this type after the initial group of errors.  Possibly a case where panda attempted to run the jobs before the input files has been 
    staged to the site.  ggus 69601 closed.
    (v)  4/12: UTD-HEP - job failures with errors like "Mover.py | !!FAILED!!3000!! Get error: Replica with guid 601B99EE-1E42-E011-BA72-001D0967D549 not found 
    at srm://fester.utdallas.edu."  Possibly due to concurrently running a disk clean-up script.  ggus 69641 in-progress, eLog 24284.
    Update 4/14 from Harisankar at UTD: We believe the error is caused due the cleaning process of dark data we performed. We are currently working on it (hence 
    closing this ticket.)  ggus 69641 closed.  Test jobs successful, queue set back on-line.  eLog 24331.
    
    • All is well. Typical site issues.
    • Pilot update from Paul - many new features.

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week:
    • Throughput meeting:
      US ATLAS Throughput Meeting – April 12,  2011
                      ====================================
      Attending:  Shawn, Jason, Karthik, Dave, Andy, Tom, Aaron, Horst
      Excused: Philippe
      1)      Review past items
      a.       Dell R410 (single node perfSONAR instance) status.   No updates yet.
      b.      Issue between AGLT2 and MWT2_IU?   Something changed March 21? Seems OK since then. Resolved perhaps by routing changes?
      c.       Load tests redone?  Need updates from MWT2_IU, NET2, SWT2_UTA and WT2 – No update since no-one from those sites on the call.  *All sites need to make sure the load-retest is done and documented*.
      2)      perfSONAR status:  The dashboard at BNL is all-green except for a possible throughput issue with WT2. 
      a.       Problems noted yesterday with OU and BNL (Philippe): Update.  Plot attached.  Reboot fixed it.  Hiro’s throughput tests showed poor performance over the weekend as well.   BNL system was strange.  Jason reported reboot/clean didn’t resolve.  Further DB maintenance seems to have fixed it.   Test results were being saved but not “seen”.  Will watch it.
      b.      New issues? WT2 throughput?   Yee restarted perfSONAR node…we will watch to see if it is fixed.  Possible issue between OU and IU (asymmetry).  OU to IU is slow compared to other direction for at least the last month.  Try a set of “divide and conquer” measurements to intermediate perfSONAR points.  *Jason will assist  Karthik/Horst in this testing*.  
      c.       Reconfiguration for nightly maintenance from 1-2AM EASTERN.  Status at SWT2, WT2, Illinois?  OU needs to do it (Karthik reports it being done during the call).  Illinois already done 2 weeks ago.  *UTA and SLAC need to reconfigured*.  
      3)      Throughput monitoring.  Additional throughput issues?
      a.       MCDISK entries removed on Hiro’s throughput tests?     Still present and not yet removed…hopefully soon? *Action item for Hiro*
      b.      Merging perfSONAR throughput with FTS/DDM throughput in graphs?   No report.
      c.       Tom reported on Gratia/RSV perfSONAR dashboard status:   Added throughput between AGLT2_UM and BNL on current test dashboard as an example.   Probes right now have “hidden” names and is inefficient.    Need multi-host probes before filling out matrix.   Also need DB changes (summary table) to make things more efficient and faster.   Tom is working with Gratia developers to address this.    Primitive probes could be added when available.   Some minor work on host report inside probes needs to happen.  Andy/Jason will be working on this.   Notification list needs updating.  *Shawn will send contact list for perfSONAR instances around for comment*. 
      d.      ‘ktune’ package status:   Updates for  VLAN interfaces/aliases not ready yet…soon.  The package is deployed at AGLT2_UM on all the  dCache storage nodes and on one machine at MWT2_UC.   Aaron mentioned file-system mount tunings may be added.  Will work with Shawn on feeding this type of info into the package.
      e.      UltraLight kernel status.   Being tested.  2.6.38-UL1 being tested at  AGLT2 on a dCache test node.   MWT2 is running 2.6.36-UL5 on all storage and worker nodes.   Running well and performs better than stock SL/CentOS kernel.
      4)      Site reports (around the table)  -- Aaron reported on preparing to use Dell 8024F switch at MWT2.  *Shawn will send configs in use at AGLT2 to Aaron for reference*.
       
      5)      AOB - none
    • Throughput exercise plans:
      • MWT2_IU, NET2 - defer until after IO upgrade, SWT2_UTA (will try to run a test)
    • Perfsonar instances - agreed to run crons at 1-2 am Eastern; sites must adjust accordingly.

  • this week:
    • No meeting this week.
    • There is some on-going discussions w/ Alessandro deSalvo re: perfsonar monitoring for IT cloud.
    • Longer term getting this ATLAS-wide

HTPC configuration for AthenaMP testing (Horst)

  • OSG reference, https://twiki.grid.iu.edu/bin/view/Documentation/HighThroughputParallelComputing
  • Still waiting on others to make progress - the Muonboy causing segfaults (can't do reco jobs).
  • Douglas Smith had tried jobs, which required pool file catalog, but this doesn't work at OSCER.
  • Not using Tier 2 cluster since it would require a Condor upgrade. Is this possible?
  • Saul - caused by fortran library? Other sites using these rpms?
  • Justin at SMU - volunteered to do this.
  • Only requirement is to configure scheduler for whole-node scheduler
  • Saul will look into setting this up at NET2 (PBS, lsf). Will send Paolo a message.

Python + LFC bindings, clients (Charles)

last week:
  • new dq2 clients package requiring at least 2.5, recommendation is 2.6. goal is to not distribute this with clients. goal is to make our install look like lxplus - /usr/bin/python26 installable from yum. Plus setup files. this will be the platform pre-requisite.
  • LFC python bindings problem - mixed 32/64 bit environ. Goal is to make sure wlcg-client has both 32 bit and 64 bit environment. /lib and /lib64 directories. _lfc.so - also pulls in whole Globus stack. Hopefully next update for wlcg-client will incorporate these changes. Charles will write this up and circulate an email.

this week:

  • Working on it - in progress.

WLCG accounting (Karthik)

last week:
  • There were active discussions - including HT in the normalization factor.
  • Feedback from Burt re: GIP, etc.
  • Brian is the point of contact. Dan has been contacted.
  • Wei has confirmed that SLAC figures are close to that from the March WLCG report
  • Saul's report shows agreement with to 10%, except for cases with the with foreign cloud production

this week:

  • Called into OSG production meeting, not much feedback during the meeting.
  • Dan Fraser suggested a separate meeting with Brian, Burt, Karthik, & others. Karthik will setup a meeting for Monday.
  • Michael - facts are on the table, goal for the meeting should be that the owners assume responsibility for implementing a solution.

CVMFS (John DeStefano)

last week:
  • Mirror at BNL available for testing.
  • For Tier2 sites: TestingCVMFS
    • Status:
      Regarding CVMFS testing and configuration to point to the CVMFS testbed at BNL, see the below (thanks John).      
              Note - Alessandro expects to have the final layout and release repository available on the production 
              CERN IT servers within the next two weeks, so stay tuned for final instructions.   We'll update our US 
              facility-specific instructions (http://www.usatlas.bnl.gov/twiki/bin/view/Admins/TestingCVMFS.html) 
              to point to general, official ATLAS instructions as they become available.  
      
               - Rob
      
      Begin forwarded message:
      
      From: John DeStefano 
      Date: April 7, 2011 6:56:02 AM CDT
      To: Rob Gardner , Doug Benjamin 
      Subject: Connection information for CVMFS test bed at BNL
      
      Here are our test servers:
      
      Replica server: gcecore04.racf.bnl.gov
      Cache server: gridopn02.usatlas.bnl.gov
      
      A few caveats:
      - This is purely a test environment (obviously)
      - The cache is set to accept incoming requests on port 80, but from outside BNL you _may_ need to instead specify an alternate port (35753) in order to bypass BNL campus proxy restrictions
      - If you have problems connecting to the replica server, please let me know.
      - Most importantly, please share your results and levels of success or failure so I know what's working or not?
      
      So your configs would need something like:
      CVMFS_HTTP_PROXY="http://[your_proxy:port];http://gridopn02.usatlas.bnl.gov:80"
      ...or:
      CVMFS_HTTP_PROXY="http://[your_proxy:port];http://gridopn02.usatlas.bnl.gov:35753"
      CVMFS_SERVER_URL=http://gcecore04.racf.bnl.gov/opt/@org@,[cern-server]
      
      ~John
  • Encourage testing at the sites
    • There are test sites at AGLT2 and MWT2.
    • Patrick - may be able to join the testing.
    • OU - Horst will test.
  • This comes in addition to the local storage required for jobs (20 GB/job). Varies for site to site, AGLT2 is 14 GB/job in scheddb. Kaushik - this impacts which jobs get run.
  • Need to review scheddb entries for this and provide instructions (Horst).

this week:

  • Updates on sites testing.
  • John waiting on tests from sites. Sees no activity.
  • MWT2 - down, no tests
  • AGLT2 - deployed on one VM. Not an insignificant amount of work to set this up.
  • SWT2 UTA - plan to roll out on UTA_SWT2 cluster
  • SWT2 OU - Horst has installed on a few nodes - latest version, seems to be working okay.

Federated Xrootd at sites: Tier 3 (Doug), Tier 2 (Charles)

last week(s):
  • Investigating client-side modifications for xrd equivalent to "libdcap++"
  • Performance tests continuing
  • Will standardize on xrootd rpm release
  • Version 3.0.3 is working fine at the moment.
  • xrootd-lfc bridge status at the sites
  • global redirector not running (Wei)
  • client tools, etc, should be available. Can Tier 3's join?
  • Need to look at load on the redirector (Wei)
  • No issues with regard to the global namespace - is this working with dq2-get.
  • Does dq2-get -g copy to the right?

this week:

  • apologies - will send a note out to sites to check on version running.
  • global redirector is back up.
  • Michael - would like to set up a global redirector at BNL using usatlas.org domain.

Tier 3 Integration Program (Doug Benjamin)

Tier 3 References:
  • The link to ATLAS T3 working groups Twikis are here
  • T3g Setup guide is here
  • Users' guide to T3g is here

last week(s):

  • AK - working on firewall issues affecting transfer rate. Working with the IT director to resolve the issue.
  • Note from Doug:
    - new Tier 3 online at UC Santa Cruz (will update spreadsheet)
    - waiting to hear from Arizona on the status of their Tier 3 (did they do any additional work 
      after I left, there were a few small issues?)
    - New ATLAS wide Tier 3 document being written by me as an ATLAS com note.
    -  new version of dq2 client tools out but will not be deployed at US Tier 3 sites using
       ATLASLocalRootBase (default for new sites) due to a problem with wlcg-client-lite
      (see below for a message from Asoka DeSilva)
    -  Asoka has  requested that the US setup a Tier 3 testbed that does not use 
      ATLASLocalRootBase. Similar to the way lxplus is used as a test bed for gLite.
      Are the BNL interactive machines the proper place? Do they use wlcg-client or
      wlcg-client-lite to setup dq2?
    -  Andy H is working with Dean Schaumburg at Stony Brook to debug his problems 
      with his xrootd server side inventory
    - Kaushik has agreed to talk at the US ATLAS IB meeting on Panda for Tier 3.
      I will then use his talk to query all of the US ATLAS institutions to see if they 
      have any interest in setting up a Panda queue at their Tier 3.  If no-one is
      interested then we will drop the issue completely and spend the resources
      on other more important items.
    
    Cheers,
    Doug
    
    I was wondering if you have this new version available for deployment ?  
    dq2 0.1.36 has been released and for gLite, on lxplus and at Tier3s, it seems to work only with 32-bit python 2.5/2.6 - but that is good enough for me to deploy on Monday.
    US Tier3 sites will not be able to use this new version of DQ2 since the primary middleware is wlcg-client and only the 64-bit version is installed. (I tried with both 32/64 bit versions of python on the 64-bit OS)
    LFC exception [Python bindings missing]
    On this related matter, can someone in the US please setup a test bed that does NOT use ATLASLocalRootBase / manageTier3SW.  Similar to that on lxplus which serves as the testbed for gLite, I'd like to be able to look at the scripts in the US which use wlcg-client-lite to setup dq2  and use that as an example to encapsulate inside ATLASLocalRootBase.  
    Thanks !
    regards,
    Asoka

this week:

Tier 3GS site reports (Doug, Joe, AK, Taeksu)

last week:
  • UTD (Joe): proddisk space filling up quickly. Kaushik notes that all jobs are recon, pilueup, merge jobs; need to clean regularly. Joe notes will need to do this every day. There is a question about whether a production role is required to run the script.
  • Illinois - to report during MWT2 site report

this week:

  • UTD: not here.
  • BELLARMINE-OU: still working on firewall issue. Could consult with OSG Security.
  • Hampton: investigating getting a site up.

Site news and issues (all sites)

  • T1:
    • last week(s): Storage consolidation, millions of files to be moved and deleted ~ month. Hiro and his team planning on chimera migration. New glibc security vulnerability patch available. Did local testing, with Athena as well. Replacing now, not requiring a downtime.
    • this week: Chimera migration is progressing, hardware specs are out, PO issued. Expect a fair amount of SSD disks for the database. Will start learning about the migration to convert 100M inventory. Federated xrootd, cvmfs, and other things. Planning upgrades to power infrastructure in the building addition, more panels and breakers, this will require a partial downtime. Esnet is working on getting additional circuits operational on the new fiber infrastructure; light budget not enough bnl-to-manhattan requiring a light amp halfway.

  • AGLT2:
    • last week(s): Quiet, waiting for jobs. Rolling update and rebuild SL 5.5 of all worker nodes in progress. Fabric update table for analysis jobs.
    • this week: 10G nic flaky on a storage node, resolved. Bob - getting wn's updated to sl 5.5 - jobs running successfully. Will start a rolling update to all the machines including security patch.

  • NET2:
    • last week(s): Major IO upgrade on-going - rebalancing GPFS volumes. Going to multiple gridftp endpoints. Cluster NFS is running. Network BU-HU issues resolved.
    • this week: Still working on IO upgrade; John away on vacation; getting ready to buy

  • MWT2:
    • last week: Major downtime pushed back till April 18 for UC server room move. LOCALGROUPDISK cleanup in progress. Site reports:
      • UC (Aaron): move is our priority. srm services moved into a vm. upgrade of glibc. Throughput to IU to later this week.
      • IU (Sarah): all is well; on vacation this week.
      • UIUC (Dave): all working well - 250 jobs failed for pandamover reasons. LOCALGROUP disk - 20 TB AOD files downloaded. There must be an agreement with a group representative.
    • this week:
      • UC: moving server room: major downtime this week and next.
      • IU: took a short downtime to migrate GUMS from UC to IU, back online.
      • Illinois: all is well.

  • SWT2 (UTA):
    • last week: Taking down on of the clusters for OSG upgrade, and security update. New OSG versions of complete stack. SWT2 (production only) cluster. Probably Friday.
    • this week: OSG updated. Replaced xrootd and bestman - to latest. Update to worker nodes. Might be an issue with RSV. Occasional problems with an analysis job type. Patrick: OSG is trailing too far behind the xrootd release - would rather work directly with source.

  • SWT2 (OU):
    • last week: All is well. Pursuing segfault issue at OSCER - Ed Moyse is looking at it locally.
    • this week:

  • WT2:
    • last week(s): Scheduled power outtage over the weekend, came back okay. Need to bring back global redirector. PO sent for storage.
    • this week: Short downtime this afternoon for network changes. Problem w/ uplinks w/ 8024F was due to bad fiber. 2 20G uplinks to Cisco.

Carryover issues (any updates?)

AOB

last week this week


-- RobertGardner - 15 Apr 2011

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


pdf HCC-opportunistic-Atlas.pdf (1524.3K) | RobertGardner, 19 Apr 2011 - 10:28 |
jpg screenshot_01.jpg (193.8K) | RobertGardner, 20 Apr 2011 - 12:50 |
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback