r4 - 16 Mar 2011 - 14:32:29 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesMar16

MinutesMar16

Introduction

Minutes of the Facilities Integration Program meeting, March 16, 2011
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • 866-740-1260, Access code: 7027475

Audio Details: Dial-in Number:
U.S. & Canada: 866.740.1260
U.S. Toll: 303.248.0285
Access Code: 7027475; Chair passcode: 8734
Registration Link: https://cc.readytalk.com/r/bd2w3deu2kkg

Attending

  • Meeting attendees: Fred, Dave, Aaron, AK, Nate, John, Michael, Karthik, Saul, Shawn, Sarah, John B, Patrick, Alden, Horst, Rob, Charles, Bob, Tom, Armen, Kaushik, Mark, Wei, Doug,
  • Apologies:

Integration program update (Rob, Michael)

  • IntegrationPhase16 NEW
  • Special meetings
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • Program notes:
    • last week(s) * Next week is the US ATLAS Computing Facilities Workshop, co-located with the OSG All Hands, see: OSG Agenda page.
      • Discussion regarding normalization:
        • CPU is being reported accurately by OSG
        • GIP information collected about sub-clusters
        • Sent to WLCG as a SI2K? , unfolded from there
        • OSG is adjusting the SI2K? value - should be HS06.
        • Request should be to get SI2K? out of the picture.
        • Two other issues - hyperthreading - should be logical CPU.
        • WLCG policy is to use normalized CPU time.
      • Resource requirements are being intensely discussed within ATLAS, there will be changes coming. Eg. Tier2-D's, being actively pursued by ADC. Simone setting up direct connections, most of our Tier 2's are already involved, expect new connections from beyond the US cloud. Will need to see how the network cooperates. Resources would be used differently - more dynamically, less static data - more cache like.
      • Close to data-taking again at LHC. Be prepared for new data.
      • Tier2-D - will eventually be part of every T1. Expect to see lots of connections, x5 or x10. Gridftp servers.
    • this week
      • WLCG reporting - still to sort out - Karthik reporting. See https://twiki.grid.iu.edu/bin/view/Accounting/OSGtoWLCGDataFlow. Problems using KSI2K? factor, on the OSG side - is the table incorrect? Also problems with HT on or off.
      • Capacity spreadsheet reporting - see updates with HT and number of jobs per node.
      • ATLAS ADC is in the process checking capacities - a new web page provided DDM, via SRM; browsing the page found the deployed capacities are underreporting for every site. We need to understand why. Eg. AGLT2 (1.9 PB versus 1.4 PB reported). We need to take a look into this. Michael will provide link. SWT2 - may be related to the space token reporting.
      • Expected capacity to deliver - may need to average out to meet the pledges between T1 and T2's.
      • LHC first collisions on Sunday, stable beams are still rare - working on protections, loss maps.
      • http://bourricot.cern.ch/dq2/accounting/federation_reports/USASITES/

Tier 3 Integration Program (Doug Benjamin)

Tier 3 References:
  • The link to ATLAS T3 working groups Twikis are here
  • T3g Setup guide is here
  • Users' guide to T3g is here

last week(s):

  • xrootd rpm under test, needs work.
  • Arizona coming online.
  • Rik migrating out of Tier 3 management - to analysis support, but will stay closely involved since T3 and analysis closely related.
this week:
  • Doug travels to Arizona next week (Tues-Thursday) to help setup their Tier 3 site
  • last week had a meeting during the OSG Hands meeting with VDT about Xrootd rpm packaging. OSG/VDT promised a new rpm soon.
    • Next week - Wei reports a new release is imminent.
  • CVMFS meeting Wednesday 16-Mar 17:00 CET
    • Move to final namespace in advance migration to CERN IT - not sure about timescale
    • Nightlies and conditions data
    • AGLT2 discovered a problem with fresh installation - testing a different machine. Should be fixed so as to not damage the file system.
  • Write up on Xrootd federation given to Simone Campana and Torre Wenaus. They are collecting information on R&D tasks and task forces
  • wlcg-client - now supported by Charles; some python issues need to be resolved.
  • UTD report from Joe - cmtconfig error - was running CentOS, tracked down to firmware updater from redhat needs. Lost heartbeat errors - tracked down

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • Analysis reference:
  • last meeting(s):
    • US cloud has been quiet with few issues; getting low on jobs at many sites overnight, picking up now.
    • Significant new pilot release; one of the changes might have broke a MWT2 - fix in the works.
    • Panda monitor issues over the weekend - understood.
    • Lack of defined jobs - taking a bit of time. Is panda-mover moving data quickly enough? Mark will check.
    • Saul: noticing a large number of looping analysis jobs, there were some legitimate jobs getting removed. pilot was looking for a log output in the wrong place. People are getting annoyed - will get switched back to the 12 hour time limit.
  • this week:
    • HI reprocessing going on
    • Sites are all doing fine
    • Analysis usage has been low for the past three weeks. 1/2 from what it nominally has been.
    • We'll need to make sure all sites can sustain 75% of their capacity to support analysis - to sort out the technical aspects. Lets do some HC tests.
    • Saul notes that analysis failures are a factor of 2 higher than previously. Need to do an error analysis on these.

Data Management and Storage Validation (Kaushik)

  • Reference
  • last week(s):
    • No meeting this week.
    • Armen: all is well.
  • this week:
    • No meetings this week
    • Deletion from GROUPDISK - taking some time - all deletions will be submitted by tomorrow.
    • Discussions with developers of deletion tools, lots of mail exchange. There is a large backlog of deletions.
    • Otherwise the space is okay.
    • Discuss LOCALGROUPDISK in Tuesday meeting - how to manage it. (A US facility clean up disk) Need some accounting and deletion policy.
    • Shawn notes that some ACLs may need to be changed in the LFC.
    • Will do a test SWT2 today.

Shifters report (Mark)

  • Reference
  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=130344
    
    1)  3/2: NERSC_HOTDISK - file transfer errors with "failed to contact on remote SRM."  As of 3/5 no further errors of this type were observed (maybe a certificates problem?), 
    and ggus 68198 closed.  eLog 22614.
    2)  3/3: Pilot update from Paul (SULU 46b): The default looping limit was increased to 12 hours for user jobs. The next pilot version has a rewritten looping job killer but has 
    not been tested yet.
    3)  3/4 - 3/9: Jobs failing in various clouds with the error "Transformation not installed in CE."  Missing installations eventually sorted out.  
    See: https://savannah.cern.ch/bugs/index.php?78998, eLog 22698.
    4)  3/4: SWT2_CPB - Performed necessary maintenance on the cluster gatekeeper (mainly gass cache clean-up, plus a reboot).  Also, Xin cleaned out some stale pilots on 
    the submit host(s) at BNL.  Site back on-line as of ~6:00 p.m. CST.  eLog 22717/10.
    5)  3/6: At various times during the day all VO boxes and central services were degraded and had to be restarted.  Eventually as of ~23:40 UTC the system seemed to 
    have stabilized.  See eLog 22806.
    6)  3/7: WISC_DATADISK file transfer errors like "failed to contact on remote SRM [httpg://atlas07.cs.wisc.edu:8443/srm/v/server]."  Problem fixed by the site admin.  
    ggus 68295 closed, eLog 22824.
    7)  3/7: MWT2_UC - job failures with the error "pilot: Get error: lsm-get failed (201): ERROR 201 Copy command failed."  From Aaron at MWT2: We have been experiencing 
    a lockup condition on our dcache pools which require them to be restarted occasionally. The errors you see here were most likely happening during this restart, or failures 
    attempting to access data during the pool lockup. This batch should have stopped failing an hour or so ago, as all our pools are up and working now.  ggus 68319 closed, eLog 22861.
    8)  3/8: HU_ATLAS_Tier2 job failures with the error "pilot: Get error: lashotdisk/ddo/DBRelease/v140101/.../DBRelease-14.1.1.tar.gz'\'''] failed because it had non-empty stderr 
    [ssh_exchange_identification: Connection closed by remote host] and it had non-zero exit status."  From Saul: There was transient incident at 9:25:48 UTC on our gatekeeper. This 
    dropped connections for our local site mover causing these errors. We'll investigate further, but the errors stopped within a few seconds.  ggus 68340 closed, eLog 22885.
    9)  3/8 - 3/9: SWT2_CPB - SRM service became unresponsive due to two heavily loaded data servers.  SRM was taken off-line for a period of time to allow the hosts to stabilize.  
    During this period a user was unable to retrieve some files from the site - ggus 68395 & RT 19546 were opened (and since closed).  eLog 23014, https://savannah.cern.ch/bugs/?79282.
    
    Follow-ups from earlier reports:
    (i)  1/9: AGLT2 - low-level of job failures with the error "Put error: lfc_creatg failed with (2704, Bad magic number)."  Site is investigating.
    (ii)  1/19: BNL - user reported a problem while attempting to download files from the site - for example: "httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2: CGSI-gSOAP running 
    on t301.hep.tau.ac.il reports Error reading token data header: Connection closed."  ggus 66298.  From Hiro:
    There is a known issue for users with Israel CA having problem accessing BNL and MWT2. This is actively investigated right now. Until this get completely resolved, users are suggested 
    to request DaTRI request to transfer datasets to some other sites (LOCAGROUPDISK area) for the downloading.
    (iii)  1/27: all U.S. sites received an RT & ggus ticket regarding the issue "WLCG sites not publishing GlueSiteOtherInfo=GRID=WLCG value."  Consolidated into a single goc ticket, 
    https://ticket.grid.iu.edu/goc/viewer?id=9871.  Will be resolved in a new OSG release currently being tested in the ITB.
    Update 3/8 from Kyle Gross: I'll close this as it has been fixed in OSG 1.2.17-1.2.18. This information will be published as resources update to the new software.
    (iv)  2/10: A bug in the most recent OSG software release (1.2.17, released on Monday, February 7th) affects WLCG availability reporting for sites. Sites may go into an UNKNOWN state 
    one day after updating.  Thus it is recommended that sites defer upgrading their OSG installations until a fix is released.  See: http://osggoc.blogspot.com/
    (v)  2/24: MWT2_UC - job failures with " lsm-get failed: time out after 5400 seconds" errors.  From Aaron: We performed a dcache upgrade yesterday, 3/1 which has improved our stability 
    at the moment. This can probably be closed, as new tickets will be opened if new failures occur.  ggus 67887 in-progress (and will be closed), eLog 22425.
    Update 3/11: this issue cross-linked to the (closed) ggus ticket 68544.
    (vi)  2/25: UTD-HEP set off-line due to A/C compressor problem.  eLog 22454
    Update 3/5: A/C issue resolved, test jobs successful, site set back 'on-line'.  eLog 22747.
    (vii)  3/2 a.m.: MWT2_UC job failures with errors like:
    "Error details: pilot: Get error: Failed to get PoolFileCatalog|Log put error: Could not figure out destination path from dst_se (guid=3920b517-03bb-4ae6-8ddf-d7c298a79a96 lfn=log.261508._039922.job.log.tgz.20): 
    list index out of range."  Apparently a problem with the new pilot release (#5 above).  A fix is being prepared.  ggus 68156 in-progress, eLog 22594,
    Update 3/5: Error no longer exists, the new pilot code must have fixed things.  ggus 68156 closed.
    
  • this meeting:* Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=131286
    
    1)  3/10: SLACXRD_LOCALGROUPDISK transfer errors ("failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]").  From Wei: We are hit very 
    hard by analysis jobs. Unless that is over, I expect error like this to continue.  As of 3/14 issue probably resolved - we can close ggus 68498.  eLog 22978.
    2)  3/12: SLACXRD_LOCALGROUPDISK transfer errors with "[NO_SPACE_LEFT] No space found with at least .... bytes of unusedSize]."  
    https://savannah.cern.ch/bugs/index.php?79353 still open, eLog 23037.  Later the same day: SLACXRD_PERF-JETS transfer failures with "Source file/user checksum mismatch" errors.  
    https://savannah.cern.ch/bugs/index.php?79361.  Latest comment to the Savannah ticket suggests declaring the files lost to DQ2 if they are corrupted.  eLog 23048.
    3)  3/12: UTD-HEP set on-line.  http://savannah.cern.ch/support/?119596, eLog 23057.  (The site had originally been set on-line back on 3/5, but ran into some cmtconfig issues.  
    These were resolved as of 3/12, and test jobs were successful.)
    4)  3/12: OU_OCHEP_SWT2_DATADISK - file transfer errors like "gridftp_copy_wait: Connection timed out."  From Horst: Since these timeouts only happened from two sites, while 
    we've been getting lots of successful transfers from everywhere else at the same time and still are, I'm going to assume the problem is on the other end(s) and am closing this ticket again.  
    ggus 68570 / RT 19558 closed, eLog 23059.
    5)  3/13: Shifter reported that some queries in the panda monitor requesting detailed job information were returning errors.  Valeri reported that a fix to the problem had been deployed.  
    https://savannah.cern.ch/bugs/index.php?79367, eLog 23064.
    6)  3/13: OU_OSCER_ATLAS job failures due to a problem with release 15.6.3.10.  As of 3/14 Alessandro was reinstalling the s/w.  Can we close this ticket?  ggus 68611 / RT 19561, 
    eLog 23134, https://savannah.cern.ch/bugs/index.php?79368.
    7)  3/14: SWT2_CPB - power outage in the building, the generator came on, but was not supplying power correctly to the A/C units in the machine room.  Entire cluster had to be powered off.  
    Power restored to the building by early evening - systems were gradually brought back on-line.  As of 3/15 afternoon test jobs completed successfully, panda queues back on-line.  eLog 23189.
    8)  3/14: MWT2_UC file transfer errors ("[GENERAL_FAILURE] AsyncWait] Duration [0]").  From Aaron: This is due to a dcache pool which has been restarted multiple times this afternoon. 
    We are attempting to get this server more stable or drain it, and we expect to be running again without problems within an hour or two.  Can we close this ticket?  ggus 68617, eLog 23139.
    9)  3/15: Development version of the panda monitor available for testing (http://pandadev.cern.ch/).  This version is being tested under SLC5.
    10)  3/15: HU_ATLAS_Tier2 and ANALY_HU_ATLAS_Tier2 set off-line at Saul's request.  ggus 68660, https://savannah.cern.ch/support/index.php?119796, eLog 23194.
    
    Follow-ups from earlier reports:
    
    (i)  1/9: AGLT2 - low-level of job failures with the error "Put error: lfc_creatg failed with (2704, Bad magic number)."  Site is investigating.
    (ii)  1/19: BNL - user reported a problem while attempting to download files from the site - for example: "httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2: CGSI-gSOAP running on t301.hep.tau.ac.il reports 
    Error reading token data header: Connection closed."  ggus 66298.  From Hiro:
    There is a known issue for users with Israel CA having problem accessing BNL and MWT2. This is actively investigated right now. Until this get completely resolved, users are suggested to request 
    DaTRI request to transfer datasets to some other sites (LOCAGROUPDISK area) for the downloading.
    Update 3/14 from Iris: The issue is still under investigation. Thank you for your patience.
    (iii)  2/10: A bug in the most recent OSG software release (1.2.17, released on Monday, February 7th) affects WLCG availability reporting for sites. Sites may go into an UNKNOWN state one day after updating.  
    Thus it is recommended that sites defer upgrading their OSG installations until a fix is released.  See: http://osggoc.blogspot.com/
    (iv)  2/24: MWT2_UC - job failures with " lsm-get failed: time out after 5400 seconds" errors.  From Aaron: We performed a dcache upgrade yesterday, 3/1 which has improved our stability at the moment. 
    This can probably be closed, as new tickets will be opened if new failures occur.  ggus 67887 in-progress (and will be closed), eLog 22425.
    Update 3/11: this issue cross-linked to the (closed) ggus ticket 68544.
    Update 3/14 from Aaron: No errors have occurred like this recently. Closing, please re-open or open a new ticket if the problem continues.  Both ggus tickets now closed/solved.  eLog 22984, 23017.
    
    

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week:
    • Had a meeting yesterday - see email for notes.
    • Good news - perfsonar plots for throughput and latency nearly green.
    • OU, BNL, Illinois issues addressed
    • MWT2_IU and AGLT2 path has a unique component - slowing things down, and its asymmetric. The only issues we've seen so far.
    • Action item all T2's to get another load test in. Sites to contact Hiro, monitor the results. An hour long test. ASAP.
    • More problems in the network likely with the new ATLAS computing model - could our monitoring system be more broadly adopted in ATLAS? Encourage new sites to adopt a perfsonar infrastructure.
    • Will ATLAS do something globally? Part of LHCONE, for example.
    • Timeframe for 10G monitoring. Testing with a server at UM - dual integrated 10G NICs. Probably with the next hardware purchase. Can a single box run both roles (throughput and latency)?
  • this week:
    • Throughput meeting yesterday:
         USATLAS Throughput Meeting Notes --- March 15, 2011
                      ===========================================
      Attending: Shawn, Andy, Dave, Philippe, Sarah, Jason, Aaron, Tom, John
       
      1)      Past Action Item status
      a.       Dell R410 (merged perfSONAR box):  No updates.
      b.      AGLT2 to MWT2_IU (low throughput noted, used NLR segment unlike most other MWT2_IU paths).   No updates on this issue.   Jason had done some NLR tests.  Will be looking later this more later this week.
      c.       Loadtest retesting.   Sites need to schedule tests.  Contact Hiro.  Only AGLT2 done so far.  Tier-2 sites should try to schedule loadtests before the next meeting in two weeks. (Avoid conflicts with LHC data by getting this done soon).
      2)      perfSONAR status -  Currently 3 CRIT values on Latency matrix for LHCPERFMON.  Andy will check the plugin to see if CRIT may mean “UNKNOWN”.   General discussion about current settings and email alerting.  Jason mentioned restarting services may cause problems with the low-level service monitoring.   Jason will check for possible fixes to handle restarts.  Tom can setup alerting windows which ignore known bad periods or can use different criteria (e.g., need to fail two tests 1 hour apart before alerting).  General consensus was to keep things as they are to get more experience and make sure things are stable.  Will revisit more aggressive alerting and threshold tuning in a future meeting.
      3)      Throughput monitoring
      a.       Hiro’s throughputs still have MCDISK…fix?  -  No update.
      b.      Adding perfSONAR to throughput test graphs –No update.
      c.       Tom described transition to remove dependency of perfSONAR dashboard on Nagios.  Jason is providing Andy’s plugins augmented with “RSV” mode.   Goal is to have RSV tests for perfSONAR feeding Gratia DB.  Tom’s dashboard (currently PHP) will migrate to Java and utilize the Gratia DB as it’s source.  Modular and portable for the future.
      d.      Shawn described ‘ktune’ package for USATLAS sites.  Started from ktune 0.2-6 and augmented with some tuning’s from AGLT2 and others.  Network recommendations taken from ESnet’s Fasterdata page at http://fasterdata.es.net/   Asking for “beta” testers to deploy and provide feedback.  RPMS available at:
    • http://linat05.grid.umich.edu/ktune-0.2-6_usatlas.src.rpm
    • http://linat05.grid.umich.edu/ktune-0.2-6_usatlas.noarch.rpm (This is the one you install to test…read the README)
      4)      Site Reports/Around-the-table:  Aaron noted MWT2_IU -> OU performance is bad.   On the list of things to check.   Will be looked at soon.   Shawn mentioned ‘ktune’ again…looking for sites longer term to help benchmark settings and augment package to provide a starting point for sites to use.
       
      No AOB.  Plan to keep in touch on on-going activities via email.   We will meet again in two weeks at the regular time.  Send along corrections or additions to the notes via email.  Thanks,
       
      Shawn 
    • Matrices are mostly green DONE
    • Will turn up the sensitivity - will watch for alerts.
    • Modularizing perfsonar for OSG RSV probes - store in a Gratia database.
    • ktune - come up with recommended kernel tunings, appropriate for wide area networking. Aaron and Dave doing some testing. Tunes TCP stack, may tune the NIC via ethtool command, does some memory settings.
    • Site certification table for this quarter has load tests. 400 MB/s over an hour, or best as can. Capture plots.

Federated Xrootd at sites: Tier 3 (Doug), Tier 2 (Charles)

last week(s):
  • Running tests with the current architecture. At MWT2, using the xrootd as a front end to dcache now - more stable than dcap. (we do see a performance hit for the wide-area, requiring tuning in the xrootd client)
  • Working on re-architecting with LFC. Removal of DQ2 timestamps - working with Hiro and Simone. Will need to test this against storage.
  • Renormalization of paths at MWT2, for the global namespace.
  • Working with sites in Europe with DPM backend.
  • Will talk on this Tuesday morning.
this week:
  • Doug sent a document to Simone and Torre - to be part of an ATLAS task force, R & D project, may be discussed during SW week.
  • Charles - continuing to test - performance tests 500 MB/s. Post LFC model work - to replace the LFC-callout plugin (requires normalizing paths; getting rid of DQ2 suffixes - some setting changed).

Site news and issues (all sites)

  • T1:
    • last week(s): BNL has its own PRODDISK area now. Deployed about 2PB of disk, in production. Will need to remove some of the storage.
    • this week:

  • AGLT2:
    • last week: All is working well. Have had some checksum failures - chasing this down. Users attempting to get files that were once here, but no longer. Is the user job unknowingly removed files under the usatlas1 account? Looking at options to trap the remove command, and log these. Want to get the lsm installed here, to instrument IO.
    • this week: Doing some work on the SE. Would like to get better aspects of IO for jobs. Testing on ktune.

  • NET2:
    • last week(s): Tier3 hardware is on the way, ordering a new rack of worker nodes (looking at R410). Working to get additional 10G links, maybe even a 40G link. DYNES application approved! Since someone was asking last time... debris from MCDISK
    • this week: Work at BU storage - all underway to improve transfers to HU. Two GPFS filesystems will be combined (will change reporting momentarily). New switch for HU connectivity. Production job failures at HU last night - expired CRL - stopped running for some reason.

  • MWT2:
    • last week(s): Downtime yesterday - dCache upgraded to 1.9.5-24. Evaluating CVMFS at MWT2_IU. Migrated monitoring services (Cacti, ganglia, etc.) onto a new machine using kvm. Finishing last plans for new server room at UC - adding additional 30 ton CRAC unit; some construction already complete - raised floor, cooling infrastructure, new transformer and UPS delivered. At IU - we'll have to take a downtime to re-arrange server room, no exact date, but will announce. Hiro notes that there were some additional subscriptions made over the wekend - could have caused the lockup.
    • this week: Working on a new MWT2 endpoint using Condor as a scheduler. Correct CPUs arrived from Dell - to be replaced.

  • SWT2 (UTA):
    • last week: The grid monitor has been getting lost - causing load issues - a cron job was not running correctly not deleting gass-cache files. Maintenance yesterday at SWT2_UTA. Network connectivity into the analysis cluster is currently 1G links, working with networking folks to get a 10G switch. Will be looking to update OSG, hopefully the OSG will be released. Periodic failures in SAM testing, probably one of the storage nodes is getting too busy.
    • this week: Lost power on campus Monday afternoon - problem in switch gear for cooling.

  • SWT2 (OU):
    • last week: Hiro's throughput test showing 400 MB/s.
    • this week: Waiting for final confirmation for compute node additions next week. Investigating Alessandro's install job hang.

  • WT2:
    • last week(s): Last week problem with a Dell machine storage - replaced CPU and memory, though not stressed. Planning 3 major outages - each lasting a day or two: March, April, early May. Will be setting final dates soon.
    • this week: Getting quote for new switch.

Carryover issues ( any updates?)

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc. Note: https://atlas-install.roma1.infn.it/atlas_install/ - site admins can subscribe, and get notified of release installation & validation activity at their site.

  • last report(s)
    • IU and BU have now migrated.
    • 3 sites left: WT2, SWT2-UTA, HU
    • Waiting on confirmation from Alessandro; have requested completion by March 1.
    • Focusing on WT2 - there is a proxy issue
    • No new jobs yet to: SWT2, HU - jobs are timing out, not running.
    • There is also Tufts. BDII publishing.
    • One of the problems at SLAC is lack of outbound links, and the new procedure will probably use gridftp. Discussing options with them.
  • this meeting:

AOB

  • last week
  • this week


-- RobertGardner - 15 Mar 2011

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback