r5 - 16 Feb 2011 - 14:35:26 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesFeb16

MinutesFeb16

Introduction

Minutes of the Facilities Integration Program meeting, Feb 16, 2011
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • 866-740-1260, Access code: 7027475

Audio Details: Dial-in Number:
U.S. & Canada: 866.740.1260
U.S. Toll: 303.248.0285
Access Code: 7027475; Chair passcode: 8734
Registration Link: https://cc.readytalk.com/r/bd2w3deu2kkg

Attending

  • Meeting attendees: Karthik, Aaron, Nate, Charles, Rob, Dave, Fred, Saul, Horst, John DeStefano, AJ, John Brunelle, Michael, Sarah, Patrick, Bob, Alden, Armen, Mark, Kaushik, Tom, Xin, Hiro, Doug
  • Apologies: Wei, Joe, Jason-Internet2

Integration program update (Rob, Michael)

  • IntegrationPhase16 NEW
  • Special meetings
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • Program notes:
    • last week(s)
      • Reminder next face-to-face facilities meeting co-located with OSG All Hands meeting (March 7-11, Harvard Medical School, Boston), http://ahm.sbgrid.org/. US ATLAS agenda will be here.
      • Starting up CVMFS evaluation in production Tier 2 settings: SWT2_OU, MWT2 participating, AGLT2 and NET2 (possible). Instructions to be developed here: TestingCVMFS.
        • Doug: migration into CERN IT which will change the mount point. Rationalizing directory structure. Will use Alessandro DeSalvo's software installation tool.
      • Updates to SiteCertificationP16
      • A new OSG release will be coming out shortly that will up delayed
      • Actionable tasks in the integration program - OSG 1.2.17 update, and TestingCVMFS (new installation notes provided by Nate Y - thanks)
      • Expect to run out of production jobs later in the week - good time for downtimes (email usatlas-t2-l@lists.bnl.gov).
      • News from ADC retreat in Napoli, Kors summary slides at this weeks ADC Weekly
        • Especially, look at the summary talks on friday. We had a lot of discussions about how to manage the huge data volume expected in 2011 and 2012.
        • We are thinking about an extreme PD2P, where very little data will be pre-placed and kept on disk (one copy of RAW and two copies of final derived datasets, ATLAS-wide). Rest of storage will be temporary cache space - centrally managed. Without this, we will run out of space a few months after LHC starts (note, Tier 1's are already completely full with 2010 data).
        • Important user analysis change: looping jobs will be killed after 3 hours (not 12 hours). Dan reported that ~20% CPU was being wasted on looping jobs.
        • LFC consolidation, PRODDISK at T1, data migrations and deletions... please look at the summary talks.
      • Looping jobs (evidently inactive jobs) may be mostly prun.
    • this week
      • Reminder: Next face-to-face facilities meeting co-located with OSG All Hands meeting (March 7-11, Harvard Medical School, Boston), http://ahm.sbgrid.org/. US ATLAS agenda will be here.
      • Hold off on OSG update, till 1.2.18 which has bug fixes for RSV probes for WLCG availability, and updates to GIP for WLCG attributes.
      • We need to huddle outside this meeting on CVMFS (TestingCVMFS), I think.
      • Facility Capacity spread sheet updated, see "*v18-v2*" in CapacitySummary - (please send any discrepancies to Rob)
      • Note: current capacity is a bit larger - to be updated in v19
      • Some numbers:
        Job slots HS-06
        US ATLAS Facility 16,748 165,982
        TIER1 Center 4,996 61,810
        TIER2 Centers 11,752 104,172
      • Some snapshots:
        • Job slots ending 2010:
          screenshot_03.jpg
        • HEPSPEC 2006 capacity:
          screenshot_01.jpg
      • Summary of US ATLAS operations management review comments w.r.t. Tier 2 centers - i.e. have performed very well, planning well-organized; cost-benefit analysis viewed as a huge success. No negative aspects, or specific recommendations.
      • However will likely not be getting anticipated funding - probability of a 5% cut, which we'll have to deal with (1.5M).
      • Some capacities are beyond pledge - question as to reserve for US physicists; technically already implemented, activated at AGLT2. Kaushik: still some bugs to work out.

Tier 3 Integration Program (Doug Benjamin, Rik Yoshida)

Tier 3 References:
  • The link to ATLAS T3 working groups Twikis are here
  • T3g Setup guide is here
  • Users' guide to T3g is here

General Tier 3 issues

last week(s):
  • Continuing to update list of Tier-3's - web page with status
  • Doug was at Napoli
  • Xrootd-OSG-ATLAS meeting yesterday - progress from VDT on rpm packaging, gridftp-plugin, basic configuration
this week:
  • xrootd rpm under test, needs work.
  • Arizona coming online.
  • Rik migrating out of Tier 3 management - to analysis support, but will stay closely involved since T3 and analysis closely related.
  • How many T3 Panda sites?

Tier 3 production site issues

  • Bellamine University (AK):
    • last week(s):
      • IT director working with consulting firm to find solution. Removal of hangs.
      • Horst - running transfer tests, no hangs. 5 MB/s each direction. Throttled slightly.
      • Hope to get to 20 MB/s in near term.
    • this week:
      • Packet shaper partitioned and working well now.
      • Simultaneous inbound-outbound with the same host. Horst: all is okay
      • SRM transfers to try next.
      • 50 Mbps in both directions simultaneously

  • UTD (Joe Izen)
    • last week(s):
      • No LFC errors this week, for the first time
      • In production most of the week
      • Caught 11 lost-heart beat jobs overnight. At the moment offline due to power blackout in Texas.
    • this week
      • From Joe: UTD is running smoothly, although we are not at full strength. We had a rash of hard drive failures on our older "fester" cluster. Disks are still under warranty, but we don't keep spares on hand - resources saved for our new c6100 cluster which isn't online yet. No outstanding tickets. Progress on the new cluster is slower than I'd like. Our sys-admin is splitting time with another group that just received new hardware.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • Analysis reference:
  • last meeting(s):
    • Still in downtime for database upgrade. Waiting to turn back on.
    • Dist analysis up and down as per usual, but production is stable.
    • ESD access may be restricted once the LHC restarts. Should discuss locally, and work through physics groups. Still in the information gathering stage.
  • this week:
    • Running out of jobs - down to US and Canada now. There is a new request for 80M simulation, Borut will define these soon.
    • Cross-cloud production, promoted by Rod Walker. MWT2 and AGLT2 setup. Smooths production of high priority tasks.
    • Simultaneously Simone is performing sonar tests. Need to go beyond the star-channel.
    • LHC wide area connectivity becoming important.
    • Distributed analysis will be much more of a challenge, PD2P? will be transferring the entire dataset. Q: could this be improved for only the parts users request. Could send jobs to multiple Tier 2's if the dataset could be split; not favored - better smaller datasets.
    • Beyond-pledge issue: parameters in schedconfig not getting transferred into pilot, dependent on Condor-G.

Data Management & Storage Validation (Kaushik)

Shifters report (Mark)

  • Reference
  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=125577
    
    1)  2/3: AGLT2 - job failures (stage-out errors) & DDM transfer failures.  From Shawn: Last night I was working on getting the head02 setup as similar as possible to old head02.  
    I installed yum-autoupdate as part of the process.  This morning it upgraded postgres90 from 9.0.2-1 to 9.0.2-2.  The problem is the version on head02 is custom built.  This caused 
    postgresql to shutdown around 6:50 AM.  I reverted, put the exclude into /etc/yum.conf and got things running again.
    Also, there was a brief network outage which resulted in many "lost heartbeat" errors.  Everything resolved by ~noon CST.  eLog 21718.
    2)  2/3: MWT2_UC - job failures with lost heartbeat & stage-in errors.  From Nate at MWT2: We had a network outage at IU which caused those lost heartbeats. The nodes are still 
    down until someone there can replace the switch.  eLog 21729.
    3)  2/3: US sites HU_ATLAS_Tier2, UTA_SWT2, SWT2_CPB - job failures due to a problem with atlas release 16.6.0.1.  Xin reinstalled the s/w, issue resolved.  ggus 66992-94, 
    RT 19389-91 tickets closed, eLog 21732-34.
    4)  2/4-2/5: BNL-OSG2_DATADISK transfer errors such as "failed to contact on remote SRM [httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2]. Givin' up after 3 tries]."  Issue was 
    due to excessive load on the dCache pnfs server - now resolved.  ggus 67005 closed, eLog 21745.
    5)  2/5: Number of running production jobs in the U.S. cloud temporarily decreased - from Michael: The reason for the reduced number of running jobs was a file system on one of 
    the Condor-G submit hosts filled up earlier today. An alarm was triggered and Xin started cleaning up the filesystem a couple of hours ago. You will see the US cloud at full 
    capacity shortly.  eLog 21797.
    6)  2/5-2/6: SWT2-CPB-MCDISK file transfer failures.  Issue understood and resolved - from Patrick: The SRM failed when the partition containing bestman filled up due to logging.  
    The logs were removed and the srm restarted.  ggus 67070 / RT 19394 closed, eLog 21902.
    7)  2/6: MWT2_UC - job failures with the error "Can't find [AtlasProduction_16_0_3_6_i686_slc5_gcc43_opt]."  Xin was eventually able to install this cache (initially had a problem 
    accessing the CE due to a load spike) - issue resolved.  ggus 67074 closed, eLog 21856.
    8)  2/7: IllinoisHEP lost heartbeat job failures.  From Dave at Illinois: These were caused by a problem on our NFS server early this morning.  The problem was fixed, but only 
    after the currently running jobs all failed.  ggus 67121 closed, eLog 21907.
    9)  2/8: NET2_DATADISK - failing functional tests with "failed to contact on remote SRM" errors.  Issue resolved - from Saul: Fixed (bestman needed a restart when we updated 
    our host certificate).  ggus  67145 closed, eLog 21912.
    10)  2/8: OU_OCHEP_SWT2_DATADISK failing functional tests with "failed to contact on remote SRM" errors.  Horst couldn't find an issue on the OU end, and subsequent 
    transfers were succeeding.  ggus 67146 closed, eLog 21913.
    11)  2/8: FTS errors for transfers to a couple of U.S. cloud site.  The messages indicated a full disk on the FTS host: "ERROR MSG: [FTS] FTS State [Failed] FTS Retries [1] Reason [AGENT 
    error during TRANSFER_SERVICE phase: [INTERNAL_ERROR] cannot create archive repository: No space left on device]."  Issue resolved by Hiro.  ggus 67132 closed, eLog 21905.
    
    Follow-ups from earlier reports:
    (i)  12/17, 12/20:  ANALY_SWT2_CPB was auto-blacklisted twice.  Hammercloud test jobs were failing due to the fact that a required db release file was not yet transferred to the site 
    when the first jobs started up.  Once the transfer completed the test jobs began to complete successfully.  Discussion underway about how to address this issue.
    (ii)  1/9: AGLT2 - low-level of job failures with the error "Put error: lfc_creatg failed with (2704, Bad magic number)."  Site is investigating.
    (iii)  1/14: AGLT2_PHYS-SM - file transfer failures due to "[USER_ERROR] source file doesn't exist."  ggus 66150 in-progress, https://savannah.cern.ch/bugs/?77036.  
    Also https://savannah.cern.ch/bugs/index.php?77139.
    1/25: Update from Shawn:
    I have declared the 48 files as "missing" to the consistency service. See https://twiki.cern.ch/twiki/bin/view/Atlas/DDMOperationProcedures#In_case_some_files_were_confirme and 
    you can track the "repair" at http://bourricot.cern.ch/dq2/consistency/.  Let me know if there are further issues.
    Update 1/28: files were declared 'recovered' - Savannah 77036 closed.  (77139 dealt with the same issue.)  ggus 66150 in-progress.
    (iv)  1/19: BNL - user reported a problem while attempting to download files from the site - for example: "httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2: CGSI-gSOAP running 
    on t301.hep.tau.ac.il reports Error reading token data header: Connection closed."  ggus 66298.  From Hiro: There is a known issue for users with Israel CA having problem accessing 
    BNL and MWT2. This is actively investigated right now. Until this get completely resolved, users are suggested to request DaTRI request to transfer datasets to some other 
    sites (LOCAGROUPDISK area) for the downloading.
    (v)  1/21: File transfer errors from ALGT2 to MWT2_UC_LOCALGROUPDISK with source errors like "FTS State [Failed] FTS Retries [1] Reason [SOURCE error during 
    TRANSFER_PREPARATION phase:[USER_ERROR] source file doesn't exist]."  https://savannah.cern.ch/bugs/index.php?77251, eLog 21440.
    (vi)  1/27: all U.S. sites received an RT & ggus ticket regarding the issue "WLCG sites not publishing GlueSiteOtherInfo=GRID=WLCG value."  Consolidated into a single goc ticket, 
    https://ticket.grid.iu.edu/goc/viewer?id=9871.  Will be resolved in a new OSG release currently being tested in the ITB.
    (vii)  1/30: AGLT2 - job (stage-out: "Internal name space timeout lcg_cp: Invalid argument") & file transfer errors ("failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]").  
    From Shawn: This morning around 8 AM Eastern time our postgresql server for the dCache namespace (Chimera) filled its partition with logging info (over 10 GB in the last 24 hours). This was 
    traced to multiple attempts to re-register a few files over and over.  We have cleaned up space on the partition and modified the logging to be "terse" so this won't happen as easily in the future.  
    ggus 66794 in-progress, eLog 21616.
    Update 2/3: issue resolved by reducing the level of postgresql logging.  ggus 66794 closed, eLog 21717.
    (iix)  2/2: UTD-HEP set off-line at request of site admin.  Rolling blackouts in the D-FW area (unfortunately).  eLog 21702.
    Update 2/8: site recovered from power issues - test jobs completed successfully - set back on-line.  eLog 21901,
    https://savannah.cern.ch/support/index.php?119022.
    (ix)  2/2: WISC_DATADISK - failing functional tests with file transfer errors like " Can't mkdir: /atlas/xrootd/atlasdatadisk/step09]."  ggus 66897 in-progress, eLog 21695.
    Update 2/4: Site admin reported issue was resolved.  No more errors, ggus 66897 closed.
    
    • Automatic release not yet deployed everywhere. Would be nice to get all sites consistently using the same system.
  • this meeting:* Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=127502
    
    1)  2/9: Problems accessing the panda monitor.  Issue was traced back to a faulty NAT configuration on the monitor machines - now solved.  https://savannah.cern.ch/bugs/?77992, eLog 21950.
    2)  2/9:  Bob at AGLT2: From 15:12 to 16:00 or so we had a network snafu at AGLT2 that took our primary DNS server off-line.  Some 70 jobs were lost, but all else seems to have recovered satisfactorily.  
    During part of this time I set auto-pilots off to keep load away from our gate-keeper.  These have now been turned back on.
    3)  2/9: Jobs were failing at several U.S. sites with "transfer timeout" errors.  Issue understood - from Yuri: Most of the jobs succeeded the 2-nd attempts. Output file transfer failed because these are 
    validation tasks (high priority), so they have very low transfer timeouts in Panda as such tasks could run at T1s. If they go to T2s and the storage or network is busy, then these jobs will fail due to timeout 
    (Another possibility is to increase these timeouts in Panda).  https://savannah.cern.ch/bugs/index.php?78000, eLog 21981.
    4)  2/10: File transfer errors between BNL & RAL - " [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2]."  Issue was reported as solved (intermittent 
    problem on the wide area network link between RAL and BNL), but later recurred (high load on the  dCache core servers), and the ticket was re-opened.  ggus 67214 in-progress, eLog 21973.
    5)  2/10: HU_ATLAS_Tier2 - job failures with the error "Could not update Panda server, EC = 7168," and eventually "Pilot received a panda server signal to kill job."  Eventually the problem went away.  
    ggus 67237 closed, eLog 21978.  (Note: a similar issue was reported at FZK in the DE cloud - see: https://gus.fzk.de/ws/ticket_info.php?ticket=66783.)
    6)  2/10: A bug in the most recent OSG software release (1.2.17, released on Monday, February 7th) affects WLCG availability reporting for sites. Sites may go into an UNKNOWN state one day after 
    updating.  Thus it is recommended that sites defer upgrading their OSG installations until a fix is released.  See: http://osggoc.blogspot.com/
    7)  2/12 - 2/14: Atlas s/w installation system off-line 12/02/2011 16:00 UTC to 14/02/2011 15:00 UTC, due to electrical power maintenance which affected all of the Tier2's in Roma.
    8)  2/13: WISC_DATADISK & _LOCALGROUPDISK file transfer errors.  ggus 67276 was initially (erroneously) opened and directed to BNL, but the problem was actually on the WISC end.  This ticket 
    was closed, and issue followed in ggus 66280.  Site admin (Wen) reported the issue was resolved - all tickets now closed.  eLog 22037.
    9)  2/13: BNL-OSG2_MCDISK to NDGF-T1_MCDISK file transfer errors (source) - "[INTERNAL_ERROR] Source file/user checksum mismatch]."  Not a BNL problem - Stephane pointed out that the file 
    in question was corrupt in all sites (i.e., file was corrupted at the time it was generated).  See details in eLog 22057.  ggus 67251 closed.
    10)  2/14: The new DN:
    /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=ddmadmin/CN=531497/CN=Robot: ATLAS Data Management
    now being used for DDM Functional Tests.  eLog 22046.  (Also gradually being rolled to in various clouds for site services.)
    11)  2/14: MWT2 postponed maintenance originally scheduled for 2/15 in order to participate in tier-2 testing.
    12)  2/14: AGLT2 maintenance outage (network optimizations, firmware upgrades, OSG upgrades, other tasks).  Work completed in the evening - test jobs successful, production & analysis queues 
    set back on-line.  eLog 22100.
    13)  2/15: BNL-OSG2_DATADISK file transfer failures - "[AGENT error during TRANSFER_SERVICE phase: [INTERNAL_ERROR] cannot create archive repository: No space left on device]."  Hiro 
    reported the issue was resolved - ggus 67492 closed, eLog 22105.
    14)  2/15: HU_ATLAS_Tier2 - John reported an overheating problem (glycol failure) that caused some filesystem hosts to shutdown.
    Issue resolved - systems back on-line.  eLog 22111, https://savannah.cern.ch/support/index.php?119248.
    15)  2/15: BU_ATLAS_Tier2o & OU_OCHEP_SWT2 - job failures with the error "SFN not set in LFC for guid (check LFC server version)."  Seems to have been a transient issue - disappeared.  
    ggus  67501 / RT 19464 closed, eLog 22133.
    16)  2/15: WISC_DATADISK - file transfer errors due to a certificate problem (could not map the new 'ddmadmin' DN).  Wen added the entry to the local mapfile - issue resolved.  
    ggus 67495 closed, eLog 22137.
    
    Follow-ups from earlier reports:
    
    (i)  1/9: AGLT2 - low-level of job failures with the error "Put error: lfc_creatg failed with (2704, Bad magic number)."  Site is investigating.
    (ii)  1/14: AGLT2_PHYS-SM - file transfer failures due to "[USER_ERROR] source file doesn't exist."  ggus 66150 in-progress, https://savannah.cern.ch/bugs/?77036.  
    Also https://savannah.cern.ch/bugs/index.php?77139.
    1/25: Update from Shawn:
    I have declared the 48 files as "missing" to the consistency service. See https://twiki.cern.ch/twiki/bin/view/Atlas/DDMOperationProcedures#In_case_some_files_were_confirme and you can 
    track the "repair" at http://bourricot.cern.ch/dq2/consistency/
    Let me know if there are further issues.
    Update 1/28: files were declared 'recovered' - Savannah 77036 closed.  (77139 dealt with the same issue.)  ggus 66150 in-progress.
    (iii)  1/19: BNL - user reported a problem while attempting to download files from the site - for example: "httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2: CGSI-gSOAP running on 
    t301.hep.tau.ac.il reports Error reading token data header: Connection closed."  ggus 66298.  From Hiro: There is a known issue for users with Israel CA having problem accessing BNL 
    and MWT2. This is actively investigated right now. Until this get completely resolved, users are suggested to request DaTRI request to transfer datasets to some other sites 
    (LOCAGROUPDISK area) for the downloading.
    (iv)  1/21: File transfer errors from ALGT2 to MWT2_UC_LOCALGROUPDISK with source errors like "FTS State [Failed] FTS Retries [1] Reason [SOURCE error during 
    TRANSFER_PREPARATION phase:[USER_ERROR] source file doesn't exist]."  https://savannah.cern.ch/bugs/index.php?77251, eLog 21440.
    (v)  1/27: all U.S. sites received an RT & ggus ticket regarding the issue "WLCG sites not publishing GlueSiteOtherInfo=GRID=WLCG value."  Consolidated into a single goc ticket, 
    https://ticket.grid.iu.edu/goc/viewer?id=9871.  Will be resolved in a new OSG release currently being tested in the ITB.
    (vi)  2/6: AGLT2_PRODDISK to BNL-OSG2_MCDISK file transfer errors (source) - " [GENERAL_FAILURE] RQueued]."  ggus 67081 in-progress, eLog 21935.
    
    • Transfer timeout issue from higher priority task - there was a tighter constraint on the timeout definition. Succeed on second or third attempt.
    • Maintenance in software distribution system over the weekend.
    • Some of the carryover issues above probably resolved.
    • Hiro: panda monitor is falsely reporting failures after two weeks - when it was a day.
    • Hiro: high prio tasks would be more likely succeed if given a larger share in DDM. Can Panda can request a higher priority with a different share? Kaushik wil look into it.

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • New DDM proxy - a global proxy - is causing problems, it has an email address, at some sites. SLAC and UTA. Will send out a status email later.
    • Email address field seems to be tool-dependent. Eg. voms-proxy-info and openssl.
    • Can manually map this as a work-around in GUMS
    • Could be solved by different certificate
    • Template at OSG does not have /atlas/role=production mapped to usatlas1
    • NERSC - needs a person, not a robotic certificate; Doug will put Hiro in contact with the responsible at NERSC.
    • Interesting dq2 use case circulated
    • Shawn noticing users causing SRM failures (as opposed to production)
    • Simone doing commissioning tests between various sites - studies for "flatter, mesh-like topology"
  • this meeting:
    • Proxy issue: https://gus.fzk.de/ws/ticket_info.php?ticket=67144; resolved with new cert without the email address. Expect this to work, will test.
    • Decommissioning of MCDISK has been completed - removed from Tiers of ATLAS. So local cleanups can take place.
    • Hiro will send a mail to include T3's in the sonar tests - esp the production sites
    • There are a number of jobs running at MWT2 "within the CERN" cloud, with ~1700 in transferring state. Michael - what about the accounting? Will the US still get credit for these jobs, eg.
    • CERN will become an 11th Tier 1.

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week:
    • Last week's meeting skipped - sent around the perfsonar performance matrix. Sites are requested to please follow-up.
    • LHCOPN meeting tomorrow in Lyon - a need for better monitoring; Jason will send summary notes.
    • DYNES - there will be a phased deployment; first are the PI, co-PI sites, then 10 sites at a time, etc. Meeting at joint-techs last week. Deploy all sites in the instrument by end of year. May be a separate call for additional participants. There was an announcement last week at joint-techs. Everyone applied has been provisionally accepted.
  • this week:

Global Xrootd: Tier 3 (Doug), Tier 2 (Charles)

last week(s):
  • Development release in dq2 for the physical path
  • rpms from OSG - adler32 bug fixed; will work on testing re-installation
this week:

Site news and issues (all sites)

  • T1:
    • last week(s): Will be adding expansion chasis for IBM storage servers, tomorrow. Due to the maintenance work on the storage servers in order to add more capacities, some of many storage will go off line for short period tomorrow (Feb 10th). Since only small fraction of storage servers are affected, there is no scheduled downtime associated with this activity. However, it is expected that users(/production/DDM) will experience sporadic connectivity problems, particularly for reading. The impact to writing should be minimum (if none at all.) Hiro These are NEXAN shelves. Hiro busy testing storage - finding some strange pre-fecthing performance (ZFS).
    • this week: New version of the OSG software stack - faulty RSV probe - affected availability. Storage expansion of ~ 2PB almost coming online. BNL now has 72 fibers north and south shore routing into NYC. Additional 10G circuit will be coming online - in March. 100G to BNL - equipment has been installed (optical switches) - mostly alpha equipment, so caution; late next year, part of EsNET testbed, maybe late next year. Late 2013/14 for production.

  • AGLT2:
    • last week: Downtime next Monday to finish up tasks in advance of data taking. New SAS SSDs to arrive. Networking on VMWare systems suspect - requires full shutdown rather than simple restart. Also finding some packet loss to/from Condor-VM job manager, perhaps due to incomplete spanning-tree configuration (to be fixed on Monday). Working with Condor team to build some robustness.
    • this week: Took downtime on Monday - took care of a number of issues. 1.9.10-4 dCache on head nodes. multiple-spanning tree running, cleanly talking with Cisco now. SSD for dCache heads delayed. PERC card firmware updates complete. MSU - no spanning tree there, to eliminate a loop.

  • NET2:
    • last week(s): pcache issue on the BU side. Release problem at HU with 16.0.2, 16.0.3. Will purchase Tier-3 equipment from Dell (for ATLAS and CMS). Will ramp up analysis production at HU - will require a 10G NIC at BU. There were some fiber channel problems - investigating.
    • this week: HU panda queues down for a couple hours due to an AC problem in the room which houses fileservers. Perfsonar tuned up. Ordered 10G switch to run full analy jobs at HU. Analysis capacity - can't ramp up until 10g switch installed (has just been ordered).

  • MWT2:
    • last week(s): Working on connectivity/network issue with new R410s.
    • this week: Connectivity/network issue with new R410s at UC resolved (static routes), nodes working fine. Postponed downtime so as to participate in "Big Tier2" testing by Rod - jobs run from CA and DE clouds (but at low levels). Next downtime will involve a dCache update. Otherwise all is fine.

  • SWT2 (UTA):
    • last week: Iced-in last week. Will work on mapping issue as discussed above. Will take a downtime in the next week.
    • this week: Quiet in the past week. Replaced installations on perfsonar hosts, updated w/ latest patches. Production running smoothly. Will be getting back to federated xrootd.

  • SWT2 (OU):
    • last week: Shutdown last week, and today. Working on mapping issue.
    • this week: Waiting for Dell to install extra nodes. 18 dual quad x5620's w/ 32G ram.

  • WT2:
    • last week(s): All is well. Setting up a PROOF cluster, hardware is setup. 7 nodes each with 16 cores and 24G memory and 12x2TB disks. In April there will be power outages at SLAC. Considering 6248 to provide gigabit. May use 8024F as aggregation.
    • this week: All are running well with light load at SLAC. We handed over a Proof cluster for Tier 3 user to run testing jobs. We are discussing two power outages in April and one in May (and three more in April that are not supposed to affect us).

Carryover issues ( any updates?)

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc. Note: https://atlas-install.roma1.infn.it/atlas_install/ - site admins can subscribe, and get notified of release installation & validation activity at their site.

  • last report(s)
    • AGLT2 now running Alessandro's system - now in automatic installation mode. Will do other sites after the holiday.
    • MWT2_UC is using the new system, for only 16 series releases. If it works well, will enable this for the new system.
    • Next site - BU - once BDII publication issue resolved, will return to this.
    • WT2, IU, SWT2 - depending on Alessandro's availability.
  • this meeting:
    • IU and BU have now migrated.
    • 3 sites left: WT2, SWT2-UTA, HU
    • Waiting on confirmation from Alessandro; have requested completion by March 1.

AOB

  • last week
  • this week


-- RobertGardner - 15 Feb 2011

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


jpg screenshot_03.jpg (76.2K) | RobertGardner, 16 Feb 2011 - 11:19 |
jpg screenshot_01.jpg (528.0K) | RobertGardner, 16 Feb 2011 - 11:19 |
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback