r4 - 15 Dec 2010 - 14:11:26 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesDec15

MinutesDec15

Introduction

Minutes of the Facilities Integration Program meeting, Dec 15, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • 866-740-1260, Access code: 7027475

Audio Details: Dial-in Number:
U.S. & Canada: 866.740.1260
U.S. Toll: 303.248.0285
Access Code: 7027475
Registration Link: https://cc.readytalk.com/r/bd2w3deu2kkg

Attending

  • Meeting attendees: Dave, Fred, Michael, Patrick, Tom, Doug, Torre, Saul, Armen, Kaushik, Mark, Shawn ,Sarah, Rob, Aaron, Nate, Charles, Wei, Bob, Horst, Alden
  • Apologies: Rik, Jason

Integration program update (Rob, Michael)

  • IntegrationPhase15
  • Special meetings
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • Program notes:
    • last week(s)
      • SiteCertificationP15 - some updates
      • Dynes template due
      • Lots of discussions last week about how computing will be organized in the next year, given a year of experience - changes, further adjustments for PD2P? , next-gen computing element
    • this week
      • Nearing end of current fiscal quarter FY11Q1, IntegrationPhase15
      • Ask that sites fill in fabric deployments for this quarter: FabricUpgradeP15
      • New HC results for BNL and MWT2 - http://hammercloud.cern.ch/atlas/10002202/test/
        • More jobs scheduled at BNL - time window might have ben competing with production at MWT2, both highly efficient
        • CPU/Walltime varies significantly 38% versus 67%, both dCache sites (stage-in and direct access).
        • What is interesting is the event rates are very different (Events/walltime). Note time to download input files is quite difference. Average stage-in time at BNL is about 40 minutes.
        • Have results now for BNL, AGLT2, NET2, MWT2, SWT2.. still need to add SLAC
      • SiteCertificationP15
      • Continue to use usatlas-t2-l@lists.bnl.gov for downtime announcements
      • Things have slowed a bit, a good time for taking downtimes. Want to get back to stable operations by end of January, early February.

CVMFS mirror hosting (Doug)

  • BNL, CERN, LBL people attending
  • Few hundred gigabytes, with future growth
  • Public access required, two in the US
  • Some development for replication
  • Two or more squid proxy machines, web server, storage system
  • Pedraq will provide recommended hardware
  • RAL will setup instructions and will setup replication (expect in January)
  • Start at BNL in January
  • LBL - will be a second site

Global Xrootd: Tier 3 (Doug), Tier 2 (Charles)

last week(s):
  • Twiki page setup at CERN: https://twiki.cern.ch/twiki/bin/viewauth/Atlas/AtlasXrootdSystems
  • Some notes from VDT-ATLAS-OSG meeting
    • Attending: Tanya, Tim, Andy, Wei, Charles, Marco, Rob
    • progress on rpm packaging in VDT
    • sym links are now setup correctly.
    • Still do not have a configuration package - do we need a specific ATLAS configuration set?
    • http://vdt.cs.wisc.edu/internal/native/vdt-rpm-checklist.html
    • Goal is get vanilla parts into a 'platinum' level, going source into standard source repository (future)
    • New set of xrootd, xrootdFS rpms at the 'silver' level should be ready for testing; uses xrootd release 03151007.
    • Build process - uses configure-classic
    • Andy: autotools versus classic build an issue in the newest release; hope to resolve this week.
    • Andy: init.d scripts from Brian Bockelman - not yet incorporated
    • Andy: No man pages are available.. putting all this on web. Charles may convert the html to man.
    • Tim: working on new configuration system for xrootd.
    • Andy: hope to have a new release soon.
    • ATLAS-Tier 3: any status updates? Tanya will contact Doug.
    • ATLAS - demonstrator project - discussed heavily at last week's SW week. X509 authentication will need to be incorporated at some point; for production. No need for the demonstrator project.
    • Charles will setup a local redirector at UC
    • Next meeting: January 11, 2011
    • Tanya: will discuss with Marco documentation changes required for rpm.
  • Charles - working on testing and performance
  • Plugin for Posix sites - needed for SMU, GPFS
  • Two sites in Spain are willing to participate
  • X509 authentication will be a requirement - but we have to figure out what this means
  • Centralized monitoring mpxstats-plugin, and central ATLAS monitoring dashboard
this week:
  • Valencia has requested to join - want instructions.
  • Charles - working on testing - getting jobs through the analy_x queue
  • Will work with SMU

Tier 3 Integration Program (Doug Benjamin, Rik Yoshida)

References
  • The link to ATLAS T3 working groups Twikis are here
  • T3g Setup guide is here
  • Users' guide to T3g is here
last week(s):
  • Many sites are getting their hardware and are setting things up
  • UCI - working to tweak their configuration, for example taking into account local switching infrastructure
  • Tier 3 policy - working to have minimal impact - there will be requirements for testing depending on which services are used; all Tier 3 sites must register in AGIS. Hopefully Tier3g will have minimal requirements. New sites that want to be part of production, have a dependence on LFC, etc, will have a week of testing.
  • FTS-plugin to check into CVS (Hiro)
  • CVMFS - taking off; request to setup a mirrored repository outside of CERN (at BNL: Michael and John DeStefano and John Hover). Meeting next week to know what this
  • CVMFS for conditions data repository at CERN, under testing
this week:
  • Hampton and Bellamine want to do production, will need to do production testing
  • T3G? 's will get a blanket approval
  • Suggest using CVMFS for releases
  • Bellamine - has a 100 Mbps, but reliable
  • Doug wants both Hampton and Ballamine to submit a proposal and be thoroughly tested, before turning on for official production
  • T3's associated with T2's are grandfathered in.
  • Will ask these sites to join the meeting - and review production performance before admitting into production

Operations overview: Production and Analysis (Kaushik)

Data Management & Storage Validation (Kaushik)

Shifters report (Mark)

  • Reference
  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting: not available (reason unknown)
    1)  12/3: BNL-OSG2_DATATAPE - transfer errors like "Job failed with forced timeout after 43200 seconds."  Issue understood - from Iris: The time out is because the file is not staged to dCache disk yet. The staging service was stopped overnight 
    for performance issue. It is started now.  ggus 64929 closed, eLog 20268.
    2)  12/3: MWT2_UC_USERDISK - transfer failures with the error "[HTTP_TIMEOUT] failed to contact on remote SRM [httpg://uct2dc1.uchicago.edu:8443/srm/managerv2]. Givin' up after 3 tries]."  From Sarah: Our SRM door ran out of memory and 
    went unresponsive. I have restarted it with increased memory allocations.  RSV probes are passing and we should see transfers start to succeed.  ggus 64940 closed, eLog 20274.
    3)  12/3: New pilot release (SULU 45b) from Paul.  Details here:
    http://www-hep.uta.edu/~sosebee/ADCoS/pilot-version-SULU_45b.html
    4)  12/3: OU_OCHEP_SWT2 file transfer errors: "failed to contact on remote SRM [httpg://tier2-02.ochep.ou.edu:8443/srm/v2/server]. Givin' up after 3 tries]."  From Horst reported that a problem with Lustre was fixed by DDN.  
    ggus 64922 / RT 18801 closed, eLog 20306.
    12/5:  Lustre/network problems reappeared.  RT 18818 / ggus 65018 in-progress, eLog 20362.
    12/8 from Horst: the Lustre server upgrade is mostly complete, but the clients haven't been upgraded, since that will require a complete downtime, which I don't want to do while nobody is around locally, so we'll do that next week some time.
    5)  12/3 - 12/4: All US sites drained due to a lack of job input files.  The issue was a pilot problem for pandamover as a result of the update Paul released earlier in the day on 12/3 (see 3) above).  A new pilot release was made (SULU 45c) 
    which fixed the problem.  See details in the thread contained in eLog 20321.
    6)  12/4: SWT2_CPB - ggus ticket 65008 / RT 18811 were opened due to low level of transfer failures (" Connection timed out" errors), which was a transient problem and went away fairly quickly.  Tickets was closed, eLog 20333.
    7)  12/4: BNL - jobs failing due to missing release 15.6.13.4.  Alessandro recommended that the use switch to 15.6.13.6, as 15.6.13.4 is a known buggy version of the s/w.  See details here: https://savannah.cern.ch/bugs/index.php?76018.  eLog 20377.
    8)  12/4: File transfer errors between ILLINOISHEP_USERDISK to WISC_LOCALGROUPDISK.  Issue on the WISC end was resolved.  ggus 65014 closed, eLog 20348.
    9)  12/6: Shifters are requested to use the Panda Monitor rather than the ProdSys dashboard.  (Features, which are available in ProdSys dashboard, and are missing from the Panda Monitor, will be introduced to Panda Monitor soon.)
    10)  12/6: BNL - jobs were failing with the error "file is not in DDN pools."  Not a site issue, but instead related to the pilot problem in 5) above.  More details in eLog 20402, ggus 65034 closed.
    11)  12/6: AGLT2 - jobs were failing due to insufficient space on PRODDISK.  Queue temporarily set off-line.  Space added, and the site was unblacklisted / set to online later that day.  eLog 20412, https://savannah.cern.ch/support/index.php?118207.
    12)  12/7: (Minor) pilot update from Paul (SULU 45d).  Details here:
    http://www-hep.uta.edu/~sosebee/ADCoS/pilot-version-SULU_45d.html
    13)  12/7: From Bob at AGLT2: 130 jobs would not reconnect this morning after their workers were disconnected from the network for several hours overnight.  126 were production, 4 were analysis.  Some 300 or so job slots were potentially involved in 
    this problem.  It is likely the 130, and perhaps as many as 350, jobs will show up as lost heartbeat during the upcoming timeout period for these jobs.
    14)  12/8: Oracle maintenance at CERN required that the Panda server be shutdown for ~one hour beginning at around 9:00 UTC.  Should not have a major impact, but expect some job failures during this period.
    
    Follow-ups from earlier reports:
    
    (i)  11/2: OU_OSCER_ATLAS: jobs using release 16.0.2.3 are failing with seg fault errors, while they finish successfully at other sites.  Alessandro checked the release installation, and this doesn't appear to be the issue.  May need to run a 
    job ""by-hand" to get more detailed debugging information.  In-progress.
    Update 11/17: any news on this?
    Update 12/8: This error seems to have gone away.  Archive this item for now.
    (ii)  11/29 - 11/30: transfers to UTD_HOTDISK were failing with the error "Can't mkdir: /net/yy/srmcache/atlashotdisk/ddo/DBRelease/v13010101."  Issue was one or more failed hard drives in the storage.  Issue resolved, and test jobs have been submitted 
    to the site (which completed successfully).
    Update 12/6: ggus 64727 was closed, eLog 20078, 20295.
    (iii)  11/29 - 12/1: HU_ATLAS_Tier2 job failures with "Exception caught in pilot: (empty error string)" error.  From John at HU:
    Our site was not experiencing any more of these errors when I looked. We've finished over 1500 jobs with a 0% failure rate for the past 12 hours. This ticket can be closed.  ggus 64712 / RT 18740 closed, eLog 20212.  (Site was set offline by shifters 
    during this period - probably a bit overly aggressive, plus the site was not notified.)
    Update, 12/7: job failures with the error "pilot.py | !!FAILED!!1999!! Exception caught in pilot: (empty error string), Traceback (most recent call last):
    File "/n/atlasgrid/home/usatlas1/gram_scratch_TU82Q0BNvq/pilot.py," which was the original subject of ggus 64712 (now re-opened).  From John at HU:
    We are working with condor developers to see what we can do about the slow status updates of jobs. The grid monitor is querying the status of many
    thousands of very old jobs. This is causing load sometimes over 1000 on the gatekeeper and general unresponsiveness. My suspicion is that a timeout is causing the "empty error string". In any case, the site is dysfunctional 
    until this Condor-G issue can be fixed.  However, I'm also in the middle of making some changes and I can't see they're effects unless the site is loaded like this, hence why I haven't set it to brokeroff.
    (iv)  11/30: AGLT2 - job failures with the error "Trf setup file does not exist for 16.2.1.2 release."  From Alessandro: there is a problem when setting up the releases in AGLT2, and it's already under study. For the moment please do not use that site for 
    any release that has not been validated and tagged (which should be the default, BTW). The needed release 16.2.1 in fact is not validated nor tagged in AGLT2.  ggus 64770 in-progress, eLog 20131.
    (v)  11/30 afternoon: MWT2_UC & ANALY_MWT2 set offline in preparation for a maintenance outage.
    Update 12/1 from Aaron: maintenance outage completed as of ~9:30 p.m. CST.
    (vi)  11/30: MWT2_UC - atlas s/w release installations were failing due to a problem with the GCC setup at the site.  Xin re-installed the GCC, issue resolved.  (Production & analysis queues were set offline during this period, ~ 2 hours.)
    Update from Xin, 12/2: The installation of 16.2.1 and its caches finished successfully at MWT2_UC.
    
    • Ongoing issues with storage at OU - Lustre upgrade, Horst reporting.
    • Sites draining last week, caused by pilot bug, Paul fixed quickly.
    • Most open issues cited above have been resolved.
    • Release installation issues: Xin - there was a problem with the requirements file; Missing 16.0.2.7 cache caused by failed re-install attempt. Not all sites have been fixed - one or two left. Migration to new installation system: now doing AGLT2; Alessandro working on fixing a problem with the validation. Next will be BU - which has a problem running WMS jobs due to BDII info missing. BU published multiple sub-cluster under one site name. Need help from OSG-GIP expert. (BNL, OU have successfully migrated).
    • Kaushik: word from shifters that we've got more reports with release issues. Xin - believes we'll get good support from Alessandro, and the new system has desirable features.
    • Michael notes that due to group production, the number of releases needed has increased dramatically. Not limited to the US.
    • Request to Xin: to keep an eye on sites so that new releases are installed correctly and in timely fashion. We also need expertise with Alessandro's system.
  • this meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=slides&confId=116757
    
    1)  12/8: SWT2_CPB - RT 18834 was created due to OSG "failing one or more critical metrics."  Issue was with the containercert.pem / containerkey.pem files needing to be updated, which was causing problems with the gums servers.  
    Since resolved, RT ticket closed.
    2)  12/8: file transfer failures - MWT2_UC_USERDISK to MWT2_UC_LOCALGROUPDISK with SOURCE "AsyncWait" errors.  From Sarah: One of our pools went unresponsive this morning and had to be rebooted. It is operational again 
    and we should see transfers start to succeed.  ggus 65122 closed, eLog 20611.
    3)  12/9: UTD-HEP - job failures with errors like "Can't mkdir: /net/yy/srmcache/atlaslocalgroupdisk/user/mahsan/201012021617/..."
    From the site admin: We're in the process of cleaning up a large accumulation of dark data. This issue should be resolved shortly.
    ggus 65137 closed, eLog 20535.
    4)  12/9: file transfer from NET2_DATADISK to AGLT2_DATADISK were failing with the error "FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] failed 
    to contact on remote SRM [httpg://atlas.bu.edu:8443/srm/v2/server].  From Saul: Rebooting our gatekeeper caused the problems. Fixed.  ggus 65188/191 closed, eLog 20545/72.
    5)  12/9: From Bob at AGLT2: There is a cooling issue in the MSU server room.  We are idling down 3 racks of worker nodes, but if the room temp begins to rise uncontrollably, we will throw the switch and crash the workers.  
    Will update when we know more.  Later: Temperature is under control, but 3 racks are stuck now in condor peaceful retirement.  As those complete, I'll re-enable them to accept jobs.
    6)  12/9: job failures in the US cloud with "SFN not set in LFC for guid" errors - for example:
    pilot: Get error: SFN not set in LFC for guid 882A6BDC-F399-DF11-8B93-A4BADB532C99 (check LFC server version)
    Possibly related to panda db glitch on 12/8.  ggus 65158 closed,  eLog 20610.
    7)  12/10: shifter messages to 'atlas-support-cloud-US@cern.ch' were being forwarded to 'usatlas-grid-l@lists.bnl.gov' as well, which resulted in an overly wide distribution list.  Fixed.
    8)  12/10: From Bob at AGLT2: We have decided we are not ready for a downtime on Dec 15, so will put this off until (likely) some time in January.
    9)  12/10: File transfer errors with ALGT2 as the source - "[SOURCE error during TRANSFER_PREPARATION phase: [USER_ERROR] source file doesn't exist] Duration [0]."  https://savannah.cern.ch/bugs/index.php?76323, eLog 20627.  
    Problem of missing files under investigation.
    10)  12/11: SWT2_CPB_PERF-EGAMMA - DaTrI requests were in the subscribed state for several days.  No real problem, as the dataset in question was eventually transferred.  Transfer speeds should improve once the full migration 
    to the new 10 Gb/s link is complete.  https://savannah.cern.ch/bugs/index.php?76329, eLog 20575.
    11) 12/15: BNL dCache maintenance - 9:30:00 => 14:00:00 EST - no access (read and write) during this period.
    12)  12/15: OU sites maintenance outage - from Horst: beginning at ~10 am CST for a few hours to upgrade our Lustre file system.
    
    Follow-ups from earlier reports:
    
    (i)  11/29 - 12/1: HU_ATLAS_Tier2 job failures with "Exception caught in pilot: (empty error string)" error.  From John at HU:
    Our site was not experiencing any more of these errors when I looked. We've finished over 1500 jobs with a 0% failure rate for the past 12 hours. This ticket can be closed.  ggus 64712 / RT 18740 closed, eLog 20212.  
    (Site was set offline by shifters during this period - probably a bit overly aggressive, plus the site was not notified.)
    Update, 12/7: job failures with the error "pilot.py | !!FAILED!!1999!! Exception caught in pilot: (empty error string), Traceback (most recent call last):
    File "/n/atlasgrid/home/usatlas1/gram_scratch_TU82Q0BNvq/pilot.py," which was the original subject of ggus 64712 (now re-opened).  From John at HU:
    We are working with condor developers to see what we can do about the slow status updates of jobs. The grid monitor is querying the status of many thousands of very old jobs. This is causing load sometimes over 1000 on the 
    gatekeeper and general unresponsiveness. My suspicion is that a timeout is causing the "empty error string". In any case, the site is dysfunctional until this Condor-G issue can be fixed.  However, I'm also in the middle of making 
    some changes and I can't see they're effects unless the site is loaded like this, hence why I haven't set it to brokeroff.
    (ii)  11/30: AGLT2 - job failures with the error "Trf setup file does not exist for 16.2.1.2 release."  From Alessandro: there is a problem when setting up the releases in AGLT2, and it's already under study. For the moment please do 
    not use that site for any release that has not been validated and tagged (which should be the default, BTW). The needed release 16.2.1 in fact is not validated nor tagged in AGLT2.  ggus 64770 in-progress, eLog 20131.
    (iii)  12/3: OU_OCHEP_SWT2 file transfer errors: "failed to contact on remote SRM [httpg://tier2-02.ochep.ou.edu:8443/srm/v2/server]. Givin' up after 3 tries]."  From Horst reported that a problem with Lustre was fixed by DDN.  
    ggus 64922 / RT 18801 closed, eLog 20306.
    12/5:  Lustre/network problems reappeared.  RT 18818 / ggus 65018 in-progress, eLog 20362.
    12/8 from Horst: the Lustre server upgrade is mostly complete, but the clients haven't been upgraded, since that will require a complete downtime, which I don't want to do while nobody is around locally, so we'll do 
    that next week some time.
    Update 12/15: see 12) above - outage to upgrade Lustre.
    (iv)  12/8: Oracle maintenance at CERN required that the Panda server be shutdown for ~one hour beginning at around 9:00 UTC.  Should not have a major impact, but expect some job failures during this period.
    Update 12/9: Problem occurred during the maintenance which led to a large number of failed jobs.  Issue eventually resolved.  See eLog 20495.
    
     
    • us-cloud-support list now reduced
    • aglt2 transfer failures - some missing files
    • bnl - down for dcache upgrade
    • ou - down for upgrade

DDM Operations (Hiro)

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week:
    • Meeting notes
      	===========================================
      
      Attending: Shawn, Dave, Jason, Philippe, Sarah, Saul, Andy, Tom, Rob, Hiro
      Excused: Horst, Karthik
      
      1) Update on problem status
      	a) OU - No updates
      	b) BNL - No known issues right now.  Have to see if any of the OU problems are related to BNL as was sometimes discussed.
      	c) Illinois - No changes.
      
      2) perfSONAR in USATLAS status.   All sites installed and updated.  Site admins were asked to check/verify how things are running for their installations:
      	a) AGLT2_UM - All services running as they should.  Data looks good.  SLAC was re-added to tests.  Found an issue at UM via the latency packet loss data.  Looking into how to resolve (switch interconnect issue).
      	b) MWT2_IU - Firewall settings changed and lost specific port settings so latency tests are 1 way.  Sarah will look into it and restore original settings to allow bi-directional tests.
      	c) NET2_BU - All services are running.   Need to verify test results but things look operational.
      	d) AGLT2_MSU - All services running and results look reasonable.  Some possible issues between MSU and UM in terms of test scheduling.  The latency tests from MSU to UM are sparse...we need to figure out why.
      	Dell R410 box is available for perfSONAR developers.  New Internet2 perfSONAR person will try to have a look at the box soon and see if OWAMP tests can be isolated from BWCTL tests on the box.  Use irqbinding, numactl and other techniques to minimize interference between testing roles.  Also have access to a development server at UM for this work. Will check status at our next meeting.
      
      3) Throughput, Measurement and Monitoring
      	a) Tom reported on new tests which will verify the proper services are responsive for both latency and throughput nodes. Discussed some possible future tests concerning measured data to start providing alarms if potential problems are present.  	
           b) No one from SLAC on the call.  Problem with Nagios monitoring the SLAC instances may be related to the particular URL in the tests?  Tom will send Shawn the current test being used for SLAC so we can investigate further.
      	c) New tests will be rolled out as we finish implementing the next set of service tests.   Details to be worked on as we go. Idea for future test (Jason): go to DB directly if needed.  Will require NRPE (Nagios) probe which could be added into future perfSONAR versions.  Also could be 'yum installed' now if there is a need.
      	d) Hiro reported that he still plans to grab perfSONAR data and add it to his throughput plots but hasn't had time to try it yet.
      
      4) Round-table.   Any site issues...?  Very quiet out there...guess that is good!
      
      Given the holidays we plan to meet next in the new year on January 4th, 2010.
      
      Please send along any additions or corrections to the list.  Happy Holidays.
      
      Shawn
    • Current version of perfsonar working at all sites
    • Nagios not able to query SLAC properly. No outbound port 80 from BNL, but this can be changed.
    • Once available this will be able to monitor hosts at the sites.
    • DYNES template should be available today at http://www.internet2.edu/dynes
  • this week:
    • There are some nice Nagios pages setup by Tom at BNL
    • DYNES application deadline is today
    • Have equipment for 40 sites - a site selection committee
    • Will also fund 14 regional network
    • Completed applications: MSU, UM, UTA, IU, BU/HU/Tufts, UC, UW
    • SLAC encouraged to submit
    • OU not sure if one was submitted for campus

Site news and issues (all sites)

  • T1:
    • last week(s): The pilot installed on Friday caused US resources to drain, caused problems with panda mover; Paul fixed the issue quickly. (Kaushik notes that this came as a request for default direct access for prun jobs, all sites. Question as to why this wasn't discussed.) CPU resources have been shifted back to analysis.
    • this week: Blue Arc storage appliance to be evaluated, to be put on top of DDN 9900; also call to work on DDN 10K improved for random access, important for direct reading. Torre in contact with Yushu at BNL to run panda machines on the Magellan cloud; has setup virtual machine; will need to setup a Panda site.

  • AGLT2:
    • last week: there was a problem with spanning tree config between dell and cisco; Dec 15 downtime: new servers, update dcache.
    • this week: Having a strange problem where new data sets have missing files in sub-group disk. No evidence was done by central deletion. Did dCache remove these files? Happened on Dec 10, according to billing datbase. Still in LFC. Nothing in SRM logs.

  • NET2:
    • last week(s): Will need to look into BDII issue for NET2 reporting. Perfsonar is online and up to the latest release. Problem at HU related to lsf - gram. Wei notes this was solved by caching queries. Planning for holyoke move in 2012.
    • this week: Just submitted DYNES application. Changed GIP configuration for a single sub-cluster, now Alessandro's installation is working.

  • MWT2:
    • last week(s): Space crunch passed... Continue to experience issues resulting from Chimera upgrade - slowness of certain queries, and intermittent Chimera process crashes. Orders w/ Dell for compute nodes.
    • this week: The CIC OmniPoP router in Starlight stopped advertising routes for the MWT2_UC <--> MWT2_IU direct path; as a result, traffic between sites was routed via Internet2 (the fallback route), causing 4x increase in latency. Caused jobs at IU to run inefficiently over the weekend. Fixed on Monday. Chimera configuration investigation continues, domain locks up with deletes; number of threads lowered (in comparison w/ AGLT2) - cautiously optimistic. Some parts arrived for new equipment at UC.

  • SWT2 (UTA):
    • last week: All is well; an expired host certificate; may need to take a short downtime to move hosts over to 10G circuit. May start experimenting with Bestman2: queue setup and depths.
    • this week: All is smooth.

  • SWT2 (OU):
    • last week: all is well. Working w/ Dell getting new quotes for R410. (not sure of the number). all is fine. 19 new R410's.
    • this week: Lustre upgrade nearly complete. Expect to turn site back on shortly. Will put into test mode first.

  • WT2:
    • last week(s): running smoothly. One of the data servers failed - recovered okay, but unsavory since not sure what caused the failure.
    • this week:

Carryover issues ( any updates?)

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc. Note: https://atlas-install.roma1.infn.it/atlas_install/ - site admins can subscribe, and get notified of release installation & validation activity at their site.

  • last report
  • this meeting:

libdcap & direct access (Charles)

last report(s):
  • Rebuilt libdcap locally
  • Current HC performance, http://hammercloud.cern.ch/atlas/10000957/test/
  • Comparing to previous results with dcap++
  • Looking at messaging between dcap client and server
  • Most problems with prun jobs (pathena seems robust)
  • Trying to get dcap++ merged into official dcap sources.. libdcap has diverged.
  • Continuing to track down stalls in analysis jobs - leading to hangs in dcap.
  • newline causing buffer overflow
  • chipping away at this on the job-side.
  • dcap++ and newer stock libdcap could be merged at some point.
  • Maximum active movers needs to be high - 1000.
  • Submitted patch to dcache.org for review
  • No estimate on libdcap++ inclusion in to the release
  • No new failure modes observed

this meeting:

AOB

  • last week
  • this week


-- RobertGardner - 14 Dec 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback