r4 - 08 Dec 2010 - 14:35:35 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesDec8

MinutesDec8

Introduction

Minutes of the Facilities Integration Program meeting, Dec 8, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • 866-740-1260, Access code: 7027475

Audio Details: Dial-in Number:
U.S. & Canada: 866.740.1260
U.S. Toll: 303.248.0285
Access Code: 7027475
Registration Link: https://cc.readytalk.com/r/bd2w3deu2kkg

Attending

  • Meeting attendees: Rob, Nate, Charles, Sarah, Dave, Jason (I2), Shawn, Saul, Rik, Sarah, Bob, Mark, Armen, Kaushik, Patrick, Wei, Doug, Booker, Fred, Alden, John D, Hiro,
  • Apologies: Horst, Aaron

Integration program update (Rob, Michael)

Global Xrootd: Tier 3 (Doug), Tier 2 (Charles)

last week(s):
  • Twiki page setup at CERN: https://twiki.cern.ch/twiki/bin/viewauth/Atlas/AtlasXrootdSystems
  • Lots of discussion last week. xrd-lfc plugin code is finished, incorporating all suggestions to date. Testing right now.
  • Search logic in the code - since LFC conventions can't be changed
  • Angelos has released a wiki for dq2-ls, -get with new functionality; should be released shortly.
  • Testing - seems there are problems with the global redirector not finding files
  • Will ask Panda developers to store files according to DDM
  • Sites are installing equipment
  • Yushu is working on configuration management
  • Nils working on "back end" - test kit. Example jobs.
  • CVMFS work continues for conditions database files; Doug will work on getting AFS synchronized with CVMFS
this week:
  • Some notes from VDT-ATLAS-OSG meeting
    • Attending: Tanya, Tim, Andy, Wei, Charles, Marco, Rob
    • progress on rpm packaging in VDT
    • sym links are now setup correctly.
    • Still do not have a configuration package - do we need a specific ATLAS configuration set?
    • http://vdt.cs.wisc.edu/internal/native/vdt-rpm-checklist.html
    • Goal is get vanilla parts into a 'platinum' level, going source into standard source repository (future)
    • New set of xrootd, xrootdFS rpms at the 'silver' level should be ready for testing; uses xrootd release 03151007.
    • Build process - uses configure-classic
    • Andy: autotools versus classic build an issue in the newest release; hope to resolve this week.
    • Andy: init.d scripts from Brian Bockelman - not yet incorporated
    • Andy: No man pages are available.. putting all this on web. Charles may convert the html to man.
    • Tim: working on new configuration system for xrootd.
    • Andy: hope to have a new release soon.
    • ATLAS-Tier 3: any status updates? Tanya will contact Doug.
    • ATLAS - demonstrator project - discussed heavily at last week's SW week. X509 authentication will need to be incorporated at some point; for production. No need for the demonstrator project.
    • Charles will setup a local redirector at UC
    • Next meeting: January 11, 2011
    • Tanya: will discuss with Marco documentation changes required for rpm.
  • Charles - working on testing and performance
  • Plugin for Posix sites - needed for SMU, GPFS
  • Two sites in Spain are willing to participate
  • X509 authentication will be a requirement - but we have to figure out what this means
  • Centralized monitoring mpxstats-plugin, and central ATLAS monitoring dashboard

Tier 3 Integration Program (Doug Benjamin, Rik Yoshida)

References
  • The link to ATLAS T3 working groups Twikis are here
  • T3g Setup guide is here
  • Users' guide to T3g is here
last week(s):
  • Monday more than 50% of all panda piliots failed at Duke, ANL and Brandeis
  • Tuesday the number was smaller. Still a large number at ANL.
  • Tier 3 support personnel must have registration in OIM - RT queue, and contacts list.
  • Kaushik: there is a group attribute in Panda which allows only local uses to run jobs - has been done by Asoka in Canada
this week:
  • Many sites are getting their hardware and are setting things up
  • UCI - working to tweak their configuration, for example taking into account local switching infrastructure
  • Tier 3 policy - working to have minimal impact - there will be requirements for testing depending on which services are used; all Tier 3 sites must register in AGIS. Hopefully Tier3g will have minimal requirements. New sites that want to be part of production, have a dependence on LFC, etc, will have a week of testing.
  • FTS-plugin to check into CVS (Hiro)
  • CVMFS - taking off; request to setup a mirrored repository outside of CERN (at BNL: Michael and John DeStefano and John Hover). Meeting next week to know what this
  • CVMFS for conditions data repository at CERN, under testing

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • Analysis reference:
  • last meeting(s):
    • Panda database problems resolved - Oracle admins + Tadashi
    • Expect G4 MC production for the next month
    • Analysis sites - large failure rates at all sites
    • Xin: there was a problem with the installation testing. Caused failures at all sites.
  • this week:
    • Production dropped off - scheduled maintenance for the Oracle server at CERN, restructuring of database, affecting Panda production
    • There are some residual issues, slow queries, dropped tables, ...
    • Not happy with Oracle performance in the past month - issues are not fully understood.
    • Aside: LFC migration - this will occur gradually, and scaling issues will be examined carefully. US may be the last region to be migrated.
    • Many discussions with Borut last week, there will be plenty of work to do. There should be no job shortage.
    • The major issue is storage. We'll need to have a new plan in place by January.

Data Management & Storage Validation (Kaushik)

Shifters report (Mark)

  • Reference
  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting: not available due to software week at CERN.
    
    1)  11/24: ANALY_MWT2 was auto-excluded from Panda by the new system.  Not a site problem, but instead a glitch with the new atlas s/w installation - from Xin: We are testing the new installation system from Alessandro on US T2's.  
    The gcc was reinstalled, but on some sites like MWT2, the reinstall failed.  Queue set back to on-line.
    2)  11/25 - 11/26: UTA_SWT2 - transfer errors due to two issues: (i) SRM locked up and had to be restarted; (ii) disk space was getting tight.  Both items resolved, and transfers now completing successfully.  ggus 64628 / RT 18730 closed, eLog 19911.
    3)  11/28: Job failures at various sites (US cloud: MWT2_UC, AGLT2, other clouds) due to missing atlas release 16.2.1.  Alessandro reported that this was a know problem and a fix was being deployed to the sites.  eLog  20022/23.
    4)  11/28: From Charles at MWT2: We had a brief power interruption at MWT2_UC which caused a batch of jobs to fail - expect some "lost heartbeat" errors due to this.  (~780 "lost heartbeat" jobs were observed.)
    5)  11/29: Problem with the backend db for the CASTOR name server at CERN - issue resolved.  See details in eLog 20081.
    6)  11/29 - 11/30: File transfer issue from BNL-OSG2_DATADISK to AGLT2_DATADISK with SOURCE "source file doesn't exist" errors.  Not a site issue, but rather a DDM one - from Hiro: The dataset replica is removed from BNL 
    while the subscriptions to US T2s exist. I will remove the subscription, which should solve the current problem.  ggus 64705 closed, eLog 20062/146.
    7)  11/29 - 11/30: transfers to UTD_HOTDISK were failing with the error "Can't mkdir: /net/yy/srmcache/atlashotdisk/ddo/DBRelease/v13010101."  Issue was one or more failed hard drives in the storage.  Issue resolved, and test jobs 
    have been submitted to the site.
    8)  11/29 - 12/1: HU_ATLAS_Tier2 job failures with "Exception caught in pilot: (empty error string)" error.  From John at HU:
    Our site was not experiencing any more of these errors when I looked. We've finished over 1500 jobs with a 0% failure rate for the past 12 hours. This ticket can be closed.  ggus 64712 / RT 18740 closed, eLog 20212.  
    (Site was set offline by shifters during this period - probably a bit overly aggressive, plus the site was not notified.)
    9)  11/30: AGLT2 - job failures with the error "Trf setup file does not exist for 16.2.1.2 release."  From Alessandro: there is a problem when setting up the releases in AGLT2, and it's already under study. For the moment please do not use 
    that site for any release that has not been validated and tagged (which should be the default, BTW). The needed release 16.2.1 in fact is not validated nor tagged in AGLT2.  ggus 64770 in-progress, eLog 20131.
    10)  11/30: HPSS upgrade at BNL completed successfully as of ~12:45 EST.  (Expected impact was no access to tape-resident HPSS data during the upgrade.)
    11)  11/30: MWT2_UC - atlas s/w release installations were failing due to a problem with the GCC setup at the site.  Xin re-installed the GCC, issue resolved.  (Production & analysis queues were set offline during this period, ~ 2 hours.)
    12)  11/30 afternoon: MWT2_UC & ANALY_MWT2 set offline in preparation for a maintenance outage.
    13)  11/30 - 12/1: File transfer errors between SWT2_CPB_USERDISK => PIC_SCRATCHDISK with the error "FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [INTERNAL_ERROR] Invalid SRM version [] for endpoint."  
    It appears this is an issue on the PIC side, so ggus 64802 was re-assigned to PIC (and subsequently solved).  From Arnau Bria: there was a bug in /usr/bin/glite-info-update-endpoints which made some OSG sites invisible.  Maarten provided a patch 
    (details in the ggus ticket).  RT 18787 resolved.  ggus 64848 / RT 18790 also opened regarding this problem - both now closed.  eLog 20169.
    14)  12/1: autumn 2010 reprocessing campaign declared to be officially done. 
    
    Follow-ups from earlier reports:
    
    (i)  11/2: OU_OSCER_ATLAS: jobs using release 16.0.2.3 are failing with seg fault errors, while they finish successfully at other sites.  Alessandro checked the release installation, and this doesn't appear to be the issue.  
    May need to run a job ""by-hand" to get more detailed debugging information.  In-progress.
    Update 11/17: any news on this?
    (ii)  11/11 - 11/12: SLAC disk free space low - from Wei: "We run out of space in the front tier. I stopped the channels to let old data moving to back tier."  This led to job failures with stage-out errors.  Problem under investigation.  eLog 19324.
    Update 11/30: no recent errors of this type - assume issue is resolved.
    (iii)  11/11 - present: BNL storage issues due to low free space in DATADISK.  Intermittent errors with DDM transfers and job stage-in/out problems.
    More details in ggus 64154 (open) and 64218 (closed), eLog 19388 / 433 / 488.
    Update: ggus 64154 closed on 11/27 (Transfers stable / good performance) .
    (iv)  11/14: MWT2_UC_PRODDISK - DDM errors with "SOURCE error during TRANSFER_PREPARATION phase" message.  From Aaron: "This is due to a problem with the disks on one of our storage nodes. We are working to get this node back online, 
    but these errors will continue until this is complete."  ggus 64230 in-progress, eLog 19420.
    Update 11/30: Issue resolved, ggus 64230 closed.
    (v)  11/16: AGLT2 - network outage caused a loss of running jobs (clean recovery on the gatekeeper not possible).  Issue resolved.  ggus 64323 was opened during this period for related DDM errors.  Additional errors like 
    "gridftp_copy_wait: Connection timed out" seen overnight.  ggus ticket in-progress.
    Update from Bob, 11/19: I see one error in the last 4 hours, 20 error in the last 24 hours (this latter out of 10700). We think there is some packet loss at a very low level in our network, and we are actively investigating to find the source. 
    We also have one very busy server that we will deal with today.
    Update 11/25: no additional transfer errors - ggus 64323 closed.
    (vi)  11/20: SWT2_CPB - DDM errors like "failed to contact on remote SRM [httpg://gk03.atlas-swt2.org:8443/srm/v2/server]."  ggus 64457 / RT 18679, eLog 19714/811.  From Patrick: There is an issue with the size of the files and the timeouts enabled 
    on the FTS channel for SWT2_CPB. The incoming files appear to be large (12+GB) and the transfers are not completing within the current timeout. We are playing with the timeouts available on the FTS channel to get these transfers to succeed.
    Update 11/26: The increased timeout on the FTS channel are allowing the files to reach SWT2_CPB.  ggus 64457 / RT 18679 closed.
    
    
  • this meeting: Operations summary:
     Yuri's summary from the weekly ADCoS meeting: not available (reason unknown)
    
    1)  12/3: BNL-OSG2_DATATAPE - transfer errors like "Job failed with forced timeout after 43200 seconds."  Issue understood - from Iris: The time out is because the file is not staged to dCache disk yet. The staging service was stopped overnight 
    for performance issue. It is started now.  ggus 64929 closed, eLog 20268.
    2)  12/3: MWT2_UC_USERDISK - transfer failures with the error "[HTTP_TIMEOUT] failed to contact on remote SRM [httpg://uct2dc1.uchicago.edu:8443/srm/managerv2]. Givin' up after 3 tries]."  From Sarah: Our SRM door ran out of memory and 
    went unresponsive. I have restarted it with increased memory allocations.  RSV probes are passing and we should see transfers start to succeed.  ggus 64940 closed, eLog 20274.
    3)  12/3: New pilot release (SULU 45b) from Paul.  Details here:
    http://www-hep.uta.edu/~sosebee/ADCoS/pilot-version-SULU_45b.html
    4)  12/3: OU_OCHEP_SWT2 file transfer errors: "failed to contact on remote SRM [httpg://tier2-02.ochep.ou.edu:8443/srm/v2/server]. Givin' up after 3 tries]."  From Horst reported that a problem with Lustre was fixed by DDN.  
    ggus 64922 / RT 18801 closed, eLog 20306.
    12/5:  Lustre/network problems reappeared.  RT 18818 / ggus 65018 in-progress, eLog 20362.
    12/8 from Horst: the Lustre server upgrade is mostly complete, but the clients haven't been upgraded, since that will require a complete downtime, which I don't want to do while nobody is around locally, so we'll do that next week some time.
    5)  12/3 - 12/4: All US sites drained due to a lack of job input files.  The issue was a pilot problem for pandamover as a result of the update Paul released earlier in the day on 12/3 (see 3) above).  A new pilot release was made (SULU 45c) 
    which fixed the problem.  See details in the thread contained in eLog 20321.
    6)  12/4: SWT2_CPB - ggus ticket 65008 / RT 18811 were opened due to low level of transfer failures (" Connection timed out" errors), which was a transient problem and went away fairly quickly.  Tickets was closed, eLog 20333.
    7)  12/4: BNL - jobs failing due to missing release 15.6.13.4.  Alessandro recommended that the use switch to 15.6.13.6, as 15.6.13.4 is a known buggy version of the s/w.  See details here: https://savannah.cern.ch/bugs/index.php?76018.  eLog 20377.
    8)  12/4: File transfer errors between ILLINOISHEP_USERDISK to WISC_LOCALGROUPDISK.  Issue on the WISC end was resolved.  ggus 65014 closed, eLog 20348.
    9)  12/6: Shifters are requested to use the Panda Monitor rather than the ProdSys dashboard.  (Features, which are available in ProdSys dashboard, and are missing from the Panda Monitor, will be introduced to Panda Monitor soon.)
    10)  12/6: BNL - jobs were failing with the error "file is not in DDN pools."  Not a site issue, but instead related to the pilot problem in 5) above.  More details in eLog 20402, ggus 65034 closed.
    11)  12/6: AGLT2 - jobs were failing due to insufficient space on PRODDISK.  Queue temporarily set off-line.  Space added, and the site was unblacklisted / set to online later that day.  eLog 20412, https://savannah.cern.ch/support/index.php?118207.
    12)  12/7: (Minor) pilot update from Paul (SULU 45d).  Details here:
    http://www-hep.uta.edu/~sosebee/ADCoS/pilot-version-SULU_45d.html
    13)  12/7: From Bob at AGLT2: 130 jobs would not reconnect this morning after their workers were disconnected from the network for several hours overnight.  126 were production, 4 were analysis.  Some 300 or so job slots were potentially involved in 
    this problem.  It is likely the 130, and perhaps as many as 350, jobs will show up as lost heartbeat during the upcoming timeout period for these jobs.
    14)  12/8: Oracle maintenance at CERN required that the Panda server be shutdown for ~one hour beginning at around 9:00 UTC.  Should not have a major impact, but expect some job failures during this period.
    
    Follow-ups from earlier reports:
    
    (i)  11/2: OU_OSCER_ATLAS: jobs using release 16.0.2.3 are failing with seg fault errors, while they finish successfully at other sites.  Alessandro checked the release installation, and this doesn't appear to be the issue.  May need to run a 
    job ""by-hand" to get more detailed debugging information.  In-progress.
    Update 11/17: any news on this?
    Update 12/8: This error seems to have gone away.  Archive this item for now.
    (ii)  11/29 - 11/30: transfers to UTD_HOTDISK were failing with the error "Can't mkdir: /net/yy/srmcache/atlashotdisk/ddo/DBRelease/v13010101."  Issue was one or more failed hard drives in the storage.  Issue resolved, and test jobs have been submitted 
    to the site (which completed successfully).
    Update 12/6: ggus 64727 was closed, eLog 20078, 20295.
    (iii)  11/29 - 12/1: HU_ATLAS_Tier2 job failures with "Exception caught in pilot: (empty error string)" error.  From John at HU:
    Our site was not experiencing any more of these errors when I looked. We've finished over 1500 jobs with a 0% failure rate for the past 12 hours. This ticket can be closed.  ggus 64712 / RT 18740 closed, eLog 20212.  (Site was set offline by shifters 
    during this period - probably a bit overly aggressive, plus the site was not notified.)
    Update, 12/7: job failures with the error "pilot.py | !!FAILED!!1999!! Exception caught in pilot: (empty error string), Traceback (most recent call last):
    File "/n/atlasgrid/home/usatlas1/gram_scratch_TU82Q0BNvq/pilot.py," which was the original subject of ggus 64712 (now re-opened).  From John at HU:
    We are working with condor developers to see what we can do about the slow status updates of jobs. The grid monitor is querying the status of many
    thousands of very old jobs. This is causing load sometimes over 1000 on the gatekeeper and general unresponsiveness. My suspicion is that a timeout is causing the "empty error string". In any case, the site is dysfunctional 
    until this Condor-G issue can be fixed.  However, I'm also in the middle of making some changes and I can't see they're effects unless the site is loaded like this, hence why I haven't set it to brokeroff.
    (iv)  11/30: AGLT2 - job failures with the error "Trf setup file does not exist for 16.2.1.2 release."  From Alessandro: there is a problem when setting up the releases in AGLT2, and it's already under study. For the moment please do not use that site for 
    any release that has not been validated and tagged (which should be the default, BTW). The needed release 16.2.1 in fact is not validated nor tagged in AGLT2.  ggus 64770 in-progress, eLog 20131.
    (v)  11/30 afternoon: MWT2_UC & ANALY_MWT2 set offline in preparation for a maintenance outage.
    Update 12/1 from Aaron: maintenance outage completed as of ~9:30 p.m. CST.
    (vi)  11/30: MWT2_UC - atlas s/w release installations were failing due to a problem with the GCC setup at the site.  Xin re-installed the GCC, issue resolved.  (Production & analysis queues were set offline during this period, ~ 2 hours.)
    Update from Xin, 12/2: The installation of 16.2.1 and its caches finished successfully at MWT2_UC.
    
    
    • Ongoing issues with storage at OU - Lustre upgrade, Horst reporting.
    • Sites draining last week, caused by pilot bug, Paul fixed quickly.
    • Most open issues cited above have been resolved.
    • Release installation issues: Xin - there was a problem with the requirements file; Missing 16.0.2.7 cache caused by failed re-install attempt. Not all sites have been fixed - one or two left. Migration to new installation system: now doing AGLT2; Alessandro working on fixing a problem with the validation. Next will be BU - which has a problem running WMS jobs due to BDII info missing. BU published multiple sub-cluster under one site name. Need help from OSG-GIP expert. (BNL, OU have successfully migrated).
    • Kaushik: word from shifters that we've got more reports with release issues. Xin - believes we'll get good support from Alessandro, and the new system has desirable features.
    • Michael notes that due to group production, the number of releases needed has increased dramatically. Not limited to the US.
    • Request to Xin: to keep an eye on sites so that new releases are installed correctly and in timely fashion. We also need expertise with Alessandro's system.

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • In terms of operation, DDM is operating without any problems. And, I did not notice many DDM errors related to T2s/T3s sites. The user dataset deletion announcement should come in this week (I will send the size). Since I have not seen any BDII errors in my warning system, it must be working well since BDII has started to use 120s for timeout last week and since the network issued at OSG/indiana was resolved. In terms of BNL dCache, the space is still tight, but we are coping. We have stopped using the specific storage for specific space tokens. So, we can set the quota more flexibly. There were a few incidents when large number of requests are going to very limited number of read pools, which has causes some errors in PANDA and DDM. The migration of popular data to different pools have solved that problems. Staging from HPSS works fine. It reaches 1Gb/s constantly (for these big RAW files). BNL is CPU limited and not HPSS IO limited (in terms of reprocessing). In terms of throughput/networking, BNL-CNAF network problem in GEANT shows progress by installation of new hardware. But, the switching the circuit back to the original route was not completed and will try again in Nov 29th. This is now approaching four months to fix this network between two T1s. How long are we going to expect for any problems between US T2s to foreign T2s to get resolved (if/when a someone even notice) (NOTE: There is an initiative in wlcg to look into this issue.)
  • this meeting:
    • LFC consolidation - motivated
    • Can each site confirm they are making backups of the LFC? They are.
    • poolfilecatalog creation - DQ2 developer made a new option to get around dq2-ls hack for regex in US; will change toa as a result. dcap doors should use an alias.
    • DDM - all is well
    • SRM deletion being slow - why? Has to do with ordering buy space tokens. Hiro has several feature requests into the developers.
    • LFC dump is now automated on a weekly basis. SQLite file. Charles will modify CCC to handle this.

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week:
  • this week:
    • Meeting notes
      	===========================================
      
      Attending: Shawn, Dave, Jason, Philippe, Sarah, Saul, Andy, Tom, Rob, Hiro
      Excused: Horst, Karthik
      
      1) Update on problem status
      	a) OU - No updates
      	b) BNL - No known issues right now.  Have to see if any of the OU problems are related to BNL as was sometimes discussed.
      	c) Illinois - No changes.
      
      2) perfSONAR in USATLAS status.   All sites installed and updated.  Site admins were asked to check/verify how things are running for their installations:
      	a) AGLT2_UM - All services running as they should.  Data looks good.  SLAC was re-added to tests.  Found an issue at UM via the latency packet loss data.  Looking into how to resolve (switch interconnect issue).
      	b) MWT2_IU - Firewall settings changed and lost specific port settings so latency tests are 1 way.  Sarah will look into it and restore original settings to allow bi-directional tests.
      	c) NET2_BU - All services are running.   Need to verify test results but things look operational.
      	d) AGLT2_MSU - All services running and results look reasonable.  Some possible issues between MSU and UM in terms of test scheduling.  The latency tests from MSU to UM are sparse...we need to figure out why.
      	Dell R410 box is available for perfSONAR developers.  New Internet2 perfSONAR person will try to have a look at the box soon and see if OWAMP tests can be isolated from BWCTL tests on the box.  Use irqbinding, numactl and other techniques to minimize interference between testing roles.  Also have access to a development server at UM for this work. Will check status at our next meeting.
      
      3) Throughput, Measurement and Monitoring
      	a) Tom reported on new tests which will verify the proper services are responsive for both latency and throughput nodes. Discussed some possible future tests concerning measured data to start providing alarms if potential problems are present.  	
           b) No one from SLAC on the call.  Problem with Nagios monitoring the SLAC instances may be related to the particular URL in the tests?  Tom will send Shawn the current test being used for SLAC so we can investigate further.
      	c) New tests will be rolled out as we finish implementing the next set of service tests.   Details to be worked on as we go. Idea for future test (Jason): go to DB directly if needed.  Will require NRPE (Nagios) probe which could be added into future perfSONAR versions.  Also could be 'yum installed' now if there is a need.
      	d) Hiro reported that he still plans to grab perfSONAR data and add it to his throughput plots but hasn't had time to try it yet.
      
      4) Round-table.   Any site issues...?  Very quiet out there...guess that is good!
      
      Given the holidays we plan to meet next in the new year on January 4th, 2010.
      
      Please send along any additions or corrections to the list.  Happy Holidays.
      
      Shawn
    • Current version of perfsonar working at all sites
    • Nagios not able to query SLAC properly. No outbound port 80 from BNL, but this can be changed.
    • Once available this will be able to monitor hosts at the sites.
    • DYNES template should be available today at http://www.internet2.edu/dynes

Site news and issues (all sites)

  • T1:
    • last week(s): Extensive data replication to BNL has seen rates of > 2 TB/ hour, filling up the site. This has been quite a stress on services and people, learning a lot. Regarding space, in the process of procuring another 2 PB of disk. (Estimates are coming up short.) This will put BNL at 10 PB of disk by the end of year. Pedro, leader of storage management group, has left. Looking for capable and well-plugged in group leader, Hiro will now lead the group. Group has been re-constituted group with systems and database expertise (Carlos, Jason). Will be moving to Chiimera on timescale of a year. Repro winding down; did more than 30%. Discussing effects of full pools.
    • this week: The pilot installed on Friday caused US resources to drain, caused problems with panda mover; Paul fixed the issue quickly. (Kaushik notes that this came as a request for default direct access for prun jobs, all sites. Question as to why this wasn't discussed.) CPU resources have been shifted back to analysis.

  • AGLT2:
    • last week: discovered 1% data loss discovered at UM.. completely disrupted local network, spanning-tree top lost. MSU has received all worker nodes (2 racks). Upgrading LAN with larger 10G switches. 64 new compute nodes at MSU - change in network configuration caused by firmware update. Hope to have nodes running next week. Juniper switches used for WAN connection, tested very well during SC.
    • this week: there was a problem with spanning tree config between dell and cisco; Dec 15 downtime: new servers, update dcache.

  • NET2:
    • last week(s): Yesterday running short in datadisk; have been doing deletions, 55 TB free. Electrical work completed, ready to plug in another rack. Networking incident at Harvard last week on veterans day, disconnecting a couple of Nehalem racks, they responded quickly. Came back online, only lost 1/3 of the jobs. Problems with CondorG? not updating quickly enough, causing backlog. Happened before, thought fixed by Jamie. Now looks like its out of synch again. Scaled up analy queue at HU, adjustments to lsm to use different server (~700 slots). All is well. Finding about 1/3 analy jobs failing, some due to local scratch filling. What about HU analysis jobs? (apparently broker-off, John is reconfiguring network between lsm box and GPFS volume server).
    • this week: Will need to look into BDII issue. Perfsonar is online and up to the latest release. Problem at HU related to lsf - gram. Wei notes this was solved by caching queries. Planning for holyoke move in 2012.

  • MWT2:
    • last week(s): A couple of problems over the week - LFC database index. pnfs-manager load issue. Finalized purchase orders for worker nodes at IU and UC (42+56 R410 servers). We are seeing Java memory usage on dCache pool nodes, uncertain as to the cause but several investigations underway. Keeping things stable as we do so. Downtime scheduled for December 1.
    • this week: Space crunch passed... Continue to experience issues resulting from Chimera upgrade - slowness of certain queries, and intermittent Chimera process crashes. Orders w/ Dell for compute nodes.

  • SWT2 (UTA):
    • last week: SRM issue at UTA_SWT2 - more file descriptors (increasing ulimit). Unsure as to cause. Could it be due to central deletion service? ANALY cluster at CPB - problem with storage, restarted. Working on joining the federated xrootd cluster. Wei: need to follow-up with Berkeley team. transfer failures timing out. Files taking too long.. very large files - 12 GB files. Patrick working on FTS timeout settings. The reprocessing tasks not configured well.
    • this week: all is well; an expired host certificate; may need to take a short downtime to move hosts over to 10G circuit. May start experimenting with Bestman2: queue setup and depths.

  • SWT2 (OU):
    • last week: all is well. Working w/ Dell getting new quotes for R410. (not sure of the number). all is fine. 19 new R410's.
    • this week:

  • WT2:
    • last week(s): Having problems with a data server going down over the past week for unknown sources. Problems with an xrootd redirector that Andy will fix. Finding redirector slowing down in certain situations - using a replacement from Andy. everything is running fine. Asking systems folk for Dell nodes to be installed.
    • this week: running smoothly. One of the data servers failed - recovered okay, but unsavory since not sure what caused the failure.

Carryover issues ( any updates?)

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc.
  • last report
  • this meeting:
    • Covered above

HEPSpec 2006 (Bob)

last week(s): this week:

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

libdcap & direct access (Charles)

last report(s):
  • Rebuilt libdcap locally
  • Current HC performance, http://hammercloud.cern.ch/atlas/10000957/test/
  • Comparing to previous results with dcap++
  • Looking at messaging between dcap client and server
  • Most problems with prun jobs (pathena seems robust)
  • Trying to get dcap++ merged into official dcap sources.. libdcap has diverged.
  • Continuing to track down stalls in analysis jobs - leading to hangs in dcap.
  • newline causing buffer overflow
  • chipping away at this on the job-side.
  • dcap++ and newer stock libdcap could be merged at some point.
  • Maximum active movers needs to be high - 1000.
  • Submitted patch to dcache.org for review
  • No estimate on libdcap++ inclusion in to the release
  • No new failure modes observed

this meeting:

AOB

  • last week
    • Reminder: all T2's to submit DYNES applications by end of the month. Template to become available.
    • May not have meeting next week - SW dinner.
  • this week
    • None.


-- RobertGardner - 07 Dec 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback