r5 - 05 Jan 2011 - 14:14:19 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJan5

MinutesJan5

Introduction

Minutes of the Facilities Integration Program meeting, Jan 5, 2011
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • 866-740-1260, Access code: 7027475

Audio Details: Dial-in Number:
U.S. & Canada: 866.740.1260
U.S. Toll: 303.248.0285
Access Code: 7027475
Registration Link: https://cc.readytalk.com/r/bd2w3deu2kkg

Attending

  • Meeting attendees: Doug, Charles, Rob, Dave, Jason, Patrick, Michael, Sarah, Fred, Torre, Akhtar, Alden, Horst, John, Wensheng, Mark, Kaushik, Armen, Wei, Saul, Bob, Shawn, Tom, Hiro
  • Apologies:

Integration program update (Rob, Michael)

Bellamine U

  • "AK" is here
  • 385 cores, 200TB RAID, 200 TB on nodes.
  • Horst has been measuring things. 100 Mbps connection (10 MB/s with no students)
  • FTS transfer tests started
  • Totally dedicated to ATLAS production (not analysis)
  • Goal is to be a low-priority production site
  • There will be a long testing period in the US, at least a week
  • Need to look at the bandwidth limitation
  • Horst has installed the OSG stack
  • There is .5 FTE local technical support
  • Support tickets go to the Tier 3 queue - then they get assigned to the site responsible. No.
  • Will need to setup an RT queue.

Global Xrootd: Tier 3 (Doug), Tier 2 (Charles)

last week(s):
  • Xrootd demonstrator projects will be presented at the WLCG GDB meeting 12-January-2011 (Doug will present the Tier 3 Xrootd demonstrator talk)
  • NET2 will try out the installation instructions.
  • What is the time frame for testing?
  • Wei - there is a problem with checksums, has discussed w/ Andy. Need to provide a method for calculating this. Needed for checksum. There is a workaround, but its somewhat expensive.
  • AGLT2 - next in January.
  • Let Charles know when setting this up, and provide feedback.
this week:
  • Doug: attempting to get a proxy server running at Argonne - to handle the case where data servers are behind the firewall.
  • Will gather some information from Charles.
  • A new version of xrootd is available - testing.

Tier 3 Integration Program (Doug Benjamin, Rik Yoshida)

References
  • The link to ATLAS T3 working groups Twikis are here
  • T3g Setup guide is here
  • Users' guide to T3g is here
last week(s):
  • Univ. Cal. Irvine , UC Stanta Cruz and SUNY Stony Brook are all setting up Tier 3 sites.
  • All Tier 3 sites (even those collocated with Tier 1 and Tier 2 sites are request to register with OSG in the OIM) This information will be migrated into AGIS eventually
  • Shuwei Ye has agree to help setup a test package that will allow Tier 3 sites to test their configuration (storage, batch system etc)
  • UIUC, UTD have been contacted and requested to attend this meeting as they are production sites
  • Bellarmine and Hampton have also been contacted and requested to attend this meeting as they would like to be production sites. The process had been explained to them
  • Some issues have been seen with dq2-get + fts (not all files in a large data set have been received - only 100 files out of ~5000). Transfer was successful and no errors seen a ticket has been created. (Hiro will have a look.)
this week:
  • Oregon, UIC, Arizona, Duke, Indiana, all setting up equipment.
  • Will be sending out a global email to get their status.
  • Discussion about the support list

Operations overview: Production and Analysis (Kaushik)

Data Management & Storage Validation (Kaushik)

Shifters report (Mark)

  • Reference
  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-12_27_10.html
    
    1)  12/22: Problem with the installation of atlas release 16.0.3.3 on the grid in most clouds.  Alessandro informed, ggus 65628, eLog 20818.
    2)  12/22: SLAC - jobs failing with the error "poolToObject: caught error: FID "8E91164C-1E3C-DB11-8CAB-00132046AB63" is not existing in the catalog."  As of 12/28 the task was marked as 'DONE', so the issue is apparently resolved.  
    https://savannah.cern.ch/bugs/index.php?76641, eLog 20830.
    3)  12/22: SLACXRD_SCRATCHDISK, SLACXRD_USERDISK file transfer errors.  Issue resolved as of 12/23.  From Wei: Our weekly "data server went down without a reason"... Back online.  ggus 65658/59 closed, eLog 20835/36/40.  
    (Ticket for ANL_LOCALGROUPDISK was due to the issue on the SLAC end.)
    4)  12/26: SWT2_CPB file transfer errors such as "Error:/bin/mkdir: cannot create directory `/xrd/datadisk/step09/AOD/closed/step09.20201052000083L.physics_A.recon.AOD.closed'. "  https://savannah.cern.ch/bugs/index.php?76672, 
    ggus 65707/34, RT 19062/65, eLog 20884/928.
    5)  12/26: UTD-HEP set off-line for site maintenance.  https://savannah.cern.ch/support/?118511, eLog 20891.
    6)  12/26 - 12/28: Power outage in the MSU (AGLT2) server room at approximately 8:30pm 12/26.  Site recovered from the outage as of 12/28 afternoon.  ggus 65730, eLog 20911.
    
    Follow-ups from earlier reports:
    (i)  11/29 - 12/1: HU_ATLAS_Tier2 job failures with "Exception caught in pilot: (empty error string)" error.  From John at HU:
    Our site was not experiencing any more of these errors when I looked. We've finished over 1500 jobs with a 0% failure rate for the past 12 hours. This ticket can be closed.  ggus 64712 / RT 18740 closed, eLog 20212.  
    (Site was set offline by shifters during this period - probably a bit overly aggressive, plus the site was not notified.)
    Update, 12/7: job failures with the error "pilot.py | !!FAILED!!1999!! Exception caught in pilot: (empty error string), Traceback (most recent call last):
    File "/n/atlasgrid/home/usatlas1/gram_scratch_TU82Q0BNvq/pilot.py," which was the original subject of ggus 64712 (now re-opened).  From John at HU:
    We are working with condor developers to see what we can do about the slow status updates of jobs. The grid monitor is querying the status of many thousands of very old jobs. This is causing load sometimes over 
    1000 on the gatekeeper and general unresponsiveness. My suspicion is that a timeout is causing the "empty error string". In any case, the site is dysfunctional until this Condor-G issue can be fixed.  However, I'm also 
    in the middle of making some changes and I can't see they're effects unless the site is loaded like this, hence why I haven't set it to brokeroff.
    (ii)  11/30: AGLT2 - job failures with the error "Trf setup file does not exist for 16.2.1.2 release."  From Alessandro: there is a problem when setting up the releases in AGLT2, and it's already under study. For the moment 
    please do not use that site for any release that has not been validated and tagged (which should be the default, BTW). The needed release 16.2.1 in fact is not validated nor tagged in AGLT2.  ggus 64770 in-progress, eLog 20131.
    (iii)  12/10: From Bob at AGLT2: We have decided we are not ready for a downtime on Dec 15, so will put this off until (likely) some time in January.
    (iv)  12/16: Offline database infrastructure will be split into two separate Oracle RAC setups: (i) ADC applications (PanDA, Prodsys, DDM, AKTR) migrating to a new Oracle setup, called ADCR; (ii) Other applications 
    (PVSS, COOL, Tier0) will remain in the current ATLR setup.  See: https://twiki.cern.ch/twiki/bin/viewauth/Atlas/OfflineDBSplit
    (v)  12/17, 12/20:  ANALY_SWT2_CPB was auto-blacklisted twice.  Hammercloud test jobs were failing due to the fact that a required db release file was not yet transferred to the site when the first jobs started up.  
    Once the transfer completed the test jobs began to complete successfully.  Discussion underway about how to address this issue.
    (vi)  12/21: NERSC file transfer errors - "failed to contact on remote SRM [httpg://pdsfdtn1.nersc.gov:62443/srm/v2/server]."  ggus 65617 in-progress, eLog 20810.
    (vii)  12/21: SWT2_CPB file transfer failures - " failed to contact on remote SRM [httpg://gk03.atlas-swt2.org:8443/srm/v2/server]."  Issue understood - from Patrick: There was a problem with the edit I made to a configuration 
    file related to Bestman that caused the initial problem.  After making a modification to the gridftp door, I restarted Bestman, and there was a new problem with SRM accepting DOEGrids certs.  A final restart of Bestman cleared this issue.  
    ggus 65627 / RT 19045 will be closed today once we confirm everything is working o.k.  eLog 20817.
    Update 12/22: Issue resolved, ggus & RT tickets closed.
    
    • OU back online - post Lustre update; was a problem with sym links disappearing, broke the releases, got it straightened out.
    • Power cut at CERN over the weekened
    • Nagios alerts at BNL - memory usage on submit hosts. Xin notes this is a false alarm.
  • this meeting:* Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-1_3_11.html
    
    1)  12/28: Job failures at BU with the error "runJob.py | !!WARNING!!2999!! runJob setup failed: cmtsite command failed: 11."  From Xin: I don't know why the releases stopped working. Rerunning the validation 
    on caches also failed. I went ahead to reinstall 16.0.2 kit and caches, validation runs fine now. I will reinstall 15.6.13 as well.  Issue resolved, ggus 65765 closed as of 1/3.  eLog 20954.
    2)  12/31: File transfer errors between BNL-OSG2_DATADISK to SWT2_CPB_PERF-EGAMMA with the error "source file doesn't exist."  Issue understood - from Hiro: Files are missing because the files were deleted 
    by the central operation.  Since SWT2_CPB had storage problem during that time (see (vii) below), the subscription to SWT2_CPB has never completed before the deletion of this dataset at BNL. I have no idea why 
    it was deleted or why it can't check the subscription to SWT2 before deletion from BNL.   Anyway, I just subscribed it to BNL-OSG2_SCRATCHDISK from FZK. So, the subscription to SWT2_CPB will complete. 
    Then, jobs will run normally.  See https://savannah.cern.ch/bugs/index.php?76693, eLog 20966.
    3)  1/2: Job failures at MWT2_UC with the error "Trf setup file does not exist at: /osg/app/atlas_app/atlas_rel/15.6.13/AtlasProduction/15.6.13.11/AtlasProductionRunTime/cmt/setup.sh."  Later that day, from Xin: This cache 
    is installed at MWT2_UC now.  ggus 65791 closed, eLog 20990.
    4)  1/3: From Bob at AGLT2: Looks like all GUMS servers were offline early AM yesterday, and gate01 services never recovered.  Will likely blow the load of any running jobs here.  Auto-pilots are stopped.  
    Will give services a bit of time to stabilize this way, then begin recovery. Will wait for rsv to go green again before re-enabling auto-pilots. | Later: Queues have been set back on-line.  Running job load was lost/killed.  
    We'll be starting up from scratch.
    5)  1/3: From Wei at SLAC: We have a disk failure in one of the backend storage (Dell). I backlisted SLACXRD_* and stopped FTS channel.  We will continue to accept jobs since backend downtime should cause little 
    error to jobs. | Later: the maintenance is over. services are back online.
    6)  1/4: From Bob at AGLT2: gate01 rsv has been "peculiar" since the weekend problem with GUMS.  Services often time out.  I am therefore scheduling a reboot of gate01 at Noon today. I will stop new auto-pilots around 
    11:45am and not restart them until the machine is back up fully. | Later: Auto-pilots have been re-enabled to our queues at AGLT2.
    7)  1/5: UTD_HOTDISK file transfer errors - "failed to contact on remote SRM [httpg://fester.utdallas.edu:8446/srm/v2/server]."  Site is in a scheduled maintenance outage (see iix below), but maybe not 
    blacklisted in DDM?  ggus 65866, eLog 21047.
    8)  1/5: MWT2 - network maintenance outage.  eLog 21049.
    
    Follow-ups from earlier reports:
    (i)  11/29 - 12/1: HU_ATLAS_Tier2 job failures with "Exception caught in pilot: (empty error string)" error.  From John at HU:
    Our site was not experiencing any more of these errors when I looked. We've finished over 1500 jobs with a 0% failure rate for the past 12 hours. This ticket can be closed.  ggus 64712 / RT 18740 closed, eLog 20212.  
    (Site was set offline by shifters during this period - probably a bit overly aggressive, plus the site was not notified.)
    Update, 12/7: job failures with the error "pilot.py | !!FAILED!!1999!! Exception caught in pilot: (empty error string), Traceback (most recent call last):
    File "/n/atlasgrid/home/usatlas1/gram_scratch_TU82Q0BNvq/pilot.py," which was the original subject of ggus 64712 (now re-opened).  From John at HU:
    We are working with condor developers to see what we can do about the slow status updates of jobs. The grid monitor is querying the status of many thousands of very old jobs. This is causing load sometimes over 
    1000 on the gatekeeper and general unresponsiveness. My suspicion is that a timeout is causing the "empty error string". In any case, the site is dysfunctional until this Condor-G issue can be fixed.  However, I'm also 
    in the middle of making some changes and I can't see they're effects unless the site is loaded like this, hence why I haven't set it to brokeroff.
    Update 1/3: Issue resolved, ggus 64712 closed.
    (ii)  11/30: AGLT2 - job failures with the error "Trf setup file does not exist for 16.2.1.2 release."  From Alessandro: there is a problem when setting up the releases in AGLT2, and it's already under study. For the moment 
    please do not use that site for any release that has not been validated and tagged (which should be the default, BTW). The needed release 16.2.1 in fact is not validated nor tagged in AGLT2.  ggus 64770 in-progress, eLog 20131.
    (iii)  12/16: Offline database infrastructure will be split into two separate Oracle RAC setups: (i) ADC applications (PanDA, Prodsys, DDM, AKTR) migrating to a new Oracle setup, called ADCR; (ii) Other applications 
    (PVSS, COOL, Tier0) will remain in the current ATLR setup.  See: https://twiki.cern.ch/twiki/bin/viewauth/Atlas/OfflineDBSplit
    (iv)  12/17, 12/20:  ANALY_SWT2_CPB was auto-blacklisted twice.  Hammercloud test jobs were failing due to the fact that a required db release file was not yet transferred to the site when the first jobs started up.  Once the 
    transfer completed the test jobs began to complete successfully.  Discussion underway about how to address this issue.
    (v)  12/21: NERSC file transfer errors - "failed to contact on remote SRM [httpg://pdsfdtn1.nersc.gov:62443/srm/v2/server]."  ggus 65617 in-progress, eLog 20810.
    (vi)  12/22: Problem with the installation of atlas release 16.0.3.3 on the grid in most clouds.  Alessandro informed, ggus 65628, eLog 20818.
    (vii)  12/26: SWT2_CPB file transfer errors such as "Error:/bin/mkdir: cannot create directory `/xrd/datadisk/step09/AOD/closed/step09.20201052000083L.physics_A.recon.AOD.closed'. "  
    https://savannah.cern.ch/bugs/index.php?76672, ggus 65707/34, RT 19062/65, eLog 20884/928.
    Update 12/30 from Patrick: A storage server developed a problem that affected these transfers. The server was restored and the file system and SRM components were restarted. SAM tests and local tests are passing.  
    All ggus and RT tickets closed.
    (iix)  12/26: UTD-HEP set off-line for site maintenance.  https://savannah.cern.ch/support/?118511, eLog 20891.
    (ix)  12/26 - 12/28: Power outage in the MSU (AGLT2) server room at approximately 8:30pm 12/26.  Site recovered from the outage as of 12/28 afternoon.  ggus 65730, RT 19064, eLog 20911.
    Update from Bob, 12/30: dCache server issue resolved during the AM on 12/29.  Auto-pilots were re-enabled on UM workers only at 13:30 that day.  MSU worker nodes were brought on line over the next 4 hours or so, 
    and we are now at nearly our full capacity.  Will bring up the balance of worker nodes during normal hours next week.  ggus & RT tickets closed,
    

DDM Operations (Hiro)

Throughput Initiative (Shawn)

Site news and issues (all sites)

  • T1:
    • last week(s): Planning for downtime in January, dcache upgrade; have ordered 2 PB of disk (21 Nexan expansion units 120 TB per/dual controllers, added to existing server/setups). Network re-organization to increase backbone capacity.
    • this week: will take advantage of the ATLAS downtime in 10 days - will do a major dCache upgrade, but still using pnfs. Major network configuration as well, starting that Saturday. May need to extend into tuesday. Will also upgrade LFC 1.8.0-1.

  • AGLT2:
    • last week: Having a strange problem where new data sets have missing files in sub-group disk. No evidence was done by central deletion. Did dCache remove these files? Happened on Dec 10, according to billing datbase. Still in LFC. Nothing in SRM logs.
    • this week: Site downtime at the end of the month, will plan dcache upgrade then. At MSU, Dec 26 smoke detector went off, dropped power via EPO; restarted following Monday. Week later, at UM GUMS server issue - Condor job manager communication issue, problem with schedd, Condor could not recover (had to restart from scratch). Waiting for new release of schedd.

  • NET2:
    • last week(s): Fixed condor-lsf job-manager synch at HU. More analy jobs at NET2? John will change limits on analysis.
    • this week: Smoothly over the holidays. Working on xrootd federation setup. Working on getting analysis jobs running full-out at HU, and setting up HC tests. Looking at direct access from HU, mounting GPFS via NFS. Purchasing Tier 3 rack of worker nodes. John will summarize LSF issues in a memo.

  • MWT2:
    • last week(s): Analysis pilot starvation causing poor user experience at MWT2. (Xin has move the analysis pilot submitter to a new host, and there have also been changes to autopilot_adjuster and timefloor setting in schedconfig.)
    • this week:

  • SWT2 (UTA):
    • last week: SRM incident - made a change to the config in bestman - dynamic list of gridftp servers read by plugin from file, based on modtime of file. parsing of this will not ignore a blank like, caused a problem. All running fine now. Setting of analy queue to brokeroff because of HC and distribution of dbrelease files. Site testing expected dbrelease file to be there arbitrarily soon. HC is by-passing brokerage? Kaushik would have to check w/ Tadashi. 10G network is in place, commissioning now. Seem to have a 1G bottleneck.
    • this week: Perc6 card locked up, dropped two shelves causing probs as CPB. Otherwise okay.

  • SWT2 (OU):
    • last week: Lustre upgrade nearly complete. Expect to turn site back on shortly. Will put into test mode first.
    • this week: all is well; the network issue was cleared up. 19 R410's order is out.

  • WT2:
    • last week(s): running smoothly. One of the data servers failed - recovered okay, but unsavory since not sure what caused the failure.
    • this week: Over the break all was okay. Planning to upgrade Solaris systems to a newer level.

Carryover issues ( any updates?)

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc. Note: https://atlas-install.roma1.infn.it/atlas_install/ - site admins can subscribe, and get notified of release installation & validation activity at their site.

  • last report
    • AGLT2 now running Alessandro's system - now in automatic installation mode. Will do other sites after the holiday.
    • There was a prob w/ 16.0.3.3, SIT released a new version, Xin installing on all the sites.
    • Xin is generating PFC for the sites still.
  • this meeting:

libdcap & direct access (Charles)

last report(s):
  • Rebuilt libdcap locally
  • Current HC performance, http://hammercloud.cern.ch/atlas/10000957/test/
  • Comparing to previous results with dcap++
  • Looking at messaging between dcap client and server
  • Most problems with prun jobs (pathena seems robust)
  • Trying to get dcap++ merged into official dcap sources.. libdcap has diverged.
  • Continuing to track down stalls in analysis jobs - leading to hangs in dcap.
  • newline causing buffer overflow
  • chipping away at this on the job-side.
  • dcap++ and newer stock libdcap could be merged at some point.
  • Maximum active movers needs to be high - 1000.
  • Submitted patch to dcache.org for review
  • No estimate on libdcap++ inclusion in to the release
  • No new failure modes observed

this meeting:

AOB

  • last week
  • this week
    • Providing the pledged capacities for 2011. This milestone is due by April 1. Processing is pretty much okay, but there are shortages in disk.
    • DYNES proposals will be evaluated by end of month.
    • lcg cern certificate expiring soon - new voms package available (Hiro will send mail)


-- RobertGardner - 04 Jan 2011

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback