r6 - 22 Dec 2010 - 14:09:11 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesDec22

MinutesDec22

Introduction

Minutes of the Facilities Integration Program meeting, Dec 22, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • 866-740-1260, Access code: 7027475

Audio Details: Dial-in Number:
U.S. & Canada: 866.740.1260
U.S. Toll: 303.248.0285
Access Code: 7027475
Registration Link: https://cc.readytalk.com/r/bd2w3deu2kkg

Attending

  • Meeting attendees: Rob, Aaron, Nate, Charles, Sarah, Patrick, Michael, Rik, Torre, Kaushik, Armen, Mark, John, Xin, Wei, Hiro,
  • Apologies: Doug, Shawn, Bob, Fred, Horst, Karthik

Integration program update (Rob, Michael)

Global Xrootd: Tier 3 (Doug), Tier 2 (Charles)

last week(s): this week:
  • Xrootd demonstrator projects will be presented at the WLCG GDB meeting 12-January-2011 (Doug will present the Tier 3 Xrootd demonstrator talk)
  • NET2 will try out the installation instructions.
  • What is the time frame for testing?
  • Wei - there is a problem with checksums, has discussed w/ Andy. Need to provide a method for calculating this. Needed for checksum. There is a workaround, but its somewhat expensive.
  • AGLT2 - next in January.
  • Let Charles know when setting this up, and provide feedback.

Tier 3 Integration Program (Doug Benjamin, Rik Yoshida)

References
  • The link to ATLAS T3 working groups Twikis are here
  • T3g Setup guide is here
  • Users' guide to T3g is here
last week(s):
  • Hampton and Bellamine want to do production, will need to do production testing
  • T3G? 's will get a blanket approval
  • Suggest using CVMFS for releases
  • Bellamine - has a 100 Mbps, but reliable
  • Doug wants both Hampton and Ballamine to submit a proposal and be thoroughly tested, before turning on for official production
  • T3's associated with T2's are grandfathered in.
  • Will ask these sites to join the meeting - and review production performance before admitting into production
this week:
  • Univ. Cal. Irvine , UC Stanta Cruz and SUNY Stony Brook are all setting up Tier 3 sites.
  • All Tier 3 sites (even those collocated with Tier 1 and Tier 2 sites are request to register with OSG in the OIM) This information will be migrated into AGIS eventually
  • Shuwei Ye has agree to help setup a test package that will allow Tier 3 sites to test their configuration (storage, batch system etc)
  • UIUC, UTD have been contacted and requested to attend this meeting as they are production sites
  • Bellarmine and Hampton have also been contacted and requested to attend this meeting as they would like to be production sites. The process had been explained to them
  • Some issues have been seen with dq2-get + fts (not all files in a large data set have been received - only 100 files out of ~5000). Transfer was successful and no errors seen a ticket has been created. (Hiro will have a look.)

Operations overview: Production and Analysis (Kaushik)

Data Management & Storage Validation (Kaushik)

Shifters report (Mark)

  • Reference
  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=slides&confId=116757
    
    1)  12/8: SWT2_CPB - RT 18834 was created due to OSG "failing one or more critical metrics."  Issue was with the containercert.pem / containerkey.pem files needing to be updated, which was causing problems with the gums servers.  
    Since resolved, RT ticket closed.
    2)  12/8: file transfer failures - MWT2_UC_USERDISK to MWT2_UC_LOCALGROUPDISK with SOURCE "AsyncWait" errors.  From Sarah: One of our pools went unresponsive this morning and had to be rebooted. It is operational again 
    and we should see transfers start to succeed.  ggus 65122 closed, eLog 20611.
    3)  12/9: UTD-HEP - job failures with errors like "Can't mkdir: /net/yy/srmcache/atlaslocalgroupdisk/user/mahsan/201012021617/..."
    From the site admin: We're in the process of cleaning up a large accumulation of dark data. This issue should be resolved shortly.
    ggus 65137 closed, eLog 20535.
    4)  12/9: file transfer from NET2_DATADISK to AGLT2_DATADISK were failing with the error "FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] failed 
    to contact on remote SRM [httpg://atlas.bu.edu:8443/srm/v2/server].  From Saul: Rebooting our gatekeeper caused the problems. Fixed.  ggus 65188/191 closed, eLog 20545/72.
    5)  12/9: From Bob at AGLT2: There is a cooling issue in the MSU server room.  We are idling down 3 racks of worker nodes, but if the room temp begins to rise uncontrollably, we will throw the switch and crash the workers.  
    Will update when we know more.  Later: Temperature is under control, but 3 racks are stuck now in condor peaceful retirement.  As those complete, I'll re-enable them to accept jobs.
    6)  12/9: job failures in the US cloud with "SFN not set in LFC for guid" errors - for example:
    pilot: Get error: SFN not set in LFC for guid 882A6BDC-F399-DF11-8B93-A4BADB532C99 (check LFC server version)
    Possibly related to panda db glitch on 12/8.  ggus 65158 closed,  eLog 20610.
    7)  12/10: shifter messages to 'atlas-support-cloud-US@cern.ch' were being forwarded to 'usatlas-grid-l@lists.bnl.gov' as well, which resulted in an overly wide distribution list.  Fixed.
    8)  12/10: From Bob at AGLT2: We have decided we are not ready for a downtime on Dec 15, so will put this off until (likely) some time in January.
    9)  12/10: File transfer errors with ALGT2 as the source - "[SOURCE error during TRANSFER_PREPARATION phase: [USER_ERROR] source file doesn't exist] Duration [0]."  https://savannah.cern.ch/bugs/index.php?76323, eLog 20627.  
    Problem of missing files under investigation.
    10)  12/11: SWT2_CPB_PERF-EGAMMA - DaTrI requests were in the subscribed state for several days.  No real problem, as the dataset in question was eventually transferred.  Transfer speeds should improve once the full migration 
    to the new 10 Gb/s link is complete.  https://savannah.cern.ch/bugs/index.php?76329, eLog 20575.
    11) 12/15: BNL dCache maintenance - 9:30:00 => 14:00:00 EST - no access (read and write) during this period.
    12)  12/15: OU sites maintenance outage - from Horst: beginning at ~10 am CST for a few hours to upgrade our Lustre file system.
    
    Follow-ups from earlier reports:
    
    (i)  11/29 - 12/1: HU_ATLAS_Tier2 job failures with "Exception caught in pilot: (empty error string)" error.  From John at HU:
    Our site was not experiencing any more of these errors when I looked. We've finished over 1500 jobs with a 0% failure rate for the past 12 hours. This ticket can be closed.  ggus 64712 / RT 18740 closed, eLog 20212.  
    (Site was set offline by shifters during this period - probably a bit overly aggressive, plus the site was not notified.)
    Update, 12/7: job failures with the error "pilot.py | !!FAILED!!1999!! Exception caught in pilot: (empty error string), Traceback (most recent call last):
    File "/n/atlasgrid/home/usatlas1/gram_scratch_TU82Q0BNvq/pilot.py," which was the original subject of ggus 64712 (now re-opened).  From John at HU:
    We are working with condor developers to see what we can do about the slow status updates of jobs. The grid monitor is querying the status of many thousands of very old jobs. This is causing load sometimes over 1000 on the 
    gatekeeper and general unresponsiveness. My suspicion is that a timeout is causing the "empty error string". In any case, the site is dysfunctional until this Condor-G issue can be fixed.  However, I'm also in the middle of making 
    some changes and I can't see they're effects unless the site is loaded like this, hence why I haven't set it to brokeroff.
    (ii)  11/30: AGLT2 - job failures with the error "Trf setup file does not exist for 16.2.1.2 release."  From Alessandro: there is a problem when setting up the releases in AGLT2, and it's already under study. For the moment please do 
    not use that site for any release that has not been validated and tagged (which should be the default, BTW). The needed release 16.2.1 in fact is not validated nor tagged in AGLT2.  ggus 64770 in-progress, eLog 20131.
    (iii)  12/3: OU_OCHEP_SWT2 file transfer errors: "failed to contact on remote SRM [httpg://tier2-02.ochep.ou.edu:8443/srm/v2/server]. Givin' up after 3 tries]."  From Horst reported that a problem with Lustre was fixed by DDN.  
    ggus 64922 / RT 18801 closed, eLog 20306.
    12/5:  Lustre/network problems reappeared.  RT 18818 / ggus 65018 in-progress, eLog 20362.
    12/8 from Horst: the Lustre server upgrade is mostly complete, but the clients haven't been upgraded, since that will require a complete downtime, which I don't want to do while nobody is around locally, so we'll do 
    that next week some time.
    Update 12/15: see 12) above - outage to upgrade Lustre.
    (iv)  12/8: Oracle maintenance at CERN required that the Panda server be shutdown for ~one hour beginning at around 9:00 UTC.  Should not have a major impact, but expect some job failures during this period.
    Update 12/9: Problem occurred during the maintenance which led to a large number of failed jobs.  Issue eventually resolved.  See eLog 20495. 
    • us-cloud-support list now reduced
    • aglt2 transfer failures - some missing files
    • bnl - down for dcache upgrade
    • ou - down for upgrade
  • this meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-12_20_10.html
    
    1)  12/16: Offline database infrastructure will be split into two separate Oracle RAC setups: (i) ADC applications (PanDA, Prodsys, DDM, AKTR) migrating to a new Oracle setup, called ADCR; (ii) Other applications (PVSS, COOL, Tier0) 
    will remain in the current ATLR setup.  See: https://twiki.cern.ch/twiki/bin/viewauth/Atlas/OfflineDBSplit
    2)  12/16: At the conclusion of the Lustre upgrade at OU test jobs at OU_OCHEP_SWT2 were failing due to missing atlas s/w releases.  Issue was due to links that point to the actual s/w areas which had disappeared.  Recreated, problem solved.  
    All test jobs now completing successfully - queues back to 'on-line'.
    3)  12/17, 12/20:  ANALY_SWT2_CPB was auto-blacklisted twice.  Hammercloud test jobs were failing due to the fact that a required db release file was not yet transferred to the site when the first jobs started up.  Once the transfer completed 
    the test jobs began to complete successfully.  Discussion underway about how to address this issue.
    4)  12/18:  Power cut at CERN.  Most issues resolved by ~22:00 UTC, although there were a few lingering effects for several hours afterwards.  Very large number of "lost heartbeat" panda failures over the next 12 or so hours.  
    See for example eLog 20750/51.
    5)  12/19: WISC transfer errors - "Unable to connect to atlas03.cs.wisc.edu."  From Wen: This server has some problem. A backup server is setup to replace it now.  ggus 65545 closed, eLog 20797.  Site un-blacklisted.
    6)  12/19 - 12/20: Job failures at BNL with stage-in errors - for example: "lsm-get failed (201): [201] Copy command failed! dccp failed and since the file is not in DDN pools no retry using another tool is possible!"  
    From Iris: These errors are due to four storage pools are offline. We are investigating and restarting them now.  Issue resolved.  ggus 65549 / 65556 closed, eLog 20780 / 92.
    7)  12/21: NERSC file transfer errors - "failed to contact on remote SRM [httpg://pdsfdtn1.nersc.gov:62443/srm/v2/server]."  ggus 65617 in-progress, eLog 20810.
    8)  12/21: SWT2_CPB file transfer failures - " failed to contact on remote SRM [httpg://gk03.atlas-swt2.org:8443/srm/v2/server]."  Issue understood - from Patrick: There was a problem with the edit I made to a configuration file related 
    to Bestman that caused the initial problem.  After making a modification to the gridftp door, I restarted Bestman, and there was a new problem with SRM accepting DOEGrids certs.  A final restart of Bestman cleared this issue.  
    ggus 65627 / RT 19045 will be closed today once we confirm everything is working o.k.  eLog 20817.
    9)  12/22: MWT2_UC file transfer errors.  Issue understood - from Sarah: I was adjusting setting on our SRM door this morning, and temporarily enabled a setting that caused all space token transfers to fail. I've backed out the change 
    and we should start to see transfers succeed.  eLog 20829.
    
    Follow-ups from earlier reports:
    
    (i)  11/29 - 12/1: HU_ATLAS_Tier2 job failures with "Exception caught in pilot: (empty error string)" error.  From John at HU:
    Our site was not experiencing any more of these errors when I looked. We've finished over 1500 jobs with a 0% failure rate for the past 12 hours. This ticket can be closed.  ggus 64712 / RT 18740 closed, eLog 20212.  
    (Site was set offline by shifters during this period - probably a bit overly aggressive, plus the site was not notified.)
    Update, 12/7: job failures with the error "pilot.py | !!FAILED!!1999!! Exception caught in pilot: (empty error string), Traceback (most recent call last):
    File "/n/atlasgrid/home/usatlas1/gram_scratch_TU82Q0BNvq/pilot.py," which was the original subject of ggus 64712 (now re-opened).  From John at HU:
    We are working with condor developers to see what we can do about the slow status updates of jobs. The grid monitor is querying the status of many thousands of very old jobs. This is causing load sometimes over 1000 
    on the gatekeeper and general unresponsiveness. My suspicion is that a timeout is causing the "empty error string". In any case, the site is dysfunctional until this Condor-G issue can be fixed.  However, I'm also in the middle 
    of making some changes and I can't see they're effects unless the site is loaded like this, hence why I haven't set it to brokeroff.
    (ii)  11/30: AGLT2 - job failures with the error "Trf setup file does not exist for 16.2.1.2 release."  From Alessandro: there is a problem when setting up the releases in AGLT2, and it's already under study. For the moment please 
    do not use that site for any release that has not been validated and tagged (which should be the default, BTW). The needed release 16.2.1 in fact is not validated nor tagged in AGLT2.  ggus 64770 in-progress, eLog 20131.
    (iii)  12/3: OU_OCHEP_SWT2 file transfer errors: "failed to contact on remote SRM [httpg://tier2-02.ochep.ou.edu:8443/srm/v2/server]. Givin' up after 3 tries]."  From Horst reported that a problem with Lustre was fixed by DDN.  
    ggus 64922 / RT 18801 closed, eLog 20306.
    12/5:  Lustre/network problems reappeared.  RT 18818 / ggus 65018 in-progress, eLog 20362.
    12/8 from Horst: the Lustre server upgrade is mostly complete, but the clients haven't been upgraded, since that will require a complete downtime, which I don't want to do while nobody is around locally, so we'll do that next week some time.
    Update 12/15: outage to upgrade Lustre.
    Update 12/16 from Horst: the Lustre upgrade is complete, and all looks well.  ggus 65018 / RT 18818 closed, eLog 20709.
    (iv)  12/10: From Bob at AGLT2: We have decided we are not ready for a downtime on Dec 15, so will put this off until (likely) some time in January.
    (v)  12/10: File transfer errors with ALGT2 as the source - "[SOURCE error during TRANSFER_PREPARATION phase: [USER_ERROR] source file doesn't exist] Duration [0]."  https://savannah.cern.ch/bugs/index.php?76323, eLog 20627.  
    Problem of missing files under investigation.
    Update 12/17 (from Cedric Serfon): Files declared.  47 are being recovered.  415 are definitively lost and removed from DQ2.  Savannah ticket closed.
    (vi)  12/15: BNL dCache maintenance - 9:30:00 => 14:00:00 EST - no access (read and write) during this period.
    Update: maintenance completed as of ~7:30 p.m. EST.  eLog 20689.
    
    • OU back online - post Lustre update; was a problem with sym links disappearing, broke the releases, got it straightened out.
    • Power cut at CERN over the weekened
    • Nagios alerts at BNL - memory usage on submit hosts. Xin notes this is a false alarm.

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • LFC consolidation - motivated
    • Can each site confirm they are making backups of the LFC? They are.
    • poolfilecatalog creation - DQ2 developer made a new option to get around dq2-ls hack for regex in US; will change toa as a result. dcap doors should use an alias.
    • DDM - all is well
    • SRM deletion being slow - why? Has to do with ordering buy space tokens. Hiro has several feature requests into the developers.
    • LFC dump is now automated on a weekly basis. SQLite file. Charles will modify CCC to handle this.
  • this meeting:
    • There was a problem with the backlog of production jobs over the weekend. Output of merging jobs back at BNL, competing with production.
    • Question about LFC Tier 3 usage - we will have more sites requesting this. Is the path of getting new Tier 3's supported clear?
    • Hampton and Bellamine - that want to be production sites - will need some support. Horst will provide guidance.
    • Hiro is sending a daily, via DDM. There's a script in T3 SVN that do a consistency check.
    • FTS channels were closed for a bit to study some transfer contention issues at BNL.
    • Wei - doing to network testing - to a specific host; use the BNL star channel. A manual transfer, get the id, go to the monitor and view the log to determine.

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week:
    • There are some nice Nagios pages setup by Tom at BNL
    • DYNES application deadline is today
    • Have equipment for 40 sites - a site selection committee
    • Will also fund 14 regional network
    • Completed applications: MSU, UM, UTA, IU, BU/HU/Tufts, UC, UW
    • SLAC encouraged to submit
    • OU not sure if one was submitted for campus
  • this week:
    • Should SLAC participate? Wei has had some local discussions - there are some DOE-NSF obstacles, Wei will follow.

Site news and issues (all sites)

  • T1:
    • last week(s): Blue Arc storage appliance to be evaluated, to be put on top of DDN 9900; also call to work on DDN 10K improved for random access, important for direct reading. Torre in contact with Yushu at BNL to run panda machines on the Magellan cloud; has setup virtual machine; will need to setup a Panda site.
    • this week: Planning for downtime in January, dcache upgrade; have ordered 2 PB of disk (21 Nexan expansion units 120 TB per/dual controllers, added to existing server/setups). Network re-organization to increase backbone capacity.

  • AGLT2:
    • last week: Having a strange problem where new data sets have missing files in sub-group disk. No evidence was done by central deletion. Did dCache remove these files? Happened on Dec 10, according to billing datbase. Still in LFC. Nothing in SRM logs.
    • this week:

  • NET2:
    • last week(s): Just submitted DYNES application. Changed GIP configuration for a single sub-cluster, now Alessandro's installation is working.
    • this week: Fixed condor-lsf job-manager synch at HU. More analy jobs at NET2? John will change limits on analysis.

  • MWT2:
    • last week(s): The CIC OmniPoP router in Starlight stopped advertising routes for the MWT2_UC <--> MWT2_IU direct path; as a result, traffic between sites was routed via Internet2 (the fallback route), causing 4x increase in latency. Caused jobs at IU to run inefficiently over the weekend. Fixed on Monday. Chimera configuration investigation continues, domain locks up with deletes; number of threads lowered (in comparison w/ AGLT2) - cautiously optimistic. Some parts arrived for new equipment at UC.
    • this week:
      • Analysis pilot starvation causing poor user experience at MWT2.

  • SWT2 (UTA):
    • last week: All is well
    • this week: SRM incident - made a change to the config in bestman - dynamic list of gridftp servers read by plugin from file, based on modtime of file. parsing of this will not ignore a blank like, caused a problem. All running fine now. Setting of analy queue to brokeroff because of HC and distribution of dbrelease files. Site testing expected dbrelease file to be there arbitrarily soon. HC is by-passing brokerage? Kaushik would have to check w/ Tadashi. 10G network is in place, commissioning now. Seem to have a 1G bottleneck.

  • SWT2 (OU):
    • last week: Lustre upgrade nearly complete. Expect to turn site back on shortly. Will put into test mode first.
    • this week:

  • WT2:
    • last week(s): running smoothly. One of the data servers failed - recovered okay, but unsavory since not sure what caused the failure.
    • this week: all is well. HC.

Carryover issues ( any updates?)

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc. Note: https://atlas-install.roma1.infn.it/atlas_install/ - site admins can subscribe, and get notified of release installation & validation activity at their site.

  • last report
  • this meeting:
    • AGLT2 now running Alessandro's system - now in automatic installation mode. Will do other sites after the holiday.
    • There was a prob w/ 16.0.3.3, SIT released a new version, Xin installing on all the sites.
    • Xin is generating PFC for the sites still.

libdcap & direct access (Charles)

last report(s):
  • Rebuilt libdcap locally
  • Current HC performance, http://hammercloud.cern.ch/atlas/10000957/test/
  • Comparing to previous results with dcap++
  • Looking at messaging between dcap client and server
  • Most problems with prun jobs (pathena seems robust)
  • Trying to get dcap++ merged into official dcap sources.. libdcap has diverged.
  • Continuing to track down stalls in analysis jobs - leading to hangs in dcap.
  • newline causing buffer overflow
  • chipping away at this on the job-side.
  • dcap++ and newer stock libdcap could be merged at some point.
  • Maximum active movers needs to be high - 1000.
  • Submitted patch to dcache.org for review
  • No estimate on libdcap++ inclusion in to the release
  • No new failure modes observed

this meeting:

AOB

  • last week
  • this week
    • No meeting next week.


-- RobertGardner - 20 Dec 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


png prod_queue.png (30.4K) | AaronvanMeerten, 21 Dec 2010 - 16:56 | Production Queue for MWT2
png analy_queue.png (32.9K) | AaronvanMeerten, 21 Dec 2010 - 16:57 | Analysis Queue for MWT2
png running_jobs.png (31.7K) | AaronvanMeerten, 21 Dec 2010 - 16:58 | Running Jobs at MWT2
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback