r86 - 03 Aug 2007 - 07:58:25 - AlexeiKlimentovYou are here: TWiki >  Admins Web > DQ2SiteServicesP1

DQ2SiteServicesP1

Goal and schedule

This Phase 1 this task covers the integration and deployment of the set of services for the DQ2 0.3 system.
  • Week of June 18: install and evaluate DQ2 0.3 from installation instructions below.
    • June 19: DQ2 0.2.12 shutdown.
    • June 20: continue initial subscription tests, and all other functional tests
    • June 20-22: install and evaluate of client tools
    • Panda integration tests to new sites at UTA, UC, BU
    • Finish first install on all US ATLAS sites by Friday.
  • Week of June 25:
    • Continued debugging of initial installations
    • Continued tests of client tools
    • Redo large scale subscription tests
    • Get all new site services centrally logged for troubleshooting console
  • Week of July 2
    • Continued debugging of DQ2 0.3 at various points.
    • Update 7/5/07 - upgrade to site services.
    • Resume production

Site admins: please post installation tricks/tips below. Also, when your installation is complete and the first order validation is complete (as coordinated by Alexei), please update the SiteCertificationP1 table with a DONE flag.

Installation, configuration, and startup instructions

This should cover the basic instructions for deploying DQ2 0.3 site services (and client tools).

Client tools

  • How to install client tools on an OSG site? Copy from /afs/cern.ch/atlas/offline/external/GRID/ddm/pro03 to your local installation directory, and modify the path string in dq2.sh to reflect local setup.
  • Validation tests of the client tools. Try 'source dq2.sh; dq2-ping; dq2-list-dataset'.

Validation

  • Simple subscription test as described in the instructions should be performed.
  • Alexei's first-order subscription test.
  • Update your site in the SiteCertificationP1 after completing all the steps above.

Known issues uncovered during migration

Please record here problems/issues encountered during the DQ2 0.3 install, as feedback for the developers.

  • Allowing subscriber to specify share (needed to prioritize production datasets) working?
  • LFC client is generating spurious output to shell which started agents
  • DQ2 logs contain warning about proxy (fix provided by Hiro) done?
  • Location of dashboard lock files (scripts that clean /tmp may cause problems)
  • Dashb-agent-stop sometimes fails
  • Agents can take some time to stop
  • PFN's might be renamed; leaf no longer LFN
  • Jun 19 13:00 EDT DQ2 client installed at BNL (WD)
  • Jun 19 BNLDISK, BNLPANDA instances are ready (HI)
  • Jun 19 UTA is ready for DQ2 0.3 tests (PM)
  • Jun 20 11:00 EDT AGLT2 is ready for DQ2 0.3 tests (SM)
  • Jun 20 11:00 EDT Central database migration at CERN is done (MB,PS)
  • Jun 20 12:20 EDT the whole panda (including Adder) is tested using an analysis job (TM)
  • Jun 20 12:20 EDT dq2_user tools tested (TM)
  • Jun 20 15:40 EDT test datasets (including 4GB files) are subscribed from CERNCAF to BNLDISK and BNLPANDA (AK)
  • Jun 20 16:30 EDT (?) "0.3 are not logging correctly." and "logging provided by 0.3 is a step backwards from 0.2." (PM -> MB)
  • Jun 20 16:50 EDT MC production datasets registration pre-testing (AK)
  • Jun 20 evening bulk subscription test for UTA (PM), (?) site services locked up overnight
  • Jun 20 evening SLACXRD is ready for DQ2 0.3 tests (YW,PM)
  • Jun 20 evening Panda tests in progress (1000 jobs submitted)
  • Jun 20 night Reached 1000 running jobs in Panda (BNL, UTA). Many jobs finishing successfully. UTA jobs waiting for transfer back to BNL.
  • Jun 21 morning Pedro/Miguel reports row locking problems. Number of maximum processes in Panda reduced from 25 to 1. Problem continues. Consultation started with CERN Oracle DBA. Pre-testing missed this because use case of many updates to same dataset was not simulated.
  • Jun 21 morning Panda adder calls registerFilesInDataset twice. The second call positively verifies that registration succeeded. Originally, listFilesInDataset was used for verification - but performance was too slow. Reverting back to this method for now, to reduce row locking problems. In future may need new api: checkDatasets for verification.
  • Jun 20 18:40 EDT CERNCAF->BNL->AGLT2, CERNCAF->BNLDISK->UTA_SWT2 data transfer tests (AK)
  • Jun 21 09:25 EDT No transfer started from CERNCAF to BNLPANDA
  • Jun 21 11:30 EDT transfer from CERNCAF to BNLPANDA resumed (HI), "The problem with BNLPANDA was caused by the local path in dcache. By adding the custom path codes as it was in old version, it works. But, the custom code will cause the headache for apt/rpm update". Small files are transferred. None of 4GB files is transferred to BNLPANDA.
  • Jun 21 09:25 EDT BNL->SLACXRD data transfer test (YW,AK)
  • Jun 21 12:05 EDT The version with fixes is available. Upgrade is needed for all installed instances. DQ2 SERVICES ARE STOPPED. We will try with UTA_SWT2 and AGLT2 first. The upgrade should fix the problems reported to the developers yesterday/today.
  • Jun 21 13:45 EDT AGLT2 upgraded, data is moving to AGLT2 now (SM), UTA_SWT2 upgraded , much better performance with respect to finding replicas (PM), performance verification is in progress, by replication large dataset (PM)
  • Jun 21 13:55 EDT PANDA monitor is working with DQ2 0.3 on gridui03 (TW)
  • Jun 21 15:15 EDT SLACXRD upgraded (YW), BU asked to start installation/upgrade
  • Jun 21 16:00 EDT Largish subscription test for DQ2 0.3 (UTA and SLACXRD), BU upgraded (WD)
  • Jun 21 17:20 EDT Realized that "0.2" datasets subscriptions to BNL are not restored. "need Pedro to do it safely" (MB)
  • Jun 21 evening CERNCAF->BNLDISK/BNLPANDA test transfers finished. BNL, AGLT2, SLACXRD have complete replicas, still have question about FTS channel, it looks there are no errors with the transfer of small files, but 4 GB files were transferred after several attempts. Largish subscription test for DQ2 0.3 done for SLACXRD (1000 files, 135 mins, YW)
  • Jun 21 evening Out of 292 subscriptions to BNLPANDA none is seen with dq2-list-subscription-site BNLPANDA
  • Jun 22 morning dq2-list-subscription-site BNLPANDA returns list of datasets
  • Jun 22 afternoon PANDA server works in production mode (TW,PS,TM)
  • Jun 22 evening DQ2 0.2 subscriptions are not restored yet for all Tiers. (it is confirmed by Pedro, that all subscriptions to BNLPANDA are restored)
  • Jun 23 morning DQ2 0.3 test transfer request from French T2s to LYONDISK no transfer after 24h, ARDA status 'queued' (SJ)
  • Jun 23 morning slowdown in data transfer to BNLPANDA is observed (KD), is it correlated with restoring of DQ2 0.2 subscriptions ?
  • Jun 23 morning problems with Fetchers on US sites (fixed by Miguel), the problem was related to 0.2/0.3 subscriptions compatibility
  • Jun 23 return code of DQ2Access? .py functions (we need to distinguish catalog error from other errors) (PS,AK)
  • Jun 23 20:00 CEST registration of MC production datasets is running in interactive mode. All datasets for tasks with ID > 10640 are registered (AK)
  • Jun 24 17:51 EDT Charles reported problems with SubscriptionResolverAgent? .py. "central catalogs are misbehaving with the load" (MB)
  • Jun 24 afternoon Kaushik reported problems with Fetcher at BNL. Action : "extra logging will be added to investigate the case" (MB)
  • Jun 25 morning Fetcher with extra logging is available (MB)
  • Jun 25 17:00 CEST more testing subscriptions T2->LYON (SJ)
  • Jun 26 morning central catalog problems over weekend understood and fixed (MB,PS), restoring lost subscriptions (PS)
  • Jun 26 afternoon new DQ2 version with fixes BNL instances are upgraded (HI)
  • Jun 26 evening MWT2_UC and MWT2_IU ready for full production (CGW,HI,PM)
  • Jun 28 DQ2 lost all subscriptions with '0' files. fixed (MB), DQ2 is reinstalled at BNL (HI)
  • Jun 28 DQ2 0.2 subscriptions will be restored Mon Jul 2nd It was agreed during DDM operations session that DQ2 0.2 subscriptions will be restored on Monday Jul 2nd. Catalogs cleaning (deletion of aborted and obsolete) will be done before Monday.
  • Jul 1 Catalogs cleaning cannot be done for the moment. 'dq2-delete-files' returns exception if even one file with given GUID doesn't exist and command doesn't delete other files.
  • Jul 1 - Aug 1 Short summary and known bugs from DDM operations and Production * dataset location information is not propagating to new dataset version * program crashes eventually trying to get number of files (or list of files, or replicas) for the existing dataset (developers suspect network glitch or apache exception) * Jul 15th - Jul 21st restore AOD dataset subscriptions to Tier-1s. ~Jul 25th services are stopped on all LCG VO boxes, high loading is observed. Developers are working on new version to fix the problem. Jul 31st, fix is available and it is under test for CNAF, LYON and FZK. All services will be operational by Aug 6th * Jul 24th AOD replication is started for US sites. AGLT2,BU, MWT2_IU - 100% of datasets (~1000 subscriptions per site), SLACXRD - 30% (~300 subscriptions), UTA_SWT2 - 10% (~100 subscriptions). Use different queues (default and production) for AOD replication and files produced by Panda. MC Production (Kaushik) and sites reported data transfer performance degradation. * Aug 2nd Subscriptions are stopped to BU_DDM and decreased to 10% for MWT2_IU. MC data transfer is back to normal for the above sites.
  • Aug 1st. dq2 client commands and dq2api calls are crashing eventually with error message : 'module' object has no attributes 'sites' fixed by Pedro. The corrupted copy of ToACache? was downloaded probably when concurrent stuff running.
  • Aug 2nd . Simone Campana reported that NAPOLI is trying to get data directly from UTA, though NAPOLI is subscribed from CNAF. Explanation from Miguel : "NAPOLI tried to copy the file once from CNAF.
It failed (for some reason). But... it figured out there was another source - UTA - because CNAF (whose site services instance is shared with NAPOLI) was trying to get it from there. Therefore, it tried, ONCE, to get the file from UTA to NAPOLI, because anyway the 1st time it failed from CNAF->NAPOLI. and it keeps going like this."

  • (?) dq2-list-files-in-dataset returns files in 'random' order (may be using GUIDs. not LFNs)
  • (?) dataset replicas info after the whole dataset is copied to BNLDISK is still 'INCOMPLETE' (fixed)
  • (?) the transfer from srm-durable-atlas.cern.ch to BNL is using BNL FTS. (fixed)
  • (?) ARDA reported file transfer 'done' (after several transfer attempts), but file name on destination disk has suffix '_DQ2_timestamp' . explanation from Miguel : "if a previous transfer attempt fails and DQ2 cannot delete file, we leave it there. The new attempt will go with a new PFN and the new PFN will stay in the catalogue - because the reason we created is that we couldn't overwrite the original one!"
  • (?) dq2-list-dataset-by-creationdate --younger 28800 is unstable. Sometimes it works, sometimes not (SJ)

Site Services Updates

Instructions for updating site services on US ATLAS sites: when/how/etc.
  • 7/05/07: DQ2 0.3 SITE SERVICES UPDATE. All sites should update their site services to reflect DQ2 0.3 modifications through 7/4/07.

-- RobertGardner - 10 May 2007

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


pdf DQ2_0_3v1.ppt.pdf (138.5K) | RobertGardner, 22 Jun 2007 - 09:14 |
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback