r5 - 14 Apr 2010 - 14:39:56 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesApr14

MinutesApr14

Introduction

Minutes of the Facilities Integration Program meeting, Apr 14, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (605) 715-4900, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Saul, Kaushik, John DeStefano, Booker, Rob, Aaron, Charles, Nate, Shawn, Karthik, Rik, Torre, Wei, John Brunelle, Patrick, Justin, Armen, Tom, Bob, Xin
  • Apologies: Michael, Mark, Jason, John

Integration program update (Rob, Michael)

  • SiteCertificationP13 - FY10Q3
  • Special meetings
    • Tuesday (9am CDT): Frontier/Squid
    • Tuesday (9:30am CDT): Facility working group on analysis queue performance: FacilityWGAP suspended for now
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
      • LHC collisions at 7 TeV formally by March 30, starting the 18-24 month run (press release)
      • Two OSG documents for review:
      • Updated CapacitySummary
      • Hope for stable beam around March 30. However after 3.5 TeV ramp at noon resulted in Cryo failure that will take a day to recover.
      • Quarter is about to end - quarterly reporting, there will be a 9 day deadline
      • glexec heads up: WLCG management discussion about glexec yesterday - details to be spelled out - will be a requirement to have this installed. Will require integration testing. There were a number of issues raised previously, so a lot of details to iron out. Will need to work with OSG to get glexec installation - may want to invite Jose to the meeting and describe the system - shouldn't place much impact on users or sites. The basic requirement is traceability at the gatekeeper level.
    • this week

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

  • last week(s):
    • The link to ATLAS T3 working groups Twikis are here
    • Draft users' guide to T3g is here
  • this week:
    • ANL analysis jamboree - used the model Tier 3g at ANL. Had four people doing analysis transfer their analysis there, as well as having users doing exercises.
    • Feedback was positive, user interface easy to use.
    • Some issues came up in the batch system - CVMFS reconfigured. Will be running more tests.
    • Next week there will be a Tier 3 session - all working groups will report at that meeting.
    • Close to coming up with a standard configuration for Tier 3g for ATLAS
    • Tier 3-Panda is working. There were some residual issues to sort out (to discuss w/ Torre). Working with Dan Van de Steer to get HC tests running on Tier 3.
    • Need to update Tier 3g build instructions - after the software week. Target mid-May

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • Analysis reference:
  • last meeting(s):
    • Distributed analysis tests at AGLT2, preliminary conclusions
      • See MinutesMarch31?
      • NFS access to VO home directories and ATLAS kits is unchanged (saturated?) for all HC job counts
        • Likely need better service here than current PE1950/MD1000 combination, eg, SUNNAS or SSD
        • Should likely split kits and VO home directories
      • dCache performs well, but would be aided by better distribution over available server systems
      • Limiting number of Analysis jobs per worker node seems to be a good idea
        • Best number for a PE1950 TBD (testing this is ongoing today)
        • Using half the cores as a maximum seems to be a good rule of thumb
          • May want to decrease this even further, at cost of slots, to get cpu/walltime higher
      • Other things: pcache, increasing number of jobs on the server
    • Next week: Wei will discuss w/ Kaushik running tests at SLAC - but need to reduce number of production
  • this week:
    • HC_Test_Performance_at_AGLT2.pdf: HC_Test_Performance_at_AGLT2.pdf - performance studies updated: includes IOSTAT graphs
    • Running out of jobs
    • Reprocessing next week
    • User analysis ramping up - bottlenecks: 80% of users were asking for ESDs - only at BNL; AODs - no one interested. Brought to RAC and CREM's attention. For data, agreement to automatically replicate ESDs to Tier 2s. Already observing a balance of analysis jobs at Tier 2s.
    • Backlog of user analysis jobs are quite high.
    • Reprocessing announcement: https://twiki.cern.ch/twiki/bin/view/Atlas/ADCDataReproSpring2010

Data Management & Storage Validation (Kaushik)

  • Reference
  • last week(s): * Meeting last week: MinutesDataManageMar23 * A little short at NET2 * What about data that has been deleted centrally, and at BNL, but not at Tier 2s due to custodial bit. Has this been sorted out? A ticket has been opened and follow-up discussion - Kaushik thinks its a central deletion problem. Were these deletion failures? Follow-up
  • this week:
    • MinutesDataManageApr13
    • Main issues have been keeping an eye on storage - need to do some dataset deletion.
    • Remnant of consolidating site services at BNL - short vs long for physical file names. Before this wasn't an issue since sites were internally consistent. SLAC and SWT2 had long form, fixed by Hiro. Bottom line: a mess in the LFCs. Short/long at all sites. Form task force to address quickly. Need to make consistent.
    • Replication percentages to T2 sites now set to nominal.

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=90211
    
    1)  4/1: MWT2_UC wasn't getting new assigned jobs.  Brokerage was omitting the site due to insufficient space in PRODDISK -- more space added, issue resolved.
    2)  4/1: No pilots flowing to MWT2_IU, AGLT2, also affected BNL analysis queues.  Condor was re-started on gridui07 at BNL, issue resolved.
    3)  4/3: Transfer errors at SLAC reported by point 1 shifter.  Issue resolved, transfers now succeeding.  eLog 11032, 11054.
    4)  4/3: IllinoisHEP - Problems with an NFS server (bad disk) generated failed jobs, transfer errors.  Problem resolved, but later in the day Hiro noticed a problem with one of the site gridftp doors, which was fixed by restarting dCache on the node. Next day (4/4) Dave reported this problem:
    I did the repairs to my NFS server which I hope has now fixed the problem.  Looks like a double fault in a RAID 5 set which held, among others, the home areas for the globus users. I rebuilt that raidset, restored the data, tested globus and dcache and all looks good.  
    I have turned the IllinoisHEP queues back on and production jobs are running again.
    5)  4/4: Transfer errors to AGLT2_PRODDISK and ALGT2-PERF_MUONS -- errors like:
    [TRANSFER error during TRANSFER phase: [SECURITY_ERROR] globus_ftp_client: the server responded with an error535 Authentication failed: GSSException: Defective credential detected [Caused by: [Caused by: Bad sequence size: 6]]]
    >From Shawn:
    It has been our experience that these type of errors are correlated with some kind of dCache gridftp door problem. Restarting the door is the usual solution and we have automated detecting this error (via 'monit') and automatically restarting the gridftp door. Looking at the last few hours I don't see any errors. 
    Closing this ticket.  eLog 11086, ggus 57023, RT 15881.
    6)  4/5: Again a lack of pilots at several sites (AGLT2, MWT2_IU).  Apparently a proxy problem on the submit host gridui07, though not obvious exactly why.  See details in RT 15886.  eLog 11141.
    7)  4/5: Transfer failures at BNL, with errors like:
    FTS State [Failed] FTS Retries [1] Reason [TRANSFER error during TRANSFER phase: [GRIDFTP_ERROR] globus_ftp_client: the server responded with an error451 Operation failed: FTP Door: got response from '[>PoolManager@dCacheDomain:*@dCacheDomain:SrmSpaceManager@srm-dcsrmDomain:
    *@srm-dcsrmDomain:*@dCacheDomain]' with error Best pool  too high : Infinity]
    Space was added to the MCDISK token, issue resolved.
    8)  4/5 - 4/6: SWT2_CPB: Analysis jobs were failing with errors like:
    Error details: trans: No input file available - check availability of input dataset at site.
    This was tracked down to an issue with how DQ2 site services was doing file registrations.  Specifically LFC registrations using a shortened form were causing the analysis jobs to fail when they attempted to read the files across the network via xrootd.  From Patrick:
    The DQ2 site services code has been updated to register long form SURL's in the LFC and we have repaired the registrations of previously delivered files.  RT 15820, ggus 56876.
    9)  4/6: BNL - phase 1 of the Condor upgrade:
    The first stage of the previously announced Condor upgrade has been completed. We now have separate and redundant Condor
    servers (running version 7.4.1) for both RHIC and ATLAS, while all the worker nodes still run the old client (version
    6.8.9). The second stage of the migration (clients moving to 7.4.x) will take place in May.
    10)  4/7 a.m.: BU_ATLAS_Tier2o - job failures during stage-out like:
    07 Apr 08:15:41|Mover.py | !!FAILED!!2999!! SFN not set in LFC for guid C8EA1BB4-E42F-DF11-BE76-001E4F18C178 (check LFC server version)
    07 Apr 08:15:41|Mover.py | !!FAILED!!3000!! Get error: SFN not set in LFC for guid C8EA1BB4-E42F-DF11-BE76-001E4F18C178 (check LFC server version)
    07 Apr 08:15:41|Mover.py | !!FAILED!!3000!! Get returned a non-zero exit code (1122), will now update pilot server
    >From Saul:
    This is a problem caused a few days ago when we moved the releases from an old nearly full file system to GPFS. This turned out to be a mistake as the load from starting athena on all the worker nodes was causing GPFS to slow down too
    much. This eventually caused a bunch of jobs to fail leading to the 83% error rate below. To solve this problem, we've built an NFS volume and have been copying the releases from GPFS to the new volume. This is going slower than
    expected, but it is nearly finished now. As soon as it is finished, we will turn back on.  ggus 57094, RT 15912, eLog 11207.
    
    Follow-ups from earlier reports:
    (i)  New calendar showing site downtimes for all regions (EGEE, NDGF, OSG) is here:
    https://twiki.cern.ch/twiki/bin/view/Atlas/AtlasGridDowntime
    (ii)  3/3: Consolidation of dq2 site services at BNL for the tier 2's by Hiro beginning.  Will take several days to complete all sites.  ==> Has this migration been completed?
    (iii) 3/24: Question regarding archival / custodial bit in dq2 deletions -- understood and/or resolved?
    (iv) Over the week of 3/24 - 3/31 there was an ongoing issue of HU_ATLAS_Tier2 not filling with jobs -- instead seemed to be somehow capped at 50.  As of 3/30 p.m. John though this issue was understood (grid-monitor jobs not using the proper CA certs location -- details in the mail thread).  
    Once the site began filling up overheating issues were discovered, 
    such that large numbers of jobs were failing with stage-in errors.  
    Site was set off-line while this problem was being worked on.  RT 15839, ggus 56899, eLog 10891.
    Update #1, from John at HU on 3/31: I updated our lsm to have it acquire an exclusive lock, per worker node, before contacting the BU gatekeeper.  This should guarantee the number of processes stays below the ulimit (1 ssh session (3 processes) per WN).
    Update #2, from John on 4/1: The latest problem was that the gratia-lsf cron job tries to write 3.6 GB to local disk every 10 minutes, and that was temporarily filling the filesystem on a regular basis.  It pipes our entire lsb.acct LSF history file through tac, 
    and that text reversal requires a temp file of equal size.  
    Yikes.  I disabled the gratia-lsf cron job.
    Update #3, 4/5 - 4/6: HU_ATLAS_Tier2 was set back to on-line (all outstanding issues resolved), but local site mover failures were observed as the site began to fill up.  From Saul: We're having a load problem on atlas.bu.edu most likely causing the Harvard lsm to fail.  
    While we investigate, could you set HU_ATLAS_Tier2 to offline?  This was done.  
    At this point problem is still under investigation.  ggus 57054, RT 15896, eLog 11169.
    
    
  • this meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=91274
    
    1)  4/7: BNL MC data transfer errors were reported in eLog 11239/61, ggus 57121, RT 15922.  Tickets still open.
    2)  4/7: File transfer errors at MWT2_UC, such as:
    FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [LOCALITY] Source file [srm://uct2-dc1.uchicago.edu/pnfs/uchicago.edu/atlasproddisk/mc09_7TeV/log/e524_s765/mc09_7TeV.108
    495.PythiaB_bbmu4X.simul.log.e524_s765_tid120957_22/log.120957._226211.job.log.tgz.1]: locality is UNAVAILABLE]
    >From Charles at UC:
    We had a failure on our Cisco 6509 router. Cisco diagnostics indicated that one of the 10G modules was not communicating with the backplane. Removing and reseating the module cleared the error. This is slightly unusual, since nobody was working anywhere near the router at the time the failures started.  Everything seems to be back to normal now at MWT2_UC.  eLog 11237.
    3)  4/7 - 4/8: Pilots were not flowing to ANALY_SLAC for several hours.  From Torre:
    The pilot scheduler for ANALY_SLAC on gridui11 just vanished, even though it has a restart mechanism in case it stops for any reason (other than being told to stop). Don't know why.  Restarted.
    4)  4/8: BNL - jobs failures with errors like:
    07 Apr 23:22:39|futil.py | !!WARNING!!5000!! Error message: Using grid catalog type: UNKNOWN Using grid catalog : lfc.usatlas.bnl.gov VO name: atlas Checksum type: None Destination SE type: SRMv2 [SE][StatusOfPutRequest][ETIMEDOUT] httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2: User timeout over lcg_cp: Connection timed out.  (Panda error "DQ2 PUT FILECOPY ERROR ").  
    Issue resolved - from Pedro:
    Possibly, there are only 1-2 best pools to copy data into our storage.  If there are a lot of transfers from the worker node, as an overall the transfers might be slow and timeout.  We're checking an alternative to overcome this possibility.
    ==> We've changed slightly our configuration. this shouldn't happen again.  eLog 11252, ggus 57130, RT 15924.
    5)  4/9: IllinoisHEP - jobs were failing with LFC errors like:
    Error details: pilot: Get error: Failed to get LFC replicas: -1 (lfc_getreplicas failed with: 1004, Timed out)|Log put error: lfc-mkdir get was timed out after 600 seconds.  Hiro reported the problem with the LFC at BNL was resolved - test jobs successful, site set back to 'on-line'.  eLog 11320, RT 15938.
    6)  4/9: AGLT2 - From Bob at AGLT2:
    We found a lot of dCache transfers sitting in the pending state. Looking at one transfer, this was seen in the gridftp log:
    [12:33:36 PM] Shawn McKee: 09 Apr 2010 12:31:53 (GFTP-umfs13-Unknown-52288) [gridftp-umfs13Domain-1270830713036] CRL /etc/grid-security/certificates/09ff08b7.r0 failed to load.
    java.security.GeneralSecurityException: [JGLOBUS-16] CRL data not found.  I noticed a week or so ago that this cert had not updated since Mar 27.  The CRL file was zero length (possibly a failed transfer during an earlier update, etc.?)  Issue resolved by removing the bad file, so that the next scheduled update picked up the correct version.
    7)  4/9: New version of pcache (3.0) now available (thanks Charles).  See: https://twiki.cern.ch/twiki/bin/view/Atlas/Pcache
    8)  4/9 - 4/11: Many failed jobs at BNL (for example from tasks 125109, 125087-125117), due to a missing file in the conditions data area (PFC).  From Xin:
    The daily PFC update job failed because one PFC entry has a wrong SRM protocol value:
    http://voatlas20.cern.ch:25980/monitor/logs/81e21983-31cc-4d18-a9cc-b8d30a5f793d/tarball_PandaJob_1060542928_BNL_ATLAS_Install/install.log
    So PFC file was not updated. I just run another job and still failed. I am trying to reproduce it by hand, checking which PFC entry has the wrong srm, and maybe
    try a manual fix for now. 
    Later:
    I generated and corrected the PFC file manually at BNL, double checked the missing file in the failed job is now in there.
    Savannah # 65555, 65566.
    9)  4/9 - 4/12: Analysis jobs were getting assigned to ANALY_MWT2_X, even though this site is not supposed to be in production.  From Sarah:
    I have also set ANALY_MWT2_X offline, and hopefully that will keep it from being automatically selected.  Shift is copied so that they know why the site is offline.  (Thanks for the info!)
    10)  4/9 => present: SWT2_CPB has been experiencing storage problems.  The main issue is the fact that the majority of the storage servers in the cluster are very full, such that data IO ends up hitting just one or two of the less full servers.  The site is working on several longer-term solutions (more storage is about to come on-line, 
    redistribute the data among the servers, xroot updates, etc.).  This issue is being tracked here: RT 15936, eLog 11301, ggus 57161.
    11)  4/11: Failed jobs at AGLT2 with errors like:
    11 Apr 13:37:44| lcgcp2SiteMo| !!WARNING!!2999!! The pilot will fail the job since the remote file does not exist.
    Shawn verified that the file exists in /pnfs, and that the correct checksum info is available via dCache/Chimera.  Could there be some timing issue present? What does getdCacheChecksum() try to do? I note it mentioned "attempt 1"...are there follow-on attempts or is this site-db configured?  Paul added to the thread in case there is an issue on the pilot side.  
    ggus 57186, RT 15953, eLog 11406.  In progress.
    12)  4/11 - 4/12: Transfer errors at BNL-OSG2_DATADISK -- "no space left device."  From Hiro:
    Here is the reason for the confusion.
    1. There is a time period where FTS runs run of of space for log archive, which is shown in the error.
    2. Not having enough space for log archives is not criticial error in FTS. It just does not make a transfer log.
    So, PIC->BNL transfer was failing for some reason during which FTS had run out of log archive space.  But, the transfers were failing for something else.  Anyway, situation has been resolved.  eLog 11448, ggus 57183, RT 15952.
    13)  4/11 => present: Widespread issue with missing conditions/pool file catalog data at sites.  Some sites have been patched by hand, a permanent fix is under discussion.  Many tickets, mail threads - a sample:
    https://savannah.cern.ch/bugs/?65555
    https://savannah.cern.ch/bugs/?65566
    https://savannah.cern.ch/bugs/?65616
    https://savannah.cern.ch/bugs/?65736
    https://lists.bnl.gov/pipermail/usatlas-prodsys-l/2010-April/007544.html
    https://lists.bnl.gov/pipermail/usatlas-ddm-l/2010-April/007045.html
    14)  4/13: Transfer errors at AGLT2, such as:
    ERROR MSG: [FTS] FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]. Givin' up after 3 tries].  From Shawn:
    Issue seems to be related to the Postgresql DB underlying our SRM on the headnode which started around 1:30 AM. Problem resolved itself around 4 AM local time and has been OK since then. I have restarted SRM services around 8 AM just to be sure that the SRM is in a good state.  ggus 57219, RT 15965 (these will need to be closed), eLog 11463.
    15)  4/13: Transfer errors at ILLINOISHEP_PRODDISK:failed to contact on remote SRM.  From Dave at IllinoisHEP:
    The campus has a serious network problem routing portions of the network off campus. Internally we are fine.  Our T3 is on one of the affected subnets.  This started around 23:15 CDT  on 4/12/2010.  Campus networking is currently working on the problem, but I have no estimate as to when the problem will be resolved.
    Later:
    The campus network problem has been resolved.  eLog 11492.
    16)  4/13: MWT2_UC was draining due to an issue with the file:
    /osg/app/atlas_app/atlas_rel/15.6.3/AtlasProduction/15.6.3.10/InstallArea/i686-slc5-gcc43-opt/lib/libMathCore.rootmap
    in the atlas releases area (incorrect file size and permissions).  From Sarah:
    I've copied the file and permissions from MWT2_IU to MWT2_UC, and will monitor for more failures.  Xin re-ran the validation against this cache and it completed successfully, so this issue is likely resolved.
    
    Follow-ups from earlier reports:
    (i)  3/3: Consolidation of dq2 site services at BNL for the tier 2's by Hiro beginning.  Will take several days to complete all sites.  ==> Has this migration been completed?
    (ii) 3/24: Question regarding archival / custodial bit in dq2 deletions -- understood and/or resolved?
    (iii) Over the week of 3/24 - 3/31 there was an ongoing issue of HU_ATLAS_Tier2 not filling with jobs -- instead seemed to be somehow capped at 50.  As of 3/30 p.m. John though this issue was understood (grid-monitor jobs not using the proper CA certs location -- details in the mail thread).  Once the site began filling up overheating issues were discovered, 
    such that large numbers of jobs were failing with stage-in errors.  Site was set off-line while this problem was being worked on.  RT 15839, ggus 56899, eLog 10891.
    Update #1, from John at HU on 3/31: I updated our lsm to have it acquire an exclusive lock, per worker node, before contacting the BU gatekeeper.  This should guarantee the number of processes stays below the ulimit (1 ssh session (3 processes) per WN).
    Update #2, from John on 4/1: The latest problem was that the gratia-lsf cron job tries to write 3.6 GB to local disk every 10 minutes, and that was temporarily filling the filesystem on a regular basis.  It pipes our entire lsb.acct LSF history file through tac, and that text reversal requires a temp file of equal size.  Yikes.  I disabled the gratia-lsf cron job.
    Update #3, 4/5 - 4/6: HU_ATLAS_Tier2 was set back to on-line (all outstanding issues resolved), but local site mover failures were observed as the site began to fill up.  From Saul: We're having a load problem on atlas.bu.edu most likely causing the Harvard lsm to fail.  While we investigate, could you set HU_ATLAS_Tier2 to offline?  
    This was done.  At this point problem is still under investigation.  ggus 57054, RT 15896, eLog 11169.
    Follow-up, 4/12, from Saul & John:
    Coincidentally, the ATLAS kits at BU were moved to the SE storage just before Harvard ramped up. This was the underlying problem behind the stage-in failures -- running the kits out of GPFS storage caused a huge performance hit.
    The filesystem got backed up, and jobs could no longer stage-in/out files.
    We've migrated the kits at BU off of GPFS, to a dedicated NFS share, and performance has returned. I also added pcache to the Harvard lsm to reduce stage-in load.  All test jobs are now succeeding, and production jobs are running with no lsm issues. I'm closing the ticket.
    iv)  4/7 a.m.: BU_ATLAS_Tier2o - job failures during stage-out like:
    07 Apr 08:15:41|Mover.py | !!FAILED!!2999!! SFN not set in LFC for guid C8EA1BB4-E42F-DF11-BE76-001E4F18C178 (check LFC server version)
    07 Apr 08:15:41|Mover.py | !!FAILED!!3000!! Get error: SFN not set in LFC for guid C8EA1BB4-E42F-DF11-BE76-001E4F18C178 (check LFC server version)
    07 Apr 08:15:41|Mover.py | !!FAILED!!3000!! Get returned a non-zero exit code (1122), will now update pilot server
    >From Saul:
    This is a problem caused a few days ago when we moved the releases from an old nearly full file system to GPFS. This turned out to be a mistake as the load from starting athena on all the worker nodes was causing GPFS to slow down too
    much. This eventually caused a bunch of jobs to fail leading to the 83% error rate below. To solve this problem, we've built an NFS volume and have been copying the releases from GPFS to the new volume. This is going slower than
    expected, but it is nearly finished now. As soon as it is finished, we will turn back on.  ggus 57094, RT 15912, eLog 11207.
    Update this week (4/10): From Saul:
    The move of the releases at BU is complete and analysis jobs are running successfully.  Please start up BU_ATLAS_Tier2o with the usual procedure.  John will be in touch about starting HU_ATLAS_Tier2 also soon.
    Test jobs successful, BU_ATLAS_Tier2o set to 'on-line'.  eLog 11357. 
    

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • DQ2 logging has a new feature - errors are reporting now. Request: to be able to search at the error level.
    • Will be adding link for FTS log viewing.
    • FTS channel configuration change for data flow time-out. New FTS has option for terminating the transfer timeouts. Default for the entire transfer is 30 minutes. Wastes channel for a failed transfer. If no progress in first 3 minutes, transfer is terminated. Now active for all t2 channels.
      • If no progress (bytes transferred) during the a 180 second window, transfer cancelled. (Every 30 seconds a transfer marker is sent.) Making a page with all the settings.
      • Have observed that some transfers being terminated.
      • BNL-IU problem - fails for small files when directly writing into pools. All sites with direct transfers to pools are affected - its a gridftp2.
      • Logfiles and root files - few hundred kilobyte sized files.
      • In the meantime BNL-IU is not using gridftp2
      • dcache developers being consulted - may need a new dcache adapter
    • DQ2 SS consolidation except BU - problem with checksum issues.
    • Need to update Tier 3 DQ2. Note: Illinois working
    • BU site service now at BNL.. so all sites now running DQ SS at BNL DONE
    • FTS log, site level SS logs both available
  • this meeting:
    • Pool file catalog issue - creation at T2 could be problematic, depending on the data. dq2-ls does not handle both cases. Hiro's proposal is to change ToA and dq2-ls code, which accommodate the mixed situation in the US.
    • The small files failing to transfer in SAM tests. Hiro made adjustments to make this work - turned gridftp2 off.
    • No movement on addressing the bug in dcache
    • AGT2 storage information disappeared into BDII - may need to monitor this. Brian Bockleman has a beta version of an RSV probe to check this.

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

  • Reference
  • last week(s)
    • New release of Squid from ATLAS - do not install
    • Testing latest Frontier servlet.
    • Testing from Fred to follow when new Squid deployed
    • Do we need more resources for Frontier (effort) - no effort from Tier 0; follow-up offline
    • Fred - testing newest version of squid at ANL
    • DNS lookup issue at AGLT2 - Frontier client sometimes floods DNS; was this coming from misconfigured analysis jobs? Let Fred knows if you see anything like this.
  • this week
    • There is a new Squid rpm - hesitant to recommend installation;
    • Structure is being changed - will need to de-install completely
    • If you're doing heavy squid customization - wait. If not, you can do the upgrade.
    • Fred has been testing the latest Frontier server; will be updating the launch pad.
    • Also testing new squid. All okay.
    • ATLAS does not have the resources to add new sites into the monitoring infrastructure. Relevant to Tier 3? CVMFS uses squid - already watched closely at the site level; therefore is there a need for central monitoring?
    • Follow-up on DNS user at AGLT2: - it was associated with one user's job, not resolving to a local host. ATLF variable - the string was used as a host name. No new information on reproducing it.
    • PFC corruption at HU - was affecting production jobs which it should never do. This file is not used, but it needs to exist and be in proper XML format. Hiro reported a problem with its content. Alessandro was working on a validity check in the software - thought this was done. Saul: it was not corrupted actually, but out of date. This is a general problem - the jobs which install this sometimes fail (dq2 failures), and this will affect running analysis jobs. Fred will discuss with Richard Hawking next week at CERN and report back. We need a consistency checker for this.

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week(s):
    • Minutes:
      
      
    • perfsonar release schedule - about a month away - anticipate doing only bug fixes.
    • Transaction bottleneck tests - but there is a dcache bug for small files that must be solved first; use xrootd site.
    • Look at data in perfsonar - all sites
    • BU site now configured. SLAC - still not deployed, still under discussion.
  • this week:
    • From Jason: I am currently traveling and will be unable to make the meeting tomorrow. As an update, the first RCs of the next perfSONAR release were made available to USATLAS testers on Monday. Pending any serious issues, we expect a full release on 4/23.
    • A few sites have installed this - AGLT2 and MWT2_IU
    • No meeting this week - probably not next week either. 2 weeks from Tuesday.

Site news and issues (all sites)

  • T1:
    • last week(s):Testing of new storage - dCache testing by Pedro. Will purchase 2000 cores - R410s rather than high density units, ~ six weeks. Another Force10 coming online 100 Gbps interconnect. Requested another 10G link out of BNL - for the Tier 2s. Hope ESnet will manage the bw's to sites well. Fast track muon recon running for the last couple of days, majority at BNL (kudos); lsm by Pedro now supporting put operations - tested on ITB. CREAM CE discussion w/ OSG (Alain) - have encouraged him to go for this and make available to US ATLAS as soon as possible.
    • this week:

  • AGLT2:
    • last week: Lustre in VM going well. v1.8.2. Now have a Lustre deployment going here - looking to replace multiple NFS servers (home, releases, osg home, etc). Getting experience. Will start to transfer to use it and evaluate.
    • this week: A Dell switch stack not talking properly to the central switch in the infrastructure. Have had incidence of dropped net connectivity, may require a reboot. Tom: tuning swappiness parameter on sL5 machine dcache 16GB ram 4 pools default kernel setting wanted to swap - seemed to affect performance, turned down. 60 to 10, improved - machine stopped swapping. Charles notes there is an overcommit variable.

  • NET2:
    • last week(s): Filesystem problem turned out to be a local networking problem. HU nodes added - working on ramping up jobs. Top priority is acquiring more storage - will be Dell. DQ2 SS moved to BNL. Shawn helped tune up perfsonar machines. Moving data around - ATLASDATADISK seems too large. Also want to start using pcache.
    • this week: Built new NFS filesystem to improve performance. Installed pcache at HU - big benefit. Addressed issues with Condor-G from Panda. Ramped HU all the way up; major milestone in that all systems running at capacity. Gatekeepers are holding up - even with 500 MB/s incoming from DDM; interactive users.

  • MWT2:
    • last week(s): Electrical work complete putting new storage systems behind UPS. New storage coming online: SL5.3 installed via Cobbler and Puppet on seven R710 systems. RAID configured for MD1000 shelves. 10G network to each system (6 into our core Tier 2 Cisco 6509, 1 into our Dell 6248 switch stack). dCache installed. Also working on WAN Xrootd testing (see ATLAS Tier 3 working group meeting yesterday). Python bindings for xrootd library - work continues - in advance of local site mover development for xrootd. Focus getting new storage online. Everything installed, configured, running dcache test pools and running load tests. xrootd testing continued.
    • this week: New storage online - 5 of 7 systems. Cisco backplane failures caused by intensive transfer testing; investigating w/ Cisco.

  • SWT2 (UTA):
    • last week: SL5.4 w/ Rocks 5.3 complete. SS transitioned to BNL. Issues w/ transfers failing to BNL. There may be an issue w/ how checksums are being handled. 400 TB of storage being racked and stacked. Looking into ordering more compute nodes. All running fine - putting together 400 TB storage. Continuing to look into procuring new compute and storage. Ran into problem with too many directories in PRODDISK - this is a problem when running cns service; may be cleared up with XFS rather than ext3. Cleared up with proddisk cleands. Continued installation of storage.
    • this week: Problems with loading xrootd from analysis jobs - too few file descriptors. Upgraded xrootd on that data server. Now stable and useable. Limiting analysis jobs for the time being. DDM is working fine at a reduced scale. Progressing. Limit number of threads as per Wei's suggestions. May roll out xrootd onto other servers. Hot data server taking most of the new data. Also preparing another 4 data servers, 400 TB.

  • SWT2 (OU):
    • last week: 23 servers arrived. Call scheduled w/ Dell regarding installation. Equipment now in place - probably take big downtime last week of April. Just now finding a problem with low number of running jobs - none activated; maybe a job shortage. Mark will follow-up.
    • this week: Agreed w/ Dell for install date, April 26. Expect a two week downtime. Have upgraded perfsonar to the latest.

  • WT2:
    • last week(s): ATLAS home and release NFS server failed; will be relocating to temporary hardware. All is well. Storage configuration changed - no longer using the xrootd namespace (CNS service)
    • this week: New HC tests throughput is much much better - not sure its to be believed.

Carryover issues (any updates?)

Release installation, validation (Xin, Kaushik)

The issue of validating process, completeness of releases on sites, etc.
  • last meeting
    • Michael: John Hover will open a thread with Alessandro to begin deploying releases using his methods. Which is WMS-based installation.
    • John's email will start the process today -
    • There will be questions - certificate to be used and account to be mapped to.
    • Charles: makes point that it would be good to have tools that admins could have to test releases. Will do this in the context of Alessandro's framework.
  • this meeting:
    • Xin is helping Alessandro test his system on the ITB site. WMS for the testbed at CERN is not working - i.e. ITB not reporting to the PPS BDII. Working this with GOC.

VDT Bestman, Bestman-Xrootd

Local Site Mover

  • Specification: LocalSiteMover
  • code
    • lsm-get, lsm-put, lsm-df, lsm-rm have all been implemented [Pedro]
    • lsm tests have successfully run at ITB queue [Pedro]
  • this week if updates:

AOB

  • last week
  • this week
    • None


-- RobertGardner - 13 Apr 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


pdf HC_Test_Performance_at_AGLT2.pdf (703.8K) | RobertGardner, 14 Apr 2010 - 12:54 |
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback