r4 - 08 Feb 2011 - 12:15:17 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesFeb2

MinutesFeb2

Introduction

Minutes of the Facilities Integration Program meeting, Feb 2, 2011
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • 866-740-1260, Access code: 7027475

Audio Details: Dial-in Number:
U.S. & Canada: 866.740.1260
U.S. Toll: 303.248.0285
Access Code: 7027475; Chair passcode: 8734
Registration Link: https://cc.readytalk.com/r/bd2w3deu2kkg

Attending

  • Meeting attendees: Rob, AK, Fred, Booker, Alex Undres, Sarah, Tom, Saul, Alden, Wei, Horst, Wensheng, Dave, Joe, Rik,
  • Apologies: Bob, Jason, Patrick, Mark

Integration program update (Rob, Michael)

Tier 3 Integration Program (Doug Benjamin, Rik Yoshida)

Tier 3 References:
  • The link to ATLAS T3 working groups Twikis are here
  • T3g Setup guide is here
  • Users' guide to T3g is here

General Tier 3 issues

last week(s):
  • Have met with sys admins from UTD
  • Working with UC Irvine; Amherst
  • Sending out email to contacts, including co-located Tier 2s
  • T3 section for OSG all-hands meeting
  • Joint-session with USCMS Tier3
  • Note - OSG all hands registration is open
  • Automatic testing - data handling functional tests, waiting for the site status board to do auto-blacklisting. Douglas Smith is going to handle production jobs, adopting HC framework.
  • New Tier 3's coming online for production: UTD, Bellamine, Hampton.
  • Analysis Tier 3gs: will report.
this week:
  • Rik reports 47 possible sites, 18 are functional 6 are setting up, 7 received hardware, 1 in planning stage; majority are using the wiki to setup their sites.

Tier 3 production site issues

  • Bellamine University (AK):
    • last week(s):
      • "AK" - Akhtar Mahmood
      • 385 cores, 200TB RAID, 200 TB on nodes.
      • Horst has been measuring things. 100 Mbps connection (10 MB/s with no students)
      • FTS transfer tests started
      • Totally dedicated to ATLAS production (not analysis)
      • Goal is to be a low-priority production site
      • There will be a long testing period in the US, at least a week
      • Need to look at the bandwidth limitation
      • Horst has installed the OSG stack
      • There is .5 FTE local technical support
      • Will need to setup an RT queue.
      • AK and Horst - more networking testing - inbound bandwidth limited
      • There is a packet shaper on campus - to limit p2p bandwidth, etc. By-pass it completely is the recommendation from Shawn.
    • this week:
      • Met with IT director about the packet shaper, working to get a bypass.

  • UTD (Joe Izen)
    • last week(s):
      • Have been in production for years, with 152 cores. Have new hardware, 1/5 to go for Tier3g. Most will go into production - 180 cores. Doug is providing some recommendations.
      • Operationally - biggest issue is lost heartbeat failures. Will consult Charles.
      • Open ticket for failed jobs - LFC accuracy. Cleaned up, been stable since Monday; expect to close the ticket.
      • Got script from Charles to debug lost heartbeats
      • Deploying new fileservers for new hardware
    • this week
      • No LFC errors this week, for the first time
      • In production most of the week
      • Caught 11 lost-heart beat jobs overnight. At the moment offline due to power blackout in Texas.

Operations overview: Production and Analysis (Kaushik)

Data Management & Storage Validation (Kaushik)

Shifters report (Mark)

  • Reference
  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting (this week presented by Tom Fifield):
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=slides&confId=124091
    
    Note: AGLT2 will take a maintenance outage on 2/1.  See more details in the message from Bob to the usatlas-t2-l mailing list.
    
    1)  1/19: UTD-HEP - job failures with missing input file errors - for example: "19 Jan 07:07:10|Mover.py | !!FAILED!!2999!! Failed to transfer HITS.170554._000123.pool.root.2: 1103 (No such file or directory)."  
    ggus 66284, eLog 21346.
    2)  1/19:  BNL - file transfer failures with the error "FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [FILE_EXISTS] at Tue Jan 18 23:56:18 EST 2011 
    state Failed : file exists]."  ggus 66280, eLog 21352.  Issue resolved as of 1/21 - from Hiro: since DDM will retry with different physical name, this is not a issue. The dark data will be taken care of later.  ggus ticket closed.
    3)  1/19: BNL - user reported a problem while attempting to download files from the site - for example: "httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2: CGSI-gSOAP running on t301.hep.tau.ac.il reports Error 
    reading token data header: Connection closed."  ggus 66298.  From Hiro: There is a known issue for users with Israel CA having problem accessing BNL and MWT2. This is actively investigated right now. Until this get 
    completely resolved, users are suggested to request DaTRI request to transfer datasets to some other sites (LOCAGROUPDISK area) for the downloading.
    4)  1/21: SLACXRD file transfer errors - "failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]."  Issue was reported to be fixed by Wei, but the errors reappeared later the same day, 
    so the ticket (ggus 66346) was re-opened.  eLog 21409.
    5)  1/21: File transfer errors from ALGT2 to MWT2_UC_LOCALGROUPDISK with source errors like "FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase:[USER_ERROR] 
    source file doesn't exist]."  https://savannah.cern.ch/bugs/index.php?77251, eLog 21440.
    6)  1/21: HammerCloud auto-exclusion policy updated - see: https://twiki.cern.ch/twiki/bin/view/IT/HammerCloud#10_HammerCloud_Automatic_Site_Ex
    7)  1/21: SWT2_CPB - user reported that the command 'lcg-ls' was hanging when attempting to communicate via the SRM interface.  Possibly a temporary network glitch - subsequent tests are working correctly.  
    RT 19309 / ggus 66386.  Update 1/26: user reports command now working, no recent instances of this error.  RT & ggus tickets closed.
    8)  1/22: MWT2_IU_DATADISK timeout problems - for example " [TRANSFER error during TRANSFER phase: [TRANSFER_MARKERS_TIMEOUT] No transfer markers received for more than 180 seconds] ACTIVITY: 
    User Subscriptions."  Issue understood - from Aaron: We had a storage pool which was throwing errors and needed to be restarted. The behavior described here doesn't exactly match this failure, but it does match 
    the timeframe. Please let us know if this is still occurring at any frequency, otherwise we can consider the issue resolved.  ggus 66410 closed, eLog 21429.
    9)  1/22: MWT2_IU_PRODDISK => BNL-OSG2_MCDISK file transfer failures with source errors.  Resolved - from Aaron: This was tracked down to a failing pool, which was restarted and is now delivering data as expected. 
    We should see this transfers succeed, and this issue should now be cleared up.  ggus 66415 closed, eLog 21439.
    10)  1/22: BNL-OSG2_LOCALGROUPDISK file transfer errors (from NDGF-T1_PHYS-SUSY) like " [TRANSFER error during TRANSFER phase: [FIRST_MARKER_TIMEOUT] First non-zero marker not received within 
    180 seconds]."  From Michael: The issue is no longer observed. The ticket can be closed.  ggus 66416 closed, eLog 21441.
    11)  1/24: ALGT2 job & file transfer errors - "[SOURCE error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]. Givin' up 
    after 3 tries]."  ggus 66450 in-progress, eLog 21488.  Update from Bob at ALGT2: Just restarted dcache services on head01. rsv srmcp-readwrite had been red. Hopefully that will clear the issue.  Since the queues at 
    the site (analy_, prod) had been set offline (ADC site exclusion ticket: https://savannah.cern.ch/support/?118828) test jobs were submitted, and they completed successfully (eLog 21497).  Are we ready to close this ticket?
    12)  1/25: SLACXRD_DATADISK file transfer errors - "[Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [NO_SPACE_LEFT] No space found with at least 2895934054 bytes 
    of unusedSize]."  http://savannah.cern.ch/bugs/?77346.
    13)  1/25: SLACXRD job failures with stage-in errors - for example "!!FAILED!!2999!! Failed to transfer ... 1099 (Get error: Staging input file failed)."  From Wei: I think this is fixed. Sorry for the noise. They are a lot of 
    places to change when I change the config so I may still be missing something... let me know if that is still the case.  ggus 66520 closed.
    
    Follow-ups from earlier reports:
    (i)  12/17, 12/20:  ANALY_SWT2_CPB was auto-blacklisted twice.  Hammercloud test jobs were failing due to the fact that a required db release file was not yet transferred to the site when the first jobs started up.  
    Once the transfer completed the test jobs began to complete successfully.  Discussion underway about how to address this issue.
    (ii)  12/21: NERSC file transfer errors - "failed to contact on remote SRM [httpg://pdsfdtn1.nersc.gov:62443/srm/v2/server]."  ggus 65617 in-progress, eLog 20810.
    (iii)  1/9: AGLT2 - low-level of job failures with the error "Put error: lfc_creatg failed with (2704, Bad magic number)."  Site is investigating.
    (iv)  1/14: AGLT2_PHYS-SM - file transfer failures due to "[USER_ERROR] source file doesn't exist."  ggus 66150 in-progress, https://savannah.cern.ch/bugs/?77036.  Also https://savannah.cern.ch/bugs/index.php?77139.
    1/25: Update from Shawn:
    I have declared the 48 files as "missing" to the consistency service. See https://twiki.cern.ch/twiki/bin/view/Atlas/DDMOperationProcedures#In_case_some_files_were_confirme and you can track the "repair" at 
    http://bourricot.cern.ch/dq2/consistency/
    Let me know if there are further issues.
    
    
    • ADCR databases were down, things coming back online now.
    • Note modifications to exclusion policy
    • 3 old sites have been deprecated.
  • this meeting:* Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-1_31_11.html
    
    1)  1/26: data transfer errors from SLACXRD_USERDISK to MWT2_UC_LOCALGROUPDISK ("source file doesn't exist").  From Wei: I think you can close this ticket. There is only a few missing files and they do not exist at WT2. I don't know why FTS were asked to transfer them (maybe they were there when the request was submitted?) Repeated transfer request created lots of failures simply because they don't exist.  ggus 66613 closed, eLog 21535.
    2)  1/27: all U.S. sites received an RT & ggus ticket regarding the issue "WLCG sites not publishing GlueSiteOtherInfo=GRID=WLCG value."  Consolidated into a single goc ticket, https://ticket.grid.iu.edu/goc/viewer?id=9871.  Will be resolved in a new OSG release currently being tested in the ITB.
    3)  1/27: from Bob at AGLT2 - At 1pm EST AGLT2 had a dCache issue.  Available postgres connections had been dropped from 1000 to 300 during a pgtune a few days ago, and was not noticed until this failure was noticed.  Unfortunately, this caused a LOT of job failures during the last 3 hours.
    Later that evening / next morning:
    We had some sort of "event" on our gate keeper around 11pm last night. Ultimately, condor was shot, and our load is lost. I have disabled auto-pilots this morning to both AGLT2 and ANALY_AGLT2 while we investigate the cause.  Indications of hitting an open file limit on the system were found, and we need to understand the cause.  Queues were set off-line.  Later Friday afternoon, from Bob: We increased several sysctl parameters on gate01 dealing with total number of available file handles.  Issues resolved, queues set back on-line.  eLog 21583.
    4)  1/30: AGLT2 - job (stage-out: "Internal name space timeout lcg_cp: Invalid argument") & file transfer errors ("failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]").  From Shawn: This morning around 8 AM Eastern time our postgresql server for the dCache namespace (Chimera) filled its partition with logging info (over 10 GB in the last 24 hours). This was traced to multiple attempts to re-register a few files over and over.  We have cleaned up space on the partition and modified the logging to be "terse" so this won't happen as easily in the future.  ggus 66794 in-progress, eLog 21616.
    5)  2/1: Maintenance outage at AGLT2 - from Bob: The outage will include all of Condor, as well as a dCache outage and upgrade.
    Update 2/1 late afternoon: outage extended in OIM to 10 p.m. EST.  Later, early a.m. 2/2: work completed, test jobs were successful, queues set back on-line.  eLog 21696.
    6)  2/2: UTD-HEP set off-line at request of site admin.  Rolling blackouts in the D-FW area (unfortunately).  eLog 21702.
    7)  2/2: WISC_DATADISK - failing functional tests with file transfer errors like " Can't mkdir: /atlas/xrootd/atlasdatadisk/step09]."  ggus 66897 in-progress, eLog 21695.
    
    Follow-ups from earlier reports:
    (i)  12/17, 12/20:  ANALY_SWT2_CPB was auto-blacklisted twice.  Hammercloud test jobs were failing due to the fact that a required db release file was not yet transferred to the site when the first jobs started up.  Once the transfer completed the test jobs began to complete successfully.  Discussion underway about how to address this issue.
    (ii)  12/21: NERSC file transfer errors - "failed to contact on remote SRM [httpg://pdsfdtn1.nersc.gov:62443/srm/v2/server]."  ggus 65617 in-progress, eLog 20810.
    Update 1/30 from a shifter: No more problems seen - closing this ticket (ggus 65617).
    (iii)  1/9: AGLT2 - low-level of job failures with the error "Put error: lfc_creatg failed with (2704, Bad magic number)."  Site is investigating.
    (iv)  1/14: AGLT2_PHYS-SM - file transfer failures due to "[USER_ERROR] source file doesn't exist."  ggus 66150 in-progress, https://savannah.cern.ch/bugs/?77036.  Also https://savannah.cern.ch/bugs/index.php?77139.
    1/25: Update from Shawn:
    I have declared the 48 files as "missing" to the consistency service. See https://twiki.cern.ch/twiki/bin/view/Atlas/DDMOperationProcedures#In_case_some_files_were_confirme and you can track the "repair" at http://bourricot.cern.ch/dq2/consistency/
    Let me know if there are further issues.
    (v)  1/19: UTD-HEP - job failures with missing input file errors - for example: "19 Jan 07:07:10|Mover.py | !!FAILED!!2999!! Failed to transfer HITS.170554._000123.pool.root.2: 1103 (No such file or directory)."  ggus 66284, eLog 21346.
    Update 1/27: from the site admin: These errors seem to have been resolved by the LFC cleaning -- closing the ticket.  eLog 21612.
    (vi)  1/19: BNL - user reported a problem while attempting to download files from the site - for example: "httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2: CGSI-gSOAP running on t301.hep.tau.ac.il reports Error reading token data header: Connection closed."  ggus 66298.  From Hiro:
    There is a known issue for users with Israel CA having problem accessing BNL and MWT2. This is actively investigated right now. Until this get completely resolved, users are suggested to request DaTRI request to transfer datasets to some other sites (LOCAGROUPDISK area) for the downloading.
    (vii)  1/21: SLACXRD file transfer errors - "failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]."  Issue was reported to be fixed by Wei, but the errors reappeared later the same day, so the ticket (ggus 66346) was re-opened.  eLog 21409.
    Update 1/30 from a shifter: No more errors in the last 12 hours, 400 successful transfers, maybe migration comes to an end.  ggus 66346 closed, eLog 21611.
    (iix)  1/21: File transfer errors from ALGT2 to MWT2_UC_LOCALGROUPDISK with source errors like "FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase:[USER_ERROR] source file doesn't exist]."  https://savannah.cern.ch/bugs/index.php?77251, eLog 21440.
    (ix)  1/24: ALGT2 job & file transfer errors - "[SOURCE error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]. Givin' up after 3 tries]."  ggus 66450 in-progress, eLog 21488.  Update from Bob at ALGT2: 
    Just restarted dcache services on head01. rsv srmcp-readwrite had been red. Hopefully that will clear the issue.  Since the queues at the site 
    (analy_, prod) had been set offline (ADC site exclusion ticket: https://savannah.cern.ch/support/?118828) test jobs were submitted, and they completed successfully (eLog 21497).  Are we ready to close this ticket?
    Update 1/26: The site team restarted dcache services on head01 (rsv srmcp-readwrite had been red). Test jobs completed OK.  ggus 66450 closed, eLog 21526.
    (x)  1/25: SLACXRD_DATADISK file transfer errors - "[Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [NO_SPACE_LEFT] No space found with at least 2895934054 bytes of unusedSize]."  http://savannah.cern.ch/bugs/?77346.
    Update 1/26 from Wei: this can be ignored. I was moving data amount storage nodes and was filling the quota fast.
    
     

DDM Operations (Hiro)

Throughput and Networking (Shawn)

Global Xrootd: Tier 3 (Doug), Tier 2 (Charles)

last week(s):
  • Charles - working on HC tests running on using federated xrootd. Also working on xrd-lfc module - requires voms proxy; looking at lfc-dli, but performance is poor, and its being deprecated. Can we use a server certificate?
  • Wei - discussed with Andy regarding the checksum issue - may require architectural change.
  • HC tests running pointed at UC local xrootd redirector (through ANALY_MWT2_X). Few job failures (9/200) tracking down. Event rate not as good as dcap. May need more tuning on xrootd client. Hiro will set this up at BNL.
  • Local tests dcap versus xrootd - apparently factor 2 improvement
this week:

Site news and issues (all sites)

  • T1:
    • last week(s): Comprehensive intervention at the Tier 1 - Saturday to Monday - various services were upgraded. Doubled bandwidth between storage farm and worker nodes - up to 160 Gbps. Several grid components upgraded - move SL5 on some nodes. CE upgraded. A number of failure modes discussed with NEXAN - new firmware for disk arrays, to improve identification of faulty disks. Will further improve availability and performance. Hiro - on storage management, dcache upgraded to 1.9.5-23. 3.1.18 pnfs. upgraded postgres to 9.0.2, and backend disk area (hardware swap to get more spindles). Hot standby for postgres. All dcache changes went okay. LFC upgraded to 1.8.0-1, significant upgrade. Should discuss with OSG to package this version.
    • this week:

  • AGLT2:
    • last week: Planned downtime for next week. 1.9.10-4. Will be using NFSv4.1. Lots of internal changes, eg. configuration, going to this release. MSU getting network changes ready.
    • this week: Downtime yesterday - change to our LAN at MSU, including routers connecting to regional network. Work ongong: dcache upgrade, and Condor (more stable negotiator). Router work on-going at UM. Expect to be back later.

  • NET2:
    • last week(s): Had to kill a bunch of jobs in HU analysis queue (resulting in get errors). DATADISK filled up. Adding a half-rack of storage (130 TB useable). Increase DATADISK to 450 TB. Ramping up HU analysis queue, currently not getting enough IO capacity. Another 10G port needed - will buy another switch - will go to 2000 analysis jobs. Testing with ClusterNFS to export GPFS (removing need for GPFS clients).
    • this week: Meeting with Dell on Monday to finalize purchases, including Dell network equipment, and some new worker nodes. An install issue being worked on with Xin. Continue to work on detailed planning for big move to Holyoke.

  • MWT2:
    • last week(s): Continuing to install 88 worker nodes; will install new libdcap. Will install CVMFS.
    • this week:

  • SWT2 (UTA):
    • last week: Working on Tier 3; working on federated xrootd monitoring. All running fine otherwise.
    • this week:

  • SWT2 (OU):
    • last week: All is fine. Schedule installation of R410, may require electrical work. Bestman2 has been tested extensively in the past months.
    • this week: Still waiting for Dell regarding installation of new nodes.

  • WT2:
    • last week(s): Bestman2 running in gateway seems to be fine. Starts up very quickly. No immediate need to upgrade to it, but this will be the only supported version. Enabled security module for xrootd client in the internal storage cluster, created some problems. But discovered in XrootdFS, will notify Doug and Tanya. Trivial to update (Wei's scripts). Patrick has a package to dynamically add gridftp servers using a plug-in module (don't have to restart Bestman).
    • this week: Working with Dell on new purchase, want low-end 5400 rpm drives for "tape"-disk.

Carryover issues ( any updates?)

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc. Note: https://atlas-install.roma1.infn.it/atlas_install/ - site admins can subscribe, and get notified of release installation & validation activity at their site.

  • last report(s)
    • AGLT2 now running Alessandro's system - now in automatic installation mode. Will do other sites after the holiday.
    • MWT2_UC is using the new system, for only 16 series releases. If it works well, will enable this for the new system.
    • Next site - BU - once BDII publication issue resolved, will return to this.
    • WT2, IU, SWT2 - depending on Alessandro's availability.
  • this meeting:

libdcap & direct access (Charles)

last report(s):
  • Rebuilt libdcap locally
  • Current HC performance, http://hammercloud.cern.ch/atlas/10000957/test/
  • Comparing to previous results with dcap++
  • Looking at messaging between dcap client and server
  • Most problems with prun jobs (pathena seems robust)
  • Trying to get dcap++ merged into official dcap sources.. libdcap has diverged.
  • Continuing to track down stalls in analysis jobs - leading to hangs in dcap.
  • newline causing buffer overflow
  • chipping away at this on the job-side.
  • dcap++ and newer stock libdcap could be merged at some point.
  • Maximum active movers needs to be high - 1000.
  • Submitted patch to dcache.org for review
  • No estimate on libdcap++ inclusion in to the release
  • No new failure modes observed

this meeting:

AOB

  • last week
  • this week


-- RobertGardner - 31 Jan 2011

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback