r4 - 26 Jan 2011 - 14:03:25 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJan26

MinutesJan26

Introduction

Minutes of the Facilities Integration Program meeting, Jan 26, 2011
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • 866-740-1260, Access code: 7027475

Audio Details: Dial-in Number:
U.S. & Canada: 866.740.1260
U.S. Toll: 303.248.0285
Access Code: 7027475; Chair passcode: 8734
Registration Link: https://cc.readytalk.com/r/bd2w3deu2kkg

Attending

  • Meeting attendees: Charles, Aaron, Dave, Rob, Jason, Torre, Shawn, Saul, AK, Patrick, Bob, Joe, Xin, Horst, Wei, Tom, Alden, Armen, Mark, Kaushik, Karthik, Hiro,
  • Apologies: Michael

Integration program update (Rob, Michael)

  • IntegrationPhase16 NEW
  • Special meetings
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • Program notes:
    • last week(s)
      • Phase 16 Integration program forming - see above. Add CVMFS infrastructure.
      • Metrics discussion - new cvs format available. One issue is getting file size and number of files. Alden is providing an API as well. Will give Charles data to upload and modify.
      • Reports from Tier 2's on downtime activities, if any.
      • Facility capacity report
      • Request from OSG/VDT regarding xrootd packaging for US ATLAS: rpms available for 3.0.1; pacman needed?
    • this week
      • Congratulations Green Bay Packers wink
      • Next face-to-face facilities meeting co-located with OSG All Hands meeting (March 7-11, Harvard Medical School, Boston), http://ahm.sbgrid.org/. US ATLAS agenda will be here.
      • Starting up CVMFS evaluation in production Tier 2 settings: SWT2_OU, MWT2 participating, AGLT2 and NET2 (possible). Instructions to be developed here: TestingCVMFS.
      • Updates to SiteCertificationP16

Tier 3 Integration Program (Doug Benjamin, Rik Yoshida)

Tier 3 References:
  • The link to ATLAS T3 working groups Twikis are here
  • T3g Setup guide is here
  • Users' guide to T3g is here

General Tier 3 issues

last week(s):
  • Have met with sys admins from UTD
  • Working with UC Irvine; Amherst
  • Sending out email to contacts, including co-located Tier 2s
  • T3 section for OSG all-hands meeting
  • Joint-session with USCMS Tier3
  • Note - OSG all hands registration is open
  • Automatic testing - data handling functional tests, waiting for the site status board to do auto-blacklisting. Douglas Smith is going to handle production jobs, adopting HC framework.
  • New Tier 3's coming online for production: UTD, Bellamine, Hampton.
  • Analysis Tier 3gs: will report.
this week:

Tier 3 production site issues

  • Bellamine University (AK):
    • last week(s):
      • "AK" - Akhtar Mahmood
      • 385 cores, 200TB RAID, 200 TB on nodes.
      • Horst has been measuring things. 100 Mbps connection (10 MB/s with no students)
      • FTS transfer tests started
      • Totally dedicated to ATLAS production (not analysis)
      • Goal is to be a low-priority production site
      • There will be a long testing period in the US, at least a week
      • Need to look at the bandwidth limitation
      • Horst has installed the OSG stack
      • There is .5 FTE local technical support
      • Will need to setup an RT queue.
      • AK and Horst - more networking testing - inbound bandwidth limited
    • this week:
      • There is a packet shaper on campus - to limit p2p bandwidth, etc. By-pass it completely is the recommendation from Shawn.

  • UTD (Joe Izen)
    • last week(s):
      • Have been in production for years, with 152 cores. Have new hardware, mostly for Tier3g. Some will go into production - 180 cores. Doug is providing some recommendations.
      • Operationally - biggest issue is lost heartbeat failures. Will consult Charles.
    • this week
      • Open ticket for failed jobs - LFC accuracy. Cleaned up, been stable since Monday; expect to close the ticket.
      • Got script from Charles to debug lost heartbeats
      • Deploying new fileservers for new hardware

Operations overview: Production and Analysis (Kaushik)

Data Management & Storage Validation (Kaushik)

Shifters report (Mark)

  • Reference
  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=121879
    
    1)  1/14: UTD-HEP - file transfer errors between UTD_PRODDISK => BNL-OSG2_MCDISK - fro example: failed to contact on remote SRM [httpg://fester.utdallas.edu:8446/srm/v2/server].  Issue resolved as 
    of 1/15 - from the site admin: Reboot of our gateway node seems to have resolved this problem. No errors since 2011-01-14, 14:00 GMT.  ggus 66146 closed, eLog 21215.
    2)  1/14: BNL - analysis job failures with the error "*pilot:* Get function can not be called for staging input files: [Errno 28] No space left on device *trans:*Exception caught by pilot."  Issue understood - from Hiro: 
    The message actually came from NFS mounted space and not the local disk.  /usatlas/grid (which includes usatlas1 used by pilot for proxy and others) has run out of space yesterday.  The space was increased yesterday.  
    3)  1/14: AGLT2_PHYS-SM - file transfer failures due to "[USER_ERROR] source file doesn't exist."  ggus 66150 in-progress, https://savannah.cern.ch/bugs/?77036.  Also https://savannah.cern.ch/bugs/index.php?77139.
    4)  1/14: US cloud set off-line in preparation for the BNL maintenance outage.  More details here, including other cloud statuses relative to database infrastructure split at CERN 1/17 - 1/19: https://savannah.cern.ch/support/?118697, 
    https://savannah.cern.ch/support/?118699.
    5)  1/18: SWT2_CPB file transfer errors - FTS channels temporarily turned off.  Issue with two problematic dataservers resolved - from Patrick: I think everything should be working now.  The dataservers are back and 
    PandaMover is able to stage data to us correctly.  There should be no reason that FTS will have a problem.  I will turn the channels back on momentarily.
    
    Follow-ups from earlier reports:
    (i)  11/30: AGLT2 - job failures with the error "Trf setup file does not exist for 16.2.1.2 release."  From Alessandro: there is a problem when setting up the releases in AGLT2, and it's already under study. For the moment please 
    do not use that site for any release that has not been validated and tagged (which should be the default, BTW). The needed release 16.2.1 in fact is not validated nor tagged in AGLT2.  ggus 64770 in-progress, eLog 20131.
    Update 12/21, from Alessandro: the situation is definitely better now in AGLT2, almost all the releases have been installed and are now properly working.  ggus ticket 64770 closed on 1/11.
    (ii)  12/16: Offline database infrastructure will be split into two separate Oracle RAC setups: (i) ADC applications (PanDA, Prodsys, DDM, AKTR) migrating to a new Oracle setup, called ADCR; (ii) Other applications (PVSS, COOL, 
    Tier0) will remain in the current ATLR setup.  See: https://twiki.cern.ch/twiki/bin/viewauth/Atlas/OfflineDBSplit
    Update 1/18: work completed.  See eLog 21312 / 18 / 19.
    (iii)  12/17, 12/20:  ANALY_SWT2_CPB was auto-blacklisted twice.  Hammercloud test jobs were failing due to the fact that a required db release file was not yet transferred to the site when the first jobs started up.  Once the transfer 
    completed the test jobs began to complete successfully.  Discussion underway about how to address this issue.
    (iv)  12/21: NERSC file transfer errors - "failed to contact on remote SRM [httpg://pdsfdtn1.nersc.gov:62443/srm/v2/server]."  ggus 65617 in-progress, eLog 20810.
    (v)  12/22: Problem with the installation of atlas release 16.0.3.3 on the grid in most clouds.  Alessandro informed, ggus 65628, eLog 20818.
    Update 1/12: from I Ueda: The problem with a tarball seems to have been fixed.  ggus 65628 closed.
    (vi)  1/9: AGLT2 - low-level of job failures with the error "Put error: lfc_creatg failed with (2704, Bad magic number)."  Site is investigating.
    (vii)  1/7: Duke - file transfer errors - "failed to contact on remote SRM [httpg://atlfs02.phy.duke.edu."  ggus 65933 in-progress, eLog 21103.
    Update 1/13 from Doug: I believe that I have fixed the problem but removing the certificates and replacing them and restarting bestman. I am able now to fetch a dataset from Duke using dq2-get.  [This ticket will be closed.]
    (iix)  1/10: file transfer errors between AGLT2_USERDISK => TRIUMF-LCG2_PERF-TAU - "failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]. Givin' up after 3 tries."  Appears to be an issue 
    related to network connectivity between the sites.  ggus 65972 in-progress, eLog 21221.
    Update 1/19, from Shawn: Routing fix between NLR and USLHCnet fixed this issue. ggus 65972 closed.
    (ix) A multi-day service outage at the US Atlas Tier 1 facility is planned at BNL, starting on Saturday, January 15 at 8:00AM EST through Monday, January 17 at 5:00PM EST. This service outage will affect all services hosted 
    at the US Atlas Tier 1 facility at BNL.  eLog 21139.
    Update 1/17, from Michael: The intervention was completed. All services are operational; the batch queues were re-opened.  eLog 21316.
    • UWISC - remove from Panda since they are not participating in production.
  • this meeting:* Operations summary:
     
    Yuri's summary from the weekly ADCoS meeting (this week presented by Tom Fifield):
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=slides&confId=124091
    
    Note: AGLT2 will take a maintenance outage on 2/1.  See more details in the message from Bob to the usatlas-t2-l mailing list.
    
    1)  1/19: UTD-HEP - job failures with missing input file errors - for example: "19 Jan 07:07:10|Mover.py | !!FAILED!!2999!! Failed to transfer HITS.170554._000123.pool.root.2: 1103 (No such file or directory)."  
    ggus 66284, eLog 21346.
    2)  1/19:  BNL - file transfer failures with the error "FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [FILE_EXISTS] at Tue Jan 18 23:56:18 EST 2011 
    state Failed : file exists]."  ggus 66280, eLog 21352.  Issue resolved as of 1/21 - from Hiro: since DDM will retry with different physical name, this is not a issue. The dark data will be taken care of later.  ggus ticket closed.
    3)  1/19: BNL - user reported a problem while attempting to download files from the site - for example: "httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2: CGSI-gSOAP running on t301.hep.tau.ac.il reports Error 
    reading token data header: Connection closed."  ggus 66298.  From Hiro: There is a known issue for users with Israel CA having problem accessing BNL and MWT2. This is actively investigated right now. Until this get 
    completely resolved, users are suggested to request DaTRI request to transfer datasets to some other sites (LOCAGROUPDISK area) for the downloading.
    4)  1/21: SLACXRD file transfer errors - "failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]."  Issue was reported to be fixed by Wei, but the errors reappeared later the same day, 
    so the ticket (ggus 66346) was re-opened.  eLog 21409.
    5)  1/21: File transfer errors from ALGT2 to MWT2_UC_LOCALGROUPDISK with source errors like "FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase:[USER_ERROR] 
    source file doesn't exist]."  https://savannah.cern.ch/bugs/index.php?77251, eLog 21440.
    6)  1/21: HammerCloud auto-exclusion policy updated - see: https://twiki.cern.ch/twiki/bin/view/IT/HammerCloud#10_HammerCloud_Automatic_Site_Ex
    7)  1/21: SWT2_CPB - user reported that the command 'lcg-ls' was hanging when attempting to communicate via the SRM interface.  Possibly a temporary network glitch - subsequent tests are working correctly.  
    RT 19309 / ggus 66386.  Update 1/26: user reports command now working, no recent instances of this error.  RT & ggus tickets closed.
    8)  1/22: MWT2_IU_DATADISK timeout problems - for example " [TRANSFER error during TRANSFER phase: [TRANSFER_MARKERS_TIMEOUT] No transfer markers received for more than 180 seconds] ACTIVITY: 
    User Subscriptions."  Issue understood - from Aaron: We had a storage pool which was throwing errors and needed to be restarted. The behavior described here doesn't exactly match this failure, but it does match 
    the timeframe. Please let us know if this is still occurring at any frequency, otherwise we can consider the issue resolved.  ggus 66410 closed, eLog 21429.
    9)  1/22: MWT2_IU_PRODDISK => BNL-OSG2_MCDISK file transfer failures with source errors.  Resolved - from Aaron: This was tracked down to a failing pool, which was restarted and is now delivering data as expected. 
    We should see this transfers succeed, and this issue should now be cleared up.  ggus 66415 closed, eLog 21439.
    10)  1/22: BNL-OSG2_LOCALGROUPDISK file transfer errors (from NDGF-T1_PHYS-SUSY) like " [TRANSFER error during TRANSFER phase: [FIRST_MARKER_TIMEOUT] First non-zero marker not received within 
    180 seconds]."  From Michael: The issue is no longer observed. The ticket can be closed.  ggus 66416 closed, eLog 21441.
    11)  1/24: ALGT2 job & file transfer errors - "[SOURCE error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]. Givin' up 
    after 3 tries]."  ggus 66450 in-progress, eLog 21488.  Update from Bob at ALGT2: Just restarted dcache services on head01. rsv srmcp-readwrite had been red. Hopefully that will clear the issue.  Since the queues at 
    the site (analy_, prod) had been set offline (ADC site exclusion ticket: https://savannah.cern.ch/support/?118828) test jobs were submitted, and they completed successfully (eLog 21497).  Are we ready to close this ticket?
    12)  1/25: SLACXRD_DATADISK file transfer errors - "[Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [NO_SPACE_LEFT] No space found with at least 2895934054 bytes 
    of unusedSize]."  http://savannah.cern.ch/bugs/?77346.
    13)  1/25: SLACXRD job failures with stage-in errors - for example "!!FAILED!!2999!! Failed to transfer ... 1099 (Get error: Staging input file failed)."  From Wei: I think this is fixed. Sorry for the noise. They are a lot of 
    places to change when I change the config so I may still be missing something... let me know if that is still the case.  ggus 66520 closed.
    
    Follow-ups from earlier reports:
    (i)  12/17, 12/20:  ANALY_SWT2_CPB was auto-blacklisted twice.  Hammercloud test jobs were failing due to the fact that a required db release file was not yet transferred to the site when the first jobs started up.  
    Once the transfer completed the test jobs began to complete successfully.  Discussion underway about how to address this issue.
    (ii)  12/21: NERSC file transfer errors - "failed to contact on remote SRM [httpg://pdsfdtn1.nersc.gov:62443/srm/v2/server]."  ggus 65617 in-progress, eLog 20810.
    (iii)  1/9: AGLT2 - low-level of job failures with the error "Put error: lfc_creatg failed with (2704, Bad magic number)."  Site is investigating.
    (iv)  1/14: AGLT2_PHYS-SM - file transfer failures due to "[USER_ERROR] source file doesn't exist."  ggus 66150 in-progress, https://savannah.cern.ch/bugs/?77036.  Also https://savannah.cern.ch/bugs/index.php?77139.
    1/25: Update from Shawn:
    I have declared the 48 files as "missing" to the consistency service. See https://twiki.cern.ch/twiki/bin/view/Atlas/DDMOperationProcedures#In_case_some_files_were_confirme and you can track the "repair" at 
    http://bourricot.cern.ch/dq2/consistency/
    Let me know if there are further issues.
    
    
    • ADCR databases were down, things coming back online now.
    • Note modifications to exclusion policy
    • 3 old sites have been deprecated.

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week:
    • See notes via email to usatlas-grid-l.
    • Illinois problems have improved
    • Perfsonar action item for sites to check disk on latency nodes (cf instructions from Philip and Jason)
    • New monitoring via Nagios monitor at BNL -throughput matrix - https://nagios.racf.bnl.gov/nagios/cgi-bin/prod/perfSonar.php?page=115
    • Finding configuration problems at site - goal is to get the matrix fully green.
    • Will also have latency measurements
    • Tom has added plotting abilities as well.
    • Hiro will add results onto his graphs. Personar and his data transfer plots
    • LHC OPN Tier 2 meeting at CERN - summary - meeting last thursday, technical meeting, how can LHC support T2 networking; four whitepapers presented. Small subset of the group will try to synthesis results into a single document. Considering a distinct infrastructure separate from LHC OPN, separate from the Tier 1, but funding is an issue. Starlight and MANLAN would be logical, federated facilities to create open exchange points. A challenging prospect - to traverse different service providers. Michael: would be interesting to see how Simone's "sonar" tests - the resulting matrix - shows for today's capabilities. This has been raised for Napoli retreat in February.
    • DYNES - completing evaluation of proposals this week. PI meeting on Friday.
    • Discussion about policy for access to port 80 to/from SLAC, BNL.
  • this week:
    • Off week; Tom is adding the latency matrix monitoring
    • DYNES - proposals all looked strong - will likely accept all proposals, with clarifications. Will be a BOF at joint techs meeting next week at Clemson, milestones and schedule. This will be available via remote.

Global Xrootd: Tier 3 (Doug), Tier 2 (Charles)

last week(s):
  • WLCG grid deployment board meeting last week; Doug presented work for ATLAS. EOS-xrootd discussed, and Brian Bockleman from CMS. GDB will follow these activities. NIKKEF will work us on security issues.
  • Charles - working on HC tests running on using federated xrootd. Also working on xrd-lfc module - requires voms proxy; looking at lfc-dli, but performance is poor, and its being deprecated. Can we use a server certificate?
  • Wei - discussed with Andy regarding the checksum issue - may require architectural change.
this week:
  • HC tests running pointed at UC local xrootd redirector (through ANALY_MWT2_X). Few job failures (9/200) tracking down. Event rate not as good as dcap. May need more tuning on xrootd client. Hiro will set this up at BNL.
  • Local tests dcap versus xrootd - apparently factor 2 improvement

Site news and issues (all sites)

  • T1:
    • last week(s): Comprehensive intervention at the Tier 1 - Saturday to Monday - various services were upgraded. Doubled bandwidth between storage farm and worker nodes - up to 160 Gbps. Several grid components upgraded - move SL5 on some nodes. CE upgraded. A number of failure modes discussed with NEXAN - new firmware for disk arrays, to improve identification of faulty disks. Will further improve availability and performance. Hiro - on storage management, dcache upgraded to 1.9.5-23. 3.1.18 pnfs. upgraded postgres to 9.0.2, and backend disk area (hardware swap to get more spindles). Hot standby for postgres. All dcache changes went okay. LFC upgraded to 1.8.0-1, significant upgrade. Should discuss with OSG to package this version.
    • this week:

  • AGLT2:
    • last week: dCache - changed postgres to 9.0.2 as well. There was a Chimera issue, solved. Upgraded OSG 1.2.18. MSU switch configuration preparations, testing spanning-tree issues. Testing new switches.
    • this week: Planned downtime for next week. 1.9.10-4. Will be using NFSv4.1. Lots of internal changes, eg. configuration, going to this release. MSU getting network changes ready.

  • NET2:
    • last week(s): Working with Dell on Tier 3 hardware, C6100 leaning. Getting ready to purchase a new rack of processors at BU. Working on Harvard analysis queue. Networking improvements are being planned in 2011. Detailed planning for moving Tier 2 to Holyoke in late 2012. Moving PRODDISK off old equipment.
    • this week: Had to kill a bunch of jobs in HU analysis queue (resulting in get errors). DATADISK filled up. Adding a half-rack of storage (130 TB useable). Increase DATADISK to 450 TB. Ramping up HU analysis queue, currently not getting enough IO capacity. Another 10G port needed - will buy another switch - will go to 2000 analysis jobs. Testing with ClusterNFS to export GPFS (removing need for GPFS clients).

  • MWT2:
    • last week(s): During downtime installed 4 new 6248 Dell switches (added into existing switch stack). Retired usatlas[1-4] home server, migrated to new hardware.
    • this week: Continuing to install 88 worker nodes; will install new libdcap. Will install CVMFS.

  • SWT2 (UTA):
    • last week: Probs with data severs over the weekend, resolved. Working on monitoring federated xrootd systems.
    • this week: Working on Tier 3; working on federated xrootd monitoring. All running fine otherwise.

  • SWT2 (OU):
    • last week: 18 R410's have arrived, waiting for Dell to install them.
    • this week: All is fine. Schedule installation of R410, may require electrical work. Bestman2 has been tested extensively in the past months.

  • WT2:
    • last week(s): Still upgrading Solaris for storage. Hope to finish by the end of next week. Installing Bestman2 on a test machine, including module to dynamically change gridftp doors. Want to test with FTS before going into production (Hiro).
    • this week: Bestman2 running in gateway seems to be fine. Starts up very quickly. No immediate need to upgrade to it, but this will be the only supported version. Enabled security module for xrootd client in the internal storage cluster, created some problems. But discovered in XrootdFS, will notify Doug and Tanya. Trivial to update (Wei's scripts). Patrick has a package to dynamically add gridftp servers using a plug-in module (don't have to restart Bestman).

Carryover issues ( any updates?)

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc. Note: https://atlas-install.roma1.infn.it/atlas_install/ - site admins can subscribe, and get notified of release installation & validation activity at their site.

  • last report
    • AGLT2 now running Alessandro's system - now in automatic installation mode. Will do other sites after the holiday.
    • There was a prob w/ 16.0.3.3, SIT released a new version, Xin installing on all the sites.
    • Xin is generating PFC for the sites still.
  • this meeting:
    • MWT2_UC is using the new system, for only 16 series releases. If it works well, will enable this for the new system.
    • Next site - BU - once BDII publication issue resolved, will return to this.
    • WT2, IU, SWT2 - depending on Alessandro's availability.

libdcap & direct access (Charles)

last report(s):
  • Rebuilt libdcap locally
  • Current HC performance, http://hammercloud.cern.ch/atlas/10000957/test/
  • Comparing to previous results with dcap++
  • Looking at messaging between dcap client and server
  • Most problems with prun jobs (pathena seems robust)
  • Trying to get dcap++ merged into official dcap sources.. libdcap has diverged.
  • Continuing to track down stalls in analysis jobs - leading to hangs in dcap.
  • newline causing buffer overflow
  • chipping away at this on the job-side.
  • dcap++ and newer stock libdcap could be merged at some point.
  • Maximum active movers needs to be high - 1000.
  • Submitted patch to dcache.org for review
  • No estimate on libdcap++ inclusion in to the release
  • No new failure modes observed

this meeting:

AOB

  • last week
    • Doug - would like a report on BNL CVMFS infrastructure. Michael - has ordered adequate hardware. Doug reports that cloning to Rutherford.
  • this week


-- RobertGardner - 25 Jan 2011

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback