r3 - 19 Jan 2011 - 15:45:45 - MarkSosebeeYou are here: TWiki >  Admins Web > MinutesJan19

MinutesJan19

Introduction

Minutes of the Facilities Integration Program meeting, Jan 19, 2011
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • 866-740-1260, Access code: 7027475

Audio Details: Dial-in Number:
U.S. & Canada: 866.740.1260
U.S. Toll: 303.248.0285
Access Code: 7027475; Chair passcode: 8734
Registration Link: https://cc.readytalk.com/r/bd2w3deu2kkg

Attending

  • Meeting attendees: Rob, Michael, Fred, Charles, Dave, Karthik, Torre, Doug, Rik, AK, Patrick, Joe, Justin, Jason (I2), Saul, Shawn, John B, Tom, Bob, Alden, Armen, Mark, Wensheng
  • Apologies:

Integration program update (Rob, Michael)

  • IntegrationPhase16 NEW
  • Special meetings
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • Program notes:
    • last week(s)
      • Welcome Akhtar Mahmood (Bellamine University, Tier 3 site) to the meeting, AK.
      • Next face-to-face facilities meeting will be co-located with the OSG All Hands meeting, March 7-10, 2011, hosted by the Harvard Medical School; cf: http://ahm.sbgrid.org/
      • Autopilot factory, to manage pilot submissions in the US. Looking into cloud computing, from the facility-side. Looking into advances in storage.
      • Quarterly report - reminder - to complete by mid-January.
      • Reminder - database intervention at CERN, Sunday January 16, to be completed on Monday, January 17, to restart Tuesday morning in Europe; again - a good time for interventions.
    • this week
      • Phase 16 Integration program forming - see above. Add CVMFS infrastructure.
      • Metrics discussion - new cvs format available. One issue is getting file size and number of files. Alden is providing an API as well. Will give Charles data to upload and modify.
      • Reports from Tier 2's on downtime activities, if any.
      • Facility capacity report
      • Request from OSG/VDT regarding xrootd packaging for US ATLAS: rpms available for 3.0.1; pacman needed?

Tier 3 Integration Program (Doug Benjamin, Rik Yoshida)

Tier 3 References:
  • The link to ATLAS T3 working groups Twikis are here
  • T3g Setup guide is here
  • Users' guide to T3g is here

General Tier 3 issues

last week(s):
  • Oregon, UIC, Arizona, Duke, Indiana, all setting up equipment.
  • Will be sending out a global email to get their status.
  • Discussion about the support list (no consensus)
this week:
  • Have met with sys admins from UTD
  • Working with UC Irvine; Amherst
  • Sending out email to contacts, including co-located Tier 2s
  • T3 section for OSG all-hands meeting
  • Joint-session with USCMS Tier3
  • Note - OSG all hands registration is open
  • Automatic testing - data handling functional tests, waiting for the site status board to do auto-blacklisting. Douglas Smith is going to handle production jobs, adopting HC framework.
  • New Tier 3's coming online for production: UTD, Bellamine, Hampton.
  • Analysis Tier 3gs: will report.

Tier 3 production site issues

  • Bellamine University (AK): last week:
      • "AK" - Akhtar Mahmood
      • 385 cores, 200TB RAID, 200 TB on nodes.
      • Horst has been measuring things. 100 Mbps connection (10 MB/s with no students)
      • FTS transfer tests started
      • Totally dedicated to ATLAS production (not analysis)
      • Goal is to be a low-priority production site
      • There will be a long testing period in the US, at least a week
      • Need to look at the bandwidth limitation
      • Horst has installed the OSG stack
      • There is .5 FTE local technical support
      • Support tickets go to the Tier 3 queue - then they get assigned to the site responsible. No.
      • Will need to setup an RT queue. this week:
      • AK and Horst - more networking testing - inbound bandwidth

  • UTD (Joe Izen)
      • Have been in production for years, with 152 cores. Have new hardware, mostly for Tier3g. Some will go into production - 180 cores. Doug is providing some recommendations.
      • Operationally - biggest issue is lost heartbeat failures. Will consult Charles.

Operations overview: Production and Analysis (Kaushik)

Data Management & Storage Validation (Kaushik)

Shifters report (Mark)

  • Reference
  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=slides&confId=119786
    
    Note: A multi-day service outage at the US Atlas Tier 1 facility is planned at BNL, starting on Saturday, January 15 at 8:00AM EST through Monday,
    January 17 at 5:00PM EST. This service outage will affect all services hosted at the US Atlas Tier 1 facility at BNL.  eLog 21139.
    
    1)  1/5: SWT2_CPB_PERF-EGAMMA - it was reported that "There are large number of DaTri requests stuck on "subscribed" for over 24 hours all to destinations."  Issue understood - some of the associated 
    tasks were still running when the ticket was created, combined with the time required to transfer the files to SWT2_CPB.  See details here: https://savannah.cern.ch/bugs/?76746 (& 76747), eLog 21045/48.
    2)  1/6: BNL - job failures with the error "Pilot has decided to kill looping job."  Issue understood - from Hiro: It was due to the recent change in the maximum number of connection to each dcache storage pools. 
    We have recently lower the maximum connections for all pools. Although this should actually work better for copying files in/out of storage via dccp/srmcp, it has bad impact for the case of the direct access by 
    limiting the number of concurrently accessed jobs. I completely forgot that the condition data is accessed directly through dcap doors. Anyway, I increased the maximum connections(/movers) for HOTDISK to a large value. 
    Therefore, those jobs should run normally now.  https://savannah.cern.ch/bugs/?76680.  (Ticket was originally opened on 12/28.)
    3)  1/7: MWT2_UC - job failures with the error "Get error: lsm-get failed (201): ERROR 201 Copy command failed."  Issue resolved as of 1/8 - ggus 65922 closed, eLog 21116.
    4)  1/7: Duke - file transfer errors - "failed to contact on remote SRM [httpg://atlfs02.phy.duke.edu."  ggus 65933 in-progress, eLog 21103.
    5)  1/8: MWT2_UC - file transfer errors with srm://uct2-dc1.uchicago.edu as the source.  From Rob: There was an unhappy pool (uct2-s6) - restarted; also restarted SRM on uct2-dc1.  Issue resolved - eLog 21118.
    6)  1/9: AGLT2 - low-level of job failures with the error "Put error: lfc_creatg failed with (2704, Bad magic number)."  Site is investigating.
    7)  1/10: SLACXRD - job failures with stage-out errors - for example: "Put error: copying the file: 256, Error accessing path/file for root://..."  Issue understood - from Wei: fixed. I was making change to the storage and 
    cause some failure. you see 100% failure (and large # in transferring state) because I stopped FTS channels. I reopened FTs channels so you should see successful jobs.  ggus 65950 closed, eLog 21142.
    8)  1/10: file transfer errors between AGLT2_USERDISK => TRIUMF-LCG2_PERF-TAU - "failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]. Givin' up after 3 tries."  Appears to be an issue 
    related to network connectivity between the sites.  ggus 65972 in-progress, eLog 21141.
    9)  1/10: File transfer errors from multiple sites => BNL with the error "[AGENT error during TRANSFER_SERVICE phase: [INTERNAL_ERROR]
    cannot create archive repository: No space left on device]."  Issue quickly resolved - ggus 65981 closed, eLog 21152/53.
    10)  1/11: From Stephane Jezequel: Following the decision of last ADC weekly meeting, all datasets in T2_MCDISK older than 15 days and not accessed at all in the last 90 days were deleted (lifetime was set to 1 day 
    so it is possible to recover). Only datasets with at least 2 primary replicas in T1s were deleted.
    11)  1/12: UTA_SWT2 - file transfer failures with the error "failed to contact on remote SRM [httpg://gk05.swt2.uta.edu:8443/srm/v2/server]. Givin' up after 3 tries."  Issue understood and resolved - from Patrick: We had a 
    problem with a partition filling on our SRM host.  The partition was cleared and a related issue with CRLs was fixed.  The certs used by DDM are now accepted and transfers should be OK now.  I can see that SAM tests 
    are starting to pass.  ggus 66011 / RT 19140 closed, eLog 21169.
    12)  1/12: AGLT2 - From Bob: A power failure at MSU dumped an entire rack of worker nodes at 11pm yesterday. This, in turn, caused Condor to once again get lost. I have just re-enabled auto-pilots to us after wiping the 
    full load of jobs that we were running at the time. Panda reported this to be 4100 production and 276 analysis jobs this morning.  eLog 21181.
    
    Follow-ups from earlier reports:
    (i)  11/30: AGLT2 - job failures with the error "Trf setup file does not exist for 16.2.1.2 release."  From Alessandro: there is a problem when setting up the releases in AGLT2, and it's already under study. For the moment 
    please do not use that site for any release that has not been validated and tagged (which should be the default, BTW). The needed release 16.2.1 in fact is not validated nor tagged in AGLT2.  ggus 64770 in-progress, eLog 20131.
    (ii)  12/16: Offline database infrastructure will be split into two separate Oracle RAC setups: (i) ADC applications (PanDA, Prodsys, DDM, AKTR) migrating to a new Oracle setup, called ADCR; (ii) Other applications (PVSS, 
    COOL, Tier0) will remain in the current ATLR setup.  See: https://twiki.cern.ch/twiki/bin/viewauth/Atlas/OfflineDBSplit
    (iii)  12/17, 12/20:  ANALY_SWT2_CPB was auto-blacklisted twice.  Hammercloud test jobs were failing due to the fact that a required db release file was not yet transferred to the site when the first jobs started up.  Once the 
    transfer completed the test jobs began to complete successfully.  Discussion underway about how to address this issue.
    (iv)  12/21: NERSC file transfer errors - "failed to contact on remote SRM [httpg://pdsfdtn1.nersc.gov:62443/srm/v2/server]."  ggus 65617 in-progress, eLog 20810.
    (v)  12/22: Problem with the installation of atlas release 16.0.3.3 on the grid in most clouds.  Alessandro informed, ggus 65628, eLog 20818.
    (vi)  12/26: UTD-HEP set off-line for site maintenance.  https://savannah.cern.ch/support/?118511, eLog 20891.
    Update 1/8: test jobs completed successfully - site set back to 'on-line'.  eLog 21115.
    (vi)  1/5: UTD_HOTDISK file transfer errors - "failed to contact on remote SRM [httpg://fester.utdallas.edu:8446/srm/v2/server]."  Site is in a scheduled maintenance outage (see iix below), but maybe not blacklisted 
    in DDM?  ggus 65866, eLog 21047.
    Update 1/10: maintenance outage over, production was resumed at the site - ggus 65866 closed.
    (vii)  1/5: MWT2 - network maintenance outage.  eLog 21049.
    Update 1/6: Some lingering issues following the outage, but these appear to be resolved now.  https://savannah.cern.ch/support/index.php?118563, eLog 21076.
    
  • this meeting:* Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=121879
    
    1)  1/14: UTD-HEP - file transfer errors between UTD_PRODDISK => BNL-OSG2_MCDISK - fro example: failed to contact on remote SRM [httpg://fester.utdallas.edu:8446/srm/v2/server].  Issue resolved as 
    of 1/15 - from the site admin: Reboot of our gateway node seems to have resolved this problem. No errors since 2011-01-14, 14:00 GMT.  ggus 66146 closed, eLog 21215.
    2)  1/14: BNL - analysis job failures with the error "*pilot:* Get function can not be called for staging input files: [Errno 28] No space left on device *trans:*Exception caught by pilot."  Issue understood - from Hiro: 
    The message actually came from NFS mounted space and not the local disk.  /usatlas/grid (which includes usatlas1 used by pilot for proxy and others) has run out of space yesterday.  The space was increased yesterday.  
    3)  1/14: AGLT2_PHYS-SM - file transfer failures due to "[USER_ERROR] source file doesn't exist."  ggus 66150 in-progress, https://savannah.cern.ch/bugs/?77036.  Also https://savannah.cern.ch/bugs/index.php?77139.
    4)  1/14: US cloud set off-line in preparation for the BNL maintenance outage.  More details here, including other cloud statuses relative to database infrastructure split at CERN 1/17 - 1/19: https://savannah.cern.ch/support/?118697, 
    https://savannah.cern.ch/support/?118699.
    5)  1/18: SWT2_CPB file transfer errors - FTS channels temporarily turned off.  Issue with two problematic dataservers resolved - from Patrick: I think everything should be working now.  The dataservers are back and 
    PandaMover is able to stage data to us correctly.  There should be no reason that FTS will have a problem.  I will turn the channels back on momentarily.
    
    Follow-ups from earlier reports:
    (i)  11/30: AGLT2 - job failures with the error "Trf setup file does not exist for 16.2.1.2 release."  From Alessandro: there is a problem when setting up the releases in AGLT2, and it's already under study. For the moment please 
    do not use that site for any release that has not been validated and tagged (which should be the default, BTW). The needed release 16.2.1 in fact is not validated nor tagged in AGLT2.  ggus 64770 in-progress, eLog 20131.
    Update 12/21, from Alessandro: the situation is definitely better now in AGLT2, almost all the releases have been installed and are now properly working.  ggus ticket 64770 closed on 1/11.
    (ii)  12/16: Offline database infrastructure will be split into two separate Oracle RAC setups: (i) ADC applications (PanDA, Prodsys, DDM, AKTR) migrating to a new Oracle setup, called ADCR; (ii) Other applications (PVSS, COOL, 
    Tier0) will remain in the current ATLR setup.  See: https://twiki.cern.ch/twiki/bin/viewauth/Atlas/OfflineDBSplit
    Update 1/18: work completed.  See eLog 21312 / 18 / 19.
    (iii)  12/17, 12/20:  ANALY_SWT2_CPB was auto-blacklisted twice.  Hammercloud test jobs were failing due to the fact that a required db release file was not yet transferred to the site when the first jobs started up.  Once the transfer 
    completed the test jobs began to complete successfully.  Discussion underway about how to address this issue.
    (iv)  12/21: NERSC file transfer errors - "failed to contact on remote SRM [httpg://pdsfdtn1.nersc.gov:62443/srm/v2/server]."  ggus 65617 in-progress, eLog 20810.
    (v)  12/22: Problem with the installation of atlas release 16.0.3.3 on the grid in most clouds.  Alessandro informed, ggus 65628, eLog 20818.
    Update 1/12: from I Ueda: The problem with a tarball seems to have been fixed.  ggus 65628 closed.
    (vi)  1/9: AGLT2 - low-level of job failures with the error "Put error: lfc_creatg failed with (2704, Bad magic number)."  Site is investigating.
    (vii)  1/7: Duke - file transfer errors - "failed to contact on remote SRM [httpg://atlfs02.phy.duke.edu."  ggus 65933 in-progress, eLog 21103.
    Update 1/13 from Doug: I believe that I have fixed the problem but removing the certificates and replacing them and restarting bestman. I am able now to fetch a dataset from Duke using dq2-get.  [This ticket will be closed.]
    (iix)  1/10: file transfer errors between AGLT2_USERDISK => TRIUMF-LCG2_PERF-TAU - "failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]. Givin' up after 3 tries."  Appears to be an issue 
    related to network connectivity between the sites.  ggus 65972 in-progress, eLog 21221.
    Update 1/19, from Shawn: Routing fix between NLR and USLHCnet fixed this issue. ggus 65972 closed.
    (ix) A multi-day service outage at the US Atlas Tier 1 facility is planned at BNL, starting on Saturday, January 15 at 8:00AM EST through Monday, January 17 at 5:00PM EST. This service outage will affect all services hosted 
    at the US Atlas Tier 1 facility at BNL.  eLog 21139.
    Update 1/17, from Michael: The intervention was completed. All services are operational; the batch queues were re-opened.  eLog 21316.
    
    • UWISC - remove from Panda since they are not participating in production.

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week:
  • this week:
    • See notes via email to usatlas-grid-l.
    • Illinois problems have improved
    • Perfsonar action item for sites to check disk on latency nodes (cf instructions from Philip and Jason)
    • New monitoring via Nagios monitor at BNL -throughput matrix - https://nagios.racf.bnl.gov/nagios/cgi-bin/prod/perfSonar.php?page=115
    • Finding configuration problems at site - goal is to get the matrix fully green.
    • Will also have latency measurements
    • Tom has added plotting abilities as well.
    • Hiro will add results onto his graphs. Personar and his data transfer plots
    • LHC OPN Tier 2 meeting at CERN - summary - meeting last thursday, technical meeting, how can LHC support T2 networking; four whitepapers presented. Small subset of the group will try to synthesis results into a single document. Considering a distinct infrastructure separate from LHC OPN, separate from the Tier 1, but funding is an issue. Starlight and MANLAN would be logical, federated facilities to create open exchange points. A challenging prospect - to traverse different service providers. Michael: would be interesting to see how Simone's "sonar" tests - the resulting matrix - shows for today's capabilities. This has been raised for Napoli retreat in February.
    • DYNES - completing evaluation of proposals this week. PI meeting on Friday.
    • Discussion about policy for access to port 80 to/from SLAC, BNL.

Global Xrootd: Tier 3 (Doug), Tier 2 (Charles)

last week(s):
  • Doug: attempting to get a proxy server running at Argonne - to handle the case where data servers are behind the firewall.
  • Will gather some information from Charles.
  • A new version of xrootd is available - testing.
this week:
  • WLCG grid deployment board meeting last week; Doug presented work for ATLAS. EOS-xrootd discussed, and Brian Bockleman from CMS. GDB will follow these activities. NIKKEF will work us on security issues.
  • Charles - working on HC tests running on using federated xrootd. Also working on xrd-lfc module - requires voms proxy; looking at lfc-dli, but performance is poor, and its being deprecated. Can we use a server certificate?
  • Wei - discussed with Andy regarding the checksum issue - may require architectural change.

Site news and issues (all sites)

  • T1:
    • last week(s): will take advantage of the ATLAS downtime in 10 days - will do a major dCache upgrade, but still using pnfs. Major network configuration as well, starting that Saturday. May need to extend into tuesday. Will also upgrade LFC 1.8.0-1.
    • this week: comprehensive intervention at the Tier 1 - Saturday to Monday - various services were upgraded. Doubled bandwidth between storage farm and worker nodes - up to 160 Gbps. Several grid components upgraded - move SL5 on some nodes. CE upgraded. A number of failure modes discussed with NEXAN - new firmware for disk arrays, to improve identification of faulty disks. Will further improve availability and performance. Hiro - on storage management, dcache upgraded to 1.9.5-23. 3.1.18 pnfs. upgraded postgres to 9.0.2, and backend disk area (hardware swap to get more spindles). Hot standby for postgres. All dcache changes went okay. LFC upgraded to 1.8.0-1, significant upgrade. Should discuss with OSG to package this version.

  • AGLT2:
    • last week: Site downtime at the end of the month, will plan dcache upgrade then. At MSU, Dec 26 smoke detector went off, dropped power via EPO; restarted following Monday. Week later, at UM GUMS server issue - Condor job manager communication issue, problem with schedd, Condor could not recover (had to restart from scratch). Waiting for new release of schedd.
    • this week: dCache - changed postgres to 9.0.2 as well. There was a Chimera issue, solved. Upgraded OSG 1.2.18. MSU switch configuration preparations, testing spanning-tree issues. Testing new switches.

  • NET2:
    • last week(s): Smoothly over the holidays. Working on xrootd federation setup. Working on getting analysis jobs running full-out at HU, and setting up HC tests. Looking at direct access from HU, mounting GPFS via NFS. Purchasing Tier 3 rack of worker nodes. John will summarize LSF issues in a memo.
    • this week: Working with Dell on Tier 3 hardware, C6100 leaning. Getting ready to purchase a new rack of processors at BU. Working on Harvard analysis queue. Networking improvements are being planned in 2011. Detailed planning for moving Tier 2 to Holyoke in late 2012. Moving PRODDISK off old equipment.

  • MWT2:
    • last week(s): Analysis pilot starvation causing poor user experience at MWT2. (Xin has move the analysis pilot submitter to a new host, and there have also been changes to autopilot_adjuster and timefloor setting in schedconfig.)
    • this week: During downtime installed 4 new 6248 Dell switches (added into existing switch stack). Retired usatlas[1-4] home server, migrated to new hardware.

  • SWT2 (UTA):
    • last week: Perc6 card locked up, dropped two shelves causing probs as CPB. Otherwise okay.
    • this week: Probs with data severs over the weekend, resolved. Working on monitoring federated xrootd systems.

  • SWT2 (OU):
    • last week: all is well; the network issue was cleared up. 19 R410's order is out.
    • this week: 18 R410's have arrived.

  • WT2:
    • last week(s): Over the break all was okay. Planning to upgrade Solaris systems to a newer level.
    • this week: Still upgrading Solaris for storage. Hope to finish by the end of next week. Installing Bestman2 on a test machine, including module to dynamically change gridftp doors. Want to test with FTS before going into production (Hiro).

Carryover issues ( any updates?)

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc. Note: https://atlas-install.roma1.infn.it/atlas_install/ - site admins can subscribe, and get notified of release installation & validation activity at their site.

  • last report
    • AGLT2 now running Alessandro's system - now in automatic installation mode. Will do other sites after the holiday.
    • There was a prob w/ 16.0.3.3, SIT released a new version, Xin installing on all the sites.
    • Xin is generating PFC for the sites still.
  • this meeting:

libdcap & direct access (Charles)

last report(s):
  • Rebuilt libdcap locally
  • Current HC performance, http://hammercloud.cern.ch/atlas/10000957/test/
  • Comparing to previous results with dcap++
  • Looking at messaging between dcap client and server
  • Most problems with prun jobs (pathena seems robust)
  • Trying to get dcap++ merged into official dcap sources.. libdcap has diverged.
  • Continuing to track down stalls in analysis jobs - leading to hangs in dcap.
  • newline causing buffer overflow
  • chipping away at this on the job-side.
  • dcap++ and newer stock libdcap could be merged at some point.
  • Maximum active movers needs to be high - 1000.
  • Submitted patch to dcache.org for review
  • No estimate on libdcap++ inclusion in to the release
  • No new failure modes observed

this meeting:

AOB

  • last week
    • Providing the pledged capacities for 2011. This milestone is due by April 1. Processing is pretty much okay, but there are shortages in disk.
    • DYNES proposals will be evaluated by end of month.
    • lcg cern certificate expiring soon - new voms package available (Hiro will send mail)
  • this week
    • Doug - would like a report on BNL CVMFS infrastructure. Michael - has ordered adequate hardware. Doug reports that cloning to Rutherford.


-- RobertGardner - 18 Jan 2011

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback