r4 - 17 Mar 2010 - 14:15:13 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesMar17

MinutesMar17

Introduction

Minutes of the Facilities Integration Program meeting, Mar 17, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (605) 715-4900, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Michael, Sarah, Charles, Aaron, Rob, John De Stefano, Marco, Horst, Karthik, Justin, Patrick, Rik,
  • Apologies: Jason, Kaushik, Mark

Integration program update (Rob, Michael)

  • SiteCertificationP12 - FY10Q2
  • Special meetings
    • Tuesday (9am CDT): Frontier/Squid
    • Tuesday (9:30am CDT): Facility working group on analysis queue performance: FacilityWGAP suspended for now
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
    • this week
      • Phase 12 of Integration Program winding down
      • lsm project for xrood sites - Charles agreed to develop a python module to avoid ld preload dependence, xrdmodule.py
      • Need to discuss ATLAS release installation plan (below, Xin)
      • Need to discuss Tier 3-OSG issues (below, Rik, John)
      • Thanks everyone for attending the OSG AH meeting @ Fermilab last week - a good meeting.
      • LHC operations: aiming for high intensity 450 GeV, stable beam within 24 hours which will lead to a ramp up of activities and requirements for stable operations at all sites.
      • Review season is about to start

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

  • last week(s):
    • Hardware recommendations within 2 weeks.
    • Frontier-squid setup in SL5 virtual machine at ANL ASC (push - button solution)
  • this week:
    • Presentations at last week's workshop here
    • Tier 3 xrootd meetings: https://twiki.cern.ch/twiki/bin/view/Atlas/XrootdTier3
    • Jamboree at the end of the month - cleaning up "model" Tier 3 at ANL; encourage those setting up to attend.
    • Working groups ATLAS-wide are making progress
    • Rik is participating in user support working group - lightly attended at the moment; encourage others to attend.
    • Tier 3 - OSG issues:
      Hi all,
      Following are some issues that need clarification. Michael asked me to summarize them. 
      Perhaps they could be discussed in the meeting this afternoon.
      
      1) Rob Quick would like to know how the OSG GOC should route tickets (GGUS, end-user) 
      that involve ATLAS Tier 3 OSG sites. There are two options:
       -- Send them as regular GOC tickets, via email, directly to the site admins as listed in OIM.
       -- Send them as forwarded tickets into BNL's Tier 3 RT queue.
      
      There are pros and cons to each. Doing everything directly means less ability to 
      track issues and notice unresponsive T3s. Doing everything via BNL RT means any 
      non-ATLAS-related issues get lumped in with ATLAS concerns. This only matters 
      if a Tier 3 site is doing work for other VOs--which is probably unlikely.
      
      My main concern is that whatever scheme is chosen, all tickets get handled similarly.
      
      2) The Tufts site apparently has the OSG software installed and it is publishing 
      via Gratia to OSG, but the site has not been created in OIM. The specific question 
      is if someone could urge them to create their entry? The larger issue is the need 
      to precisely clarify what the responsibilities of an ATLAS Tier 3 site are with 
      respect to OSG, and make sure that they perform all the necessary steps.
      
      3) In order for a site to subscribe to ATLAS data (via DQ2?), and possibly to
       transfer data with lcg-cp, it apparently is necessary/useful for the site to 
      publish its SE info into the WLCG BDII. (Marco Mambelli is involved in the lcg 
      tools on OSG and maybe can provide more exact requirements.)
      
      If so, this means that the site requires a CEMon/GIP installation. Currently, 
      these only get installed along with a CE, which some T3s may not need. So 
      we need to determine if a standalone CEMon/GIP setup is required, and if 
      so we need to request such a package be defined in the VDT/OSG stack. 
      The pieces exist--it is just a matter of configuration. Burt Holzman and 
      Brian Bockelman are willing to do it, but want confirmation that it is 
      required by our model before putting in the effort.
      Cheers,
      --john
    • follow-up next week.
    • Note: BDII is critical service for US ATLAS. There is an SLA.

Operations overview: Production and Analysis (Kaushik)

  • last meeting(s):
    • Borut's report this morning
    • US T1 already validated for reprocessing - will discuss whether to include US T2's.
    • New jobs to keep us busy for the next few jobs.
    • Presently in US cloud ~4000 analysis jobs running, ~5000 production jobs
    • Will also want to backfill our queues with regional production
    • There are some CondorG scaling issues keeping up with lots of short jobs.
    • Sites that are up are running quite well.
    • Discussion of site blacklisting and monitoring of sites with regards to DDM FT, SAM, Ganga robots, etc. At some point these will be automated.
    • Michael - there are discussions with Alessandro, Hiro and Michael on some of the technical issues. We don't expect our sites to get blacklisted.
  • this week:
    • Note starting with this meeting analysis queue issues formerly covered by Nurcan will be addressed here.

Data Management & Storage Validation (Kaushik)

Release installation, validation (Xin, Kaushik)

The issue of validating process, completeness of releases on sites, etc.
  • last meeting
  • this meeting:
    • Its been a one-person operation to deploy releases.
    • There will be an installation database which has a record of all the pieces necessary.
    • Integrate Panda-based installation w/ this database.
    • Control will be given to Alessandro.
    • Why not use Alessandro system itself?
    • If we use the EGEE WMS system some of the pre-requisites are there. For example we have CE information published in the BDII.

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=87475
    
    1)  3/3: MWT2_UC -- failed jobs with errors like "CMTCONFIG is not available on the local system:  NotAvailable (required of task: i686-slc5-gcc43-opt," due to a corrupted atlas s/w release.   Xin re-installed the s/w, issue resolved.
    2)  3/3: Consolidation of dq2 site services at BNL for the tier 2's by Hiro beginning.  Will take several days to complete all sites.
    3)  3/4: Test jobs submitted to the new site BNL_ITB_ATLAS_TEST (a new one for testing) completed successfully.  Thus the site is now considered validated.
    4)  3/4: Problem with analysis jobs failing with the error "pilot: CMTCONFIG is not available on the local system: NotAvailable (required of task: i686-slc5-gcc43-opt)" apparently due to missing atlas s/w (15.6.5.3), and not site issues.
    5)  3/5: BNL -- transfer errors ("failed to contact on remote SRM[httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2]").  Issue resolved -- from Pedro:
    Problem has been solved.  SRM has been enabled.  eLog 10063.
    6)  3/5: IllinoisHEP -- jobs were failing due to a missing dataset, which was subsequently restored.  RT 15627, eLog 10079.
    7)  3/6: BNL -- transfer errors between BNL-OSG2_MCDISK & BNL-OSG2_DATADISK.  Issue resolved.  ggus 56229, RT 15636, eLog 10124.
    8)  3/6: MWT2_IU -- jobs were failing with an error indicating a problem with an atlas s/w release.  Problem tracked down to a misconfigured worker node (iut2-c141), which was re-installed.  Issue resolved.  ggus 56230, RT 15637, eLog 10113.
    9)  3/6: BNL -- nagios alerts indicated a problem with some gatekeepers.  Issue tracked down by John Hover:
    Problem with the pool monitors on the F5 for the GUMS servers. They were configured to try a
       GET /gums/testMapGridIdentity.jsp
    and look for a
       OK
    in return.
    For some reason, that was failing and the F5 set both back-ends offline.
    Rather than dive deeper, I switched the monitors to just confirm the TCP connection on port 8443 (a less strict check). This got things going again. We can run in this mode indefinitely until we can look into what went wrong with the strict monitor probes. 
    Xin, Tomasz and I will take up that issue with a smaller CC list.
    10)  3/8: UTA_SWT2 -- jobs were not moving to 'activated' state due to problem with transferring input file:
    [SE][srmRm][SRM_FAILURE] Error:sudo: sorry, you must have a tty to run sudo
    Issue was a misconfiguration in the SRM interface.  Problem resolved.  gguas 56245, RT 15643, eLog 10148.
    11)  3/8: BNL -- transfer errors to / from BNL-OSG2_DATADISK,BNL-OSG2_DATATAPE ,BNL-OSG2_MCDISK like:
    [FTS] FTS State [Failed] FTS Retries [1] Reason [TRANSFER error during TRANSFER phase: [SECURITY_ERROR] globus_ftp_client: the server responded with an error535 Authentication failed: GSSException: Defective credential
    detected [Caused by: [Caused by: Bad sequence size: 4]]]
    >From Hiro:
    It has been fixed by restarting the problematic gridftp server.
    12)  3/9: IllinoisHEP -- jobs failing with the error "!!WARNING!!3000!! Trf setup file does not exist at:
    /home/osgstore/app/atlas_app/atlas_rel/15.6.3/AtlasProduction/15.6.3.10/AtlasProductionRunTime/cmt/setup.sh
    Xin installed the missing s/w -- issue resolved. 
    
    Follow-ups from earlier reports:
    (i)  New calendar showing site downtimes for all regions (EGEE, NDGF, OSG) is here:
    https://twiki.cern.ch/twiki/bin/view/Atlas/AtlasGridDowntime
    
    
  • this meeting:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=88132
    
    1)  3/10: Transfer errors at MWT2_IU, "MWT2_IU_LOCALGROUPDISK failed to contact on remote SRM," was due to a dCache restart, therefore a transient problem.  eLog 10282, ggus 56360, RT 15687.
    2)  3/11 - 3/13: MWT2_UC -- problems with atlas s/w releases 15.6.3 & 15.6.6.  Jobs were failing with errors like:
    ImportError: /usr/lib/libstdc++.so.6: version `GLIBCXX_3.4.9' not found (required by
    /share/osg/app/atlas_app/atlas_rel/15.6.3/sw/lcg/app/releases/ROOT/5.22.00h/i686-slc5-gcc43-opt/root/lib/libPyROOT.so)
    Issue resolved -- Xin re-installed the releases.  eLog 10309, RT 15685, ggus 56346.
    3)  3/12: BNL -- transfer errors due to a bad credential on one of the grid FTP doors.  Issue resolved.  eLog 10363.
    4)  3/12: Site UTA_SWT2 was set off-line to upgrade SL to v5.4 and ROCKS.  dq2 site services migration to BNL also occurred during this outage.  3/15: test jobs completed successfully, back to on-line.  eLog 10447.
    5)  3/12: From Shawn at AGLT2:
    We had a routing incident with our Solaris node. File transfers from 16:42 until 17:10 were impacted.  Should be OK now.
    6)  3/13: Transfer errors at UTA_SWT2_HOTDISK -- problem understood, from Patrick:
    The mapping errors arose from an unstable NIC on the GUMS host. The "unable to connect" errors were due to rebooting the SRM and GUMS hosts to accommodate update NIC driver configurations.  RT 15715, ggus 56427, eLog 10378, 79.
    7)  3/14: AGLT2 -- transfer errors from T0 ==> AGLT2_CALIBDISK.  Issue was a h/w problem on a dCache storage node (UMFS07.AGLT2.ORG).  Issue resolved with Dell tech support.  ggus 56434, RT 15717, eLog 10409, 25.
    8)  3/14:  Transfer errors between BNL-OSG2_USERDISK (src) and MWT2_IU_LOCALGROUPDISK (dest).  Issue understood -- from Michael:
    Experts in the US have investigated the issue and found that it is caused by modifications in the FTS timeout settings in conjunction with no data flowing while a transfer is in progress.  More details in eLog 10420.
    9)  3/15: New releases of DQ2 Central Catalogs.  Details here: https://savannah.cern.ch/support/?113291 (Note: looks like a permission problem with this web page?)
    10)  3/16: BNL -- US ATLAS conditions db maintenance completed.  No user impact.
    11)  3/16: From Wei at SLAC:
    Quite a few jobs failed to write to our storage. It was due to a bug in the particular version of xrootd we are using at SLAC. It is now fixed.
    12)  3/16: File transfer problems between BNL & NET2.  From Saul:
    One of our main file systems at NET2 is behaving badly right now and writing speeds are down to 30-60 MB/sec.  That's very likely why things are getting backed up and timing out.  We don't know what's going on yet, but are investigating.  
    RT 15735, eLog 10484.
    
    Follow-ups from earlier reports:
    (i)  New calendar showing site downtimes for all regions (EGEE, NDGF, OSG) is here:
    https://twiki.cern.ch/twiki/bin/view/Atlas/AtlasGridDowntime
    (ii)  3/3: Consolidation of dq2 site services at BNL for the tier 2's by Hiro beginning.  Will take several days to complete all sites.  ==> Has this migration been completed?
    

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • Site naming convention
    • FTS 2.2 update status
    • DQ2 SS consolidation status
    • Tier 3 issues
  • this meeting:
    • DQ2 logging has a new feature - errors are reporting now. Request: to be able to search at the error level.
    • Will be adding link for FTS log viewing.
    • FTS channel configuration change for data flow time-out. New FTS has option for terminating the transfer timeouts. Default for the entire transfer is 30 minutes. Wastes channel for a failed transfer. If no progress in first 3 minutes, transfer is terminated. Now active for all t2 channels.
      • If no progress (bytes transferred) during the a 180 second window, transfer cancelled. (Every 30 seconds a transfer marker is sent.) Making a page with all the settings.
      • Have observed that some transfers being terminated.
      • BNL-IU problem - fails for small files when directly writing into pools. All sites with direct transfers to pools are affected - its a gridftp2.
      • Logfiles and root files - few hundred kilobyte sized files.
      • In the meantime BNL-IU is not using gridftp2
      • dcache developers being consulted - may need a new dcache adapter
    • DQ2 SS consolidation except BU - problem with checksum issues.
    • Need to update Tier 3 DQ2. Note: Illinois working

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • last week(s):
    • focus was on perfsonar - new release to be out this friday; Jason: no show stoppers. All sites should upgrade next week.
    • Fixes bugs identified at our sites - hope this release resilient enough for T3 recommendations
    • Next meeting Feb 9
  • this week:
    • Minutes:
      
      
    • No meeting this week. Next meeting next week.
    • A couple of sites having issues with perfsonar install - developers investigating.

Site news and issues (all sites)

  • T1:
    • last week(s):On-going issue with condor-g - there has been incremental progress being made but there are new effects observed. Observed a slow-down in job throughput. Working with Condor team, some fixes were applied (new condor-q) which helped for a while; decided to add another submit host to the configuration. New HPSS data movers and network links.
    • this week: Long awaited DDN equipment has arrived, 2 PB (fully populated 9900, 1200 drives, 2 TB). Open Solaris and ZFS. Four head nodes. Have Dell R710s servers in front of this array. Had to add FC switch. Pedro and storage management group have a new lsm w/ callbacks for staging - to be integrated into production carefully. Xin has configured an ITB queue for Panda jobs; tested for get, close to completing tests for put operations. C6100 eval, 4 mobos (r410), pricing doesn't look too encouraging (yet).

  • AGLT2:
    • last week: Failover testing of WAN, all successful. Tuesday will be upgrading dcache hosts to SL5.
    • this week: Running well. Lustre to replace NFS.

  • NET2:
    • last week(s): Upgraded GK host to SL5 OSG 1.2.6 at BU; HU in progress. LFC upgraded to latest version.
    • this week: Problem with GPFS slowness, investigating. Production on new Nahalem nodes; need WLCG reporting work.

  • MWT2:
    • last week(s): Completed upgrade of MWT2 (both sites) to SL 5.3 and new Puppet/Cobbler configuration build system. Both gatekeepers at OSG 1.2.5, to be upgraded to 1.2.6 at next downtime. Delivery of 42 MD1000 shelves.
    • this week: Work proceeds on deployment of 1 PB storage. Systems racked, cabling started. Electrical work continues - will need to schedule a downtime to re-arrange UPS power between dCache nodes. Working on distributed Xrootd testbed between IU and UC (ANALY_MWT2_X). PNFS trash feature enabled - fixed pnfs orphans. Making python bindings to the xrootd library - accessing xrootd functions directly. Will be generally useable.

  • SWT2 (UTA):
    • last week: Extended downtime thurs/fri to add new storage hardware plus usual new SW upgrades; hope to be done by end of week.
    • this week: SL5.4 w/ Rocks 5.3 complete. SS transitioned to BNL. Issues w/ transfers failing to BNL. There may be an issue w/ how checksums are being handled. 400 TB of storage being racked and stacked. Looking into ordering more compute nodes.

  • SWT2 (OU):
    • last week: Started getting more equip from storage order; continue to wait for hardware.
    • this week: Waiting on 23 node compute node order from Dell. Will have 456 cores.

  • WT2:
    • last week(s): ATLAS home and release NFS server failed; will be relocating to temporary hardware.
    • this week: All is well. Storage configuration changed - no longer using the xrootd namespace (CNS service).

Carryover issues (any updates?)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week(s)
  • this week

Local Site Mover

  • Specification: LocalSiteMover
  • code
    • BNL has a lsm-get implemented and they're just finishing implementing test cases [Pedro]
  • this week if updates:

WLCG Capacity Reporting (Karthik)

  • last discussion(s):
    • Preliminary capacity report was working.
  • this meeting
    • There is a report complete - there is an email every Tuesday.
    • AGLT2 is the only site that is in compliant in terms of reporting HS correctly. OIM is likely out of date.
    • Once the sites have completed their updates Karhik will check.
    • Karthik will send a reminder.

AOB

  • last week
  • this week
    • none


-- RobertGardner - 16 Mar 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback