r4 - 29 May 2013 - 14:58:48 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesMay292013



Minutes of the Facilities Integration Program meeting, May 29, 2013
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern

  • Your access code: 2913843
  • Your private passcode:
  • Dial *8 to allow conference to continue without the host
    1. Dial Toll-Free Number: 866-740-1260 (U.S. & Canada)
    2. International participants dial: Toll Number: 303-248-0285. Or International Toll-Free Number: http://www.readytalk.com/intl
    3. Enter your 7-digit access code, followed by “#”
    4. Press “*” and enter your 4-digit Chairperson passcode, followed by “#”


  • Meeting attendees: Bob, Michael, Fred, Saul, John, Sarah, Dave, Armen, Shawn, Wei, Ilija, Mark S, Horst, John Hover, Mark N, Doug
  • Apologies: Jason (travel), Kaushik
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, bi-weekly - convened by Kaushik) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Wei or Rob): Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • CapacitySummary - please update v27 in google docs
      • Program this quarter: SL6 migration; FY13 procurement; perfsonar update; Xrootd update; OSG and wn-client updates; FAX
      • WAN performance problems, and general strategy
      • Introducing FAX into real production, supporting which use-cases.
      • v28 version of spreadsheet can communicated to John Weigand
    • this week
      • Phase 25 site certification matrix: SiteCertificationP25
      • SL6 validation and deployment (see below)
      • FY13 procurements - note we have been discussing with ATLAS computing management re: CPU/storage ratio. Guidance is to emphasize CPU in this round.

FY13 Procurements

  • Guidance is for CPU
  • Share benchmarks
  • Lets have one expert assigned by each Tier 2 to collect relevant information on benchmarks. Michael notes significant performance differences depending on configuration.
  • Rob will send a note to each of the Tier 2 contacts to join a group.

Facility storage deployment review

last meeting:(s)
  • Tier 1 DONE
  • WT2:DONE
  • NET2: DONE
  • SWT2_OU: 120 TB installed. DONE
  • MWT2: 3.7 PB now online DONE
  • SWT2_UTA: One of the systems is built and deployable; good shape, a model for moving forward. Will need to take a downtime - but will have to consult with Kaushik. Should be ready for a downtime in two weeks. If SL6 is a harder requirement, will have to adjust priority.
this meeting:

Integration program issues

Reviewing LHCONE connectivity for the US ATLAS Facility (Shawn)

last meeting(s)
  • June 1 is the milestone date to get all sites on.
  • BNL DONE, AGLT2 DONE, 2 sites from MWT2 2/3 DONE (*MWT2_IU needs action, see below.)


  • Updates?
  • OU - status unknown.
  • UTA - conversations with LEARN, UTA, I2 are happening. There has been a meeting. They are aware of the June 1 milestone.
  • NET2 - new 10g link is setup. 2 x 10 g to HU. Chuck is aware of the June 1 LHCONE milestone. Saul will follow-up shortly, expects no problem by June 1.
  • IU - plan is to decide friday whether whether we need to bypass the brocade, access Juniper directly to peer with LHCONE. Fred is working closely with the engineers.
this meeting
  • Updates?
  • References

The transition to SL6

last meeting(s)
  • All sites - deploy by end of May
  • Shuwei validating on SL6.4; believes ready to go. BNL_PROD will be modified - in next few days 20 nodes will be converted. Then the full set of nodes.
  • At MWT2 will use UIUC campus cluster nodes; will start on this tomorrow.
  • Doug - provide a link from the SIT page. Notes prun does compilation.
  • NET2: concerned about the timescale. Will have a problem meeting the deadline.
  • At BNL, the time required was about 4 hours.
  • WT2: have a new 3,000 node cluster coming up as RHEL6 - ATLAS may have access to. Timing issues for the shared resources. ATLAS dedicated resources can be moved to SL6, though.
  • AGLT2: concern is ROCKS, glued to SL5. Simultaneous transition to Forman and Puppet.
  • OU: transition from Platform to Puppet. OSCER cluster is RHEL6 (though job manager issues).
  • UTA: cannot do it before June 1. ROCKS, plus heavily involved with storage and networking, limiting the time.
  • Bob - discussed last week, no way to be ready with Puppet and Foreman. Decided to go back to ROCKS SL6 server. (Will transition to puppet later this summer, more smoothly)
  • UTA - no time in the past two weeks to look into it.
  • Issue is ROCKS doesn't recognize SL6.
  • NET2 - will try.

this meeting

Updates from the Tier 3 taskforce?

last meeting
  • Fred - Tier 3 institutes have been surveyed, about 1/2 have responded. In general, people are happy with local resources.
  • Report is due by July
  • Doing testing of Tier 3 scenarios using grid or cloud resources
  • Working with AGLT2 as a test queue.

this meeting

  • Managed to get surveys from every Tier 3 site. Writing assignments will be setup for the final report.
  • Half the community does not have resources on their campus.
  • Solve the data handling problem to local resources; as fully supported DDM endpoint. gridftp-only endpoints were never fully supported.
  • Survey report will be available in two weeks

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Running smooth
    • Pilot update from Paul
    • Large number of waiting jobs from the US - premature assignment?
    • Following-up with PENN: email got not response. Paul Keener reported an auto-update to his storage; reverted back to previous version (March 28). Transfers are now stable at the site, the ticket has been closed.
    • Discussion about Site Storage blacklisting. Its essentially an automatic blacklisting. Discussed using atlas-support-cloud-us@cern.ch. The problem is what to do with the Tier 3. Doug will make sure the tier3 sites have correct email addresses in AGIS.
  • this meeting:
    • There has been a lack of production jobs lately
    • From ADC meeting - new software install system; Alastaire's talk on production / squid.
    • SMU issues - there has been progress
    • ADC operations will be following up on any installation issues.

Shift Operations (Mark)

  • last week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    1)  5/16: NET2 - jobs failing with "installPyJobTransforms failed" error.  Possibly related to recent AGIS change (see iv in follow-up section below).  Errors stopped, so closed 
    https://ggus.eu/ws/ticket_info.php?ticket=94129.  eLog 44232.  (Similar errors were seen in other clouds, but no updates to
    2)  5/18: BNL-OSG2_DATADISK: "source-file-does-not-exist" DDM transfer errors - not a site issue, as the dataset replica at BNL had previously been deleted.  
    https://ggus.eu/ws/ticket_info.php?ticket=94189 was closed, eLog 44260.
    Follow-ups from earlier reports:
    (i)  4/7: Transfer errors with Tier-3 site SMU as the source ("[USER_ERROR] source file doesn't exist]").  https://ggus.eu/ws/ticket_info.php?ticket=93166 in progress, eLog 43743.
    Update 4/24: site admin has requested a production role in order to be able to more effectively work on this issue. 
    (ii)  4/30: SMU file transfer failures ("[SECURITY_ERROR] not mapped./DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=ddmadmin/CN=531497/CN=Robot: ATLAS Data Management]").  
    https://ggus.eu/ws/ticket_info.php?ticket=93748 in progress, eLog 44035.
    (iii)  5/10: SWT2_CPB - user reported a problem while attempting to transfer data from the site.  Likely related to the user's certificate / CA (similar problem has been seen in the 
    past for a few CA's).  Under investigation - https://ggus.eu/ws/ticket_info.php?ticket=93976.
    Update 5/22/13: Awaiting reply from the user.
    (iv)  5/14: BU_ATLAS_Tier2: frontier squid is down (see: http://dashb-atlas-ssb.cern.ch/dashboard/request.py/sitehistory?site=BU_ATLAS_Tier2#currentView=Frontier_Squid).  
    https://ggus.eu/ws/ticket_info.php?ticket=94054 in-progress, eLog 44197.
    Update 5/17: AGIS needed to be updated with the new server info (atlas.bu.edu => atlas-net2.bu.edu).  Issue resolved - closed ggus 94054. eLog 44237.

  • this week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    Not available this week
    1)  5/22 p.m.: BU_ATLAS_Tier2 - jobs failing heavily with errors like "Failed to connect to service geomDB/geomDB_sqlite." From Saul: Our usual globus cleanup crons were not 
    running post the move on the BU side (HU is not having a problem).  If I remember correctly, this can cause very non-obvious looking errors. I'm restarting all of that now. 
    This seemed to solve the problem. eLog 44320.
    2)  5/22 late p.m.: BNL - jobs were failing on the the WN acas1542.usatlas.bnl.gov, mostly with the error "Too little space left on local disk to run job." From John Hover: I logged 
    in and deleted a bunch of defunct job sandboxes. We'll look into why the normal cleanup mechanisms didn't work.
    3)  5/22 late p.m.: SWT2_CPB - jobs were failing on several WN's with the error " No space left on local disk." Issued understood:  The errors were occurring primarily on some older 
    WN's which have small internal hard drives.  The production jobs which fail require relatively larger input datafiles, hence the errors.  The number of job slots on these WN's was 
    reduced by one.  This should give more local scratch space to the remaining jobs. eLog 44323.
    4)  5/23 early a.m.) SLACXRD - jobs failing on WN hequ009 with errors indicating a CVMFS problem.  Wei removed the node from production - issued resolved.
    5)  5/23: New pilot release from Paul (v57c).  More details:
    6)  5/23: MWT2 - Rob reported that jobs were accidentally deleted at the MWT2 and ANALY_MWT2 sites during a condor reconfiguration, resulting in a large number of "lost heartbeat" 
    job failures.  eLog 44325.
    7)  5/24:  SLACXRD_USERDISK DDM deletion errors ("Query string contains invalid character").  https://ggus.eu/ws/ticket_info.php?ticket=94327 was closed, since this is a known 
    issue with the deletion service.  (Site is requested to delete the files locally, and the deletion request was canceled.) eLog 44370.
    8)  5/24-5/27: HammerCloud alerts were incorrectly sent out to US cloud support (among others) with the subject "Missing datasets for PFT tests." Federica Legger reported that this 
    was a known problem with the HC system, and it would be fixed.
    9)  5/27: AGLT2 - a full database partition was causing file transfer errors (SRM).  Space was added, this fixed the problem. 
    https://ggus.eu/ws/ticket_info.php?ticket=94357 was closed, elog 44369.
    10)  Monday 1:00PM EDT May 27 - Wednesday 7:00PM EDT May 29: BNL maintenance - see eLog 44358.  (As part of this maintenance the US tier-2 LFC was off-line for ~three hours 
    on 5/28.  ANALY_ queues were manually set off-line during this period. eLog 44362, http://savannah.cern.ch/support/?137840.
    Follow-ups from earlier reports:
    (i)  4/7: Transfer errors with Tier-3 site SMU as the source ("[USER_ERROR] source file doesn't exist]").  https://ggus.eu/ws/ticket_info.php?ticket=93166 in progress, eLog 43743.
    Update 4/24: site admin has requested a production role in order to be able to more effectively work on this issue. 
    (ii)  4/30: SMU file transfer failures ("[SECURITY_ERROR] not mapped./DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=ddmadmin/CN=531497/CN=Robot: ATLAS Data Management]").  
    https://ggus.eu/ws/ticket_info.php?ticket=93748 in progress, eLog 44035.
    (iii)  5/10: SWT2_CPB - user reported a problem while attempting to transfer data from the site.  Likely related to the user's certificate / CA (similar problem has been seen in the past for 
    a few CA's).  Under investigation - https://ggus.eu/ws/ticket_info.php?ticket=93976.
    Update 5/22/13: Awaiting reply from the user.

Data Management and Storage Validation (Armen)

  • Reference
  • last meetings(s):
    • No major issues
    • ALGT2 groupdisk needs space
    • USERDISK cleanup will happen this week
    • Wants to check with Saul about deletion rates, which is unchanged after the move. Will increase chunk size parameter. 60-80 #files is default; currently NET2 = 10. Armen believes there is a bottleneck.
    • Saul was able to reproduce the problem - dropped connections - which depends on location of client. Where does the deletion service client run? Armen: there are dedicated machines at CERN.
    • Dark data - mainly has been in USERDISK. Cleaned all datasets prior to 2013. Sites: please check sites for dark data.
    • Armen - will provide a dark data summary for next meeting.
  • this meeting:
    • NET2 - increased "chunk size" from 10-30, deletion rate went up to 1.5-2 Hz. This is good.
    • New USERDISK cleanup tasks just submitted
    • NET2- needs a PRODDISK cleanup
    • GROUPDISK quotas: discussion with central ADC, Higgs went over 40 TB, etc. New quotas are published by Ueda; do we agree?
    • Get from central ATLAS usage info each area.
    • Dark data invention
    • Rob will send an email around

DDM Operations (Hiro)

  • this meeting:
    • LFC dump page is still stuck - Hiro has been busy.

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last meeting(s):
    • Sites should prepare for perfsonar rc3
    • 10g boxes up and running at UTA; will run at 3.2.2, which is stable.
  • this meeting:
    • See notes from yesterday
    • Final release of perfsonar 3.3 likely within a week. (this is rc4) Will request all sites to update. Option of preserving data or a clean install. Few things to do with OSG service registration.
    • Work continues on the dashboard.
    • Few things being developed for alerts.
    • Transfers to OU and UTA in-bound are doing better; but there are problems on some paths coming in.
    • LHCONE - SWT2 problems during the switch-over. Asymmetric routes with CERN created - had to back changes out.
    • Michael: Some complaints about the dashboard. Distinguishing real problems with false "reds". Shawn: most of the red issues are known problems with version 3.2.
    • Info on client tools will be provided. Doug wants something specific for Tier 3s. Logical set is US ATLAS Tier 2s and Tier 1.

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference

last week(s)

  • Wei - Gerri packed the voms security module into rpm, works at SLAC and with DPM sites. Once Gerri has a permanent location for a permanent location, will create a set of detailed instruction.
  • Ilija has requested a central git repo at CERN for FAX. Can WLCG task force provide a point of coordination.
  • Ilija - doing FDR testing. Also testing using the Slim Skim Service. Seeing nice results for file transfers to UC3. Now seeing 4GB/s. How is the collector holding up.
this week
  • Wei: email sent to upgrade GSI security to certain sites. dCache sites already have it. Working with Spanish Tier 1.
  • Ilija: enabled allow-fax true at MWT2 and AGLT2. No problems seen. Will look at statistics.
  • Xrootd upgrade issues - have xrootd.org and OSG and EPEL. Wei will discuss with Lucasz.

Site news and issues (all sites)

  • T1:
    • last meeting(s): looking at next gen network switches; equipment for the eval - Cisco, Xtreme, Arista; discovered issues with bw/stream (not exceeding 10g, e.g.). Will get a unit June/July timeframe. CPU procurement. Dell won the bid. (HP, Oracle, IBM). Two sandybridge, 2.3 GHz, 64 GB mem, 8x500 GB = 4 TB ($5k). Tested extensively.
    • this meeting: Final stage of completing network upgrade. Then DB backend for LFC and FTS to Oracle 11g. And dCache upgraded 2.2.-10. Farm on SL6. DDM enabled. SL6 wn's running HC jobs - soon to open up for regular production and analysis. PandaMover being brought up - so as to resume production at the Tier 2s. Hopefully finished by 7 pm. Buying new compute nodes - delivery data of June 10, 90 worker nodes, to be in production a few days after. (25-30 kHS06). In collaboration with ESNet on 100g transatlantic demo; preparing link between BNL and MANLAN in NYC. On European end, following Tirena conference, extend Amsterdam to CERN link.

  • AGLT2:
    • last meeting(s): Running well. Working with ROCKS6, as noted above. Setting up T3 test queue this afternoon. 40 cores (5 PE1950s). There is a Twiki page.
    • this meeting: Working hard getting ready for SL6. Test jobs run production just fine. User analysis jobs are failing however, unclear. VMWare systems at MSU and UM - new door machine at MSU configured and running - then will update pool servers to SL6.

  • NET2:
    • last meeting(s): Big move successful! DDM up, BU is up; only major problem was HW for HU gatekeeper - may need to switch. Few disks were damaged. T3 is working. HU will come back online very soon. Have a lot of funds for expansion; storage need replacement; need to replace old IBM blades.
    • this week: Release issue as mentioned - unresolved problem. Unintentionally update of CVMFS on HU nodes. Michael: since HU and BU in same data center, why not unify? Only reason was to minimize changes in the move, might do this in the future. HC stress test?

  • MWT2:
    • last meeting(s): uc2-s16 online, now at 3.7 PB capacity. Network activities at IU. UIUC - campus cluster down today for GPFS upgrade. Building SL6 nodes, puppet rules in place, nodes deployed at UIUC and UC.
    • this meeting: Regional networking meeting next week, AGLT2+MWT2. Illinois CC GPFS issues last week caused by faulty Infiniband component. New compute nodes online with SL6. Puppet rules setup for IU and UC nodes. Networking issues at IU: Sunday changed network to backup 10g link, re-enabled virtual router in the Brocade router for LHCONE. However, checksum errors returned. 191/600K transfers. People are trying to understand the source.

  • SWT2 (UTA):
    • last meeting(s): Been busy. Perfsonar 10g up and running, performing well; still need to get monitoring straightened out. Identified issue in campus network related to gateway switch, dropping packets. Seems to be responsible for throughput issues to UTA. Hiro's tests go up to 300 MB/s download speeds. Evaluating that switch and others to improve performance. Looking at F10 S4810. Testing today and tonight. Diagnosing issues for pulling data from IU - 1% checksum errors; (not seen at UC); 1200 0 errors at UC, 40 errors out of 1400 from IU. 3660 deployment coming along - moved back to SL6.3 kernel (Dell probs with SL6.4).
    • this meeting: Loading SL6 and ROCKS - has a solution for this, isolating ISOs. Will bring up head node, and start bringing configurations forward. Malformed URI's from deletion service.

  • SWT2 (OU):
    • last meeting(s): All is fine. SL6 won't happen until Horst returns from Germany by July 1. Does expect to have SL6 jobs running on OSCER by June 1.
    • this meeting: Some problems with high memory jobs - result has crashed compute nodes. Condor configured to kill jobs over 3.8 GB. These are production jobs. No swap? Very little.

  • WT2:
    • last meeting(s): No burning issues; future batch system will be lsf 9.1. Testing.
    • this meeting: Preparing for SL6 migration. Have CVMFS running on an SL6 machine; running test CE, and its working with LSF 9.1.


last meeting this meeting
  • None.

-- RobertGardner - 28 May 2013

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback