r6 - 18 Apr 2013 - 14:21:25 - DavidLesnyYou are here: TWiki >  Admins Web > MinutesApr172013



Minutes of the Facilities Integration Program meeting, April 17, 2013
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern

  • Your access code: 2913843
  • Your private passcode:
  • Dial *8 to allow conference to continue without the host
    1. Dial Toll-Free Number: 866-740-1260 (U.S. & Canada)
    2. International participants dial: Toll Number: 303-248-0285. Or International Toll-Free Number: http://www.readytalk.com/intl
    3. Enter your 7-digit access code, followed by “#”
    4. Press “*” and enter your 4-digit Chairperson passcode, followed by “#”


  • Meeting attendees: Horst, Bob, Saul, Mark, Alden, Ilija, John, Kaushik, Armen, Hiro, Wei, Sarah, Dave, Doug
  • Apologies: Jason, Shawn, Mark
  • Guests:

Integration program update (Rob, Michael)

Supporting OASIS usability by OSG VOs

last meeting
  • See presentations at OSG All Hands
  • Still about a month away
  • Note this is CVMFS for OSG VOs
  • Fred will invite Scott to report in two weeks.

this meeting

  • OASIS presentation - postponed

Facility storage deployment review

last meeting:(s)
  • Tier 1 DONE
  • WT2:DONE
  • MWT2: Network updates in place. Bringing remaining 500 TB online this week and next.
  • NET2: DONE
  • SWT2_UTA: Equipment received, storage racked. Planning networking changes; coming along. Will take a downtime to make network changes.
  • SWT2_OU: Still on for April 8. Tuesday network configuration. Will add 120 TB.
this meeting:
  • Tier 1 DONE
  • WT2:DONE
  • MWT2: 3.3 PB now online. 180 TB remaining to complete the 1 PB upgrade
  • NET2: DONE
  • SWT2_UTA: Equipment being installed. Network change has been postponed. Downtime to bring online being planned.
  • SWT2_OU: 120 TB installed. DONE

Supporting opportunistic usage from OSG VOs (Rob)

last meetings
  • AccessOSG
  • In particular, I would like to request all US ATLAS sites support the UC3 VO (see SupportingUC3).
  • UC3 VO support: MWT2, AGLT2 (working on it, updating GUMS server), NET2 (working on it), SWT2_UTA, _OU (will upgrade GUMS)
  • CMS VO support: MWT2, WT2 (only for debugging, due to lack of outbound connectivity), AGLT2, SWT2_UTA, NET2 (will do)
  • OSG VO support: MWT2, AGLT2, SWT2_UTA, NET2 (will do),

  • UC3 VOMS issue seen at AGLT2 - getting access error. Elisabeth from GOC attempting to reproduce. Fred can also help.
  • CMS - NET2 attempted, but had an auth issue.
  • OSG - NET2 will try again

this week

  • UC3 fully supported at AGLT2
  • NET2 - final stages of preparing for move to Holyoke. Delayed.
  • At BNL - preparing to use InCommon to support users. This might be a path into the future - rather than using a static certificate valid only for a year. This will be used by US Snowmass users.
  • BNL, OU will support UC3.

Integration program issues

Reviewing LHCONE connectivity for the US ATLAS Facility (Shawn)

last meeting
  • June 1 is the milestone date to get all sites on.
  • BNL, AGLT2, 2 sites from MWT2 (*MWT2_IU needs action.)
  • NET2 - Holyoke to MANLAN now connected. Tier 2 subnets will be need to routed. Holyoke move is postponed. Saul expects June 1 milestone will be met -- either at Holyoke, or in place at BU. What about Harvard? Will bring this up with HU networking.
  • SWT2 - both sites are not on LHCONE presently. OU goes through ONENET (has connect to AL2S? I2 system, ride layer 2 to MANLAN and peer there). For SWT2, LEARN - Kansas City may be the connection point. Dale Finkleson thought it could happen quickly. Patrick: have not discussed locally at UTA networking; concerned about how UTA traffic is handled. Shawn: I2 can bring up a peering in any of their PoPs, e.g. in Houston, if that is the best place to peer. Will talk with network managers
  • SLAC is already on LHCONE.
  • Fred - still struggling with Brocade at IU. Problem has been reproduced, testing.
  • SWT2_OU: still waiting to hear from I2 and ESnet. Zane Gray at OU is leading this.

this meeting

  • Kaushik: meeting last week with campus networking to connect via LEARN, process started. Michael: who are the playings on the LHCONE side - e.g. Internet2 side who are VRF provider. Notes this is about configuration issue to separate campus from LHC traffic. Kaushik notes that his campus networking team is working the issue. Michael: need to connect UTA campus people to Mike O'Conner and Dale Finkleson.
  • Horst: OU - will have a meeting tomorrow with LHCONE operations to hook up OUONE.
  • Fred: still struggling with Brocade firmware issues causing checksum errors. Testing setup at Indianapolis; problem reproduced by Brocade. Fred will arrange a meeting with him early next week.

Deprecating PandaMover in the US ATLAS Computing Facility (Kasuhik)

last meeting
  • No update.

this meeting

  • Kaushik: need a discussion with the DDM team first since it will involve an increase in load, but have an internal discussion first.
  • Michael - panda mover operations are opaque. We should join the mainstream. The current model works perfectly for all the other clouds. Little noise about these issues.
  • Kaushik: have not revisited its need in a long time.
  • Rob: suggests moving a Tier 2 site off pandamover, and watch the effect.
  • Kaushik - will get started.
  • Hiro wants to keep Pandamover for tape staging, as he feels its more efficient. Michael - doesn't feel the benefit is worth keeping Pandamover. Notes tape-staging happens rarely, only during specific campaigns.

The transition to SL6

last meeting
  • Discussed above.

this meeting

  • All sites - deploy by end of May
  • Shuwei validating on SL6.4; believes ready to go. BNL_PROD will be modified - in next few days 20 nodes will be converted. Then the full set of nodes.
  • At MWT2 will use UIUC campus cluster nodes; will start on this tomorrow.
  • Need a twiki setup to capture details.
  • Doug - provide a link from the SIT page. Notes prun does compilation.
  • Expect all sites to participate and convert to SL6 as old clients will be disabled in June.

Evolving the ATLAS worker node environment

last meeting
  • Nothing to report this meeting.

this meeting

Virtualizing Tier 2 resources to backend ageing Tier 3's (Lincoln)

  • No update
  • What about the Tier 3 taskforce? The activity has two working groups. One will work on technical aspects, the other analysis. One activity will have a questionnaire to provide guidance.
  • It is a US specific committee.

Transition from DOEGrids to DigiCerts

last week
  • As of March 26 there is a list of sponsors.

this week

  • Michael: there have been issues with people applying for certs - taking too long. Process being investigated - too many people involved, and probably not the right people, taking too long. Decision to shift responsibility transferred from ITD Help Desk into the Tier 1 to process requested. Requests should take less than hour.
  • Doug has volunteered US ATLAS analysis support to help facilitate the process

R&D HPC work (Doug)

  • Working on getting alpgen and sherpa simulation running on the HPC computing facility at Argonne. Eventually G4. Two leadership class facilities at Argonne.
  • Mark: sounds like this would be a good as a focus topic for the Thursday meeting.
  • Plan on presenting this at the TIM meeting.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Running smooth
    • Pilot update from Paul
    • Large number of waiting jobs from the US - premature assignment?
    • Following-up with PENN: email got not response. Paul Keener reported an auto-update to his storage; reverted back to previous version (March 28). Transfers are now stable at the site, the ticket has been closed.
    • Discussion about Site Storage blacklisting. Its essentially an automatic blacklisting. Discussed using atlas-support-cloud-us@cern.ch. The problem is what to do with the Tier 3. Doug will make sure the tier3 sites have correct email addresses in AGIS.
  • this meeting:

Data Management and Storage Validation (Armen)

  • Reference
  • last meetings(s):
    • Will need another USERDISK cleanup. Hiro will send an email.
    • PRODDISK needs to be cleaned at NET2 and MWT2.
    • The SRM values reported incorrectly at SLAC and SWT2.
    • The SRM value at SWT2 space tokens dropped, then came back, except at GROUPDISK. Notes this is transient behavior.
    • GROUPDISK loss at MWT2 - site report below.
    • NET2 deletion issue - it is still slow. There is also a reporting issue here as well. This is reported everyday in Savannah. There is an active discussion. Saul believes it is functional by lowering the chunk size to 10 (other sites have 80-120). lcg-del command used by Tomas sometimes drops connections. Dropped connections 1 out of 10 from BU. Next step? Try duplicating problem on a new machine. Dropped packets?
  • this meeting:

Shift Operations (Mark)

  • last week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    1)  4/4: BNL_ATLAS_RCF - known issue with job eviction.  https://ggus.eu/ws/ticket_info.php?ticket=93078 closed, eLog 43681.
    2)  4/4 early a.m.: AGLT2 source file transfer errors (SRM).  Issue understood - from Bob: A pgsql partition filled on head01 late last evening. Partition was resized 
    and head01 rebooted.  https://ggus.eu/ws/ticket_info.php?ticket=93077 closed, eLog 43683.
    3)  4/5: BU_ATLAS_Tier2 - jobs were heavily failing on WN abc-d09.  The site was using the node for some network testing, so it was removed from production.
    4)  4/7: Transfer errors with Tier-3 site SMU as the source ("[USER_ERROR] source file doesn't exist]").  https://ggus.eu/ws/ticket_info.php?ticket=93166 in progress, 
    eLog 43743.
    5)  4/8: BU_ATLAS_Tier2 job failures (stage-in/out errors).  Saul reported the problem was fixed (GPFS issue).  https://ggus.eu/ws/ticket_info.php?ticket=93202 closed, 
    eLog 43763.
    Follow-ups from earlier reports:
    (i)  3/29:  NERSC - https://ggus.eu/ws/ticket_info.php?ticket=92768 was re-opened after the file transfer errors reappeared.  eLog 43629.
    Update 4/7: https://ggus.eu/ws/ticket_info.php?ticket=93162 was opened for failing transfers with the error " [GRIDFTP_ERROR] an end-of-file was reached 
    globus_xio: An end of file occurred (possibly the destination disk is full)]."  Not a site issue, but rather an issue to be addressed by local users.  ggus ticket was closed, 
    eLog 43738.

  • this week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    1)  4/11: ADC switched over to a new set of Analysis Functional Tests (AFT's).  See:
    2)  4/11: SLACXRD - file transfers heavily failing with the errors  "cannonical path" & SRM errors.  Issue resolved later that day.  https://ggus.eu/ws/ticket_info.php?ticket=93293 
    closed - eLog 43840. 
    https://ggus.eu/ws/ticket_info.php?ticket=93378 was opened on 4/16 for SRM errors at the site.  Issue was quickly resolved.  eLog 43867.
    3)  4/14: BNL - from Michael: At BNL we observed massive job failures in production and analysis due to problems accessing geometry data.
    Issue possibly due to a central services glitch that impacted the software installation system.  More details here:
    4)  4/14: UPENN file transfer errors ("[DESTINATION error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] failed to contact
    on remote SRM [httpg://srm.hep.upenn.edu:8443/srm/v2/server]").  Restarting BeStMan fixed the problem.  https://ggus.eu/ws/ticket_info.php?ticket=93326 closed, 
    eLog 43860.
    Follow-ups from earlier reports:
    (i)  3/29:  NERSC - https://ggus.eu/ws/ticket_info.php?ticket=92768 was re-opened after the file transfer errors reappeared.  eLog 43629.
    Update 4/7: https://ggus.eu/ws/ticket_info.php?ticket=93162 was opened for failing transfers with the error " [GRIDFTP_ERROR] an end-of-file was reached globus_xio: 
    An end of file occurred (possibly the destination disk is full)]."  Not a site issue, but rather an issue to be addressed by local users.  ggus ticket was closed, eLog 43738.  
    https://savannah.cern.ch/support/?136950 (Savannah site exclusion).
    Update 4/14: Recent file transfers to NERSC are succeeding, so ggus 92768 was closed (no details) - eLog 43850.
    (ii)  4/7: Transfer errors with Tier-3 site SMU as the source ("[USER_ERROR] source file doesn't exist]").  https://ggus.eu/ws/ticket_info.php?ticket=93166 in progress, 
    eLog 43743.

DDM Operations (Hiro)

  • this meeting:
    • There was a meeting this morning discussing slow transfers on April 3. Cause was Panda server doing lots of subscriptions, more than normal. Not a typical situation, thus indicating development effort.

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last meeting(s):
    • Release by March. With 10G. Goal in facility to deploy across the facility by end March.
    • NET2 - CERN connectivity - has it been improved?
    • LHCONE connectivity for NET2 and SWT2 - timeline?
    • Prepare with discussions at NET2, even if the setup will come in with the move to Holyoke; get organized. Move won't happen before the end of March. The WAN networking at Holyoke is still not well defined. Start a conversation about bringing LHCONE.
    • rc2 for perfsonar; rc3 next week; sites should prepare to upgrade.
    • Network problems seen generally for westbound traffic.
  • this meeting:

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference

last week(s)

  • Wei - Gerri packed the voms security module into rpm, works at SLAC and with DPM sites. Once Gerri has a permanent location for a permanent location, will create a set of detailed instruction.
  • Ilija has requested a central git repo at CERN for FAX. Can WLCG task force provide a point of coordination.
  • Ilija - doing FDR testing. Also testing using the Slim Skim Service. Seeing nice results for file transfers to UC3. Now seeing 4GB/s. How is the collector holding up.
this week

Site news and issues (all sites)

  • T1:
    • last meeting(s):
    • this meeting:

  • AGLT2:
    • last meeting(s):Working on dcache pool servers to sl6.3; most converted. Moving VMs. Shawn working w/ Patrick and Sarah on proddisk cleanup. Then to get CCC mechanism working again.
    • this meeting:

  • NET2:
    • last meeting(s): Had a spike in lost heartbeat jobs due to a bug in the gatekeeper SEG; OSG is working on a patch. Still have the slow deletion issue.
    • this week:

  • MWT2:
    • last meeting(s): Upgrades to UC network - reconfigured with bonded 40G ports between Cisco and Dell stacks. 2x10G bonded for some s-nodes. IU reconfiguration for jumbo frames. New compute nodes at UIUC - 14 R420s, getting built with SL5. Also adding more disk for DDN, but there are continued GPFS issues, working closely with campus cluster admins. GROUPDISK data loss - CCC was reporting a large amount of dark data incorrectly. Recovering what we can from other sites, and notifying users, and modifying procedures so it doesn't happen again.
    • this meeting:

  • SWT2 (UTA):
    • last meeting(s): A couple of storage issues possibly creating the dips in the accounting plots. New version of panda mover in the git repo that uses python 2.6. Use with caution. Will be busy getting new storage online.
    • this meeting:

  • SWT2 (OU):
    • last meeting(s): Downtime next week. 10g perfsonar node as well.
    • this meeting:

  • WT2:
    • last meeting(s): Dell switch issue - minor issue. Two 10g uplinks channel bonded sometimes has trouble, possible problem with the environment.
    • this meeting: Noticing large transfer


last meeting this meeting
  • HPC activities at Argonne - Doug to report next meeting.

-- RobertGardner - 14 Apr 2013

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback