r8 - 20 Mar 2013 - 06:42:12 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesMar062013



Minutes of the Facilities Integration Program meeting, March 6, 2013
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern

  • Your access code: 2913843
  • Your private passcode: 4519
  • Dial *8 to allow conference to continue without the host
    1. Dial Toll-Free Number: 866-740-1260 (U.S. & Canada)
    2. International participants dial: Toll Number: 303-248-0285. Or International Toll-Free Number: http://www.readytalk.com/intl
    3. Enter your 7-digit access code, followed by “#”
    4. Press “*” and enter your 4-digit Chairperson passcode, followed by “#”


  • Meeting attendees: Shawn, Rob, Ilija, Saul, Patrick, Dave, Sarah, Alden, Doug, Horst, John, Armen, Kaushik, Mark
  • Apologies: Michael, Bob, Fred, Mark
  • Guests:

Integration program update (Rob, Michael)

AGIS and site configuration (Alden)

Supporting opportunistic usage from OSG VOs (Rob)


Reviewing LHCONE connectivity for the US ATLAS Facility (Shawn)

Supporting OASIS usability by OSG VOs

Deprecating PandaMover in the US ATLAS Computing Facility (Kasuhik)

Supporting opportunistic access from OSG by ATLAS

The transition to SL6

  • WLCG working group
  • Shawn is in this group already; Horst volunteers

Evolving the ATLAS worker node environment

Virtualizing Tier 2 resources to backend ageing Tier 3's (Lincoln)

Transition from DOEGrids to DigiCerts

last week

this week

  • AGLT2 has switched over all its host certs. Notes 50 cert req. limit per day.

Facility storage deployment review

last meeting:(s)
  • Tier 1 DONE
  • WT2:DONE
  • MWT2: 500 TB
  • NET2: up and running, being tested
  • SWT2_UTA: still waiting for equipment
  • SWT2_OU: Storage is online - but waiting to encorporate into Lustre.
this meeting:
  • Tier 1 DONE
  • WT2:DONE
  • MWT2: no change. Downtime delay for remaining 500 TB and required network upgrades, likely week of March 25
  • NET2: one or two days 576 TB will be added.
  • SWT2_UTA: waiting for delivery (est. tomorrow)
  • SWT2_OU: Need to request service from DDN team.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Saul notes continued problems with jobs getting stuck in the transferring state. Kaushik notes the brokering limit on transferring to running > 2 wil be raised. Also, why is the number of transferring jobs increased? There as well is an autopyfactory job submission issue. Will need to discuss with John Hover. Saul and John to discuss with Jose and John Hover.
    • Hiro notes transfers back to FZK might be slowing this.
  • this meeting:
    • Have plenty of production to do, sites should remain full. There will be period bursts for important work.
    • A good time for downtimes, starting next week.
    • Try to coordinate downtimes.

Data Management and Storage Validation (Armen)

  • Reference
  • last meetings(s):
    • A note sent yesterday. In contact with sites to adjust space tokens.
    • Hiro will send USERDISK cleanup, actual cleanup will be in two weeks.
    • Is DATADISK being used? Armen claims it is primary data. It is a question of popularity. We need to work with ADC to discuss policy for effective use by physicists.
    • Issue reported by Doug: Top group has SW and NET2. 25% of at NET2, which has the most space has issues - many hundreds of queued datasets, lots of deletions on the books. Can't direct output of D3PD? production for use by US physicists, also by FAX. Had datasets that have been stalled for two weeks. 178 TB. Michael: may need to find an interim solution.
  • this meeting:
    • NERSC scratchdisk deletion issue. Reporting not correct. Lots of consistency checking.
    • Daily summary reporting errors - complaint given to SSB team but not response.
    • NET2 central deletion errors continuing. Focus has been on low transfer rates. Have tried everything except upgrading Bestman2. (Error rate of 50%. Service sees "dropped connection". 400 errors/hour) Have updated Java, increased allowed threads; have not updated Java heapsize (Wei will send pointers).
    • USERDISK cleanup at the end of week

Shift Operations (Mark)

  • last week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    1)  2/21: M4.4 upgrade for ATLAS DDM Dashboard 2.0.  More details here:
    2)  2/21: BNL_ATLAS_RCF - job failures with "lost heartbeat" & "cmtside command was timed out" errors.  Former are due to oportunistic use of the site 
    by ATLAS (jobs can be evicted).  The latter one is due to occasional high load on an NFS server when a large number of jobs are starting up.  Will eventually 
    move to CVMFS.  Closed https://ggus.eu/ws/ticket_info.php?ticket=91695 - eLog 43035.
    3)  2/21: Jobs running at US sites were failing due to an error in the VOMS service configuration on voms305.cern.ch.  Issue resolved as of 2/22 a.m.  ggus 91704, 
    91710 were opened/closed for this issue.  eLog 43057.
    4)  2/23: SWT2_CPB - file transfer failures with "trouble with canonical path" errors.  A storage server went off-line (cooling fan on a NIC), and this in turn created 
    problems for the xrootdfs process on the SRM host.  Most of the errors went away the same day after the storage server was repaired, but one more xrootdfs restart 
    was needed to correct the small number of "/bin/mkdir: cannot create directory" errors that persisted after the first incident.  https://ggus.eu/ws/ticket_info.php?ticket=91741 
    was closed as of 2/25 p.m.  eLog 43134.  (Duplicate ticket https://ggus.eu/ws/ticket_info.php?ticket=91775 also opened/closed during this period.)  
    http://savannah.cern.ch/support/?136087 (Savannah site exclusion).
    Follow-ups from earlier reports:
    (i)  8/24: NERSC_SCRATCHDISK - file transfer failures with SRM errors.  Issue was due to the fact that the site admins had previously taken the token off-line to 
    protect it from undesired auto-deletion of files.  ggus 85490 closed, eLog 38791.  https://savannah.cern.ch/support/index.php?131508 (Savannah site exclusion ticket), 
    eLog 38795.
    Update 11/2: See latest information in Savannah 131508 (eventually the space token will be decommissioned).
    Update 2/25: Site carried out a local cleanup, and the space token is now whitelisted.  Savannah 131508 closed.
    (ii)  12/8: NET2 - again seeing a spike in DDM deletion errors - https://ggus.eu/ws/ticket_info.php?ticket=89339 - eLog 41596.
    (iii)  1/21: SWT2_CPB DDM deletion errors - probably not a site issue, as the some of the errors are related to datasets with malformed names, others are deletion 
    attempts for very old datasets.  Working with DDM experts to resolve the issue.  https://ggus.eu/ws/ticket_info.php?ticket=90644 in-progress.
    Update 1/23: Opened https://savannah.cern.ch/support/?135310 - awaiting a response from DDM experts.
    Update 2/13: Duplicate ggus ticket 91451 was opened/closed - eLog 42890.  Still awaiting feedback from the deletions team.
    Update 2/25: Old remnants deleted at the site, and from the deletion service side those deletions were processed and pushed out of queue.  No more errors - ggus 90644 closed.  
    eLog 43085.
    (iv)  2/16: UPENN - file transfer failures (SRM connection issue).  https://ggus.eu/ws/ticket_info.php?ticket=91122 was re-opened, as this issue has been seen at the site a 
    couple of times over the past few weeks.  Restarting BeStMan fixes the problem for a few days.  Site admin requested support to try and implement a more permanent fix.  
    eLog 42963.
    (v)  2/19: NET2 - file transfer failures.  Initially the errors were SRM connection ones.  These may have coincided with admins at the site working on the central deletions issue.  
    Later there were new errors like "No markers indicating progress received for more than 180 seconds" & " source file doesn't exist."  Issue under investigation.  
    https://ggus.eu/ws/ticket_info.php?ticket=91641, eLog  43018.
    Update 2/22: Issue understood.  Setting for the number of streams and parallel files was too high, resulting in "not enough progress" kinds of timeouts.  
    Problem fixed - ggus 91641 closed.
  • this week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    1)  2/27: AGLT2 file transfer errors (locality is UNAVAILABLE").  One or more storage servers experiencing heavy loads.  https://ggus.eu/ws/ticket_info.php?ticket=91835 in-progress, 
    eLog 43143.  https://savannah.cern.ch/support/index.php?136180 (Savannah site exclusion).  https://ggus.eu/ws/ticket_info.php?ticket=91896 was also opened on 3/3 for file 
    transfer problems.  Update 3/5: SRM errors this day a separate issue (ownership of host certs on some dCache servers).  Issue resolved.  Also, deployed a new kernel of the storage 
    nodes to rectify recent problems. 
    2)  2/28: M4.5 upgrade for ATLAS DDM Dashboard 2.0.  Details here:
    3)  2/28: SWT2_CPB file transfer failures.  Early a.m. there was a hardware in the xrootd redirector host (the RAID card for the system drive mirror set died and had to be replaced).  
    A final restart of the xrootdfs process on the SRM host early afternoon cleared of the remaining errors.  eLog 43174, http://savannah.cern.ch/support/?136087 (Savannah site 
    4)  2/28: NET2 production job failures ("LFC entry erased or file not yet transferred").  Also, as early a.m. 3/1 file transfer problems with "source file doesn't exist" errors.  Saul 
    reported there had been a major networking issue at the site the night before.  Working to understand what happened.  https://ggus.eu/ws/ticket_info.php?ticket=91874 in-progress, 
    eLog 43178.  Update 3/4: issues resolved, file transfers successful, https://ggus.eu/ws/ticket_info.php?ticket=91874 closed. 
    5)  3/5: WISC-ATLAS LOCALGROUPDISK: functional test transfer failures (efficiency 0%) with DESTINATION errors (" has trouble with canonical path").  
    https://ggus.eu/ws/ticket_info.php?ticket=92164, eLog 43227.
    Follow-ups from earlier reports:
    (i)  12/8: NET2 - again seeing a spike in DDM deletion errors - https://ggus.eu/ws/ticket_info.php?ticket=89339 - eLog 41596.
    (ii)  2/16: UPENN - file transfer failures (SRM connection issue).  https://ggus.eu/ws/ticket_info.php?ticket=91122 was re-opened, as this issue has been seen at the site a couple 
    of times over the past few weeks.  Restarting BeStMan fixes the problem for a few days.  Site admin requested support to try and implement a more permanent fix.  eLog 42963.
    Update 3/2: still an ongoing issue.  BesStMan restarts are required every few days (4-5?).  eLog 43194.
  • Problem with UPENN transfers; is it a checksum error? Same problem as Iwona

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last meeting(s):
    • Release by March. With 10G. Goal in facility to deploy across the facility by end March.
    • NET2 - CERN connectivity - has it been improved?
    • LHCONE connectivity for NET2 and SWT2 - timeline?
    • Prepare with discussions at NET2, even if the setup will come in with the move to Holyoke; get organized. Move won't happen before the end of March. The WAN networking at Holyoke is still not well defined. Start a conversation about bringing LHCONE.
  • this meeting:

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference

last week(s)

  • Release 3.3 required for security module change is out.
  • Ilijia notes release 3.3 supports f-stream, and we should switch from detailed monitoring.
  • Global name space change and Rucio - may need to address this with DDM.
  • BNL overwrite options need to be set correctly (probably using an old xrdcp client)
this week
  • US-FAX-status20130304.pptx: Status of US FAX sites
  • SSS_USATLASFACILITIES.pptx: SkimSlimService
  • BNL - there is a configuration error preventing Shuwei's jobs
  • AGLT2 failover redirection not working. Will wait for 3.3.1 to be deployed.
  • MWT2 - two setups. The dcache-xrootd will have same issue as AGLT2
  • WT2 - running new security config; waiting for the rpm to become available.
  • OU is okay
  • SWT2 - failing. Is this caused by the HC overwrite of the local copy tool?
  • Topology: Three sites from GRIF; adding LAPP and Marseille. Italian Tier 1 monitoring information.

Site news and issues (all sites)

  • T1:
    • last meeting(s): John is currently writing a cost-aware cloud scheduler. Adds policies that are cost driven, to expand to "pay as you go" resources. The current demand-driven event is helping drive better understanding with Amazon policies for provisioning and cost modeling. No indication of bottlenecks into/out of storage.
    • this meeting:

  • AGLT2:
    • last meeting(s): Still working with new storage servers - unstable.
    • this meeting:

  • NET2:
    • last meeting(s): Running 100% analysis on the BU side. Michael: would like to include in Panglia.
    • this week: DDM slowness problem has been solved. New storage is coming online. Holyoke move is postponed, pending networking. Analy jobs don't appear in the monitoring. HU had RSV issue, resolved.

  • MWT2:
    • last meeting(s): Preparing for downtime during week of March 18: UC network reconfiguration, add 500 TB, investigate network issues at IU.
    • this meeting: Illinois campus cluster has been down since Friday for GPFS and network reconfiguration. Most of the cluster is back up - getting ATLAS component back up. Networking issue at IU - will do a network reconfiguration with jumbo frames. At UC need to reconfig network to accommodate additional storage. Downtime postponed till April.

  • SWT2 (UTA):
    • last meeting(s): Still tracking an issue with the deletion service to clear up old deletions.
    • this meeting: Awaiting storage delivery.

  • SWT2 (OU):
    • last meeting(s): Disk order has gone out. Horst is getting close to having clean-up script.
    • this meeting:

  • WT2:
    • last meeting(s): Working on getting FAX jobs to run - unintentionally brought the site down. Will need to experiment with Bestman and a new version of Java. Is OSG aware of this?
    • this meeting:


last meeting this meeting
  • Doug: Sergei got 4,000 cores for one month. CentOS6-baseds. These are behind DUKEGCE.

-- RobertGardner - 05 Mar 2013

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


pptx US-FAX-status20130304.pptx (64.0K) | WeiYang, 05 Mar 2013 - 18:31 | Status of US FAX sites
pptx SSS_USATLASFACILITIES.pptx (1280.8K) | IlijaVukotic, 06 Mar 2013 - 04:03 |
pdf rwg-monitoring.pptx.pdf (1247.1K) | RobertGardner, 06 Mar 2013 - 12:44 |
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback