r2 - 05 Nov 2008 - 14:01:29 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesNov5

MinutesNov5

Introduction

Minutes of the Facilities Integration Program meeting, Nov 5, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • new phone (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: John, Charles, Rob, Shawn, Saul, Patrick, Mark, Kaushik, Wei, Nurcan, Horst, Hiro, Xin, Fred
  • Apologies:
  • Guests: none

Integration program update (Rob, Michael)

  • IntegrationPhase7
  • High level goals in Integration Phase 7:
    • Pilot integration with space tokens
    • LFC deployed and commissioned: DDM, Panda-Mover, Panda fully integrated
    • Transition to /atlas/Role=Production proxy for production
    • Storage
      • Procurements - keep to schedule
      • Space management and replication
    • Network and Throughput
      • Monitoring infrastructure and new gridftp server deployed
      • Throughput targets reached
    • Analysis
      • New benchmarks for analysis jobs coming from Nurcan
      • Upcoming Jamborees
    • Probably will hold another US ATLAS Tier 2/Tier3 meeting
      • Winter/Early Spring
      • Location: Duke (TBC)
  • OSG site admins meeting coming up: https://twiki.grid.iu.edu/bin/view/SiteCoordination/SiteAdminsWorkshop2008 led-blue

Facility downtimes

We need to make transparent downtimes, or whether there are problems. Making this well-known to the collaboration. We need to work on a structure provide this information to the collaboration.
  • follow-up discussion from last week
    • OIM https://oim.grid.iu.edu/ to announce downtimes
    • Where to "announce"? shifters, prodsys, oim, ....
    • Curl method for off/online sits in Panda - make this automatically inform OIM?
    • Would like to do this in one place. With one click. Note OIM requires a grid certificate.
    • Kaushik will provide a summary, send around. Kaushik will try for next week.

Next procurements

  • Cover in site-reports

Operations overview: Production (Kaushik)

  • follow-up issues
    • US production initiative - keeping queues filled. Two-three weeks.
    • New round of production by Borut should keep the cloud filled for the next couple of weeks.
    • We also need to keep the communication lines open with the US community - RAC meetings starting back up.
  • Discussion of chronic job shortages. US ATLAS queue fillers not politically popular in international ATLAS - high level ATLAS management wants firm central control, even if it means a massive waste of resources.
  • User requests to Tier 3 - being approved based on dataset types.

Shifters report (Mark)

  • Reference Yuri's weekly summary in operations meetings.
  • LFC migration making progress on sites - eg. SWT2_CPB, SLAC working this week.
  • Not much production running.

PRODDISK migration (Yuri, Paul, others)

  • Follow-up w/ status at each site
  • SWT2_OU - waiting for new storage.
  • SLAC - completed.
  • BU - completed for BU site, working well. For Harvard, custom site mover tool will be needed.

Analysis queues, FDR analysis (Nurcan)

  • Background, https://twiki.cern.ch/twiki/bin/view/AtlasProtected/PathenaAnalysisQueuesUScloud
  • Waiting for LFC migration to be completed.
  • DA session at SW workshop - long with lots of updates for pathena and Ganga robots. What about glexec - follow-up w/ Maxim. Needed for SLAC.
  • DA shift team presented - users seem to be happy but concerns for support scalability as more users arrive. On Friday there will be a session on improving the tools from shiters.

Operations: DDM (Hiro)

  • Problems at IU - resolved.
  • At UC - converting unreserved storage pools over to space tokens. Had to convert tools from LRC to use LFC.
  • Otherwise no problems.

Throughput initiative - status (Shawn)

  • last status
    • Focus on testing with new BNL doors. 700MB/s easily. Now testing at sites.
    • Perfsonar boxes. Supposed to be deployed and operational.
    • Hiro: can easily get to 200 MB/s to AGLT2, but not 400 MB/s for long periods. Checking configurations.
    • Will test at UC and Wisconsin today.
    • Will continue with other sites as available.
    • Rich: sites should let John Bigelow know when their Koi servers are running.
  • Standard meeting this week - see list.
  • Working through a number of site issues. At IU, lots of variability. SLAC close to 400 MB/s. Hiro will be working on 1 GB/s to multiple sites.
  • Reminder for all sites to put up Perfsonar boxes.

LFC migration

  • SubCommitteeLFC
  • Meeting today: LFCMeetNov5
  • Notes above.
  • Dataset deletion from user jobs from pathean. Nurcan has been discussing this at SW week. DQ2 client page in the twiki discusses this. Discuss of replica deletion, and full deletion of the dataset in the catalog, etc. Will need to follow-up.
  • Data deletion within pathena - only within the context of a workflow.
  • General tool will come from DDM.
  • SLAC - waiting for testing from Paul.
  • Full URL - are these now support.

Site news and issues (all sites)

  • T1: none
  • AGLT2: busy bringing up new equipment. Blade server racked and cabled, rocksifying. Storage reporting now setup. userdisk, proddisk are reported, but nothing for other storage areas. And, can only report free space, not used or total. Analysis queue is running well. Yesterday very low failure rates.
  • NET2: all running well. storage arriving.
  • MWT2: bnl routing problem. proddisk expanded, UC sites back online. new machines stacking. iu dcache - pnfs prob fixed.
  • SWT2 (UTA): lfc conversion went okay. production on-going without probs. still negotiating w/ dell. hope to have something by next week.
  • SWT2 (OU): still nego w/ dell & ibrix. migrating to lfc: working out permissions issues.
  • WT2: process of migrating to LFC.

Carryover issues (any updates?)

Long URL's in the catalog (Marco)

  • follow-up:
    • Our convention is to use full URL's in the catalog in the US.
    • There are few changes implied for the pilot - Paul is aware.
    • What about dq2-put? Is the short URL used? Need to check w/ Mario.
  • savannah ticket submitted.

Release installation via Pacballs + DMM (Xin, Fred)

  • status from last week
    • Successfully ran install job at BNL. Will try to more sites this week.
    • Migration to production - will do this site by site.
    • 14.2.24 being built right now. Will check the timing of the pacball migration to BNL.
  • Test jobs cannot contact EGEE installation portal. Waiting on Torre to setup new sites in Panda for installation.
  • Automatic transfer of pacballs to Tie 1 - under discussion.

Squids and Frontier

  • last week:
    • Fred was running jobs on Tier 3 getting conditions data from BNL database. Tune to reduce download times. Carlos Gamboa at BNL offered suggestions for Oracle tuning at the host level. Wei notes that latency to SLAC around 20 minutes. Bring back into question use of Frontier. Michael - setting up required infrastructure at BNL to support the distribution - use Squid at Tier 2. Timeframe - few weeks.

AOB

  • Nurcan: glexec - Maxim has tested only at BNL; for SLAC, issue is contact to mysql.


-- RobertGardner - 04 Nov 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback