r5 - 02 Apr 2012 - 15:39:10 - RobertGardnerYou are here: TWiki >  Admins Web > ConsolidatingLFCStudyGroupUS

ConsolidatingLFCStudyGroupUS

Introduction

I would like to invite you to form a study group to examine the issues of LFC consolidation at CERN, its impact on US facility operations, and to devise in detail a migration plan that accounts for all operational dependencies. This activity will probably take a couple of conference calls and some email, and should culminate in brief written report that would be the basis for recommendations to ADC at the March software week. The report would include among other issues:

  1. A brief statement describing the current deployment in the US. Included in this description would be a table listing the # files by site, giving a sense of the scale of each LFC catalog. For reference, looking up similar data from other clouds.
  2. Include any other historical "structure" in the catalog that may need to be cleaned up before migration, permissions ACLs, directories, etc.
  3. Examination of the utilities currently in use - and any needed changes implied if the LFC is consolidated at CERN. Eg. CCC.py and proddiskcleanse.py, etc. Impact for sites based on SE type, if any.
  4. Any dependencies or implied changes in the compute node environment?
  5. What is the immediate benefit of consolidation to ADC operations? to sites? (will this speed up central deletion or consistency checking?)
  6. Probably other issues I've not listed here.

Below is a doodle poll to find a time to meet:

http://www.doodle.com/3t5h84xfqbm6u2u2

I am looking for a scribe to take notes and make a first draft of the (brief!) report. Bob would you be willing?

Thanks all,

- Rob

Meetings

  • Our Skype-capable conference line: (6 to mute) ** announce yourself in a quiet moment after you connect *
  • USA Toll-Free: (877)336-1839
  • USA Caller Paid/International Toll : (636)651-0008
  • ACCESS CODE: 3444755

Friday, Feb 3, 2012

Notes from today's meeting -- bob

Present: everyone involved.

Plans and intentions as stated in Twiki

Points on meeting page just items that came to mind without any specific ordering.

1. Making a statement. How long does a ccc check take? Is that going to be an issue once consolidated? Other clouds have already done this? What issues did they encounter?

2. Pump Shawn for his memory of the old dq2-machine days.

3. List of utilities used that may need touching or modification as a result of the migration.

4. What needs to be changed to comply with the WN environment, eg, SLAC firewall access. Or maybe even NET2.

schedconfig mods needed for all sites to do such a switch.

Remote LFC problems now replace local LFC problems. But, how many local problems do we have now? Central LFC leaves a single point of failure. Consistency was an early problem, but is not so much so now. Probably some ghosts, etc, and it should be scoured periodically. But that will be true regardless. Is ADC consistency sufficient and viable? Perhaps should compare ADC to local results. Hiro says ADC does not even run it, because DDM delivers files instead of Panda. Hiro says we'll never give up Panda-mover. DDM implies registration of files, but Panda does not do this (Hiro will double check this).

Need a broken-out section on consistency checking in our report, comparing to other clouds. Can the ADC consistency checker give us info on dark data or orphans, or ...? How would one clear out dark data/orphans with a central LFC?

ADC cannot clean files existing on sites that are not in the catalog. DDM/LFC consistency is checked. Sarah says ccc can use a db/lfc dump for checking for and deleting the dark data, but then CERN would have to provide such a dump.

We do not need to provide solutions, but we do need to point out issues.

Original argument for never consolidating: How does US cloud compare to CA cloud? What happens to latency when we have to go to CERN for LFC access? How does that impact Prod and Anal? Are there SQL limits that would become a problem?

Why does the ADC want to do this? Probable answer in history. Long ago there were many lfc errors, with some complete cloud failures (French, eg). Is this the issue? We need to report the number of lfc-related issues we've had in recent times.

US Tier-3's use BNL lfc, so Hiro runs dumps and sends it to site for their own use. This is a usage beyond ccc and proddiskcleanse.

Possibility, consolidate one T2 at BNL and see what happens. Patrick/UTA volunteers to try this on one of their clusters UT_SWT2.

Definitely need to make a count (AGL is 8Million, big cluster at SW is 2.5M, with other 2 being smaller, MW is probably 6M or more, BNL is 100M or more). CERN is now hosting 4 clouds (Taiwan, CNAF, SARA, FZK), but we don't know its size. With this size, it seems that ccc could not be run, the scale is just too large. UK is already complaining because they can't clean up fast enough, and that is just their own cloud centrally cleaned.

Bob will do broad outline MS Word doc and Email it around. Rob will try to set up Google Docs to do this, and send invite. With outline set, everyone will volunteer for a section. Meet again next week at this same time to see where to go.

One T2 will do back/forth testing, using their own and the BNL LFC, to see how they compare. HC testing might be useful, looking at stage-in/out times on someones Analy queues.

Friday, Feb 17, 2012

Discussion of overall document structure, and section assignments.

Rob: Intro
Rob: Current LFC text
Ask Wei to supply info for WT2 table entry (Rob)
Bob: ProdDiskCleanse
Hiro+Bob: Test plan
Bob+Sarah: Impact
Hiro: Brave New World
Patrick: ccc
Google doc fully shows who is doing what

Are there early-days LFC relics in our catalogs? Query Shawn, for example. May or may not come into play in migration, depends on method used.

Brave New World section may not be a good one to keep, but several of us expressed a desire to have this information in place.

Graham Stewart (?) may be driving force behind this migration.

Are AGLT2/MWT2 ccc_pnfs.py identical? MW: v 1.13, AGL v 1.14. Ask Shawn if Chimera is the reason behind this. AGLT2 uses a modified v 1.14 that accounts for Solaris on thor01 (since retired) and the AGLT2 status as a Muon Calibration site.

Who is connecting to the LFCs, and how often?

What will be our recommendations? Possibilities:

  • Not consolidate at all
  • Cloud level at BNL
  • Consolidate it all at CERN

Suggested Document evolution path

  • Version 0: current
  • Version 0.5 deadline: Feb 24
  • Version 1.0 to give to Michael by March 5 latest

No meeting 2/24

Bob will Email when Twiki is updated

We will all Email as we complete our sections


-- RobertGardner - 31 Jan 2012

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


pdf LFCConsolidationfortheUSCloud.pdf (330.1K) | RobertGardner, 02 Apr 2012 - 15:38 |
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback