r5 - 31 Oct 2006 - 15:04:19 - HorstSeveriniYou are here: TWiki >  AtlasSoftware Web > USDDMSep2006

US ATLAS DDM and MC Production Workshop (BNL Sep 28-29, 2006)


Workshop Specifics

We are organizing a two-day workshop at BNL, September 28-29, 2006, to discuss issues related to US ATLAS distributed data management and MC production. The first day of the workshop will be dedicated to plenary talks, during the second day we will have training sessions and splinter meetings.

Agenda: http://indico.cern.ch/conferenceDisplay.py?confId=6495

  • Sep 28
    • DQ2 software and DDM operations, data flow monitoring
    • Computing facilities and infrastructure (Tier-1/Tier-2s)
    • MC production in US, Panda and related issues
    • 3D and database replication

  • Sep 29
    • DDM/DQ2 training and Production training sessions
    • DDM splinter meeting | Panda splinter meeting

The agenda of each session will be prepared by conveners. See here for the full Agenda. Contact one of the following persons to discuss the agenda and/or if you want to give a talk. We will be interested to have talks from people working for Tier-2s about their vision of Data management and the support they need from the Tier-1.

Alexei, Torre : DQ2 and DDM operations issues

Razvan, Rob, Alexei : US ATLAS Tier-1/Tier-2s

Pavel, Kaushik : MC Production

Sasha : 3D and database(s) replication

Miguel, Tadashi, Alexei : DDM/DQ2 training session

The workshop is open and we expect (at least) one person per T2 who is in charge of DDM operations and MC Production.

Location

The meeting will take place in BNL Physics Department, Building 510A.

Registration

We require a registration for the US ATLAS DDM and MC Production Workshop.

There is no registration fees, but please send e-mail to Alexei Klimentov (aak@bnl.gov) and Torre Wenaus (wenaus@bnl.gov) with cc to Linda Feierabend (feierabe@bnl.gov) and Penka Novakova (penka@bnl.gov), notifying your attendance.

Attendees who already hold a valid BNL Guest Badge may come on site without notification to the main gate. Attendees who already have approved guest appointments, but not a Guest Badge will have to notify the BNL main gate of their arrival. All others must apply for the BNL guest appointment by completing the Brookhaven's Guest Registration Form well in advance of the workshop (https://fsd84.bis.bnl.gov/guest/guest.asp, please give H.Gordon's name as a host). You will be notified when the guest appointment is approved. When you arrive at BNL with the approved guest appointment for the first time, you must stop at the security office by the main entrance where you will be issued a temporary pass, then proceed to the RHIC/AGS Users' Office (in Bldg.335) to complete the check-in procedure.

NON-US citizens without BNL guest appointment should complete the registration form as soon as possible, since it takes additional time for the approval process, especially for people from sensitive countries.

Participants

  • Miguel Branco / CERN
  • Kaushik De / UTA
  • Robert Gardner / UC
  • Hironori Ito / BNL
  • Alexei Klimentov / BNL
  • Tadashi Maeno / BNL
  • David Malon /ANL
  • Shawn McKee? /UM
  • Patrick McGuigan? /UTA
  • Pavel Nevski / BNL
  • Karthik Arunachalam / OU
  • Razvan Popescu / BNL
  • Tom Rockwell / MSU
  • Pedro Salgado / CERN
  • Dan Schrager / IU
  • Horst Severini / OU
  • Jim Shank / BU
  • Sasha Vanyashine / ANL
  • Torre Wenaus / BNL
  • Wei Yang / SLAC

US ATLAS DDM Workshop Software comments, observations, proposals

  • DDM problems are severe and of the highest urgency
  • DDM infrastructure works quite well (modulo facility problems) within the US cloud, but breaks down severely once one goes outside the cloud. True also for some well-managed clouds in LCG.
  • LCG DDM has severe problems. A substantial fraction of them (25%?) are believed to come from LFC. Recent improvement in LFC performance reported during the meeting (100 GUID lookup going from 59 sec to 8.5 sec) gives no more confidence that LFC is a credible solution.
    • design/implementation requirement for DDM from the beginning -- a foundation of validated, production-quality tools, clearly not met
      • This requirement reaffirmed by Alexei from DDM operations perspective
  • we should fix serious problems for which we actually have identifiable solutions
    • clear example is LFC. We should quantify performance and compare with MySQL? , but we know the outcome
    • anyway, cannot be any argument that we need an LFC fallback
    • Miguel this morning: we will not ever fix the LRC problem with operations effort. Solution is in the software.
  • Proposal: we should pursue the extension of the OSG MySQL? LRC to support LCG and ATLAS in general, and deploy it at CERN as a mirror for ATLAS production LFCs
    • will provide (if it succeeds) a demonstrated fallback for LFC
      • will address the deployment concerns of an in-house LRC by demonstrating (or not) that the mirror deployment at CERN is sustainable
    • we can then put the http interface in front of the mirror and make LCG data access from the US independent of LCG grid tools and LCG UI (using globus-url-copy for copying)
  • Alexei in this meeting: dq2_get is currently the most reliable means of replicating data
  • Proposal: extend dq2_get with option to make local SE the destination, and register destination replicas with LRC and dataset replica with DQ2 (for full-dataset replications)
    • Will allow dq2_get users to share the results of their copying through DQ2
  • Hiro in this meeting: if FTS breaks, DQ2 breaks; files cannot be replicated
    • This should not be the case. There should be auto failover to simpler fallback(s). Should fail only if fallbacks also fail.
  • Proposal: priority attention in DQ2 development to thorough implementation of auto fallback
  • Miguel in this meeting: Don't use SRM if you don't have to.
    • Are we using it where we don't have to? At least, are there places we do not have auto fallback?
    • New hooks to go to native storage system layer to eg. rm very good. Places to extend this approach?
  • Patrick reminded me of another item: we should eliminate POOL dependency from http LRC service
    • Next priority after grid authentication?
    • Already done by Miguel?
  • Which do we see as higher priority for DQ2, partitioning or hierarchical catalog?
    • Basic function of hierarchical catalog obtained via naming conventions on (sub)datasets, and cross-population of parent and child datasets
    • No such stand-in for partitioning
      • Partitioning can mean either partitioning of catalogs within a DQ2 instance, or multiple DQ2 instances, or both
    • Partitioning use cases mentioned in this meeting, in addition to basic scalability:
      • Regional partitioning: distinct Panda DQ2 instance for US production and analysis
        • scalability
        • Decouple from usage load on ATLAS service
        • remove region-specific traffic from ATLAS service
      • Separate partition for user datasets
        • Avoid letting less controlled usage by end users impact production instance
        • Different ownership/security requirements can make it a natural separation
      • Separate partition(s) for old archived datasets
        • Control the scale of the current, most active partition
    • Partitioning requires
      • Careful implementation in DQ2 and clients (user tools, Panda, monitors, ...) to not confuse or complicate life for users
      • 'DNS' like layer that maps a dataset (based on metadata, these days found in the name) to a DQ2 or catalog instance
      • Site service awareness of the different instances, and which to use
    • You guessed it: I vote for partitioning as the higher priority

LFC comments heard at software week:

  • most of the LFCs are having problems a good part of the day. Every day.
  • LFC not dataset optimized at all. File oriented.
  • developers refuse to support readonly unauthenticated access
  • bulk operations not provided
  • won't expand API beyond the posix world


Major updates:
-- TorreWenaus - 29 Sep 2006

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback