Virtual data email thread

From: Rob Gardner 
To: David Malon , David Adams ,
	Ed Frank , Mike Wilde ,
	Alexandre Vaniachine ,
	Torre Wenaus , "Jennifer M. Schopf" 
Subject: Virtual data discussion for Core-Grid meeting
Date: Wed, 01 May 2002 10:52:46 -0500

So we have had a number of discussions about virtual data
and the atlas core sw (adb, hes) over the last two months.
The meeting next week is an opportunity to make some 
progress on defining at least a high level picture of a
virtual data system for atlas, expressed in atlas terms
and components.  This is a difficult job, and there is not 
much time for this on the agenda. The question is how can 
we use our time effectively?  What do we want as an outcome?  
As David A has said much of this discussion needs to involve 
others from the DB group who will not be present (eg RD), 
but at BNL we will have a good group assembled with adb, hes, 
rtag-persis and griphyn expertise.  Perhaps an objective 
would be to begin work on an "initial thoughts" type document 
that could incorporate the essential ideas of adb, hes, and 
Mike's griphyn VDC architecture and language. Such a document 
could be discussed within the DB group at the upcoming SW week. 
I'd be willing to meet Monday evening to work on this
if there is not enough time during the day.  What do others
think?  Too broad, ambitious?  Should we focus on 
implementation or abstract views?  

Rob

ps,

I've been trying to assemble the relevant background documents 
in one place, see "Documents" on the atlas-griphyn webpage

http://www.usatlas.bnl.gov/computing/grid/griphyn/  

If there are others of interest let me know and I'll
post them there as well.

========================================================================

From: Mike Wilde 
To: Rob Gardner , David Malon ,
	David Adams , Ed Frank ,
	Mike Wilde ,
	Alexandre Vaniachine ,
	Torre Wenaus , "Jennifer M. Schopf" 
Subject: Re: Virtual data discussion for Core-Grid meeting
Date: Wed, 01 May 2002 11:55:30 -0500

Rob, Torre, and all - thanks for getting this meeting organized! I know 
this is a lot of work and its hard to forge a consensus by email on how to 
proceed.

My (GriPhyN) goal is to work with ATLAS to understand its requirements, and 
to create GriPhyN deliverables to meet those requirements that fall into 
GriPhyN's focus areas. At this meeting I'd like to make good progress 
towards those ends, and my comments below (supporting Rob's proposal) are 
with this in mind. If this can be done while still achieving the other 
needs of the meeting participants, that would be great. If not, lets talk 
about how/if we can do this through some subsequent meeting in the near future.

At 10:52 AM 5/1/2002 -0500, Rob wrote:

>So we have had a number of discussions about virtual data
>and the atlas core sw (adb, hes) over the last two months.
>The meeting next week is an opportunity to make some
>progress on defining at least a high level picture of a
>virtual data system for atlas, expressed in atlas terms
>and components.

I would be glad to present an update on where we are at with virtual data 
(still shooting for June for a first release of tools), on where the 
architecture has evolved since we met in Feb, on how we plan to work with 
WP1, and answers to some of the questions that were posed in Feb (hiding 
files from  science users; the nature of the user interface to virtual 
data; etc).

>This is a difficult job, and there is not
>much time for this on the agenda. The question is how can
>we use our time effectively?  What do we want as an outcome?
>As David A has said much of this discussion needs to involve
>others from the DB group who will not be present (eg RD),
>but at BNL we will have a good group assembled with adb, hes,
>rtag-persis and griphyn expertise.  Perhaps an objective
>would be to begin work on an "initial thoughts" type document
>that could incorporate the essential ideas of adb, hes, and
>Mike's griphyn VDC architecture and language.

A design and planning document sounds like a great idea. Can we set the 
specific bounds of such a document before the meeting so that we could jump 
right in? For example, lets look at the project commitments we jointly have 
to meet by this October and next, and plan the systems that have to be 
build, modified, and integrated to meet those goals.

- ATLAS data challenges
- ATLAS project plan milestones
- ditto for GriPhyN, iVDGL, and other projects involved - a specific and 
finite set

>Such a document
>could be discussed within the DB group at the upcoming SW week.
>I'd be willing to meet Monday evening to work on this
>if there is not enough time during the day.

I would be willing to work on this as late into the night as necessary.

>  What do others
>think?  Too broad, ambitious?  Should we focus on
>implementation or abstract views?

I would propose a focus on implementation with the abstractions to guide 
us. We need to distill the abstractions down to the right set of 
guidelines. For example, what would a concrete implementation of the Atlas 
Database Model look like? Specifically how does it differ from HES?

Could we focus the meeting on getting some concrete pieces (Athena/Gaudi, 
MAGDA, HES, VDC, and Grid components) to work together in a phased plan, 
with milestones being the data challenges and specific demos? Where 
"getting" means creating plans and component-level designs?

I would like to propose that we plan:

- how to have a virtual-data enabled event production system running by 
fall 2002
- how to create a demo/mockup of a distributed analysis system comparable 
to what CMS is planning by fall 2002
- how to have a useable, evolveable analysis system deployed by fall 2003

- both systems should be backed by production quality file tracking and 
metadata databases
- both systems should be well defined user interfaces that enhance 
productivity (both GUI and script-accessible)

I need to run now, but would like to continue the dialog by mail so that by 
Friday we are in sync.

Torre, Rob, and all - please dont hesitate to defer all this if this is not 
what you think this meeting needs to cover.

Regards,

Mike


>Rob
>
>ps,
>
>I've been trying to assemble the relevant background documents
>in one place, see "Documents" on the atlas-griphyn webpage
>
>http://www.usatlas.bnl.gov/computing/grid/griphyn/
>
>If there are others of interest let me know and I'll
>post them there as well.

========================================================================

From: Ed Frank 
To: Mike Wilde 
Cc: Rob Gardner , David Malon ,
	David Adams ,
	Alexandre Vaniachine ,
	Torre Wenaus , "Jennifer M. Schopf" 
Subject: Re: Virtual data discussion for Core-Grid meeting
Date: Wed, 1 May 2002 15:55:15 -0500


On Wed, May 01, 2002 at 11:55:30AM -0500, Mike Wilde wrote:

> A design and planning document sounds like a great idea. Can we set the 
> specific bounds of such a document before the meeting so that we could jump 
> right in?

I am trying to gather sketches resulting from the few meetings we have
had and put them into distributable format, elaborating upon them a
bit.  My desire in doing this is to elicit the interconnectsion
between the various components and thus allow us to ask what travels
on those links.  Remember, this grew from the task (last meeting) of
determining how to interconnect the ADB and the Grid.  Thus, the
diagrams try to understand better the interconnect between ADB,
replication managers, Virtual data, etc.  We have not succeeded in
getting response in the email list nor have we made satisfactory
progress in our own meetings.  Therefore these diagrams are intended
to be strawmen that people can correct.

	-Ed

-- 
Ed Frank                                             Office:  (773) 702-7475
http://hep.uchicago.edu/~efrank                      Fax   :  (773) 702-1914
Enrico Fermi Institute /    University of Chicago  / Chicago, IL USA

========================================================================

From: Rob Gardner 
To: David Malon , David Adams ,
	Ed Frank , Mike Wilde ,
	Alexandre Vaniachine ,
	Torre Wenaus , "Jennifer M. Schopf" 
Subject: [Re: Virtual data discussion for Core-Grid meeting]
Date: Thu, 02 May 2002 10:08:43 -0500

Mike,

This looks like a reasonable approach to me.  From yesterday's
atlas grid telecon several people mentioned that it would be
helpful to focus on some of the near term components of the VDC,
what they are and schedule for delivery.  In addition, we should
try to provide feedback to you for what would be useful to ATLAS
in the Fall release of VDT.

Rob

========================================================================

From: David Adams 
To: Mike Wilde 
Cc: Rob Gardner , David Malon ,
	Ed Frank ,
	Alexandre Vaniachine ,
	Torre Wenaus , "Jennifer M. Schopf" 
Subject: Re: [Re: Virtual data discussion for Core-Grid meeting]
Date: Thu, 02 May 2002 12:39:56 -0400

Rob, Mike and all:

I agree with goals of trying to understand the role of virtual data in 
ATLAS and how to integrate with the GriPhyN virtual data effort. I also 
agree that producing a document is the way to focus our thinking and to 
share our thoughts with the rest of the ATLAS collaboration. I am 
willing to devote some effort to these tasks.

However I am not sure we can make much progress on a document until we 
resolve some fundamental issues. These include the scope of the document 
and a picture of data processing and analysis in ATLAS. I hope we can 
resolve some of these issues during Monday afternoon's session. I am 
preparing a talk to try to frame the issues as I understand them. I will 
try to get far enough to post a version this evening. I would be glad to 
have input from any of you to include in the talk or as a separate 
presentation during the session.

There are a couple issues that stand out in my mind:

1. As already indicated, we need a better understanding of how ATLAS 
expects to view event data. The ADB and HES views are somewhat different 
and neither is likely to be the final word. David, Ed and I are planning 
to address this issue next week although we may have to defer much of 
this discussion until the day after the core/grid meeting.

2. My understanding of the GriPhyN virtual data system is that it only 
views data at the file level and describes transformations of files into 
other files. I believe ATLAS will want to address virtual data at the 
finer granularity of event data objects and the coarser granularity of 
datasets. We will also want to support replication (and regeneration) at
the finer level. I suspect that a file-based virtual data system will be 
too constraining for ATLAS. Of course many of the ideas of the VDS can 
be carried over.

I would like to collect any useful talks, papers or web pages that might 
aid in our discussion of ATLAS virtual data. I will post these at

   http://www.usatlas.bnl.gov/~dladams/vdata

Please send me whatever you have or know of.

da

========================================================================

From: Alexandre Vaniachine 
To: David Adams 
Cc: Mike Wilde , Rob Gardner ,
	David Malon , Ed Frank ,
	Alexandre Vaniachine ,
	Torre Wenaus ,
	"Jennifer M. Schopf" , Pavel Nevski 
Subject: Re: [Re: Virtual data discussion for Core-Grid meeting]
Date: Thu, 2 May 2002 13:02:29 -0500 (CDT)

Hi David,

On Thu, 2 May 2002, David Adams wrote:

> 2. My understanding of the GriPhyN virtual data system is that it only 
> views data at the file level and describes transformations of files into 
> other files. I believe ATLAS will want to address virtual data at the 
> finer granularity of event data objects and the coarser granularity of 
> datasets. We will also want to support replication (and regeneration) at
> the finer level. I suspect that a file-based virtual data system will be 
> too constraining for ATLAS. Of course many of the ideas of the VDS can 
> be carried over.

I do not think that transformations are limited to just physical files. 
These are, in fact, "logical files", those I believe can be collections of 
physical files (or rahter ATLAS "event collections").

I consider the GriPhyN's diamond DAG example corresponding well to the 
preferred ATLAS scatter-gather data processing architecture, when, e.g., 
the single Event Generator file (or file collection) is processed by 
scattering data over many simulation jobs running in parallel and 
producing multiple small size outputs, with those outputs are then 
gathered into a collection of larger size files during the next
QA/filtering transformation.

Regards,
Sasha

========================================================================

From: Mike Wilde 
To: David Adams , Mike Wilde 
Cc: Rob Gardner , David Malon ,
	Ed Frank ,
	Alexandre Vaniachine ,
	Torre Wenaus , "Jennifer M. Schopf" 
Subject: Re: [Re: Virtual data discussion for Core-Grid meeting]
Date: Thu, 02 May 2002 13:19:27 -0500

At 12:39 PM 5/2/2002 -0400, David Adams wrote:
>Rob, Mike and all:
>
>I agree with goals of trying to understand the role of virtual data in 
>ATLAS and how to integrate with the GriPhyN virtual data effort. I also 
>agree that producing a document is the way to focus our thinking and to 
>share our thoughts with the rest of the ATLAS collaboration. I am willing 
>to devote some effort to these tasks.
>
>However I am not sure we can make much progress on a document until we 
>resolve some fundamental issues. These include the scope of the document 
>and a picture of data processing and analysis in ATLAS. I hope we can 
>resolve some of these issues during Monday afternoon's session. I am 
>preparing a talk to try to frame the issues as I understand them. I will 
>try to get far enough to post a version this evening. I would be glad to 
>have input from any of you to include in the talk or as a separate 
>presentation during the session.

Yes, I agree completely.


>There are a couple issues that stand out in my mind:
>
>1. As already indicated, we need a better understanding of how ATLAS 
>expects to view event data. The ADB and HES views are somewhat different 
>and neither is likely to be the final word. David, Ed and I are planning 
>to address this issue next week although we may have to defer much of this 
>discussion until the day after the core/grid meeting.

On the virtual data side we, too, have been trying to create a 
computer-scientists view of physics analysis - a data flow model which 
abstracts the processing in a manner that tries to be independent of the 
heavy science issues embedded in the data. I'll try to get that into paper 
before Monday, to compare and discuss.

>2. My understanding of the GriPhyN virtual data system is that it only 
>views data at the file level and describes transformations of files into 
>other files. I believe ATLAS will want to address virtual data at the 
>finer granularity of event data objects and the coarser granularity of 
>datasets. We will also want to support replication (and regeneration) at
>the finer level. I suspect that a file-based virtual data system will be 
>too constraining for ATLAS. Of course many of the ideas of the VDS can be 
>carried over.

Files are just a starting point. We have discussed how to describe 
datasets; we need to understand better how to go in the other direction, 
towards finer granularity. IN this mode, how would the EDO's be exchanged 
between programs? Would EDO's *within* a file get updated in place, as 
opposed to creating a new file?

We're eager to get both of these modes incorporated into the VDC in a 
timeframe that meets the needs of the experiments. If we can jointly figure 
out how to do this, we'll go off and implement it to your requirements.


>I would like to collect any useful talks, papers or web pages that might 
>aid in our discussion of ATLAS virtual data. I will post these at
>
>   http://www.usatlas.bnl.gov/~dladams/vdata
>
>Please send me whatever you have or know of.
>
>da

I will send you a recent paper on VDL experiments with SDSS to post.

Mike

========================================================================

From: David Adams 
To: Alexandre Vaniachine 
Cc: Mike Wilde , Rob Gardner ,
	David Malon , Ed Frank ,
	Torre Wenaus ,
	"Jennifer M. Schopf" , Pavel Nevski 
Subject: Re: [Re: Virtual data discussion for Core-Grid meeting]
Date: Thu, 02 May 2002 14:45:20 -0400

Sasha:

I agree the GriPhyN model is expressed in terms of logical files and I 
did not mean to imply otherwise. I understand logical file to reference 
a single file and any of its replicas. Event if we extend it to a 
collection of files, I find the model to be too restrictive. I expect 
that a dataset can be formed by taking part of the data from one file 
and combining with pieces from other files. In addition I envision 
replication at a level below that of a file so that there are different 
logical files that can be used to construct the same dataset.

I interpret your comment as restricting ATLAS to transforming file 
collections (more precisely collections of logical file). This is more 
restrictive than what I assume but may be adequate. If we are to make 
such a choice, we should make it consciously and not as consequence of 
adopting the GriPhyN data system.

It is unfortunate you cannot be at next week's meeting.

da

========================================================================

From: David Adams 
To: Mike Wilde 
Cc: Rob Gardner , David Malon ,
	Ed Frank ,
	Alexandre Vaniachine ,
	Torre Wenaus , "Jennifer M. Schopf" 
Subject: Re: [Re: Virtual data discussion for Core-Grid meeting]
Date: Thu, 02 May 2002 15:03:51 -0400

Mike:

To respond to your second comment, I am glad to hear you are considering 
extending the VDS beyond the file transformation model. Data in a file 
is never modified. In my (possibly too extreme) view, it is EDO's that 
are being transformed and files are just the containers in which they 
are exchanged between jobs.

The file transformation model implies that all events in a file have 
been transformed in the same way. This might be acceptable. Again, I 
think we need to better spell out the ATLAS requirements before too much 
work is done on extending VDS.

I look forward to out discussions next week.

da

========================================================================


Last modified: Thu May 2 16:33:12 EDT 2002