Title: Towards Iterators in the Virtual Data Language
1Towards Iterators in the Virtual Data Language
- Luc Moreau
- Electronics and Computer Science
- University of Southampton
- on sabbatical at UofC/ANL
- L.Moreau_at_ecs.soton.ac.uk
2Contents
- Brief VDL Overview
- Types to specify data set structures
- Abstract data sets vs. physical data sets
- Physical representation of data sets
- Queries over data sets
- Towards iterators for VDL2
3Virtual Data Scenario
Manage workflow
On-demand data generation
Update workflow following changes
Explain how to derive a result, e.g. for file8
psearch t 10 i file3 file4 file5 o
file8summarize t 10 i file6 o file7reformat
f fz i file2 o file3 file4 file5 conv l esd
o aod i file 2 o file6simulate t 10 o file1
file2
4Virtual DataDescribes analysis workflow
psearch t 10
file1
file8
simulate t 10
file2
reformat f fz
Requesteddataset
file7
conv I esd o aod
summarize t 10
file6
- The recorded virtual data recipe here is
- Files 8 lt (1,3,4,5,7), 7 lt 6, (3,4,5,6) lt 2
- Programs 8 lt psearch, 7 lt summarize,(3,4,5) lt
reformat, 6 lt conv, (1,2) lt simulate
5Virtual DataDescribes analysis workflow
psearch t 10
file1
file8
simulate t 10
file2
reformat f fz
Requestedfile
file7
conv I esd o aod
summarize t 10
file6
- To recreate file 8 Step 1
- simulate gt file1, file2
6Virtual DataDescribes analysis workflow
psearch t 10
file1
file8
simulate t 10
file2
reformat f fz
Requestedfile
file7
conv I esd o aod
summarize t 10
file6
- To re-create file8 Step 2
- files 3, 4, 5, 6 derived from file 2
- reformat gt file3, file4, file5
- conv gt file 6
7Virtual DataDescribes analysis workflow
psearch t 10
file1
file8
simulate t 10
file2
reformat f fz
Requestedfile
file7
conv I esd o aod
summarize t 10
file6
- To re-create file 8 step 3
- File 7 depends on file 6
- Summarize gt file 7
8Virtual DataDescribes analysis workflow
psearch t 10
file8
simulate t 10
Requestedfile
file7
summarize t 10
- To re-create file 8 final step
- File 8 depends on files 1, 3, 4, 5, 7
- psearch lt file1, file3, file4, file5, file 7 gt
file 8
9VDL in One Slide
- Transformation
- Abstract template of program invocation
- Similar to "function definition"
- Derivation
- Function call to a Transformation
- Invocation
- Record of a Derivation execution
- Storage of both past and future
- Derivation a recipe of how data products can be
generated - Provenance a record of how data products were
generated - These XML documents reside in a virtual data
catalog VDC - a relational database
10VDL Describes Workflowvia Data Dependencies
file1
- TR tr1(in a1, out a2)
- argument stdin a1
- argument stdout a2
- TR tr2(in a1, out a2)
- argument stdin a1
- argument stdout a2
-
- DV x1-gttr1(a1_at_infile1, a2_at_outfile2)
- DV x2-gttr2(a1_at_infile2, a2_at_outfile3)
x1
file2
x2
file3
11Workflow example
- Graph structure
- Fan-in
- Fan-out
- "left" and "right" can run in parallel
- Needs external input file
- Located via replica catalog
- Data file dependencies
- Form graph structure
preprocess
findrange
findrange
analyze
12VDL Shortcomings
- Currently, in VDL, no iterator over data sets.
- Users have to go over awkward process
- Outside VDL select subset of data set
- Generate mapping logical-physical files
- Generate workflow DAX
- Run workflow
- Prevents true compositionality of
transformations, and automated provenance
tracking.
13Virtual Data Sets?
- Can we extend the idea of Virtual Data to Virtual
Data Sets? - Can we separate the abstract description of a
data set from its physical implementation? - Can we define transformations in terms of data
set abstract descriptions, and use such
transformations on different physical
representations? - Can the system take care of casting between the
different physical representations?
14Types to Specify Data Sets
- Type declaration inspired by C type constructs
- Structs allow us to refer to elements by name
- Arrays allow us to refer to elements by index
- Typedefs allow us to name types
- e.g.
- Foo
- int a
- int b
- Bar c
- Hux d
-
15QuarkNet Example
16SDSS DR2
17SDSS DR2
- How to deal with encoding of file names?
- fpObjc-100000-3-0110.fit
- run 100000
- column 3
- field number 0110
- Data sets may be given attributes and associate
values, i.e. key value pairs.
18XML Schema
- XML Schemas express shared vocabularies and
allow machines to carry out rules made by people.
- They provide a means for defining the structure,
content and semantics of XML documents - http//www.w3.org/XML/Schema
Data Sets
19XML Schema benefits
- W3C Standard
- Lots of existing tools (editors, validators,
query languages) - Adopted by Web and Grid services
- Does it mean VDL should have an XML syntax? No!
(cf. VDLt vs. VDLx)
20XML Schemas for Describing Data Sets
- XML Schemas provide a good mechanism to specify
the structure of data sets. - Uniform way for representing both data within
files and sets outside files. - Good because in the long run this will allow us
to express workflows that operate both on data
sets and their file contents - DR2 key value pairs in fit files headers could
be referred to using this mechanism. - It is not a requirement to express the contents
of file (e.g. not desirable for binary format),
but it is a possibility that can be used, when
convenient.
21Physical Representation of Data Sets
- QuarkNet HEPSearch examples comprises 4 different
formats for a small logical data set (Excel and
ascii files for 2002 and 2003). - Transformations (written as Perl Programs) expect
data sets in a specific format. - DR2 is available from a local directory or a http
url http//das.sdss.org/DR2/data/ - Subsets were made available to us as tar balls.
22(No Transcript)
23How to express physical representation?
- As a first approximation, for each type, we need
to provide - The kind of physical data container used to
represent this element, e.g. directory, url,
file, etc - Two conversion functions
- Read function Given the physical representation,
how to define the name of the abstract object and
its attributes. - Write function Given the abstract
representation, how to construct the name of the
physical object - To be complete, we need to identify the element
name, and the complex type in which it appears
(and possibly its context).
24How to express physical representation?
- Work still in progress. Current programmatic
representation for reading DR2. -
- null, "Dir", "imaging",
"ConvertToSelf" , - "ReRun", "Dir", "objcs",
"ConvertToSelf" , - "Imaging", "Dir", "run",
"ConvertRun" , - "Run", "Dir", "rerun",
"ConvertReRun" , - "Objcs", "Dir", "camcol",
"ConvertCamcol" , - "Camcol", "File(BLOB)", "fpBIN",
"ConvertFpBIN" , - "Camcol", "File(BLOB)", "fpM",
"ConvertFpM" , - "Camcol", "File(BLOB)", "psField",
"ConvertPsField" , - "Camcol", "File(BLOB)", "fpFieldStat",
"ConvertFpFieldStat" , - "Camcol", "File(BLOB)", "fpAtlas",
"ConvertFpAtlas" , - "Camcol", "File(BLOB)", "fpObjc",
"ConvertFpObjc"
Function to convert Element name
Element
Physical Representation
Contex type
25Query examples
- Use of xpath 1.0 as the query language
- Simple directory like navigation of data sets
- /DR2/imaging
- /run1/rerun1/objcs
- /camcol3
- /fpBIN10
- 1st run, 1st rerun, 3rd camcol, 10th fpBIN file
26Query examples
- Short-cuts
- /DR2//fpBIN
-
- fpBIN files at any depth
27Query examples
- Use of attributes
- /DR2/imaging
- /run_at_number'1239'
- /rerun_at_number'6'
- /objcs/camcol1
- /fpObjc_at_field'110'
- Run 1239, rerun 6, fpObjc files with field 110
28VDC Queries
- In practice, queries over data sets should not
only be related to the structure of the data set
but also to metadata contained in the virtual
data catalog. - Xpath queries support functions, and we have
pre-defined a function to query the catalog.
29VDC Queries (example)
- Get all fpAtlas files in run 1239 of DR2 such
that the metadata attribute isUseful is set to
yes in the VDC. - /DR2
- //run_at_number'1239'
- //fpAtlasvdcmetadata('isUseful')'yes'
30Iterators in VDL2
- Selection of a subset of a data set
- define dataset1 with type,format
- dataset2
- select ltltxpath_exprgtgt
- in dataset1
31Iterators in VDL2
- Iterating over a subset of a data set
- dataset2
- forall x in ltltxpath_exprgtgt
- of dataset1
- call ltlttransformationgtgt x, params,
32Conclusion
- Separating abstract data type from physical
representation is powerful. - In the spirit of a semantic description.
- Useful for casting of data sets into the
appropriate physical representation requested by
transformation. - Key ideas presented here have been implemented.
33Future Work
- Complete the language to specify physical
encoding - Support other physical representations (Stateful
Grid Services, Databases, etc) - Specify VDL2 iterators
- Large data sets lazy traversal of data sets,
checkpointing of traversal state, recovery over
failures.
34Acknowledgements
- The GriPhyN team at UoC
- Ian Foster
- Mike Wilde
- Jens Voeckler
- Yong Zhao