Title: The Chimera Virtual Data System
1The Chimera Virtual Data System
- www.griphyn.org/chimera
- Presented by Mike Wilde
- Workflow Workshop
- 3 December 2003
- e-Science Institute, Edinburgh
2Acknowledgements
- GriPhyN the Grid Physics Network is supported
by The National Science Foundation, Information
Technology Research Program - The Chimera Virtual Data Systemis the work of
Ian Foster, Jens Voeckler, Mike Wilde and Yong
Zhao - The Pegasus Planner is the work of Ewa Deelman,
Gaurang Mehta, and Karan Vahi - This talk was also delivered at the Data
Provenance and Annotation Workshop, 1 Dec 2003
3The Virtual Data Concept
- Enhance scientific productivity through
- Discovery and application of datasets and
programs - Enabling use of a worldwide data grid as a
scientific workstation - Virtual Data enables this approach by creating
datasets from workflow recipes and recording
their provenance. - Provenance Virtual Data
4Provenance System Goals
- Producing data from transformations with
uniform, precise data interface descriptions
enables - Discovery finding and understanding datasets and
transformations - Workflow structured paradigm for organizing,
locating, specifying, producing scientific
datasets - Forming new workflow
- Building new workflow from existing patterns
- Managing change
- Planning automated to make the Grid transparent
- Audit explanation and validation via provenance
5Virtual Data Grid Vision
6Usage Models and Cases
- Domains where its valuable (and where its not)?
Cost benefit ratios? - Batch models
- Cluster finding laboratory code and data
changes, track results. - Interactive models
- Using provenance within interactive dialogs in
graphical and textual tools - Moving back and forth between interactive and
batch modes - Discovery
- Understand / review / audit
- Compose
- Passive Provenance recording
- Active Provenance declaration
7Virtual Data ExampleGalaxy Cluster Search
DAG
Sloan Data
Galaxy cluster size distribution
Jim Annis, Steve Kent, Vijay Sehkri, Fermilab,
Michael Milligan, Yong Zhao,
University of Chicago
8 Virtual Data Application
High Energy Physics Data
Analysis
mass 200 decay WW stability 1 LowPt
20 HighPt 10000
Work and slide by Rick Cavanaugh and Dimitri
Bourilkov, University of Florida
9Provenance Scenario
Manage workflow
On-demand data generation
Update workflow following changes
Explain provenance, e.g. for file8
psearch t 10 i file1 file3 file4 file5 file7o
file8simulate t 10 o file1 file2reformat f
fz i file2 o file3 file4 file5 summarize t 10
i file6 o file7conv l esd o aod i file 2 o
file6
10Fundamental Units
- Transformations
- Interface Declarations
- Action Declarations
- Call declaration
- Invocation
- Datasets
- Contents
- Representation
- Location
11VDL Virtual Data LanguageDescribes Data
Transformations
- Transformation
- Abstract template of program invocation
- Similar to "function definition"
- Derivation
- Function call to a transformation
- Stores past and future
- A record of how data products were generated
- A recipe of how data products can be generated
- Invocation
- Record of a Derivation execution
12Example Transformation
- TR t1( out a2, in a1, none pa "500", none
env "100000" ) - argument "-p "pa
- argument "-f "a1
- argument "-x y"
- argument stdout a2
- profile env.MAXMEM env
a1
t1
a2
13Example Transformation Calls (Derivations)
- DV d1-gtt1 (env"20000", pa"600",a2_at_outrun1.e
xp15.T1932.summary,a1_at_inrun1.exp15.T1932.raw
, - )
- DV d2-gtt1 (a1_at_inrun1.exp16.T1918.raw,a2_at_ou
t.run1.exp16.T1918.summary - )
14Workflow from File Dependencies
file1
- TR tr1(in a1, out a2)
- argument stdin a1Â
- argument stdout a2
- TR tr2(in a1, out a2)
- argument stdin a1
- argument stdout a2
-
- DV x1-gttr1(a1_at_infile1, a2_at_outfile2)
- DV x2-gttr2(a1_at_infile2, a2_at_outfile3)
x1
file2
x2
file3
15Example Invocation
Completion status and resource usage
Attributes of executable transformation
Attributes of input and output files
16Example Workflow
- Complex structure
- Fan-in
- Fan-out
- "left" and "right" can run in parallel
- Uses input file
- Register with RC
- Complex file dependencies
- Glues workflow
preprocess
findrange
findrange
analyze
17Workflow step "preprocess"
- TR preprocess turns f.a into f.b1 and f.b2
- TR preprocess( output b, input a ) argument
"-a top"argument " i "inputaargument
" o " outputb -
- Makes use of the "list" feature of VDL
- Generates 0..N output files.
- Number file files depend on the caller.
18Workflow step "findrange"
- Turns two inputs into one output
- TR findrange( output b, input a1, input a2,none
name"findrange", none p"0.0" ) argument "-a
"nameargument " i " a1 " "
a2argument " o " bargument " p "
p -
- Uses the default argument feature
19Can also use list parameters
- TR findrange( output b, input a,none
name"findrange", none p"0.0" ) argument "-a
"nameargument " i " " "aargument
" o " bargument " p " p
20Workflow step "analyze"
- Combines intermediary results
- TR analyze( output b, input a ) argument
"-a bottom"argument " i " aargument "
o " b
21Complete VDL workflow
- Generate appropriate derivations
- DV top-gtpreprocess( b _at_out"f.b1", _at_
out"f.b2" , a_at_in"f.a" ) - DV left-gtfindrange( b_at_out"f.c1",
a2_at_in"f.b2", a1_at_in"f.b1", name"left",
p"0.5" ) - DV right-gtfindrange( b_at_out"f.c2",
a2_at_in"f.b2", a1_at_in"f.b1", name"right" )
- DV bottom-gtanalyze( b_at_out"f.d", a
_at_in"f.c1", _at_in"f.c2" )
22Compound Transformations
- Using compound TR
- Permits composition of complex TRs from basic
ones - Calls are independent
- unless linked through LFN
- A Call is effectively an anonymous derivation
- Late instantiation at workflow generation time
- Permits bundling of repetitive workflows
- Model Function calls nested within a function
definition
23Compound Transformations (cont)
- TR diamond encapsulates diamond workflows
- TR diamond( out fd, io fc1, io fc2, io fb1, io
fb2, in fa, p1, p2 ) - call preprocess( afa, b outfb1,
outfb2 ) - call findrange( a1infb1, a2infb2,
name"LEFT", pp1, boutfc1 ) - call findrange( a1infb1, a2infb2,
name"RIGHT", pp2, boutfc2 ) - call analyze( a infc1, infc2 ,
bfd )
24Compound Transformations (cont)
- Multiple DVs allow easy generator scripts
- DV d1-gtdiamond( fd_at_out"f.00005",
fc1_at_io"f.00004", fc2_at_io"f.00003",
fb1_at_io"f.00002", fb2_at_io"f.00001",
fa_at_io"f.00000", p2"100", p1"0" ) - DV d2-gtdiamond( fd_at_out"f.0000B",
fc1_at_io"f.0000A", fc2_at_io"f.00009",
fb1_at_io"f.00008", fb2_at_io"f.00007",
fa_at_io"f.00006", p2"141.42135623731", p1"0"
) - ...
- DV d70-gtdiamond( fd_at_out"f.001A3",
fc1_at_io"f.001A2", fc2_at_io"f.001A1",
fb1_at_io"f.001A0", fb2_at_io"f.0019F",
fa_at_io"f.0019E", p2"800", p1"18" )
25Dataset Requirements
ltFORM ltTitlegt /FORMgt
File
Set of files
Object closure
XML Element
Relational query or spreadsheet range
Set of files with relational index
New user-defined dataset type
26Possible Dataset Type Model
- Types used for
- Managing dataset representation
- Determining argument conformance in invocations
- Discovery of datasets and transformations
- Two parallel type hierarchies separate
representation and semantics - Representational organizes and specifies
families of dataset representation - Logical organizes and specifies
application-specific semantics of datasets
27Example Dataset Types(Nonleaf Types are
Superclasses)
FileDataset
Representational
File
FileSet
Logical
MultiFileSet
TarFileSet
EventCollection
RawEventSet
SimulatedEventSet
MonteCarloSimulation
DiscreteEventSimulation
28Dataset Representation Descriptor
- Defines a datasets physical layout
- Permits transformations to access datasets
- Structure is defined by dataset type (examples)
- File ltlfngt ltevt.02gt
- MultiFileSet ltlfngt ltevt.03, evt.04, evt05gt
- TarFileSet ltlfn,taroptsgt ltevts.1998, "-b50 -z"gt
- Relation ltltodbcgtltselect .gtgt ltserver
name"db.mcs.anl.gov" db"hepdb"
id"uchep"/gtltquery request"select from evt
where eidgt2897 and eidlt3945" /gt - Stored in dataset catalog
- Format constrained by DS type def
29Provenance Schema
describes
describes
Metadata
30Observations
- A provenance approach based on interface
definition and data flow declaration fits well
with Grid requirements for code and data
transportability and heterogeneity - Working in a provenance-managed system has many
fringe benefits uniformity, precision,
structure, communication, documentation
31Vision for Provenance in the Large
- Universal knowledge management and production
systems - Vendors integrate the provenance tracking
protocol into data processing products - Ability to run anywhere in the Grid
32Virtual Data Grid Vision
33Systems requirementsServices and Interfaces
- Provenance databases, servers, virtual machines,
workflow composers - Provenance navigation portals and webs
- Embedded tracing systems esp. within interactive
tools SPSS, ROOT, Excel, etc - Catalog integration replica catalogs, metadata
catalogs, transformation catalogs, integrity,
coherence, interoperability. - Interaction between provenance systems and
workflow systems
34Provenance Servers
- OGSA-based Grid services
- Discovery, security, resource management
- Supports code and data discoveryand workflow
management - Object names (TR, DS, TY, DV, IV) can be used as
global cross-server links - Derivations can reference remote transformations
and datasets - Structured object namespaces object-level
access control enable large VO collaboration
35Provenance Hyperlinks
36Indexing Provenance Servers to Support Discovery
37Challenges
- Whats the unit of change? Dataset? File?
Object? - Relations to the worlds of HDF, CDF, FITS, many
others - Does a dataset type have multiple dimensions?
- Dataset names/handles
- Unification of processing models App, SQL,
Service - Closure and reflection Are transformations and
workflows datasets? Can we track provenance of
annotations? - Version management mutability, timestamps
- Garbage collection, retention, pruning
- Distribution what standards and naming protocols
are needed? Catalogs, schemas? - Theoretical models? Unification of fine-grain and
coarse-grained models?