Title: The GriPhyN Virtual Data System
1The GriPhyN Virtual Data System
- When we know exactly how to produce or re-create
a data object, it becomes virtual the recipe for
creating the object can in many contexts act as a
virtual stand-in for the physical object. By
tracking the recipes or provenance of that data
of a collaboration, the Chimera Virtual Data
System provides powerful data management
capabilities for the Grid - the ability to audit data and know exactly how
it was created - the ability to discover datasets of interest
within massive data collections - the ability to specify a virtual workspace of
data for selective instantiation at later times - the ability to re-create data that has been
deleted, lost, or damaged - the ability to capture, exchange, and reason
about the patterns of scientific and business
workflows - How it works
- A user or a user interface prepares a document
in VDL the Virtual Data Language which
describes the interface (inputs, output, and
environment) of data transforming applications,
and the arguments to call them with to produce
specific data objects. - The VDL definitions are stored in a Virtual Data
Catalog - Abstract workflows are produced by tools that
traverse the VDL dependency graph and produce an
abstract XML workflow - Executable workflows for grids running the
GriPhyN Virtual Data Toolkit (VDT), are run
through the DAGman, Condor-G, and GRAM services - This research is a collaboration of Ewa Deelman,
Ian Foster, Carl Kesselman, Gaurang Mehta,
Douglas Scheftner, Karan Vahi, Jens Voeckler,
Mike Wilde, and Yong Zhao - For code downloads and more info
www.griphyn.org/vds www.griphyn.org/vdt
www.globus.org www.cs.wisc.edu/condor
Virtual Data Defines Workflow
Galaxy Cluster Analysis
Location-independent Workflow for OSG and TeraGrid
Manage workflow
On-demand data generation
Patch workflow following changes
Explain provenance, e.g. file8
psearch t 10 i file3 file4 file5 o
file8summarize t 10 i file6 o file7reformat
f fz i file2 o file3 file4 file5 conv l esd
o aod i file 2 o file6simulate t 10 o file1
file2
By Jim Annis, Steve Kent, Vijay Sehkri, Neha
Sharma Fermilab, Michael Milligan, Yong Zhao -
U of Chicago
Virtual Data Language
Transformation Similar to "function
definition Specifies formal
parameters Derivation Similar to "function
call Specifies actual parameters
Records how data products were generated
Recipe for re-generation Invocation Record of
a derivation execution
TR tr1(in a1, out a2) profile
hints.exec-pfn "/usr/bin/app1"Â argument
stdin a1Â argument stdout a2 TR
tr2(in a1, out a2) profile hints.exec-pfn
"/usr/bin/app2" argument stdin a1
argument stdout a2 DV
x1-gttr1(a1_at_infile1, a2_at_outfile2) DV
x2-gttr2(a1_at_infile2, a2_at_outfile3)
This research is supported by the National
Science Foundation under contract ITR-0086044
(GriPhyN). Data Grid research is supported by the
Mathematical, Information, and Computational
Sciences Division subprogram of the Office of
Advanced Scientific Computing Research, U.S.
Department of Energy, under Contract
W-31-109-Eng-38 (Data Grid Toolkit).