Title: Scientific Data Management: From Data Integration to Analytical Pipelines
1 Scientific Data Management From Data
Integration to Analytical Pipelines
Bertram Ludäscher ludaesch_at_sdsc.edu
- Data Knowledge Systems
- San Diego Supercomputer Center
- University of California, San Diego
2Outline
- Motivation Scientific Data Integration Problems
- Semantic (Model-based) Mediation
- Scientific Workflows and Analytical Pipelines
3Acknowledgements
- National Science Foundation (NSF)
- www.nsf.gov
- GEOsciences Network (NSF)
- www.geongrid.org
- Biomedical Informatics Research Network (NIH)
- www.nbirn.net
- Science Environment for Ecological Knowledge
(NSF) - seek.ecoinformatics.org
- Scientific Data Management Center (DOE)
- sdm.lbl.gov/sdmcenter/
4An Online Shoppers Information Integration
Problem
El Cheapo Where can I get the cheapest copy
(including shipping cost) of Wittgensteins
Tractatus Logicus-Philosophicus within a week?
One-World Scenario XML-based mediator
Mediator (virtual DB) (vs. Datawarehouse)
5A Home Buyers Information Integration Problem
What houses for sale under 500k have at least 2
bathrooms, 2 bedrooms, a nearby school ranking
in the upper third, in a neighborhood with
below-average crime rate and diverse population?
Multiple-Worlds Mediation
6Some BIRNing Data Integration Questions
Biomedical Informatics Research
Network http//nbirn.net
- Data Integration Approaches
- Lets just share data, e.g., link everything from
a web page! - ... or better put everything into an relational
or XML database - ... and do remote access using the Grid
- ... or just use Web services!
- Nice try. But
- Find the files where the amygdala was
segmented. - Which other structures were segmented in the
same files? - Did the volume of any of those structures differ
much from normal? - What is the cerebellar distribution of rat
proteins with more than 70 homology with human
NCS-1? Any structure specificity? How about other
rodents?
7A Neuroscientists Information Integration Problem
Biomedical Informatics Research
Network http//nbirn.net
What is the cerebellar distribution of rat
proteins with more than 70 homology with human
NCS-1? Any structure specificity? How about other
rodents?
Complex Multiple-Worlds Mediation
8Information Integration Challenges
Heterogeneities S4...
- System Aspects
- platforms, devices, distribution, APIs,
protocols, - Syntaxes
- heterogeneous data formats (one for each tool
...) - Structures
- heterogeneous schemas (one for each DB ...)
- heterogeneous data models (RDBs, ORDBs, OODBs,
XMLDBs, flat files, ) - Semantics
- unclear hidden semantics e.g., incoherent
terminology, multiple taxonomies, implicit
assumptions, ...
9Information Integration Challenges
- System aspects Grid Middleware
- distributed data computing
- Web Services, WSDL/SOAP, OGSA,
- sources functions, files, data sets,
-
- Syntax Structure
- (XML-Based) Data Mediators
- wrapping, restructuring
- (XML) queries and views
- sources (XML) databases
- Semantics
- Model-Based/Semantic Mediators
- conceptual models and declarative views
- Knowledge Representation ontologies, description
logics (RDF(S),OWL ...) - sources knowledge bases (DBCMsICs)
10Information Integration from a DB Perspective
- Information Integration Problem
- Given data sources S1, ..., Sk (DBMS, web sites,
...) and user questions Q1,..., Qn that can be
answered using the Si - Find the answers to Q1, ..., Qn
- The Database Perspective source database
- Si has a schema (relational, XML, OO, ...)
- Si can be queried
- define virtual (or materialized) integrated
views V over S1 ,..., Sk using database
query languages (SQL, XQuery,...) - questions become queries Qi against V(S1,..., Sk)
11Standard (XML-Based) Mediator Architecture
(XML) View
(XML) View
(XML) View
wrappers implemented as web services
Wrapper
Wrapper
Wrapper
S1
S2
Sk
12Scientific Data Integration ... Questions to
Queries ...
What is the distribution and U/ Pb zircon ages of
A-type plutons in VA? How about their 3-D
geometry ? How does it relate to host rock
structures?
Complex Multiple-Worlds Mediation
GeoPhysical (gravity contours)
Geologic Map (Virginia)
GeoChronologic (Concordia)
Foliation Map (structure DB)
GeoChemical
13Towards Shared Conceptualizations Data
Contextualization via Concept Spaces
14Rock Classification Ontology
Genesis
Fabric
Composition
Texture
15Some enabling operations on ontology data
- Concept expansion
- what else to look for when asking for Mafic
Composition
16Some enabling operations on ontology data
- Generalization
- finding data that is like X and Y
Composition
17(No Transcript)
18Example Geologic Map Integration
19GEON and Semantic Data Integration
20Mediator Demo
21Getting Formal Source Contextualization
Ontology Refinement in Logic
Biomedical Informatics Research
Network http//nbirn.net
22Distributed Querying Processing Challenges Part
I, The Basics
- Scientific data (BIRN, GEON, ...) variant of
data integration problem studied by database CS
community - Given
- user query against integrated view
- view to source mappings (GAV/LAV)
- sources with limited access patterns
- Compute a distributed query plan P s.t.
- P has a feasible execution order
- P optimized wrt. time/space/networking complexity
23Real-time Observatories, Applications, and
Data management Network
- Autonomous field sensors
- Seismic, oceanic, climate, ecological, , video,
audio, - RT Data Acquisition
- ANZA Seismic Network (1981-present)13 Broadband
Stations, 3 Borehole Strong Motion Arrays, 5
Infrasound Stations, 1 Bridge Monitoring System
Kyrgyz Seismic Network (1991-present) 10
Broadband Stations IRIS PASSCAL Transportable
Array (1997-Present)15 - 60 Broadband and Short
Period Stations IDA Global Seismic Network
(1990 -Present) 38 Broadband Stations - High Performance Wireless Research Network
(HPWREN) - High performance backbone network 45Mbps duplex
point-to-point links, backbone nodes at quality
locations, network performance monitors at
backbone sites High speed access links hard to
reach areas, typically 45Mbps or 802.11radios,
point-to-point or point-to-multipoint - Data Grid Technology (SRB)
- collaborative access to distributed heterogeneous
data, single sign-on authentication and seamless
authorization,data scaling to Petabytes and 100s
of millions of files, data replication, etc.
24A P2P Problem from ROADNet
- Networks of ORBs send each other various data
streams - Avoid actual loops in the presence of virtual
loops - A ? B ? C
- A c1?B
- B c2 ? C
- C c3 ? A
- ...
- Idea L(c1) ? L(c2) ? L(c3)
- In the real system unix regexps
25Scientific Workflows and Analytical Pipelines
26Biomedical Informatics Research
Network http//nbirn.net
Scientific Workflows/Analytical Pipelines over
Brain Data
27Example Promoter Identification Workflow (PIW)
(simplified)
- scientific data sets flow between the steps
- abstraction of tasks into higher conceptual
levels - branching/merging of tasks and looping
28SEEK Vision Overview
- Large collaborative NSF/ITR project UNM, UCSB,
UCSD, UKansas,.. - Fundamental improvements for researchers Global
access to ecologically relevant data Rapidly
locate and utilize distributed computation
Capture, reproduce, extend analysis process
29SEEK Components
- EcoGrid
- Seamless access to distributed, heterogeneous
data ecological, biodiversity, environmental
data - Semantically mediated and metadata driven
- Centralized search management portal(s)
- Analysis and Modeling System
- Capture, reproduce, and extend analysis process
- Declarative means for documenting analysis
- Pipeline system for linking generic analysis
steps - Strong version control for analysis steps
- Easy-to-use interface between data and analysis
- Semantic Mediation System
- smart data discovery, type-correct pipeline
construction data binding - determine whether/how to link analytic steps
- determine how data sets can be combined
- determine whether/how data sets are appropriate
inputs for analysis steps
30AMS Overview
- Objective
- Create a semi-automated system for analyzing data
and executing models that provides documentation,
archiving, and versioning of the analyses,
models, and their outputs (visual programming
language?) - Scope
- Any type of analysis or model in ecology and
biodiversity science - Massively streamline the analysis and modeling
process - Archiving, rerunning analyses in SAS, Matlab, R,
SysStat, C(),
31SMS Requirements from AMS
- ...assist users in determining the
appropriateness of combining various analytical
steps and data sources based on semantic
mediation... - Semantic mediation should occur in three areas
- determine whether it is appropriate to link
together particular analytic steps. - mediate between multiple data sets to determine
in what ways they can be combined. - determine whether the selected data sources are
appropriate inputs for the selected analysis.
32Some functional requirements
- SMS should have the ability to ...
- FR1 recognize data types (XML Schema types!? EML
types?) of registered EcoGrid data sets - FR2 recognize semantic types (OWL and/or RDF(S)
!?) of registered EcoGrid data sets - FR3 recognize registered EcoGrid ontologies
- Note semantic types reference those ontologies
- FR4 recognize data type signature (XML Schema?
WSDL?) of analytical steps (ASs) - FR5 recognize semantic type signature of
analytical steps - FR6 recognize semantic constraints (OWL?
First-order? What syntax? KIF? Prolog?) - Note data schemas and signatures of analytical
steps have those
33... some functional requirements
- Ability to ...
- FR8 check well-typedness (data and semantics) of
a data set wrt. an analytical step - FR9 check compatibility of two data sets wrt.
"generalized operations" between those data sets
(e.g., "semantic" join and union) - FR10 check well-typedness (data and semantics)
of chained analytical steps - FR11 introduce data type conversions (e.g., int
? float) - FR12 perform and "explain" semantic type
substitutions - (e.g. if some AS works for Cs and D-isa-C, it
also works for Ds) - FR13 optional generate type correct APs from a
given schema of desired output and (optionally)
input parameters
34Use Cases
- Clients of the SMS include the AMS, the EcoGrid,
and "scientific workflow engineers". - UC1 Client requests type signature (data and
semantic types) of a registered EcoGrid data set
(DS) - UC2 Client requests "other semantic constraints"
of a DS. - UC3 Client requests type signature (data and
semantic types) of an analytical step (AS) - UC4 Client requests "other semantic constraints"
of an AS. - UC5 Client requests type signature of an AP.
- UC6 Client requests type checking of AP.
- UC7 Client requests registered data sets
compatible with the inputs of an AS (e.g., if AS
is scale sensitive, then all data sets must have
the same scale a flag is raised if data needs
scaling). - UC8 Client requests all registered ASs which can
produce a given parameter (the latter is part
of a registered ontology) - UC9 Client requests candidate predecessor and
successor steps for a given AS.
35Planned Components
- SW1 Formal language(s) for representing/instantia
ting data types, semantic types, ontologies, and
"other semantic constraints". - SW2 System for data type checking and inference
(includes introduction of data type conversion
steps) - SW3 System for semantic type checking and
inference -
- SW4 optional System for "planning" APs given
some of output parameters, data sets, and input
parameteres
36THE PROBLEM Reconcile this
- Simple, intuitive graph/pipeline language,
- which is expressive enough to handle real-world
flows (SciDAC PIW), - and allows some static analysis
- while trying to leverage existing work
- e.g., Ptolemy-II directors Process Networks
(PN), Synchronous Dataflow (SDF), ..., - or workflow standards and systems
37(Analytical) Pipelines . (Scientific) Workflows
- Spectrum of languages formalisms
- Pipelines (a la Unix)
- Dataflow languages
- Kahns process networks (PN)
- Synchronous dataflow networks (SDF)
- Web page-flow
- Active XML, WebML,
- Hesitating-weak-alternating-tree-automata-ML
-
- (Business) Workflows
- WfMCs XPDL, WSFL, BPELWS,
38Kahn Process Networks (PN)
- Concurrent processes communication through
one-way FIFO channels with unbounded capacity - A functional process F maps a set of input
sequences into a set of output sequences (sounds
like XSM!) - increasing chain of sets of sequences ? outputs
may not increase! - Consider increasing chains (wrt. prefix ordering
lt) of streams - PN is continuous if lub(Xs) exists for all
increasing chains Xs and - F(lub(Xs)) lt lub(F(Xs))
- Continuous implies montonic
- if Xs lt Ys then F(Xs)ltF(Ys)
39Process Networks (contd)
- PN in essence simultaneous relations between
sequences - Network of functional processes can be described
by a mapping - X F(X,I)
- X denotes all the sequences in the network
(inputs Ioutputs) - X that forms a solution is a fixed point
- Continuity implies exactly one minimal fixed
point - minimal in the sense of pre-fix ordering for any
inputs I - execution of the network given I and find
the minimal fixed point (works because of the
monotonic property)
40Synchronous Data Flow Networks (SDF)
- Special case of PN
- Ptolemy-II SDF overview
- SDF supports efficient execution of Dataflow
graphs that lack control structures - with control structures ? Process Networks(PN)
- requires that the rates on the ports of all
actors be known before hand - do not change during execution
- in systems with feedback, delays, which are
represented by initial tokens on relations must
be explicitly noted ? SDF uses this rate and
delay information to determine the execution
sequence of the actors before execution begins.
41Extended Kahn-MacQueen Process Networks
- A process is considered active from its creation
until its termination - An active process can block when trying to read
from a channel (read-blocked), when trying to
write to a channel (write-blocked) or when
waiting for a queued topology change request to
be processed (mutation-blocked) - A deadlock is when all the active processes are
blocked - real deadlock all the processes are blocked on a
read - artificial deadlock all processes are blocked,
at least one process is blocked on a write ?
increase the capacity of receiver with the
smallest capacity amongst all the receivers on
which a process is blocked on a write. This
breaks the deadlock. - If the increase results in a capacity that
exceeds the value of maximumQueueCapacity, then
instead of breaking the deadlock, an exception is
thrown. This can be used to detect erroneous
models that require unbounded queues.
42Analytical Pipelines An Open Source Tool
43A commercial tool for Analytical Pipelines
44(No Transcript)
45MAP Data Massaging a la Blue-Titan/Perl
46Compiling Abstract Scientific Workflows into
Web Service Workflows
47The Problem
- Scientist would like to ...
- create a high-level abstract WF and
- not bother about web service urls, parameter
passing, low-level data transformations,... - How to go from ...
- a high-level Abstract Workflow (AWF) to
- an Executable (web service) Workflow (EWF) ??
- Idea
- Using nested definitions, express AWF in terms of
other AWFs and EWFs unfold definitions at
compile-time - ? Abstract-as-View approach
48WF Language Constructs (AWFEWF)
?
cond
?
?
cond
?
?
cond
?
?
49Conceptual Workflow
Compute clusters (min. distance)
For each promoter
Select gene-set (cluster-level)
Compute Subsequence labels
For each gene
With all Promoter Models
Compute Joint Promoter Model
50Abstract Workflow (AWF)( chain program over
relations with i/o patterns)
AWF piw(DB,Gene,TFBSModel) -
cDNASequence(Gene, GeneSeq), localAlignment(DB,
CDNASeq,RankedPromoterList), firstRest(Promoter,
RankedPromoters,RankedPromoters1), promoter_deta
il(Promoter, PromoterId, Start, End,
Orientation), cDNASequence(PromoterId,Geno
micSeq), trim_sequence(GenomicSeq, Start, End,
Orientation, ShortSeq), convertSeq(Orientation,S
hortSeq,PosSeq), transfac(PosSeq, TFBSModel).
51piw
AWF
promoters
tfbs_models
Promoters
TFBSModels
Promoters
Gene
promoters AAV
AWF to EWF in graph form
DB
Gene
Promoters
CDNASeq
CDNASeq
gene_seq
localAlignment
AAV
EWF
gene_seq AAV
GenbankId
cDNASeq
genbank_service
EMBLId
cDNASeq
Gene
GeneId
CDNASeq
convertToAcc
embl_service
DDBJId
cDNASeq
ddbj_service
52AWF ? EWF Translation
- Check whether AWF is well-formed and well-typed
if not, corresponding warnings are issued (a
semantic type mismatch may not only be a workflow
design error, but often indicates the
incompleteness of the underlying ontology). - Next the AWF is successively unfolded, using the
AAV view definitions. - (Compiling AWF into EWF using AAV is similar to
rewriting a query against a global schema into
queries against the sources.) - The unfolded logic query plan then undergoes
several rewriting steps until a certain normal
(DNF/UCQ?) is reached. If the join variables (
the connection edges) are not of the same data
type (but at least of compatible semantic types)
then the insertion of conversion rules is
attempted if this fails, an error is reported. - For each list of conjunctive goals, the system
tries to find an executable goal order, i.e., one
which satisfies all i/o restrictions imposed by
the web service descriptions of executable tasks. - Implementation a set of Java and Prolog
programs, rules, ontologies and repositories
53selectGeneSet
expression Array
geneList
updated Gene List
managegeneLoop/ while geneList not EMPTY
LOOP1 for each gene
updated Gene List
gene
geneList EMPTY
Loop1 Final
AWF for Matts Promoter Identification Workflow
54prepareClustalWInput
CW Sequence
manageClustalW Loop
geneList
updated GeneList
ClustalW Sequence
noMore Genes
geneListEmpty
geneListNOTEmpty
loop back
partialSeq
geneId
orient gt 0
geneNo
complement Sequence
shortSeq
plusSeq
orient lt 0
minusSeq
Figure1
orient gt 0
geneListEmpty
type
pwalignment
TRANSFACMatInspector
inspected TFBSs
sequence
ClustalW
Sequence List
multipleSeq Alignment
EWF for Matts Extended Promoter Identification
Workflow (w/ loops conditions)
55manageClustalW Loop
noMore Genes
geneListEmpty
prepareClustalWInput
geneList
updated GeneList
ClustalW Sequence
geneListNOTEmpty
loop back
geneId
format
program
db1
partialSeq
geneNo
BlastRID
Genbank1
RequestId
cDNASeq
seq1
orient gt 0
dopt
cmd2
db2
cmd1
complement Sequence
plusSeq
minusSeq
list_udis
BlastPromoter
full Genomic Sequence
Genbank2
RId
promoters
shortSeq
orient lt 0
orient gt 0
outputNext Promoter
updated Promoter List
geneListEmpty
type
pwalignment
seq2
hitId
trimSequence
promoter List
ClustalW
start
from
Sequence List
multipleSeq Alignment
end
to
orientation
orient
TRANSFACMatInspector
inspected TFBSs
sequence
Unfolded EWF
56Generated EWF Plan (using BIRN Mediation Tool)
57Abstract-As-View (AAV) DefinitionsControl-Flow
Issues
AAV cDNASequence(GeneId, CDNASeq) -
genbank(GeneId, CDNASeq)
fail(genbank), embl(GeneId, CDNASeq)
fail(genbank),fail(embl),ddbj(GeneId,
CDNASeq). localAlignment(DB, CDNASeq,RankedPromot
erList) - blast(CDNASeq,DB,xml,RankedPr
omoterList) fail(blast),
fasta(CDNASeq,DB, RankedPromoterList)
fail(blast),fail(fasta),blat(CDNASeq,que
rytype,
sortcriteria,outputtype,RankedProm
oterList). convertSeq(Orientation,ShortSeq,PosSe
q) - negative(Orientation),
complement(ShortSeq,PosSeq) equals(ShortSeq,P
osSeq)
58Abstract Task (AT) Registration Tool
59Abstract Task (AT) View and Delete
60Abstract Task (AT) Update
61AWF Editor
62Further Problems
- Reconcile
- Simple, intuitive graph/pipeline language,
- which is expressive enough to handle real-world
flows (PIW), - and allows some static analysis
- while trying to leverage existing work
- e.g., Ptolemy-II directors Process Networks
(PN), Synchronous Dataflow (SDF), ..., - or workflow standards and systems
- Semi-automatic web service composition
- use of semantic and data types to define data
transformations - map prev_step.out ? next_step.in
63(Ptolemy II-Based Architecture)
WF-Pilot
Design(Ptolemy-II)
Execution monitoring(Ptolemy-II)
Execution(Ptolemy-II)
Directors PN, SDF, . . , XPDL/OFBiz Style
Ptolemy-II Director
SciDAC Extensions to Ptolemy-II
Web Service plug-in
AWF
Valid-AWF
web service invocation
web service invocation
ET
ET
Validation Errors
query rewriting
semantic type checking
data type conversion
web service matching
Genbank
BLAST
ET -- Web service AT -- (Mini workflow of ETs
Composition of ETs and ATs) ?
may become a web service if deployed
Abstract Task (AT) Repository
Data Parameter Ontologies
Datatype Conversion Repository
Executable Task (ET) Repository
64Designing PIW in Scientific Workflow Management
System
- User specified parameters
- The accession numbers, separated by commas,
- The number of promoters to investigate,
- The name of the file to hold the fasta format
promoter regions.
65Looking Inside an Abstract Task Gene Sequence
Processing
66Running the PIW Model
67The other end Workflow Languages
68The ZEN of Workflow Patterns
- Basic Control Patterns
- Sequence - execute activities in sequence
- Parallel Split - execute activities in parallel
- Synchronization - synchronize two parallel
threads of execution - Exclusive Choice - choose one execution path
from many alternatives - Simple Merge - merge two alternative execution
paths - Advanced Branching and Synchronization Patterns
- Multiple Choice - choose several execution paths
from many alternatives - Multiple Merge - merge many execution paths
without synchronizing - Discriminator - merge many execution paths
without synchronizing. Execute the subsequent
activity only once. - N-out-of-M Join - merge many execution paths.
Perform partial synchronization and execute
subsequent activity only once. - Synchronizing Join - merge many execution paths.
Synchronize if many paths are taken. Simple merge
if only one execution path is taken
69The ZEN of Workflow Patterns
- Structural Patterns
- Arbitrary Cycles - execute workflow graph w/out
any structural restriction on loops - Implicit Termination - terminate if there is
nothing to be done - Patterns Involving Multiple Instances
- MI with a priori known design time knowledge -
generate many instances of one activity when a
number of instances is known at the design time - MI with a priori known runtime knowledge -
generate many instances of one activity when a
number of instances can be determined at some
point during the runtime (as in FOR loop) - MI with no a priori runtime knowledge - generate
many instances of one activity when a number of
instances cannot be determined (as in WHILE loop)
- MI requiring synchronization - generate many
instances of one activity and synchronize them
afterwards
70The ZEN of Workflow Patterns
- State-based patterns
- Deferred Choice - execute one of the two
alternatives threads. The choice which thread is
to be executed should be implicit. - Interleaved Parallel Routing - execute two
activities in random order, but not in parallel.
- Milestone - enable an activity until a milestone
is reached - Cancellation Patterns
- Cancel Activity - cancel (disable) an enabled
activity - Cancel Case - cancel (disable) the process
71The ZOO of Workflow Standards and Systems
72Summary (Scientific Workflows)
- Spectrum of dataflow/control-flow/workflow
approaches - SDF, PN, , AXML, WebML, XPDL
- Scientist user needs to visually program them
- System support needed
- Translation from simple, conceptual
(declarative) WFs to executable Web/Grid
service plans - Static analysis to check
- dynamic properties (deadlocks, starvation,),
- feasibility wrt. given sources
- type compatibilities
- Macro/micro-level planning (overall control flow,
local schema mappings)
73Summary Mediation Scenarios Techniques
Common Schema Mediated
Schema Common Glue Maps
SQL, rules XML
query languages DOOD query
languages Schema Transformations
Syntax-Aware Mappings Semantics-Aware
Mappings Syntactic Joins
Syntactic Joins Semantic Joins via
Glue Maps DB expert DB expert KRDB
domain experts
74Combine EverythingDie eierlegende Wollmilchsau
- Database Federation/Mediation
- query rewriting under GAV/LAV
- w/ binding pattern constraints
- distributed query processing
- Semantic Mediation
- semantic integrity constraints, reasoning w/
plans, automated deduction - deductive database/logic programming technology,
AI stuff... - Semantic Web technology
- Scientific Workflow Management
- more procedural than database mediation (often
the scientist is the query planner) - deployment using web services