Scientific Workflows: Declarative vs' Procedural Born Again - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Scientific Workflows: Declarative vs' Procedural Born Again

Description:

Scientific Workflows: Declarative vs' Procedural Born Again – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 51
Provided by: bertr68
Category:

less

Transcript and Presenter's Notes

Title: Scientific Workflows: Declarative vs' Procedural Born Again


1
Scientific Workflows Declarative vs. Procedural
Born Again!?
Bertram Ludäscher ludaesch_at_sdsc.edu
  • San Diego Supercomputer Center
  • Dept. of Computer Science Engineering
  • University of California, San Diego

2
Overview
  • Disclaimer
  • WORK-shop pre-work-in-progress (Problems in
    Progress)
  • throwing problems ideas from scientific data
    management projects at you
  • Invitation
  • hopefully problems are amenable to DB approaches
    trigger interest (HELP!!)
  • Outline
  • Scientific Data Management Examples Problems
  • Data Integration Knowledge Representation
    Scientific Workflows

3
Example of a Scientific Workflow
4
Scientific Data Integration ... Questions to
Queries ...
What is the distribution and U/ Pb zircon ages of
A-type plutons in VA? How about their 3-D
geometry ? How does it relate to host rock
structures?
Complex Multiple-Worlds Mediation
GeoPhysical (gravity contours)
Geologic Map (Virginia)
GeoChronologic (Concordia)
Foliation Map (structure DB)
GeoChemical
5
Towards Shared Conceptualizations Data
Contextualization via Concept Spaces
6
Rock Classification Ontology
Genesis
Fabric
Composition
Texture
7
Some enabling operations on ontology data
  • Concept expansion
  • what else to look for when asking for Mafic

Composition
8
Example Geologic Map Integration
9
Information Integration Challenges
  • System aspects Grid Middleware
  • distributed data computing
  • Web Services, WSDL/SOAP, OGSA,
  • sources functions, files, data sets,
  • Syntax Structure
  • (XML-Based) Data Mediators
  • wrapping, restructuring
  • (XML) queries and views
  • sources (XML) databases
  • Semantics
  • Model-Based/Semantic Mediators
  • conceptual models and declarative views
  • Knowledge Representation ontologies, description
    logics (RDF(S),OWL ...)
  • sources knowledge bases (DBCMsICs)

10
Real-time Observatories, Applications, and
Data management Network
  • Autonomous field sensors
  • Seismic, oceanic, climate, ecological, , video,
    audio,
  • RT Data Acquisition
  • ANZA Seismic Network (1981-present)13 Broadband
    Stations, 3 Borehole Strong Motion Arrays, 5
    Infrasound Stations, 1 Bridge Monitoring System
    Kyrgyz Seismic Network (1991-present) 10
    Broadband Stations IRIS PASSCAL Transportable
    Array (1997-Present)15 - 60 Broadband and Short
    Period Stations IDA Global Seismic Network
    (1990 -Present) 38 Broadband Stations
  • High Performance Wireless Research Network
    (HPWREN)
  • High performance backbone network 45Mbps duplex
    point-to-point links, backbone nodes at quality
    locations, network performance monitors at
    backbone sites High speed access links hard to
    reach areas, typically 45Mbps or 802.11radios,
    point-to-point or point-to-multipoint
  • Data Grid Technology (SRB)
  • collaborative access to distributed heterogeneous
    data, single sign-on authentication and seamless
    authorization,data scaling to Petabytes and 100s
    of millions of files, data replication, etc.

11
A P2P Problem from ROADNet
  • Networks of ORBs send each other various data
    streams
  • Avoid actual loops in the presence of virtual
    loops
  • A ? B ? C
  • A c1?B
  • B c2 ? C
  • C c3 ? A
  • ...
  • Idea L(c1) ? L(c2) ? L(c3)
  • In the real system unix regexps

12
Scientific Workflows and Analytical Pipelines
13
A Scientific Workflow Promoter Identification
14
SDM Demo Architecture
Translation Approach Abstract Workflow (AWF) gt
Executable Workflow (EWF)
15
Biomedical Informatics Research
Network http//nbirn.net
Scientific Workflows/Analytical Pipelines over
Brain Data
16
SEEK Vision Overview
  • Large collaborative NSF/ITR project UNM, UCSB,
    UCSD, UKansas,..
  • Fundamental improvements for researchers Global
    access to ecologically relevant data Rapidly
    locate and utilize distributed computation
    Capture, reproduce, extend analysis process

17
SEEK Components
  • EcoGrid
  • Seamless access to distributed, heterogeneous
    data ecological, biodiversity, environmental
    data
  • Semantically mediated and metadata driven
  • Centralized search management portal(s)
  • Analysis and Modeling System
  • Capture, reproduce, and extend analysis process
  • Declarative means for documenting analysis
  • Pipeline system for linking generic analysis
    steps
  • Strong version control for analysis steps
  • Easy-to-use interface between data and analysis
  • Semantic Mediation System
  • smart data discovery, type-correct pipeline
    construction data binding
  • determine whether/how to link analytic steps
  • determine how data sets can be combined
  • determine whether/how data sets are appropriate
    inputs for analysis steps

18
AMS Overview
  • Objective
  • Create a semi-automated system for analyzing data
    and executing models that provides documentation,
    archiving, and versioning of the analyses,
    models, and their outputs (visual programming
    language?)
  • Scope
  • Any type of analysis or model in ecology and
    biodiversity science
  • Massively streamline the analysis and modeling
    process
  • Archiving, rerunning analyses in SAS, Matlab, R,
    SysStat, C(),

19
SMS Requirements from AMS
  • ...assist users in determining the
    appropriateness of combining various analytical
    steps and data sources based on semantic
    mediation...
  • Semantic mediation should occur in three areas
  • determine whether it is appropriate to link
    together particular analytic steps.
  • mediate between multiple data sets to determine
    in what ways they can be combined.
  • determine whether the selected data sources are
    appropriate inputs for the selected analysis.

20
Some functional requirements
  • SMS should have the ability to ...
  • FR1 recognize data types (XML Schema types!? EML
    types?) of registered EcoGrid data sets
  • FR2 recognize semantic types (OWL and/or RDF(S)
    !?) of registered EcoGrid data sets
  • FR3 recognize registered EcoGrid ontologies
  • Note semantic types reference those ontologies
  • FR4 recognize data type signature (XML Schema?
    WSDL?) of analytical steps (ASs)
  • FR5 recognize semantic type signature of
    analytical steps
  • FR6 recognize semantic constraints (OWL?
    First-order? What syntax? KIF? Prolog?)
  • Note data schemas and signatures of analytical
    steps have those

21
... some functional requirements
  • Ability to ...
  • FR8 check well-typedness (data and semantics) of
    a data set wrt. an analytical step
  • FR9 check compatibility of two data sets wrt.
    "generalized operations" between those data sets
    (e.g., "semantic" join and union)
  • FR10 check well-typedness (data and semantics)
    of chained analytical steps
  • FR11 introduce data type conversions (e.g., int
    ? float)
  • FR12 perform and "explain" semantic type
    substitutions
  • (e.g. if some AS works for Cs and D-isa-C, it
    also works for Ds)
  • FR13 optional generate type correct APs from a
    given schema of desired output and (optionally)
    input parameters

22
Use Cases
  • Clients of the SMS include the AMS, the EcoGrid,
    and "scientific workflow engineers".
  • UC1 Client requests type signature (data and
    semantic types) of a registered EcoGrid data set
    (DS)
  • UC2 Client requests "other semantic constraints"
    of a DS.
  • UC3 Client requests type signature (data and
    semantic types) of an analytical step (AS)
  • UC4 Client requests "other semantic constraints"
    of an AS.
  • UC5 Client requests type signature of an AP.
  • UC6 Client requests type checking of AP.
  • UC7 Client requests registered data sets
    compatible with the inputs of an AS (e.g., if AS
    is scale sensitive, then all data sets must have
    the same scale a flag is raised if data needs
    scaling).
  • UC8 Client requests all registered ASs which can
    produce a given parameter (the latter is part
    of a registered ontology)
  • UC9 Client requests candidate predecessor and
    successor steps for a given AS.

23
Planned Components
  • SW1 Formal language(s) for representing/instantia
    ting data types, semantic types, ontologies, and
    "other semantic constraints".
  • SW2 System for data type checking and inference
    (includes introduction of data type conversion
    steps)
  • SW3 System for semantic type checking and
    inference
  • SW4 optional System for "planning" APs given
    some of output parameters, data sets, and input
    parameteres

24
THE PROBLEM Reconcile this
  • Simple, intuitive graph/pipeline language,
  • which is expressive enough to handle real-world
    flows (SciDAC PIW),
  • and allows some static analysis
  • while trying to leverage existing work
  • e.g., Ptolemy-II directors Process Networks
    (PN), Synchronous Dataflow (SDF), ...,
  • or workflow standards and systems

25
(Analytical) Pipelines . (Scientific) Workflows
  • Spectrum of languages formalisms
  • Pipelines (a la Unix)
  • Dataflow languages
  • Kahns process networks (PN)
  • Synchronous dataflow networks (SDF)
  • Web page-flow
  • Active XML, WebML,
  • Hesitating-weak-alternating-tree-automata-ML
  • (Business) Workflows
  • WfMCs XPDL, WSFL, BPELWS,

26
Kahn Process Networks (PN)
  • Concurrent processes communication through
    one-way FIFO channels with unbounded capacity
  • A functional process F maps a set of input
    sequences into a set of output sequences (sounds
    like XSM!)
  • increasing chain of sets of sequences ? outputs
    may not increase!
  • Consider increasing chains (wrt. prefix ordering
    lt) of streams
  • PN is continuous if lub(Xs) exists for all
    increasing chains Xs and
  • F(lub(Xs)) lt lub(F(Xs))
  • Continuous implies montonic
  • if Xs lt Ys then F(Xs)ltF(Ys)

27
Process Networks (contd)
  • PN in essence simultaneous relations between
    sequences
  • Network of functional processes can be described
    by a mapping
  • X F(X,I)
  • X denotes all the sequences in the network
    (inputs Ioutputs)
  • X that forms a solution is a fixed point
  • Continuity implies exactly one minimal fixed
    point
  • minimal in the sense of pre-fix ordering for any
    inputs I
  • execution of the network given I and find
    the minimal fixed point (works because of the
    monotonic property)

28
Synchronous Data Flow Networks (SDF)
  • Special case of PN
  • Ptolemy-II SDF overview
  • SDF supports efficient execution of Dataflow
    graphs that lack control structures
  • with control structures ? Process Networks(PN)
  • requires that the rates on the ports of all
    actors be known before hand
  • do not change during execution
  • in systems with feedback, delays, which are
    represented by initial tokens on relations must
    be explicitly noted ? SDF uses this rate and
    delay information to determine the execution
    sequence of the actors before execution begins.

29
Extended Kahn-MacQueen Process Networks
  • A process is considered active from its creation
    until its termination
  • An active process can block when trying to read
    from a channel (read-blocked), when trying to
    write to a channel (write-blocked) or when
    waiting for a queued topology change request to
    be processed (mutation-blocked)
  • A deadlock is when all the active processes are
    blocked
  • real deadlock all the processes are blocked on a
    read
  • artificial deadlock all processes are blocked,
    at least one process is blocked on a write ?
    increase the capacity of receiver with the
    smallest capacity amongst all the receivers on
    which a process is blocked on a write. This
    breaks the deadlock.
  • If the increase results in a capacity that
    exceeds the value of maximumQueueCapacity, then
    instead of breaking the deadlock, an exception is
    thrown. This can be used to detect erroneous
    models that require unbounded queues.

30
Analytical Pipelines An Open Source Tool
31
A commercial tool for Analytical Pipelines
32
Discovery Process Markup Language Workflow
Representation
  • Workflow Discovery Planning by Service
    Composition
  • Towards a Standard Workflow Representation for
    Discovery Informatics Discovery Process Markup
    Language (DPML)Sorry for another standard, but
    it may be useful for
  • Discovery Planning Recording and managing a
    collaboratively-built discovery Process.
  • Distributed Service Composition Components
    organised by the workflow can be executing
    anywhere
  • Discovery Plans as Collaborative Intellectual
    Property Discovery Plans can be stored, reused,
    audited, refined and deployed in various forms

D-Net Workflow for Genome Annotation 16
services executing across Internet
33
(No Transcript)
34
MAP Data Massaging a la Blue-Titan/Perl
35
The other end Workflow Languages
36
The ZEN of Workflow Patterns(from
http//tmitwww.tm.tue.nl/research/patterns/)
  • Basic Control Patterns
  • Sequence - execute activities in sequence
  • Parallel Split - execute activities in parallel
  • Synchronization - synchronize two parallel
    threads of execution
  • Exclusive Choice - choose one execution path
    from many alternatives
  • Simple Merge - merge two alternative execution
    paths
  • Advanced Branching and Synchronization Patterns
  • Multiple Choice - choose several execution paths
    from many alternatives
  • Multiple Merge - merge many execution paths
    without synchronizing
  • Discriminator - merge many execution paths
    without synchronizing. Execute the subsequent
    activity only once.
  • N-out-of-M Join - merge many execution paths.
    Perform partial synchronization and execute
    subsequent activity only once.
  • Synchronizing Join - merge many execution paths.
    Synchronize if many paths are taken. Simple merge
    if only one execution path is taken

37
The ZEN of Workflow Patterns
  • Structural Patterns
  • Arbitrary Cycles - execute workflow graph w/out
    any structural restriction on loops
  • Implicit Termination - terminate if there is
    nothing to be done
  • Patterns Involving Multiple Instances
  • MI with a priori known design time knowledge -
    generate many instances of one activity when a
    number of instances is known at the design time
  • MI with a priori known runtime knowledge -
    generate many instances of one activity when a
    number of instances can be determined at some
    point during the runtime (as in FOR loop)
  • MI with no a priori runtime knowledge - generate
    many instances of one activity when a number of
    instances cannot be determined (as in WHILE loop)
  • MI requiring synchronization - generate many
    instances of one activity and synchronize them
    afterwards

38
The ZEN of Workflow Patterns
  • State-based patterns
  • Deferred Choice - execute one of the two
    alternatives threads. The choice which thread is
    to be executed should be implicit.
  • Interleaved Parallel Routing - execute two
    activities in random order, but not in parallel.
  • Milestone - enable an activity until a milestone
    is reached
  • Cancellation Patterns
  • Cancel Activity - cancel (disable) an enabled
    activity
  • Cancel Case - cancel (disable) the process

39
The ZOO of Workflow Standards and Systems
(http//tmitwww.tm.tue.nl/research/patterns/)
40
From Abstract to Executable Workflows(or
Declarative vs. Procedural Born-Again)
  • Basic idea
  • Let scientist define her scientific workflow as
    an abstract, conceptual workflow
  • Let the system translate this to an executable
    web service flow (plan)
  • Many challenges
  • Defining AWF over executable services (EWF)
  • ? Abstract-as-view definition
  • But complex control flows! (branching, loops,
    nested loops, etc.)
  • Also Schema mappings WS1.out ? WS2.in
  • Static analysis and planning is somewhere between
    impossible to very hard (unless SDF like model is
    sufficient)
  • Compiling Abstract Scientific Workflows into Web
    Service Workflows, B. Ludäscher, I. Altintas, and
    A. Gupta, 15th Intl. Conference on Scientific and
    Statistical Database Management (SSDBM), Boston,
    Massachussets, 2003.

41
From Abstract Scientific Workflows to
Executable Web Service Flows
SSDBM03
AWF
EWF
web service invocation
web service invocation
ET
ET
query rewriting
semantic type checking
data type conversion
web service matching
Genbank
BLAST
Abstract Task (AT) Repository
Data Parameter Ontologies
Datatype Conversion Repository
Executable Task (ET) Repository
42
Conceptual Workflow
Compute clusters (min. distance)
For each promoter
Select gene-set (cluster-level)
Compute Subsequence labels
For each gene
With all Promoter Models
Compute Joint Promoter Model
43
piw
AWF
promoters
tfbs_models
Promoters
TFBSModels
Promoters
Gene
promoters AAV
DB
Gene
Promoters
CDNASeq
CDNASeq
gene_seq
localAlignment
AAV
EWF
gene_seq AAV
GenbankId
cDNASeq
genbank_service

EMBLId
cDNASeq
Gene
GeneId
CDNASeq
convertToAcc
embl_service
DDBJId
cDNASeq
ddbj_service
44
PIW as an Abstract Workflow (AWF) Composed of
Abstract Tasks (AT)
AWF piw(DB,Gene,TFBSModel) -
cDNASequence(Gene, GeneSeq), localAlignment(DB,
CDNASeq,RankedPromoterList), firstRest(Promoter,
RankedPromoters,RankedPromoters1), promoter_deta
il(Promoter, PromoterId, Start, End,
Orientation), cDNASequence(PromoterId,Geno
micSeq), trim_sequence(GenomicSeq, Start, End,
Orientation, ShortSeq), convertSeq(Orientation,S
hortSeq,PosSeq), transfac(PosSeq, TFBSModel).
45
Abstract-As-view (AAV) Definitions for the ATs
AAV cDNASequence(GeneId, CDNASeq) -
genbank(GeneId, CDNASeq)
fail(genbank), embl(GeneId, CDNASeq)
fail(genbank),fail(embl),ddbj(GeneId,
CDNASeq). .localAlignment(DB, CDNASeq,RankedPromo
terList) - blast(CDNASeq,DB,xml,RankedP
romoterList) fail(blast),
fasta(CDNASeq,DB, RankedPromoterList)
fail(blast),fail(fasta),blat(CDNASeq,que
rytype,
sortcriteria,outputtype,RankedProm
oterList). convertSeq(Orientation,ShortSeq,PosSe
q) - negative(Orientation),
complement(ShortSeq,PosSeq) equals(ShortSeq,P
osSeq)
46
AAV Definitions for the ATs (continued)
trim_sequence(GenomicSeq, Start, End,
Orientation, ShortSeq) - trim_sequence1(GenomicS
eq, Start, End, Orientation, ShortSeq) fail(tr
im_sequence1), trim_sequence2(GenomicSeq, Start,
End, Orientation, ShortSeq). trim_sequence1(Genom
icSeq, Start, End, Orientation, ShortSeq)
- negative(Orientation), add(Start,End,Sum),
divide(Sum,2,Mid), trim(GenomicSeq, Mid -
1000, Mid 500, ShortSeq). trim_sequence2(Genomi
cSeq, Start, End, Orientation, ShortSeq)
- add(Start,End,Sum), divide(Sum,2,Mid),
trim(GenomicSeq, Mid - 500, Mid 1000,
ShortSeq).
  • ... and now the same in graph form...

47
?
cond
?
?
cond
?
?
cond
?
?
Figure 4. Allowed edge types in a well-formed EWF.
48
prepareClustalWInput
CW Sequence
manageClustalW Loop
geneList
updated GeneList
ClustalW Sequence
noMore Genes
geneListEmpty
geneListNOTEmpty
loop back
partialSeq
geneId
orient gt 0
geneNo
complement Sequence
shortSeq
plusSeq
orient lt 0
minusSeq
Figure1
orient gt 0
geneListEmpty
type
pwalignment
TRANSFACMatInspector
inspected TFBSs
sequence
ClustalW
Sequence List
multipleSeq Alignment
EWF for Matts Extended Promoter Identification
Workflow (w/ loops conditions)
49
manageClustalW Loop
noMore Genes
geneListEmpty
prepareClustalWInput
geneList
updated GeneList
ClustalW Sequence
geneListNOTEmpty
loop back
geneId
format
program
db1
partialSeq
geneNo
BlastRID
Genbank1
RequestId
cDNASeq
seq1
orient gt 0
dopt
cmd2
db2
cmd1
complement Sequence
plusSeq
minusSeq
list_udis
BlastPromoter
full Genomic Sequence
Genbank2
RId
promoters
shortSeq
orient lt 0
orient gt 0
outputNext Promoter
updated Promoter List
geneListEmpty
type
pwalignment
seq2
hitId
trimSequence
promoter List
ClustalW
start
from
Sequence List
multipleSeq Alignment
end
to
orientation
orient
TRANSFACMatInspector
inspected TFBSs
sequence
Unfolded EWF
50
Summary
  • Spectrum of dataflow/control-flow/workflow
    approaches
  • SDF, PN, , AXML, WebML, XPDL
  • Scientist user needs to visually program them
  • System support needed
  • Translation from simple, conceptual
    (declarative) WFs to executable Web/Grid
    service plans
  • Static analysis to check
  • dynamic properties (deadlocks, starvation,),
  • feasibility wrt. given sources
  • type compatibilities
  • Macro/micro-level planning (overall control flow,
    local schema mappings)
Write a Comment
User Comments (0)
About PowerShow.com