Title: Zen and the Art of SWF Maintenance
1Zen and the Art of SWF Maintenance
- Kinds of Scientific Workflows
- Why not just Python scripts?
- Business workflows born again ?
- Zen and the art of workflow design
- and other research issues
2What is a Scientific Workflow (SWF)?
- Model the way scientists work with their data and
tools - Mentally coordinate data export, import, analysis
via software systems - Scientific workflows emphasize data flow (?
business workflows) - Metadata (incl. provenance info, semantic types
etc.) is crucial for automated data ingestion,
data analysis,
- Goals
- SWF automation,
- SWF, component reuse
- SWF design documentation
- ? making scientists data analysis and management
easier!
3What we use SWF for
- Short answer Everything
- includes making coffee (tea ceremonies are
harder) - Kinds of workflows (not disjoint)
- Plumbing Stage files, submit batch jobs, monitor
progress, move files off XT3 to analysis and viz
cluster, archive, steer computation, - Ex Fusion simulation, Astrophysics (supernova
simulation), your laptop backup??? - Knowledge discovery workflows automate
repetitive data access, retrieval, custom
analysis (e.g. Blast), generic steps (PCA,
cluster analysis, ..), - Do this in ways that are meaningful to the
scientist - Ex PIW, Motif analysis, NDDP,
- Conceptual modeling workflows what the heck is
XYZ doing? Reverse engineering of processes and
information flows at all levels, in order to
optimize, we need to understand first - Ex napkin drawing workflows to get an overview,
refine design from abstract to executable
(top-down), or generalize from the
concrete/legacy to the abstract (bottom-up)
data-driven, task-driven, ..
4Why not just a Python script?
- Users who might be able to define, reuse, modify,
specialize WFs might not be able to do the same
for Python scripts - But wait, theres more
- Modular reuse
- Debugging and monitoring of WF execution
- easy to tee (man tee for you windows guys -)
- Automated Provenance Mgmt
- Semantic types
- From integrated WF modeling (ER dataflow
co-registrations) to execution, optimization,
archival
5Business workflows born-again?
- Yes, there are similarities
- And we can learn from BWF! E.g. transactions!
- But also big differences
- SWF
- data-flow oriented
- streaming/pipelined execution
- cf. signal processing (see also COM later)
- popular MoC PN
- BWF
- task- and control-flow oriented
- popular MoC Petri-Net? CSP?
6Sample BWFs
- Focus is on
- Tasks
- Control-flow
- Work items
- Useful stuff
- Transactions!
- How to handle complex control-flow
7Pop Quiz! BWF? SWF?
8And the answer is
9Click here for Oracle (or another one)
10Dataflow it is!
11The Dataflow Difference
12Data/Process/Provenance Central
13BUY ME!!
14A Signal Processing Pipeline
15Some Terminology (tentative)
- Workflow definition W (? WF graph we see)
- partial specification of a workflow (cf. program)
- parameters P need to be instantiated
- data-bindings D can be viewed as special
parameters - Model of Computation (MoC)
- Looking at W, P, D we still not know how to
execute W(P,D) to compute result R - A MoC is an algorithm telling us how to apply W
on P and D to obtain R. - Examples
- MoC TM (Turing Machine)
- given program P and input I, we know what to do
- MoC PN (Process Network)
- Network of independent processes, communicating
through (infinite) unidirectional buffers
(queues), prefix-monotonic behavior given a PN
and an input stream and prefix-monotonic,
deterministic actors, the output stream is
determined! (lots of flexibility for execution!) - MoC SDF (Synchronous Dataflow)
- Similar to PN, but actors must statically declare
there token production/consumption rates solving
for pos. int. solutions of balance equations
(LGS) yields static schedule guaranteeing fixed
buffer size
16Some Terminology (tentative)
- Model of Computation (MoC)
- WF Run completed computation
- WF Execution ongoing computation
- Computation graph graph data structure keeping
track of which token has been computed from which
other one(s) - Simple examples evaluating an arithmetic
expression running a job DAG - But keeping track of real dependencies can be
tricky - Ex output tuples of an SQL query have witness
tuples in multiple relations clear for positive
existential queries what are witnesses for
universal and negated queries? R A \ B
witnesses anybody? - Similar to the notion of proof tree in logic
(and LP) negation-as-failure looms its ugly
(beautiful?) head!
17Research Area Provenance
- (Abstract) Use Cases
- Total Recall capture everything the MoC can
observe - and more MoC-inherent plus addtl. observables
- Example time-stamp token-in, token-out events ?
benchmark actor exec time, data movement time, - The 7 Ws Who, What, Where, Why, When, Which,
(W)how (C. Goble) - Smart Re-run after Pause or Stop, followed by
parameter changes rerun relevant parts - Fault tolerance, crash recovery (cf.
checkpointing) - Result interpretation and post-mortem analysis
- Research Question
- Given a use case (as a query U) and a provenance
schema PS, can U be answered using PS? (related
to query answering using views a reasoning
problem!) - Ultimately design PS with U in mind! Also
optimize/specialize PS if U is known/limited - Note the MoC can make a difference! For example,
some MoCs have explicit notion of firing or
might exploit actor declarations (Im a
function! I have no state!) This means is
relevant e.g. for checkpointing (Need to save
state or not? When to save state..)
18Research Area WF/Dataflow Design
- Collection-Oriented Modeling (COM)
- Assembly line metaphor Signal Processing XML
- Streams are nested collections (? XML)
- Stream data schema is registered to a WF data
model (really need this) - Actor picks up only certain parts of the
stream scope - Actor declares how within the scope is changed
delta - Gives rise to new notions of type and new
problems of type inference (using scope, delta,
workflow structure etc.) - Advantages
- Less messy WFs (more linear, less branching)
- Add-only mode (inject new derived information)
augmentation instead of transformation - Tagging data for downstream processing (instead
of bombing, pass on dirty / faulty / strange
data with a relevant tag - Pipelined parallelism (can stream an array)
19Research WF Design
- ER model primitives
- Entity (-type), attribute, relationship (-type)
- SWF model primitives??
- Actors, directors (MoC),
- Lots of new types
- Conventional data type (Java style)
- Polymorphic types w/ type variables (Haskell
style) - Semantic type (formal annotations in logic
relative to a controlled vocabulary or knowledge
base) - Hybrids
- A theory of adapters !?
20designed to fit
hand-crafted control solution also forces
sequential execution!
designed to fit
Altintas-et-al-PIW-SSDBM03
hand-crafted Web-service actor
No data transformations available
Complex backward control-flow
21A Scientific Workflow Problem More Solved
(Computer Scientists view)
- Solution based on declarative, functional
dataflow process network - ( also a data streaming model!)
- Higher-order constructs map(f)
- no control-flow spaghetti
- data-intensive apps
- free concurrent execution
- free type checking
- automatic support to go from piw(GeneId) to
- PIW map(piw) over GeneId
map(f)-style iterators
Powerful type checking
Generic, declarative programming constructs
Generic data transformation actors
Forward-only, abstractable sub-workflow
piw(GeneId)
22A Scientific Workflow Problem Even More Solved
(domainCS coming together!)
map(GenbankWS) Input NM_001924,
NM020375 Output CAGTAATATGAC",GGGGACAA
AGA
23Research Problem Optimization by Rewriting
- Example PIW as a declarative, referentially
transparent functional process - optimization via functional rewriting possible
- e.g. map(f o g) map(f) o map(g)
- Technical report PIW specification in Haskell
map(f o g) instead of map(f) o map(g)
Combination of map and zip
http//kbis.sdsc.edu/SciDAC-SDM/scidac-tn-map-cons
tructs.pdf
24Job Management (here NIMROD)
- Job management infrastructure in place
- Results database under development
- Goal 1000s of GAMESS jobs (quantum mechanics)
25Kepler Coupling Components Codes
- Types of Coupling
- Loosely coupled (1st Phase)
- Web Services (SPA, GEON, SEEK, ),
- ssh actors, ..
- reusability (behavorial polymorphism)
- scalability ( components)
- efficiency
- Tight(er) coupling (2nd Phase)
- Via CCA (SciRUN-2, Ccaffeine, )
(Cipres uses CORBA) - HPC needs code-coupling as efficient flexible
as possible (e.g. Scotts challenges) - memory-to-memory (single node or shared memory),
- MPI (multiple-nodes)
- optimizations for transfer of data control
(streaming, socket-based connections)
26Accord-CCA Ccaffeine w/ Self-Managed Behavior
cf. w/ mobile models, reconfiguration in Ptolemy
II begging for a Kepler design and
implementation
Source Hua Liu and Manish Parashar
27Fault Tolerance Maintenance Challenges
28Workflow Templates and Patterns
New Ingredients
Proposed Layered Architecture
work w/ Anne Ngu, Shawn Bowers, Terence Critchlow
29Use Ideas from Fault Tolerant Shell
Good ideas in ftsh some might be (semi-)low
hanging fruits for Kepler
Source Douglas Thain, Miron Livny The Ethernet
Approach to Grid Computing
30Use of Semantics in SWF
- Smart Search
- Concept-based, e.g., find all datasets
containing biomass measurements - Improved Linking, Merging, Integration
- Establishing links between data through semantic
annotations ontologies - Combining heterogeneous sources based on
annotations - Concatenate, Union (merge), Join, etc.
- Transforming
- Construct mappings from schema S1 to S2 based on
annotations - Semantic Propagation
- Pushing semantic annotations through
transformations/queries
31Typing Workflow Components
Semantic Type Editor is used to assign one or
more semantic types to the component or to the
components input and output ports. In the
simplest case, a semantic type is a class taken
from an OWL-DL ontology. Multiple types define a
conjoined concept expression.
A simple ontology browser is provided in Kepler
to navigate a classified OWL-DL ontology. Classes
can be searched for and selected as a semantic
type.
The Semantic Type Editor allows the user to
assign one or more semantic types to the
component or to the components input and output
ports. In the simplest case, a semantic type is a
class taken from an OWL-DL ontology. Multiple
types define a conjoined concept expression. The
above-right screenshot shows a user assigning
semantic types to the dataset and the above-left
screenshot shows the user assigning an ontology
class to the output port (dataset attribute)
labeled Plot.
Because ontologies can get large and complicated,
there is a built in browser for navigating
through and choosing the concept that fits the
port.
A simple ontology browser is provided in Kepler
for navigating a classified OWL-DL concept
hierarchy and ontology properties. Classes can be
searched for and selected. Selecting a class
assigns it as the corresponding semantic type.
32More on Semantic Annotation
- Initial Version Supports
- Actor-level and port-level annotations
- Annotations are stored in actors MoML definition
(as new semantic type properties) - Creation of composite ports (i.e., virtual
ports grouping a set of underlying ports) - Regular and composite ports may have multiple
annotations (conjunction) - Annotations can be drawn from multiple ontologies
An annotated composite port
33More on Semantic Annotation
- Currently Adding
- Semantic Link Annotations for annotation of
ports via ontology properties - E.g, hasLat(point1, lat1)
- Supported in MoML, not yet in tool
- Simple condition filters in port semantic
annotations - E.g., if attribute height gt 0 then biomass is
annotated as AboveGroundBiomass - Incorporating instances/values in semantic links
- E.g., hasUnit(biomass, celsius)
- Suggesting additional annotations based on given
ones - suggesting/guessing ways to fill in given
annotations - E.g., possible semantic links
- Templates and ontology views
- To help specify common annotation patterns
Semantic Links
34Checking Type Constraints
Kepler can statically perform semantic and
structural type checking of connections. A type
checker allows the user to see potentially
mismatched port connections as well as known type
conflicts before workflow execution.
The user can navigate the unsafe and potentially
unsafe channels using the Kepler Type Checker
dialog. When a channel is selected (a) it is
highlighted on the canvas, (b) the structural
type and status is shown (here, the channel is
structurally well typed), and (c) the semantic
type and status is shown (here, the connection
produce a semantic type error).
35Kepler Actor-Library
- Ontology-based actor organization / browsing
- Customizable libraries based on ontologies
- Text search with concept-based expansion
Users can discover ImageJ using various search
terms. Here, ImageJ shows up in multiple tree
locations based on its given annotations. The
library search permits text-based matching
against the components metadata (its given name
and certain properties), expanded with concept
matches.
36Semantic Searching
Kepler provides a more advanced ontology-based
search mechanism. Users can start the Semantic
Search dialog, where components can be search for
based on their semantic types.
The Semantic Search dialog allows a user to
search components by any combination of actor,
input, and output semantic types.
37 Structural Type (XML DTD) Annotations
structType(P2)
structType(P3)
root cohortTable (measurement) elem
measuremnt (phase, obs) elem phase
xsdstring elem obs xsdinteger
ltpopulationgt ltsamplegt ltmeasgt
ltcntgt44,000lt/cntgt ltaccgt0.95lt/accgt
lt/measgt ltlspgtEggslt/lspgt lt/samplegt
ltpopulationgt
ltcohortTablegt ltmeasurementgt
ltphasegtEggslt/cntgt ltobsgt44,000lt/accgt
lt/measurementgt ltcohortTablegt
P2
P3
P5
S1(life stage property)
S2(mortality rate for period)
P1
P4
Source Bowers-Ludaescher, DILS04
38Ontology-Guided Data Transformation
Ontologies (OWL)
Compatible
(?)
SemanticType Ps
SemanticType Pt
Structural/Semantic Association
Structural/Semantic Association
StructuralType Pt
StructuralType Ps
Correspondence
?(Ps)
Generate
Source Service
Target Service
Transformation
Pt
Ps
Desired Connection
Source Bowers-Ludaescher, DILS04
39WF-Design Adapters for Semantic Structural
Incompatibility
- Adapters may
- be abstract (no impl.)
- be concrete
- bridge a semantic gap
- fix a structural mismatch
- be generated automatically (e.g., Tavernas list
mismatch) - be reused components(based on signatures)
C
D
C
C?
D?
D
C1
C1?
D1?
C1
D
D
C2
C2?
D2?
C2
map
f2
f1
f1
f2
S
S?
T
S?
S
T
map
map
f2
f1
f1
S
S?
T
S?
T
S
f2
Source Bowers-Ludaescher, ER05
40Additional Design Primitives for Semantic Types
Extended Transformations
Starting Workflow
Resulting Workflow
Resulting Workflow
t9 Actor Semantic Type Refinement (T? T)
T?
T
t10 Port Semantic TypeRefinement (C? C, D?
D)
C
D
C?
D
C
D?
D
D
t11 AnnotationConstraint Refinement (?? ? ?)
C
D
C
C
?1
?2
??1
?2
?1
??2
t
t
t
s
s
s
t12 I/O Constraint Strengthening (? ? ? )
?
?
t13 Data Connection Refinement
t14 Adapter Insertion
t15 Actor Replacement
f
f?
t16 Workflow Combination (Map)
Source Bowers-Ludaescher, ER05
41Scientific Workflow Design
- Support SWF design reuse, via
- Structural data types
- Semantic types
- Associations (constraints) between them
- Type checking, inference, propagation
- ?Separation of concerns
- structure, semantics, WF orchestration, etc.
Source Bowers-Ludaescher, ER05
42Semantic Annotation Propagation
43Forward and Backward Propagation Rules
44GEON Dataset Generation Registration(and
co-development in KEPLER)
Makefile gt ant run
SQL database access (JDBC)
Matt et al. (SEEK)
Efrat (GEON)
Ilkay (SDM)
Yang (Ptolemy)
Xiaowen (SDM)
Edward et al.(Ptolemy)
45Web Services ? Actors (WS Harvester)
1
2
4
3
- ? Minute-made (MM) WS-based application
integration - Similarly MM workflow design sharing w/o
implemented components
46Some KEPLER Actors (out of 160 and counting)
47Different Directors for Different Concerns
- Example
- Ptolemy Directors factoring out the concern
of workflow orchestration (MoC) - common aspects of overall execution not left to
the actors - Similarly
- Black Box (flight recorder)
- a kind of recording central to avoid wiring
100s of components to recording-actor(s) - Red Box (error handling, fault tolerance)
- use ftsh ideas tempaltes
- Yellow Box (type checking)
- for workflow design
- Blue Box (shipping-and-handling)
- central handling of data transport (by value, by
reference, by scp, SRB, GridFTP, ) - CCA Boxes
- Change behavior (e.g. algorithm) of a component
- Change behavior (i.e., wiring) of a workflow
in-flight
SDF/PN/DE/
Provenance Recorder
On Error
Static Analysis
SHA _at_
Component Mgr
Composition Mgr
48Separation of Concerns Port Types
- Token consumption ( production) type
- a directors concern
- More generally resource consumption type
- other scheduling problems
- Token transport type
- by value, reference (which one), protocol (SOAP,
scp, GridFTP, scp, SRB, ) - a SHA concern
- Structural and semantic types
- SAT (static analysis typing) concern
- built after static unit type system
- static unit type system as a special case!?
49Other Research Problems
- Making the system more X-aware
- MoC-aware ok (directors)
- Provenance-aware
- DS (data schema)-aware
- Semantics-aware upcoming (should be hybrid w/
DS) - Host-aware allow distributed scheduling of
actors - Data-transport-aware choose suitable data
transport protocol (scp, bbcp, http, (Grid-)ftp,
SRB, SRM, ...) - Think of new folks on the movie set
- Actors, director
- Cameraman (provenance recorder?)
- Editor (FF/REW/Play/Pause/Stop provenance re-run)
- Caterer/Stager (feeding actors with yummy
tokens!) - Managers for Process Central and Data Central
- Semantic/Hybrid Type Manager
50More Research Topics
- What if we know something about bandwidths,
processor loads, data sizes? - ? workflow optimization!
- What if we have more semantics for actors?
- Black-box token in/out
- Grey-box data types, semantic types
- White box exact functional behavior is known!
- Example Actor implements a (stream-?) query!
- ? Query Process Network
- New optimization opportunities!
51A Users Wish List
- Usability
- Closing the lid (cf. vnc)
- Dynamic plug-in of actors (cf. actor data
registries/repositories) - Distributed WF execution
- Collection-based programming
- Grid awareness
- Semantics awareness
- WF Deployment (as a web site, as a web service,
) - Power apps (cf. SCIRun)
-