Zen and the Art of SWF Maintenance - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Zen and the Art of SWF Maintenance

Description:

Zen and the Art of SWF Maintenance. Kinds of Scientific Workflows. Why not just Python scripts? ... Zen and the art of workflow design ... and other research issues ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 52
Provided by: bertr68
Category:
Tags: swf | art | maintenance | zen

less

Transcript and Presenter's Notes

Title: Zen and the Art of SWF Maintenance


1
Zen and the Art of SWF Maintenance
  • Kinds of Scientific Workflows
  • Why not just Python scripts?
  • Business workflows born again ?
  • Zen and the art of workflow design
  • and other research issues

2
What is a Scientific Workflow (SWF)?
  • Model the way scientists work with their data and
    tools
  • Mentally coordinate data export, import, analysis
    via software systems
  • Scientific workflows emphasize data flow (?
    business workflows)
  • Metadata (incl. provenance info, semantic types
    etc.) is crucial for automated data ingestion,
    data analysis,
  • Goals
  • SWF automation,
  • SWF, component reuse
  • SWF design documentation
  • ? making scientists data analysis and management
    easier!

3
What we use SWF for
  • Short answer Everything
  • includes making coffee (tea ceremonies are
    harder)
  • Kinds of workflows (not disjoint)
  • Plumbing Stage files, submit batch jobs, monitor
    progress, move files off XT3 to analysis and viz
    cluster, archive, steer computation,
  • Ex Fusion simulation, Astrophysics (supernova
    simulation), your laptop backup???
  • Knowledge discovery workflows automate
    repetitive data access, retrieval, custom
    analysis (e.g. Blast), generic steps (PCA,
    cluster analysis, ..),
  • Do this in ways that are meaningful to the
    scientist
  • Ex PIW, Motif analysis, NDDP,
  • Conceptual modeling workflows what the heck is
    XYZ doing? Reverse engineering of processes and
    information flows at all levels, in order to
    optimize, we need to understand first
  • Ex napkin drawing workflows to get an overview,
    refine design from abstract to executable
    (top-down), or generalize from the
    concrete/legacy to the abstract (bottom-up)
    data-driven, task-driven, ..

4
Why not just a Python script?
  • Users who might be able to define, reuse, modify,
    specialize WFs might not be able to do the same
    for Python scripts
  • But wait, theres more
  • Modular reuse
  • Debugging and monitoring of WF execution
  • easy to tee (man tee for you windows guys -)
  • Automated Provenance Mgmt
  • Semantic types
  • From integrated WF modeling (ER dataflow
    co-registrations) to execution, optimization,
    archival

5
Business workflows born-again?
  • Yes, there are similarities
  • And we can learn from BWF! E.g. transactions!
  • But also big differences
  • SWF
  • data-flow oriented
  • streaming/pipelined execution
  • cf. signal processing (see also COM later)
  • popular MoC PN
  • BWF
  • task- and control-flow oriented
  • popular MoC Petri-Net? CSP?

6
Sample BWFs
  • Focus is on
  • Tasks
  • Control-flow
  • Work items
  • Useful stuff
  • Transactions!
  • How to handle complex control-flow

7
Pop Quiz! BWF? SWF?
8
And the answer is
9
Click here for Oracle (or another one)
10
Dataflow it is!
11
The Dataflow Difference
12
Data/Process/Provenance Central
13
BUY ME!!
14
A Signal Processing Pipeline
15
Some Terminology (tentative)
  • Workflow definition W (? WF graph we see)
  • partial specification of a workflow (cf. program)
  • parameters P need to be instantiated
  • data-bindings D can be viewed as special
    parameters
  • Model of Computation (MoC)
  • Looking at W, P, D we still not know how to
    execute W(P,D) to compute result R
  • A MoC is an algorithm telling us how to apply W
    on P and D to obtain R.
  • Examples
  • MoC TM (Turing Machine)
  • given program P and input I, we know what to do
  • MoC PN (Process Network)
  • Network of independent processes, communicating
    through (infinite) unidirectional buffers
    (queues), prefix-monotonic behavior given a PN
    and an input stream and prefix-monotonic,
    deterministic actors, the output stream is
    determined! (lots of flexibility for execution!)
  • MoC SDF (Synchronous Dataflow)
  • Similar to PN, but actors must statically declare
    there token production/consumption rates solving
    for pos. int. solutions of balance equations
    (LGS) yields static schedule guaranteeing fixed
    buffer size

16
Some Terminology (tentative)
  • Model of Computation (MoC)
  • WF Run completed computation
  • WF Execution ongoing computation
  • Computation graph graph data structure keeping
    track of which token has been computed from which
    other one(s)
  • Simple examples evaluating an arithmetic
    expression running a job DAG
  • But keeping track of real dependencies can be
    tricky
  • Ex output tuples of an SQL query have witness
    tuples in multiple relations clear for positive
    existential queries what are witnesses for
    universal and negated queries? R A \ B
    witnesses anybody?
  • Similar to the notion of proof tree in logic
    (and LP) negation-as-failure looms its ugly
    (beautiful?) head!

17
Research Area Provenance
  • (Abstract) Use Cases
  • Total Recall capture everything the MoC can
    observe
  • and more MoC-inherent plus addtl. observables
  • Example time-stamp token-in, token-out events ?
    benchmark actor exec time, data movement time,
  • The 7 Ws Who, What, Where, Why, When, Which,
    (W)how (C. Goble)
  • Smart Re-run after Pause or Stop, followed by
    parameter changes rerun relevant parts
  • Fault tolerance, crash recovery (cf.
    checkpointing)
  • Result interpretation and post-mortem analysis
  • Research Question
  • Given a use case (as a query U) and a provenance
    schema PS, can U be answered using PS? (related
    to query answering using views a reasoning
    problem!)
  • Ultimately design PS with U in mind! Also
    optimize/specialize PS if U is known/limited
  • Note the MoC can make a difference! For example,
    some MoCs have explicit notion of firing or
    might exploit actor declarations (Im a
    function! I have no state!) This means is
    relevant e.g. for checkpointing (Need to save
    state or not? When to save state..)

18
Research Area WF/Dataflow Design
  • Collection-Oriented Modeling (COM)
  • Assembly line metaphor Signal Processing XML
  • Streams are nested collections (? XML)
  • Stream data schema is registered to a WF data
    model (really need this)
  • Actor picks up only certain parts of the
    stream scope
  • Actor declares how within the scope is changed
    delta
  • Gives rise to new notions of type and new
    problems of type inference (using scope, delta,
    workflow structure etc.)
  • Advantages
  • Less messy WFs (more linear, less branching)
  • Add-only mode (inject new derived information)
    augmentation instead of transformation
  • Tagging data for downstream processing (instead
    of bombing, pass on dirty / faulty / strange
    data with a relevant tag
  • Pipelined parallelism (can stream an array)

19
Research WF Design
  • ER model primitives
  • Entity (-type), attribute, relationship (-type)
  • SWF model primitives??
  • Actors, directors (MoC),
  • Lots of new types
  • Conventional data type (Java style)
  • Polymorphic types w/ type variables (Haskell
    style)
  • Semantic type (formal annotations in logic
    relative to a controlled vocabulary or knowledge
    base)
  • Hybrids
  • A theory of adapters !?

20
designed to fit
hand-crafted control solution also forces
sequential execution!
designed to fit
Altintas-et-al-PIW-SSDBM03
hand-crafted Web-service actor
No data transformations available
Complex backward control-flow
21
A Scientific Workflow Problem More Solved
(Computer Scientists view)
  • Solution based on declarative, functional
    dataflow process network
  • ( also a data streaming model!)
  • Higher-order constructs map(f)
  • no control-flow spaghetti
  • data-intensive apps
  • free concurrent execution
  • free type checking
  • automatic support to go from piw(GeneId) to
  • PIW map(piw) over GeneId

map(f)-style iterators
Powerful type checking
Generic, declarative programming constructs
Generic data transformation actors
Forward-only, abstractable sub-workflow
piw(GeneId)
22
A Scientific Workflow Problem Even More Solved
(domainCS coming together!)
map(GenbankWS) Input NM_001924,
NM020375 Output CAGTAATATGAC",GGGGACAA
AGA
23
Research Problem Optimization by Rewriting
  • Example PIW as a declarative, referentially
    transparent functional process
  • optimization via functional rewriting possible
  • e.g. map(f o g) map(f) o map(g)
  • Technical report PIW specification in Haskell

map(f o g) instead of map(f) o map(g)
Combination of map and zip
http//kbis.sdsc.edu/SciDAC-SDM/scidac-tn-map-cons
tructs.pdf
24
Job Management (here NIMROD)
  • Job management infrastructure in place
  • Results database under development
  • Goal 1000s of GAMESS jobs (quantum mechanics)

25
Kepler Coupling Components Codes
  • Types of Coupling
  • Loosely coupled (1st Phase)
  • Web Services (SPA, GEON, SEEK, ),
  • ssh actors, ..
  • reusability (behavorial polymorphism)
  • scalability ( components)
  • efficiency
  • Tight(er) coupling (2nd Phase)
  • Via CCA (SciRUN-2, Ccaffeine, )
    (Cipres uses CORBA)
  • HPC needs code-coupling as efficient flexible
    as possible (e.g. Scotts challenges)
  • memory-to-memory (single node or shared memory),
  • MPI (multiple-nodes)
  • optimizations for transfer of data control
    (streaming, socket-based connections)

26
Accord-CCA Ccaffeine w/ Self-Managed Behavior
cf. w/ mobile models, reconfiguration in Ptolemy
II begging for a Kepler design and
implementation
Source Hua Liu and Manish Parashar
27
Fault Tolerance Maintenance Challenges
28
Workflow Templates and Patterns
New Ingredients
Proposed Layered Architecture
work w/ Anne Ngu, Shawn Bowers, Terence Critchlow
29
Use Ideas from Fault Tolerant Shell
Good ideas in ftsh some might be (semi-)low
hanging fruits for Kepler
Source Douglas Thain, Miron Livny The Ethernet
Approach to Grid Computing
30
Use of Semantics in SWF
  • Smart Search
  • Concept-based, e.g., find all datasets
    containing biomass measurements
  • Improved Linking, Merging, Integration
  • Establishing links between data through semantic
    annotations ontologies
  • Combining heterogeneous sources based on
    annotations
  • Concatenate, Union (merge), Join, etc.
  • Transforming
  • Construct mappings from schema S1 to S2 based on
    annotations
  • Semantic Propagation
  • Pushing semantic annotations through
    transformations/queries

31
Typing Workflow Components
Semantic Type Editor is used to assign one or
more semantic types to the component or to the
components input and output ports. In the
simplest case, a semantic type is a class taken
from an OWL-DL ontology. Multiple types define a
conjoined concept expression.
A simple ontology browser is provided in Kepler
to navigate a classified OWL-DL ontology. Classes
can be searched for and selected as a semantic
type.
The Semantic Type Editor allows the user to
assign one or more semantic types to the
component or to the components input and output
ports. In the simplest case, a semantic type is a
class taken from an OWL-DL ontology. Multiple
types define a conjoined concept expression. The
above-right screenshot shows a user assigning
semantic types to the dataset and the above-left
screenshot shows the user assigning an ontology
class to the output port (dataset attribute)
labeled Plot.
Because ontologies can get large and complicated,
there is a built in browser for navigating
through and choosing the concept that fits the
port.
A simple ontology browser is provided in Kepler
for navigating a classified OWL-DL concept
hierarchy and ontology properties. Classes can be
searched for and selected. Selecting a class
assigns it as the corresponding semantic type.
32
More on Semantic Annotation
  • Initial Version Supports
  • Actor-level and port-level annotations
  • Annotations are stored in actors MoML definition
    (as new semantic type properties)
  • Creation of composite ports (i.e., virtual
    ports grouping a set of underlying ports)
  • Regular and composite ports may have multiple
    annotations (conjunction)
  • Annotations can be drawn from multiple ontologies

An annotated composite port
33
More on Semantic Annotation
  • Currently Adding
  • Semantic Link Annotations for annotation of
    ports via ontology properties
  • E.g, hasLat(point1, lat1)
  • Supported in MoML, not yet in tool
  • Simple condition filters in port semantic
    annotations
  • E.g., if attribute height gt 0 then biomass is
    annotated as AboveGroundBiomass
  • Incorporating instances/values in semantic links
  • E.g., hasUnit(biomass, celsius)
  • Suggesting additional annotations based on given
    ones
  • suggesting/guessing ways to fill in given
    annotations
  • E.g., possible semantic links
  • Templates and ontology views
  • To help specify common annotation patterns

Semantic Links
34
Checking Type Constraints
Kepler can statically perform semantic and
structural type checking of connections. A type
checker allows the user to see potentially
mismatched port connections as well as known type
conflicts before workflow execution.
The user can navigate the unsafe and potentially
unsafe channels using the Kepler Type Checker
dialog. When a channel is selected (a) it is
highlighted on the canvas, (b) the structural
type and status is shown (here, the channel is
structurally well typed), and (c) the semantic
type and status is shown (here, the connection
produce a semantic type error).
35
Kepler Actor-Library
  • Ontology-based actor organization / browsing
  • Customizable libraries based on ontologies
  • Text search with concept-based expansion

Users can discover ImageJ using various search
terms. Here, ImageJ shows up in multiple tree
locations based on its given annotations. The
library search permits text-based matching
against the components metadata (its given name
and certain properties), expanded with concept
matches.
36
Semantic Searching
Kepler provides a more advanced ontology-based
search mechanism. Users can start the Semantic
Search dialog, where components can be search for
based on their semantic types.
The Semantic Search dialog allows a user to
search components by any combination of actor,
input, and output semantic types.
37
Structural Type (XML DTD) Annotations
structType(P2)
structType(P3)
root cohortTable (measurement) elem
measuremnt (phase, obs) elem phase
xsdstring elem obs xsdinteger
ltpopulationgt ltsamplegt ltmeasgt
ltcntgt44,000lt/cntgt ltaccgt0.95lt/accgt
lt/measgt ltlspgtEggslt/lspgt lt/samplegt
ltpopulationgt
ltcohortTablegt ltmeasurementgt
ltphasegtEggslt/cntgt ltobsgt44,000lt/accgt
lt/measurementgt ltcohortTablegt
P2
P3
P5
S1(life stage property)
S2(mortality rate for period)
P1
P4
Source Bowers-Ludaescher, DILS04
38
Ontology-Guided Data Transformation
Ontologies (OWL)
Compatible
(?)
SemanticType Ps
SemanticType Pt
Structural/Semantic Association
Structural/Semantic Association
StructuralType Pt
StructuralType Ps
Correspondence
?(Ps)
Generate
Source Service
Target Service
Transformation
Pt
Ps
Desired Connection
Source Bowers-Ludaescher, DILS04
39
WF-Design Adapters for Semantic Structural
Incompatibility
  • Adapters may
  • be abstract (no impl.)
  • be concrete
  • bridge a semantic gap
  • fix a structural mismatch
  • be generated automatically (e.g., Tavernas list
    mismatch)
  • be reused components(based on signatures)

C
D
C
C?
D?
D
C1
C1?
D1?
C1
D
D
C2
C2?
D2?
C2
map
f2
f1
f1
f2
S
S?
T
S?
S
T
map
map
f2
f1
f1
S
S?
T
S?
T
S
f2
Source Bowers-Ludaescher, ER05
40
Additional Design Primitives for Semantic Types
Extended Transformations
Starting Workflow
Resulting Workflow
Resulting Workflow
t9 Actor Semantic Type Refinement (T? T)
T?
T
t10 Port Semantic TypeRefinement (C? C, D?
D)
C
D
C?
D
C
D?
D
D
t11 AnnotationConstraint Refinement (?? ? ?)
C
D
C
C
?1
?2
??1
?2
?1
??2
t
t
t
s
s
s
t12 I/O Constraint Strengthening (? ? ? )
?
?
t13 Data Connection Refinement
t14 Adapter Insertion
t15 Actor Replacement
f
f?
t16 Workflow Combination (Map)
Source Bowers-Ludaescher, ER05
41
Scientific Workflow Design
  • Support SWF design reuse, via
  • Structural data types
  • Semantic types
  • Associations (constraints) between them
  • Type checking, inference, propagation
  • ?Separation of concerns
  • structure, semantics, WF orchestration, etc.

Source Bowers-Ludaescher, ER05
42
Semantic Annotation Propagation
43
Forward and Backward Propagation Rules
44
GEON Dataset Generation Registration(and
co-development in KEPLER)
Makefile gt ant run
SQL database access (JDBC)
Matt et al. (SEEK)
Efrat (GEON)
Ilkay (SDM)
Yang (Ptolemy)
Xiaowen (SDM)
Edward et al.(Ptolemy)
45
Web Services ? Actors (WS Harvester)
1
2
4
3
  • ? Minute-made (MM) WS-based application
    integration
  • Similarly MM workflow design sharing w/o
    implemented components

46
Some KEPLER Actors (out of 160 and counting)
47
Different Directors for Different Concerns
  • Example
  • Ptolemy Directors factoring out the concern
    of workflow orchestration (MoC)
  • common aspects of overall execution not left to
    the actors
  • Similarly
  • Black Box (flight recorder)
  • a kind of recording central to avoid wiring
    100s of components to recording-actor(s)
  • Red Box (error handling, fault tolerance)
  • use ftsh ideas tempaltes
  • Yellow Box (type checking)
  • for workflow design
  • Blue Box (shipping-and-handling)
  • central handling of data transport (by value, by
    reference, by scp, SRB, GridFTP, )
  • CCA Boxes
  • Change behavior (e.g. algorithm) of a component
  • Change behavior (i.e., wiring) of a workflow
    in-flight

SDF/PN/DE/
Provenance Recorder
On Error
Static Analysis
SHA _at_
Component Mgr
Composition Mgr
48
Separation of Concerns Port Types
  • Token consumption ( production) type
  • a directors concern
  • More generally resource consumption type
  • other scheduling problems
  • Token transport type
  • by value, reference (which one), protocol (SOAP,
    scp, GridFTP, scp, SRB, )
  • a SHA concern
  • Structural and semantic types
  • SAT (static analysis typing) concern
  • built after static unit type system
  • static unit type system as a special case!?

49
Other Research Problems
  • Making the system more X-aware
  • MoC-aware ok (directors)
  • Provenance-aware
  • DS (data schema)-aware
  • Semantics-aware upcoming (should be hybrid w/
    DS)
  • Host-aware allow distributed scheduling of
    actors
  • Data-transport-aware choose suitable data
    transport protocol (scp, bbcp, http, (Grid-)ftp,
    SRB, SRM, ...)
  • Think of new folks on the movie set
  • Actors, director
  • Cameraman (provenance recorder?)
  • Editor (FF/REW/Play/Pause/Stop provenance re-run)
  • Caterer/Stager (feeding actors with yummy
    tokens!)
  • Managers for Process Central and Data Central
  • Semantic/Hybrid Type Manager

50
More Research Topics
  • What if we know something about bandwidths,
    processor loads, data sizes?
  • ? workflow optimization!
  • What if we have more semantics for actors?
  • Black-box token in/out
  • Grey-box data types, semantic types
  • White box exact functional behavior is known!
  • Example Actor implements a (stream-?) query!
  • ? Query Process Network
  • New optimization opportunities!

51
A Users Wish List
  • Usability
  • Closing the lid (cf. vnc)
  • Dynamic plug-in of actors (cf. actor data
    registries/repositories)
  • Distributed WF execution
  • Collection-based programming
  • Grid awareness
  • Semantics awareness
  • WF Deployment (as a web site, as a web service,
    )
  • Power apps (cf. SCIRun)
Write a Comment
User Comments (0)
About PowerShow.com