Title: Scientific Workflows as Configurable, ChangeResilient Data Transducers
1 Scientific Workflows as Configurable,
Change-ResilientData Transducers
UC Davis Team Dr. Shawn Bowers Dr. Tim
McPhillips Dr. Norbert Podhorszki Dr. Carlos
Rueda Manish Anand Saumen Dey Dave Thau Daniel
Zinn
- Bertram Ludäscher
- Dept. of Computer Science
- Genome Center
- University of California, Davis
- ludaesch_at_ucdavis.edu
2SUMMARY
- Déjà Vu
- Scientific workflow the CI upperware of
eScience - Scientific workflows why?
- Diversity of scientific workflow (one size fits
all?) - eScience Collaborations using
- d-WS, a-WS, w-WS!
- is this the W3C Gone Wild?? (W3C-GW)
- Scientific Workflows are Forever (YMMV)
- Facilitating sharing, design through evolution
- The challenge
- The Complex, the Brittle, the Unsustainable
(CBUs) wfs - GOTO still considered harmful
- A possible solution
- Optimize human time (its about us isnt it?)
- Change-resilience
- Employ data coherence
- VALs (Virtual Assembly Lines) COMAD
- Optimizing dataflow (cpu-time)
- Kepler/CORE
3Scientific Workflows Cyberinfrastructure
UPPER-WARE
4Why Scientific Workflow?
- Capture how a scientist works with data and
analytical tools - data access, transformation, analysis,
visualization - possible worldview dataflow-oriented (cf.
signal-processing) - Scientific workflow (wf) benefits (compare w/
script-based approaches) - wf automation
- wf component reuse
- wf design, documentation
- wf archival, sharing
- built-in concurrency
- (task-, pipeline-parallelism)
- built-in provenance support
- distributed parallel exec
- Grid cluster support
-
5Kepler Data Access via the EcoGrid
Data QuickSearch Tab
Metadata Keyword Search
Access Multiple EcoGrid Sources
Return Data Sets as Actors to Drag-Drop to
Canvas
6Kepler Actor Semantic-Type Annotation
- Actor input/output port annotation
- Each port can be annotated with multiple classes
from multiple ontologies - Annotations are stored with actor metadata (MOML)
- Actors can be discovered, validated, etc., via
their semantic types - Citations here
7Kepler Actor Library
- Actor Annotations for Indexing and Classification
- New actors can be annotated and indexed into the
component library (e.g., specializing generic
actors) - Existing components can also be revised,
annotated, and indexed (hiding previous versions) - Quick search leverages metadata, including
annotations ontologies
8Building a simple workflow in Kepler
1
3
2
- Select actors from Kepler actor library
- Local or remote actors
- View actor metadata/documentation (not shown)
- Drag desired actor to canvas
- Connect actor ports
other actor examples
9Building a simple workflow in Kepler
1
2
3
- Select input data
- Shown here is an EcoGrid for bacterial
abundance - Connect data actors to workflow inputs
many ways to import data
10Building a simple workflow in Kepler
- Using EcoGrid data sources
- Metadata (EML) can be displayed
- Data can be queried via SQL/QBE interface
- Data set here is a tab-delimited file
11Building a simple workflow in Kepler
- Run the workflow
- Also set parameters, select configuredirector,
run window, etc.
12Scientific workflows are CI upper-ware, i.e.
the scientists way to harness
cyberinfrastructure
- Domain Scientists View
- Q When is CI (middle-ware, under-ware) good?
- A When I cant see it!
- Q When is a scientific workflow tool (CI
upper-ware) good? - A When I can get more, new, faster, better
science done! - Workflow Engineers View
- How can I (help the scientist) design implement
the desired wfs? - How does wf make my life easier? Is there life
beyond Perl Python? - Choice of platforms, standards reuse of existing
tools, semantic extensions, scheduling on the
Grid? - How do I make all of this robust, fault-tolerant,
etc. - Computer Scientists View
- workflow modeling design, static analysis,
optimization, theoretical limits what can /
cant be done - The quest for the right models languages
Workflow Thinking - The holy grail of eScience Join the Quest!
13Rough taxonomy of (overlapping) workflow types
- Desktop / discovery workflows
- analysis/method-intensive, R, Matlab, custom
algorithims - e.g. bioinformatics, genomics, phylogenetics
- exploratory workflow, rapidly evolving
- need data workflow provenance
- Plumbing workflows
- data-intensive, e.g. moving TBs between from
ORNL (compute) to LBL/NERSC (archive) - Production workflow reliable, fault-tolerant,
high-throughput, runtime monitoring - HPC workflows
- cpu-intensive, need to utilize a local cluster
or distribute Grid, e.g. Ecological Niche
Modeling, Parameter studies, - Parallel/distributed workflow
- Streaming workflows
- (near) real-time processing and data analysis
- distributed setting
14Simple Kepler workflow using R (reuse, dont
reinvent)
15Discovery Workflow Ecological Niche Modeling
Slide Matt Jones
16Ex SEEK Ecological Niche Modeling Pipeline
- Scientific Workflow paradigm
- Reusable components (actors) a scientists
verbs/actions - Top-level workflows conceptual representation
of the science process, sentences in the
scientists language - Sub-workflows increasing levels of detail
- Separation of concerns
- actors what to do
- parameters configurable behavior
- channels dataflow, pipeline composition
- directors fix execution model, scheduling
- semantic types smart discovery, linking
D Pennington, D Higgins, AT Peterson, M Jones, B
Ludaescher, S Bowers. Ecological Niche Modeling
using the Kepler Workflow System. Workflows for
e-Science, Springer.
17Inferring a phylogenetic tree from disparate data
Aligned DNA sequences
Maximum likelihood tree (DNA)
Discrete morphological data
Maximum parsimony tree
Integrate
Consensus Tree(s)
Maximum likelihood tree (continuous characters)
Continuous characters
Actors
Datasets
Datasets
18Plumbing (1/2) Archive migration workflow
Stage from NERSC HPSS to local disk transfer
to ORNL disk store at ORNL HPSS
Moved 10TB of data from NERSC archive to ORNL
archive in 11 days (network issues, bugs, and
more)
Norbert Podhorszki (UC Davis), Scott Klasky
(ORNL)
19Under the hood pipeline parallel processing
Norbert Podhorszki (UC Davis)
20Plumbing (2/2) SDM/CPES (fusion simulation)
ORNL
40 GB/s
HPSS
Norbert Podhorszki (UC Davis), Scott Klasky
(ORNL)
Command Control site
21- Plumbing workflow
- to accomplish all these tasks
- 50 composite actors (subworkflows)
- 4 levels of hierarchy
- 1000 atomic (Java) actors
Norbert Podhorszki UC Davis, soon ORNL
22Enabling e-Science Collaboration
- Workflows Its about sharing
- standing on the shoulders of giants
- of course after you got your paper, grant, etc in
- myExperiment ? Kepler repository
23Déjà vu (MS eSW06) Workflow component
repositories myExperiment your library Our
Workflow Repository!
- Taverna Repository Kepler
Repository
Need W-WS! Workflows Worth Sharing
24Enabling e-Science Collaboration
- d-WS, a-WS, w-WS (W3C gone
wild??) - Data Worth Sharing
- Actors (components) Worth Sharing
- Workflows Worth Sharing
- Scientific Workflows are Forever (YMMV)
- Facilitating sharing, design through evolution
- The challenge
- The Complex, the Brittle, the Unsustainable
(CBUs) wfs - GOTO still considered harmful
25Behold the Beauty of Scientific Workflows
Author Kristian Stevens, UC Davis
26 the ugly truth inside (CBUs)
Author Kristian Stevens, UC Davis
27But how do we get from messy to neat reusable
designs?
28Scientific Workflow Modeling Design
And thats why our scientific workflows are
much easier to develop, understand, reuse and
maintain!
29The Joy of Exa-Scale Cyberinfrastructure
- Are we working at the right level of abstraction?
- Are we optimizing the right thing?
- Optimize human cycles, not just CPU cycles!
- cf. John McCarthy (of AI/LISP fame)
- ? Make data scientific workflows effectively
(re-)usable for scientist - Make workflows first-class, shareable knowledge
artifacts - cf. myExperiment!
- Importance of user-oriented workflow design! (
and provenance)
30A Problem Evolving Workflows
Daniel Zinn (UC Davis)
31What we want Simple Analysis Pipelines
Author Tim McPhillips, UC Davis
32Ford Assembly Line (2 x 2 - DX Pipe)
x
?o
- Actors move along the buffet (data)
- pick up data
- may put data
- task pipeline parallel (if very hungry also
data parallel possible) - actors are configurable
- passing the buck on irrelevant data
- ? Resilient to change!
- Equivalent view
- Who moves?
- line up actors
- roll buffet data past actors
- ? COMAD/VAL model
33The Answer (YMMV)
- Virtual Assembly Lines Paradigm (VALs)
- Embrace the assembly line metaphor fully
- ? cf. Flow-based Programming (J. Morrison)
- Collection-Oriented Modeling Design (COMAD)
- Data tagged nested collections
- pipelined (XML) token streams
- passing the buck on whats not in your scope
Timothy McPhillips UC Davis
34Virtual Assembly Lines (VAL/COMAD)
Daniel Zinn (UC Davis)
35COMAD / VAL, hints at the secret sauce
- Scope your work and pass the buck!
- Let go! (often stateless actors)
- To maximize concurrency / minimize latency
- futures promises (holes as placeholders)
36Conventional vs Assembly Line Delta-XML
Thinking
Daniel Zinn (UC Davis)
37Conceptual Pipeline w/ Scopes Types
Daniel Zinn (UC Davis)
38What we got Simple Change-Resilient Pipelines
Author Tim McPhillips, UC Davis
Look Ma No Shims!
39Result Change-Resilience (Wf graph)
?
X
A
B
C
S
R
W
Original
Automatic Configuration
W
WX
S R
S R
Infer Configuration X of X
Daniel Zinn (UC Davis)
40Input Change-Resilience (nested data types)
S. Bowers, Daniel Zinn (UC Davis)
41Optimizing VAL/COMAD User vs. System View
Daniel Zinn (UC Davis)
42X-CSR (XML Scissor) Cut-Ship-Reassemble
submitted for publication Daniel Zinn, Shawn
Bowers, Bertram Ludaescher (UC Davis)
43Language Abstractions Modeling Design
- Vanilla Process Network
- Functional Programming Dataflow Network
- XML Transformation Network
- Collection-oriented Modeling Design framework
(COMAD)
The limitations of my modeling language are the
limitations of my design world. BL
44Towards 2020 Science Report (MSR)
http//research.microsoft.com/towards2020science
- new develoment at the intersection of computer
science and the sciences a leap from the
application of computing to support scientists to
do science (i.e. computational science) to
the integration of computer science concepts,
tools and theorems into the very fabric of
science. We believe this development
represents the foundations of a new revolution in
science - we believe computer science is poised to become
as fundamental to biology as mathematics has
become to physics - to understand cells and cellular systems
requires viewing them as information processing
systems, as evidenced by the fundamental
similarity between molecular machines of the
living cell and computational automata, and by
the natural fit between computer process algebras
and biological signalling and between
computational logical circuits and regulatory
systems in the cell - We highlight that an immediate and important
challenge is that of end-to-end scientific data
management, from data acquisition and data
integration, to data treatment, provenance and
persistence. - dramatic in its impact, will be the integration
of new conceptual and technological tools from
computer science into the sciences.
45Consilience The Unity of Knowledge (E. O. Wilson)
- "Literally a jumping together of knowledge by the
linking of facts and fact-based theory across
disciplines to create a common groundwork for
explanation." E.O.Wilson - eScience, Cyberinfrastructure, CS mechanisms
- to make progress
- Scientific Workflows
- crucial elements to get the most mileage out of
CI to fuel eScience, accelerating knowledge
discovery - Need good workflow repositories!
- Workflows Worth Sharing (Workflows are
Forever) - Importance of Workflow Design, Reuse through
Evolution, Change-Resilience - Wir müssen wissen, wir werden wissen!
- We must know, we will now! -- D. Hilbert
46New NSF/SDCI Project Kepler/CORE
Phylogenetics
Astronomy
Library Science
Ecology
Conservation Biology
Oceanography
Geosciences
Molecular Biology
Chemistry
Particle Physics
47Thank You!
ludaesch_at_ucdavis.edu
- Invitation Join the Workflow Community (e.g.
become a Kepler member) - New NSF/SDCI Kepler/CORE grant
- Refactoring the software to make extensions,
customization, deployment easy - Open process, joint ownership
- Kepler users, developers (core v.s. extensions),
stakeholders,
48Related References
- Scientific Workflows More e-Science Mileage from
Cyberinfrastructure, Bertram Ludäscher, Shawn
Bowers, Timothy McPhillips, Norbert Podhorszki.
Workshop on Scientific Workflows and Business
workflow standards in e-Science at eScience'06,
Amsterdam, December, 2006. - Collection-Oriented Scientific Workflows for
Integrating and Analyzing Biological Data,
Timothy McPhillips, Shawn Bowers, Bertram
Ludäscher. 3rd International Workshop on Data
Integration in the Life Sciences (DILS'06),
European Bioinformatics Institute (EBI), Hinxton,
UK, July 20-22, 2006. - Project Histories Managing Data Provenance
Across Collection-Oriented Scientific Workflow
Runs, Shawn Bowers, Timothy McPhillips, Martin
Wu, Bertram Ludäscher. 4th Intl. Workshop on Data
Integration in the Life Sciences (DILS'07),
University of Pennsylvania, Philadelphia, June
27-29, 2007. - Actor-Oriented Design of Scientific Workflows,
Shawn Bowers and Bertram Ludäscher, 24th Intl.
Conference on Conceptual Modeling (ER'05),
Klagenfurt, Austria, LNCS, Springer, 2005 - D. Zinn, S. Bowers, B. Ludäscher, Dataflow
Optimization for Distributed XML Stream
Processors submitted for publication -
- Provenance in Collection-Oriented Scientific
Workflows, Shawn Bowers, Timothy McPhillips,
Bertram Ludäscher. Concurrency and Computation
Practice Experience, special issue on the First
Provenance Challenge, 2007, in press. - Workflow Automation for Processing Plasma Fusion
Simulation Data, Norbert Podhorszki, Bertram
Ludäscher, Scott Klasky. 2nd Workshop on
Workflows in Support of Large-Scale Science
(WORKS'07), Monterey Bay California, June 25,
2007. - Scientific Workflow Management and the Kepler
System, B. Ludäscher, I. Altintas, C. Berkley, D.
Higgins, E. Jaeger-Frank, M. Jones, E. Lee, J.
Tao, Y. Zhao, Concurrency and Computation
Practice Experience, 18(10), pp. 1039-1065,
2006. DOI
49More References
- Semantic Type Annotation
- S Bowers, B Ludaescher. A Calculus for
Propagating Semantic Annotations through
Scientific Workflow Queries. ICDE Workshop on
Query Languages and Query Processing (QLQP),
LNCS, 2006. - S Bowers, B Ludaescher. Towards Automatic
Generation of Semantic Types in Scientific
Workflows. International Workshop on Scalable
Semantic Web Knowledge Base Systems (SSWS), WISE
2005 Workshop Proceedings, LNCS, 2005. - C Berkley, S Bowers, M Jones, B Ludaescher, M
Schildhauer, J Tao. Incorporating Semantics in
Scientific Workflow Authoring. SSDBM, 2005. - B Ludaescher, K Lin, S Bowers, E Jaeger-Frank, B
Brodaric, C Baru. Managing Scientific Data From
Data Integration to Scientific Workflows. GSA
Today, Special Issue on Geoinformatics, 2006. - S Bowers, D Thau, R Williams, B Ludaescher. Data
Procurement for Enabling Scientific Workflows On
Exploring Inter-Ant Parasitism. VLDB Workshop on
Semantic Web and Databases (SWDB), 2004. - S Bowers, K Lin, B Ludaescher. On Integrating
Scientific Resources through Semantic
Registration. SSDBM, 2004. - S Bowers, B Ludaescher. An Ontology-Drive
Framework for Data Transformation in Scientific
Workflows. International Workshop on Data
Integration in the Life Sciences (DILS), LNCS,
2004. - S Bowers, B Ludaescher. Towards a Generic
Framework for Semantic Registration of Scientific
Data. International Semantic Web Conference
Workshop on Semantic Web Technologies for
Searching and Retrieving Scientific Data, 2003. - Workflow Design and Modeling
- T McPhillips, S Bowers, B Ludaescher.
Collection-Oriented Scientific Workflows for
Integrating and Analyzing Biological Data.
Workshop on Data Integration in the Life Sciences
(DILS), LNCS, 2006. - S Bowers, T McPhillips, B Ludaescher, S Cohen, SB
Davidson. A Model for User-Oriented Data
Provenance in Pipelined Scientific Workflows.
International Provenance and Annotation Workshop
(IPAW), LNCS, 2006. - S Bowers, B Ludaescher, AHH Ngu, T Critchlow.
Enabling Scientific Workflow Reuse through
Structured Composition of Dataflow and
Control-Flow. IEEE Workshop on Workflow and Data
Flow for Scientific Applications (SciFlow), 2006. - S Bowers, B Ludaescher. Actor-Oriented Design of
Scientific Workflows. International Conference on
Conceptual Modeling (ER), LNCS, 2005. - T McPhillips, S Bowers. Pipelining Nested Data
Collections in Scientific Workflows. SIGMOD
Record, 2005. - Kepler
- D Pennington, D Higgins, AT Peterson, M Jones, B
Ludaescher, S Bowers. Ecological Niche Modeling
using the Kepler Workflow System. Workflows for
e-Science, Springer-Verlag, to appear. - W Michener, J Beach, S Bowers, L Downey, M Jones,
B Ludaescher, D Pennington, A Rajasekar, S
Romanello, M Schildhauer, D Vieglais, J Zhang.
SEEK Data Integration and Workflow Solutions for
Ecology. Workshop on Data Integration in the Life
Sciences (DILS), LNCS, 2005. - S Romanello, W Michener, J Beach, M Jones, B
Ludaescher, A Rajasekar, M Schildhauer, S Bowers,
D Pennington. Creating and Providing Data
Management Services for the Biological and
Ecological Sciences Science Environment for
Ecological Knowledge. SSDBM, 2005.