Title: Scientific Workflow Support for AToL Data Analysis
1Scientific Workflow Support for AToL Data
Analysis Management
- Bertram Ludaescher
- Shawn Bowers
- Timothy McPhillips
- Sean Riddle
- Manish Anand
- UC Davis Genome Center Dept. of Computer Science
daks.ucdavis.edu/kepler-ppod
2The pPOD and Kepler Projects
- pPOD (processing PhylOData)
- NSF/CISE funded 3-year collaboration between U
Penn, Univ. of Florida, UC Davis, and Yale - Core database technologies for the AToL community
- Data access, integration, provenance, scientific
workflows - Kepler/pPOD _at_ UC Davis
- Data collection-oriented scientific workflows
for phylogenetic data analysis - Data workflow provenance management
3Scientific Workflows
- Automating analytical tools and the management of
data, services, and provenance (i.e., processing
history) - Why scientific workflows?
- Workflow design
- Workflow, component reuse
- Execution monitoring
- Built-in provenance
- Workflow sharing
- Simplified access to remote resource (data,
services, hpc) -
4Example
5Collection-Oriented Modeling Design (COMAD)
- Operates on nested data collections (XML)
- sent as token streams
- Configurable actors declare their scope of
work (stream ?) - ? simple, robust designs
- ? provenance automatically included in the stream
6Aligning sequences using CIPRes services and
local applications
- Distinct workflow components (actors) automate
steps in an analysis - Actors are provided for data load, store, and
manipulation as well scientific calculations - Actors can access to remote services and automate
local applications - Actors for applications and services using
different data standards can be combined easily
in a single workflow
A workflow for aligning protein sequences using
the CIPRes ClustalW service
- Parameters are saved to allow workflows to be
executed again automatically or shared with
others - System records how results were computed in
trace files (data provenance)
A workflow for refining protein sequence
alignments using the local Gblocks application
7A family of workflows for tree inference
Similar workflows for inferring trees using the
CIPRes RAxML service...
- (In COMAD) actors are strung together in an
intuitive order with minimal configuration - Workflows can easily be modified to invoke
different methods for a particular step
...the CIPRes MrBayes service...
- Resulting workflows are easy to understand
- Behavior of a workflow is easy to predict
... and the local Phylip protpars program
8Workflows can be combined easily to form longer
computational pipelines...
A single workflow combining local applications
and remote services
... and can include loops (e.g., for parameter
sweeps)
A subworkflow for running Phylip pars with a
series of seed values
9Complexity is encapsulated in subworkflows
Inside the PhylipProtPars actor
- Complexities involved in integrating different
kinds of services and applications are hidden by
the system - Subworkflows encapsulate this complexity and
provide reusable components - We employ a common data model at the highest
level of the workflow - For example, subworkflows create and parse data
structures and files specific to the programs
they wrap
Inside the CIPResRAxML actor
10Integrated Kepler/pPOD data provenance support
- Captures data dependencies during a workflow run
- Stores workflow results in trace files
- Provides tools to view data provenance and
navigate history of the workflow execution - Enables outputs of one run to be used as inputs
of another (provenance is maintained) - quick demo
- Disclaimer Kepler/pPOD preview release
(3/8/2008) wraps real software and remote
services
11Kepler Roles Desktop Application
Kepler GUI
Workflow Engine
Workflow Library
Provenance Store
Data Store
Actors, Services
- Traditional/current Kepler role
- GUI for designing, configuring, executing
workflows - Local data and provenance storage/management
- Local actors remote services (e.g. CIPRes)
- CLUSTAL, Gblocks, PAUP, RAxML, MrBayes, Phylip,
12Kepler Roles Embedded
Mesquite
Kepler Workflow Engine
Provenance
Data
Workflows
Actors
- Kepler engine embedded in other applications
- (e.g., Mesquite)
- Workflows shared via the Kepler repository
13Kepler Roles As a Web Service
Mesquite
Web Service
Kepler Workflow Engine
Tolkin
Provenance
Data
Workflows
Actors
- Applications use Kepler as an external (web)
service - Multiple applications can access/share same
Kepler engine - Engine can be deployed on application server or
parallel compute cluster
14Kepler Roles Loosely coupled
Mesquite
Kepler GUI
Workflow Engine
Workflow Library
Actors, Services
Shared Data Provenance
- Loosely coupled through shared data repositories
- Shared data model
15Kepler Roles Not so loosely coupled
Mesquite
Kepler GUI
Workflow Engine
Workflow Library
Actors, Services
Shared API
- Loosely coupled through shared data repositories
- Shared data model
16Future Extensions to Kepler/pPOD
- Support for additional services, applications,
and workflows - Based on community input
- What tools would you use?
- What kinds of analyses would you use the system
for? - Data access and management
- Integration with pPOD data system
- Integration with Mesquite Tolkin
- Integrated access to data repositories
- What data repositories would you like access to?
- Provenance
- Ability to query and filter provenance
- Project histories for managing multiple runs
and data sets used in projects - Kepler/CORE
- Kepler/Cipres, Kepler/pPOD, Kepler/Chip2
(micro-arrays), Kepler/REAP (streaming env.
sensor data), Kepler/SEEK (ecology), - is about bringing order to this