Scientific Workflow Support for AToL Data Analysis presentation

About This Presentation

Title:

Scientific Workflow Support for AToL Data Analysis

Description:

NSF/CISE funded 3-year collaboration between U Penn, Univ. of Florida, UC ... For example, subworkflows create and parse data structures and files specific to ... –

Number of Views:71

Avg rating:3.0/5.0

Slides: 17

Provided by: atol3

Category:

more less

Transcript and Presenter's Notes

Title: Scientific Workflow Support for AToL Data Analysis

1
Scientific Workflow Support for AToL Data
Analysis Management

Bertram Ludaescher
Shawn Bowers
Timothy McPhillips
Sean Riddle
Manish Anand
UC Davis Genome Center Dept. of Computer Science

daks.ucdavis.edu/kepler-ppod
2
The pPOD and Kepler Projects

pPOD (processing PhylOData)
NSF/CISE funded 3-year collaboration between U
Penn, Univ. of Florida, UC Davis, and Yale
Core database technologies for the AToL community
Data access, integration, provenance, scientific
workflows
Kepler/pPOD _at_ UC Davis
Data collection-oriented scientific workflows
for phylogenetic data analysis
Data workflow provenance management

3
Scientific Workflows

Automating analytical tools and the management of
data, services, and provenance (i.e., processing
history)
Why scientific workflows?
Workflow design
Workflow, component reuse
Execution monitoring
Built-in provenance
Workflow sharing
Simplified access to remote resource (data,
services, hpc)

4
Example
5
Collection-Oriented Modeling Design (COMAD)

Operates on nested data collections (XML)
sent as token streams
Configurable actors declare their scope of
work (stream ?)
? simple, robust designs
? provenance automatically included in the stream

6
Aligning sequences using CIPRes services and
local applications

Distinct workflow components (actors) automate
steps in an analysis
Actors are provided for data load, store, and
manipulation as well scientific calculations
Actors can access to remote services and automate
local applications
Actors for applications and services using
different data standards can be combined easily
in a single workflow

A workflow for aligning protein sequences using
the CIPRes ClustalW service

Parameters are saved to allow workflows to be
executed again automatically or shared with
others
System records how results were computed in
trace files (data provenance)

A workflow for refining protein sequence
alignments using the local Gblocks application
7
A family of workflows for tree inference
Similar workflows for inferring trees using the
CIPRes RAxML service...

(In COMAD) actors are strung together in an
intuitive order with minimal configuration
Workflows can easily be modified to invoke
different methods for a particular step

...the CIPRes MrBayes service...

Resulting workflows are easy to understand
Behavior of a workflow is easy to predict

... and the local Phylip protpars program
8
Workflows can be combined easily to form longer
computational pipelines...
A single workflow combining local applications
and remote services
... and can include loops (e.g., for parameter
sweeps)
A subworkflow for running Phylip pars with a
series of seed values
9
Complexity is encapsulated in subworkflows
Inside the PhylipProtPars actor

Complexities involved in integrating different
kinds of services and applications are hidden by
the system
Subworkflows encapsulate this complexity and
provide reusable components
We employ a common data model at the highest
level of the workflow
For example, subworkflows create and parse data
structures and files specific to the programs
they wrap

Inside the CIPResRAxML actor
10
Integrated Kepler/pPOD data provenance support

Captures data dependencies during a workflow run
Stores workflow results in trace files
Provides tools to view data provenance and
navigate history of the workflow execution
Enables outputs of one run to be used as inputs
of another (provenance is maintained)
quick demo
Disclaimer Kepler/pPOD preview release
(3/8/2008) wraps real software and remote
services

11
Kepler Roles Desktop Application
Kepler GUI
Workflow Engine
Workflow Library
Provenance Store
Data Store
Actors, Services

Traditional/current Kepler role
GUI for designing, configuring, executing
workflows
Local data and provenance storage/management
Local actors remote services (e.g. CIPRes)
CLUSTAL, Gblocks, PAUP, RAxML, MrBayes, Phylip,

12
Kepler Roles Embedded
Mesquite
Kepler Workflow Engine
Provenance
Data
Workflows
Actors

Kepler engine embedded in other applications
(e.g., Mesquite)
Workflows shared via the Kepler repository

13
Kepler Roles As a Web Service
Mesquite
Web Service
Kepler Workflow Engine

Tolkin
Provenance
Data
Workflows
Actors

Applications use Kepler as an external (web)
service
Multiple applications can access/share same
Kepler engine
Engine can be deployed on application server or
parallel compute cluster

14
Kepler Roles Loosely coupled
Mesquite
Kepler GUI
Workflow Engine
Workflow Library
Actors, Services
Shared Data Provenance

Loosely coupled through shared data repositories
Shared data model

15
Kepler Roles Not so loosely coupled
Mesquite
Kepler GUI
Workflow Engine
Workflow Library
Actors, Services
Shared API

Loosely coupled through shared data repositories
Shared data model

16
Future Extensions to Kepler/pPOD

Support for additional services, applications,
and workflows
Based on community input
What tools would you use?
What kinds of analyses would you use the system
for?
Data access and management
Integration with pPOD data system
Integration with Mesquite Tolkin
Integrated access to data repositories
What data repositories would you like access to?
Provenance
Ability to query and filter provenance
Project histories for managing multiple runs
and data sets used in projects
Kepler/CORE
Kepler/Cipres, Kepler/pPOD, Kepler/Chip2
(micro-arrays), Kepler/REAP (streaming env.
sensor data), Kepler/SEEK (ecology),
is about bringing order to this

Write a Comment

User Comments (0)

About PowerShow.com