Scientific Workflow Support for AToL Data Analysis

1 / 16
About This Presentation
Title:

Scientific Workflow Support for AToL Data Analysis

Description:

NSF/CISE funded 3-year collaboration between U Penn, Univ. of Florida, UC ... For example, subworkflows create and parse data structures and files specific to ... –

Number of Views:71
Avg rating:3.0/5.0
Slides: 17
Provided by: atol3
Category:

less

Transcript and Presenter's Notes

Title: Scientific Workflow Support for AToL Data Analysis


1
Scientific Workflow Support for AToL Data
Analysis Management
  • Bertram Ludaescher
  • Shawn Bowers
  • Timothy McPhillips
  • Sean Riddle
  • Manish Anand
  • UC Davis Genome Center Dept. of Computer Science

daks.ucdavis.edu/kepler-ppod
2
The pPOD and Kepler Projects
  • pPOD (processing PhylOData)
  • NSF/CISE funded 3-year collaboration between U
    Penn, Univ. of Florida, UC Davis, and Yale
  • Core database technologies for the AToL community
  • Data access, integration, provenance, scientific
    workflows
  • Kepler/pPOD _at_ UC Davis
  • Data collection-oriented scientific workflows
    for phylogenetic data analysis
  • Data workflow provenance management

3
Scientific Workflows
  • Automating analytical tools and the management of
    data, services, and provenance (i.e., processing
    history)
  • Why scientific workflows?
  • Workflow design
  • Workflow, component reuse
  • Execution monitoring
  • Built-in provenance
  • Workflow sharing
  • Simplified access to remote resource (data,
    services, hpc)

4
Example
5
Collection-Oriented Modeling Design (COMAD)
  • Operates on nested data collections (XML)
  • sent as token streams
  • Configurable actors declare their scope of
    work (stream ?)
  • ? simple, robust designs
  • ? provenance automatically included in the stream

6
Aligning sequences using CIPRes services and
local applications
  • Distinct workflow components (actors) automate
    steps in an analysis
  • Actors are provided for data load, store, and
    manipulation as well scientific calculations
  • Actors can access to remote services and automate
    local applications
  • Actors for applications and services using
    different data standards can be combined easily
    in a single workflow

A workflow for aligning protein sequences using
the CIPRes ClustalW service
  • Parameters are saved to allow workflows to be
    executed again automatically or shared with
    others
  • System records how results were computed in
    trace files (data provenance)

A workflow for refining protein sequence
alignments using the local Gblocks application
7
A family of workflows for tree inference
Similar workflows for inferring trees using the
CIPRes RAxML service...
  • (In COMAD) actors are strung together in an
    intuitive order with minimal configuration
  • Workflows can easily be modified to invoke
    different methods for a particular step

...the CIPRes MrBayes service...
  • Resulting workflows are easy to understand
  • Behavior of a workflow is easy to predict

... and the local Phylip protpars program
8
Workflows can be combined easily to form longer
computational pipelines...
A single workflow combining local applications
and remote services
... and can include loops (e.g., for parameter
sweeps)
A subworkflow for running Phylip pars with a
series of seed values
9
Complexity is encapsulated in subworkflows
Inside the PhylipProtPars actor
  • Complexities involved in integrating different
    kinds of services and applications are hidden by
    the system
  • Subworkflows encapsulate this complexity and
    provide reusable components
  • We employ a common data model at the highest
    level of the workflow
  • For example, subworkflows create and parse data
    structures and files specific to the programs
    they wrap

Inside the CIPResRAxML actor
10
Integrated Kepler/pPOD data provenance support
  • Captures data dependencies during a workflow run
  • Stores workflow results in trace files
  • Provides tools to view data provenance and
    navigate history of the workflow execution
  • Enables outputs of one run to be used as inputs
    of another (provenance is maintained)
  • quick demo
  • Disclaimer Kepler/pPOD preview release
    (3/8/2008) wraps real software and remote
    services

11
Kepler Roles Desktop Application
Kepler GUI
Workflow Engine
Workflow Library
Provenance Store
Data Store
Actors, Services
  • Traditional/current Kepler role
  • GUI for designing, configuring, executing
    workflows
  • Local data and provenance storage/management
  • Local actors remote services (e.g. CIPRes)
  • CLUSTAL, Gblocks, PAUP, RAxML, MrBayes, Phylip,

12
Kepler Roles Embedded
Mesquite
Kepler Workflow Engine
Provenance
Data
Workflows
Actors
  • Kepler engine embedded in other applications
  • (e.g., Mesquite)
  • Workflows shared via the Kepler repository

13
Kepler Roles As a Web Service
Mesquite
Web Service
Kepler Workflow Engine

Tolkin
Provenance
Data
Workflows
Actors
  • Applications use Kepler as an external (web)
    service
  • Multiple applications can access/share same
    Kepler engine
  • Engine can be deployed on application server or
    parallel compute cluster

14
Kepler Roles Loosely coupled
Mesquite
Kepler GUI
Workflow Engine
Workflow Library
Actors, Services
Shared Data Provenance
  • Loosely coupled through shared data repositories
  • Shared data model

15
Kepler Roles Not so loosely coupled
Mesquite
Kepler GUI
Workflow Engine
Workflow Library
Actors, Services
Shared API
  • Loosely coupled through shared data repositories
  • Shared data model

16
Future Extensions to Kepler/pPOD
  • Support for additional services, applications,
    and workflows
  • Based on community input
  • What tools would you use?
  • What kinds of analyses would you use the system
    for?
  • Data access and management
  • Integration with pPOD data system
  • Integration with Mesquite Tolkin
  • Integrated access to data repositories
  • What data repositories would you like access to?
  • Provenance
  • Ability to query and filter provenance
  • Project histories for managing multiple runs
    and data sets used in projects
  • Kepler/CORE
  • Kepler/Cipres, Kepler/pPOD, Kepler/Chip2
    (micro-arrays), Kepler/REAP (streaming env.
    sensor data), Kepler/SEEK (ecology),
  • is about bringing order to this
Write a Comment
User Comments (0)
About PowerShow.com