Title: Yolanda Gil, PhD
1Scientific Reproducibility through Semantic
Workflows andShared Provenance Representations
- Yolanda Gil, PhD
- Information Sciences Institute and
- Department of Computer Science
- University of Southern California
- gil_at_isi.edu
- http//www.isi.edu/gil
2NSF Workshop on Challenges of Scientific
Workflows Gil et al IEEE Computer 2007
- Despite investments on CyberInfrastructure as an
enabler of a significant paradigm change in
science - Reproducibility, key to scientific method, is
threatened - Exponential growth in Compute, Sensors, Data
storage, Network BUT growth of science is not
same exponential - What is missing
- Perceived importance of capturing and sharing
process in accelerating pace of scientific
advances - Process (method/protocol) is increasingly complex
and highly distributed - Workflows are emerging as a paradigm for
process-model driven science that captures the
analysis itself - Workflows need to be first class citizens in
science CyberInfrastructure - Enable reproducibility
- Accelerate scientific progress by automating
processes - Interdisciplinary and intradisciplinary research
challenges - Report available at http//www.isi.edu/nsf-workflo
ws06
3Benefits of Workflow Systems Taylor et al 07
- Managing execution
- Remote job submission
- Dependencies among steps
- Failure recovery
- Managing distributed computation
- Move data when needed
- Managing large data sets
- Efficiency, reliability
- Security and access control
- Access to shared resources
- Provenance recording
- Low-cost high-fidelity reproducibility
4Capabilities Available Today Wings/Pegasus
Workflows for Seismic Hazard Analysis Gil et
al 07 (see also Maechlin et al 05 Deelman et
al 06)
- Input data a site and an earthquake forecast
model - thousands of possible fault ruptures and rupture
variations, each a file, unevenly distributed - 110,000 rupture variations to be simulated for
that site - High-level template combines 11 application codes
- 8048 application nodes in the workflow instance
generated by Wings - Provenance records kept for 100,000 workflow data
products - Generated more than 2M triples of metadata
- 24,135 nodes in the executable workflow generated
by Pegasus, including - data stage-in jobs, data stage-out jobs, data
registration jobs - Executed in USC HPCC cluster, 1820 nodes w/ dual
processors) but only lt 144 available - Including MPI jobs, each runs on hundreds of
processors for 25-33 hours - Runtime was 1.9 CPU years
5The Wings/Pegasus Workflow SystemGil et al 07
Deelman et al 03 Deelman et al 05 Kim et al 08
Gil et al forthcoming
WINGS Semantic workflow environment wings.isi.edu
- Knowledge-based reasoning on workflows and data
(W3Cs OWL) - Semantic workflow catalogs
- Automation and assistance
- Execution-independent workflows
Pegasus Automated workflow refinement and
execution pegasus.isi.edu
- Optimize for performance, cost, reliability
- Assign execution resources
- Manage execution through DAGMan
- Daily operational use in many domains
Grid services condor.uwisc.edu www.globus.org
- Secure and controlled sharing of distributed
services, computing, data - Scalable service-oriented architecture
- Commercial quality, open source
6Semantic Workflows in WINGSGil et al IEE IS
2010 Gil et al JETAI 2010 Gil et al eScience
2009 Kim et al JCCPE 2008 Gil et al 2007
- Semantic workflows
- More than a dataflow graph
- Workflow variables each constituent (node, link,
component, dataset) has a corresponding variable - Semantic constraints on workflow variables, both
within and across variables - Semantic descriptions of collections of of data
and components are concisely represented
(TestData dcdomisDiscrete false) (TrainingData
dcdomisDiscrete false)
modelerInput_not_equal_to_classifierInput
(modelerInput wflowhasDataBinding ?ds1)
(classifierInput wflowhasDataBinding ?ds2)
equal(?ds1, ?ds2) (?t rdftype
wflowWorkflowTemplate) gt (?t
wflowisInvalid "true"xsdboolean)
7Workflow Portal for Genetic Studies of Mental
Disorders (with E. Deelman and C. Mason)
- Existing repository of genotypic and phenotypic
information - Goal develop workflows useful for data in the
repository
8Designing a Workflow Collection for Population
Genomics
- Designed workflows for common analysis types
- Association tests
- CNV detection
- Variant discovery
- Family-based association analysis (TDT)
- Developed workflow components by encapsulating
widely-used heterogeneous open software - Plink (Purcell, Harvard)
- R (Chambers et al)
- PennCNV (Penn) -- Hidden Markov Models
- Gnosis (State, Yale) -- sliding windows
- Allegro (Decode, Iceland) -- Multiterminal Binary
Decision Diagrams - Structure (Pritchard, Chicago) -- structured
association - FastLink (Schaffer, NCBI)
- (BWA) Burrows-Wheeler Aligner (Li Durbin)
- SAMTools
9Wings Workflows for Genetic Studies of Mental
Disorders Gil et al, forthcoming
Transmission Disequilibrium Test (TDT)
Association Tests
CNV Detection
Variant Discovery from Resequencing
10Major Features
- Workflow system manages set up and execution
- Wings set up
- Pegasus - execution
- Initial collection of workflows captures common
genomic analyses - Users can upload their own datasets
- Including collections of datasets
- User data is secure
- Not accessible by others
11Wings Replication of Crohns Disease Association
Study from Duerr et al, Science 06
12Wings Replication of Early-Onset Parkinsons
Disease Study from Bayrakli et al, Human
Mutation 07
13Observations about Reproducibility with Workflows
Gil et al, forthcoming
- Effort involved in reproducing results is minor
- 30 seconds to set up a workflow
- A catalog of carefully crafted workflows of
select state-of-the-art methods will cover a wide
range of genomic analyses - Our workflows were independently developed and
used as is - Semantic representations abstract the analysis
method from the software that implements it - Our workflows used different analytic tools than
the original studies - Many implementations of same algorithm, some
proprietary - Semantic constraints can be added to workflows to
avoid analysis errors - Eg in association analysis workflow, added
constraint to remove duplicate individuals
initially to avoid problems downstream
14Benefits of Semantic Workflows Gil JSP-09
- Execution management
- Automation of workflow execution
- Managing distributed computation
- Managing large data sets
- Security and access control
- Provenance recording
- Low-cost high fidelity reproducibility
- Semantics and reasoning
- User assistance to correctly explore analysis
design space - Validation of analyses
- Automated generation of metadata
- Workflow retrieval and discovery
- Conceptual reproducibility
15W3C Provenance Group (Y. Gil, chair)Goals
- Provide state-of-the-art understanding and
develop a roadmap for development and possible
standardization - Articulate requirements for accessing and
reasoning about provenance information - Develop use cases
- Identify issues in provenance that are direct
concern to the Semantic Web - Articulate relationships with other aspects of
Web architecture - Report on state-of-the-art work on provenance
- Report on a roadmap for provenance in the
Semantic Web - Identify starting points for provenance
representations - Identifying elements of a provenance architecture
that would benefit from standardization
16W3C Provenance GroupProducts of the Group to
Date
- Group formed in September 2009, open to new
members - All information is public http//www.w3.org/2005/
Incubator/prov/wiki/ - Developed a set of key dimensions for provenance
(11/09) - Grouped into three major categories content,
management, use - Developed use cases for provenance (12/09)
- More than 30 use cases, including 10 in science
but others are relevant - Developed requirements for provenance from use
cases (1/10) - User requirements what is the purpose of the
provenance information - Technical requirements derived from the user
requirements - Report on Requirements for Provenance on the
Web - Currently developing state-of-the-art report
(expected 6/10) - Started to develop recommendations (expected
9/10) - Mappings across provenance vocabularies (eg DC,
OPM, SWAN,)