Title: Workflow and data integration in ebioscience
1Workflow and data integration in e-bioscience
2Integrative Bioinformatics Unitand collaborations
IBU Márcia Inda,Oskar Bruning, Scott
Marshall, Lennart Post, Tessa Pronk, Han
Rauwerda, Marco Roos, Igor Serov, Robert Stad
(FlexGen), Peter Sterk (EBI), Frans Verster,
Asli Umur, Timo Breit
Collaborations Semantic modeling and its
applications Pieter Adriaans (UvA) Guus
Schreiber, Machiel Jansen (VU) Werner Ceusters,
Anand Kumar, Barry Smith (IFOMIS) Marijke Keet
() Information management Ersin Kaletas, Bob
Hertzberger (UvA) Joost Kok, Fons Verbeek
(LIACS) Case studies Lennart Post, Roel van
Driel (UvA), Wendy Bruins, Jeroen Pennings,
Annemieke de Vries (RIVM) Dutch system biology of
Lactococcus lactis consortium Workflow Adam
Belloum (UvA) Taverna developers (EBI)
More information www.micro-array.nl
3Outline
- Computational experiments in an e-science
environment - Requirements
- Annotation of both data and services
- Provenance
- Intersecting views of data
- Mapping experimental data to meaning
- What is an ontology? Essential concepts.
- Semantic annotation of experimental data
- A brief look at Taverna
- Conclusions
4Computational experiment
Database
Database
Computational experiment in workflow environment
...
Database
5Issues raised by computational experimentation
- How will we find relevant data?
- How will we automatically integrate such data
into our experiment? - How will we find apropriate services?
- How will we integrate our results as usable data
for a new (computational) experiment? - -gt annotation
6Computational Experiments Anticipated needs of
the data consumer
- Data integration - combining different types of
data - Data annotation beyond formats
- Not only
- Data types (integer, string, etc.)
- But also
- Data semantics What do the data represent?
- Determined by the experimental design
- Provenance What has been done to the data?
- Description of the procedure(s) that
produced/transformed the data - Find and apply appropriate (web) services
- Reuse results from a computational experiment as
data in another computational experiment - derived data is tagged and put into the
repository
7Anticipated needs of the data supplier (and
consumer)
- Data in
- Simple submission/registration of data to
e-science repository - Semi-automatic annotation
- Data out
- Easy search and retrieval of previous datasets
(my personal and my groups data) - Easy search and retrieval of relevant datasets
from public repository - Combining data
- Different types and different sources
- Example Intersecting views of data
- data mapped to physical or semantic space
(Examples follow..)
8Intersecting views of data mRNA levels mapped
to embryo cross-section
9Why semantic annotation?
- We want annotation to be machine-readable
- Free text arbitrary text tags generated by
users wont always match up - Simplest problem Finding a named object
- Hyponyms - Different names exist for the same
object in different contexts and roles. - Synonyms - The same name is used for different
objects. - Which name should I use?
- Standardized vocabulary list
- can only find literal matches
- Example Using data types to search for services
will find too many! - Semantic tags
- allow searching for similar items
- Find items like this one.
- allow searching with a description
- Find items with these properties.
- semantic description of service (OWL-S) as well
as data (OWL)
10What is an ontology?
- Definitions
- A collection of things that are defined in terms
of their properties and relations to other
things. - A specification of a conceptualization that is
designed for reuse across multiple applications
and implementations (Gruber 93, 95, Guarino
96, Guarino and Giaretta 95) - General applications
- Searching for objects that are resources,
documents, concepts, experimental data, or
collections of these things. - Knowledge capture
- Example Biological model with hypothetical
knowledge - Common applications in bioinformatics
- Annotation of database entries (e.g. gene
products)
11Inheritance in ontologies
Animal
Mammal
Bird
Robin
Heron
Penguin
- Often represented as DAGs (Directed Acyclic
Graphs) or hierarchies (trees) - Power of inheritance
- Inclusion relations (ISA) apply transitivity to
create inheritance of class and properties
downward along chains in the hierarchy. - Use an element as a metadata tag for semantic
annotation (ontotag) - An ontotag serves as a pointer into a semantic
space
12Gene Ontology
Mouse p53 List of GO identifiers Process apopt
osis, DNA damage response, signal transduction by
p53 class mediator... Component cytoplasm,
cytosol... Function DNA binding, protein
binding...
- Cluster of genes X from micro array analysis
- Collection of List of GO identifiers per gene
in cluster - Most prevalent GO identifiers
- Apoptosis, Cytosol, Protein Binding
- Significant relationships between GO classes
(e.g. cell death and DNA damage response)
13(No Transcript)
14Intersecting views of data IImRNA levels mapped
to gene ontology
15Applications for search
?
- Finding an object when we dont know the name
(for example, the ontology has changed!) - It belongs to Class E5 and has these attributes
(x, y, ..) and relations (a, b, ..). - Its similar to Object A but plays a role in
context G
16Ontological search for annotated data
Annotated Experimental data
Domain Ontology
Human
von Hippel-Lindau
Zebrafish
Polycystic Kidney Disease
von Hippel-Lindau
17Ontological search for similar model model
extension
Another Knowledge Model
My Knowledge Model
Gene A
Gene B
Gene A
18Semantic annotation - ontotags
Evidence Ontology
Provenance
Author
Gene Ontology
Metadata
19Computational experiment
Database
Database
Some provenance should be added by the
module/service itself
...
Database
20What is Taverna?
- Taverna myGrid sourceforge Tom Oinn,
Matthew Pocock, Martin Senger, Anil Wipat, Peter
Li, Kevin Glover sourceforge - Institutes
- European Bioinformatics Institute (EBI),
- IT Innovation
- Rosalind Franklin Centre for Genomic Research
(RFCGR) - Newcastle Computer Science faculty
- Newcastle Centre for Life
- Manchester Computer Science faculty
- Nottingham University Mixed Reality Lab
- Release 1.0 January 24, 2005
- Motivation Scufl (Simple conceptual unified flow
language) was created because WSFL (Web Services
Flow Language) and BPEL (Business Process
Execution Language) do not have the levels of
user abstraction necessary for most
bioinformaticians and.
21Taverna Highlights
- Language, Platform, and Domain independent
- Services available as remote and local components
- Visual interface
- Workflow graph
- Visualisers
- Access to computing clusters such as at European
Bioinformatics Institute via services (no
administrative overhead) - Workflow exchange through XML (XScufl)
- Provenance
- Personalisation
22Taverna Workflow diagram
23Taverna Advanced model explorer
24GoViz workflow output
25Workflow wishlist - Visualization
Feature Extraction
Preprocessing/ Normalization
Differential Expression
Clustering
- Visualization
- Interactive Visualization
- especially linked brushing where selections in
one view become active in another view
26Taverna - Intermediate results
27Taverna - Provenance
28Conclusions
- Semantic annotation is essential for data
integration. - Ontological tags (ontotags) can be used for
semantic annotation of both data and
(web)services. - Ontotags and provenance can be added by the
(web)services themselves. - Interaction will sometimes be needed from
(web)services. - Taverna provides a foundation for the further
implementation of semantic annotation and
provenance.
29The End
- Science is built up of facts, as a house is
built of stones but an accumulation of facts is
no more a science than a heap of stones is a
house. - Henri Poincaré,
- Science and Hypothesis, 1905
30VL-e wishlist applied to Taverna
- Present
- Absent
- Potential or Intention
31Functional Wishlist
- Language, Platform (not browser), and Domain
independent - Encapsulation of procedures for novice users and
best practice - Access to DBMS a service on(/from) which a
workflow entity can store(/retrieve) data - Access to databases from workflow
(storage/retrieval/querying) (ODBC) - Integration of 3rd party software the ability to
integrate existing software packages in a
workflow (R, Matlab, VTK, ITK, FSL, etc.) - Discovery and invocation of existing web-services
developed/maintained by others (e.g. EMBOSS) - Typing mechanism for input/output data connected
entities in a workflow should only be allowed to
exchange data if the type of the data produced by
the outputting-module is of the same type as is
consumed by the inputting-module - Fan-in ( the input data of an entity can come
from multiple entities) andfan-out ( the output
of an entity can be passed to multiple entities)
32User interface and SW Engineering wishlist
- User-friendly (graphical, sensible defaults,
wizards) - Interactive graph editing of workflow diagram
- Encapsulation the ability to create hierarchies
of workflow), copy/paste (topologies are
first-class objects being able to load a
topology as if it is a module) - Capture workflow, provenance
- Based on well-established standards (i.e. Grid
software, easy to install, maintain) - Software engineering maintainability of
dependency on 3rd party software - Open source
- Semantic annotation of web services as well as
the data produced by a given module - Visualization from a service component
- Interaction with (the visualization from) a
service component, especially selections
33Run-time wishlist
- Execution of workflow, controlled (e.g. stepwise
useful in debugging) - Distributed execution (e.g. across a Grid of
systems) - Interactive, dynamic execution of workflow,
Dynamic workflow (execution is not predetermined) - Monitoring execution of workflow, gathering
information on execution of workflow (metadata)
(also from inside a workflow) - Maintain history/log of executed workflow for
later scrutinyReproduction of experiment - Checkpointing both data (as a BLOB) and
process checkpointing - nohup execution (being able to execute a
workflow in the background, without having to
be logged in all the time) - Control flow (while/for/if-then-else,
parallel/sequential/recursion, execute the same
workflow with multiple different input, parameter
sweeping, gathering/collecting of result) - Resource brokering given the description of
resources required by a workflow entity and the
description of abilities provided by a resource
the (automatic) brokering of and entity onto a
resource - Quality-of-Service fault tolerant, stable, high
availability, dependable