Title: Taverna and myGrid
1Taverna and myGrid
- A solution for confusion intensive computing?
- Tom Oinn EMBL-EBI,
- tmo_at_ebi.ac.uk
http//mygrid.org.uk http//taverna.sf.net
2Who are we?
- myGrid
- An EPSRC funded eScience Pilot Project
- Based across multiple sites in the UK
- Taverna
- A tethered spin-off of the myGrid project
- Aimed at producing powerful tools to complement
the basic research work
EBI Hinxton Campus
3What is Taverna?
- Allows scientists to graphically construct
complex processes in the form of workflows - What is a workflow?
- Set of activities that make up a process
- Definitions about how data moves between these
activities - The user specifies what to do but not how to do
it - Insulates users from the complexity of
distributed computing
4Looks a bit like this
5myGrid, Taverna and WBS
- One of several early adopters of Taverna
- Manchester based group working on Williams-Beuren
Syndrome in the medical genetics department - Workflows written by life scientists not computer
scientists ? - Following slides stolen at the last minute from
Hannah Tipney at Manchester!
6Williams-Beuren Syndrome (WBS)
- Contiguous sporadic gene deletion disorder
- 1/20,000 live births, caused by unequal crossover
(homologous recombination) during meiosis - Haploinsufficiency of the region results in the
phenotype - Multisystem phenotype muscular, nervous,
circulatory systems - Characteristic facial features
- Unique cognitive profile
- Mental retardation (IQ 40-100, mean60, normal
mean 100 ) - Outgoing personality, friendly nature, charming
7Williams-Beuren Syndrome Microdeletion
POM121
C-cen
Eicher E, Clark R She, X An Assessment of the
Sequence Gaps Unfinished Business in a Finished
Human Genome. Nature Genetics Reviews (2004)
5345-354 Hillier L et al. The DNA Sequence of
Human Chromosome 7. Nature (2003) 424157-164
NOLR1
A-cen
FKBP6
B-cen
FZD9
C-mid
BAZ1B
BCL7B
TBL2
WBSCR14
WBSCR18
WBSCR22
STX1A
WBSCR21
CLDN3
CLDN4
ELN
LIMK1
WBSCR1/E1f4H
WBSCR5/LAB
RFC2
B-mid
CYLN2
A-mid
GTF2IRD1
B-tel
GTF2I
A-tel
NCF1
C-tel
GTF2IRD2
8Experiment
RepeatMasker
BLASTwrapper
GenBank Accession No
Promotor Prediction
URL inc GB identifier
TF binding Prediction
Translation/sequence file. Good for records and
publications
prettyseq
Regulation Element Prediction
GenBank Entry
Amino Acid translation
Sort for appropriate Sequences only
Identifies PEST seq
epestfind
Identify regulatory elements in genomic sequence
Seqret
Identifies FingerPRINTS
pscan
MW, length, charge, pI, etc
Nucleotide seq (Fasta)
pepstats
6 ORFs
Predicts Coiled-coil regions
RepeatMasker
pepcoil
tblastn Vs nr, est, est_mouse, est_human
databases. Blastp Vs nr
GenScan
Coding sequence
BlastWrapper
Restriction enzyme map
restrict
SignalP TargetP PSORTII
sixpack
Predicts cellular location
transeq
CpG Island locations and
cpgreport
Identifies functional and structural
domains/motifs
InterPro
RepeatMasker
Repetitive elements
ORFs
Hydrophobic regions
Pepwindow? Octanol?
Blastn Vs nr, est databases.
ncbiBlastWrapper
9Analysis via Cut and Paste
10Workflows
A
B
C
A Identification of overlapping sequence B
Characterisation of nucleotide sequence C
Characterisation of protein sequence
11The Biological Results
Four workflow cycles totalling 10 hours The gap
was correctly closed and all known features
identified
WBSCR14
ELN
CTA-315H11
CTB-51J22
12And Now Pretty Pictures ?
The first thing users see
13Different service types, unified.
BioMoby (orange), Soaplab (wheat), Workflow
(red), SOAP Service (green), SeqHound (blue),
Local Java operation (purple), String constant
(pale blue)
14Launching a workflow
15Invocation progress
16Browsing the results
17Results in context
18Integration Epochs
- Databases / Data warehouses
- Integration of data
- Distributed Queries, Workflows
- Integration of process
- Semantic Unification
- Integration of knowledge
- Current state of the art somewhere around 2.5,
what do we need to do next?
19Last Years Problems
- Multiple data sources
- SOA approaches, distributed queries i.e. OGSA-DAI
- Heterogeneous computational resources
- SOA combined with workflow methods
- Toolkits widely used and deployed i.e. Soaplab,
BioMoby et al. - As a community we can provide data and compute
services, and are doing so.
20Yesterdays Problems
- Usability
- Distributed computing and biologists go together
like water and mains electricity - Graphical workflow environments now exist e.g.
Taverna, Triana, Discovery-Net, Ptolemy - Can be improved upon but basically usable by the
target audience of expert researchers.
21- Concept
- Workflows, SOA and friends are now accepted as a
legitimate way of doing things - Methods have moved from the out there research
world to just inside the common scientific toolbox
22- Functionality
- Integration of BioMoby, EMBOSS, SOAP services,
command line tools, SeqHound, Web CGIs and others
on demand - Fault tolerance and reporting
- Enactment of complex process flows
- Some service discovery (crude but surprisingly
effective) - Available and widely used (gt2500 downloads of
Taverna from http//taverna.sf.net)
23Current Work
- Service Discovery
- Doing it properly semantic registry technology
- Ontologies for services, data etc.
- Annotating the corpus of services with metadata
- Data management
- Putting data in context within the scientific
process - Managing the new bursts of data from workflow
systems
24So Wheres This Confusion Then?
- At the moment, invoking a workflow gives results
equivalent to a big set of files - Files are data, what we want is knowledge
- Confusion is formed from data and banished by the
conversion of that data into knowledge - This is the problem for Today, Tomorrow and
beyond! - So, what are we going to do about it next?
25Some Types of knowledge in myGrid and Taverna
- Data to Context Knowledge
- Which operation produced the data?
- Which workflow defined the operation?
- When, Where and Who?
- Workflow design and enactment!
- Data to Data Knowledge
- Relate operation inputs and outputs
- Base derived from relation in RDF
- Can be specialized through templates
26- Context to Context Knowledge
- Common information model shared across components
- Encapsulates organizations, people, experiment
designs, instances and results. - Equivalent to an overall eScience file system
- In Silico eScience Materials and Methods
- Expressed in terms of workflow definitions within
Taverna
27The eScience Knowledge Gap (one of them anyway!)
- Hypothesis is missing!
- Without some specification of the hypothesis
which the experiment is designed to test we
cannot do much more than the forms of knowledge
stated previously. - Hypothesis as part of the Process Model?
- Can we define the hypothesis as the population of
a domain and experiment specific data model in
combination with a set of statements about
instances of this model? - How would this fit in with the current workflow
centric approach were taking?
28But Domain Modeling is Hard
- Do we need to model the entire domain?
- Derive an experiment specific model by either
creating from scratch or aggregating fine grained
Atomic Domain Models - Examples Sequence Features, GO Term Graph,
Metabolic Pathway, Protein Interaction Set - For example, if the hypothesis is proteins
annotated with GO term xxx or children by
InterPro scan are implicated in pathway zzz - Aggregate target domain model consists of the
combination of these Atomic Domain Models. - Hypothesis statement in the form of this model
query over the model topology which returns the
proportion of proteins in the model satisfying
the hypothesis constraint.
29Populating the Target Domain Model
- Workflows are based on the composition of
distributed services - Can we derive services from the Target Domain
Model? For example, the Sequence Features model
would manifest a setFeature(start, end, sequence,
feature) operation or similar. - Allow the user to incorporate these operations
into the workflow alongside the regular services,
effectively annotating the workflow. - Make use of existing Data to Data Knowledge
and Data to Context Knowledge to link
entities within the Target Domain
Model with derivation
information.
30Data Transformed to Knowledge
- A workflow invocation would now result in a
populated domain model as opposed to (or in
addition to) a large set of discrete pieces of
data. - Explicit semantic in the Target Domain Model
- Drive hypothesis testing
- Drive visualization in a graphical UI
- Generate textual summary of the knowledge
31myGrid and WBS People!
- Core
- Matthew Addis, Nedim Alpdemir, Tim Carver, Rich
Cawley, Neil Davis, Alvaro Fernandes, Justin
Ferris, Robert Gaizaukaus, Kevin Glover, Carole
Goble, Chris Greenhalgh, Mark Greenwood, Yikun
Guo, Ananth Krishna, Peter Li, Phillip Lord,
Darren Marvin, Simon Miles, Luc Moreau, Arijit
Mukherjee, Tom Oinn, Juri Papay, Savas
Parastatidis, Norman Paton, Terry Payne, Matthew
Pockock Milena Radenkovic, Stefan
Rennick-Egglestone, Peter Rice, Martin Senger,
Nick Sharman, Robert Stevens, Victor Tan, Anil
Wipat, Paul Watson and Chris Wroe. - Users
- Simon Pearce and Claire Jennings, Institute of
Human Genetics School of Clinical Medical
Sciences, University of Newcastle, UK - Hannah Tipney, May Tassabehji, Andy Brass, St
Marys Hospital, Manchester, UK - Postgraduates
- Martin Szomszor, Duncan Hull, Jun Zhao, Pinar
Alper, John Dickman, Keith Flanagan, Antoon
Goderis, Tracy Craddock, Alastair Hampshire - Industrial
- Dennis Quan, Sean Martin, Michael Niemi, Syd
Chapman (IBM) - Robin McEntire (GSK)
- Collaborators
- Keith Decker
32Acknowledgements
myGrid is an EPSRC funded UK eScience Program
Pilot Project
Particular thanks to the other members of the
Taverna project, http//taverna.sf.net