Taverna and myGrid - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Taverna and myGrid

Description:

Taverna and myGrid. A solution for confusion intensive computing? Tom ... Outgoing personality, friendly nature, charming' Chr 7 ~155 Mb ~1.5 Mb. 7q11.23. C-cen ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 33

Provided by: tomo161

Category:

more less

Transcript and Presenter's Notes

Title: Taverna and myGrid

1
Taverna and myGrid

A solution for confusion intensive computing?
Tom Oinn EMBL-EBI,
tmo_at_ebi.ac.uk

http//mygrid.org.uk http//taverna.sf.net
2
Who are we?

myGrid
An EPSRC funded eScience Pilot Project
Based across multiple sites in the UK
Taverna
A tethered spin-off of the myGrid project
Aimed at producing powerful tools to complement
the basic research work

EBI Hinxton Campus
3
What is Taverna?

Allows scientists to graphically construct
complex processes in the form of workflows
What is a workflow?
Set of activities that make up a process
Definitions about how data moves between these
activities
The user specifies what to do but not how to do
it
Insulates users from the complexity of
distributed computing

4
Looks a bit like this
5
myGrid, Taverna and WBS

One of several early adopters of Taverna
Manchester based group working on Williams-Beuren
Syndrome in the medical genetics department
Workflows written by life scientists not computer
scientists ?
Following slides stolen at the last minute from
Hannah Tipney at Manchester!

6
Williams-Beuren Syndrome (WBS)

Contiguous sporadic gene deletion disorder
1/20,000 live births, caused by unequal crossover
(homologous recombination) during meiosis
Haploinsufficiency of the region results in the
phenotype
Multisystem phenotype muscular, nervous,
circulatory systems
Characteristic facial features
Unique cognitive profile
Mental retardation (IQ 40-100, mean60, normal
mean 100 )
Outgoing personality, friendly nature, charming

7
Williams-Beuren Syndrome Microdeletion
POM121
C-cen
Eicher E, Clark R She, X An Assessment of the
Sequence Gaps Unfinished Business in a Finished
Human Genome. Nature Genetics Reviews (2004)
5345-354 Hillier L et al. The DNA Sequence of
Human Chromosome 7. Nature (2003) 424157-164
NOLR1
A-cen
FKBP6
B-cen
FZD9
C-mid
BAZ1B
BCL7B
TBL2
WBSCR14
WBSCR18
WBSCR22
STX1A
WBSCR21
CLDN3
CLDN4
ELN
LIMK1
WBSCR1/E1f4H
WBSCR5/LAB
RFC2
B-mid
CYLN2
A-mid
GTF2IRD1
B-tel
GTF2I
A-tel
NCF1
C-tel
GTF2IRD2
8
Experiment
RepeatMasker
BLASTwrapper
GenBank Accession No
Promotor Prediction
URL inc GB identifier
TF binding Prediction
Translation/sequence file. Good for records and
publications
prettyseq
Regulation Element Prediction
GenBank Entry
Amino Acid translation
Sort for appropriate Sequences only
Identifies PEST seq
epestfind
Identify regulatory elements in genomic sequence
Seqret
Identifies FingerPRINTS
pscan
MW, length, charge, pI, etc
Nucleotide seq (Fasta)
pepstats
6 ORFs
Predicts Coiled-coil regions
RepeatMasker
pepcoil
tblastn Vs nr, est, est_mouse, est_human
databases. Blastp Vs nr
GenScan
Coding sequence
BlastWrapper
Restriction enzyme map
restrict
SignalP TargetP PSORTII
sixpack
Predicts cellular location
transeq
CpG Island locations and
cpgreport
Identifies functional and structural
domains/motifs
InterPro
RepeatMasker
Repetitive elements
ORFs
Hydrophobic regions
Pepwindow? Octanol?
Blastn Vs nr, est databases.
ncbiBlastWrapper
9
Analysis via Cut and Paste
10
Workflows
A
B
C
A Identification of overlapping sequence B
Characterisation of nucleotide sequence C
Characterisation of protein sequence
11
The Biological Results
Four workflow cycles totalling 10 hours The gap
was correctly closed and all known features
identified
WBSCR14
ELN

CTA-315H11
CTB-51J22
12
And Now Pretty Pictures ?
The first thing users see
13
Different service types, unified.
BioMoby (orange), Soaplab (wheat), Workflow
(red), SOAP Service (green), SeqHound (blue),
Local Java operation (purple), String constant
(pale blue)
14
Launching a workflow
15
Invocation progress
16
Browsing the results
17
Results in context
18
Integration Epochs

Databases / Data warehouses
Integration of data
Distributed Queries, Workflows
Integration of process
Semantic Unification
Integration of knowledge
Current state of the art somewhere around 2.5,
what do we need to do next?

19
Last Years Problems

Multiple data sources
SOA approaches, distributed queries i.e. OGSA-DAI
Heterogeneous computational resources
SOA combined with workflow methods
Toolkits widely used and deployed i.e. Soaplab,
BioMoby et al.
As a community we can provide data and compute
services, and are doing so.

20
Yesterdays Problems

Usability
Distributed computing and biologists go together
like water and mains electricity
Graphical workflow environments now exist e.g.
Taverna, Triana, Discovery-Net, Ptolemy
Can be improved upon but basically usable by the
target audience of expert researchers.

Concept
Workflows, SOA and friends are now accepted as a
legitimate way of doing things
Methods have moved from the out there research
world to just inside the common scientific toolbox

Functionality
Integration of BioMoby, EMBOSS, SOAP services,
command line tools, SeqHound, Web CGIs and others
on demand
Fault tolerance and reporting
Enactment of complex process flows
Some service discovery (crude but surprisingly
effective)
Available and widely used (gt2500 downloads of
Taverna from http//taverna.sf.net)

23
Current Work

Service Discovery
Doing it properly semantic registry technology
Ontologies for services, data etc.
Annotating the corpus of services with metadata
Data management
Putting data in context within the scientific
process
Managing the new bursts of data from workflow
systems

24
So Wheres This Confusion Then?

At the moment, invoking a workflow gives results
equivalent to a big set of files
Files are data, what we want is knowledge
Confusion is formed from data and banished by the
conversion of that data into knowledge
This is the problem for Today, Tomorrow and
beyond!
So, what are we going to do about it next?

25
Some Types of knowledge in myGrid and Taverna

Data to Context Knowledge
Which operation produced the data?
Which workflow defined the operation?
When, Where and Who?
Workflow design and enactment!
Data to Data Knowledge
Relate operation inputs and outputs
Base derived from relation in RDF
Can be specialized through templates

Context to Context Knowledge
Common information model shared across components
Encapsulates organizations, people, experiment
designs, instances and results.
Equivalent to an overall eScience file system
In Silico eScience Materials and Methods
Expressed in terms of workflow definitions within
Taverna

27
The eScience Knowledge Gap (one of them anyway!)

Hypothesis is missing!
Without some specification of the hypothesis
which the experiment is designed to test we
cannot do much more than the forms of knowledge
stated previously.
Hypothesis as part of the Process Model?
Can we define the hypothesis as the population of
a domain and experiment specific data model in
combination with a set of statements about
instances of this model?
How would this fit in with the current workflow
centric approach were taking?

28
But Domain Modeling is Hard

Do we need to model the entire domain?
Derive an experiment specific model by either
creating from scratch or aggregating fine grained
Atomic Domain Models
Examples Sequence Features, GO Term Graph,
Metabolic Pathway, Protein Interaction Set
For example, if the hypothesis is proteins
annotated with GO term xxx or children by
InterPro scan are implicated in pathway zzz
Aggregate target domain model consists of the
combination of these Atomic Domain Models.
Hypothesis statement in the form of this model
query over the model topology which returns the
proportion of proteins in the model satisfying
the hypothesis constraint.

29
Populating the Target Domain Model

Workflows are based on the composition of
distributed services
Can we derive services from the Target Domain
Model? For example, the Sequence Features model
would manifest a setFeature(start, end, sequence,
feature) operation or similar.
Allow the user to incorporate these operations
into the workflow alongside the regular services,
effectively annotating the workflow.
Make use of existing Data to Data Knowledge
and Data to Context Knowledge to link
entities within the Target Domain
Model with derivation
information.

30
Data Transformed to Knowledge

A workflow invocation would now result in a
populated domain model as opposed to (or in
addition to) a large set of discrete pieces of
data.
Explicit semantic in the Target Domain Model
Drive hypothesis testing
Drive visualization in a graphical UI
Generate textual summary of the knowledge

31
myGrid and WBS People!

Core
Matthew Addis, Nedim Alpdemir, Tim Carver, Rich
Cawley, Neil Davis, Alvaro Fernandes, Justin
Ferris, Robert Gaizaukaus, Kevin Glover, Carole
Goble, Chris Greenhalgh, Mark Greenwood, Yikun
Guo, Ananth Krishna, Peter Li, Phillip Lord,
Darren Marvin, Simon Miles, Luc Moreau, Arijit
Mukherjee, Tom Oinn, Juri Papay, Savas
Parastatidis, Norman Paton, Terry Payne, Matthew
Pockock Milena Radenkovic, Stefan
Rennick-Egglestone, Peter Rice, Martin Senger,
Nick Sharman, Robert Stevens, Victor Tan, Anil
Wipat, Paul Watson and Chris Wroe.
Users
Simon Pearce and Claire Jennings, Institute of
Human Genetics School of Clinical Medical
Sciences, University of Newcastle, UK
Hannah Tipney, May Tassabehji, Andy Brass, St
Marys Hospital, Manchester, UK
Postgraduates
Martin Szomszor, Duncan Hull, Jun Zhao, Pinar
Alper, John Dickman, Keith Flanagan, Antoon
Goderis, Tracy Craddock, Alastair Hampshire
Industrial
Dennis Quan, Sean Martin, Michael Niemi, Syd
Chapman (IBM)
Robin McEntire (GSK)
Collaborators
Keith Decker

32
Acknowledgements
myGrid is an EPSRC funded UK eScience Program
Pilot Project
Particular thanks to the other members of the
Taverna project, http//taverna.sf.net

Write a Comment

User Comments (0)