Taverna and myGrid - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Taverna and myGrid

Description:

Taverna and myGrid. A solution for confusion intensive computing? Tom ... Outgoing personality, friendly nature, charming' Chr 7 ~155 Mb ~1.5 Mb. 7q11.23. C-cen ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 33
Provided by: tomo161
Category:

less

Transcript and Presenter's Notes

Title: Taverna and myGrid


1
Taverna and myGrid
  • A solution for confusion intensive computing?
  • Tom Oinn EMBL-EBI,
  • tmo_at_ebi.ac.uk

http//mygrid.org.uk http//taverna.sf.net
2
Who are we?
  • myGrid
  • An EPSRC funded eScience Pilot Project
  • Based across multiple sites in the UK
  • Taverna
  • A tethered spin-off of the myGrid project
  • Aimed at producing powerful tools to complement
    the basic research work

EBI Hinxton Campus
3
What is Taverna?
  • Allows scientists to graphically construct
    complex processes in the form of workflows
  • What is a workflow?
  • Set of activities that make up a process
  • Definitions about how data moves between these
    activities
  • The user specifies what to do but not how to do
    it
  • Insulates users from the complexity of
    distributed computing

4
Looks a bit like this
5
myGrid, Taverna and WBS
  • One of several early adopters of Taverna
  • Manchester based group working on Williams-Beuren
    Syndrome in the medical genetics department
  • Workflows written by life scientists not computer
    scientists ?
  • Following slides stolen at the last minute from
    Hannah Tipney at Manchester!

6
Williams-Beuren Syndrome (WBS)
  • Contiguous sporadic gene deletion disorder
  • 1/20,000 live births, caused by unequal crossover
    (homologous recombination) during meiosis
  • Haploinsufficiency of the region results in the
    phenotype
  • Multisystem phenotype muscular, nervous,
    circulatory systems
  • Characteristic facial features
  • Unique cognitive profile
  • Mental retardation (IQ 40-100, mean60, normal
    mean 100 )
  • Outgoing personality, friendly nature, charming

7
Williams-Beuren Syndrome Microdeletion
POM121
C-cen
Eicher E, Clark R She, X An Assessment of the
Sequence Gaps Unfinished Business in a Finished
Human Genome. Nature Genetics Reviews (2004)
5345-354 Hillier L et al. The DNA Sequence of
Human Chromosome 7. Nature (2003) 424157-164
NOLR1
A-cen
FKBP6
B-cen
FZD9
C-mid
BAZ1B
BCL7B
TBL2
WBSCR14
WBSCR18
WBSCR22
STX1A
WBSCR21
CLDN3
CLDN4
ELN
LIMK1
WBSCR1/E1f4H
WBSCR5/LAB
RFC2
B-mid
CYLN2
A-mid
GTF2IRD1
B-tel
GTF2I
A-tel
NCF1
C-tel
GTF2IRD2
8
Experiment
RepeatMasker
BLASTwrapper
GenBank Accession No
Promotor Prediction
URL inc GB identifier
TF binding Prediction
Translation/sequence file. Good for records and
publications
prettyseq
Regulation Element Prediction
GenBank Entry
Amino Acid translation
Sort for appropriate Sequences only
Identifies PEST seq
epestfind
Identify regulatory elements in genomic sequence
Seqret
Identifies FingerPRINTS
pscan
MW, length, charge, pI, etc
Nucleotide seq (Fasta)
pepstats
6 ORFs
Predicts Coiled-coil regions
RepeatMasker
pepcoil
tblastn Vs nr, est, est_mouse, est_human
databases. Blastp Vs nr
GenScan
Coding sequence
BlastWrapper
Restriction enzyme map
restrict
SignalP TargetP PSORTII
sixpack
Predicts cellular location
transeq
CpG Island locations and
cpgreport
Identifies functional and structural
domains/motifs
InterPro
RepeatMasker
Repetitive elements
ORFs
Hydrophobic regions
Pepwindow? Octanol?
Blastn Vs nr, est databases.
ncbiBlastWrapper
9
Analysis via Cut and Paste
10
Workflows
A
B
C
A Identification of overlapping sequence B
Characterisation of nucleotide sequence C
Characterisation of protein sequence
11
The Biological Results
Four workflow cycles totalling 10 hours The gap
was correctly closed and all known features
identified
WBSCR14
ELN

CTA-315H11
CTB-51J22
12
And Now Pretty Pictures ?
The first thing users see
13
Different service types, unified.
BioMoby (orange), Soaplab (wheat), Workflow
(red), SOAP Service (green), SeqHound (blue),
Local Java operation (purple), String constant
(pale blue)
14
Launching a workflow
15
Invocation progress
16
Browsing the results
17
Results in context
18
Integration Epochs
  • Databases / Data warehouses
  • Integration of data
  • Distributed Queries, Workflows
  • Integration of process
  • Semantic Unification
  • Integration of knowledge
  • Current state of the art somewhere around 2.5,
    what do we need to do next?

19
Last Years Problems
  • Multiple data sources
  • SOA approaches, distributed queries i.e. OGSA-DAI
  • Heterogeneous computational resources
  • SOA combined with workflow methods
  • Toolkits widely used and deployed i.e. Soaplab,
    BioMoby et al.
  • As a community we can provide data and compute
    services, and are doing so.

20
Yesterdays Problems
  • Usability
  • Distributed computing and biologists go together
    like water and mains electricity
  • Graphical workflow environments now exist e.g.
    Taverna, Triana, Discovery-Net, Ptolemy
  • Can be improved upon but basically usable by the
    target audience of expert researchers.

21
  • Concept
  • Workflows, SOA and friends are now accepted as a
    legitimate way of doing things
  • Methods have moved from the out there research
    world to just inside the common scientific toolbox

22
  • Functionality
  • Integration of BioMoby, EMBOSS, SOAP services,
    command line tools, SeqHound, Web CGIs and others
    on demand
  • Fault tolerance and reporting
  • Enactment of complex process flows
  • Some service discovery (crude but surprisingly
    effective)
  • Available and widely used (gt2500 downloads of
    Taverna from http//taverna.sf.net)

23
Current Work
  • Service Discovery
  • Doing it properly semantic registry technology
  • Ontologies for services, data etc.
  • Annotating the corpus of services with metadata
  • Data management
  • Putting data in context within the scientific
    process
  • Managing the new bursts of data from workflow
    systems

24
So Wheres This Confusion Then?
  • At the moment, invoking a workflow gives results
    equivalent to a big set of files
  • Files are data, what we want is knowledge
  • Confusion is formed from data and banished by the
    conversion of that data into knowledge
  • This is the problem for Today, Tomorrow and
    beyond!
  • So, what are we going to do about it next?

25
Some Types of knowledge in myGrid and Taverna
  • Data to Context Knowledge
  • Which operation produced the data?
  • Which workflow defined the operation?
  • When, Where and Who?
  • Workflow design and enactment!
  • Data to Data Knowledge
  • Relate operation inputs and outputs
  • Base derived from relation in RDF
  • Can be specialized through templates

26
  • Context to Context Knowledge
  • Common information model shared across components
  • Encapsulates organizations, people, experiment
    designs, instances and results.
  • Equivalent to an overall eScience file system
  • In Silico eScience Materials and Methods
  • Expressed in terms of workflow definitions within
    Taverna

27
The eScience Knowledge Gap (one of them anyway!)
  • Hypothesis is missing!
  • Without some specification of the hypothesis
    which the experiment is designed to test we
    cannot do much more than the forms of knowledge
    stated previously.
  • Hypothesis as part of the Process Model?
  • Can we define the hypothesis as the population of
    a domain and experiment specific data model in
    combination with a set of statements about
    instances of this model?
  • How would this fit in with the current workflow
    centric approach were taking?

28
But Domain Modeling is Hard
  • Do we need to model the entire domain?
  • Derive an experiment specific model by either
    creating from scratch or aggregating fine grained
    Atomic Domain Models
  • Examples Sequence Features, GO Term Graph,
    Metabolic Pathway, Protein Interaction Set
  • For example, if the hypothesis is proteins
    annotated with GO term xxx or children by
    InterPro scan are implicated in pathway zzz
  • Aggregate target domain model consists of the
    combination of these Atomic Domain Models.
  • Hypothesis statement in the form of this model
    query over the model topology which returns the
    proportion of proteins in the model satisfying
    the hypothesis constraint.

29
Populating the Target Domain Model
  • Workflows are based on the composition of
    distributed services
  • Can we derive services from the Target Domain
    Model? For example, the Sequence Features model
    would manifest a setFeature(start, end, sequence,
    feature) operation or similar.
  • Allow the user to incorporate these operations
    into the workflow alongside the regular services,
    effectively annotating the workflow.
  • Make use of existing Data to Data Knowledge
    and Data to Context Knowledge to link
    entities within the Target Domain
    Model with derivation
    information.

30
Data Transformed to Knowledge
  • A workflow invocation would now result in a
    populated domain model as opposed to (or in
    addition to) a large set of discrete pieces of
    data.
  • Explicit semantic in the Target Domain Model
  • Drive hypothesis testing
  • Drive visualization in a graphical UI
  • Generate textual summary of the knowledge

31
myGrid and WBS People!
  • Core
  • Matthew Addis, Nedim Alpdemir, Tim Carver, Rich
    Cawley, Neil Davis, Alvaro Fernandes, Justin
    Ferris, Robert Gaizaukaus, Kevin Glover, Carole
    Goble, Chris Greenhalgh, Mark Greenwood, Yikun
    Guo, Ananth Krishna, Peter Li, Phillip Lord,
    Darren Marvin, Simon Miles, Luc Moreau, Arijit
    Mukherjee, Tom Oinn, Juri Papay, Savas
    Parastatidis, Norman Paton, Terry Payne, Matthew
    Pockock Milena Radenkovic, Stefan
    Rennick-Egglestone, Peter Rice, Martin Senger,
    Nick Sharman, Robert Stevens, Victor Tan, Anil
    Wipat, Paul Watson and Chris Wroe.
  • Users
  • Simon Pearce and Claire Jennings, Institute of
    Human Genetics School of Clinical Medical
    Sciences, University of Newcastle, UK
  • Hannah Tipney, May Tassabehji, Andy Brass, St
    Marys Hospital, Manchester, UK
  • Postgraduates
  • Martin Szomszor, Duncan Hull, Jun Zhao, Pinar
    Alper, John Dickman, Keith Flanagan, Antoon
    Goderis, Tracy Craddock, Alastair Hampshire
  • Industrial
  • Dennis Quan, Sean Martin, Michael Niemi, Syd
    Chapman (IBM)
  • Robin McEntire (GSK)
  • Collaborators
  • Keith Decker

32
Acknowledgements
myGrid is an EPSRC funded UK eScience Program
Pilot Project
Particular thanks to the other members of the
Taverna project, http//taverna.sf.net
Write a Comment
User Comments (0)
About PowerShow.com