Title: 10'30 11'00 Introduction to myGrid
1Taverna training day in Sheffield
10.30 - 11.00 Introduction to myGrid 11.00
12.30 tutorial (part 1) 12.30 1.15 Break 1.15
2.30 Tutorial (part2) 2.30 2.40 break 2.40
3.00 sharing my workflows
2An Introduction to Taverna Workflows
- Franck Tanoh
- myGrid
- University of Manchester
3What is myGrid?
- myGrid is a suite components to support in silico
experiments in biology - Taverna workbench myGrid user interface
- Originally designed to support bioinformatics
- Expanded into new areas
- Chemoinformatics
- Health Informatics
- Medical Imaging
- Integrative Biology
- Open source and always will be
4History
EPSRC funded UK eScience Program Pilot Project
5OMII-UK
- University of Manchester (myGrid) joined with the
Universities of Edinburgh (OGSA-DAI) and
Southampton (OMII phase 1) in March 2006 - OMII-UK aims to provide software and support to
enable a sustained future for the UK e-Science
community and its international collaborators. - A guarantee of development and support
6The Life Science Community
- In silico Biology is an open Community
- Open access to data
- Open access to resources
- Open access to tools
- Open access to applications
- Global in silico biological research
7The Community Problems
- Everything is Distributed
- Data, Resources and Scientists
- Heterogeneous data
- Very few standards
- I/O formats, data representation, annotation
- Everything is a string!
- Integration of data and interoperability of
resources is difficult
8 Lots of Resources
NAR 2007 968 databases
9Traditional Bioinformatics
10Cutting and Pasting
- Advantages
- Low Technology on both server and client side
- Very Robust Hard to break.
- Data Integration happens along the way
- Disadvantages
- Time Consuming (and painful!)
- Can be repeated rarely
- Limited to small data sets.
- Error Prone
- Poor repeatability
- How do you do this for a genome/proteome/metabolom
e of information!
11Pipeline Programming
- Advantages
- Repeatable
- Allows automation
- Quick, reliable, efficient
- Disadvantages
- Requires programming skills
- Difficult to modify
- Requires local tool and database installation
- Requires tool and database maintenance!!!
12What we want as a solution
- A system that is
- Allows automation
- Allows easy repetition, verification and sharing
of experiments - Works on distributed resource
- Requires few programming skills
- Runs on a local desktop / laptop
13myGrid as a solution
- myGrid allows the automated orchestration of in
silico experiments over distributed resources
from the scientists desktop - Built on computer science technologies of
- Web services
- Workflows
- Semantic web technologies
14Web Services
- Web services support machine-to-machine
interaction over a network. - Web services are a
- technology and standard for exposing code /
databases with an API that can be consumed by a
third party remotely. - describes how to interact with it.
- They are
- Self-contained
- Self-describing
- Modular
- Platform independent
15Workflows
- General technique for describing and enacting a
process - Describes what you want to do, not how you want
to do it - High level description of the experiment
16Workflows
- Workflow language specifies how bioinformatics
processes fit together. - High level workflow diagram separated from any
lower level coding you dont have to be a coder
to build workflows. - Workflow is a kind of script or protocol that you
configure when you run it. - Easier to explain, share, relocate, reuse and
repurpose. - Workflow ltgt Model
- Workflow is the integrator of knowledge
- The METHODS section of a scientific publication
17Workflow Advantages
- Automation
- Capturing processes in an explicit manner
- Tedium! Computers dont get bored/distracted/hungr
y/impatient! - Saves repeated time and effort
- Modification, maintenance, substitution and
personalisation - Easy to share, explain, relocate, reuse and build
- Releases Scientists/Bioinformaticians to do other
work - Record
- Provenance what the data is like, where it came
from, its quality - Management of data (LSID - Life Science
Identifiers)
18Different Workflow Systems
- Kepler
- Triana
- DiscoveryNet
- Taverna
- Geodise
- Pegasus
- Pipeline Pilot
- Each has differences in action, language, access
restrictions, subject areas
19 Taverna Workflow Components
Scufl Simple Conceptual Unified Flow
Language Taverna Writing, running workflows
examining results SOAPLAB Makes applications
available
20 An Open World
- Open domain services and resources.
- Taverna accesses 3000 services
- Third party we dont own them we didnt build
them - All the major providers
- NCBI, DDBJ, EBI
- Enforce NO common data model.
- Quality Web Services considered desirable
21Services Landscape
SoapLab
BioMoby
Standard soap services
Taverna web services
API consumer
Biomart
Local Java
Beanshell
Nested workflow
22Shield the Scientist Bury the Complexity
Workflow enactor
23What can you do with myGrid?
- gt40000 downloads
- Users worldwide
- US, Singapore, UK, Europe, Australia
- Systems biology
- Proteomics
- Gene/protein annotation
- Microarray data analysis
- Medical image analysis
- Heart simulations
- High throughput screening
- Genotype/Phenotype studies
- Health Informatics
- Astronomy
- Chemoinformatics
- Data integration
24Trypanosomiasis in Africa
Steve Kemp
Andy Brass
Paul Fisher
http//www.genomics.liv.ac.uk/tryps/trypsindex.htm
l
25Trypanosomiasis Study
- A form of Sleeping sickness in cattle Known as
ngana - Caused by Trypanosoma brucei
- Can we breed cattle resistant to ngana
infection? - What are the causes of the differences between
resistant and susceptible strains?
26Trypanosomiasis Study
- Understanding Phenotype
- Comparing resistant vs susceptible strains
Microarrays - Understanding Genotype
- Mapping quantitative traits Classical genetics
QTL
Need to access microarray data, genomic sequence
information, pathway databases AND integrate the
results
27Key A Retrieve genes in QTL region B
Annotate genes with external database Ids C
Cross-reference Ids with KEGG gene ids D
Retrieve microarray data from MaxD database E
For each KEGG gene get the pathways its involved
in F For each pathway get a description of what
it does G For each KEGG gene get a description
of what it does
28Results
- Identified a pathway for which its correlating
gene (Daxx) is believed to play a role in
trypanosomiasis resistance. - Manual analysis on the microarray and QTL data
had failed to identify this gene as a candidate.
29Why was the Workflow Approach Successful?
- Workflow analysed each piece of data
systematically - Eliminated user bias and premature filtering of
datasets and results leading to single sided,
expert-driven hypotheses - The size of the QTL and amount of the microarray
data made a manual approach impractical - Workflows capture exactly where data came from
and how it was analysed - Workflow output produced a manageable amount of
data for the biologists to interpret and verify - make sense of this data -gt does this make
sense?
30Trichuris muris (mouse whipworm) infection
parasite model of the human parasite - Trichuris
trichuria)
- Identified the biological pathways involved in
sex dependence in the mouse model, previously
believed to be involved in the ability of mice to
expel the parasite. - Manual experimentation Two year study of
candidate genes, processes unidentified - Workflows trypanosomiasis cattle experiment was
reused without change. - Analysis of the resulting data by a biologist
found the processes in a couple of days.
Joanne Pennock, Richard Grencis University of
manchester
31Workflow Reuse Workflows are Scientific
Protocols Share them!
32A workflow marketplace
33A Practical Guide to Building and Managing in
silico Experiments
34Semantic Web Technologies
- myGrid built on Web Services, Workflows AND
semantic web technologies - Semantic web technologies are used to
- Find appropriate services during workflow design
- Find similar workflows for reuse and repurposing
- Record the process and outcome of an experiment,
in context - -gtgtgtgt the experimental provenance
35Finding Services
- There are over 3000 distributed services. How do
we find an appropriate one? - Find services by their function instead of their
name - We need to annotate services by their functions.
- The services might be distributed, but a registry
of service descriptions can be central and
queried -
36Feta Semantic Discovery
- Feta is the myGrid component that can query the
service annotations and find services - Questions we can ask
- Find me all the services that perform a multiple
sequence alignment And accept protein sequences
in FASTA format as input
37myGrid Ontology
Specialises
38Annotations
- Feta has been available for over a year
- Only just been included in the release
- Need critical mass of service annotations before
release - By demonstrating the use of service annotation,
we aim to encourage service providers to provide
the annotations in the future - Annotation experiments with users and domain
experts - Domain expert annotations much better
- We now have a domain expert for full-time service
annotation
39Data Management
- Workflows can generate vast amount of data - how
can we manage and track it? - We need to manage
- data AND
- metadata AND
- experiment provenance
- Workflow experiments may consist of many
workflows of the same, or different experiments. - Scientists need to check back over past results,
compare workflow runs and share workflow runs
with colleagues
40(No Transcript)
41Provenance the myGrid logbook
Smart Tea
- Who, What, Where, When, Why?, How?
- Context
- Interpretation
- Logging Debugging
- Reproducibility and repeatability
- Evidence Audit
- Non-repudiation
- Credit and Attribution
- Credibility
- Accurate reuse and interpretation
- Just good scientific practice
BioMOBY
42Conclusions
- Web services and workflows are powerful
technologies for in silico science - automation
- high throughput experiments
- systematic analysis
- Interoperability of distributed resources
43Contact Us
- Taverna development is user-driven
- Please tell us what you would like to see via the
mailing lists - Taverna-Users and Taverna-Hackers
- Download software and find out more at
- http//www.mygrid.org.uk
- http//taverna.sourceforge.net
44myGrid acknowledgements
- Carole Goble, Norman Paton, Robert Stevens, Anil
Wipat, David De Roure, Steve Pettifer - OMII-UK Tom Oinn, Katy Wolstencroft, Daniele
Turi, June Finch, Stuart Owen, David Withers,
Stian Soiland, Franck Tanoh, Matthew Gamble, Alan
Williams, Ian Dunlop - Research Martin Szomszor, Duncan Hull, Jun Zhao,
Pinar Alper, Antoon Goderis, Alastair Hampshire,
Qiuwei Yu, Wang Kaixuan. - Current contributors Matthew Pocock, James Marsh,
Khalid Belhajjame, PsyGrid project, Bergen
people, EMBRACE people. - User Advocates and their bosses Simon Pearce,
Claire Jennings, Hannah Tipney, May Tassabehji,
Andy Brass, Paul Fisher, Peter Li, Simon Hubbard,
Tracy Craddock, Doug Kell, Marco Roos, Matthew
Pocock, Mark Wilkinson - Past Contributors Matthew Addis, Nedim Alpdemir,
Tim Carver, Rich Cawley, Neil Davis, Alvaro
Fernandes, Justin Ferris, Robert Gaizaukaus,
Kevin Glover, Chris Greenhalgh, Mark Greenwood,
Yikun Guo, Ananth Krishna, Phillip Lord, Darren
Marvin, Simon Miles, Luc Moreau, Arijit
Mukherjee, Juri Papay, Savas Parastatidis, Milena
Radenkovic, Stefan Rennick-Egglestone, Peter
Rice, Martin Senger, Nick Sharman, Victor Tan,
Paul Watson, and Chris Wroe. - Industrial Dennis Quan, Sean Martin, Michael
Niemi (IBM), Chimatica. - Funding EPSRC, Wellcome Trust.