Title: Spreading Semantics Over Biology
1Spreading Semantics Over Biology
- Phillip Lord
- Newcastle University
2Overview
- Conclusions
- Data Integration in ComparaGRID
- Annotation in CARMEN and CISBAN
- Computing with Semantics
- The future
3Conclusions
- Thin Semantics is Good
- More Semantics is Better
- Shared Semantics is Wonderful
4Key Problems
- Scalability
- Both in technology and processes
- Usability
- Autonomy
5What is the most widely abused word in life
sciences?
FT SIGNAL 1 22 FT CHAIN 23
230 MAJOR PRION PROTEIN. FT PROPEP
231 253 REMOVED IN MATURE FORM (BY
SIMILARITY). FT LIPID 230 230
GPI-ANCHOR (BY SIMILARITY). FT CARBOHYD 181
181 N-LINKED (GLCNAC...) (PROBABLE). FT
DISULFID 179 214 BY SIMILARITY. FT
DOMAIN 51 91 5 X 8 AA TANDEM
REPEATS OF P-H-G-G-G-W-G- FT
Q. FT REPEAT 51 59
1. FT REPEAT 60 67 2. FT
REPEAT 68 75 3. FT REPEAT
76 83 4. FT REPEAT 84 91
5. FT IN
PATIENTS WHO HAVE A PRP MUTATION AT FT
CODON 178 PATIENTS WITH MET
DEVELOP FFI, FT
THOSE WITH VAL DEVELOP CJD). FT
/FTIdVAR_006467. FT VARIANT
171 171 N -gt S (IN SCHIZOAFFECTIVE
DISORDER). FT
/FTIdVAR_006468. FT VARIANT 178 178
D -gt N (IN FFI AND CJD). FT
/FTIdVAR_006469. FT VARIANT 180
180 V -gt I (IN CJD). FT
/FTIdVAR_006470. FT VARIANT
183 183 T -gt A (IN FAMILIAL
SPONGIFORM FT
ENCEPHALOPATHY). FT
/FTIdVAR_006471. FT VARIANT 187 187
H -gt R (IN GSS). FT
/FTIdVAR_008746. FT VARIANT 188 188
T -gt K (IN EOAD DEMENTIA ASSOCIATED TO FT
PRION DISEASES). FT
/FTIdVAR_008748. F
T VARIANT 188 188 T -gt R. FT
/FTIdVAR_008747. FT
VARIANT 196 196 E -gt K (IN CJD). FT
/FTIdVAR_008749. FT
/FTIdVAR_006472. SQ SEQUENCE 253 AA 27661
MW 43DB596BAAA66484 CRC64 MANLGCWMLV
LFVATWSDLG LCKKRPKPGG WNTGGSRYPG QGSPGGNRYP
PQGGGGWGQP HGGGWGQPHG GGWGQPHGGG WGQPHGGGWG
QGGGTHSQWN KPSKPKTNMK HMAGAAAAGA VVGGLGGYML
GSAMSRPIIH FGSDYEDRYY RENMHRYPNQ VYYRPMDEYS
NQNNFVHDCV NITIKQHTVT TTTKGENFTE TDVKMMERVV
EQMCITQYER ESQAYYQRGS SMVLFSSPPV ILLISFLIFL
IVG //
CC -!- DISEASE PRP IS FOUND IN HIGH QUANTITY
IN THE CC BRAIN OF HUMANS AND ANIMALS
INFECTED CC WITH NEURODEGENERATIVE DISEASES
KNOWN AS CC TRANSMISSIBLE SPONGIFORM
ENCEPHALOPATHIES OR PRION CC DISEASES,LIKE
CREUTZFELDT-JAKOB DISEASE (CJD), CC
GERSTMANN-STRAUSSLER SYNDROME (GSS), FATAL CC
FAMILIAL INSOMNIA (FFI) AND KURU IN HUMANS
CC SCRAPIE IN SHEEP AND GOAT BOVINE
SPONGIFORM CC ENCEPHALOPATHY (BSE) IN
CATTLE TRANSMISSIBLE CC MINK
ENCEPHALOPATHY (TME) CHRONIC WASTING CC
DISEASE (CWD) OF MULE DEER AND ELK FELINE CC
SPONGIFORM ENCEPHALOPATHY (FSE) IN CATS AND
CC EXOTIC UNGULATE ENCEPHALOPATHY (EUE) IN
CC NYALA AND GREATER KUDU. THE PRION
DISEASES CC ILLUSTRATE THREE
MANIFESTATIONS OF CNS CC DEGENERATION (1)
INFECTIOUS (2) CC SPORADIC AND (3)
DOMINANTLY INHERITED FORMS. CC TME, CWD,
BSE, FSE, EUE ARE ALL THOUGHT TO CC OCCUR
AFTER CONSUMPTION OF PRION-INFECTED CC
FOODSTUFFS. DR EMBL M13667 AAA19664.1 -. DR
EMBL M13899 AAA60182.1 -. DR EMBL D00015
BAA00011.1 -. DR PIR A05017 A05017. DR
PIR A24173 A24173. DR PIR S14078 S14078. DR
PDB 1E1G 20-JUL-00. DR PDB 1E1J
20-JUL-00. DR PDB 1E1P 20-JUL-00. DR PDB
1E1S 21-JUL-00. DR PDB 1E1U 20-JUL-00. DR
PDB 1E1W 20-JUL-00. DR MIM 176640 -. DR
MIM 123400 -. DR MIM 137440 -. DR MIM
245300 -. DR MIM 600072 -. DR MIM 604920
-. DR InterPro IPR000817 Prion. DR Pfam
PF00377 prion 1. DR PRINTS PR00341
PRION. DR SMART SM00157 PRP 1. DR PROSITE
PS00291 PRION_1 1. DR PROSITE PS00706
PRION_2 1. KW Prion Brain Glycoprotein
GPI-anchor Repeat Signal KW 3D-structure
Polymorphism Disease mutation.
ID PRIO_HUMAN STANDARD PRT 253
AA. AC P04156 DT 01-NOV-1986 (Rel. 03,
Created) DT 01-NOV-1986 (Rel. 03, Last sequence
update) DT 20-AUG-2001 (Rel. 40, Last
annotation update) DE Major prion protein
precursor (PrP) (PrP27-30) (PrP33-35C) (ASCR). GN
PRNP. OS Homo sapiens (Human). OC
Eukaryota Metazoa Chordata Craniata
Vertebrata Euteleostomi OC Mammalia
Eutheria Primates Catarrhini Hominidae
Homo. OX NCBI_TaxID9606 RN 1 RP
SEQUENCE FROM N.A. RX MEDLINE86300093
PubMed3755672 RA Kretzschmar H.A., Stowring
L.E., Westaway D., Stubblebine W.H., RA
Prusiner S.B., Dearmond S.J. RT "Molecular
cloning of a human prion protein cDNA." RL DNA
5315-324(1986). RN 2 RP SEQUENCE OF 8-253
FROM N.A. RX MEDLINE86261778
PubMed3014653 RA Liao Y.-C.J., Lebo R.V.,
Clawson G.A., Smuckler E.A. RT "Human prion
protein cDNA molecular cloning, chromosomal
mapping, RT and biological implications." RL
Science 233364-367(1986). RN 3 RP SEQUENCE
OF 58-85 AND 111-150 (VARIANT AMYLOID GSS). RX
MEDLINE91160504 PubMed1672107 RA Tagliavini
F., Prelli F., Ghiso J., Bugiani O., Serban
D., RA Prusiner S.B., Farlow M.R., Ghetti B.,
Frangione B. RT "Amyloid protein of
Gerstmann-Straussler-Scheinker disease
(Indiana RT kindred) is an 11 kd fragment of
prion protein with an N-terminal RT glycine at
codon 58." RL EMBO J. 10513-519(1991). RN
4 RP STRUCTURE BY NMR OF 118-221. RX
MEDLINE20359708 PubMed10900000 RA Calzolai
L., Lysek D.A., Guntert P., von Schroetter C.,
Riek R., RA Zahn R., Wuethrich K. RT "NMR
structures of three single-residue variants of
the human prion RT protein." RL Proc. Natl.
Acad. Sci. U.S.A. 978340-8345(2000). CC -!-
FUNCTION THE FUNCTION OF PRP IS NOT KNOWN. PRP
IS ENCODED IN THE CC HOST GENOME AND IS
EXPRESSED BOTH IN NORMAL AND INFECTED CELLS. CC
-!- SUBUNIT PRP HAS A TENDENCY TO AGGREGATE
YIELDING POLYMERS CALLED CC "RODS". CC
-!- SUBCELLULAR LOCATION ATTACHED TO THE
MEMBRANE BY A GPI-ANCHOR. CC -!- POLYMORPHISM
THE FIVE TANDEM OCTAPEPTIDE REPEATS REGION IS
HIGHLY CC UNSTABLE. INSERTIONS OR DELETIONS
OF OCTAPEPTIDE REPEAT UNITS ARE CC
ASSOCIATED TO PRION DISEASE.
6Methods for Data Integration
- Combining data from multiple, autonomous data
sources. - TAMBIS
- ontology driven mediation of querying
- EcoCyc
- ontology driven schema for warehousing
- BioPAX
- ontology defined interchange format.
- More recently, ComparaGRID
7The History of Ontologies
- Annotation
- Gene Ontology
- Function, Process, Component
- Systems Biology Ontology
- Describes terms in computational models.
8ComparaGRID
- 6 Investigators
- 5 Researchers
Roslin
Newcastle
Manchester
Cambridge
John Innes NCYC
9ComparGRIDs Problem Domain
10Many Model Organism Databases
11Data Models, Model Data
12Databases and Knowledge
domain ontology
database
Sequence
Molecule
Representation
DNA
SequenceRepresentation
SequenceRecord
seqString
length
id
S_hasID
S_hasSeqStr
S_hasLength
13The Fluxion Stack
Raw data
Aggregation
Semantics
Syntax
Raw data
JDBC
OWL
integrator
OWL
Pub service
Trans service
data
query
14The difficulties
- The Cost of Integration
- building ontologies is often hard
- The Cost of Managing Change
- biological knowledge tends to undergo a lot of
flux - The Scalabilty of Expressive Ontologies.
15Getting the Semantics Upfront
- Instead of annotating heterogenous data sources
after the event, why not do so upfront? - Originators of the data are likely to understand
it best. - Spreads the cost among those contributing.
16CARMENCode, Analysis, Repository and Modelling
for e-Neurosciencewww.carmen.org.uk
17Consortium Profile
- 4M over 4 years
- 20 Investigators
Stirling
St. Andrews
Newcastle
York
Manchester
Sheffield
Leicester
Cambridge
Warwick
Imperial
Plymouth
- Commenced 1st October 2006
18Virtual Laboratory for Neurophysiology
- Enabling sharing and collaborative exploitation
of data, analysis code and expertise that are not
physically collocated
19The need for clear metadata
- Most neurosciences data is relative simple in
structure - But often contextually complex
- Sometimes associated with behavioural features
20Requirements for CARMEN ontology, so far
- Subject description
- Experimental Process
- Experimental Data
- Statistical analysis
- Services
- Derived Data
21How do we represent
In silico Analysis
Derived data
Laboratory Experiments
22 Functional Genomics Experiment
(FuGE)
- Model of common components in science
investigations, such as materials, data,
protocols, equipment and software. - Provides a framework for capturing complete
laboratory workflows, enabling the integration of
pre-existing data formats.
23Re-use
Brain anatomy BIRNLex, FMA
Taxonomy NCBI Taxonomy
CARMEN
Sample preparation sepCV
24What we need lab based
Age/stage development
Subject preparation
CARMEN
Subject training
Experiment process
Subject stimulus
Equipment
Subject task
25What we need In silico
Data structures
File formats
CARMEN
Algorithms
Statistics
Software
26Align with OBI
Ontology for Biomedical Investigations
- Aims to provide an ontology for the life sciences
- Consortium to 15 communities from crop science to
neuroscience - CARMEN will align and contribute to OBI
27The Difficulties
- Even with a lot of pre-existing work there is a
lot to describe - OBI has 15 communities involved in it
Bio-ImagingJeff GretheBiomedical Informatics
Research Network (BIRN) Coordinating
CenterUniversity of California, San DiegoWinter
2007Â Daniel RubinRadiological Society of North
America (RSNA)National Center for Biomedical
Ontology at Stanford Medical Informatics and the
Department of Radiology, Stanford
UniversityWinter 2007Â Bill BugBiomedical
Informatics Research Network (BIRN)Laboratory of
Bioimaging and Anatomical Informatics, in the
Department of Neurobiology and Anatomy, Drexel
University College of MedicineSpring
2006 Cellular AssaysStefan Wiemann DKFZÂ
 Clinical InvestigationsJennifer FostelClinical
Trial OntologyNIEHS, National Institute for
Environmental Health SciencesSpring 2004Â Tina
Hernandez-Boussard Department of Genetics,
Stanford Medical SchoolFall 2007Â Crop
SciencesRichard BruskiewichGeneration Challenge
ProgrammeIRRIÂ Â ElectrophysiologyFrank
GibsonCARMENSchool of Computing Science,
Newcastle UniversitySpring 2007Â Environmental
OmicsNorman Morrison NERC Environmental
Bioinformatic Centre and School of Computer
Science, The University of ManchesterSpring
2004Â Flow CytometryRyan BrinkmanISAC and
FICCSBritish Columbia Cancer Research Center and
University of British Columbia in the Department
of Medical Genetics , Vancouver, BC, CanadaSpring
2004Â Genomics/MetagenomicsDawn FieldGenome
CatalogueNERC Centre for Ecology and
HydrologyWinter 2005Â Tanya GrayWinter
2005Â ImmunologyRichard ScheuermannImmPort, FICCS,
BioHealthBaseUniversity of Texas Southwestern
Medical Center, in in Department of Pathology and
Division of Biomedical InformaticsSpring
2006Â Bjoern PetersImmune Epitope Database and
Analysis ResourceLa Jolla Institute for Allergy
and ImmunologySpring 2006Â In Situ Hybridization
and ImmunohistochemistryEric DeutschMISFISHIEÂ Â
 MetabolomicsSusanna SansoneMSI, The European
Bioinformatics Institute EBI-EMBL, NET
ProjectSpring 2004Â Daniel SchoberSpring
2006Â NeuroinformaticsBill BugBiomedical
Informatics Research Network (BIRN)Laboratory of
Bioimaging and Anatomical Informatics, in the
Department of Neurobiology and Anatomy, Drexel
University College of MedicineSpring 2006Â Frank
GibsonCARMENSchool of Computing Science,
Newcastle UniversitySpring 2007Â NutrigenomicsPhili
ppe Rocca-SerraRSBIThe European Bioinformatics
Institute EBI-EMBL, NET ProjectSpring
2004Â PolymorphismTina Hernandez-BoussardPharmGKBDe
partment of Genetics, Stanford Medical
SchoolWinter 2006Fall 2007ProteomicsSusanna
SansonePSIThe European Bioinformatics Institute
EBI-EMBL, NET ProjectSpring 2004Â Daniel
SchoberSpring 2006Â Luisa MontecchiThe European
Bioinformatics Institute EBI-EMBLSpring
2006 Chris Taylor  Trish Whetzel Spring
2004Â Frank GibsonSchool of Computing Science,
Newcastle UniversitySpring 2007Â ToxicogenomicsJenn
ifer FostelToxicogenomicsNIEHS, National
Institute for Environmental Health SciencesSpring
2004Â Susanna SansoneRSBI The European
Bioinformatics Institute EBI-EMBL, NET
ProjectSpring 2004Â TranscriptomicsSusanna
SansoneMGED The European Bioinformatics Institute
EBI-EMBL, NET ProjectSpring 2004Â Philippe
Rocca-SerraSpring 2004 Trish Whetzel Spring
2004Â Chris StoeckertDepartment of Genetics and
Center for Bioinformatics, University of
PennsylvaniaSpring 2004Â Gilberto FragosoNCI
Center for BioinformaticsSpring 2004 Joe White Â
 Helen ParkinsonThe European Bioinformatics
Institute EBI-EMBLSpring 2004 Mervi Heiskanen Â
 Liju FanOntology Workshop, LLC, Columbia, MD,
USASpring 2004Â Helen CaustonImperial
CollegeSpring 2004
28Information Extraction
- More semantics is better?
- How do we get extract the information?
http//en.wikipedia.org/wiki/ImageBrain_090407.jp
g
29Centre for Integrated Systems Biology of Ageing
and Nutrition (CISBAN)
30Identification of novel interactions between
nutrition and damage using automated yeast
screening and analysis
Screen mutants for sensitivity to damage/nutrition
Robot
Robot
- Data curation.
- Functional analysis.
- Interactions with in silico
- programme.
Reference set of 5,000 mutant strains
31CISBAN dataflow
32Data Entry with SYMBA
http//symba.sourceforge.net/
33Data Entry with SYMBA
34CARMEN and CISBAN
- We can provide more semantics upfront
- This should make data more explicit
- If we still need to integrate it should be
easier. - Like much of biology, these projects are largely
using structural simple, non-SW based
technologies. - This is a lot of effort to go to what do we hope
to gain?
35Yeast Hub
YeastHub a semantic web use case for integrating
data in the life sciences domain Kei-Hoi Cheung,
Kevin Y. Yip, Andrew Smith, Remko deKnikker, Andy
Masiar and Mark Gerstein doi10.1093/bioinformat
ics/bti1026
36A rapturous reception
- So the general idea is take a bunch of data,
convert it to RDF, dump it into a RDF triple
store to discover interesting things ? - http//www.nodalpoint.org/user/greg
- Putting a lot of RDF in a bucket isnt
integration. Not unless the RDF is the same
schema and using the same concepts - Carole Goble, University of Manchester
37A thin layer of semantics.
- Inverse Document Frequency is a method for
classifying documents rare words carry more
information than common ones. - In this case, YeastHub has a common semantics
describing the type of document. - protein or sequence occurs a lot in Uniprot,
but less in the bulk corpus - Rather than treating all documents equally, they
use IDF twice. - Leveraging Biological Identifier Relationships
and Related Documents to Enhance Information
Retrieval for Proteomics -- Smith et al.,
10.1093/bioinformatics/btm452 Bioinformatics
38Thin Semantics
- The semantics of YeastHub is not deep.
- But even a thin layer of semantics is useful.
- If we modify our technologies to use it.
- A large part of library sciences has been encoded
in 15 tags Dublin Core
39Using Ontology to Classify Members of a Protein
Family
- Katy Wolstencroft (Bioinformatics)
- Daniele Turi (Instance Store)
- Phil Lord (myGrid)
- Lydia Tabernero (Protein Scientist)
- Matt Horridge, Nick Drummond et al (Protégé OWL)
- Andy Brass and Robert Stevens (Bioinformatics)
40The Protein Phosphatases
- A large superfamily of proteins
- Motifs determine a proteins place within the
family - Recognising that motifs imply class membership is
normally manual - Can these be captured in an ontology?
41Phosphatase Functional Domains
Andersen et al (2001) Mol. Cell. Biol. 21 7117-36
42Definition of Tyrosine Phosphatase
- Class TyrosineRreceptorProteinPhosphatase
- EquivalentTo Protein That
- (contains atLeast-1 ProteinTyrosinePhosphataseDom
ain and - contains 1 TransmembraneDomain
43Classifying Proteins
gtuniprotQ15262PTPK_HUMAN Receptor-type
protein-tyrosine phosphatase kappa precursor (EC
3.1.3.48) (R-PTP-kappa). MDTTAAAALPAFVALLLLSPWPLLG
SAQGQFSAGGCTFDDGPGACDYHQDLYDDFEWVHV SAQEPHYLPPEMPQ
GSYMIVDSSDHDPGEKARLQLPTMKENDTHCIDFSYLLYSQKGLNP GTL
NILVRVNKGPLANPIWNVTGFTGRDWLRAELAVSSFWPNEYQVIFEAEVS
GGRSGYI AIDDIQVLSYPCDKSPHFLRLGDVEVNAGQNATFQCIATGRD
AVHNKLWLQRRNGEDIPV..
InterPro
Codify
Translate
Instance Store
Reasoner
44Results
- Human phosphatases have been classified using the
system - The ontology system refined classification
- - DUSC contains zinc finger domain
- characterised and conserved but not in
classification - - DUSA contains a disintegrin domain
- previously uncharacterised evolutionarily
conserved - We have automated a part of the scientific
process - We have defined our domain model in a
computational form - We have collected some data
- We have let the reasoner test whether the model
fits the data - The semantics here are deeper with YeastHub,
which allow us to reason
45OGRE
- Like ComparaGRID dealing with comparing genomes
from different species. - In this case concerned with the automated
pairwise comparisons of bacterial genomes. - Genome change and evolution is often key to
pathogenisis and resistance in bacteria.
46OGRE
- Unlike phosphotase, the raw analysis results are
probabalistic. - Classification in the same way is therefore not
possible.
47OGRE Analysis Architecture
48OGRE
- It is possible to combine statistical and logical
reasoning - Doing so architecturally allows reasoning that is
hard with either alone.
49Summary
- Ontologies have been used in life sciences for
data integration - Increasingly, are being used to describe the data
early in the scientific process - Even thin semantics can be exploited for
information retrieval - Richer semantics allows more use of computational
inference
50Richer Expressivity
- There are applications of more expressive
semantics - Can we move to from specific software, to generic
software with specific knowledge models - But, scalability and usability remain the
bottleneck
51Industrialisation
- Semantics in the life sciences is moving from
small to large scale - building ontologies has now become very committee
driven - we dont understand ontology engineering as we do
software engineering - Encapsulation, modularisation, continuous
integration.
52Future
- ComparaGRID has semantics describing schema which
means data integration can happen on-the-fly. - Death to data warehouses!
- CARMEN and CISBAN are gathering semantically
enriched data in the first place. An End to
Integration! - Semantics during dissemination
- Knowledge for All.
53The Future
- An End to Integration, Death to Warehouses
- Semantics during dissemination
- Knowledge for All
54Acknowledgements
- The ComparaGRID consortium is Madhuchhanda
Bhattacharjee, Richard Boys, Tony Burdett, Rob
Davey, Jo Dicks, David Marshall, Andy Law,
Phillip Lord, Trevor Paterson, Matthew Pocock,
Peter Rice, Ian Roberts, Robert Steven, Paul
Watson, Darren Wilkinson and Neil Wipat, Andy
Gibson - CISBAN is Tom Kirkwood (PI), Thomas von Zglinicki
(PI), David Lydall (PI), Anil Wipat (PI), Stephen
Addinall (Research Associate), Suzanne Advani
(Technician), Kim Clugston (Research Associate),
Sharon Denley (PA to Professor Tom Kirkwood),
Amanda Greenall (Research Associate), Jennifer
Hallinan (Research Associate), Dominic Kurian
(Research Associate), Conor Lawless (Research
Associate), Guiyuan Lei (Research Associate),
Allyson Lister (Research Associate), Mandy
Maddick (Research Associate), Satomi Miwa
(Research Associate), Glyn Nelson (Research
Associate), Bob Nicholson (Superintendent),
Sharon Oljslagers (Technician), Joao Passos
(Research Associate), Carole Proctor (Research
Associate), Daryl Shanley (Research Associate),
Oliver Shaw (Research Associate), Donna Stark
(Research Secretary), Laura Steedman
(Technician), Joyce Wang (Technician), Darren
Wilkinson (Professor of Stochastic Modelling)
55CARMEN Acknowledgements
Professor Colin Ingram, Professor Jim Austin,
Professor Leslie Smith, Professor Paul Watson Dr.
Stuart Baker,Professor Roman Borisyuk, Dr.
Stephen Eglen, Professor Jianfeng Feng, Dr. Kevin
Gurney, Dr. Tom Jackson Dr. Marcus Kaiser, Dr.
Phillip Lord, Dr. Paul Overton, Dr. Stefano
Panzeri, Dr. Rodrigio Quian Quiroga, Dr. Simon
Schultz, Dr. Evelyne Sernagor, Dr. V. Anne Smith,
Dr. Tom Smulders Professor Miles Whittington,
Christoph Echtermeyer, Martyn Fletcher, Frank
Gibson, Mark Jessop Dr. Bojian Liang, Juan
Martinez-Gomez, Dr. Chris Mountford, Agah
Ogungboye, Georgios Pitsilis, Dr. Daniel Swan
University ofSt Andrews
56Holiday Pictures
57Questions?