Title: Interactions and Ontologies
1Interactions and Ontologies
CBW Bioinformatics Workshop February 24th 2005,
Vancouver Christopher Hogue The Blueprint
Initiative
2The Blueprint Initiative
- Develop, curate and maintain the Biomolecular
Interaction Network Database (BIND) and Related
Tools - Headquarters in Toronto
- Blueprint Asia Ltd. Pte. in Singapore
- Public good research program funded by Canadian
and Singaporean Government - Data and software is freely available
- like GenBank
3 blueprint.org
4About this talk
- The OECD Ministerial Declaration of 2004
- Interoperability, Standards and Systems - A
Historic Perspective - Understanding Biomolecular Function
- A BIND Interaction Record
- Interaction and Reaction Data Models
- Interaction Experiments
- Yeast Two Hybrid, Affinity Purification and False
Positives - Spoke and Matrix models for complexes of Unknown
Topology - Ontologies
5Organisation for Economic Co-operation and
Development (oecd.org)
- DECLARATION ON ACCESS TO RESEARCH DATA FROM
PUBLIC FUNDING - adopted on 30 January 2004 in ParisThe
governments (1)Â of Australia, Austria, Belgium,
Canada, China, the Czech Republic, Denmark,
Finland, France, Germany, Greece, Hungary,
Iceland, Ireland, Israel, Italy, Japan, Korea,
Luxembourg, Mexico, the Netherlands, New Zealand,
Norway, Poland, Portugal, the Russian Federation,
the Slovak Republic, the Republic of South
Africa, Spain, Sweden, Switzerland, Turkey, the
United Kingdom, and the United States
6Declare their commitment to
- Work towards the establishment of access regimes
for digital research data from public funding in
accordance with the following objectives and
principles
- Openness
- Transparency
- Legal Conformity
- Formal Responsibility
- Professionalism
- IP Protection
- Interoperability
- Quality and Security
- Efficiency
- Accountability
7Bioinformatics Research is Orthogonal to Biology
Research
- All scientists are rewarded by publishing papers.
- 300 year old tradition
- Bioinformatics papers focus on algorithm
development and improvement - the last 2 improvement in sensitivity/specificity
- Discovery research needs implementations of
algorithms that work together - interoperability
- Interoperability is only now becoming
publishable
8Interaction Databases
- Aminoacyl-tRNA Synthetases Database
- ASEdb - Alanine Scanning Energetics database
- BBID - Biological Biochemical Image Database
- BIND - Biomolecular Interaction Network Database
- BindingDB - The Binding Database
- Biocarta
- Biocatalysis/Biodegradation Database
- BioPathways Consortium
- BRENDA
- BRITE - Biomolecular Relations in Information
Transmission and Expression - COMPEL (Composite Regulatory Elements)
- COPE - Cytokines Online Pathfinder Encyclopaedia
- CSNDB - Cell Signaling Networks Database / CSNDB
Paper - Curagen Pathcalling
- DIP - Database of Interacting Proteins
- DPInteract - DNA-protein interactions
- DRC - Database of Ribosomal Crosslinks
- Ecocyc and Metacyc
- Dynamic Signaling Maps
- JenPep Immunology MHC-peptide database
- KEGG - Kyoto Encyclopedia of Genes and Genomes
- Kohn Molecular Interaction Maps
- MDB - Metalloprotein Database and Browser
- MHCPEP - A database of MHC binding peptides
- MINT - a database of Molecular INTeractions
- MIPS Yeast Genome Database
- MMDB - Molecular Modeling Database
- NetBiochem Welcome Page
- ooTFD - object-oriented Transcription Factors
Database) - ORDB - Olfactory Receptor Database
- PATIKA - Pathway Analysis Tool for Integration
and Knowledge Acquisition - PFBP - Protein Function and Biochemical Pathways
Project - PhosphoBase - A database of phosphorylation sites
- PIM (Protein Interaction Map)
- PIMdb - Drosophila Protein Interaction Map
database - PKR - Protein Kinase Resource
- ProChart Database (at AxCell Biosciences)
- ProNet Online - Protein Interactions on the Web
(Myriad)
9Over 50? Why So Many?
- Easy to build a simple Interaction Database.
- A Simple Abstraction. Many Projects cutting
their teeth in Bioinformatics - Conceptually this list includes Biochemical
Pathways (reactions interactions) - Also includes transcription factors, tRNA
synthetases, etc, all of which can fall into a
general biomolecular binding description. - Many Niches to Fill
- Kinetics
- Organism centric
- Protein-protein centric
- Most are not funded for a large-scale service
10How do we make things interoperate?What is in a
Standard?A Historical Perspective
- Standards emerge from successful implementations
of complete systems. - Which one is the standard The light bulb
or the electrical grid? - Lamps were the original killer app.
- (bye-bye candles, gas lamps, oil lamps)
- Other Apps Motors, Heaters, Toasters
- Unexpected Apps radio, TV, transformers,
computers, rechargables - Entire systems become standards via ad-hoc and
popular use snowball effect.
11Emergence and evolution of technological systems
- Systems emerge across broad frontiers
- Lots of small inventions are responsible for
emerging technologies. - Portions of the frontier that are held back
become the focus of intense innovation - Called a reverse salient by students of
technology - An inadequately functioning or accessible
component in a complex system of components - Opportunities for invention and replacement
12Reverse Salient AC/DC Example
- 1882 Edisons DC standard lit up Wall Street
- High-level buy-in for DC.
- AC was too complicated, could kill a person!
- Edisons DC system only worked over short-range.
- This flaw is the reverse salient.
- Westinghouse/Stanley/Tesla saw the flaw in this
standard - AC technology raced to fill the gap.
- Light bulbs work with both AC or DC.
- Motors required re-invention
- E.S. Rogers batteryless radio
1925
13Reverse Salient AC/DC Example
- Result Cars, Battery based devices emerged with
DC.
- Result The electrical Grid emerged with AC.
NOT A WINNER-TAKE-ALL (zero-sum game) RESULT!
14A few reverse salients in Bioinformatics
- Inadequately Functioning
- Integration of Structure and Sequence
- Integration of chemoinformatics with
bioinformatics - Mapping of microarray data to pathways
- Integration of interactions and pathwyas
- Inaccessable
- Carbohydrate representation and analysis tools
- Advanced, ad-hoc text mining tools
15Reverse Salient Attitudes
- What holds us back?
- Oversights (didnt think of that!).
- Shortsightedness (wont ever need that!).
- Inability (cant do it!)
- Stubbornness (wont do it!)
- Prescriptivism (do it like this!)
- Nationalism, Continentalism, Colonialism
- (because thats the way we do it here!)
- 110 vs 220
16Understanding Biomolecular Function
- "I yam what I yam and that's all that I yam.
- - Popeye the sailor man, the worlds first comic
book superhero
17Biomolecular function
E S gt E P
- This is a generalization of how a biochemist
might represent the function of enzymes.
18Biomolecular function
E S gt E P kinase-ATP complex
inactive-enzyme gt Kinase ADP active
enzyme
K
P
ATP
ADP
- Here is an example of the generalization
represented two different ways.
19Biomolecular function
Kinase-ATPcomplex
inactiveenzyme
Activeenzyme
ADP
- This is another representation.
20Biomolecular function
A
B
C
D
E
F
- This is a generalization of the representation.
21Biomolecular function
A
B
C
D
E
F
- A biomolecules function can be defined by the
things that it interacts with and the new (or
altered) molecules that result from that
interaction.
22Biomolecular function
A
B
C
D
E
n
- This representation makes it easy to focus on the
interaction part.
23Biomolecular function
A
B
C
D
E
n
- This also happens to represent the BIND data
model.
24A simple BIND record
A
B
1. Short label for A 2. Short label for B3.
Molecule type for A 4. Molecule type for B 5.
Database reference for A 6. Database reference
for B7. Where A comes from 8. Where B comes
from 9. Publication reference
- The minimal BIND record has 9 pieces of
information.
25A curated BIND record
A
B
1. Short label for A 2. Short label for B3.
Molecule type for A 4. Molecule type for B 5.
Database reference for A 6. Database reference
for B7. Where A comes from 8. Where B comes
from 9. Publication reference
- The curated BIND record may have many more pieces
of information.
26An example BIND record
A
B
1. INAD 2. TRP3. Protein 4. Protein 5.
GenBank GI 3641615 6. GenBank GI 73018617.
GenBank Taxonomy ID 7227 8. GenBank Taxonomy ID
7227 9. PubMed ID 8630257
- You can view this record in BIND
27BIND stores molecular interaction data
28(No Transcript)
29http//bind.ca
- Enter 188 (the BIND record number) in the
Identifier search box
30(No Transcript)
31BIND records are observations
A
B
1. Short label for A 2. Short label for B3.
Molecule type for A 4. Molecule type for B 5.
Database reference for A 6. Database reference
for B7. Where A comes from 8. Where B comes
from 9. Publication reference
- All BIND records will have a publication
reference and most will specifically list a
method(s) used to demonstrate the interaction.
32(No Transcript)
33Methods used to detect interactions.
- A great deal of interaction data in BIND
originates from high-throughput experiments
designed to detect interactions between
proteins. - The most common methods are
- Two-hybrid assay
- Affinity purification
34Interaction Experimental Evidence in BIND
Remaining1
35Two-hybrid assay
1.
3.
2.
4.
36Two-hybrid assay
1.
3.
2.
4.
37Two-hybrid assay
1.
B
3.
A
2.
4.
38Two-hybrid assay
1.
B
3.
A
2.
4.
39Two-hybrid assay
1.
SNF4
B
SNF1
3.
A
2.
GAL4-DBD
Transcription activation domain
UASG
4.
Fields S. Song O. Nature. 1989 Jul
20340(6230)245-6. PMID 2547163
GAL1
Allows growth on galactose
40Some Two-hybrid caveats
1.
3.
A
2.
4.
Does the DBD-fusion have activity by itself?
41Some Two-hybrid caveats
1.
A
3.
B
2.
4.
Is the interaction bi-directional?
42Some Two-hybrid caveats
1.
B
C
3.
A
2.
4.
Is the interaction mediated by some other
protein?
43Some Two-hybrid questions
1.
B
3.
A
2.
Are the proteins expresssed?Are they
over-expressed?Are they in-frame?Are the
interacting domains defined?Was the observation
reproducible?Was the strength of interaction
significant?Was another method used to back-up
the conclusion? Are the two proteins from the
same compartment?
4.
44Two-hybrid assay
1.
A
3.
B
2.
4.
Negative results dont mean a lot.
45Affinity purification
A
this molecule will bind the tag.
tag modification(e.g. HA/GST/His)
Protein of interest
46Affinity purification
the cell
A
47Affinity purification
lots of other untagged proteins
the cell
A
B
naturally binding protein
48Affinity purification
Ruptured membranes
A
B
cell extract
49Affinity purification
A
B
untagged proteinsgo through fastest(flow-through
)
50Affinity purification
A
B
tagged complexes are slower and come out later
(eluate)
51Some affinity purification questions
Is the bait protein expressed and in frame? Is
the bait protein observed?Is the bait protein
over-expressed?Are the interacting domains
defined?Was the observation reproducible?Was
the interactor found in the background?Was the
strength of interaction significant? Was the
interaction saturable? Was the interactor
stoichiometric with the bait protein?Was another
method used to back-up the conclusion?Was
tandem-affinity purification (TAP) used? Was the
interaction shown using an extract or a purified
protein? Is the inverse interaction
observable? Are the two proteins from the same
compartment? Are the two proteins known to be
involved in the same process? Is the interactor
likely to be physiologically significant?
A
B
52Some affinity purification caveats
First and most importantly, this is only a
representation of the observation. You can only
tell what proteins are in the eluate you cant
tell how they are connected to one another. If
there is only one other protein present (B), then
its likely that A and B are directly
interacting. But, what if I told you that
two other proteins (B and C) were present along
with A.
A
B
A
C
B
53Complexes with unknown topology
A
A
A
B
C
B
C
B
C
Which of these models is correct? The complex
described by this experimental result is said to
have an Unknown Topology.
54Complexes with unknown stoichiometry
A
A
B
C
Heres another possibility? The complex described
by this experimental result is also said to have
Unknown Stoichiometry.
55High throughput data in BIND
- Affinity purificationSystematic identification
of protein complexes in Saccharomyces cerevisiae
by mass spectrometry (2002). PMID 11805837 - Two-hybridA protein interaction map of
Drosophila Melanogaster(2003). PMID 14605208 - Two-hybrid and Affinity purificationA map of
the interactome network of the metazoan C.
Elegans (2004). PMID 14704431 - Data from these examples can be retrieved from
BIND using a PMID search.
56How complex data are stored in BIND.
A
?
B
?
Three interaction records.
C
?
57How complex data are stored in BIND.
A
?
A complex record in BIND is simply a collection
of interaction records.
B
?
C
?
58How complex data are stored in BIND.
A
?
A complex record in BIND is simply a collection
of interaction records.
B
?
C
?
59Alternate representations.
A
?
A
B
B
C
?
The matrix model (a clique).
C
?
60Alternate representations.
A
?
A
B
B
C
?
The spoke model. Which model to use?
C
?
61Spoke and Matrix Models
- Vrp1 (bait), Las17, Rad51, Sla1, Tfp1, Ypt7
Possible Actual Topology
Spoke
Matrix
Theoretical max. number of interactions, but many
FPs
Simple model Intuitive, more accurate, but
canmisrepresent.
BaderHogue Nature Biotech. 2002 Oct 20(10)991-7
62A view on real datamatrix model(seems hopeless)
6 redox enzymes
7 redox enzymes
Old yellow enzyme Function?
63OYE has little small molecule specificity,
unlike all other redox enzymes
The crystal structure shows a large surface near
its reactivesite, unlike other similar
proteins. So is its substrate protein? Other
redox enzymes?
64A tea cup in a rainstorm
- 2000 elemental observations (facts) about
molecular assembly published in the literature
every month - 2600 High Throughput Interactions published
every month with high rates of false positives. - 200,000 facts sitting in the literature on
library shelves, not validated.
65Ontology
- ltphilosophygt A systematic account of Existence.
- ltartificial intelligencegt (From philosophy) An
explicit formal specification of how to represent
the objects, concepts and other entities that are
assumed to exist in some area of interest and the
relationships that hold among them. - ltinformation sciencegt The hierarchical
structuring of knowledge about things by
subcategorising them according to their essential
(or at least relevant and/or cognitive)
qualities. This is an extension of the previous
senses of "ontology" (above) which has become
common in discussions about the difficulty of
maintaining subject indices. The philosophy of
indexing everything in existence?
66Ontology redux
- An ontology is a choice of a system of data
grammar together with specific controlled
vocabularies and an organizational framework to
contain data. - Ontologies are used in practice to describe how
to exchange data faithfully between computers,
not how to compute with them! - An Ontology may be used to Archive information or
to make information available to applications
(API).
67Parsing - Summary
- Parsing flatfiles is instructive to understand
how biological data is stored and used. - Most bioinformaticians in small academic groups
write their own parsers and work with small
batches of computations. - Data Grammars and automatically generated parsers
are efficient and often error free. - Most database organizations and software
developers with large audiences use data grammar
approaches. - Semantic approaches (OWL) are beginning to emerge.
68Matching and Finding Strings
- Biologists use language in a very expansive
manner, consider the lowly calcium ion - Calcium, Ca2, Ca2, Ca, calcium (II), Ca(II)
- Given a database with descriptive text how do
you find each and every record in a database that
has calcium? - Search for each form of the string Ca AND
calcium - Use a regular expression? CA
69Controlled Vocabularies
- Fix the database so that only one form of the
string is used. - control the use of vocabulary in descriptive text
only use calcium, prohibit the other forms - Called a synonym constrained controlled
vocabulary - Requires that people doing data entry use
selected words - Requires the person making a query know what form
of the word is used in the database.
70Controlled Vocabularies
- Add a field to the database and store the atomic
number of any elements described - Atomic number is a unique identifier for calcium.
- This enables searching the database by element.
- Periodic table defines the unique identifying
number - Called a numerically controlled vocabulary
- Requires that numbers representing words be added
- Allows searching by code number
71Selected Unique Identifiers in Biological Data
72(No Transcript)
73Unique Identifier Use
- Use a single unique identifier to get a specific
data entry from a database - Use a list of unique identifiers to manage a
collection of data from the database - Use of list of identifiers to keep track of GO
terms you are interested in. - Search databases using Unique Identifiers!
74List of Protein GIs for TrpRS protein hits
20178136 20178135 20178132 20178128 20178126
20178125 20139932 17367600 6226200 135188
20482401 14754335 21362967 17864462 417846 135189
18144292 18309615 15829216 6226201 1174553
1174552 21362962 20178139 20178138 20178133
20178131 20178130 20178129 20178127 20178124
20178123 20178122 17367824 16974813 16974812
16974811 7994694 135191 13878796 7994695 7994693
7994692 6226203 2501073 13431912 11134974 8039807
8039806 6226202 6094418 3915079 3122910 3122904
2501071 2501070 2501069 1711656 1351182 135187
14090160 7674347 2851538 2501074 1174551 417845
1754770 1754768
This describes a complete collection of sequences
75BioPAX
- BioPAX Biological PAthway eXchange
- A data exchange ontology and format for semantic
integration, aggregation and inference of
biological pathway data - Open source community effort the community
agreed upon and built this! - www.biopax.org
76BioPAX Ontology Overview
Level 1 v1.0 (July 7th, 2004)
77The domain Biological pathways
Main categories
Metabolic Pathways
Molecular Interaction Networks
Signaling Pathways
78Aggregation, Integration, Inference
- Multiple kinds of pathway databases
- metabolic
- molecular interactions
- signal transduction
- gene regulatory
- Constructs designed for integration
- DB References
- XRefs (Publication, Unification, Relationship)
- Synonyms
- Provenance (not yet implemented)
- OWL DL to enable reasoning
79BioPAX uses other ontologies
- Conceptual framework based upon existing DB
schemas - aMAZE, BIND, EcoCyc, WIT, KEGG, Reactome, etc.
- Allows wide range of detail, multiple levels of
abstraction - Uses pointers to existing ontologies to provide
supplemental annotation where appropriate - Cellular location ? GO Component
- Cell type ? Cell.obo
- Organism ? NCBI taxon DB
- Incorporate other standards where appropriate
- Chemical structure ? SMILES, CML, INCHI
- Interoperate with existing standards (RDF/OWL,
LSID, SBML, PSI, CellML Metadata Standard)
80Case study BioPAX in SBML facilitates SMBL
integration
- Addresses SBMLs nasty data integration issues
- Different data types, same representation
- Same data, different representations
- External references
- Synonyms
- Provenance
81BioPAX Ontology Overview
species
reaction
modifier
Level 1 v1.0 (July 7th, 2004)
82Different data types, same representation
- Protein-Protein Interaction
- ltreaction
- idpyruvate_dehydrogenase_cplx/gt
- ltlistOfReactantsgt
- ltspeciesRef speciesPdhA/gt
- ltspeciesRef speciesPdhB/gt
- lt/listOfReactantsgt
- ltlistOfProductsgt
- ltspeciesRef speciesPyruvate_dehydrogenase_E1
/gt - lt/listOfProductsgt
- lt/reactiongt
Biochemical Reaction ltreaction
idpyruvate_dehydrogenase_rxn/gt
ltlistOfReactantsgt ltspeciesRef
speciesNADP/gt ltspeciesRef speciesCoA/gt
ltspeciesRef speciespyruvate/gt
lt/listOfReactantsgt ltlistOfProductsgt
ltspeciesRef speciesNADPH/gt ltspeciesRef
speciesacetyl-CoA/gt ltspeciesRef
speciesCO2/gt lt/listOfProductsgt
ltlistOfModifersgt ltmodifierSpeciesRef
speciespyruvate_dehydrogenase_E1/gt
lt/listOfModifiersgt lt/reactiongt
83BioPAX solution metadata
- ltsbml xmlnsbphttp//www.biopax.org/release1/bio
pax-release1.owl - xmlnsowl"http//www.w3.org/2002/07/owl"
- xmlnsrdf"http//www.w3.org/1999/02/22-rdf
-syntax-ns"gt - ltlistOfSpeciesgt
- ltspecies idPdhA metaidPdhAgt
- ltannotationgt
- ltbpprotein rdfIDPdhA/gt
- lt/annotationgt
- lt/speciesgt
- ltspecies idNADP metaidNADPgt
- ltannotationgt
- ltbpsmallMolecule rdfIDNADP/gt
- lt/annotationgt
- lt/listOfSpeciesgt
- ltlistOfReactionsgt
- ltreaction idpyruvate_dehydrogenase_cplxgt
- ltannotationgt
- ltbpcomplexAssembly rdfIDpyruvate_dehydrog
enase_cplx/gt - lt/annotationgt
84BioPAX External References
- ltspecies idpyruvate metaidpyruvategt
- ltannotation
- xmlnsbphttp//biopax.org/release1/biopax-r
elease1.owlgt - ltbpsmallMolecule rdfIDpyruvategt
- ltbpXrefgt
- ltbpunificationXref
rdfIDunificationXref119"gt - ltbpDBgtLIGANDlt/bpDBgt
- ltbpIDgtc00022lt/bpIDgt
- lt/bpunificationXrefgt
- lt/bpXrefgt
- lt/bpsmallMoleculegt
- lt/annotationgt
- lt/speciesgt
85BioPAX Synonyms
- ltspecies idpyruvate metaidpyruvategt
- ltannotation xmlnsbphttp//biopax.org/release1/b
iopax_release1.owl/gt - ltbpsmallMolecule rdfIDpyruvate gt
- ltbpSYNONYMSgtpyroracemic acidlt/bpSYNONYMSgt
- ltbpSYNONYMSgt2-oxo-propionic
acidlt/bpSYNONYMSgt - ltbpSYNONYMSgtalpha-ketopropionic
acidlt/bpSYNONYMSgt - ltbpSYNONYMSgt2-oxopropanoatelt/bpSYNONYMSgt
- ltbpSYNONYMSgt2-oxopropanoic acidlt/bpSYNONYMSgt
- ltbpSYNONYMSgtBTSlt/bpSYNONYMSgt
- ltbpSYNONYMSgtpyruvic acidlt/bpSYNONYMSgt
- lt/bpsmallMoleculegt
- lt/annotationgt
- lt/speciesgt
86A comprehensive list of BioPAX supporting
Applications (Feb 2005)