Title: Interactions and Ontologies
1Interactions and Ontologies
CBW Bioinformatics Workshop February 23th 2006,
Toronto Christopher Hogue The Blueprint
Initiative
2About this talk
- Interoperability, Standards and Systems - A
Historic Perspective - Understanding Biomolecular Function
- A BIND Interaction Record
- Interaction and Reaction Data Models
- Interaction Experiments
- Yeast Two Hybrid, Affinity Purification and False
Positives - Spoke and Matrix models for complexes of Unknown
Topology - Ontologies
3Interaction Databases
- Aminoacyl-tRNA Synthetases Database
- ASEdb - Alanine Scanning Energetics database
- BBID - Biological Biochemical Image Database
- BIND - Biomolecular Interaction Network Database
- BindingDB - The Binding Database
- Biocarta
- Biocatalysis/Biodegradation Database
- BioPathways Consortium
- BRENDA
- BRITE - Biomolecular Relations in Information
Transmission and Expression - COMPEL (Composite Regulatory Elements)
- COPE - Cytokines Online Pathfinder Encyclopaedia
- CSNDB - Cell Signaling Networks Database / CSNDB
Paper - Curagen Pathcalling
- DIP - Database of Interacting Proteins
- DPInteract - DNA-protein interactions
- DRC - Database of Ribosomal Crosslinks
- Ecocyc and Metacyc
- Dynamic Signaling Maps
- JenPep Immunology MHC-peptide database
- KEGG - Kyoto Encyclopedia of Genes and Genomes
- Kohn Molecular Interaction Maps
- MDB - Metalloprotein Database and Browser
- MHCPEP - A database of MHC binding peptides
- MINT - a database of Molecular INTeractions
- MIPS Yeast Genome Database
- MMDB - Molecular Modeling Database
- NetBiochem Welcome Page
- ooTFD - object-oriented Transcription Factors
Database) - ORDB - Olfactory Receptor Database
- PATIKA - Pathway Analysis Tool for Integration
and Knowledge Acquisition - PFBP - Protein Function and Biochemical Pathways
Project - PhosphoBase - A database of phosphorylation sites
- PIM (Protein Interaction Map)
- PIMdb - Drosophila Protein Interaction Map
database - PKR - Protein Kinase Resource
- ProChart Database (at AxCell Biosciences)
- ProNet Online - Protein Interactions on the Web
(Myriad)
4Over 50? Why So Many?
- Easy to build a simple Interaction Database.
- A Simple Abstraction. Many Projects cutting
their teeth in Bioinformatics - Conceptually this list includes Biochemical
Pathways (reactions interactions) - Also includes transcription factors, tRNA
synthetases, etc, all of which can fall into a
general biomolecular binding description. - Many Niches to Fill
- Kinetics
- Organism centric
- Protein-protein centric
- Most are not funded for a large-scale service
5How do we make things interoperate?What is in a
Standard?A Historical Perspective
- Standards emerge from successful implementations
of complete systems. - Which one is the standard The light bulb
or the electrical grid? - Lamps were the original killer app.
- (bye-bye candles, gas lamps, oil lamps)
- Other Apps Motors, Heaters, Toasters
- Unexpected Apps radio, TV, transformers,
computers, rechargables - Entire systems become standards via ad-hoc and
popular use snowball effect.
6Emergence and evolution of technological systems
- Systems emerge across broad frontiers
- Lots of small inventions are responsible for
emerging technologies. - Portions of the frontier that are held back
become the focus of intense innovation - Called a reverse salient by students of
technology - An inadequately functioning or accessible
component in a complex system of components - Opportunities for invention and replacement
7Reverse Salient AC/DC Example
- 1882 Edisons DC standard lit up Wall Street
- High-level buy-in for DC.
- AC was too complicated, could kill a person!
- Edisons DC system only worked over short-range.
- This flaw is the reverse salient.
- Westinghouse/Stanley/Tesla saw the flaw in this
standard - AC technology raced to fill the gap.
- Light bulbs work with both AC or DC.
- Motors required re-invention
- E.S. Rogers batteryless radio
1925
8Reverse Salient AC/DC Example
- Result Cars, Battery based devices emerged with
DC.
- Result The electrical Grid emerged with AC.
NOT A WINNER-TAKE-ALL (zero-sum game) RESULT!
9A few reverse salients in Bioinformatics
- Inadequately Functioning
- Integration of Structure and Sequence
- Integration of chemoinformatics with
bioinformatics - Mapping of microarray data to pathways
- Integration of interactions and pathways
- Inaccessable
- Carbohydrate representation and analysis tools
- Advanced, ad-hoc text mining tools
10Reverse Salient Attitudes
- What holds us back?
- Oversights (didnt think of that!).
- Shortsightedness (wont ever need that!).
- Inability (cant do it!)
- Stubbornness (wont do it!)
- Prescriptivism (do it like this!)
- Nationalism, Continentalism, Colonialism
- (because thats the way we do it here!)
- 110 vs 220
11Understanding Biomolecular Function
- "I yam what I yam and that's all that I yam.
- - Popeye the sailor man, the worlds first comic
book superhero
12Biomolecular function
E S gt E P
- This is a generalization of how a biochemist
might represent the function of enzymes.
13Biomolecular function
E S gt E P kinase-ATP complex
inactive-enzyme gt Kinase ADP active
enzyme
K
P
ATP
ADP
- Here is an example of the generalization
represented two different ways.
14Biomolecular function
Kinase-ATPcomplex
inactiveenzyme
Activeenzyme
ADP
- This is another representation.
15Biomolecular function
A
B
C
D
E
F
- This is a generalization of the representation.
16Biomolecular function
A
B
C
D
E
F
- A biomolecules function can be defined by the
things that it interacts with and the new (or
altered) molecules that result from that
interaction.
17Biomolecular function
A
B
C
D
E
n
- This representation makes it easy to focus on the
interaction part.
18Biomolecular function
A
B
C
D
E
n
- This also happens to represent the BIND data
model.
19A simple BIND record
A
B
1. Short label for A 2. Short label for B3.
Molecule type for A 4. Molecule type for B 5.
Database reference for A 6. Database reference
for B7. Where A comes from 8. Where B comes
from 9. Publication reference
- The minimal BIND record has 9 pieces of
information.
20A curated BIND record
A
B
1. Short label for A 2. Short label for B3.
Molecule type for A 4. Molecule type for B 5.
Database reference for A 6. Database reference
for B7. Where A comes from 8. Where B comes
from 9. Publication reference
- The curated BIND record may have many more pieces
of information.
21An example BIND record
A
B
1. INAD 2. TRP3. Protein 4. Protein 5.
GenBank GI 3641615 6. GenBank GI 73018617.
GenBank Taxonomy ID 7227 8. GenBank Taxonomy ID
7227 9. PubMed ID 8630257
- You can view this record in BIND
22BIND stores molecular interaction data
23(No Transcript)
24http//bind.ca
- Enter 188 (the BIND record number) in the
Identifier search box
25(No Transcript)
26BIND records are observations
A
B
1. Short label for A 2. Short label for B3.
Molecule type for A 4. Molecule type for B 5.
Database reference for A 6. Database reference
for B7. Where A comes from 8. Where B comes
from 9. Publication reference
- All BIND records will have a publication
reference and most will specifically list a
method(s) used to demonstrate the interaction.
27(No Transcript)
28Methods used to detect interactions.
- A great deal of interaction data in BIND
originates from high-throughput experiments
designed to detect interactions between
proteins. - The most common methods are
- Two-hybrid assay
- Affinity purification
29Interaction Experimental Evidence in BIND
Remaining1
30Two-hybrid assay
1.
3.
2.
4.
31Two-hybrid assay
1.
3.
2.
4.
32Two-hybrid assay
1.
B
3.
A
2.
4.
33Two-hybrid assay
1.
B
3.
A
2.
4.
34Two-hybrid assay
1.
SNF4
B
SNF1
3.
A
2.
GAL4-DBD
Transcription activation domain
UASG
4.
Fields S. Song O. Nature. 1989 Jul
20340(6230)245-6. PMID 2547163
GAL1
Allows growth on galactose
35Some Two-hybrid caveats
1.
3.
A
2.
4.
Does the DBD-fusion have activity by itself?
36Some Two-hybrid caveats
1.
A
3.
B
2.
4.
Is the interaction bi-directional?
37Some Two-hybrid caveats
1.
B
C
3.
A
2.
4.
Is the interaction mediated by some other
protein?
38Some Two-hybrid questions
1.
B
3.
A
2.
Are the proteins expresssed?Are they
over-expressed?Are they in-frame?Are the
interacting domains defined?Was the observation
reproducible?Was the strength of interaction
significant?Was another method used to back-up
the conclusion? Are the two proteins from the
same compartment?
4.
39Two-hybrid assay
1.
A
3.
B
2.
4.
Negative results dont mean a lot.
40Affinity purification
A
this molecule will bind the tag.
tag modification(e.g. HA/GST/His)
Protein of interest
41Affinity purification
the cell
A
42Affinity purification
lots of other untagged proteins
the cell
A
B
naturally binding protein
43Affinity purification
Ruptured membranes
A
B
cell extract
44Affinity purification
A
B
untagged proteinsgo through fastest(flow-through
)
45Affinity purification
A
B
tagged complexes are slower and come out later
(eluate)
46Some affinity purification questions
Is the bait protein expressed and in frame? Is
the bait protein observed?Is the bait protein
over-expressed?Are the interacting domains
defined?Was the observation reproducible?Was
the interactor found in the background?Was the
strength of interaction significant? Was the
interaction saturable? Was the interactor
stoichiometric with the bait protein?Was another
method used to back-up the conclusion?Was
tandem-affinity purification (TAP) used? Was the
interaction shown using an extract or a purified
protein? Is the inverse interaction
observable? Are the two proteins from the same
compartment? Are the two proteins known to be
involved in the same process? Is the interactor
likely to be physiologically significant?
A
B
47Some affinity purification caveats
First and most importantly, this is only a
representation of the observation. You can only
tell what proteins are in the eluate you cant
tell how they are connected to one another. If
there is only one other protein present (B), then
its likely that A and B are directly
interacting. But, what if I told you that
two other proteins (B and C) were present along
with A.
A
B
A
C
B
48Complexes with unknown topology
A
A
A
B
C
B
C
B
C
Which of these models is correct? The complex
described by this experimental result is said to
have an Unknown Topology.
49Complexes with unknown stoichiometry
A
A
B
C
Heres another possibility? The complex described
by this experimental result is also said to have
Unknown Stoichiometry.
50High throughput data in BIND
- Affinity purificationSystematic identification
of protein complexes in Saccharomyces cerevisiae
by mass spectrometry (2002). PMID 11805837 - Two-hybridA protein interaction map of
Drosophila Melanogaster(2003). PMID 14605208 - Two-hybrid and Affinity purificationA map of
the interactome network of the metazoan C.
Elegans (2004). PMID 14704431 - Data from these examples can be retrieved from
BIND using a PMID search.
51How complex data are stored in BIND.
A
?
B
?
Three interaction records.
C
?
52How complex data are stored in BIND.
A
?
A complex record in BIND is simply a collection
of interaction records.
B
?
C
?
53How complex data are stored in BIND.
A
?
A complex record in BIND is simply a collection
of interaction records.
B
?
C
?
54Alternate representations.
A
?
A
B
B
C
?
The matrix model (a clique).
C
?
55Alternate representations.
A
?
A
B
B
C
?
The spoke model. Which model to use?
C
?
56Spoke and Matrix Models
- Vrp1 (bait), Las17, Rad51, Sla1, Tfp1, Ypt7
Possible Actual Topology
Spoke
Matrix
Theoretical max. number of interactions, but many
FPs
Simple model Intuitive, more accurate, but
canmisrepresent.
BaderHogue Nature Biotech. 2002 Oct 20(10)991-7
57A view on real datamatrix model(seems hopeless)
6 redox enzymes
7 redox enzymes
Old yellow enzyme Function?
58OYE has little small molecule specificity,
unlike all other redox enzymes
The crystal structure shows a large surface near
its reactivesite, unlike other similar
proteins. So is its substrate protein? Other
redox enzymes? Solution Go do an experiment!
59Predicting Interaction Information
- Very often the best result of a Bioinformatics
investigation is the suggestion of a specific
experiment, that wasnt previously considered. - Often very hard to get a scientist to try an
experiment. - Negative results arent publishable risk to the
experimentalist that they are wasting their
time/resources! - Narrowing down the vast space of possible
interactions is important - Approx. 36,000,000 pairs of testable
protein-protein interactions in yeast. - Important to use all the information at hand and
to demonstrate to the experimentalist that you
have reduced (not increased or left-unchanged)
their risk of failure.
601. How do we predict/validate interactions? 2.
How do we locate specific binding sites?
- Functional annotation (imprecise for 2)
- Matching sequence features to patterns
- PSSMs
- Domain-small Molecule Interactions (SMID-BLAST)
- Domain-motif interactions
- 3D Docking
- slow
- need 3D models
- Energy scoring functions are imprecise
61Motif-Domain Interactions
- Protein interactions play a crucial role in
driving many important cellular processes such as
intra-cellular signaling, transcription
regulation, cell cycle regulation, and metabolic
activities. - Many of the interactions are mediated by
conserved domains binding to short sequence
motifs that form peptide recognition modules. - Only a small number of domains have known binding
motifs.
62SH3 domain and Pro-rich Motif
63High-throughput protein complex identification
Ho et al Nature 415, 180 - 183 (10 Jan
2002) HMS-PCI dataset
Gavin et al Nature 415, 141 - 147 (10 Jan 2002)
TAP dataset
64Rho family GTPase Interactions
Extract Motifs from 3D Structures Criteria Non
-domain polypeptides
65Gibbs Sampling
- Gibbs sampling is a stochastic Markov Chain Monte
Carlo algorithm - Used for motif-discovery proteins
- Widely used for the identification transcription
factors binding sites Lawrence et al., 1993,
Neuwald et al., 1995. - Gibbs sampling allows for the incorporation of
prior knowledge about the motif composition.
66Seed and Focus Procedure
- Gibbs sampling is sensitive to database size.
- On a sufficiently large database, almost any
motif could be found. - Most motifs found with this approach were found
before databases got big from genomics - SEED the Gibbs sampler with the 3D structure
motif - Focus the Gibbs sampler groups of interacting
sequences found in complexes with the domain
smaller database - If the motif is real it should be enriched
- otherwise it should disappear
67Focused sequences from yeast complexes
containing RhoGAP.
Input to Gibbs Sampler Motifs from 3D structure
SEED Database of all proteins from HTP
complexes in yeast that have RhoGAP domains
684 Motif descriptions 4 PSSMs
QEDYXR
YVPXVP
QEDYXRLXXL
YXPXXF
69Use PSSMs to Identify Motifs
- Constrain to the HTP complexes (next slide).
- Good enough to get the attention of an
experimentalist! - Try on all yeast genes
- 18,459 raw pssm-based predictions (scores vary)
- No compartmentalization or other information
considered - Match 623 literature validated predictions.
- Probability of predicting by random chance is
1.6e-53.
70Predicted RhoGAP interactions
M. Tyers did the validation. Using a standard
flag-pull down - then a more sensitive myc
double-tag pull-down. 11 Validated interactions
(colored) to match 4 motifs
71High-throughput protein complex identification
Ho et al Nature 415, 180 - 183 (10 Jan
2002) HMS-PCI dataset
Gavin et al Nature 415, 141 - 147 (10 Jan 2002)
TAP dataset
72Domain-Motif TAP network hits.
73Domain-motif HMS-PCI network hits.
Significantly more Domain-Motif hits than in the
TAP dataset. Over-expressed proteins used in
this approach may be more sentitive to transient
or low-copy number domain-motif interactions. Or
the baits selected contain more domain-motif
interactions in their respective networks
74A tea cup in a rainstorm
- 2000 elemental observations (facts) about
molecular assembly published in the literature
every month - 2600 High Throughput Interactions published
every month with high rates of false positives. - 200,000 facts sitting in the literature on
library shelves, not validated.
75Ontologies for Pathways Interactions and
Signaling
- An emerging consensus that may help you
(someday)
76The domain Biological pathways
Main categories
Metabolic Pathways
Molecular Interaction Networks
Signaling Pathways
77Ontology
- ltphilosophygt A systematic account of Existence.
- ltartificial intelligencegt (From philosophy) An
explicit formal specification of how to represent
the objects, concepts and other entities that are
assumed to exist in some area of interest and the
relationships that hold among them. - ltinformation sciencegt The hierarchical
structuring of knowledge about things by
subcategorising them according to their essential
(or at least relevant and/or cognitive)
qualities. This is an extension of the previous
senses of "ontology" (above) which has become
common in discussions about the difficulty of
maintaining subject indices. The philosophy of
indexing everything in existence?
78Ontology redux
- An ontology is a choice of a system of data
grammar together with specific controlled
vocabularies and an organizational framework to
contain data. - Ontologies are used in practice to describe how
to exchange data faithfully between computers,
not how to compute with them! - An Ontology may be used to Archive information or
to make information available to applications
(API).
79Parsing - Summary
- Parsing flatfiles is instructive to understand
how biological data is stored and used. - Most bioinformaticians in small academic groups
write their own parsers and work with small
batches of computations. - Data Grammars and automatically generated parsers
are efficient and often error free. - Most database organizations and software
developers with large audiences use data grammar
approaches. - Semantic approaches (OWL) are beginning to emerge.
80BioPAX
- BioPAX Biological PAthway eXchange
- A data exchange ontology and format for semantic
integration, aggregation and inference of
biological pathway data - Open source community effort the community
agreed upon and built this! - www.biopax.org
81BioPAX Ontology Overview
Level 1 v1.0 (July 7th, 2004)
82The domain Biological pathways
Main categories
Metabolic Pathways
Molecular Interaction Networks
Signaling Pathways
83Aggregation, Integration, Inference
- Multiple kinds of pathway databases
- metabolic
- molecular interactions
- signal transduction
- gene regulatory
- Constructs designed for integration
- DB References
- XRefs (Publication, Unification, Relationship)
- Synonyms
- Provenance (not yet implemented)
- OWL DL to enable reasoning
84BioPAX uses other ontologies
- Conceptual framework based upon existing DB
schemas - aMAZE, BIND, EcoCyc, WIT, KEGG, Reactome, etc.
- Allows wide range of detail, multiple levels of
abstraction - Uses pointers to existing ontologies to provide
supplemental annotation where appropriate - Cellular location ? GO Component
- Cell type ? Cell.obo
- Organism ? NCBI taxon DB
- Incorporate other standards where appropriate
- Chemical structure ? SMILES, CML, INCHI
- Interoperate with existing standards (RDF/OWL,
LSID, SBML, PSI, CellML Metadata Standard)
85Case study BioPAX in SBML facilitates SMBL
integration
- Addresses SBMLs nasty data integration issues
- Different data types, same representation
- Same data, different representations
- External references
- Synonyms
- Provenance
86BioPAX Ontology Overview
species
reaction
modifier
Level 1 v1.0 (July 7th, 2004)
87Different data types, same representation
- Protein-Protein Interaction
- ltreaction
- idpyruvate_dehydrogenase_cplx/gt
- ltlistOfReactantsgt
- ltspeciesRef speciesPdhA/gt
- ltspeciesRef speciesPdhB/gt
- lt/listOfReactantsgt
- ltlistOfProductsgt
- ltspeciesRef speciesPyruvate_dehydrogenase_E1
/gt - lt/listOfProductsgt
- lt/reactiongt
Biochemical Reaction ltreaction
idpyruvate_dehydrogenase_rxn/gt
ltlistOfReactantsgt ltspeciesRef
speciesNADP/gt ltspeciesRef speciesCoA/gt
ltspeciesRef speciespyruvate/gt
lt/listOfReactantsgt ltlistOfProductsgt
ltspeciesRef speciesNADPH/gt ltspeciesRef
speciesacetyl-CoA/gt ltspeciesRef
speciesCO2/gt lt/listOfProductsgt
ltlistOfModifersgt ltmodifierSpeciesRef
speciespyruvate_dehydrogenase_E1/gt
lt/listOfModifiersgt lt/reactiongt
88BioPAX solution metadata
- ltsbml xmlnsbphttp//www.biopax.org/release1/bio
pax-release1.owl - xmlnsowl"http//www.w3.org/2002/07/owl"
- xmlnsrdf"http//www.w3.org/1999/02/22-rdf
-syntax-ns"gt - ltlistOfSpeciesgt
- ltspecies idPdhA metaidPdhAgt
- ltannotationgt
- ltbpprotein rdfIDPdhA/gt
- lt/annotationgt
- lt/speciesgt
- ltspecies idNADP metaidNADPgt
- ltannotationgt
- ltbpsmallMolecule rdfIDNADP/gt
- lt/annotationgt
- lt/listOfSpeciesgt
- ltlistOfReactionsgt
- ltreaction idpyruvate_dehydrogenase_cplxgt
- ltannotationgt
- ltbpcomplexAssembly rdfIDpyruvate_dehydrog
enase_cplx/gt - lt/annotationgt
89BioPAX External References
- ltspecies idpyruvate metaidpyruvategt
- ltannotation
- xmlnsbphttp//biopax.org/release1/biopax-r
elease1.owlgt - ltbpsmallMolecule rdfIDpyruvategt
- ltbpXrefgt
- ltbpunificationXref
rdfIDunificationXref119"gt - ltbpDBgtLIGANDlt/bpDBgt
- ltbpIDgtc00022lt/bpIDgt
- lt/bpunificationXrefgt
- lt/bpXrefgt
- lt/bpsmallMoleculegt
- lt/annotationgt
- lt/speciesgt
90BioPAX Synonyms
- ltspecies idpyruvate metaidpyruvategt
- ltannotation xmlnsbphttp//biopax.org/release1/b
iopax_release1.owl/gt - ltbpsmallMolecule rdfIDpyruvate gt
- ltbpSYNONYMSgtpyroracemic acidlt/bpSYNONYMSgt
- ltbpSYNONYMSgt2-oxo-propionic
acidlt/bpSYNONYMSgt - ltbpSYNONYMSgtalpha-ketopropionic
acidlt/bpSYNONYMSgt - ltbpSYNONYMSgt2-oxopropanoatelt/bpSYNONYMSgt
- ltbpSYNONYMSgt2-oxopropanoic acidlt/bpSYNONYMSgt
- ltbpSYNONYMSgtBTSlt/bpSYNONYMSgt
- ltbpSYNONYMSgtpyruvic acidlt/bpSYNONYMSgt
- lt/bpsmallMoleculegt
- lt/annotationgt
- lt/speciesgt
91