Title: Bio-Trac 40 (Protein Bioinformatics)
1Biomedical Ontologies
- Bio-Trac 40 (Protein Bioinformatics)
- October 9, 2008
- Zhang-Zhi Hu, M.D.
- Research Associate Professor
- Protein Information Resource, Department of
- Biochemistry and Molecular Cellular Biology
- Georgetown University Medical Center
2Overview
- What is ontology?
- What is biomedical ontology?
- What is gene ontology?
- How is it generated?
- How is it used for annotation?
- What is protein ontology?
- Why is it necessary?
- How to use it?
3Tree of Porphyry with Aristotles Categories
Aristotle, 384 BC 322 BC
4Ontology
onto-, of being or existence -logy, study.
Greek origin Latin, ontologia,1606
- In philosophy, it seeks to describe basic
categories and relationships of being or
existence to define entities and types of
entities within its framework - What do you know? How do you know it?
- What is existence? What is a physical object?
- What constitutes the identity of an object?
- Central goal is to have a definitive and
exhaustive classification of all entities.
The science of what is, of the kinds and
structures of objects, properties, events,
processes and relations in every area of reality
Barry Smith, U Buffalo
5In computer and information science
- Ontology is a data model that represents a set of
concepts within a domain and the relationships
between those concepts. It is used to reason
about the objects within that domain.
Most ontologies describe individuals (instances),
classes (concepts), attributes, and relations
Classes
Relations
Attributes
Classes (concepts)
e.g. color, engine, door
Individuals (instances)
your Ford, my Ford, his Ford
6What are ontology useful for?
Ontology is a form of knowledge representation
about the world or some part of it.
- Terminology management
- Integration, interoperability, and sharing of
data - promote precise communication between scientists
- enable information retrieval across multiple
resources - Knowledge reuse and decision support
- extend the power of computational approaches to
perform data exploration, inference, and mining
Biomedical Terminology vs. Biomedical Ontology
- UMLS (unified medical language system)
- MeSH (medical subject heading)
- NCI Thesaurus
- SNOMED / SNODENT
- Medical WordNet
7Ontology Enables Large-Scale Biomedical Science
The center of two major activities currently in
biomedical research
- Structured representation of biomedicine
- For different types of entities and relations to
describe biomedicine (ontology content curation).
- Annotation using ontologies to summarize and
describe biomedical experimental results to
enable - Integration of their data with other researchers
results - Cross-species analyses
8Gene Ontology (GO)
what makes it so wildly successful ?
9GO Consortium
http//www.geneontology.org/
- The Gene Ontology was originally constructed in
1998 by a consortium of researchers studying the
genome of three model organisms - Drosophila melanogaster (fruit fly) (FlyBase)
- Mus musculus (mouse) (MGD)
- Saccharomyces cerevisiae (yeast) (SGD)
- Many other model organism databases have joined
the GO consortium, contributing - development of the ontologies
- annotations for the genes of one or more organisms
10Need for annotation of genome sequences
- What is Gene Ontology? GO provides controlled
vocabulary to describe gene and gene product
attributes in any organism how gene products
behave in a cellular context
- Three key concepts Currently total 25804 GO
terms (Oct. 2008) - Biological process series of events accomplished
by one or more ordered assemblies of molecular
functions, e.g. signal transduction, or
pyrimidine metabolism, and alpha-glucoside
transport. total 15161 - Molecular function describes activities, such as
catalytic or binding activities, that occur at
the molecular level. Activities that can be
performed by individual gene products, or by
assembled complexes of gene products e.g.
catalytic activity, transporter activity. total
8425 - Cellular component a component of a cell that it
is part of some larger object, maybe an
anatomical structure (e.g. ER or nucleus) or a
gene product group (e.g. ribosome, or a protein
dimer). total 2218
- GO annotation
- - Characterization of gene products using GO
terms - - Members submit their data which are
available at GO website.
11GO Representation Tree or Network?
GO is a network structure
Node, a concept or a term
12http//www.geneontology.org/
13GO term (GO0006366) mRNA transcription from
RNA polymerase II promoter
GO search and display tool
14Human p53 GO annotation (UniProtKBP04637)
GO0006289nucleotide-excision repair
PMID7663514 evidenceIMP
15GO annotation of gene products
- Science basis of the GO trained experts use the
experimental observations from literature to
associate GO terms with gene products (to
annotate the entities represented in the
gene/protein databases) - Enabling data integration across databases and
making them available to semantic search
http//www.geneontology.org/GO.current.annotations
.shtml
46
Human, mouse, plant, worm, yeast
16What GO is NOT
- Ontology of gene products e.g. cytochrome c is
not in GO, but attributes of cytochrome c are,
e.g. oxidoreductase activity. - Processes, functions and component unique to
mutants or diseases e.g. oncogenesis is not a
valid GO. - Protein domains or structural features.
- Protein-protein interactions.
- Environment, evolution and expression.
- Anatomical or histological features above the
level of cellular components, including cell
types.
Neither GO is Ontology of Genes!! a misnomer
17Missing GO nodes
not deep enoughnot broad enough
18Lack of connections among GOs
19GO A Common Standard for Omics Data Analysis
what molecular function?
what biological process?
what cellular component?
20need more
- need to improve the quality of GO to support more
rigorous logic-based reasoning across the data
annotated in its terms - need to extend the GO by engaging ever broader
community support for addition of new terms and
for correction of errors - need to extend the methodology to other domains,
including clinical domains, such as - disease ontology
- immunology ontology
- symptom (phenotype) ontology
- clinical trial ontology
- ...
21http//www.obofoundry.org/
- Establish common rules governing best practices
for creating ontologies and for using these in
annotations - Apply these rules to create a complete suite of
orthogonal interoperable biomedical reference
ontologies
National Center for Biomedical Ontology (NCBO)
http//bioontology.org/
22http//www.obofoundry.org/index.cgi?sortdomainsh
owontologies
23The OBO Foundry
- A family of interoperable gold standard
biomedical reference ontologies to serve
annotation of - scientific literature
- model organism databases
- clinical trial data
OBO Foundry a subset of OBO ontologies, whose
developers have agreed in advance to accept a
common set of principles reflecting best practice
in ontology development designed to ensure
- tight connection to the biomedical basic sciences
- compatibility, interoperability, common relations
- support for logic-based reasoning
OBO Foundry Principles http//www.obofoundry.org/
crit.shtml
24Rationale of OBO Foundry coverage
CONTINUANT CONTINUANT CONTINUANT CONTINUANT OCCURRENT
INDEPENDENT INDEPENDENT DEPENDENT DEPENDENT
ORGAN AND ORGANISM Organism (NCBI Taxonomy) Anatomical Entity (FMA, CARO) Organ Function (FMP, CPRO) Phenotypic Quality(PaTO) Biological Process (GO)
CELL AND CELLULAR COMPONENT Cell (CL) Cellular Component (FMA, GO) Cellular Function (GO) Phenotypic Quality(PaTO) Biological Process (GO)
MOLECULE Molecule (ChEBI, SO, RNAO, PRO) Molecule (ChEBI, SO, RNAO, PRO) Molecular Function (GO) Molecular Function (GO) Molecular Process (GO)
25OBO Relation Ontology
Foundational is_a part_of
Spatial located_in contained_in adjacent_to
Temporal transformation_of derives_from preceded_by
Participation has_participant has_agent
e.g. A is_a B def. every instance of A is an
instance of B rose is_a plant ? all instances of
rose is_a plant
26What is Protein Ontology? Why?
PRO
http//pir.georgetown.edu/pro/
27The Need for Representation of Various Proteins
Forms
Glucocorticoid receptor (GR)
and PTMs
28Sphingomyelin phosphodiesterase (SMPD1)
(ASM_HUMAN)
- Cleavage sites
- lysosomal the enzyme is transported from the
Golgi apparatus to the lysosome after additions
of mannose-6-phosphate moieties (M6P) and binding
to M6P receptor. - secreted the shorter cleaved form is not
modified with M6P and is targeted for secretion
to the extracellular space, with different
functions such as LDL binding and oxidized LDL
catabolism.
29Alternative splicing
a single new contact between Phe32 (F32) of FGF8b
and a hydrophobic groove within Ig domain 3 of
FGFR2c
Olsen et al., Genes Dev. 2006
FGF8a, 8b differ in their ability to pattern
embryonic brain
- Only FGF8b can transform midbrain to cerebellum
whereas FGF8a causes an overgrowth of midbrain.
FGF8a FGF8b
FGF8_HUMAN alternative splicing
30GOA for Transcription factor Ovo-like 2
Form 1 - long GO0045892 IDA - negative
regulation of transcription, DNA-dependent Form 2
short GO0045893 IDA - positive regulation of
transcription, DNA-dependent
- Gene. 2004 33647-58. PMID15225875
274 aa
31The Need for Protein Classes Representing
Protein Evolutionary Relationships
- Genes/proteins identified in model organisms,
such as mouse, yeast, fly, may have important
functional implications in human. - Gene function in model organism may not applied
to human - Animal models for human diseases such as mouse
models for diabetes, arthritis, and tumor. - Essential genes may be redundant and nonessential
in another species due to functional
compensation, e.g. - mutation of Rb1 causes retinoblastoma in early
childhood - Rb1 knock-out mouse did not develop
retinoblastoma because of compensation from a
functional homolog p107. - Close examination of proteins in phylogenetic
classes and their functional convergence and
divergence in a ontological structure is
important for application of disease models.
32Implications of Protein Evolution
B.subtilis
Human
Mouse
Chimp
Yeast
Worm
E.coli
Rat
Fly
- Conclusions from experiments performed on
proteins from one organism are often applicable
to the homologous protein from another organism.
- Information learned about existing proteins
allows us to infer the properties of ancestral
proteins.
Common ancestor
33Protein Evolution
Sequence changes
Domain shuffling
With enough similarity, one can trace back to a
common origin
What about these?
34Functional convergence
- Protein classes of the same function derived from
different evolutionary origins, e.g. carbonate
dehydratase (or carbonic anhydrase EC 4.2.1.1),
which has three independent gene families with
functional convergence.
Animal and prokaryotic type
Plant and prokaryotic type
Archaea type
35Functional divergence
Gene Duplication (TGM3/EPB42 split)
Speciation (Human/mouse split)
Human
TGM3 (Human)
TGM3 branch
Mouse
TGM3 (Mouse)
Human
EPB42 (Human)
EPB42 branch
Mouse
EPB42 (Mouse)
TGM3 (Human)
TGM3 (Mouse)
EPB42 (Human)
EPB42 (Mouse)
TGM3 Protein-glutamine gamma-glutamyltransferase
(Transglutaminase involved in protein
modification)
EBP42 Erythrocyte membrane protein band
4.2 (Constituent of cytoskeleton involved in
cell shape)
36The Need for Protein Ontology
- Data integration and knowledge management for
-omics work. - A gap exists in OBO for gene products.
- Protein Ontology (PRO) will contain two connected
components (or subontologies) - ProEvo captures the protein classes represented
by protein families at fold, domain and full
length levels that reflect evolutional
relationship - ProForm captures the specific protein objects of
a specific gene resulting from alternative
splicing, posttranslational modification, genetic
variations. - ProEvo and ProMod is connected through the
reference (canonical) protein sequence
currently annotated in UniProtKB. - PRO formalization of these detailed protein
objects and classes will allow accurate and
consistent proteomics experimental design and
data analysis/integration.
37PRO Framework
- PRO is designed to be a formal and
well-principled OBO Foundry ontology for protein
entities. - Attributes of objects will take the form of links
to other ontologies, such as gene (GO), sequence
(SO), modification (PSI-MOD) and disease (DO)
ontologies. - A PRO prototype for TGF-beta signaling proteins
was built based on this framework. - In this way, PRO aims at providing an ontological
framework to define protein entities and
evolutionary-related classes that community can
adopt for different purposes, e.g. - annotation of entities attributes,
- mapping of objects in pathways, and
- modeling of biological system dynamics and
disease.
38Protein Ontology (PRO) http//pir.georgetown.edu/p
ro/
39Mothers against decapentaplegic homolog 2
Smad 2
GO annotation of SMAD2_HUMAN
Cellular Component - nucleus Molecular
Function - protein bindingBiological
Process - signal transduction - regulation of
transcription, DNA-dependent
40TGF-b
TGF-beta receptor
II
I
Smad 2
1 phosphorylation
Smad 4
Smad 2
P
P
P
CAMK2
ERK1
2 complex formation
Smad 2
P
P
P
P
Smad 2
P
P
P
P
Smad 4
P
Cytoplasm
3 nuclear translocation
Smad 2
P
P
P
Smad 4
Nucleus
P
4 DNA binding
Transcription Regulation
41Smad2 gene products Forms Location ID
normal Cytoplasmic PRO00000011
TGF-b receptor phosphorylated Forms complex Nuclear Txn upregulation PRO00000013
ERK1 phosphorylated Forms complex Nuclear Txn upregulation PRO00000014
CAMK2 phosphorylated Forms complex Cytoplasmic No Txn upregulation PRO00000015
alternatively spliced short form Cytoplasmic PRO00000016
phosphorylated short form Nuclear Txn upregulation PRO00000018
point mutation (causative agent large intestine carcinoma) Doesnt form complex Cytoplasmic No Txn upregulation PRO00000019
SMAD2_HUMAN
Smad 2
SMAD2_HUMAN
SMAD2_HUMAN
SMAD2_HUMAN
SMAD2_HUMAN
Smad 2
SMAD2_HUMAN
SMAD2_HUMAN
42PRO allows proper representations of protein
forms in pathways
TGF-beta signaling pathway (REACT_6844)
Each step in the pathway is described by a
Reactome event ID. Bold PRO IDs indicate objects
that undergo some modification that is relevant
for function (the modified form is underlined).
From Arighi et al., SIG2008.
43PRO hierarchy in Obo Edit
Representing evolutionary-related protein
classes. In this example, children of
TGF-beta-like cysteine-knot cytokine have a
common architecture consisting of a signal
peptide, a variable propeptide region and a
transforming growth factor beta-like domain that
is a cysteine-knot domain. PfamPF00019
"has_part Transforming growth factor beta like
domain".
ProEvo
Representing multiple protein products of a gene.
Only forms with experimental data are included.
When common protein forms exist in human and
mouse, a single node is created (See details
below).
ProForm
OBO relations is_a, derives_from
44Summary
- The vision of the biomedical ontology community
is that all biomedical knowledge and data are
disseminated on the Internet using principled
ontologies, such that they are semantically
interoperable and useful for improving biomedical
science and clinical care. - The scope extends to all knowledge and data that
is relevant to the understanding or improvement
of human biology and health. - Knowledge and data are semantically interoperable
when they enable predictable, meaningful,
computation across knowledge sources developed
independently to meet diverse needs. - Principled ontologies are ones that follow
NCBO-recommended formats and methodologies for
ontology development, maintenance, and use.