Title: http://www.geneontology.org/index.shtml
1http//www.geneontology.org/index.shtml
- An Introduction to the
- Gene Ontology
- (GO)
The Gene Ontology project provides a controlled
vocabulary to describe gene and gene product
attributes in any organism.
2http//www.geneontology.org/index.shtml
- Search the Gene Ontology DatabaseSearch for
genes, proteins or GO terms using AmiGOgene or
protein name GO term or IDAmiGO is the official
GO browser and search engine. Browse the Gene
Ontology with AmiGO.
3- What does the Gene Ontology Consortium do?Terms
in the Gene OntologySpecies-specific
termsObsolete termsThe OntologiesCellular
componentBiological processMolecular
functionOntology structureWhat GO is
NOTAnnotation and toolsDownloadsBeyond
GOCross-productsMappings to other
classification systemsContributing to GO
4What does the Gene Ontology Consortium do?
- Biologists currently waste a lot of time and
effort in searching for all of the available
information about each small area of research. - This is hampered further by the wide variations
in terminology that may be common usage at any
given time, which inhibit effective searching by
both computers and people. - For example, if you were searching for new
targets for antibiotics, you might want to find
all the gene products that are involved in
bacterial protein synthesis, and that have
significantly different sequences or structures
from those in humans. If one database describes
these molecules as being involved in
'translation', whereas another uses the phrase
'protein synthesis', it will be difficult for you
- and even harder for a computer - to find
functionally equivalent terms. - The Gene Ontology (GO) project is a collaborative
effort to address the need for consistent
descriptions of gene products in different
databases. - The project began as a collaboration between
three model organism databases, FlyBase
(Drosophila), the Saccharomyces Genome Database
(SGD) and the Mouse Genome Database (MGD), in
1998. - Since then, the GO Consortium has grown to
include many databases, including several of the
world's major repositories for plant, animal and
microbial genomes. See the GO Consortium page for
a full list of member organizations.
5- The GO project has developed three structured
controlled vocabularies (ontologies) that
describe gene products in terms of their
associated biological processes, cellular
components and molecular functions in a
species-independent manner. - There are three separate aspects to this effort
- first, the development and maintenance of the
ontologies themselves - second, the annotation of gene products, which
entails making associations between the
ontologies and the genes and gene products in the
collaborating databases - and third, development of tools that facilitate
the creation, maintenance and use of ontologies. - The use of GO terms by collaborating databases
facilitates uniform queries across them. - The controlled vocabularies are structured so
that they can be queried at different levels - for example, you can use GO to find all the gene
products in the mouse genome that are involved in
signal transduction, - or you can zoom in on all the receptor tyrosine
kinases. - This structure also allows annotators to assign
properties to genes or gene products at different
levels, depending on the depth of knowledge about
that entity.
6Terms in the Gene Ontology
- The building blocks of the Gene Ontology are the
terms, so what makes up a GO term? - Each entry in GO has a unique numerical
identifier of the form GOnnnnnnn, and a term
name, e.g. cell, fibroblast growth factor
receptor binding or signal transduction. - Each term is also assigned to one of the three
ontologies, molecular function, cellular
component or biological process. - The majority of terms have a textual definition,
with references stating the source of the
definition. - If any clarification of the definition or remarks
about term usage are required, these are held in
a separate comments field. - Many GO terms have synonyms GO uses 'synonym' in
a loose sense, as the names within the synonyms
field may not mean exactly the same as the term
they are attached to. Instead, a GO synonym may
be broader or narrower than the term string it
may be a related phrase it may be alternative
wording, spelling or use a different system of
nomenclature or it may be a true synonym. This
flexibility allows GO synonyms to serve as
valuable search aids, as well as being useful for
applications such as text mining and semantic
matching. The relationship of the synonym to the
term is recorded within the GO file.
7- The scope of the Gene Ontology overlaps with a
number of other databases, and in cases where a
GO term is identical in meaning to an object in
another database, a database cross reference is
added to the term. These cross references can
also be downloaded from the mappings to GO page. - Species-specific termsThe Gene Ontology aims to
provide a controlled vocabulary that can be used
to describe any organism nevertheless, many
functions, processes and components are not
common to all life forms. The convention is to
include any term that can apply to more than one
taxonomic class of organism. To specify the class
of organisms to which a term is applicable, GO
uses the designator sensu, 'in the sense of' for
example, trichome differentiation (sensu
Magnoliophyta) represents the differentiation of
plant hair cells (trichomes). - Obsolete termsOccasionally, a term is found that
is outside the scope of GO, is misleadingly named
or defined, or describes a concept that would be
better represented in another way. Rather than
delete the term, it is deprecated or made
obsolete. The term and ID still exist in the GO
database, but the term is marked as obsolete, and
a comment added, giving a reason for the
obsoletion and recommending alternative terms
where appropriate.
8The Ontologies
- The three organizing principles of GO are
cellular component, biological process and
molecular function. A gene product might be
associated with or located in one or more
cellular components it is active in one or more
biological processes, during which it performs
one or more molecular functions. For example, the
gene product cytochrome c can be described by the
molecular function term oxidoreductase activity,
the biological process terms oxidative
phosphorylation and induction of cell death, and
the cellular component terms mitochondrial matrix
and mitochondrial inner membrane.
9Cellular component
- A cellular component is just that, a component of
a cell, but with the proviso that it is part of
some larger object this may be an anatomical
structure (e.g. rough endoplasmic reticulum or
nucleus) or a gene product group (e.g. ribosome,
proteasome or a protein dimer). See the
documentation on the cellular component ontology
for more details.
10Biological process
- A biological process is series of events
accomplished by one or more ordered assemblies of
molecular functions. Examples of broad biological
process terms are cellular physiological process
or signal transduction. - Examples of more specific terms are pyrimidine
metabolism or alpha-glucoside transport. - It can be difficult to distinguish between a
biological process and a molecular function, but
the general rule is that a process must have more
than one distinct steps.A biological process is
not equivalent to a pathway at present, GO does
not try to represent the dynamics or dependencies
that would be required to fully describe a
pathway.Further information can be found in the
process ontology documentation.
11Molecular function
- Molecular function describes activities, such as
catalytic or binding activities, that occur at
the molecular level. - GO molecular function terms represent activities
rather than the entities (molecules or complexes)
that perform the actions, and do not specify
where or when, or in what context, the action
takes place. - Molecular functions generally correspond to
activities that can be performed by individual
gene products, but some activities are performed
by assembled complexes of gene products. - Examples of broad functional terms are catalytic
activity, transporter activity, or binding - examples of narrower functional terms are
adenylate cyclase activity or Toll receptor
binding. - It is easy to confuse a gene product name with
its molecular function, and for that reason many
GO molecular functions are appended with the word
"activity". The documentation on gene products
explains this confusion in more depth. The
documentation on the function ontology explains
more about GO functions and the rules governing
them.
12Ontology structure
- The terms in an ontology are linked by two
relationships, is_a and part_of. is_a is a simple
class-subclass relationship, - where A is_a B means that A is a subclass of B
for example, nuclear chromosome is_a chromosome. - part_of is slightly more complex C part_of D
means that whenever C is present, it is always a
part of D, but C does not always have to be
present. An example would be nucleus part_of
cell nuclei are always part of a cell, but not
all cells have nuclei. - The ontologies are structured as directed acyclic
graphs, - which are similar to hierarchies but differ in
that - a child, or more specialized, term can have many
parents, or less specialized, terms. - For example, the biological process term hexose
biosynthesis has two parents, hexose metabolism
and monosaccharide biosynthesis. This is because
biosynthesis is a subtype of metabolism, and a
hexose is a type of monosaccharide. When any gene
involved in hexose biosynthesis is annotated to
this term, it is automatically annotated to both
hexose metabolism and monosaccharide
biosynthesis, - because every GO term must obey the true path
rule if the child term describes the gene
product, then all its parent terms must also
apply to that gene product.
13What GO is NOT
- It is important to clearly state the scope of GO,
and what it does and does not cover. The
ontologies section explains the domains covered
by GO the following areas are outside the scope
of GO, and terms in these domains would not
appear in the ontologies.Gene products e.g.
cytochrome c is not in the ontologies, but
attributes of cytochrome c, such as
oxidoreductase activity, are.Processes,
functions or components that are unique to
mutants or diseases e.g. oncogenesis is not a
valid GO term because causing cancer is not the
normal function of any gene.Attributes of
sequence such as intron/exon parameters these
are not attributes of gene products and will be
described in a separate sequence ontology (see
the OBO website for more information).Protein
domains or structural features.Protein-protein
interactions.Environment, evolution and
expression.Anatomical or histological features
above the level of cellular components, including
cell types.GO is not a database of gene
sequences, nor a catalog of gene products.
Rather, GO describes how gene products behave in
a cellular context.GO is not a dictated standard,
mandating nomenclature across databases. Groups
participate because of self-interest, and
cooperate to arrive at a consensus.GO is not a
way to unify biological databases (i.e. GO is not
a 'federated solution'). Sharing vocabulary is a
step towards unification, but is not, in itself,
sufficient. Reasons for this include the
followingKnowledge changes and updates lag
behind.Individual curators evaluate data
differently. While we can agree to use the word
'kinase', we must also agree to support this by
stating how and why we use 'kinase', and
consistently apply it. Only in this way can we
hope to compare gene products and determine
whether they are related.GO does not attempt to
describe every aspect of biology its scope is
limited to the domains described above.Back to
14topAnnotation and tools
- How do the terms in GO become associated with
their appropriate gene products? Collaborating
databases annotate their genes or gene products
with GO terms, providing references and
indicating what kind of evidence is available to
support the annotations. More information can be
found in the GO Annotation Guide.If you browse
any of the contributing databases, you'll find
that each gene or gene product has a list of
associated GO terms. Each database also publishes
downloadable files containing these associations
these can be downloaded from the GO annotations
page. You can browse the ontologies using a range
of web-based browsers. A full list of these, and
other tools for analyzing gene function using GO,
is available on the GO Tools section.In addition,
the GO consortium has prepared GO slims, 'slimmed
down' versions of the ontologies that allow you
to annotate genomes or sets of gene products to
gain a high-level view of gene functions. Using
GO slims you can, for example, work out what
proportion of a genome is involved in signal
transduction, biosynthesis or reproduction. See
the GO Slim Guide for more information.
15Beyond GO
- GO allows us to annotate genes and their products
with a limited set of attributes. For example, GO
does not allow us to describe genes in terms of
which cells or tissues they're expressed in,
which developmental stages they're expressed at,
or their involvement in disease. It is not
necessary for GO to do these things because other
ontologies are being developed for these
purposes. The GO consortium supports the
development of other ontologies and makes its
tools for editing and curating ontologies freely
available. A list of freely available ontologies
that are relevant to genomics and proteomics and
are structured similarly to GO can be found at
the Open Biomedical Ontologies website . A larger
list, which includes the ontologies listed at OBO
and also other controlled vocabularies that do
not fulfill the OBO criteria is available at the
Ontology Working Group section of the Microarray
Gene Expression Data (MGED) Network site .
16Download
- All data from the GO project is freely available.
You can download the ontology data in a number of
different formats, including XML and mySQL, from
the GO Downloads page. For more information on
the syntax of these formats, see the GO File
Format Guide.If you need lists of the genes or
gene products that have been associated with a
particular GO term, the Current Annotations table
tracks the number of annotations and provides
links to the gene association files for each of
the collaborating databases is available.
17GO term enrichment
- Hypergeometric
- SGD example using LA output
18(No Transcript)
19Transitive functional annotation by shortest-path
analysis of gene expression dataPNAS October
1, 2002 vol. 99 no. 20 12783-12788Xianghong
Zhou, Ming-Chih J. Kao, and Wing Hung Wong
- Fig. 1. (A) Application of the shortest-path
(SP) algorithm to gene expression data. Nine
genes are depicted in the graph. The distance
between two genes is a decreasing function of
their correlation. For example, there are
multiple expression dependence paths leading from
gene a to gene e. Among them, the shortest
dependence path is a-b-c-d-e, with genes b, c,
and d serving as the transitive genes. This is
the most parsimonious summary of the expression
relationship between the terminal genes a and e.
(B) Level 0 (L0) and level 1 (L1) matches of
genes on the SP a-b-c-d-e defined according to
their relationships in the Gene Ontology (GO)
classification tree. With respect to the terminal
genes a and e, the transitive gene b is a L0
match because it is annotated in the informative
node where a and e are annotated the transitive
gene c is a L1 match because it shares the same
direct parent as the two terminal genes the
transitive gene d is neither a L0 nor a L1 match.
20Current methods for the functional analysis of
microarray gene expression data make the implicit
assumption that genes with similar expression
profiles have similar functions in cells.
However, among genes involved in the same
biological pathway, not all gene pairs show high
expression similarity. Here, we propose that
transitive expression similarity among genes can
be used as an important attribute to link genes
of the same biological pathway. Based on
large-scale yeast microarray expression data, we
use the shortest-path analysis to identify
transitive genes between two given genes from the
same biological process. We find that not only
functionally related genes with correlated
expression profiles are identified but also those
without. In the latter case, we compare our
method to hierarchical clustering, and show that
our method can reveal functional relationships
among genes in a more precise manner. Finally, we
show that our method can be used to reliably
predict the function of unknown genes from known
genes lying on the same shortest path. We
assigned functions for 146 yeast genes that are
considered as unknown by the Saccharomyces Genome
Database and by the Yeast Proteome Database.
These genes constitute around 5 of the unknown
yeast ORFome.
21Data Processing
- Saccharomyces cerevisiae gene expression profiles
from the Rosetta Compendium (6), which includes
300 deletion and drug treatment experiments.
Genes were annotated by using the biological
process ontology of Gene Ontology (GO) (7)
provided by the Saccharomyces Genome Database
(SGD) (8). - After removing the genes without GO process
annotation and the 20 genes for which there are
less than 80 experimental measurements in the
Rosetta Compendium, we were left with
266 mitochondrial, 398 cytoplasmic, and
659 nuclear GO-annotated genes. - For each of the three sets of genes, we
calculated the expression similarities of all
gene pairs a, b using Ca,b, the minimum of the
absolute value of leave-one-out Pearson
correlation coefficient estimates. This estimate
is a measurement robust against single experiment
outliers and sensitive to overall similarities in
expression patterns.
22Graph Construction and SP Computation.
- We constructed three graphs, one for each set of
the 266 mitochondrial genes, the 398 cytoplasmic
genes, and the 659 nuclear genes. In each graph,
two genes were assigned an edge if their absolute
expression correlation Ca,b was higher than
0.6. - This cut-off, while conservative, nonetheless
retains a sufficient number of connected gene
pairs in the graph. The edge length between
vertices a and b is da,b f(Ca,b) (1 Ca,b)k.
The powering factor k is used to enhance the
differences between low and high correlations.
Because the length of a path is the sum of the
individual edge lengths, by exaggerating the
differences between edge lengths, the SPs will be
more likely to cover more transitive genes. Thus
by increasing k we gain more power to reveal
transitive co-expression. We set k 6 because
for k 6, the numbers of transitive genes
stabilizes (detailed results at
www.biostat.harvard.edu/complab/SP/). To ensure
the quality of SPs, we consider only SPs with
total path lengths lt0.008.
23Predicting the Functions of Unknown Genes.
- We use the SP method to classify previously
unannotated yeast genes by adding the 3,255 ORFs
unknown to SGD into the graphs of known genes in
the mitochondrial, cytoplasmic, and nuclear
compartments. - As before, an edge is constructed between two
genes if their absolute expression correlation is
higher than 0.6. - For all pairs of known genes, we determine the
SPs connecting them. For the purpose of
functional prediction, we would like to assign a
putative function that is as specific as possible
to the gene. Given all known genes on a SP, we
achieve this by tracing back their annotations
along the GO process tree and finding their
lowest common ancestor. - If the lowest ancestral node is at least 4 levels
below the root of the GO tree, that is, it
defines a sufficiently specific gene function, we
then assign this function to the unknown genes on
the SP. - Analogous to the L0 and L1 matches, here the L0
prediction then corresponds to the lowest common
ancestor, and the L1 prediction to its direct
parent. In this way, the function represented by
the lowest common ancestor can be more specific
than that defined by the informative nodes. - . For each predicted gene function, we provide
both the number of support SPs from which the
prediction was derived and the number of unique
known genes on those support SPs (support genes).
The more support genes there are, the more
confidence we have in the corresponding
prediction. - Note that a gene can be assigned putative
functions in multiple graphs, because many genes
are known to function in multiple cellular
compartments. - Under two circumstances an unknown gene may be
assigned with multiple functions (i) Because
known genes on a SP may each have multiple
functions, they may share several lowest common
ancestors in the GO tree. (ii) An unknown gene
may reside in different SPs with different lowest
common ancestors