http://www.geneontology.org/index.shtml - PowerPoint PPT Presentation

About This Presentation
Title:

http://www.geneontology.org/index.shtml

Description:

http://www.geneontology.org/index.shtml An Introduction to the Gene Ontology (GO) The Gene Ontology project provides a controlled vocabulary to describe gene and gene ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 24
Provided by: KerCh3
Learn more at: http://www.stat.ucla.edu
Category:

less

Transcript and Presenter's Notes

Title: http://www.geneontology.org/index.shtml


1
http//www.geneontology.org/index.shtml
  • An Introduction to the
  • Gene Ontology
  • (GO)

The Gene Ontology project provides a controlled
vocabulary to describe gene and gene product
attributes in any organism.
2
http//www.geneontology.org/index.shtml
  • Search the Gene Ontology DatabaseSearch for
    genes, proteins or GO terms using AmiGOgene or
    protein name GO term or IDAmiGO is the official
    GO browser and search engine. Browse the Gene
    Ontology with AmiGO.

3
  • What does the Gene Ontology Consortium do?Terms
    in the Gene OntologySpecies-specific
    termsObsolete termsThe OntologiesCellular
    componentBiological processMolecular
    functionOntology structureWhat GO is
    NOTAnnotation and toolsDownloadsBeyond
    GOCross-productsMappings to other
    classification systemsContributing to GO

4
What does the Gene Ontology Consortium do?
  • Biologists currently waste a lot of time and
    effort in searching for all of the available
    information about each small area of research.
  • This is hampered further by the wide variations
    in terminology that may be common usage at any
    given time, which inhibit effective searching by
    both computers and people.
  • For example, if you were searching for new
    targets for antibiotics, you might want to find
    all the gene products that are involved in
    bacterial protein synthesis, and that have
    significantly different sequences or structures
    from those in humans. If one database describes
    these molecules as being involved in
    'translation', whereas another uses the phrase
    'protein synthesis', it will be difficult for you
    - and even harder for a computer - to find
    functionally equivalent terms.
  • The Gene Ontology (GO) project is a collaborative
    effort to address the need for consistent
    descriptions of gene products in different
    databases.
  • The project began as a collaboration between
    three model organism databases, FlyBase
    (Drosophila), the Saccharomyces Genome Database
    (SGD) and the Mouse Genome Database (MGD), in
    1998.
  • Since then, the GO Consortium has grown to
    include many databases, including several of the
    world's major repositories for plant, animal and
    microbial genomes. See the GO Consortium page for
    a full list of member organizations.

5
  • The GO project has developed three structured
    controlled vocabularies (ontologies) that
    describe gene products in terms of their
    associated biological processes, cellular
    components and molecular functions in a
    species-independent manner.
  • There are three separate aspects to this effort
  • first, the development and maintenance of the
    ontologies themselves
  • second, the annotation of gene products, which
    entails making associations between the
    ontologies and the genes and gene products in the
    collaborating databases
  • and third, development of tools that facilitate
    the creation, maintenance and use of ontologies.
  • The use of GO terms by collaborating databases
    facilitates uniform queries across them.
  • The controlled vocabularies are structured so
    that they can be queried at different levels
  • for example, you can use GO to find all the gene
    products in the mouse genome that are involved in
    signal transduction,
  • or you can zoom in on all the receptor tyrosine
    kinases.
  • This structure also allows annotators to assign
    properties to genes or gene products at different
    levels, depending on the depth of knowledge about
    that entity.

6
Terms in the Gene Ontology
  • The building blocks of the Gene Ontology are the
    terms, so what makes up a GO term?
  • Each entry in GO has a unique numerical
    identifier of the form GOnnnnnnn, and a term
    name, e.g. cell, fibroblast growth factor
    receptor binding or signal transduction.
  • Each term is also assigned to one of the three
    ontologies, molecular function, cellular
    component or biological process.
  • The majority of terms have a textual definition,
    with references stating the source of the
    definition.
  • If any clarification of the definition or remarks
    about term usage are required, these are held in
    a separate comments field.
  • Many GO terms have synonyms GO uses 'synonym' in
    a loose sense, as the names within the synonyms
    field may not mean exactly the same as the term
    they are attached to. Instead, a GO synonym may
    be broader or narrower than the term string it
    may be a related phrase it may be alternative
    wording, spelling or use a different system of
    nomenclature or it may be a true synonym. This
    flexibility allows GO synonyms to serve as
    valuable search aids, as well as being useful for
    applications such as text mining and semantic
    matching. The relationship of the synonym to the
    term is recorded within the GO file.

7
  • The scope of the Gene Ontology overlaps with a
    number of other databases, and in cases where a
    GO term is identical in meaning to an object in
    another database, a database cross reference is
    added to the term. These cross references can
    also be downloaded from the mappings to GO page.
  • Species-specific termsThe Gene Ontology aims to
    provide a controlled vocabulary that can be used
    to describe any organism nevertheless, many
    functions, processes and components are not
    common to all life forms. The convention is to
    include any term that can apply to more than one
    taxonomic class of organism. To specify the class
    of organisms to which a term is applicable, GO
    uses the designator sensu, 'in the sense of' for
    example, trichome differentiation (sensu
    Magnoliophyta) represents the differentiation of
    plant hair cells (trichomes).
  • Obsolete termsOccasionally, a term is found that
    is outside the scope of GO, is misleadingly named
    or defined, or describes a concept that would be
    better represented in another way. Rather than
    delete the term, it is deprecated or made
    obsolete. The term and ID still exist in the GO
    database, but the term is marked as obsolete, and
    a comment added, giving a reason for the
    obsoletion and recommending alternative terms
    where appropriate.

8
The Ontologies
  • The three organizing principles of GO are
    cellular component, biological process and
    molecular function. A gene product might be
    associated with or located in one or more
    cellular components it is active in one or more
    biological processes, during which it performs
    one or more molecular functions. For example, the
    gene product cytochrome c can be described by the
    molecular function term oxidoreductase activity,
    the biological process terms oxidative
    phosphorylation and induction of cell death, and
    the cellular component terms mitochondrial matrix
    and mitochondrial inner membrane.

9
Cellular component
  • A cellular component is just that, a component of
    a cell, but with the proviso that it is part of
    some larger object this may be an anatomical
    structure (e.g. rough endoplasmic reticulum or
    nucleus) or a gene product group (e.g. ribosome,
    proteasome or a protein dimer). See the
    documentation on the cellular component ontology
    for more details.

10
Biological process
  • A biological process is series of events
    accomplished by one or more ordered assemblies of
    molecular functions. Examples of broad biological
    process terms are cellular physiological process
    or signal transduction.
  • Examples of more specific terms are pyrimidine
    metabolism or alpha-glucoside transport.
  • It can be difficult to distinguish between a
    biological process and a molecular function, but
    the general rule is that a process must have more
    than one distinct steps.A biological process is
    not equivalent to a pathway at present, GO does
    not try to represent the dynamics or dependencies
    that would be required to fully describe a
    pathway.Further information can be found in the
    process ontology documentation.

11
Molecular function
  • Molecular function describes activities, such as
    catalytic or binding activities, that occur at
    the molecular level.
  • GO molecular function terms represent activities
    rather than the entities (molecules or complexes)
    that perform the actions, and do not specify
    where or when, or in what context, the action
    takes place.
  • Molecular functions generally correspond to
    activities that can be performed by individual
    gene products, but some activities are performed
    by assembled complexes of gene products.
  • Examples of broad functional terms are catalytic
    activity, transporter activity, or binding
  • examples of narrower functional terms are
    adenylate cyclase activity or Toll receptor
    binding.
  • It is easy to confuse a gene product name with
    its molecular function, and for that reason many
    GO molecular functions are appended with the word
    "activity". The documentation on gene products
    explains this confusion in more depth. The
    documentation on the function ontology explains
    more about GO functions and the rules governing
    them.

12
Ontology structure
  • The terms in an ontology are linked by two
    relationships, is_a and part_of. is_a is a simple
    class-subclass relationship,
  • where A is_a B means that A is a subclass of B
    for example, nuclear chromosome is_a chromosome.
  • part_of is slightly more complex C part_of D
    means that whenever C is present, it is always a
    part of D, but C does not always have to be
    present. An example would be nucleus part_of
    cell nuclei are always part of a cell, but not
    all cells have nuclei.
  • The ontologies are structured as directed acyclic
    graphs,
  • which are similar to hierarchies but differ in
    that
  • a child, or more specialized, term can have many
    parents, or less specialized, terms.
  • For example, the biological process term hexose
    biosynthesis has two parents, hexose metabolism
    and monosaccharide biosynthesis. This is because
    biosynthesis is a subtype of metabolism, and a
    hexose is a type of monosaccharide. When any gene
    involved in hexose biosynthesis is annotated to
    this term, it is automatically annotated to both
    hexose metabolism and monosaccharide
    biosynthesis,
  • because every GO term must obey the true path
    rule if the child term describes the gene
    product, then all its parent terms must also
    apply to that gene product.

13
What GO is NOT
  • It is important to clearly state the scope of GO,
    and what it does and does not cover. The
    ontologies section explains the domains covered
    by GO the following areas are outside the scope
    of GO, and terms in these domains would not
    appear in the ontologies.Gene products e.g.
    cytochrome c is not in the ontologies, but
    attributes of cytochrome c, such as
    oxidoreductase activity, are.Processes,
    functions or components that are unique to
    mutants or diseases e.g. oncogenesis is not a
    valid GO term because causing cancer is not the
    normal function of any gene.Attributes of
    sequence such as intron/exon parameters these
    are not attributes of gene products and will be
    described in a separate sequence ontology (see
    the OBO website for more information).Protein
    domains or structural features.Protein-protein
    interactions.Environment, evolution and
    expression.Anatomical or histological features
    above the level of cellular components, including
    cell types.GO is not a database of gene
    sequences, nor a catalog of gene products.
    Rather, GO describes how gene products behave in
    a cellular context.GO is not a dictated standard,
    mandating nomenclature across databases. Groups
    participate because of self-interest, and
    cooperate to arrive at a consensus.GO is not a
    way to unify biological databases (i.e. GO is not
    a 'federated solution'). Sharing vocabulary is a
    step towards unification, but is not, in itself,
    sufficient. Reasons for this include the
    followingKnowledge changes and updates lag
    behind.Individual curators evaluate data
    differently. While we can agree to use the word
    'kinase', we must also agree to support this by
    stating how and why we use 'kinase', and
    consistently apply it. Only in this way can we
    hope to compare gene products and determine
    whether they are related.GO does not attempt to
    describe every aspect of biology its scope is
    limited to the domains described above.Back to

14
topAnnotation and tools
  • How do the terms in GO become associated with
    their appropriate gene products? Collaborating
    databases annotate their genes or gene products
    with GO terms, providing references and
    indicating what kind of evidence is available to
    support the annotations. More information can be
    found in the GO Annotation Guide.If you browse
    any of the contributing databases, you'll find
    that each gene or gene product has a list of
    associated GO terms. Each database also publishes
    downloadable files containing these associations
    these can be downloaded from the GO annotations
    page. You can browse the ontologies using a range
    of web-based browsers. A full list of these, and
    other tools for analyzing gene function using GO,
    is available on the GO Tools section.In addition,
    the GO consortium has prepared GO slims, 'slimmed
    down' versions of the ontologies that allow you
    to annotate genomes or sets of gene products to
    gain a high-level view of gene functions. Using
    GO slims you can, for example, work out what
    proportion of a genome is involved in signal
    transduction, biosynthesis or reproduction. See
    the GO Slim Guide for more information.

15
Beyond GO
  • GO allows us to annotate genes and their products
    with a limited set of attributes. For example, GO
    does not allow us to describe genes in terms of
    which cells or tissues they're expressed in,
    which developmental stages they're expressed at,
    or their involvement in disease. It is not
    necessary for GO to do these things because other
    ontologies are being developed for these
    purposes. The GO consortium supports the
    development of other ontologies and makes its
    tools for editing and curating ontologies freely
    available. A list of freely available ontologies
    that are relevant to genomics and proteomics and
    are structured similarly to GO can be found at
    the Open Biomedical Ontologies website . A larger
    list, which includes the ontologies listed at OBO
    and also other controlled vocabularies that do
    not fulfill the OBO criteria is available at the
    Ontology Working Group section of the Microarray
    Gene Expression Data (MGED) Network site .

16
Download
  • All data from the GO project is freely available.
    You can download the ontology data in a number of
    different formats, including XML and mySQL, from
    the GO Downloads page. For more information on
    the syntax of these formats, see the GO File
    Format Guide.If you need lists of the genes or
    gene products that have been associated with a
    particular GO term, the Current Annotations table
    tracks the number of annotations and provides
    links to the gene association files for each of
    the collaborating databases is available.

17
GO term enrichment
  • Hypergeometric
  • SGD example using LA output

18
(No Transcript)
19
Transitive functional annotation by shortest-path
analysis of gene expression dataPNAS October
1, 2002 vol. 99 no. 20 12783-12788Xianghong
Zhou, Ming-Chih J. Kao, and Wing Hung Wong
  • Fig. 1.   (A) Application of the shortest-path
    (SP) algorithm to gene expression data. Nine
    genes are depicted in the graph. The distance
    between two genes is a decreasing function of
    their correlation. For example, there are
    multiple expression dependence paths leading from
    gene a to gene e. Among them, the shortest
    dependence path is a-b-c-d-e, with genes b, c,
    and d serving as the transitive genes. This is
    the most parsimonious summary of the expression
    relationship between the terminal genes a and e.
    (B) Level 0 (L0) and level 1 (L1) matches of
    genes on the SP a-b-c-d-e defined according to
    their relationships in the Gene Ontology (GO)
    classification tree. With respect to the terminal
    genes a and e, the transitive gene b is a L0
    match because it is annotated in the informative
    node where a and e are annotated the transitive
    gene c is a L1 match because it shares the same
    direct parent as the two terminal genes the
    transitive gene d is neither a L0 nor a L1 match.

20
Current methods for the functional analysis of
microarray gene expression data make the implicit
assumption that genes with similar expression
profiles have similar functions in cells.
However, among genes involved in the same
biological pathway, not all gene pairs show high
expression similarity. Here, we propose that
transitive expression similarity among genes can
be used as an important attribute to link genes
of the same biological pathway. Based on
large-scale yeast microarray expression data, we
use the shortest-path analysis to identify
transitive genes between two given genes from the
same biological process. We find that not only
functionally related genes with correlated
expression profiles are identified but also those
without. In the latter case, we compare our
method to hierarchical clustering, and show that
our method can reveal functional relationships
among genes in a more precise manner. Finally, we
show that our method can be used to reliably
predict the function of unknown genes from known
genes lying on the same shortest path. We
assigned functions for 146 yeast genes that are
considered as unknown by the Saccharomyces Genome
Database and by the Yeast Proteome Database.
These genes constitute around 5 of the unknown
yeast ORFome.
21
Data Processing
  • Saccharomyces cerevisiae gene expression profiles
    from the Rosetta Compendium (6), which includes
    300 deletion and drug treatment experiments.
    Genes were annotated by using the biological
    process ontology of Gene Ontology (GO) (7)
    provided by the Saccharomyces Genome Database
    (SGD) (8).
  • After removing the genes without GO process
    annotation and the 20 genes for which there are
    less than 80 experimental measurements in the
    Rosetta Compendium, we were left with
    266 mitochondrial, 398 cytoplasmic, and
    659 nuclear GO-annotated genes.
  • For each of the three sets of genes, we
    calculated the expression similarities of all
    gene pairs a, b using Ca,b, the minimum of the
    absolute value of leave-one-out Pearson
    correlation coefficient estimates. This estimate
    is a measurement robust against single experiment
    outliers and sensitive to overall similarities in
    expression patterns.

22
Graph Construction and SP Computation.
  • We constructed three graphs, one for each set of
    the 266 mitochondrial genes, the 398 cytoplasmic
    genes, and the 659 nuclear genes. In each graph,
    two genes were assigned an edge if their absolute
    expression correlation Ca,b was higher than
      0.6. 
  • This cut-off, while conservative, nonetheless
    retains a sufficient number of connected gene
    pairs in the graph. The edge length between
    vertices a and b is da,b  f(Ca,b)  (1  Ca,b)k.
    The powering factor k is used to enhance the
    differences between low and high correlations.
    Because the length of a path is the sum of the
    individual edge lengths, by exaggerating the
    differences between edge lengths, the SPs will be
    more likely to cover more transitive genes. Thus
    by increasing k we gain more power to reveal
    transitive co-expression. We set k  6 because
    for k  6, the numbers of transitive genes
    stabilizes (detailed results at
    www.biostat.harvard.edu/complab/SP/). To ensure
    the quality of SPs, we consider only SPs with
    total path lengths lt0.008.

23
Predicting the Functions of Unknown Genes.
  • We use the SP method to classify previously
    unannotated yeast genes by adding the 3,255 ORFs
    unknown to SGD into the graphs of known genes in
    the mitochondrial, cytoplasmic, and nuclear
    compartments.
  • As before, an edge is constructed between two
    genes if their absolute expression correlation is
    higher than 0.6. 
  • For all pairs of known genes, we determine the
    SPs connecting them. For the purpose of
    functional prediction, we would like to assign a
    putative function that is as specific as possible
    to the gene. Given all known genes on a SP, we
    achieve this by tracing back their annotations
    along the GO process tree and finding their
    lowest common ancestor.
  • If the lowest ancestral node is at least 4 levels
    below the root of the GO tree, that is, it
    defines a sufficiently specific gene function, we
    then assign this function to the unknown genes on
    the SP.
  • Analogous to the L0 and L1 matches, here the L0
    prediction then corresponds to the lowest common
    ancestor, and the L1 prediction to its direct
    parent. In this way, the function represented by
    the lowest common ancestor can be more specific
    than that defined by the informative nodes.
  • . For each predicted gene function, we provide
    both the number of support SPs from which the
    prediction was derived and the number of unique
    known genes on those support SPs (support genes).
    The more support genes there are, the more
    confidence we have in the corresponding
    prediction.
  • Note that a gene can be assigned putative
    functions in multiple graphs, because many genes
    are known to function in multiple cellular
    compartments.
  • Under two circumstances an unknown gene may be
    assigned with multiple functions (i) Because
    known genes on a SP may each have multiple
    functions, they may share several lowest common
    ancestors in the GO tree. (ii) An unknown gene
    may reside in different SPs with different lowest
    common ancestors
Write a Comment
User Comments (0)
About PowerShow.com