Agenda - PowerPoint PPT Presentation

About This Presentation
Title:

Agenda

Description:

Agenda Biological databases related to microarray Gene Ontology KEGG Biocarta Reactome MSigDB Pathway enrichment analysis GSEA GSA Ingenuity Pathway Analysis (IPA) – PowerPoint PPT presentation

Number of Views:146
Avg rating:3.0/5.0
Slides: 75
Provided by: George951
Category:

less

Transcript and Presenter's Notes

Title: Agenda


1
Agenda
  1. Biological databases related to microarray
  2. Gene Ontology
  3. KEGG
  4. Biocarta
  5. Reactome
  6. MSigDB
  7. Pathway enrichment analysis
  8. GSEA
  9. GSA
  10. Ingenuity Pathway Analysis (IPA)
  11. Motif finding

2
1. Databases
Biological pathways and knowledge are very
complex
  • Is it possible to establish a database?
  • To systematically structuring and managing the
    knowledge?
  • To validate analysis result or be incorporated
    into analysis?

3
1.1 Gene Ontology
  • Ontologies Controlled vocabularies to describe
    fuctions of genes.
  • The database is structured as directed acyclic
    graphs (DAGs), which differ from hierarchical
    trees in that a 'child' (more specialized term)
    can have many 'parents' (less specialized terms).

4
1.1 Gene Ontology
Three major categories in Gene Ontology
Molecular Function Ontology the tasks performed by individual gene products examples are carbohydrate binding and ATPase activity
Biological Process Ontology broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions
Cellular Component Ontology subcellular structures, locations, and macromolecular complexes examples include nucleus, telomere, and origin recognition complex
Current term counts as of April 2, 2005 at 1800
Pacific time17708 terms, 93.8 with
definitions. 9263 biological_process1496
cellular_component6949 molecular_function
5
1.1 Gene Ontology
Evidence code How is the information collected?
  • IC inferred by curator
  • IDA inferred from direct assay
  • IEA inferred from electronic annotation
  • IEP inferred from expression pattern
  • IGI inferred from genetic interaction
  • IMP inferred from mutant phenotype
  • IPI inferred from physical interaction
  • ISS inferred from sequence or structural
    similarity
  • NAS non-traceable author statement
  • ND no biological data available
  • RCA inferred from reviewed computational analysis
  • TAS traceable author statement
  • NR not recorded
  • There may be (a lot of) errors in the database!!

6
1.1 Gene Ontology
  • Demo
  • Go to GO http//www.geneontology.org
  • Go to Tools" and click on "AmiGO".
  • Click Browse. Click on the boxes with "" to
    expand any category to look at its subcategories.
    Click on "-" to collapse again.
  • Type the term cell cycle" in the "Search
    GO"field. Press "Submit". You will then see all
    GO categories containig this word.
  • Click on a GO term, say cell cycle arrest.
    Genes belonging to this GO term can be shown.
    Further filter genes by Data source or
    Species.
  • Type the name cyclin" in Amigo. Change to the
    genes or proteins" selection button and press
    "Submit". You will then see a number of genes
    containing this name. Press some of the "Tree
    view" links. 
  • Note that in some cases, the same term category
    can exist in different places in the tree. This
    ontology is thus not strictly hierarchical, but
    shows complex "many-to-many" relationships
    between gene products, ontology terms and
    branches in the ontology tree. 

7
1.2 KEGG
http//www.genome.jp/kegg/pathway.html
8
1.2 KEGG Kyoto Encyclopedia of Genes and Genomes
KEGG is a suite of databases and associated
software, integrating our current knowledge on
molecular interaction networks in biological
processes (PATHWAY database), the information
about the universe of genes and proteins
(GENES/SSDB/KO databases), and the information
about the universe of chemical compounds and
reactions (COMPOUND/GLYCAN/REACTION databases).
The current statistics of KEGG databases is as
follows Number of pathways 23,574(PATHWAY
database) Number of reference pathways 265(PATHWAY
database) Number of ortholog tables 87(PATHWAY
database) Number of organisms 272(GENOME
database) Number of genes 911,584(GENES
database) Number of ortholog clusters 35,456(SSDB
database) Number of KO assignments 6,221(KO
database) Number of chemical compounds 12,737(COMP
OUND database) Number of glycans 11,017(GLYCAN
database) Number of chemical reactions 6,399(REACT
ION database) Number of reactant
pairs 5,953(RPAIR database)
9
1.2 KEGG
RNA polymerase
10
1.2 KEGG
Cell cycle
11
1.2 KEGG
Parkinsons disease
Alzheimers disease, Huntingtons disease, Prion
disease.
12
1.3 Biocarta
13
1.4 Reactome
  • A manually curated and peer-reviewed (authors,
    reviewers and editors) pathway database.
  • Now annotates 5849 proteins, 4555 complexes, 4827
    reactions and 1192 pathways in Homo Sapien
    (Version 39, 2/21/2012)

14
of pathways (gene sets) Accuracy (manually curated?) Include gene-gene interactions(network graphs)? Note
Gene Ontology 17708 gene sets (2005) No (include many computational predictions) No
KEGG 415 pathways, 951 diseases Yes Yes
Biocarta 250 pathways, 4000 proteins, 800 complexes and 3000 interactions Yes Yes Cancer focused
Reactome 1192 pathways (human) Yes Yes
NIC-Nature Pathway Interaction Database (PID) 59 pathways Yes Yes Curated by Nature editorial team
15
1.5 MSigDB
A comprehensive pathway database (mainly gene
sets without graphical interaction model). Useful
for conventional pathway (gene set) enrichment
analysis. C1 Positional gene sets (326) C2
Curated gene sets (3272) Canonical pathways
(880) Biocarta (217) KEGG (186) Reactome
(430) C3 Motif gene sets (836) miRNA targets
gene sets (221) TF targets gene sets(615) C4
Computational gene sets (881) C5 GO gene sets
(1454)
16
2. Enrichment analysis
  • After
  • Selecting DE genes, or
  • Classification, or
  • Clustering
  • We are usually given a gene list for further
    investigation.

How do we validate information contained in the
gene list by available biological knowledge?
17
2. Enrichment analysis
Cell cycle data Cells are synchronized and
samples taken at various time points (covering 2
cell cycles). 6162 genes are included.
From Fourier analysis, 800 genes with cyclic gene
expression pattern are selected for further
investigation. Are these 800 genes really
involved in cell cycle?
18
2. Enrichment analysis
http//db.yeastgenome.org/cgi-bin/GO/goTermMapper
19
2. Enrichment analysis
Related to cell cycle Annotated but not related to cell cycle Not annotated Total
All genes 385 5703 74 6162
Expression with cyclic pattern 100 691 9 800
Is the selected set of genes enriched in the GO
term of cell cycle?
20
2. Enrichment analysis
Related to cell cycle Annotated but not related to cell cycle Total
Other genes 285 5012 5297
Expression with cyclic pattern 100 691 791
Total 385 5703 6088
21
2. Enrichment analysis
Related to cell cycle Annotated but not related to cell cycle Total
Other genes N11 N12 N1?
Expression with cyclic pattern N21 N22 N2?
Total N?1 N?2 N
22
2. Enrichment analysis
23
2. Enrichment analysis
Related to cell cycle Annotated but not related to cell cycle Total
Other genes 285 5012 5297
Expression with cyclic pattern 100 691 791
Total 385 5703 6088
R code for chi-square test without continuity
correction gt chisq.test(matrix(c(285, 5012, 100,
691), 2, 2), correctF) Pearson's
Chi-squared test data matrix(c(285, 5012, 100,
691), 2, 2) X-squared 61.2644, df 1, p-value
4.99e-15
24
2. Enrichment analysis
Chi-squared test is an approximate test and may
not perform well when sample size small. Fishers
exact test is a better alternative.
Fishers exact test G genes in the genome
(G1663) are analyzed Functional category F.
In a cluster of size C, h genes are found to be
in a functional category F with m genes, then
p-value (i.e. the probability of observing h or
more annotated genes in the cluster is calculated
as (Tavazoie et al. 1999)
25
2. Enrichment analysis
Fishers exact test
Inside cluster Outside cluster Total
Inside pathway F h m-h m
Outside pathway F C-h G-m-Ch G-m
Total C G-C G
If genes are randomly assigned, the probability
of having h intersection genes is
The p-value is the probability to observe h or
more intersection genes by chance
26
2. Enrichment analysis
Fishers exact test
Observation
Inside cluster Outside cluster Total
Inside pathway F 39 1 40
Outside pathway F 161 1799 1960
Total 200 1800 2000
  • There are only two possibilities to observe more
    extremely than observation

Total
39 1 40
161 1799 1960
Total 200 1800 2000
Total
40 0 40
160 1800 1960
Total 200 1800 2000
27
2. Enrichment analysis
Kolmogorov-Smirnov test (KS test) -- A major
issue of Fishers exact test is that it requires
an ad hoc threshold to generate DE gene list. --
KS test is a better way to associate any gene
order with a pathway information. Example
S1(1,2,3,5), S2(4,6,8,9,10) Dmaxx
F1(x)-F2(x)
28
2. Enrichment analysis
  • In practice, we need to search through thousands
    of GO terms to determine which GO term is
    enriched in the selected gene set .
  • Multiple comparison problem!!
  • Difficulties Tests are highly dependent.
  • Hierarchical structure of the GO
  • e.g. Cell Proliferation is a parent GO term of
    Cell Cycle.
  • Each gene can belong to multiple GO terms.
  • e.g. human HoxA7 gene belongs to four GO terms
    Development, Nucleus, DNA dependent
    regulation and transcription, Transcription
    factor activity.

29
2. Enrichment analysis
  • Simple and Naïve way
  • Get p-values from Fishers exact test for all
    pathways.
  • Correct by Benjamini-Hochberg procedure to
    control FDR.
  • Problem
  • Fishers test simplify DE statistics into a
    biomarker list (0-1).
  • Does not consider gene dependence structure and
    pathway hierarchical dependence structure.
  • Improved methods
  • Use averaged t-statistics or Kolmogorov-Smirnov
    (KS) statistics as the pathway-specific
    enrichment score.
  • Apply permutation test (either gene permutation
    or sample permutation) to perform FDR control.
  • Read the following papers if interested.
  • Goeman, J.J. and Buhlmann, P. (2007) Analyzing
    gene expression data in terms of gene sets
    methodological issues, Bioinformatics, 23,
    980-987.
  • Tian, L., Greenberg, S.A., Kong, S.W.,
    Altschuler, J., Kohane, I.S. and Park, P.J.
    (2005) Discovering statistically significant
    pathways in expression profiling studies,
    Proceedings of the National Academy of Sciences
    of the United States of America, 102,
    13544-13549.
  • Efron, B. and Tibshirani, R. (2007) On testing
    the significance of sets of genes, Annals of
    Applied Statistics, 1, 107-129.
  • Subramanian, A., Tamayo, P., Mootha, V.K.,
    Mukherjee, S., Ebert, B.L., Gillette, M.A.,
    Paulovich, A., Pomeroy, S.L., Golub, T.R.,
    Lander, E.S. and Mesirov, J.P. (2005) Gene set
    enrichment analysis A knowledge-based approach
    for interpreting genome-wide expression profiles,
    Proceedings of the National Academy of Sciences
    of the United States of America, 102, 15545-15550.

30
2. Enrichment analysis
  • Simple Fishers exact test
  • Ingenuity Pathway
  • A commercial package with good interface and
    human curated annotation. Can generate network
    figures.
  • NIH DAVID
  • Free and web-based. Perform enrichment analysis
    (Fishers exact test), adjust for multiple
    comparison and generate a table of results. Use
    multiple databases.
  • Gostats package in Bioconductor
  • Free and web-based. Perform enrichment analysis
    (Fishers exact test) and generate a table of
    results. Use only GO database.
  • More sophisticated and systematic methods
  • Gene set enrichment analysis (GSEA MIT
    Mesirovs group)
  • http//www.broad.mit.edu/gsea/ (free)
  • Gene set analysis (GSA Stanford Tibshiranis
    group)
  • http//www-stat.stanford.edu/tibs/GSA/
    (free)
  • Ingenuity Pathway Analysis (IPA)
  • http//www.ingenuity.com/ (commercial Pitt has
    purchases licenses)

31
2. Enrichment analysis
  • Things to note when using biological database
  • Biological pathways and gene functions are
    complex and difficult to quantify.
  • Data may not be accurate. The analysis should
    take into account of strength of evidence.
  • May need to go to specific database for
    particular organism. (e.g. SGD for yeast FlyBase
    and BDGP for fly)
  • To systematically collect and manage massive
    biological knowledge from publications and
    experiments is an important and active research
    topic in bioinformatics.

32
3. Motif Finding
33
3. Motif Finding
http//web.indstate.edu/thcme/mwking/gene-regulati
on.html
34
3. Motif Finding
Factor Sequence Motif Comments
c-Myc and Max CACGTG c-Myc first identified as retroviral oncogene Max specifically associates with c-Myc in cells
c-Fos and c-Jun TGAC/GTC/AA both first identified as retroviral oncogenes associate in cells, also known as the factor AP-1
CREB TGACGC/TC/AG/A binds to the cAMP response element family of at least 10 factors resulting from different genes or alternative splicing can form dimers with c-Jun
c-ErbA also TR (thyroid hormone receptor) GTGTCAAAGGTCA first identified as retroviral oncogene member of the steroid/thyroid hormone receptor superfamily binds thyroid hormone
c-Ets G/CA/CGGAA/TGT/C first identified as retroviral oncogene predominates in B- and T-cells
GATA T/AGATA family of erythroid cell-specific factors, GATA-1 to -6
c-Myb T/CAACG/TG first identified as retroviral oncogene hematopoietic cell-specific factor
MyoD CAACTGAC controls muscle differentiation
NF-(kappa)B and c-Rel GGGAA/CTNT/CCC(1) both factors identified independently c-Rel first identified as retroviral oncogene predominate in B- and T-cells
RAR (retinoic acid receptor) ACGTCATGACCT binds to elements termed RAREs (retinoic acid response elements) also binds to c-Jun/c-Fos site
SRF (serum response factor) GGATGTCCATATTAGGACATCT exists in many genes that are inducible by the growth factors present in serum
http//web.indstate.edu/thcme/mwking/gene-regulati
on.html
35
3. Motif Finding
  • Genes in a cluster have similar expression
    patterns.
  • They might share common regulatory motifs so they
    are expressed simultaneously.
  • It is of interest to find motifs from the gene
    clusters.

36
3. Motif Finding
The following materials are obtained from Shirley
Liu at Harvard.
37
3. Motif Finding
38
3. Motif Finding
39
3. Motif Finding
40
3. Motif Finding
41
3. Motif Finding
42
3. Motif Finding
43
3. Motif Finding
44
3. Motif Finding
45
3. Motif Finding
46
3. Motif Finding
47
3. Motif Finding
48
3. Motif Finding
49
3. Motif Finding
50
3. Motif Finding
51
3. Motif Finding
52
3. Motif Finding
53
3. Motif Finding
54
3. Motif Finding
55
3. Motif Finding
56
3. Motif Finding
57
3. Motif Finding
58
3. Motif Finding
59
3. Motif Finding
60
3. Motif Finding
61
3. Motif Finding
62
3. Motif Finding
63
3. Motif Finding
64
3. Motif Finding
65
3. Motif Finding
66
3. Motif Finding
67
3. Motif Finding
68
3. Motif Finding
69
3. Motif Finding
70
3. Motif Finding
71
3. Motif Finding
72
3. Motif Finding
73
3. Motif Finding
74
3. Motif Finding
Write a Comment
User Comments (0)
About PowerShow.com