Agenda - PowerPoint PPT Presentation

About This Presentation

Title:

Agenda

Description:

Agenda Biological databases related to microarray Gene Ontology KEGG Biocarta Reactome MSigDB Pathway enrichment analysis GSEA GSA Ingenuity Pathway Analysis (IPA) – PowerPoint PPT presentation

Number of Views:152

Avg rating:3.0/5.0

Slides: 75

Provided by: George951

Learn more at: http://tsenglab.biostat.pitt.edu

Category:

more less

Transcript and Presenter's Notes

Title: Agenda

1
Agenda

Biological databases related to microarray
Gene Ontology
KEGG
Biocarta
Reactome
MSigDB
Pathway enrichment analysis
GSEA
GSA
Ingenuity Pathway Analysis (IPA)
Motif finding

2
1. Databases
Biological pathways and knowledge are very
complex

Is it possible to establish a database?
To systematically structuring and managing the
knowledge?
To validate analysis result or be incorporated
into analysis?

3
1.1 Gene Ontology

Ontologies Controlled vocabularies to describe
fuctions of genes.
The database is structured as directed acyclic
graphs (DAGs), which differ from hierarchical
trees in that a 'child' (more specialized term)
can have many 'parents' (less specialized terms).

4
1.1 Gene Ontology
Three major categories in Gene Ontology
Molecular Function Ontology the tasks performed by individual gene products examples are carbohydrate binding and ATPase activity
Biological Process Ontology broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions
Cellular Component Ontology subcellular structures, locations, and macromolecular complexes examples include nucleus, telomere, and origin recognition complex
Current term counts as of April 2, 2005 at 1800
Pacific time17708 terms, 93.8 with
definitions. 9263 biological_process1496
cellular_component6949 molecular_function
5
1.1 Gene Ontology
Evidence code How is the information collected?

IC inferred by curator
IDA inferred from direct assay
IEA inferred from electronic annotation
IEP inferred from expression pattern
IGI inferred from genetic interaction
IMP inferred from mutant phenotype
IPI inferred from physical interaction
ISS inferred from sequence or structural
similarity
NAS non-traceable author statement
ND no biological data available
RCA inferred from reviewed computational analysis
TAS traceable author statement
NR not recorded

There may be (a lot of) errors in the database!!

6
1.1 Gene Ontology

Demo
Go to GO http//www.geneontology.org
Go to Tools" and click on "AmiGO".
Click Browse. Click on the boxes with "" to
expand any category to look at its subcategories.
Click on "-" to collapse again.
Type the term cell cycle" in the "Search
GO"field. Press "Submit". You will then see all
GO categories containig this word.
Click on a GO term, say cell cycle arrest.
Genes belonging to this GO term can be shown.
Further filter genes by Data source or
Species.
Type the name cyclin" in Amigo. Change to the
genes or proteins" selection button and press
"Submit". You will then see a number of genes
containing this name. Press some of the "Tree
view" links.
Note that in some cases, the same term category
can exist in different places in the tree. This
ontology is thus not strictly hierarchical, but
shows complex "many-to-many" relationships
between gene products, ontology terms and
branches in the ontology tree.

7
1.2 KEGG
http//www.genome.jp/kegg/pathway.html
8
1.2 KEGG Kyoto Encyclopedia of Genes and Genomes
KEGG is a suite of databases and associated
software, integrating our current knowledge on
molecular interaction networks in biological
processes (PATHWAY database), the information
about the universe of genes and proteins
(GENES/SSDB/KO databases), and the information
about the universe of chemical compounds and
reactions (COMPOUND/GLYCAN/REACTION databases).
The current statistics of KEGG databases is as
follows Number of pathways 23,574(PATHWAY
database) Number of reference pathways 265(PATHWAY
database) Number of ortholog tables 87(PATHWAY
database) Number of organisms 272(GENOME
database) Number of genes 911,584(GENES
database) Number of ortholog clusters 35,456(SSDB
database) Number of KO assignments 6,221(KO
database) Number of chemical compounds 12,737(COMP
OUND database) Number of glycans 11,017(GLYCAN
database) Number of chemical reactions 6,399(REACT
ION database) Number of reactant
pairs 5,953(RPAIR database)
9
1.2 KEGG
RNA polymerase
10
1.2 KEGG
Cell cycle
11
1.2 KEGG
Parkinsons disease
Alzheimers disease, Huntingtons disease, Prion
disease.
12
1.3 Biocarta
13
1.4 Reactome

A manually curated and peer-reviewed (authors,
reviewers and editors) pathway database.
Now annotates 5849 proteins, 4555 complexes, 4827
reactions and 1192 pathways in Homo Sapien
(Version 39, 2/21/2012)

14
of pathways (gene sets) Accuracy (manually curated?) Include gene-gene interactions(network graphs)? Note
Gene Ontology 17708 gene sets (2005) No (include many computational predictions) No
KEGG 415 pathways, 951 diseases Yes Yes
Biocarta 250 pathways, 4000 proteins, 800 complexes and 3000 interactions Yes Yes Cancer focused
Reactome 1192 pathways (human) Yes Yes
NIC-Nature Pathway Interaction Database (PID) 59 pathways Yes Yes Curated by Nature editorial team
15
1.5 MSigDB
A comprehensive pathway database (mainly gene
sets without graphical interaction model). Useful
for conventional pathway (gene set) enrichment
analysis. C1 Positional gene sets (326) C2
Curated gene sets (3272) Canonical pathways
(880) Biocarta (217) KEGG (186) Reactome
(430) C3 Motif gene sets (836) miRNA targets
gene sets (221) TF targets gene sets(615) C4
Computational gene sets (881) C5 GO gene sets
(1454)
16
2. Enrichment analysis

After
Selecting DE genes, or
Classification, or
Clustering
We are usually given a gene list for further
investigation.

How do we validate information contained in the
gene list by available biological knowledge?
17
2. Enrichment analysis
Cell cycle data Cells are synchronized and
samples taken at various time points (covering 2
cell cycles). 6162 genes are included.
From Fourier analysis, 800 genes with cyclic gene
expression pattern are selected for further
investigation. Are these 800 genes really
involved in cell cycle?
18
2. Enrichment analysis
http//db.yeastgenome.org/cgi-bin/GO/goTermMapper
19
2. Enrichment analysis
Related to cell cycle Annotated but not related to cell cycle Not annotated Total
All genes 385 5703 74 6162
Expression with cyclic pattern 100 691 9 800
Is the selected set of genes enriched in the GO
term of cell cycle?
20
2. Enrichment analysis
Related to cell cycle Annotated but not related to cell cycle Total
Other genes 285 5012 5297
Expression with cyclic pattern 100 691 791
Total 385 5703 6088
21
2. Enrichment analysis
Related to cell cycle Annotated but not related to cell cycle Total
Other genes N11 N12 N1?
Expression with cyclic pattern N21 N22 N2?
Total N?1 N?2 N
22
2. Enrichment analysis
23
2. Enrichment analysis
Related to cell cycle Annotated but not related to cell cycle Total
Other genes 285 5012 5297
Expression with cyclic pattern 100 691 791
Total 385 5703 6088
R code for chi-square test without continuity
correction gt chisq.test(matrix(c(285, 5012, 100,
691), 2, 2), correctF) Pearson's
Chi-squared test data matrix(c(285, 5012, 100,
691), 2, 2) X-squared 61.2644, df 1, p-value
4.99e-15
24
2. Enrichment analysis
Chi-squared test is an approximate test and may
not perform well when sample size small. Fishers
exact test is a better alternative.
Fishers exact test G genes in the genome
(G1663) are analyzed Functional category F.
In a cluster of size C, h genes are found to be
in a functional category F with m genes, then
p-value (i.e. the probability of observing h or
more annotated genes in the cluster is calculated
as (Tavazoie et al. 1999)
25
2. Enrichment analysis
Fishers exact test
Inside cluster Outside cluster Total
Inside pathway F h m-h m
Outside pathway F C-h G-m-Ch G-m
Total C G-C G
If genes are randomly assigned, the probability
of having h intersection genes is
The p-value is the probability to observe h or
more intersection genes by chance
26
2. Enrichment analysis
Fishers exact test
Observation
Inside cluster Outside cluster Total
Inside pathway F 39 1 40
Outside pathway F 161 1799 1960
Total 200 1800 2000

There are only two possibilities to observe more
extremely than observation

Total
39 1 40
161 1799 1960
Total 200 1800 2000
Total
40 0 40
160 1800 1960
Total 200 1800 2000
27
2. Enrichment analysis
Kolmogorov-Smirnov test (KS test) -- A major
issue of Fishers exact test is that it requires
an ad hoc threshold to generate DE gene list. --
KS test is a better way to associate any gene
order with a pathway information. Example
S1(1,2,3,5), S2(4,6,8,9,10) Dmaxx
F1(x)-F2(x)
28
2. Enrichment analysis

In practice, we need to search through thousands
of GO terms to determine which GO term is
enriched in the selected gene set .
Multiple comparison problem!!
Difficulties Tests are highly dependent.
Hierarchical structure of the GO
e.g. Cell Proliferation is a parent GO term of
Cell Cycle.
Each gene can belong to multiple GO terms.
e.g. human HoxA7 gene belongs to four GO terms
Development, Nucleus, DNA dependent
regulation and transcription, Transcription
factor activity.

29
2. Enrichment analysis

Simple and Naïve way
Get p-values from Fishers exact test for all
pathways.
Correct by Benjamini-Hochberg procedure to
control FDR.
Problem
Fishers test simplify DE statistics into a
biomarker list (0-1).
Does not consider gene dependence structure and
pathway hierarchical dependence structure.
Improved methods
Use averaged t-statistics or Kolmogorov-Smirnov
(KS) statistics as the pathway-specific
enrichment score.
Apply permutation test (either gene permutation
or sample permutation) to perform FDR control.
Read the following papers if interested.
Goeman, J.J. and Buhlmann, P. (2007) Analyzing
gene expression data in terms of gene sets
methodological issues, Bioinformatics, 23,
980-987.
Tian, L., Greenberg, S.A., Kong, S.W.,
Altschuler, J., Kohane, I.S. and Park, P.J.
(2005) Discovering statistically significant
pathways in expression profiling studies,
Proceedings of the National Academy of Sciences
of the United States of America, 102,
13544-13549.
Efron, B. and Tibshirani, R. (2007) On testing
the significance of sets of genes, Annals of
Applied Statistics, 1, 107-129.
Subramanian, A., Tamayo, P., Mootha, V.K.,
Mukherjee, S., Ebert, B.L., Gillette, M.A.,
Paulovich, A., Pomeroy, S.L., Golub, T.R.,
Lander, E.S. and Mesirov, J.P. (2005) Gene set
enrichment analysis A knowledge-based approach
for interpreting genome-wide expression profiles,
Proceedings of the National Academy of Sciences
of the United States of America, 102, 15545-15550.

30
2. Enrichment analysis

Simple Fishers exact test
Ingenuity Pathway
A commercial package with good interface and
human curated annotation. Can generate network
figures.
NIH DAVID
Free and web-based. Perform enrichment analysis
(Fishers exact test), adjust for multiple
comparison and generate a table of results. Use
multiple databases.
Gostats package in Bioconductor
Free and web-based. Perform enrichment analysis
(Fishers exact test) and generate a table of
results. Use only GO database.
More sophisticated and systematic methods
Gene set enrichment analysis (GSEA MIT
Mesirovs group)
http//www.broad.mit.edu/gsea/ (free)
Gene set analysis (GSA Stanford Tibshiranis
group)
http//www-stat.stanford.edu/tibs/GSA/
(free)
Ingenuity Pathway Analysis (IPA)
http//www.ingenuity.com/ (commercial Pitt has
purchases licenses)

31
2. Enrichment analysis

Things to note when using biological database
Biological pathways and gene functions are
complex and difficult to quantify.
Data may not be accurate. The analysis should
take into account of strength of evidence.
May need to go to specific database for
particular organism. (e.g. SGD for yeast FlyBase
and BDGP for fly)
To systematically collect and manage massive
biological knowledge from publications and
experiments is an important and active research
topic in bioinformatics.

32
3. Motif Finding
33
3. Motif Finding
http//web.indstate.edu/thcme/mwking/gene-regulati
on.html
34
3. Motif Finding
Factor Sequence Motif Comments
c-Myc and Max CACGTG c-Myc first identified as retroviral oncogene Max specifically associates with c-Myc in cells
c-Fos and c-Jun TGAC/GTC/AA both first identified as retroviral oncogenes associate in cells, also known as the factor AP-1
CREB TGACGC/TC/AG/A binds to the cAMP response element family of at least 10 factors resulting from different genes or alternative splicing can form dimers with c-Jun
c-ErbA also TR (thyroid hormone receptor) GTGTCAAAGGTCA first identified as retroviral oncogene member of the steroid/thyroid hormone receptor superfamily binds thyroid hormone
c-Ets G/CA/CGGAA/TGT/C first identified as retroviral oncogene predominates in B- and T-cells
GATA T/AGATA family of erythroid cell-specific factors, GATA-1 to -6
c-Myb T/CAACG/TG first identified as retroviral oncogene hematopoietic cell-specific factor
MyoD CAACTGAC controls muscle differentiation
NF-(kappa)B and c-Rel GGGAA/CTNT/CCC(1) both factors identified independently c-Rel first identified as retroviral oncogene predominate in B- and T-cells
RAR (retinoic acid receptor) ACGTCATGACCT binds to elements termed RAREs (retinoic acid response elements) also binds to c-Jun/c-Fos site
SRF (serum response factor) GGATGTCCATATTAGGACATCT exists in many genes that are inducible by the growth factors present in serum
http//web.indstate.edu/thcme/mwking/gene-regulati
on.html
35
3. Motif Finding

Genes in a cluster have similar expression
patterns.
They might share common regulatory motifs so they
are expressed simultaneously.
It is of interest to find motifs from the gene
clusters.

36
3. Motif Finding
The following materials are obtained from Shirley
Liu at Harvard.
37
3. Motif Finding
38
3. Motif Finding
39
3. Motif Finding
40
3. Motif Finding
41
3. Motif Finding
42
3. Motif Finding
43
3. Motif Finding
44
3. Motif Finding
45
3. Motif Finding
46
3. Motif Finding
47
3. Motif Finding
48
3. Motif Finding
49
3. Motif Finding
50
3. Motif Finding
51
3. Motif Finding
52
3. Motif Finding
53
3. Motif Finding
54
3. Motif Finding
55
3. Motif Finding
56
3. Motif Finding
57
3. Motif Finding
58
3. Motif Finding
59
3. Motif Finding
60
3. Motif Finding
61
3. Motif Finding
62
3. Motif Finding
63
3. Motif Finding
64
3. Motif Finding
65
3. Motif Finding
66
3. Motif Finding
67
3. Motif Finding
68
3. Motif Finding
69
3. Motif Finding
70
3. Motif Finding
71
3. Motif Finding
72
3. Motif Finding
73
3. Motif Finding
74
3. Motif Finding

Write a Comment

User Comments (0)