Title: Agenda
1Agenda
- Biological databases related to microarray
- Gene Ontology
- KEGG
- Biocarta
- Reactome
- MSigDB
- Pathway enrichment analysis
- GSEA
- GSA
- Ingenuity Pathway Analysis (IPA)
- Motif finding
21. Databases
Biological pathways and knowledge are very
complex
- Is it possible to establish a database?
- To systematically structuring and managing the
knowledge? - To validate analysis result or be incorporated
into analysis?
31.1 Gene Ontology
- Ontologies Controlled vocabularies to describe
fuctions of genes. - The database is structured as directed acyclic
graphs (DAGs), which differ from hierarchical
trees in that a 'child' (more specialized term)
can have many 'parents' (less specialized terms).
41.1 Gene Ontology
Three major categories in Gene Ontology
Molecular Function Ontology the tasks performed by individual gene products examples are carbohydrate binding and ATPase activity
Biological Process Ontology broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions
Cellular Component Ontology subcellular structures, locations, and macromolecular complexes examples include nucleus, telomere, and origin recognition complex
Current term counts as of April 2, 2005 at 1800
Pacific time17708 terms, 93.8 with
definitions. 9263 biological_process1496
cellular_component6949 molecular_function
51.1 Gene Ontology
Evidence code How is the information collected?
- IC inferred by curator
- IDA inferred from direct assay
- IEA inferred from electronic annotation
- IEP inferred from expression pattern
- IGI inferred from genetic interaction
- IMP inferred from mutant phenotype
- IPI inferred from physical interaction
- ISS inferred from sequence or structural
similarity - NAS non-traceable author statement
- ND no biological data available
- RCA inferred from reviewed computational analysis
- TAS traceable author statement
- NR not recorded
- There may be (a lot of) errors in the database!!
61.1 Gene Ontology
- Demo
- Go to GO http//www.geneontology.org
- Go to Tools" and click on "AmiGO".
- Click Browse. Click on the boxes with "" to
expand any category to look at its subcategories.
Click on "-" to collapse again. - Type the term cell cycle" in the "Search
GO"field. Press "Submit". You will then see all
GO categories containig this word. - Click on a GO term, say cell cycle arrest.
Genes belonging to this GO term can be shown.
Further filter genes by Data source or
Species. - Type the name cyclin" in Amigo. Change to the
genes or proteins" selection button and press
"Submit". You will then see a number of genes
containing this name. Press some of the "Tree
view" links. - Note that in some cases, the same term category
can exist in different places in the tree. This
ontology is thus not strictly hierarchical, but
shows complex "many-to-many" relationships
between gene products, ontology terms and
branches in the ontology tree.
71.2 KEGG
http//www.genome.jp/kegg/pathway.html
81.2 KEGG Kyoto Encyclopedia of Genes and Genomes
KEGG is a suite of databases and associated
software, integrating our current knowledge on
molecular interaction networks in biological
processes (PATHWAY database), the information
about the universe of genes and proteins
(GENES/SSDB/KO databases), and the information
about the universe of chemical compounds and
reactions (COMPOUND/GLYCAN/REACTION databases).
The current statistics of KEGG databases is as
follows Number of pathways 23,574(PATHWAY
database) Number of reference pathways 265(PATHWAY
database) Number of ortholog tables 87(PATHWAY
database) Number of organisms 272(GENOME
database) Number of genes 911,584(GENES
database) Number of ortholog clusters 35,456(SSDB
database) Number of KO assignments 6,221(KO
database) Number of chemical compounds 12,737(COMP
OUND database) Number of glycans 11,017(GLYCAN
database) Number of chemical reactions 6,399(REACT
ION database) Number of reactant
pairs 5,953(RPAIR database)
91.2 KEGG
RNA polymerase
101.2 KEGG
Cell cycle
111.2 KEGG
Parkinsons disease
Alzheimers disease, Huntingtons disease, Prion
disease.
121.3 Biocarta
131.4 Reactome
- A manually curated and peer-reviewed (authors,
reviewers and editors) pathway database. - Now annotates 5849 proteins, 4555 complexes, 4827
reactions and 1192 pathways in Homo Sapien
(Version 39, 2/21/2012)
14 of pathways (gene sets) Accuracy (manually curated?) Include gene-gene interactions(network graphs)? Note
Gene Ontology 17708 gene sets (2005) No (include many computational predictions) No
KEGG 415 pathways, 951 diseases Yes Yes
Biocarta 250 pathways, 4000 proteins, 800 complexes and 3000 interactions Yes Yes Cancer focused
Reactome 1192 pathways (human) Yes Yes
NIC-Nature Pathway Interaction Database (PID) 59 pathways Yes Yes Curated by Nature editorial team
151.5 MSigDB
A comprehensive pathway database (mainly gene
sets without graphical interaction model). Useful
for conventional pathway (gene set) enrichment
analysis. C1 Positional gene sets (326) C2
Curated gene sets (3272) Canonical pathways
(880) Biocarta (217) KEGG (186) Reactome
(430) C3 Motif gene sets (836) miRNA targets
gene sets (221) TF targets gene sets(615) C4
Computational gene sets (881) C5 GO gene sets
(1454)
162. Enrichment analysis
- After
- Selecting DE genes, or
- Classification, or
- Clustering
- We are usually given a gene list for further
investigation.
How do we validate information contained in the
gene list by available biological knowledge?
172. Enrichment analysis
Cell cycle data Cells are synchronized and
samples taken at various time points (covering 2
cell cycles). 6162 genes are included.
From Fourier analysis, 800 genes with cyclic gene
expression pattern are selected for further
investigation. Are these 800 genes really
involved in cell cycle?
182. Enrichment analysis
http//db.yeastgenome.org/cgi-bin/GO/goTermMapper
192. Enrichment analysis
Related to cell cycle Annotated but not related to cell cycle Not annotated Total
All genes 385 5703 74 6162
Expression with cyclic pattern 100 691 9 800
Is the selected set of genes enriched in the GO
term of cell cycle?
202. Enrichment analysis
Related to cell cycle Annotated but not related to cell cycle Total
Other genes 285 5012 5297
Expression with cyclic pattern 100 691 791
Total 385 5703 6088
212. Enrichment analysis
Related to cell cycle Annotated but not related to cell cycle Total
Other genes N11 N12 N1?
Expression with cyclic pattern N21 N22 N2?
Total N?1 N?2 N
222. Enrichment analysis
232. Enrichment analysis
Related to cell cycle Annotated but not related to cell cycle Total
Other genes 285 5012 5297
Expression with cyclic pattern 100 691 791
Total 385 5703 6088
R code for chi-square test without continuity
correction gt chisq.test(matrix(c(285, 5012, 100,
691), 2, 2), correctF) Pearson's
Chi-squared test data matrix(c(285, 5012, 100,
691), 2, 2) X-squared 61.2644, df 1, p-value
4.99e-15
242. Enrichment analysis
Chi-squared test is an approximate test and may
not perform well when sample size small. Fishers
exact test is a better alternative.
Fishers exact test G genes in the genome
(G1663) are analyzed Functional category F.
In a cluster of size C, h genes are found to be
in a functional category F with m genes, then
p-value (i.e. the probability of observing h or
more annotated genes in the cluster is calculated
as (Tavazoie et al. 1999)
252. Enrichment analysis
Fishers exact test
Inside cluster Outside cluster Total
Inside pathway F h m-h m
Outside pathway F C-h G-m-Ch G-m
Total C G-C G
If genes are randomly assigned, the probability
of having h intersection genes is
The p-value is the probability to observe h or
more intersection genes by chance
262. Enrichment analysis
Fishers exact test
Observation
Inside cluster Outside cluster Total
Inside pathway F 39 1 40
Outside pathway F 161 1799 1960
Total 200 1800 2000
- There are only two possibilities to observe more
extremely than observation
Total
39 1 40
161 1799 1960
Total 200 1800 2000
Total
40 0 40
160 1800 1960
Total 200 1800 2000
272. Enrichment analysis
Kolmogorov-Smirnov test (KS test) -- A major
issue of Fishers exact test is that it requires
an ad hoc threshold to generate DE gene list. --
KS test is a better way to associate any gene
order with a pathway information. Example
S1(1,2,3,5), S2(4,6,8,9,10) Dmaxx
F1(x)-F2(x)
282. Enrichment analysis
- In practice, we need to search through thousands
of GO terms to determine which GO term is
enriched in the selected gene set . - Multiple comparison problem!!
- Difficulties Tests are highly dependent.
- Hierarchical structure of the GO
- e.g. Cell Proliferation is a parent GO term of
Cell Cycle. - Each gene can belong to multiple GO terms.
- e.g. human HoxA7 gene belongs to four GO terms
Development, Nucleus, DNA dependent
regulation and transcription, Transcription
factor activity.
292. Enrichment analysis
- Simple and Naïve way
- Get p-values from Fishers exact test for all
pathways. - Correct by Benjamini-Hochberg procedure to
control FDR. - Problem
- Fishers test simplify DE statistics into a
biomarker list (0-1). - Does not consider gene dependence structure and
pathway hierarchical dependence structure. - Improved methods
- Use averaged t-statistics or Kolmogorov-Smirnov
(KS) statistics as the pathway-specific
enrichment score. - Apply permutation test (either gene permutation
or sample permutation) to perform FDR control. - Read the following papers if interested.
- Goeman, J.J. and Buhlmann, P. (2007) Analyzing
gene expression data in terms of gene sets
methodological issues, Bioinformatics, 23,
980-987. - Tian, L., Greenberg, S.A., Kong, S.W.,
Altschuler, J., Kohane, I.S. and Park, P.J.
(2005) Discovering statistically significant
pathways in expression profiling studies,
Proceedings of the National Academy of Sciences
of the United States of America, 102,
13544-13549. - Efron, B. and Tibshirani, R. (2007) On testing
the significance of sets of genes, Annals of
Applied Statistics, 1, 107-129. - Subramanian, A., Tamayo, P., Mootha, V.K.,
Mukherjee, S., Ebert, B.L., Gillette, M.A.,
Paulovich, A., Pomeroy, S.L., Golub, T.R.,
Lander, E.S. and Mesirov, J.P. (2005) Gene set
enrichment analysis A knowledge-based approach
for interpreting genome-wide expression profiles,
Proceedings of the National Academy of Sciences
of the United States of America, 102, 15545-15550.
302. Enrichment analysis
- Simple Fishers exact test
- Ingenuity Pathway
- A commercial package with good interface and
human curated annotation. Can generate network
figures. - NIH DAVID
- Free and web-based. Perform enrichment analysis
(Fishers exact test), adjust for multiple
comparison and generate a table of results. Use
multiple databases. - Gostats package in Bioconductor
- Free and web-based. Perform enrichment analysis
(Fishers exact test) and generate a table of
results. Use only GO database. - More sophisticated and systematic methods
- Gene set enrichment analysis (GSEA MIT
Mesirovs group) - http//www.broad.mit.edu/gsea/ (free)
- Gene set analysis (GSA Stanford Tibshiranis
group) - http//www-stat.stanford.edu/tibs/GSA/
(free) - Ingenuity Pathway Analysis (IPA)
- http//www.ingenuity.com/ (commercial Pitt has
purchases licenses)
312. Enrichment analysis
- Things to note when using biological database
- Biological pathways and gene functions are
complex and difficult to quantify. - Data may not be accurate. The analysis should
take into account of strength of evidence. - May need to go to specific database for
particular organism. (e.g. SGD for yeast FlyBase
and BDGP for fly) - To systematically collect and manage massive
biological knowledge from publications and
experiments is an important and active research
topic in bioinformatics.
323. Motif Finding
333. Motif Finding
http//web.indstate.edu/thcme/mwking/gene-regulati
on.html
343. Motif Finding
Factor Sequence Motif Comments
c-Myc and Max CACGTG c-Myc first identified as retroviral oncogene Max specifically associates with c-Myc in cells
c-Fos and c-Jun TGAC/GTC/AA both first identified as retroviral oncogenes associate in cells, also known as the factor AP-1
CREB TGACGC/TC/AG/A binds to the cAMP response element family of at least 10 factors resulting from different genes or alternative splicing can form dimers with c-Jun
c-ErbA also TR (thyroid hormone receptor) GTGTCAAAGGTCA first identified as retroviral oncogene member of the steroid/thyroid hormone receptor superfamily binds thyroid hormone
c-Ets G/CA/CGGAA/TGT/C first identified as retroviral oncogene predominates in B- and T-cells
GATA T/AGATA family of erythroid cell-specific factors, GATA-1 to -6
c-Myb T/CAACG/TG first identified as retroviral oncogene hematopoietic cell-specific factor
MyoD CAACTGAC controls muscle differentiation
NF-(kappa)B and c-Rel GGGAA/CTNT/CCC(1) both factors identified independently c-Rel first identified as retroviral oncogene predominate in B- and T-cells
RAR (retinoic acid receptor) ACGTCATGACCT binds to elements termed RAREs (retinoic acid response elements) also binds to c-Jun/c-Fos site
SRF (serum response factor) GGATGTCCATATTAGGACATCT exists in many genes that are inducible by the growth factors present in serum
http//web.indstate.edu/thcme/mwking/gene-regulati
on.html
353. Motif Finding
- Genes in a cluster have similar expression
patterns. - They might share common regulatory motifs so they
are expressed simultaneously. - It is of interest to find motifs from the gene
clusters.
363. Motif Finding
The following materials are obtained from Shirley
Liu at Harvard.
373. Motif Finding
383. Motif Finding
393. Motif Finding
403. Motif Finding
413. Motif Finding
423. Motif Finding
433. Motif Finding
443. Motif Finding
453. Motif Finding
463. Motif Finding
473. Motif Finding
483. Motif Finding
493. Motif Finding
503. Motif Finding
513. Motif Finding
523. Motif Finding
533. Motif Finding
543. Motif Finding
553. Motif Finding
563. Motif Finding
573. Motif Finding
583. Motif Finding
593. Motif Finding
603. Motif Finding
613. Motif Finding
623. Motif Finding
633. Motif Finding
643. Motif Finding
653. Motif Finding
663. Motif Finding
673. Motif Finding
683. Motif Finding
693. Motif Finding
703. Motif Finding
713. Motif Finding
723. Motif Finding
733. Motif Finding
743. Motif Finding