Title: Agenda
1Agenda
- Biological databases related to microarray
- Gene Ontology
- KEGG
- Pathway enrichment analysis
- Motif finding
21. Databases
Biological pathways and knowledge are very
complex
- Is it possible to establish a database?
- To systematically structuring and managing the
knowledge? - To validate analysis result or be incorporated
into analysis?
31.1 Gene Ontology
- Ontologies Controlled vocabularies to describe
fuctions of genes. - The database is structured as directed acyclic
graphs (DAGs), which differ from hierarchical
trees in that a 'child' (more specialized term)
can have many 'parents' (less specialized terms).
41.1 Gene Ontology
Three major categories in Gene Ontology
Current term counts as of April 2, 2005 at 1800
Pacific time17708 terms, 93.8 with
definitions. 9263 biological_process1496
cellular_component6949 molecular_function
51.1 Gene Ontology
Evidence code How is the information collected?
- IC inferred by curator
- IDA inferred from direct assay
- IEA inferred from electronic annotation
- IEP inferred from expression pattern
- IGI inferred from genetic interaction
- IMP inferred from mutant phenotype
- IPI inferred from physical interaction
- ISS inferred from sequence or structural
similarity - NAS non-traceable author statement
- ND no biological data available
- RCA inferred from reviewed computational analysis
- TAS traceable author statement
- NR not recorded
- There may be (a lot of) errors in the database!!
61.1 Gene Ontology
- Demo
- Go to GO http//www.geneontology.org
- Go to Tools" and click on "AmiGO".
- Click Browse. Click on the boxes with "" to
expand any category to look at its subcategories.
Click on "-" to collapse again. - Type the term cell cycle" in the "Search
GO"field. Press "Submit". You will then see all
GO categories containig this word. - Click on a GO term, say cell cycle arrest.
Genes belonging to this GO term can be shown.
Further filter genes by Data source or
Species. - Type the name cyclin" in Amigo. Change to the
genes or proteins" selection button and press
"Submit". You will then see a number of genes
containing this name. Press some of the "Tree
view" links. - Note that in some cases, the same term category
can exist in different places in the tree. This
ontology is thus not strictly hierarchical, but
shows complex "many-to-many" relationships
between gene products, ontology terms and
branches in the ontology tree.
71.2 KEGG
http//www.genome.jp/kegg/pathway.html
81.2 KEGG Kyoto Encyclopedia of Genes and Genomes
KEGG is a suite of databases and associated
software, integrating our current knowledge on
molecular interaction networks in biological
processes (PATHWAY database), the information
about the universe of genes and proteins
(GENES/SSDB/KO databases), and the information
about the universe of chemical compounds and
reactions (COMPOUND/GLYCAN/REACTION databases).
The current statistics of KEGG databases is as
follows Number of pathways 23,574(PATHWAY
database) Number of reference pathways 265(PATHWAY
database) Number of ortholog tables 87(PATHWAY
database) Number of organisms 272(GENOME
database) Number of genes 911,584(GENES
database) Number of ortholog clusters 35,456(SSDB
database) Number of KO assignments 6,221(KO
database) Number of chemical compounds 12,737(COMP
OUND database) Number of glycans 11,017(GLYCAN
database) Number of chemical reactions 6,399(REACT
ION database) Number of reactant
pairs 5,953(RPAIR database)
91.2 KEGG
RNA polymerase
101.2 KEGG
Cell cycle
111.2 KEGG
Parkinsons disease
Alzheimers disease, Huntingtons disease, Prion
disease.
122. Enrichment analysis
- After
- Selecting DE genes, or
- Classification, or
- Clustering
- We are usually given a gene list for further
investigation.
How do we validate information contained in the
gene list by available biological knowledge?
132. Enrichment analysis
Cell cycle data Cells are synchronized and
samples taken at various time points (covering 2
cell cycles). 6162 genes are included.
From Fourier analysis, 800 genes with cyclic gene
expression pattern are selected for further
investigation. Are these 800 genes really
involved in cell cycle?
142. Enrichment analysis
http//db.yeastgenome.org/cgi-bin/GO/goTermMapper
152. Enrichment analysis
Is the selected set of genes enriched in the GO
term of cell cycle?
162. Enrichment analysis
172. Enrichment analysis
182. Enrichment analysis
192. Enrichment analysis
R code for chi-square test without continuity
correction gt chisq.test(matrix(c(285, 5012, 100,
691), 2, 2), correctF) Pearson's
Chi-squared test data matrix(c(285, 5012, 100,
691), 2, 2) X-squared 61.2644, df 1, p-value
4.99e-15
202. Enrichment analysis
Chi-squared test is an approximate test and may
not perform well when sample size small. Fishers
exact test is a better alternative.
Fishers exact test G genes in the genome
(G1663) are analyzed Functional category F
(Six functional categories). In a cluster of size
C, h genes are found to be in a functional
category F with m genes, then p-value (i.e. the
probability of observing h or more annotated
genes in the cluster is calculated as (Tavazoie
et al. 1999)
212. Enrichment analysis
- In practice, we need to search through thousands
of GO terms to determine which GO term is
enriched in the selected gene set . - Multiple comparison problem!!
- Difficulties Tests are highly dependent.
- Hierarchical structure of the GO
- e.g. Cell Proliferation is a parent GO term of
Cell Cycle. - Each gene can belong to multiple GO terms.
- e.g. human HoxA7 gene belongs to four GO terms
Development, Nucleus, DNA dependent
regulation and transcription, Transcription
factor activity.
222. Enrichment analysis
- Simple Fishers exact test
- Ingenuity Pathway
- A commercial package with good interface and
human curated annotation. Can generate network
figures. - NIH DAVID
- Free and web-based. Perform enrichment analysis
(Fishers exact test), adjust for multiple
comparison and generate a table of results. Use
multiple databases. - Gostats package in Bioconductor
- Free and web-based. Perform enrichment analysis
(Fishers exact test) and generate a table of
results. Use only GO database. - More sophisticated and systematic methods
- Gene set enrichment analysis (GSEA MIT
Mesirovs group) - http//www.broad.mit.edu/gsea/
- Gene set analysis (GSA Stanford Tibshiranis
group) - http//www-stat.stanford.edu/tibs/GSA/
232. Enrichment analysis
- Things to note when using biological database
- Biological pathways and gene functions are
complex and difficult to quantify. - Data may not be accurate. The analysis should
take into account of strength of evidence. - May need to go to specific database for
particular organism. (e.g. SGD for yeast FlyBase
and BDGP for fly) - To systematically collect and manage massive
biological knowledge from publications and
experiments is an important and active research
topic in bioinformatics.
243. Motif Finding
253. Motif Finding
http//web.indstate.edu/thcme/mwking/gene-regulati
on.html
263. Motif Finding
http//web.indstate.edu/thcme/mwking/gene-regulati
on.html
273. Motif Finding
- Genes in a cluster have similar expression
patterns. - They might share common regulatory motifs so they
are expressed simultaneously. - It is of interest to find motifs from the gene
clusters.
283. Motif Finding
The following materials are obtained from Shirley
Liu at Harvard.
293. Motif Finding
303. Motif Finding
313. Motif Finding
323. Motif Finding
333. Motif Finding
343. Motif Finding
353. Motif Finding
363. Motif Finding
373. Motif Finding
383. Motif Finding
393. Motif Finding
403. Motif Finding
413. Motif Finding
423. Motif Finding
433. Motif Finding
443. Motif Finding
453. Motif Finding
463. Motif Finding
473. Motif Finding
483. Motif Finding
493. Motif Finding
503. Motif Finding
513. Motif Finding
523. Motif Finding
533. Motif Finding
543. Motif Finding
553. Motif Finding
563. Motif Finding
573. Motif Finding
583. Motif Finding
593. Motif Finding
603. Motif Finding
613. Motif Finding
623. Motif Finding
633. Motif Finding
643. Motif Finding
653. Motif Finding
663. Motif Finding