Title: In silico Gene Function Prediction Using Ontologybased Pattern Identification
1In silico Gene Function Prediction Using
Ontology-based Pattern Identification
- Paper Review
- By Chunyan Meng
- February 13, 2006
2Paper Source
- Authors Zhou Y1, Young JA2, Santrosyan A1, Chen
K1, Yan SF1, Winzeler EA1,2 - 1Genomics Institute of the Novartis Research
Foundation San Diego, CA 92121, USA - 2Department of Cell Biology, The Scripps
Research Institute, La Jolla, Ca 92037, USA - Source Bioinformatics. 2005 Apr 1 21(7)1237-45
3Related Words
- In Silico In or by means of a computer
simulation (http//www.worldwidewords.org/weirdwor
ds/ww-ins1.htm)
- OPI Ontology-based Pattern Identification is a
data-mining algorithm introduced in this paper
- GBA Guilt by association- an observation that
functionally related genes tend to share similar
mRNA expression profiles
- Malarial parasite Plasmodium falciparum the most
deadly of the four human malaria parasites
4Microarray Data Analysis Open Questions
- Is data transformation beneficial? Which better?
- How to filter out the trivial expression
patterns? - What is the best similarity metrics for
clustering? - Pearson correlation coefficient-based or
Euclidean distance-based (DChip manual)? - Is the partition-based method better? What is the
right k for the k-means clustering (Yeung et al.,
2001)? - Is the hierarchical clustering method more
flexible? What is the similarity threshold to
identify the right sub-tree for functional
analysis (Allocco et al., 2004)?
5Motivation of This Research
- Due to limited success of traditional methods for
producing clusters of genes with great amounts of
functional similarity, new data mining algorithms
are required to fully exploit the potential of
high-throughput genomic approaches. - This research tries to address the above
mentioned open questions.
6Key Contribution of This Study
- A tool based on the OPI algorithm to discover the
optimal analysis pipeline and its corresponding
parameters under the condition of some existing
related biology knowledge. - The research applied OPI to a publicly available
gene expression data set on the life cycle of the
malarial parasite Plasmodium falciparum and
systematically annotated genes for 320 functional
categories based on current Gene Ontology
annotations.
7OPI Method Foundation
- Minimize PD(D(M,X,G),G) VM ? M
- M a method refers to a data analysis pipeline
and its parameter set - X a data set
- G a piece of existing knowledge
- D a discovery by applying method M to the data
set X with hints from knowledge G - PD the discovery score, lower scores means more
interesting discoveries
8OPI Method Implementation
- For gene expression-based functional annotation.
- X n x m matrix of gene expression profile,
where n is the total number of genes and m is the
total number of experiments. - P. falciparum life cycle gene expression profile
n5159genes m 16 expression profiles covering
different parasite life cycle stages.
9OPI Method Implementation
- G the list of NG known genes among the total
number of NT genes that survive the
data-filtering process using the method M. - The list of genes on each vertex of the GO (Gene
Ontology) representation graph which represents a
group of genes with known functionality.
2096/5159 genes have certain annotation. Manually
confirmed annotations only.
10OPI Method Implementation
- Apply OPI on each functional group of GO
- M hyper-dimensional space of analysis methods
have the combination of the following - PANOVA x x
x - x S0
-
LinearLog
UnweightedWeighted
)
(
(
)
QSingle QAverage
(
)
11Objective Function of the Problem
NT
For a gene functional group, D is ND genes that
are predicted to have the same function according
to the GBA.
NG
NC
ND
The probability of at least NC genes are
correctly annotated in a random sampling of ND
genes out of the total collection of NT genes
follows an accumulative hypergeometric
distribution. The problem is to minimize the
probability of knowledge discovery by chance.
12Solving the Problem
- The implemented program exhaustively searches all
possible local minima in the local parameter
space.
For the Cell-Cell-Adhesion group, searching
along the S0 method dimension with the other 4
dimension fixed.
13Results
- GNG-TSRI Malaria e-Annotation Database
- Systematic view of coordination among functional
categories - Validation of functional annotation for antigenic
variation group - Pattern similarity and functional granularity
- Comparison of analysis methods
14GNG-TSRI Malaria e-Annotation Database
Generated 320 functional categories (clusters)
15Systematic View of Coordination Among Functional
Categories
16Systematic View of Coordination Among Functional
Categories
It is the first time the gene expression data
were clustered at the functional-group level
instead of individual genes or samples
17Validation of Functional Annotation
- Use Antigenic Variation Group As Example
- 67 was predicted in this group by OPI based on
2003 GO database. 50 /67 have no indication in
the GO database to belong to this group. 12 of 50
are now confirmed. - 246 was predicted in this group by OPI based on
2004 GO database with better score.
18Pattern Similarity
Size of the marker indicate the size of a
resultant gene list
- Findings no universal quantitative formula
relating FDR to expression similarity. The
threshold S0 must be set specifically for each
individual GO category.
19Comparison of Analysis Methods
- The number of times each of the 8 methods yielded
a group with FDRlt50, resulting in a total of 104
groups
20Comments from the Negative Aspects
- The mathematics is complex and computational
expensive. Especially the optimal solution
finding process. - Validation of gene function annotation is based
on one example and lacks systematic validation. - Similarity coefficient threshold S0 has very
important role in the clustering process and it
depends on the quality of the annotated genes of
GO.
21Positive Aspects
- OPI gives richer and more precise biological
interpretation to the same data than the k-means
approach. Most clusters of OPI have smaller
p-values that means higher statistical
confidence. - Take advantages of both supervised clustering(by
using GO categories) and unsupervised clustering
(by grouping based on gene expression
similarities). - Compared to k-means method, OPI allows more
function specific clusters due to the
hierarchical clustering feature.
22References
- Allocco, D.J. et al. (2004) Quantifying the
relationship between co-expression, co-regulation
and gene function. BMC Bioinformatics, 5, 18. - Yeung, K. et al. (2001) Validating clustering for
gene expression data. Bioinformatics, 17,
309-318.