In silico Gene Function Prediction Using Ontologybased Pattern Identification - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

In silico Gene Function Prediction Using Ontologybased Pattern Identification

Description:

In silico Gene Function Prediction Using Ontology-based Pattern Identification. Paper Review ... OPI : Ontology-based Pattern Identification is a data-mining ... – PowerPoint PPT presentation

Number of Views:106

Avg rating:3.0/5.0

Slides: 23

Provided by: chunya

Category:

more less

Transcript and Presenter's Notes

Title: In silico Gene Function Prediction Using Ontologybased Pattern Identification

1
In silico Gene Function Prediction Using
Ontology-based Pattern Identification

Paper Review
By Chunyan Meng
February 13, 2006

2
Paper Source

Authors Zhou Y1, Young JA2, Santrosyan A1, Chen
K1, Yan SF1, Winzeler EA1,2
1Genomics Institute of the Novartis Research
Foundation San Diego, CA 92121, USA
2Department of Cell Biology, The Scripps
Research Institute, La Jolla, Ca 92037, USA
Source Bioinformatics. 2005 Apr 1 21(7)1237-45

3
Related Words

In Silico In or by means of a computer
simulation (http//www.worldwidewords.org/weirdwor
ds/ww-ins1.htm)

OPI Ontology-based Pattern Identification is a
data-mining algorithm introduced in this paper

GBA Guilt by association- an observation that
functionally related genes tend to share similar
mRNA expression profiles

Malarial parasite Plasmodium falciparum the most
deadly of the four human malaria parasites

4
Microarray Data Analysis Open Questions

Is data transformation beneficial? Which better?
How to filter out the trivial expression
patterns?
What is the best similarity metrics for
clustering?
Pearson correlation coefficient-based or
Euclidean distance-based (DChip manual)?
Is the partition-based method better? What is the
right k for the k-means clustering (Yeung et al.,
2001)?
Is the hierarchical clustering method more
flexible? What is the similarity threshold to
identify the right sub-tree for functional
analysis (Allocco et al., 2004)?

5
Motivation of This Research

Due to limited success of traditional methods for
producing clusters of genes with great amounts of
functional similarity, new data mining algorithms
are required to fully exploit the potential of
high-throughput genomic approaches.
This research tries to address the above
mentioned open questions.

6
Key Contribution of This Study

A tool based on the OPI algorithm to discover the
optimal analysis pipeline and its corresponding
parameters under the condition of some existing
related biology knowledge.
The research applied OPI to a publicly available
gene expression data set on the life cycle of the
malarial parasite Plasmodium falciparum and
systematically annotated genes for 320 functional
categories based on current Gene Ontology
annotations.

7
OPI Method Foundation

Minimize PD(D(M,X,G),G) VM ? M
M a method refers to a data analysis pipeline
and its parameter set
X a data set
G a piece of existing knowledge
D a discovery by applying method M to the data
set X with hints from knowledge G
PD the discovery score, lower scores means more
interesting discoveries

8
OPI Method Implementation

For gene expression-based functional annotation.
X n x m matrix of gene expression profile,
where n is the total number of genes and m is the
total number of experiments.
P. falciparum life cycle gene expression profile
n5159genes m 16 expression profiles covering
different parasite life cycle stages.

9
OPI Method Implementation

G the list of NG known genes among the total
number of NT genes that survive the
data-filtering process using the method M.
The list of genes on each vertex of the GO (Gene
Ontology) representation graph which represents a
group of genes with known functionality.
2096/5159 genes have certain annotation. Manually
confirmed annotations only.

10
OPI Method Implementation

Apply OPI on each functional group of GO
M hyper-dimensional space of analysis methods
have the combination of the following
PANOVA x x
x
x S0

LinearLog
UnweightedWeighted
)
(
(
)
QSingle QAverage
(
)
11
Objective Function of the Problem
NT
For a gene functional group, D is ND genes that
are predicted to have the same function according
to the GBA.
NG
NC
ND
The probability of at least NC genes are
correctly annotated in a random sampling of ND
genes out of the total collection of NT genes
follows an accumulative hypergeometric
distribution. The problem is to minimize the
probability of knowledge discovery by chance.
12
Solving the Problem

The implemented program exhaustively searches all
possible local minima in the local parameter
space.

For the Cell-Cell-Adhesion group, searching
along the S0 method dimension with the other 4
dimension fixed.
13
Results

GNG-TSRI Malaria e-Annotation Database
Systematic view of coordination among functional
categories
Validation of functional annotation for antigenic
variation group
Pattern similarity and functional granularity
Comparison of analysis methods

14
GNG-TSRI Malaria e-Annotation Database
Generated 320 functional categories (clusters)
15
Systematic View of Coordination Among Functional
Categories
16
Systematic View of Coordination Among Functional
Categories
It is the first time the gene expression data
were clustered at the functional-group level
instead of individual genes or samples
17
Validation of Functional Annotation

Use Antigenic Variation Group As Example
67 was predicted in this group by OPI based on
2003 GO database. 50 /67 have no indication in
the GO database to belong to this group. 12 of 50
are now confirmed.
246 was predicted in this group by OPI based on
2004 GO database with better score.

18
Pattern Similarity
Size of the marker indicate the size of a
resultant gene list

Findings no universal quantitative formula
relating FDR to expression similarity. The
threshold S0 must be set specifically for each
individual GO category.

19
Comparison of Analysis Methods

The number of times each of the 8 methods yielded
a group with FDRlt50, resulting in a total of 104
groups

20
Comments from the Negative Aspects

The mathematics is complex and computational
expensive. Especially the optimal solution
finding process.
Validation of gene function annotation is based
on one example and lacks systematic validation.
Similarity coefficient threshold S0 has very
important role in the clustering process and it
depends on the quality of the annotated genes of
GO.

21
Positive Aspects

OPI gives richer and more precise biological
interpretation to the same data than the k-means
approach. Most clusters of OPI have smaller
p-values that means higher statistical
confidence.
Take advantages of both supervised clustering(by
using GO categories) and unsupervised clustering
(by grouping based on gene expression
similarities).
Compared to k-means method, OPI allows more
function specific clusters due to the
hierarchical clustering feature.

22
References

Allocco, D.J. et al. (2004) Quantifying the
relationship between co-expression, co-regulation
and gene function. BMC Bioinformatics, 5, 18.
Yeung, K. et al. (2001) Validating clustering for
gene expression data. Bioinformatics, 17,
309-318.

Write a Comment

User Comments (0)