Missing values : impact on classification - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Missing values : impact on classification

Description:

For a given clustering algorithm (algo), we defined Kalgo, the number of clusters. ... For each clustering algorithm (algo), the corresponding ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 39
Provided by: dsimbI
Category:

less

Transcript and Presenter's Notes

Title: Missing values : impact on classification


1
Missing values impact on classification
Influence of microarrays experiments missing
values on the stability of gene groups by
hierarchical clustering
Alexandre G. de Brevern Equipe de
Bioinformatique Génomique et Moléculaire
(EBGM) INSERM U726 / Université Paris VII 75251
PARIS Cedex 05 France 4 March 2005
2
Missing values impact on classification
Influence of microarrays experiments missing
values on the stability of gene groups by
hierarchical clustering Alexandre G. de Brevern
1, Serge Hazout 1 and Alain Malpertuy 2 1 EBGM,
2 Atragene Bioinformatics BMC Bioinformatics.
(2004) Aug 235(1)114.
3
Missing values impact on classification
Missing values in microarray experiments
available data (www)
4
Missing values impact on classification
The question.
Hierarchical clustering
Microarray results
Microarray results with (Missing Values) MVs
Analysis Consequence ?
5
Missing values impact on classification
.
1. Classical approach for MVs 2. kNN method
principle and analysis 3. Principle of the
evaluation 4. Results 5. Last and future works
6
Missing values impact on classification
1. Classic approach
Hierarchical clustering
Fill-in
kNN
Microarray results with MVs
Microarray results without MVs
Analysis
7
Missing values impact on classification
2. kNN k-Nearest Neighbors
Two kNN Simple gt mean values of k Pondered
gt by the distance
The question k ?
Troyanskaya OG, Cantor M, Sherlock G, Brown PO,
Hastie T, Tibshirani R, Botstein D, Altman RB
Missing value estimation methods for DNA
microarrays. Bioinformatics 2001, 17 520-525.
8
Missing values impact on classification
2. kNN k-Nearest Neighbors
The data Sets have been chosen because they
contain few MVs and, after filtering the number
of genes remains important, (ca. 6000). The
original Ogawa set (OS) contained 6013 genes with
230 genes having MVs. The elimination of the
genes with MVs (i.e. 3.8 of the genes) leads to
a set with 5783 genes. For the Gasch set (GS),
the number of MVs is more important and some
experimental conditions have more than 50 of
MVs. So we have limited the final number of
selected experimental conditions from 178 to 42
(see section Methods), it allows to conserve 5843
genes, i.e. only 310 genes are not analyzed,
representing 5.0 of all the genes.
9
Missing values impact on classification
2. kNN k-Nearest Neighbors
The data (2) We have analyzed different
subsets corresponding to 1/7, 1/6, 1/5, 1/4, 1/3
and 1/2 of the complete sets (GS and
OS). Moreover, we have defined two smaller sets,
GSH2O2 and GSHEAT, from GS corresponding
respectively to H2O2 and heat shock experimental
conditions.
10
Missing values impact on classification
2. kNN k-Nearest Neighbors
The results Evaluation of optimal k
11
Missing values impact on classification
2. kNN k-Nearest Neighbors
The results Evaluation of optimal
k Sometimes not so close from 15.
12
Missing values impact on classification
2. kNN k-Nearest Neighbors
The results Difference with the real values.
13
Missing values impact on classification
2. kNN k-Nearest Neighbors
Values gt 2.5
The results The extreme values
Values lt 0.5
14
Missing values impact on classification
3. Principle of the method.
15
Missing values impact on classification
3. Principle of the method.
16
Missing values impact on classification
3. Principle of the method.
Microarray data without MVs
17
Missing values impact on classification

3. Principle of the method.
18
Missing values impact on classification
3. Principle of the method.
19
Missing values impact on classification
3. Principle of the method.
. The term
where
is the Kronecker symbol, i.e. it is equal to 1
when the genes i and i in the two gene lists are
identical, otherwise 0. G denotes the total
number of genes. This index takes the maximal
value 1 when the clusterings RC and GR are
identical.
20
Missing values impact on classification
3. Principle of the method.
Hierarchical clustering Highly distinct
topologies. Different aggregation methods can be
used for the construction of the dendogram
generally leading to different tree topologies
and a fortiori to various cluster definitions.
The single-linkage algorithm is based to the
concept of joining the two closest objects (i.e.
genes) of two clusters to create a new cluster.
Thus the single-linkage clusters contain numerous
members and are branched in high-dimensional
space. The resulting clusters are affected by the
chaining phenomenon (i.e. the observations are
added to the tail of the biggest cluster). In
the complete-linkage algorithm, the distance
between clusters is defined as the distance
between the most distant pair of objects (i.e.
genes). This method gives compact clusters. The
average-linkage algorithm is based on the mean
similarity of the observations to all the members
of the cluster.
21
Missing values impact on classification
3. Principle of the method.
Hierarchical clustering Highly distinct
topologies. Examples.
single
22
Missing values impact on classification
3. Principle of the method.
Hierarchical clustering Highly distinct
topologies. Examples.
centroid
23
Missing values impact on classification
3. Principle of the method.
Hierarchical clustering Highly distinct
topologies. Examples.
average
24
Missing values impact on classification
3. Principle of the method.
Hierarchical clustering Highly distinct
topologies. Examples.
median
25
Missing values impact on classification
3. Principle of the method.
Hierarchical clustering Highly distinct
topologies. Examples.
mc quitty
26
Missing values impact on classification
3. Principle of the method.
Hierarchical clustering Highly distinct
topologies. Examples.
Ward
27
Missing values impact on classification
3. Principle of the method.
Hierarchical clustering Highly distinct
topologies. Examples.
complete
28
Missing values impact on classification
4. Results.
Visualization of 1 of Missing Values
complete algorithm.
OS 1/6 827 genes MVs rate 1 MV per
gene i.e., only 8 values !!!!
OS 1/6 827 genes MVs rate 1 MV per gene
29
Missing values impact on classification
4. Results.
Hierarchical clustering Highly distinct
topologies. Results.
30
Missing values impact on classification
4. Results.
Hierarchical clustering Highly distinct
topologies. Results. CPP (classic)
31
Missing values impact on classification
4. Results.
Hierarchical clustering Highly distinct
topologies. Results. CPPf
(classic). With f 5. CPPf allows to find the
genes associated to close clusters, i.e. here at
max 5.
32
Missing values impact on classification
4. Results.
Hierarchical clustering Highly distinct
topologies. Results. CPP (extreme
values)
33
Missing values impact on classification
4. Results.
Hierarchical clustering Highly distinct
topologies. Results. CPPf (extreme
values). With f 5.
34
Missing values impact on classification
4. Results.
Conclusion It is not good to have missing
values However, the replacement of MVs is an
obligation. It is better to use kNN gt zero gt
nothing. The choice of k is critical (and not
trivial). The kind of algorithm is also
important. Future
35
Missing values impact on classification
5. Last and future works.
New methods New approach better than kNN are
now available A recent work proposes Bayesian
principal component analysis to deal with MVs
(Oba et al., 2003). In the same way, Zhou and
co-workers (Zhou et al., 2003) have used a
Bayesian gene selection to estimate the MVs with
linear and non-linear regression. Oba S, Sato
M-A, Takemasa I, Monden M, Matsubara K-I, Ishii
S A Bayesian missing value estimation method for
gene expression profile data, Bioinformatics
2003, 19 2088-2096. Java bytecode
available. Zhou X, Wang X, Dougherty ER
Missing-value estimation using linear and
non-linear regression with Bayesian gene
selection. Bioinformatics 2003, 19
2302-2307. Not available.
36
Missing values impact on classification
5. Last and future works.
BPCA
Distribution of true, BPCA-predicted and kNN
predicted values for OS (1/7) with t 6.125 (1
MV per gene). BPCA error rate equals to
0.000679946 and 0.001583531 for kNN (k13).
37
Missing values impact on classification
5. Last and future works.
BPCA
38
Missing values impact on classification
5. Last and future works.
Evaluation of LSimpute (Bo et al., NAR 2004,
Java), LLSimpute (Kim, Gollub Park,
Bioinformatics, 2005, Mathlab) and CMVE (Sehgal,
Gondal Dooley, Bioinformatics,
2005). Evaluation of their interests on the
different HC algorithm Extreme
values Self-Organizing Maps K-means If Ive
got time
Thank you for your
attention.
Write a Comment
User Comments (0)
About PowerShow.com