Clustering, Classification and Validation via the L1 Data Depth - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Clustering, Classification and Validation via the L1 Data Depth

Description:

Often based on within-cluster sum of squares. The Silhouette width ... Can lead to under-fitting, selecting too few clusters for the data set. L1 Data Depth ... – PowerPoint PPT presentation

Number of Views:80

Avg rating:3.0/5.0

Slides: 33

Provided by: r588

Learn more at: https://www.stat.colostate.edu

Category:

more less

Transcript and Presenter's Notes

Title: Clustering, Classification and Validation via the L1 Data Depth

1
Clustering, Classification and Validation via the
L1 Data Depth

Rebecka Jornsten
Department of Statistics
Rutgers University
http//www.stat.rutgers.edu/rebecka
June 19 2003

Outline
Cluster validation via The Relative Data Depth
(ReD)
DDclust - Improved Clustering Accuracy via ReD
DDclass Classification based on the L1 Data
Depth
Conclusion

Clustering
- Group similar observations
- Find cluster representatives
Multitude of different clustering algorithms
Hierarchical
K-means, PAM
Model-based (e.g. mixture of MVN)
.

Clustering
In the first part of this talk we focus on
K-median clustering
Popular in applications
Robust, tight clusters
Fast approximation algorithms exist PAM
(Partition around mediods)
K-median algorithm developed with Vardi and
Zhang. Cluster representatives Multivariate
Medians

Cluster Validation
Select an appropriate number of clusters for the
data set
Identify outliers

- Often based on within-cluster sum of squares

Distance to cluster representative
- Cluster membership

6
The Silhouette width

a is the average distance from observation i to
all members of the cluster i has been allocated
to
b is the average distance from observation i to
all members of the nearest competing clusters

i
i

sil is b a
Choose the number of clusters K with the maximum
average sil

i
i
max(b , a )
i
i
5. sil can also identify outliers look out
for negative sil values.
7
Silhouette width plot
outlier?
8
Problems with the silhouette width

If the cluster scales (within-cluster variances)
differ, the sil values can be misleading
A loose cluster will look bad if the nearest
competing cluster is tight
Observations in a loose cluster may be mislabeled
as outliers, and outliers in a tight cluster be
missed
Can lead to under-fitting, selecting too few
clusters for the data set

9
L1 Data Depth

e(z,i) is the unit vector from observation i to z
e(z) is the avg. of the unit vectors from all
observations to z
e(z) is close to 1 if z is close to the edge
of the data, close to 0 if z is close to the
center
D(z)1-e(z) is the L1 data depth of z

10
The Relative Data Depth
w

D is the data depth of observation i with
respect to the cluster i has been allocated to
D is the data depth of observation i with
respect to the nearest competing clusters

i
b
i
w
b

ReD is D D
Choose the number of clusters K with the maximum
average ReD

i
i
i
5. ReD can also identify outliers look out
for negative ReD values.
11
Relative Data Depth plot
outliers?
outlier?
12
Data Depth plot
Another example Gene expression data (drug
experiment)
Within-cluster depths
Between-cluster depths, colored by cluster
13
Silhouette widths
a loose cluster
Relative Data Depths
a tight cluster
Error rates Tight cluster 17 Loose cluster
14
14

ReD vs gap and sil
ReD is less sensitive than sil and gap to the
inclusion of unrelated features, and more robust
wrt the noise level of the data
Performs as well or better than sil and gap in
many simulated scenarios
For more details see paper with Vardi and
Zhang, and slides on

http//www.stat.rutgers.edu/rebecka
15
Improved clustering accuracy via ReD Clustering
and Cluster validation a two step process But
if ReD (and sil) can identify outliers why not
include the validation criteria in the clustering
objective function directly? Suggests Data
Depth Vector Quantizer DDVQ Find partition that
minimizes K-median objective function -
ReD
16
Improved clustering accuracy via ReD Constrained
VQ a tool often used in engineering Example
Entropy constrained VQ Approach Search for the
that satisfies the constraint. What makes the
present problem different? We dont have a
natural constraint and so dont know what an
appropriate value for might be. Furthermore,
the optimal value for will vary from data set
to data set.
17
DDclust Our approach use the following
clustering criterion C where
sil plays the role of the K-median cost (scale
dependent) and ReD the role of the depth penalty
(scale independent). Now is the trade-off
between scale dependent and scale independent
costs, and the optimal is unaffected by
shifts and scale changes. We use simulated
annealing to find the partition I(K) that
maximizes C, for a given number of clusters K.
18

Results
Gene expression data
Simulated data

Leukemia PAM and DDclust agrees (2 errors)
Colon PAM 18/62 errors, DDclust 6 errors
Prostate PAM 3/25 errors, DDclust 1 error

MVN data
Equal and unequal cluster scale scenarios
equal to 0,0.25,0.5,0.9,1

19
(No Transcript)
20
(No Transcript)
21
Unequal scale model
test error decreases with increasing
PAM
increasing
22
Unequal scale model
23
Unequal scale model
Clustering with sil can even increase the error
rate when scales are unequal
24
Equal scale model
Still see some improvements but now for more
moderate
25
Equal scale model
26
Equal scale models
27

DDclass - Classification via the L1 Data Depth
We expect that the L1 Data depth of an
observation is maximized with respect to the
cluster corresponding to the correct class label.
This suggest a very simple classification rule
1. Classify unlabeled observation x by the
class in the
training set wrt which x is the most
deep.
2. Validate classification by the Relative
Data Depth.
3. ReD(training x)lt0 a training error
ReD(test x) small low classification
confidence

28
Leukemia data 72 observations, cross-validation
TE and ReD
high test error
Low ReD value
29
SILclass - Classification via avg. distance and
sil DDclass using the average distance in
place of the depth. Classification rule 1.
Classify unlabeled observation x by the class in
the training set wrt which the average
distance to x is minimized 2.
Validate classification by the sil. 3.
sil(training x)lt0 a training error
sil(test x) small low classification confidence
30
Leukemia data 72 observations, cross-validation
TE and sil
high test error
Low sil value
31
DDclass-CV and DDclass-DD Two tuning methods for
removing noisy or mislabeled observations from
the training set. This aim is two-fold -
improve test error rate performance -
reduce the size of the training set (comp. time)
DDclass-CV remove any training observations
that are misclassified via leave-one-out cross
validation DDclass-DD remove any training
observations with ReD value below a threshold
(minimizing CV error)
32
Leukemia data
Observations that were frequently removed from
the training set across 500 crossvalidation sets
solid black DDclass-CV, dashed red DDclass-DD
33
Leukemia Results on 500 10-fold CV data sets
Fivenum summary error rates of CV sets
w. best rate DDclass (0,0,0,12.5,25)
92.6 (1) DDclass-CV

DDclass-DD
91.4 (2) SILclass
(0,0,0,12.5,37.5) 86.8
SILclass-CV/SILclass-DD
NN (0,0,0,12.5,25)
88.8 (3) DLDA(0,0,0,12.5,25)
87.6
Centroid(0,0,0,12.5,37.5)
86.6 Median
87.2
34
Colon Results on 500 10-fold CV data sets
Fivenum summary error rates of CV sets
w. best rate DDclass (0,0,16.7,16.7,66.7)
85.8 (3) DDclass-CV

DDclass-DD
SILclass (0,0,16.7,16.7,66.7)
81.2 SILclass-CV/SILclass-DD
79.2 NN
(0,0,16.7,16.7,66.7)
76.2 DLDA(0,0,16.7,16.7,50)
92.8 (1) Centroid
92.6 (2)
Median

35
Simulated data - Unequal scale
kNN
DDclass
DA
Prototypes
SILclass
36
Equal scale
kNN
DDclass
DA
SILclass
Prototypes
37

Conclusions
ReD is a robust cluster validation tool
DDclust can improve clustering accuracy over PAM,
significantly so when cluster scales differ.
ReDplots identify outliers
DDclass is competitive with the best reported
methods on gene expression data, and comparable
with the Bayes rules on simulated data. ReDplots
identify observations we classify with low
confidence.
Paper, preprints and Rcode available at
http//www.stat.rutgers.edu/rebecka/papers
Current work extensions to missing data scenarios

38
Acknowledgements Yehuda Vardi and Cun-Hui Zhang
(Dept. of Statistics, Rutgers) Ron Hart,
Jonathan Zan (Dept. of Neuroscience and the
William Keck center, Rutgers)

Write a Comment

User Comments (0)