Title: Clustering, Classification and Validation via the L1 Data Depth
1Clustering, Classification and Validation via the
L1 Data Depth
- Rebecka Jornsten
- Department of Statistics
- Rutgers University
- http//www.stat.rutgers.edu/rebecka
- June 19 2003
2- Outline
- Cluster validation via The Relative Data Depth
(ReD) - DDclust - Improved Clustering Accuracy via ReD
- DDclass Classification based on the L1 Data
Depth - Conclusion
3- Clustering
- - Group similar observations
- - Find cluster representatives
- Multitude of different clustering algorithms
- Hierarchical
- K-means, PAM
- Model-based (e.g. mixture of MVN)
- .
4- Clustering
- In the first part of this talk we focus on
- K-median clustering
- Popular in applications
- Robust, tight clusters
- Fast approximation algorithms exist PAM
(Partition around mediods) - K-median algorithm developed with Vardi and
Zhang. Cluster representatives Multivariate
Medians
5- Cluster Validation
- Select an appropriate number of clusters for the
data set - Identify outliers
- Often based on within-cluster sum of squares
- Distance to cluster representative
- - Cluster membership
6The Silhouette width
- a is the average distance from observation i to
all members of the cluster i has been allocated
to - b is the average distance from observation i to
all members of the nearest competing clusters
i
i
- sil is b a
- Choose the number of clusters K with the maximum
average sil
i
i
max(b , a )
i
i
5. sil can also identify outliers look out
for negative sil values.
7Silhouette width plot
outlier?
8Problems with the silhouette width
- If the cluster scales (within-cluster variances)
differ, the sil values can be misleading - A loose cluster will look bad if the nearest
competing cluster is tight - Observations in a loose cluster may be mislabeled
as outliers, and outliers in a tight cluster be
missed - Can lead to under-fitting, selecting too few
clusters for the data set
9L1 Data Depth
- e(z,i) is the unit vector from observation i to z
- e(z) is the avg. of the unit vectors from all
observations to z - e(z) is close to 1 if z is close to the edge
of the data, close to 0 if z is close to the
center - D(z)1-e(z) is the L1 data depth of z
10The Relative Data Depth
w
- D is the data depth of observation i with
respect to the cluster i has been allocated to - D is the data depth of observation i with
respect to the nearest competing clusters
i
b
i
w
b
- ReD is D D
- Choose the number of clusters K with the maximum
average ReD
i
i
i
5. ReD can also identify outliers look out
for negative ReD values.
11Relative Data Depth plot
outliers?
outlier?
12Data Depth plot
Another example Gene expression data (drug
experiment)
Within-cluster depths
Between-cluster depths, colored by cluster
13Silhouette widths
a loose cluster
Relative Data Depths
a tight cluster
Error rates Tight cluster 17 Loose cluster
14
14- ReD vs gap and sil
- ReD is less sensitive than sil and gap to the
inclusion of unrelated features, and more robust
wrt the noise level of the data - Performs as well or better than sil and gap in
many simulated scenarios - For more details see paper with Vardi and
Zhang, and slides on
http//www.stat.rutgers.edu/rebecka
15Improved clustering accuracy via ReD Clustering
and Cluster validation a two step process But
if ReD (and sil) can identify outliers why not
include the validation criteria in the clustering
objective function directly? Suggests Data
Depth Vector Quantizer DDVQ Find partition that
minimizes K-median objective function -
ReD
16Improved clustering accuracy via ReD Constrained
VQ a tool often used in engineering Example
Entropy constrained VQ Approach Search for the
that satisfies the constraint. What makes the
present problem different? We dont have a
natural constraint and so dont know what an
appropriate value for might be. Furthermore,
the optimal value for will vary from data set
to data set.
17DDclust Our approach use the following
clustering criterion C where
sil plays the role of the K-median cost (scale
dependent) and ReD the role of the depth penalty
(scale independent). Now is the trade-off
between scale dependent and scale independent
costs, and the optimal is unaffected by
shifts and scale changes. We use simulated
annealing to find the partition I(K) that
maximizes C, for a given number of clusters K.
18- Results
- Gene expression data
- Simulated data
- Leukemia PAM and DDclust agrees (2 errors)
- Colon PAM 18/62 errors, DDclust 6 errors
- Prostate PAM 3/25 errors, DDclust 1 error
- MVN data
- Equal and unequal cluster scale scenarios
- equal to 0,0.25,0.5,0.9,1
19(No Transcript)
20(No Transcript)
21Unequal scale model
test error decreases with increasing
PAM
increasing
22Unequal scale model
23Unequal scale model
Clustering with sil can even increase the error
rate when scales are unequal
24Equal scale model
Still see some improvements but now for more
moderate
25Equal scale model
26Equal scale models
27- DDclass - Classification via the L1 Data Depth
- We expect that the L1 Data depth of an
observation is maximized with respect to the
cluster corresponding to the correct class label. - This suggest a very simple classification rule
- 1. Classify unlabeled observation x by the
class in the - training set wrt which x is the most
deep. - 2. Validate classification by the Relative
Data Depth. - 3. ReD(training x)lt0 a training error
- ReD(test x) small low classification
confidence
28Leukemia data 72 observations, cross-validation
TE and ReD
high test error
Low ReD value
29SILclass - Classification via avg. distance and
sil DDclass using the average distance in
place of the depth. Classification rule 1.
Classify unlabeled observation x by the class in
the training set wrt which the average
distance to x is minimized 2.
Validate classification by the sil. 3.
sil(training x)lt0 a training error
sil(test x) small low classification confidence
30Leukemia data 72 observations, cross-validation
TE and sil
high test error
Low sil value
31DDclass-CV and DDclass-DD Two tuning methods for
removing noisy or mislabeled observations from
the training set. This aim is two-fold -
improve test error rate performance -
reduce the size of the training set (comp. time)
DDclass-CV remove any training observations
that are misclassified via leave-one-out cross
validation DDclass-DD remove any training
observations with ReD value below a threshold
(minimizing CV error)
32Leukemia data
Observations that were frequently removed from
the training set across 500 crossvalidation sets
solid black DDclass-CV, dashed red DDclass-DD
33Leukemia Results on 500 10-fold CV data sets
Fivenum summary error rates of CV sets
w. best rate DDclass (0,0,0,12.5,25)
92.6 (1) DDclass-CV
DDclass-DD
91.4 (2) SILclass
(0,0,0,12.5,37.5) 86.8
SILclass-CV/SILclass-DD
NN (0,0,0,12.5,25)
88.8 (3) DLDA(0,0,0,12.5,25)
87.6
Centroid(0,0,0,12.5,37.5)
86.6 Median
87.2
34Colon Results on 500 10-fold CV data sets
Fivenum summary error rates of CV sets
w. best rate DDclass (0,0,16.7,16.7,66.7)
85.8 (3) DDclass-CV
DDclass-DD
SILclass (0,0,16.7,16.7,66.7)
81.2 SILclass-CV/SILclass-DD
79.2 NN
(0,0,16.7,16.7,66.7)
76.2 DLDA(0,0,16.7,16.7,50)
92.8 (1) Centroid
92.6 (2)
Median
35Simulated data - Unequal scale
kNN
DDclass
DA
Prototypes
SILclass
36Equal scale
kNN
DDclass
DA
SILclass
Prototypes
37- Conclusions
- ReD is a robust cluster validation tool
- DDclust can improve clustering accuracy over PAM,
significantly so when cluster scales differ.
ReDplots identify outliers - DDclass is competitive with the best reported
methods on gene expression data, and comparable
with the Bayes rules on simulated data. ReDplots
identify observations we classify with low
confidence. - Paper, preprints and Rcode available at
http//www.stat.rutgers.edu/rebecka/papers - Current work extensions to missing data scenarios
38Acknowledgements Yehuda Vardi and Cun-Hui Zhang
(Dept. of Statistics, Rutgers) Ron Hart,
Jonathan Zan (Dept. of Neuroscience and the
William Keck center, Rutgers)