Title: Poster Template
1DATA CLUSTERING WITH KERNAL K-MEANS
Matt Strautmann, Dept. of Electrical and Computer
Engineering
Dr. Donald C. Wunsch II, Dept. of Electrical and
Computer Engineering
- PROJECT OBJECTIVES
- PROJECT GOAL
- Experimentally demonstrate the application of
Kernel K-Means to non-linearly clusterable data
sets - ACADEMIC IMPORTANCE
- Expand the application of the Kernel K-Means
clustering algorithm to non-traditional uses
PROJECT DATASETS
- DISCUSSION
- Kernel K-Means was found to cluster the test
datasets in a superior manner over Soft K-Means - Kernel data-mapping was seen to solve the
overlapping data sets by - Mapping the data before clustering to a
higher-dimensional feature space using a
nonlinear function - Partitioning the points with linear separators in
the new space - Soft K-Means could not successfully cluster the
Lung Cancer Dataset results were for one cluster
out of three successfully clustered - Soft K-Means clustered the two dimension, two
cluster Gaussian dataset with only one error out
of the one thousand data points
Iris Plant Dataset
eleves.ens.fr
- BACKGROUND
- WHAT IS K-MEANS CLUSTERING?
- K-Means clustering aims to divide the dataset
into clusters (groups) in which each data point
belongs to the cluster with the nearest mean
vector. - WHAT IS KERNAL K-MEANS?
- Sum-of-squares algorithm
- Two step process data point assignment and
update - WHAT IS THE PLUS PLUS INITIALIZATION SCHEME?
- The first mean vector is a randomly selected data
point - Each subsequent mean vector is created by
evaluating randomly selected data points against
a vector weighting probability
2 Dimension, 2 Cluster Dataset (Gaussian 2D2K)
2 Dimension, 2 Cluster Dataset (Gaussian 2D2K)
lans.ece.utexas.edu
lans.ece.utexas.edu
- CONCLUDING REMARKS
- The initialization was seen to be the most
important factor in the algorithm converging - The PLUS PLUS cluster mean initialization was
seen to improve the results - Kernel assignment works better than the maximum
responsibility calculation of Soft K-Means - Kernel K-Means can handle small or large
dimension datasets well the increase of
dimensionally seemed to be advantageous for the
Lung Cancer Dataset (56 dimensions) over the
lower clustering accuracy of the Iris Plant
Dataset (4 dimensions) - Kernel K-Means produced superior results to
Soft K-Means when clustering the Lung Cancer
Dataset and demonstrated recognition of all three
clusters
- SOFT K-MEANS VS. KERNEL K-MEANS
-
Soft K-Means Clustering Accuracy Average (over ten runs) Standard Deviation of Accuracy Calculation (over ten runs) Variance of Accuracy Calculation (over ten runs)
Iris Plant Dataset 28.00 8.218 2.867
Lung Cancer Dataset 43.75 - -
2D2K Gaussian Dataset 99.00 - -
8D5K Gaussian Dataset 58.50 2.082 0.043
http//en.wikipedia.org/wiki/K-means_clustering
http//en.wikipedia.org/wiki/K-means_clustering
Kernel K-Means Clustering Accuracy Average (over ten runs) Standard Deviation of Accuracy Calculation (over ten runs) Variance of Accuracy Calculation (over ten runs)
Iris Plant Dataset 57.00 5.009 2.238
Lung Cancer Dataset 62.00 6.878 0.473
2D2K Gaussian Dataset 96.50 1.677 0.028
8D5K Gaussian Dataset 76.31 10.366 1.075
2.) Voronoi Diagram Generated by the Means
(data points associated with nearest cluster
mean)
1.) Initial Mean Orientations
- FUTURE WORK
- Further improvement of the mean vector
initialization is believed possible over the
PLUS PLUS initialization - Other options for the mean-squared error
calculation for data point evaluation are
possible - The time analysis of the algorithm must be
calculate - The author would like to acknowledge the
expertise of Dr. Rui Xu in advising this project.
- RESULTS COMPARISON
- Kernel K-Means clustering accuracy superior in
all cases except the two dimensional, two cluster
dataset. - The clustering accuracy of the datasets
increased by the following amounts - Iris Plant 104
- Lung Cancer 38
- 2D2k -2.5
- 8D5K 30
http//en.wikipedia.org/wiki/K-means_clustering
http//en.wikipedia.org/wiki/K-means_clustering
4.) Step 2 and 3 Repeated until Convergence
3.) Cluster Centroid Becomes New Cluster Mean
- APPROACH
- Evaluate standard K-Means (Soft) against 4
datasets to form benchmark - Hybridize Soft K-Means with Kernel K-Means to
form Kernel K-Means - Test Kernel K-Means on small size, small
dimension Gaussian, large dimension Gaussian, and
large size datasets
Acknowledgements