Title: Clustering Algorithms
1Clustering Algorithms
2K Nearest Neighbors (KNN)
X
3K Nearest Neighbor
- Store all input/output pairs in the training set
- For each pattern in the test set
- Search for the K nearest patterns to the input
patterns using a Euclidean distance measure - For classification, compute the confidence for
each class as Ci/K, where Ci is the number of
patterns among the K nearest patterns belonging
to class i The classification of the input
pattern is the class with the highest confidence. - For estimation, the output value is based on the
average of the output values of the K nearest
patterns
4K Nearest Neighbor Settings
- Number of Nearest Neighbors (K)
- should be based on cross validation over many K
settings. Generally, where p is the total
number of training patterns. - Input Compression
- used if storage/memory is an issue
- affects precision of algorithm
- Distance Metric
- Examples Euclidean, Manhattan, absolute
dimension - Combination of the k neighbors
- make them equal or weighted average
- May use Principle Component Analysis to map
higher dimensional inputs into key meaningful
dimensions for feasible KNN problem
5Nearest Cluster
- A condensed version of KNN generally used for
classification - Partitions the training set into a few clusters
of neighbors - Each cluster has numerical value for posterier
probability of all possible classes given the
input attributes for the members of the cluster - A new item is classified by finding its nearest
cluster and using that clusters posterier
probability estimates to estimate the class for
the new item.
6Nearest Cluster Training
- Perform K means clustering on the training data
- For each cluster, generate a probability for each
class according to
where Pjk is the probability for class j within
cluster k, Njk is the number of class-j
patterns belonging to cluster k, and Nk is the
number of patterns belonging to cluster k.
7Nearest Cluster Testing
- For each input pattern, X, find the nearest
cluster, Ck, using the Euclidean distance
measure
where Y is a cluster center, and m is the
number of dimensions in the input patterns
- Use the probabilities Pjk for all classes j
stored with Ck, and classify pattern X into the
class j with the highest probability.
8K means Clustering
- Initialize the number of cluster centers selected
by the user by randomly selecting them from the
training set. - Classify the entire training set. For each
pattern Xi, in the training set, find the nearest
cluster center C and classify Xi as a member of
C - For each cluster, recompute its center by finding
the mean of the cluster
where Mk is the new mean, Nk is the number of
training patterns in cluster k, and Xjk is the
j-th pattern belonging to cluster k