Title: Cluster Analysis
1Cluster Analysis
- Lecture Notes for Chapter 8
2What is Clustering?
- Finding groups of objects such that the objects
in the same group are highly similar (or related)
to one another but different from (or unrelated
to) the objects in other groups
3Applications of Cluster Analysis
- Search engine results
- Group related documents for browsing
- DISPLAY THEM BY CATEGORY --- MOST COMMON TERMS IN
EVERY CLUSTER - Biology
- Group genes or proteins that have similar
profiles to conclude functions - Class discovery
4Clustering IRIS data
5Clustering Images
6What is not Cluster Analysis?
- Supervised (vs. unsupervised) learning
- Have class label information
- Simple segmentation
- Dividing students into different registration
groups alphabetically, by last name (or by
gender, or by GPA, or by any field)
7Notion of a Cluster can be Ambiguous
WEKA EXAMPLE - IRIS
8Other Distinctions Between Sets of Clusters
- Exclusive vs. non-exclusive
- In non-exclusive clustering, points may belong to
multiple clusters - The former is the norm
- Partial vs. complete
- In some cases, we only want to cluster some of
the data - The latter is the norm
- Heterogeneous vs. homogeneous
- Cluster of widely different sizes, shapes, and
densities - Former is much harder to achieve
9Types of Clusters
- Well-separated clusters
- Center-based clusters
-
- Density-based clusters
10Types of Clusters Well-Separated
- Well-Separated Clusters
- A cluster is a set of points such that any point
in a cluster is closer (or more similar) to every
other point in the cluster than to any point not
in the cluster
3 well-separated clusters
11Types of Clusters Center-Based
- Center-based
- A cluster is a set of objects such that an object
in a cluster is closer (more similar) to the
center of a cluster, than to the center of any
other cluster - The center of a cluster is often a centroid, the
average of all the points in the cluster, or a
medoid, the most representative point of a
cluster --- i.e. closest to centroid
4 center-based clusters
12Types of Clusters Density-Based
- Density-based
- A cluster is a dense region of points, which is
separated by low-density regions, from other
regions of high density - Used when the clusters are irregular (in shape)
or intertwined, and when noise and outliers are
present
6 density-based clusters
13Characteristics of the Input Data are Important
- Type of distance or density measure
- Attribute type
- Dimensionality
- Noise and Outliers
14Types of Clusterings
- A clustering is a set of clusters
- Important distinction between hierarchical and
partitional sets of clusters - Partitional clustering
- Divides data objects into non-overlapping
clusters - Each data object is in exactly one subset
- Hierarchical clustering
- A set of nested clusters organized as a
hierarchical tree
15Partitional Clustering
Original Points
16Hierarchical Clustering
Traditional Hierarchical Clustering
Traditional Dendrogram
17Clustering Algorithms
- K-means and its variants
- Hierarchical clustering
- Density-based clustering
18K-means Clustering
- Partitional clustering approach
- Each cluster is associated with a centroid
(center point) - Each point is assigned to the cluster with the
closest centroid - Number of clusters, K, must be specified
The centroid is the average of all the points in
the cluster that is, its coordinates are the
arithmetic mean for each dimension separately
over all the points in the cluster. Example The
data set has three dimensions and the cluster has
two points X (x1, x2, x3) and Y (y1, y2,
y3). Then the centroid Z becomes Z (z1, z2,
z3), where z1 (x1Â Â y1)/2 and z2 (x2Â Â y2)/2
and z3 (x3Â Â y3)/2.
19K-means Clustering Details
- Initial centroids are often chosen randomly
- Clusters produced vary from one run to another
- The centroid is (typically) the mean of the
points in the cluster - Closeness is measured by Euclidean distance,
cosine similarity, correlation, etc. - K-means will converge for common similarity
measures mentioned above - Most of the convergence happens in the first few
iterations - Often the stopping condition is changed to Until
relatively few points change clusters - Complexity is O( n K I d )
- n number of points, K number of clusters, I
number of iterations, d number of attributes
20Two different K-means Clusterings
Original Points
21Example