Cluster Analysis - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Cluster Analysis

Description:

Finding groups of objects such that the objects in the same group are highly ... students into different registration groups alphabetically, by last name (or by ... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 22

Provided by: COMPUTA5

Category:

more less

Transcript and Presenter's Notes

Title: Cluster Analysis

1
Cluster Analysis

Lecture Notes for Chapter 8

2
What is Clustering?

Finding groups of objects such that the objects
in the same group are highly similar (or related)
to one another but different from (or unrelated
to) the objects in other groups

3
Applications of Cluster Analysis

Search engine results
Group related documents for browsing
DISPLAY THEM BY CATEGORY --- MOST COMMON TERMS IN
EVERY CLUSTER
Biology
Group genes or proteins that have similar
profiles to conclude functions
Class discovery

4
Clustering IRIS data
5
Clustering Images
6
What is not Cluster Analysis?

Supervised (vs. unsupervised) learning
Have class label information
Simple segmentation
Dividing students into different registration
groups alphabetically, by last name (or by
gender, or by GPA, or by any field)

7
Notion of a Cluster can be Ambiguous
WEKA EXAMPLE - IRIS
8
Other Distinctions Between Sets of Clusters

Exclusive vs. non-exclusive
In non-exclusive clustering, points may belong to
multiple clusters
The former is the norm
Partial vs. complete
In some cases, we only want to cluster some of
the data
The latter is the norm
Heterogeneous vs. homogeneous
Cluster of widely different sizes, shapes, and
densities
Former is much harder to achieve

9
Types of Clusters

Well-separated clusters
Center-based clusters
Density-based clusters

10
Types of Clusters Well-Separated

Well-Separated Clusters
A cluster is a set of points such that any point
in a cluster is closer (or more similar) to every
other point in the cluster than to any point not
in the cluster

3 well-separated clusters
11
Types of Clusters Center-Based

Center-based
A cluster is a set of objects such that an object
in a cluster is closer (more similar) to the
center of a cluster, than to the center of any
other cluster
The center of a cluster is often a centroid, the
average of all the points in the cluster, or a
medoid, the most representative point of a
cluster --- i.e. closest to centroid

4 center-based clusters
12
Types of Clusters Density-Based

Density-based
A cluster is a dense region of points, which is
separated by low-density regions, from other
regions of high density
Used when the clusters are irregular (in shape)
or intertwined, and when noise and outliers are
present

6 density-based clusters
13
Characteristics of the Input Data are Important

Type of distance or density measure
Attribute type
Dimensionality
Noise and Outliers

14
Types of Clusterings

A clustering is a set of clusters
Important distinction between hierarchical and
partitional sets of clusters
Partitional clustering
Divides data objects into non-overlapping
clusters
Each data object is in exactly one subset
Hierarchical clustering
A set of nested clusters organized as a
hierarchical tree

15
Partitional Clustering
Original Points
16
Hierarchical Clustering
Traditional Hierarchical Clustering
Traditional Dendrogram
17
Clustering Algorithms

K-means and its variants
Hierarchical clustering
Density-based clustering

18
K-means Clustering

Partitional clustering approach
Each cluster is associated with a centroid
(center point)
Each point is assigned to the cluster with the
closest centroid
Number of clusters, K, must be specified

The centroid is the average of all the points in
the cluster that is, its coordinates are the
arithmetic mean for each dimension separately
over all the points in the cluster. Example The
data set has three dimensions and the cluster has
two points X (x1, x2, x3) and Y (y1, y2,
y3). Then the centroid Z becomes Z (z1, z2,
z3), where z1 (x1 y1)/2 and z2 (x2 y2)/2
and z3 (x3 y3)/2.
19
K-means Clustering Details

Initial centroids are often chosen randomly
Clusters produced vary from one run to another
The centroid is (typically) the mean of the
points in the cluster
Closeness is measured by Euclidean distance,
cosine similarity, correlation, etc.
K-means will converge for common similarity
measures mentioned above
Most of the convergence happens in the first few
iterations
Often the stopping condition is changed to Until
relatively few points change clusters
Complexity is O( n K I d )
n number of points, K number of clusters, I
number of iterations, d number of attributes