Cluster Analysis - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Cluster Analysis

Description:

Finding groups of objects such that the objects in the same group are highly ... students into different registration groups alphabetically, by last name (or by ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 22
Provided by: COMPUTA5
Category:

less

Transcript and Presenter's Notes

Title: Cluster Analysis


1
Cluster Analysis
  • Lecture Notes for Chapter 8

2
What is Clustering?
  • Finding groups of objects such that the objects
    in the same group are highly similar (or related)
    to one another but different from (or unrelated
    to) the objects in other groups

3
Applications of Cluster Analysis
  • Search engine results
  • Group related documents for browsing
  • DISPLAY THEM BY CATEGORY --- MOST COMMON TERMS IN
    EVERY CLUSTER
  • Biology
  • Group genes or proteins that have similar
    profiles to conclude functions
  • Class discovery

4
Clustering IRIS data
5
Clustering Images
6
What is not Cluster Analysis?
  • Supervised (vs. unsupervised) learning
  • Have class label information
  • Simple segmentation
  • Dividing students into different registration
    groups alphabetically, by last name (or by
    gender, or by GPA, or by any field)

7
Notion of a Cluster can be Ambiguous
WEKA EXAMPLE - IRIS
8
Other Distinctions Between Sets of Clusters
  • Exclusive vs. non-exclusive
  • In non-exclusive clustering, points may belong to
    multiple clusters
  • The former is the norm
  • Partial vs. complete
  • In some cases, we only want to cluster some of
    the data
  • The latter is the norm
  • Heterogeneous vs. homogeneous
  • Cluster of widely different sizes, shapes, and
    densities
  • Former is much harder to achieve

9
Types of Clusters
  • Well-separated clusters
  • Center-based clusters
  • Density-based clusters

10
Types of Clusters Well-Separated
  • Well-Separated Clusters
  • A cluster is a set of points such that any point
    in a cluster is closer (or more similar) to every
    other point in the cluster than to any point not
    in the cluster

3 well-separated clusters
11
Types of Clusters Center-Based
  • Center-based
  • A cluster is a set of objects such that an object
    in a cluster is closer (more similar) to the
    center of a cluster, than to the center of any
    other cluster
  • The center of a cluster is often a centroid, the
    average of all the points in the cluster, or a
    medoid, the most representative point of a
    cluster --- i.e. closest to centroid

4 center-based clusters
12
Types of Clusters Density-Based
  • Density-based
  • A cluster is a dense region of points, which is
    separated by low-density regions, from other
    regions of high density
  • Used when the clusters are irregular (in shape)
    or intertwined, and when noise and outliers are
    present

6 density-based clusters
13
Characteristics of the Input Data are Important
  • Type of distance or density measure
  • Attribute type
  • Dimensionality
  • Noise and Outliers

14
Types of Clusterings
  • A clustering is a set of clusters
  • Important distinction between hierarchical and
    partitional sets of clusters
  • Partitional clustering
  • Divides data objects into non-overlapping
    clusters
  • Each data object is in exactly one subset
  • Hierarchical clustering
  • A set of nested clusters organized as a
    hierarchical tree

15
Partitional Clustering
Original Points
16
Hierarchical Clustering
Traditional Hierarchical Clustering
Traditional Dendrogram
17
Clustering Algorithms
  • K-means and its variants
  • Hierarchical clustering
  • Density-based clustering

18
K-means Clustering
  • Partitional clustering approach
  • Each cluster is associated with a centroid
    (center point)
  • Each point is assigned to the cluster with the
    closest centroid
  • Number of clusters, K, must be specified

The centroid is the average of all the points in
the cluster that is, its coordinates are the
arithmetic mean for each dimension separately
over all the points in the cluster. Example The
data set has three dimensions and the cluster has
two points X (x1, x2, x3) and Y (y1, y2,
y3). Then the centroid Z becomes Z (z1, z2,
z3), where z1 (x1  y1)/2 and z2 (x2  y2)/2
and z3 (x3  y3)/2.
19
K-means Clustering Details
  • Initial centroids are often chosen randomly
  • Clusters produced vary from one run to another
  • The centroid is (typically) the mean of the
    points in the cluster
  • Closeness is measured by Euclidean distance,
    cosine similarity, correlation, etc.
  • K-means will converge for common similarity
    measures mentioned above
  • Most of the convergence happens in the first few
    iterations
  • Often the stopping condition is changed to Until
    relatively few points change clusters
  • Complexity is O( n K I d )
  • n number of points, K number of clusters, I
    number of iterations, d number of attributes

20
Two different K-means Clusterings
Original Points
21
Example
Write a Comment
User Comments (0)
About PowerShow.com