Unsupervised learning - PowerPoint PPT Presentation

1 / 92
About This Presentation
Title:

Unsupervised learning

Description:

Unsupervised learning & Cluster Analysis: Basic Concepts and Algorithms Assaf Gottlieb Some of the s are taken form Introduction to data mining, by Tan ... – PowerPoint PPT presentation

Number of Views:246
Avg rating:3.0/5.0
Slides: 93
Provided by: KSU70
Category:

less

Transcript and Presenter's Notes

Title: Unsupervised learning


1
Unsupervised learning Cluster Analysis Basic
Concepts and Algorithms
Assaf Gottlieb
Some of the slides are taken form Introduction to
data mining, by Tan, Steinbach, and Kumar
2
What is unsupervised learning Cluster Analysis ?
  • Learning without a priori knowledge about the
    classification of samples learning without a
    teacher.
  • Kohonen (1995), Self-Organizing Maps
  • Cluster analysis is a set of methods for
    constructing a (hopefully) sensible and
    informative classification of an initially
    unclassified set of data, using the variable
    values observed on each individual.
  • B. S. Everitt (1998), The Cambridge Dictionary
    of Statistics

3
What do we cluster?
Features/Variables
Samples/Instances
4
Applications of Cluster Analysis
  • UnderstandingGroup related documents for
    browsing, group genes and proteins that have
    similar functionality, or group stocks with
    similar price fluctuations
  • Data Exploration
  • Get insight into data distribution
  • Understand patterns in the data
  • SummarizationReduce the size of large data
    setsA preprocessing step

5
Objectives of Cluster Analysis
  • Finding groups of objects such that the objects
    in a group will be similar (or related) to one
    another and different from (or unrelated to) the
    objects in other groups

Competing objectives
Inter-cluster distances are maximized
Intra-cluster distances are minimized
6
Notion of a Cluster can be Ambiguous
Depends on resolution !
7
Prerequisites
  • Understand the nature of your problem, the type
    of features, etc.
  • The metric that you choose for similarity (for
    example, Euclidean distance or Pearson
    correlation) often impacts the clusters you
    recover.

8
Similarity/Distance measures
  • Euclidean Distance
  • Highly depends on scaleof features may require
    normalization
  • City Block

9
deuc0.5846
deuc1.1345
These examples of Euclidean distance match our
intuition of dissimilarity pretty well
deuc2.6115
10
deuc1.41
deuc1.22
But what about these? What might be going on
with the expression profiles on the left? On the
right?
11
Similarity/Distance measures
  • Cosine
  • Pearson Correlation
  • Invariant to scaling (Pearson also to addition)
  • Spearman correlation for ranks

12
Similarity/Distance measures
  • Jaccard similarity
  • When interested in intersection size

X U Y
X
Y
X n Y
13
Types of Clusterings
  • Important distinction between hierarchical and
    partitional sets of clusters
  • Partitional Clustering
  • A division data objects into non-overlapping
    subsets (clusters) such that each data object is
    in exactly one subset
  • Hierarchical clustering
  • A set of nested clusters organized as a
    hierarchical tree

14
Partitional Clustering
Original Points
15
Hierarchical Clustering
Dendrogram 1
Dendrogram 2
16
Other Distinctions Between Sets of Clustering
methods
  • Exclusive versus non-exclusive
  • In non-exclusive clusterings, points may belong
    to multiple clusters.
  • Can represent multiple classes or border points
  • Fuzzy versus non-fuzzy
  • In fuzzy clustering, a point belongs to every
    cluster with some weight between 0 and 1
  • Weights must sum to 1
  • Partial versus complete
  • In some cases, we only want to cluster some of
    the data
  • Heterogeneous versus homogeneous
  • Cluster of widely different sizes, shapes, and
    densities

17
Clustering Algorithms
  • Hierarchical clustering
  • K-means
  • Bi-clustering

18
Hierarchical Clustering
  • Produces a set of nested clusters organized as a
    hierarchical tree
  • Can be visualized as a dendrogram
  • A tree like diagram that records the sequences of
    merges or splits

19
Strengths of Hierarchical Clustering
  • Do not have to assume any particular number of
    clusters
  • Any desired number of clusters can be obtained by
    cutting the dendogram at the proper level
  • They may correspond to meaningful taxonomies
  • Example in biological sciences (e.g., animal
    kingdom, phylogeny reconstruction, )

20
Hierarchical Clustering
  • Two main types of hierarchical clustering
  • Agglomerative (bottom up)
  • Start with the points as individual clusters
  • At each step, merge the closest pair of clusters
    until only one cluster (or k clusters) left
  • Divisive (top down)
  • Start with one, all-inclusive cluster
  • At each step, split a cluster until each cluster
    contains a point (or there are k clusters)
  • Traditional hierarchical algorithms use a
    similarity or distance matrix
  • Merge or split one cluster at a time

21
Agglomerative Clustering Algorithm
  • More popular hierarchical clustering technique
  • Basic algorithm is straightforward
  • Compute the proximity matrix
  • Let each data point be a cluster
  • Repeat
  • Merge the two closest clusters
  • Update the proximity matrix
  • Until only a single cluster remains
  • Key operation is the computation of the proximity
    of two clusters
  • Different approaches to defining the distance
    between clusters distinguish the different
    algorithms

22
Starting Situation
  • Start with clusters of individual points and a
    proximity matrix

Proximity Matrix
23
Intermediate Situation
  • After some merging steps, we have some clusters

C3
C4
C1
Proximity Matrix
C5
C2
24
Intermediate Situation
  • We want to merge the two closest clusters (C2 and
    C5) and update the proximity matrix.

C3
C4
C1
Proximity Matrix
C5
C2
25
After Merging
  • The question is How do we update the proximity
    matrix?

C2 U C5
C1
C3
C4
?
C1
C3
? ? ? ?
C2 U C5
C4
?
C3
?
C4
C1
Proximity Matrix
C2 U C5
26
How to Define Inter-Cluster Similarity
Similarity?
  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids
  • Wards method (not discussed)

Proximity Matrix
27
How to Define Inter-Cluster Similarity
  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids
  • Other methods driven by an objective function
  • Wards Method uses squared error

Proximity Matrix
28
How to Define Inter-Cluster Similarity
  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids
  • Other methods driven by an objective function
  • Wards Method uses squared error

Proximity Matrix
29
How to Define Inter-Cluster Similarity
  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids
  • Other methods driven by an objective function
  • Wards Method uses squared error

Proximity Matrix
30
How to Define Inter-Cluster Similarity
?
?
  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids

Proximity Matrix
31
Cluster Similarity MIN or Single Link
  • Similarity of two clusters is based on the two
    most similar (closest) points in the different
    clusters
  • Determined by one pair of points, i.e., by one
    link in the proximity graph.

32
Hierarchical Clustering MIN
Nested Clusters
Dendrogram
33
Strength of MIN
Original Points
  • Can handle non-elliptical shapes

34
Limitations of MIN
Original Points
  • Sensitive to noise and outliers

35
Cluster Similarity MAX or Complete Linkage
  • Similarity of two clusters is based on the two
    least similar (most distant) points in the
    different clusters
  • Determined by all pairs of points in the two
    clusters

36
Hierarchical Clustering MAX
Nested Clusters
Dendrogram
37
Strength of MAX
Original Points
  • Less susceptible to noise and outliers

38
Limitations of MAX
Original Points
  • Tends to break large clusters
  • Biased towards globular clusters

39
Cluster Similarity Group Average
  • Proximity of two clusters is the average of
    pairwise proximity between points in the two
    clusters.
  • Need to use average connectivity for scalability
    since total proximity favors large clusters

40
Hierarchical Clustering Group Average
Nested Clusters
Dendrogram
41
Hierarchical Clustering Group Average
  • Compromise between Single and Complete Link
  • Strengths
  • Less susceptible to noise and outliers
  • Limitations
  • Biased towards globular clusters

42
Cluster Similarity Wards Method
  • Similarity of two clusters is based on the
    increase in squared error when two clusters are
    merged
  • Similar to group average if distance between
    points is distance squared
  • Less susceptible to noise and outliers
  • Biased towards globular clusters
  • Hierarchical analogue of K-means
  • Can be used to initialize K-means

43
Hierarchical Clustering Comparison
MAX
MIN
Group Average
44
Hierarchical Clustering Time and Space
requirements
  • O(N2) space since it uses the proximity matrix.
  • N is the number of points.
  • O(N3) time in many cases
  • There are N steps and at each step the size, N2,
    proximity matrix must be updated and searched
  • Complexity can be reduced to O(N2 log(N) ) time
    for some approaches

45
Hierarchical Clustering Problems and Limitations
  • Once a decision is made to combine two clusters,
    it cannot be undone
  • Different schemes have problems with one or more
    of the following
  • Sensitivity to noise and outliers
  • Difficulty handling different sized clusters and
    convex shapes
  • Breaking large clusters (divisive)
  • Dendrogram correspond to a given hierarchical
    clustering is not unique, since for each merge
    one needs to specify which subtree should go on
    the left and which on the right
  • They impose structure on the data, instead of
    revealing structure in these data.
  • How many clusters? (some suggestions later)

46
K-means Clustering
  • Partitional clustering approach
  • Each cluster is associated with a centroid
    (center point)
  • Each point is assigned to the cluster with the
    closest centroid
  • Number of clusters, K, must be specified
  • The basic algorithm is very simple

47
K-means Clustering Details
  • Initial centroids are often chosen randomly.
  • Clusters produced vary from one run to another.
  • The centroid is (typically) the mean of the
    points in the cluster.
  • Closeness is measured mostly by Euclidean
    distance, cosine similarity, correlation, etc.
  • K-means will converge for common similarity
    measures mentioned above.
  • Most of the convergence happens in the first few
    iterations.
  • Often the stopping condition is changed to Until
    relatively few points change clusters
  • Complexity is O( n K I d )
  • n number of points, K number of clusters, I
    number of iterations, d number of attributes

Typical choice
48
Evaluating K-means Clusters
  • Most common measure is Sum of Squared Error (SSE)
  • For each point, the error is the distance to the
    nearest cluster
  • To get SSE, we square these errors and sum them.
  • x is a data point in cluster Ci and mi is the
    representative point for cluster Ci
  • can show that mi corresponds to the center
    (mean) of the cluster
  • Given two clusters, we can choose the one with
    the smallest error
  • One easy way to reduce SSE is to increase K, the
    number of clusters
  • A good clustering with smaller K can have a
    lower SSE than a poor clustering with higher K

49
Issues and Limitations for K-means
  • How to choose initial centers?
  • How to choose K?
  • How to handle Outliers?
  • Clusters different in
  • Shape
  • Density
  • Size
  • Assumes clusters are spherical in vector space
  • Sensitive to coordinate changes

50
Two different K-means Clusterings
Original Points
51
Importance of Choosing Initial Centroids
52
Importance of Choosing Initial Centroids
53
Importance of Choosing Initial Centroids
54
Importance of Choosing Initial Centroids
55
Solutions to Initial Centroids Problem
  • Multiple runs
  • Sample and use hierarchical clustering to
    determine initial centroids
  • Select more than k initial centroids and then
    select among these initial centroids
  • Select most widely separated
  • Bisecting K-means
  • Not as susceptible to initialization issues

56
Bisecting K-means
  • Bisecting K-means algorithm
  • Variant of K-means that can produce a partitional
    or a hierarchical clustering

57
Bisecting K-means Example
58
Issues and Limitations for K-means
  • How to choose initial centers?
  • How to choose K?
  • Depends on the problem some suggestions later
  • How to handle Outliers?
  • Preprocessing
  • Clusters different in
  • Shape
  • Density
  • Size

59
Issues and Limitations for K-means
  • How to choose initial centers?
  • How to choose K?
  • How to handle Outliers?
  • Clusters different in
  • Shape
  • Density
  • Size

60
Limitations of K-means Differing Sizes
Original Points
K-means (3 Clusters)
61
Limitations of K-means Differing Density
K-means (3 Clusters)
Original Points
62
Limitations of K-means Non-globular Shapes
Original Points
K-means (2 Clusters)
63
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters. Find parts
of clusters, but need to put together.
64
Overcoming K-means Limitations
Original Points K-means Clusters
65
Overcoming K-means Limitations
Original Points K-means Clusters
66
K-means
  • Pros
  • Simple
  • Fast for low dimensional data
  • It can find pure sub clusters if large number of
    clusters is specified
  • Cons
  • K-Means cannot handle non-globular data of
    different sizes and densities
  • K-Means will not identify outliers
  • K-Means is restricted to data which has the
    notion of a center (centroid)

67
Biclustering/Co-clustering
M conditions
  • Two genes can have similar expression patterns
    only under some conditions
  • Similarly, in two related conditions, some genes
    may exhibit different expression patterns

N genes
68
Biclustering
  • As a result, each cluster may involve only a
    subset of genes and a subset of conditions, which
    form a checkerboard structure

69
Biclustering
  • In general a hard task (NP-hard)
  • Heuristic algorithms described briefly
  • Cheng Church deletion of rows and columns.
    Biclusters discovered one at a time
  • Order-Preserving SubMatrixes Ben-Dor et al.
  • Coupled Two-Way Clustering (Getz. Et al)
  • Spectral Co-clustering

70
Cheng and Church
  • Objective function for heuristic methods (to
    minimize)
  • Greedy method
  • Initialization the bicluster contains all rows
    and columns.
  • Iteration
  • Compute all aIj, aiJ, aIJ and H(I, J) for reuse.
  • Remove a row or column that gives the maximum
    decrease of H.
  • Termination when no action will decrease H or H
    lt ?.
  • Mask this bicluster and continue
  • Problem removing trivial biclusters

71
Ben-Dor et al. (OPSM)
  • Model
  • For a condition set T and a gene g, the
    conditions in T can be ordered in a way so that
    the expression values are sorted in ascending
    order (suppose the values are all unique).
  • Submatrix A is a bicluster if there is an
    ordering (permutation) of T such that the
    expression values of all genes in G are sorted in
    ascending order.
  • Idea of algorithm to grow partial models until
    they become complete models.

t1 t2 t3 t4 t5
g1 7 13 19 2 50
g2 19 23 39 6 42
g3 4 6 8 2 10
Induced permutation 2 3 4 1 5
72
Ben-Dor et al. (OPSM)
73
Getz et al. (CTWC)
  • Idea repeatedly perform one-way clustering on
    genes/conditions.
  • Stable clusters of genes are used as the
    attributes for condition clustering, and vice
    versa.

74
Spectral Co-clustering
  • Main idea
  • Normalize the 2 dimension
  • Form a matrix of size mn (using SVD)
  • Use k-means to cluster both types of data
  • http//adios.tau.ac.il/SpectralCoClustering/

75
Evaluating cluster quality
  • Use known classes (pairwise F-measure, best class
    F-measure)
  • Clusters can be evaluated with internal as well
    as external measures
  • Internal measures are related to the inter/intra
    cluster distance
  • External measures are related to how
    representative are the current clusters to true
    classes

76
Inter/Intra Cluster Distances
  • Intra-cluster distance
  • (Sum/Min/Max/Avg) the (absolute/squared) distance
    between
  • All pairs of points in the cluster OR
  • Between the centroid and all points in the
    cluster OR
  • Between the medoid and all points in the
    cluster
  • Inter-cluster distance
  • Sum the (squared) distance between all pairs of
    clusters
  • Where distance between two clusters is defined
    as
  • distance between their centroids/medoids
  • (Spherical clusters)
  • Distance between the closest pair of points
    belonging to the clusters
  • (Chain shaped clusters)

77
Davies-Bouldin index
  • A function of the ratio of the sum of
    within-cluster (i.e. intra-cluster) scatter to
    between cluster (i.e. inter-cluster) separation
  • Let CC1,.., Ck be a clustering of a set of N
    objects
  • with and
  • where Ci is the ith cluster and ci is the
    centroid for cluster i

78
Davies-Bouldin index example
  • For eg for the clusters shown
  • Compute
  • var(C1)0, var(C2)4.5, var(C3)2.33
  • Centroid is simply the mean here, so c13,
    c28.5, c318.33
  • So, R121, R130.152, R230.797
  • Now, compute
  • R11 (max of R12 and R13) R21 (max of R21 and
    R23) R30.797 (max of R31 and R32)
  • Finally, compute
  • DB0.932

79
Davies-Bouldin index example (ctd)
  • For eg for the clusters shown
  • Compute
  • Only 2 clusters here
  • var(C1)12.33 while var(C2)2.33 c16.67 while
    c218.33
  • R121.26
  • Now compute
  • Since we have only 2 clusters here, R1R121.26
    R2R211.26
  • Finally, compute
  • DB1.26

80
Other criteria
  • Dunn method
  • ?(Xi, Xj) intercluster distance between clusters
    Xi and Xj ?(Xk) intracluster distance of cluster
    Xk
  • Silhouette method
  • Identifying outliers
  • C-index
  • Compare sum of distances S over all pairs from
    the same cluster against the same of smallest
    and largest pairs.

81
Example datasetAML/ALL dataset (Golub et al.)
  • Leukemia
  • 72 patients (samples)
  • 7129 genes
  • 4 groups
  • Two major types ALL AML
  • T B Cells in ALL
  • With/without treatment in AML

82
AML/ALL dataset
  • Davies-Bouldin index - C4
  • Dunn method - C2
  • Silhouette method C2

83
Visual evaluation - coherency
84
Cluster quality example do you see clusters?
C Silhouette
2 0.4922
3 0.5739
4 0.4773
5 0.4991
6 0.5404
7 0.541
8 0.5171
9 0.5956
10 0.6446
C Silhouette
2 0.4863
3 0.5762
4 0.5957
5 0.5351
6 0.5701
7 0.5487
8 0.5083
9 0.5311
10 0.5229
85
Kleinbergs Axioms
  • Scale Invariance
  • F(?d)F(d) for all d and all strictly
    positive ?.
  • Consistency
  • If d equals d, except for shrinking
    distances within clusters of F(d) or stretching
    between-cluster distances, then F(d)F(d).
  • Richness
  • For any partition P of S, there exists a
    distance
  • function d over S so that F(d)P.

86
Quality estimation
  • Gamma is the best performing measure in
    Milligans study of 30 internal criterions
    (Milligan, 1981).
  • Let d() denote the number of times that points
    which were clustered together in C had distance
    greater than two points which were not in the
    same cluster
  • Let d(-) denote the opposite result
  • Gamma satisfies scale-invariance, consistency,
    richness, and isomorphism invariance.

87
Dimensionality Reduction
  • Map points in high-dimensional space tolower
    number of dimensions
  • Preserve structure pairwise distances, etc.
  • Useful for further processing
  • Less computation, fewer parameters
  • Easier to understand, visualize

88
Dimensionality Reduction
  • Feature selection vs. Feature Extraction
  • Feature selection select important features
  • Pros
  • meaningful features
  • Less work acquiring
  • Unsupervised
  • Variance, Fold
  • UFF

89
Dimensionality Reduction
  • Feature Extraction
  • Transforms the entire feature set to lower
    dimension.
  • Pros
  • Uses objective function to select the best
    projection
  • Sometime single features are not good enough
  • Unsupervised
  • PCA, SVD

90
Principal Components Analysis (PCA)
  • approximating a high-dimensional data setwith a
    lower-dimensional linear subspace

Original axes
91
Singular Value Decomposition
92
Principal Components Analysis (PCA)
  • Rule of thumb for selecting number of components
  • Knee in screeplot
  • Cumulative percentage variance

93
Tools for clustering
  • Matlab COMPACT
  • http//adios.tau.ac.il/compact/

94
Tools for clustering
  • Matlab COMPACT
  • http//adios.tau.ac.il/compact/

95
Tools for clustering
  • Cluster TreeView (Eisen et al.)
  • http//rana.lbl.gov/eisen/?page_id42

96
Summary
  • Clustering is ill-defined and considered an art
  • In fact, this means you need to
  • Understand your data beforehand
  • Know how to interpret clusters afterwards
  • The problem determines the best solution (which
    measure, which clustering algorithm) try to
    experiment with different options.
Write a Comment
User Comments (0)
About PowerShow.com