Semi-Supervised Clustering

About This Presentation

Title:

Semi-Supervised Clustering

Description:

between O1 and O2 is a real number denoted by D(O1, ... Hierarchy algorithms ... If S 0 then swap o with o' to form the new set of k medoids. K-Medoids example ... – PowerPoint PPT presentation

Number of Views:115

Avg rating:3.0/5.0

Slides: 92

Provided by: jiep6

Category:

more less

Transcript and Presenter's Notes

Title: Semi-Supervised Clustering

1
Clustering I
Data Mining Soongsil University
2
What is clustering ?
3
What is a natural grouping among these objects?
4
What is a natural grouping among these objects?
Clustering is subjective
5
What is Similarity?
The quality or state of being similar, likeness,
resemblance as a similarity of features.
Similarity is hard to define, but We know it when
we see it
The real meaning of similarity is a
philosophical question. We will take a more
pragmatic approach.
6
Defining Distance Measures
Definition Let O1 and O2 be two objects from
the universe of possible objects. The distance
(dissimilarity) between O1 and O2 is a real
number denoted by D(O1,O2)
7
Unsupervised learning Clustering
Black Box
8
2-dimensional clustering, showing three data
clusters
9
What is Cluster Analysis?

Finding groups of objects such that the objects
in a group will be similar (or related) to one
another and different from (or unrelated to) the
objects in other groups

10
What Is A Good Clustering?

High intra-class similarity and low inter-class
similarity
Depending on the similarity measure
The ability to discover some or all of the hidden
patterns

11
Requirements of Clustering

Scalability
Ability to deal with various types of attributes
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to
determine input parameters

12
Requirements of Clustering

Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability

A technique demanded by many real world tasks
Biology taxonomy of living things such as
kingdom, phylum, class, order, family, genus and
species
Information retrieval document/multimedia data
clustering
Land use Identification of areas of similar land
use in an earth observation database
Marketing Help marketers discover distinct
groups in their customer bases, and then use this
knowledge to develop targeted marketing programs
City-planning Identify groups of houses
according to their house type, value, and
geographical location
Earth-quake studies Observed earth quake
epicenters should be clustered along continent
faults
Climate understand earth climate, find patterns
of atmospheric and ocean
- Social network mining special interest
group discovery

14
(No Transcript)
15
Data Matrix

For memory-based clustering
Also called object-by-variable structure
Represents n objects with p variables
(attributes, measures)
A relational table

16
Dissimilarity Matrix

For memory-based clustering
Also called object-by-object structure
Proximities of pairs of objects
d(i,j) dissimilarity between objects i and j
Nonnegative
Close to 0 similar

17
How Good Is A Clustering?

Dissimilarity/similarity depends on distance
function
Different applications have different functions
Judgment of clustering quality is typically
highly subjective

18
Types of Attributes

There are different types of attributes
Nominal
Examples ID numbers, eye color, zip codes
Ordinal
Examples rankings (e.g., taste of potato chips
on a scale from 1-10), grades, height in tall,
medium, short
Interval
Examples calendar dates, temperatures in Celsius
or Fahrenheit.
Ratio
Examples length, time, counts

19
Types of Data in Clustering

Interval-scaled variables
Binary variables
Nominal, ordinal, and ratio variables
Variables of mixed types

20
Similarity and Dissimilarity Between Objects

Distances are normally used measures
Minkowski distance a generalization
If q 2, d is Euclidean distance
If q 1, d is Manhattan distance
Weighed distance

21
Properties of Minkowski Distance

Nonnegative d(i,j) ? 0
The distance of an object to itself is 0
d(i,i) 0
Symmetric d(i,j) d(j,i)
Triangular inequality
d(i,j) ? d(i,k) d(k,j)

22
Categories of Clustering Approaches (1)

Partitioning algorithms
Partition the objects into k clusters
Iteratively reallocate objects to improve the
clustering
Hierarchy algorithms
Agglomerative each object is a cluster, merge
clusters to form larger ones
Divisive all objects are in a cluster, split it
up into smaller clusters

23
Partitional Clustering
Original Points
24
Hierarchical Clustering
Traditional Hierarchical Clustering
Traditional Dendrogram
Non-traditional Hierarchical Clustering
Non-traditional Dendrogram
25
Categories of Clustering Approaches (2)

Density-based methods
Based on connectivity and density functions
Filter out noise, find clusters of arbitrary
shape
Grid-based methods
Quantize the object space into a grid structure
Model-based
Use a model to find the best fit of data

26
Partitioning Algorithms Basic Concepts

Partition n objects into k clusters
Optimize the chosen partitioning criterion
Global optimal examine all partitions
(kn-(k-1)n--1) possible partitions, too
expensive!
Heuristic methods k-means and k-medoids
K-means a cluster is represented by the center
K-medoids or PAM (partition around medoids) each
cluster is represented by one of the objects in
the cluster

27
Overview of K-Means Clustering

K-Means is a partitional clustering algorithm
based on iterative relocation that partitions a
dataset into K clusters.
Algorithm
Initialize K cluster centers randomly. Repeat
until convergence
Cluster Assignment Step Assign each data point x
to the cluster Xl, such that L2 distance of x
from (center of Xl) is minimum
Center Re-estimation Step Re-estimate each
cluster center as the mean of the points in
that cluster

28
K-Means Objective Function

Locally minimizes sum of squared distance between
the data points and their corresponding cluster
centers
Initialization of K cluster centers
Totally random
Random perturbation from global mean
Heuristic to ensure well-separated centers

Source J. Ye 2006
29
K Means Example
30
K Means ExampleRandomly Initialize Means
x
x
31
Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
32
Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
33
Second Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
34
Second Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35
Pros and Cons of K-means

Relatively efficient O(tkn)
n objects, k clusters, t iterations k,
t ltlt n.
Often terminate at a local optimum
Applicable only when mean is defined
What about categorical data?
Need to specify the number of clusters
Unable to handle noisy data and outliers
Unsuitable to discover non-convex clusters

36
Variations of the K-means

Aspects of variations
Selection of the initial k means
Dissimilarity calculations
Strategies to calculate cluster means
Handling categorical data k-modes
Use mode instead of mean
Mode the most frequent item(s)
A mixture of categorical and numerical data
k-prototype method

37
Categorical Values

Handling categorical data k-modes (Huang98)
Replacing means of clusters with modes
Mode of an attribute most frequent value
Mode of instances each attribute most frequent
value
K-mode is equivalent to K-means
Using a frequency-based method to update modes of
clusters
A mixture of categorical and numerical data
k-prototype method

37
38
A Problem of K-means

Sensitive to outliers
Outlier objects with extremely large values
May substantially distort the distribution of the
data
K-medoids the most centrally located object in a
cluster

39
PAM A K-medoids Method

PAM partitioning around Medoids
Arbitrarily choose k objects as the initial
medoids
Until no change, do
(Re)assign each object to the cluster to which
the nearest medoid
Randomly select a non-medoid object o, compute
the total cost, S, of swapping medoid o with o
If S lt 0 then swap o with o to form the new set
of k medoids

40
K-Medoids example

1, 2, 6, 7, 8, 10, 15, 17, 20 break into 3
clusters
Cluster 6 1, 2
Cluster 7
Cluster 8 10, 15, 17, 20
Random non-medoid 15 replace 7 (total cost-13)
Cluster 6 1 (cost 0), 2 (cost 0), 7(1-01)
Cluster 8 10 (cost 0)
New Cluster 15 17 (cost 2-9-7), 20 (cost
5-12-7)
Replace medoid 7 with new medoid (15) and
reassign
Cluster 6 1, 2, 7
Cluster 8 10
Cluster 15 17, 20

41
K-Medoids example (continued)

Random non-medoid 1 replaces 6 (total cost2)
Cluster 8 7 (cost 6-15)10 (cost 0)
Cluster 15 17 (cost 0), 20 (cost 0)
New Cluster 1 2 (cost 1-4-3)
2 replaces 6 (total cost1)
Dont replace medoid 6
Cluster 6 1, 2, 7
Cluster 8 10
Cluster 15 17, 20
Random non-medoid 7 replaces 6 (total cost2)
Cluster 8 10 (cost 0)
Cluster 15 17(cost 0), 20(cost 0)
New Cluster 7 6 (cost 1-01), 2 (cost 5-41)

42
K-Medoids example (continued)

Dont Replace medoid 6
Cluster 6 1, 2, 7
Cluster 8 10
Cluster 15 17, 20
Random non-medoid 10 replaces 8 (total cost2)
dont replace
Cluster 6 1(cost 0), 2(cost 0), 7(cost 0)
Cluster 15 17 (cost 0), 20(cost 0)
New Cluster 10 8 (cost 2-02)
Random non-medoid 17 replaces 15 (total cost0)
dont replace
Cluster 6 1(cost 0), 2(cost 0), 7(cost 0)
Cluster 8 10 (cost 0)
New Cluster 17 15 (cost 2-02), 20(cost
3-5-2)

43
K-Medoids example (continued)

Random non-medoid 20 replaces 15 (total cost3)
dont replace
Cluster 6 1(cost 0), 2(cost 0), 7(cost 0)
Cluster 8 10 (cost 0)
New Cluster 20 15 (cost 5-02), 17(cost
3-21)
Other possible changes all have high costs
1 replaces 15, 2 replaces 15, 1 replaces 8,
No changes, final clusters
Cluster 6 1, 2, 7
Cluster 8 10
Cluster 15 17, 20

44
Semi-Supervised Clustering

45
Outline

Overview of clustering and classification
What is semi-supervised learning?
Semi-supervised clustering
Semi-supervised classification
Semi-supervised clustering
What is semi-supervised clustering?
Why semi-supervised clustering?
Semi-supervised clustering algorithms

Source J. Ye 2006
46
Supervised classification versus unsupervised
clustering

Unsupervised clustering Group similar objects
together to find clusters
Minimize intra-class distance
Maximize inter-class distance
Supervised classification Class label for each
training sample is given
Build a model from the training data
Predict class label on unseen future data points

Source J. Ye 2006
47
What is clustering?

Finding groups of objects such that the objects
in a group will be similar (or related) to one
another and different from (or unrelated to) the
objects in other groups

Source J. Ye 2006
48
What is Classification?
Source J. Ye 2006
49
Clustering algorithms

K-Means
Hierarchical clustering
Graph based clustering (Spectral clustering)
Bi-clustering

Source J. Ye 2006
50
Classification algorithms

K-Nearest-Neighbor classifiers
Naïve Bayes classifier
Linear Discriminant Analysis (LDA)
Support Vector Machines (SVM)
Logistic Regression
Neural Networks

Source J. Ye 2006
51
Supervised Classification Example
.
.
.
.
52
Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
53
Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
54
Unsupervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
55
Unsupervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
56
Semi-Supervised Learning

Combines labeled and unlabeled data during
training to improve performance
Semi-supervised classification Training on
labeled data exploits additional unlabeled data,
frequently resulting in a more accurate
classifier.
Semi-supervised clustering Uses small amount of
labeled data to aid and bias the clustering of
unlabeled data.

57
Semi-Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
58
Semi-Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
59
Semi-Supervised Classification

Algorithms
Semisupervised EM GhahramaniNIPS94,NigamML00.
Co-training BlumCOLT98.
Transductive SVMs Vapnik98,JoachimsICML99.
Graph based algorithms
Assumptions
Known, fixed set of categories given in the
labeled data.
Goal is to improve classification of examples
into these known categories.

60
Semi-supervised clustering problem definition

Input
A set of unlabeled objects, each described by a
set of attributes (numeric and/or categorical)
A small amount of domain knowledge
Output
A partitioning of the objects into k clusters
(possibly with some discarded as outliers)
Objective
Maximum intra-cluster similarity
Minimum inter-cluster similarity
High consistency between the partitioning and the
domain knowledge

61
Why semi-supervised clustering?

Why not clustering?
The clusters produced may not be the ones
required.
Sometimes there are multiple possible groupings.
Why not classification?
Sometimes there are insufficient labeled data.
Potential applications
Bioinformatics (gene and protein clustering)
Document hierarchy construction
News/email categorization
Image categorization

62
Semi-Supervised Clustering

Domain knowledge
Partial label information is given
Apply some constraints (must-links and
cannot-links)
Approaches
Search-based Semi-Supervised Clustering
Alter the clustering algorithm using the
constraints
Similarity-based Semi-Supervised Clustering
Alter the similarity measure based on the
constraints
Combination of both

63
Search-Based Semi-Supervised Clustering

Alter the clustering algorithm that searches for
a good partitioning by
Modifying the objective function to give a reward
for obeying labels on the supervised data
Demeriz ANNIE99.
Enforcing constraints (must-link, cannot-link) on
the labeled data during clustering
WagstaffICML00, WagstaffICML01.
Use the labeled data to initialize clusters in an
iterative refinement algorithm (k-Means,)
BasuICML02.

Source J. Ye 2006
64
(No Transcript)
65
(No Transcript)
66
K Means ExampleAssign Points to Clusters
x
x
67
K Means ExampleRe-estimate Means
x
x
68
K Means ExampleRe-assign Points to Clusters
x
x
69
K Means ExampleRe-estimate Means
x
x
70
K Means ExampleRe-assign Points to Clusters
x
x
71
K Means ExampleRe-estimate Means and Converge
x
x
72
Semi-Supervised K-Means

Partial label information is given
Seeded K-Means
Constrained K-Means
Constraints (Must-link, Cannot-link)
COP K-Means

73
Semi-Supervised K-Means for partially labeled data

Seeded K-Means
Labeled data provided by user are used for
initialization initial center for cluster i is
the mean of the seed points having label i.
Seed points are only used for initialization, and
not in subsequent steps.
Constrained K-Means
Labeled data provided by user are used to
initialize K-Means algorithm.
Cluster labels of seed data are kept unchanged in
the cluster assignment steps, and only the labels
of the non-seed data are re-estimated.

74
Seeded K-Means
Use labeled data to find the initial centroids
and then run K-Means. The labels for seeded
points may change.
Source J. Ye 2006
75
Seeded K-Means Example
76
Seeded K-Means ExampleInitialize Means Using
Labeled Data
x
x
77
Seeded K-Means ExampleAssign Points to Clusters
x
x
78
Seeded K-Means ExampleRe-estimate Means
x
x
79
Seeded K-Means ExampleAssign points to clusters
and Converge
x
the label is changed
x
80
Constrained K-Means
Use labeled data to find the initial centroids
and then run K-Means. The labels for seeded
points will not change.
Source J. Ye 2006
81
Constrained K-Means Example
82
Constrained K-Means ExampleInitialize Means
Using Labeled Data
x
x
83
Constrained K-Means ExampleAssign Points to
Clusters
x
x
84
Constrained K-Means ExampleRe-estimate Means and
Converge
85
COP K-Means

COP K-Means Wagstaff et al. ICML01 is K-Means
with must-link (must be in same cluster) and
cannot-link (cannot be in same cluster)
constraints on data points.
Initialization Cluster centers are chosen
randomly,
but as each one is chosen any must-link
constraints that it participates in are enforced
(so that they cannot later be chosen as the
center of another cluster).
Algorithm During cluster assignment step in
COP-K-Means, a point is assigned to its nearest
cluster without violating any of its constraints.
If no such assignment exists, abort.

Source J. Ye 2006
86
COP K-Means Algorithm
87
Illustration
Determine its label
Must-link
x
x
Assign to the red class
88
Illustration
Determine its label
x
x
Cannot-link
Assign to the red class
89
Illustration
Determine its label
Must-link
x
x
Cannot-link
The clustering algorithm fails
90
Summary

Seeded and Constrained K-Means partially labeled
data
COP K-Means constraints (Must-link and
Cannot-link)
Constrained K-Means and COP K-Means require all
the constraints to be satisfied.
May not be effective if the seeds contain noise.
Seeded K-Means use the seeds only in the first
step to determine the initial centroids.
Less sensitive to the noise in the seeds.
Experiments show that semi-supervised k-Means
outperform traditional K-Means.

91
References

Ye , Jieping Introduction to Data Mining,
Department of Computer Science and Engineering
Arizona State University, 2006
Clifton, Chris Introduction to Data Mining,
Purdue University, 2006
Zhu, Xingquan Davidson, Ian , Knowledge
Discovery and Data Mining, 2007

Write a Comment

User Comments (0)