Title: Semi-Supervised Clustering
1Semi-Supervised Clustering
Jieping Ye Department of Computer Science and
Engineering Arizona State University http//www.pu
blic.asu.edu/jye02
2Outline
- Overview of clustering and classification
- What is semi-supervised learning?
- Semi-supervised clustering
- Semi-supervised classification
- Semi-supervised clustering
- What is semi-supervised clustering?
- Why semi-supervised clustering?
- Semi-supervised clustering algorithms
3Supervised classification versus unsupervised
clustering
- Unsupervised clustering Group similar objects
together to find clusters - Minimize intra-class distance
- Maximize inter-class distance
- Supervised classification Class label for each
training sample is given - Build a model from the training data
- Predict class label on unseen future data points
4What is clustering?
- Finding groups of objects such that the objects
in a group will be similar (or related) to one
another and different from (or unrelated to) the
objects in other groups
5What is Classification?
6Clustering algorithms
- K-Means
- Hierarchical clustering
- Graph based clustering (Spectral clustering)
- Bi-clustering
7Classification algorithms
- K-Nearest-Neighbor classifiers
- Naïve Bayes classifier
- Linear Discriminant Analysis (LDA)
- Support Vector Machines (SVM)
- Logistic Regression
- Neural Networks
8Supervised Classification Example
.
.
.
.
9Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11Unsupervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12Unsupervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13Semi-Supervised Learning
- Combines labeled and unlabeled data during
training to improve performance - Semi-supervised classification Training on
labeled data exploits additional unlabeled data,
frequently resulting in a more accurate
classifier. - Semi-supervised clustering Uses small amount of
labeled data to aid and bias the clustering of
unlabeled data.
14Semi-Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15Semi-Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
16Semi-Supervised Classification
- Algorithms
- Semisupervised EM GhahramaniNIPS94,NigamML00.
- Co-training BlumCOLT98.
- Transductive SVMs Vapnik98,JoachimsICML99.
- Graph based algorithms
- Assumptions
- Known, fixed set of categories given in the
labeled data. - Goal is to improve classification of examples
into these known categories.
17Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
18Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19Second Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20Second Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21Semi-supervised clustering problem definition
- Input
- A set of unlabeled objects, each described by a
set of attributes (numeric and/or categorical) - A small amount of domain knowledge
- Output
- A partitioning of the objects into k clusters
(possibly with some discarded as outliers) - Objective
- Maximum intra-cluster similarity
- Minimum inter-cluster similarity
- High consistency between the partitioning and the
domain knowledge
22Why semi-supervised clustering?
- Why not clustering?
- The clusters produced may not be the ones
required. - Sometimes there are multiple possible groupings.
- Why not classification?
- Sometimes there are insufficient labeled data.
- Potential applications
- Bioinformatics (gene and protein clustering)
- Document hierarchy construction
- News/email categorization
- Image categorization
23Semi-Supervised Clustering
- Domain knowledge
- Partial label information is given
- Apply some constraints (must-links and
cannot-links) - Approaches
- Search-based Semi-Supervised Clustering
- Alter the clustering algorithm using the
constraints - Similarity-based Semi-Supervised Clustering
- Alter the similarity measure based on the
constraints - Combination of both
24Search-Based Semi-Supervised Clustering
- Alter the clustering algorithm that searches for
a good partitioning by - Modifying the objective function to give a reward
for obeying labels on the supervised data
DemerizANNIE99. - Enforcing constraints (must-link, cannot-link) on
the labeled data during clustering
WagstaffICML00, WagstaffICML01. - Use the labeled data to initialize clusters in an
iterative refinement algorithm (kMeans,)
BasuICML02.
25Overview of K-Means Clustering
- K-Means is a partitional clustering algorithm
based on iterative relocation that partitions a
dataset into K clusters. - Algorithm
- Initialize K cluster centers
randomly. Repeat until convergence - Cluster Assignment Step Assign each data point x
to the cluster Xl, such that L2 distance of x
from (center of Xl) is minimum - Center Re-estimation Step Re-estimate each
cluster center as the mean of the points in
that cluster
26K-Means Objective Function
- Locally minimizes sum of squared distance between
the data points and their corresponding cluster
centers - Initialization of K cluster centers
- Totally random
- Random perturbation from global mean
- Heuristic to ensure well-separated centers
27K Means Example
28K Means ExampleRandomly Initialize Means
x
x
29K Means ExampleAssign Points to Clusters
x
x
30K Means ExampleRe-estimate Means
x
x
31K Means ExampleRe-assign Points to Clusters
x
x
32K Means ExampleRe-estimate Means
x
x
33K Means ExampleRe-assign Points to Clusters
x
x
34K Means ExampleRe-estimate Means and Converge
x
x
35Semi-Supervised K-Means
- Partial label information is given
- Seeded K-Means
- Constrained K-Means
- Constraints (Must-link, Cannot-link)
- COP K-Means
36Semi-Supervised K-Means for partially labeled data
- Seeded K-Means
- Labeled data provided by user are used for
initialization initial center for cluster i is
the mean of the seed points having label i. - Seed points are only used for initialization, and
not in subsequent steps. - Constrained K-Means
- Labeled data provided by user are used to
initialize K-Means algorithm. - Cluster labels of seed data are kept unchanged in
the cluster assignment steps, and only the labels
of the non-seed data are re-estimated.
37Seeded K-Means
Use labeled data to find the initial centroids
and then run K-Means. The labels for seeded
points may change.
38Seeded K-Means Example
39Seeded K-Means ExampleInitialize Means Using
Labeled Data
x
x
40Seeded K-Means ExampleAssign Points to Clusters
x
x
41Seeded K-Means ExampleRe-estimate Means
x
x
42Seeded K-Means ExampleAssign points to clusters
and Converge
x
the label is changed
x
43Exercise
Compute the clustering using seeded Kmeans
44Constrained K-Means
Use labeled data to find the initial centroids
and then run K-Means. The labels for seeded
points will not change.
45Constrained K-Means Example
46Constrained K-Means ExampleInitialize Means
Using Labeled Data
x
x
47Constrained K-Means ExampleAssign Points to
Clusters
x
x
48Constrained K-Means ExampleRe-estimate Means and
Converge
49Exercise
Compute the clustering using constrained Kmeans
50COP K-Means
- COP K-Means Wagstaff et al. ICML01 is K-Means
with must-link (must be in same cluster) and
cannot-link (cannot be in same cluster)
constraints on data points. - Initialization Cluster centers are chosen
randomly, but as each one is chosen any must-link
constraints that it participates in are enforced
(so that they cannot later be chosen as the
center of another cluster). - Algorithm During cluster assignment step in
COP-K-Means, a point is assigned to its nearest
cluster without violating any of its constraints.
If no such assignment exists, abort.
51COP K-Means Algorithm
52Illustration
Determine its label
Must-link
x
x
Assign to the red class
53Illustration
Determine its label
x
x
Cannot-link
Assign to the red class
54Illustration
Determine its label
Must-link
x
x
Cannot-link
The clustering algorithm fails
55Similarity-based semi-supervised clustering
- Alter the similarity measure based on the
constraints - Paper From Instance-level Constraints to
Space-Level Constraints Making the Most of Prior
Knowledge in Data Clustering. D. Klein et al.
Two types of constraints Must-links and
Cannot-links
Clustering algorithm Hierarchical clustering
56Constraints
57Overview of Hierarchical Clustering Algorithm
- Agglomerative versus Divisive
- Basic algorithm of Agglomerative clustering
- Compute the distance matrix
- Let each data point be a cluster
- Repeat
- Merge the two closest clusters
- Update the distance matrix
- Until only a single cluster remains
-
- Key operation is the update of the distance
between two clusters
58How to Define Inter-Cluster Distance
Distance?
- MIN
- MAX
- Group Average
- Distance Between Centroids
distance matrix
59Must-link constraints
- Distance between must-links pair to zero.
- Derive a new metric by running an
all-pairs-shortest distances algorithm. - It is still a metric
- Faithful to the original metric
- Computational complexity O(N2 C)
- C number of points involved in must-link
constraints - N total number of points
60New distance matrix based on must-link constraints
Hierarchical clustering can be carried out based
on the new distance matrix.
What is missing?
New distance matrix
61Cannot-link constraint
- Run hierarchical clustering with complete link
(MAX) - The distance between two clusters is determined
by the largest distance. - Set the distance between cannot-link pair to be
- The new distance matrix does not define a metric.
- Work very well in practice
62Constrained complete-link clustering algorithm
Derive a new distance Matrix based on both Types
of constraints
63Illustration
0 0.2 0.5 0.1 0.8
0 0.4 0.2 0.6
0 0.3 0.2
0 0.5
0
2
1
4
3
5
Initial distance matrix
64New distance matrix
0.9
0 0.2 0.5 0.1 0.8
0 0.4 0.2 0.6
0 0.3 0.2
0 0.5
0
0 0 0.1 0.1 0.8
0 0.2 0.2 0.6
0 0 0.2
0 0.2
0
Must-links 12, 34 Cannot-links 2--3
65Hierarchical clustering
1 and 2 form a cluster, and 3 and 4 form another
cluster
1,2
3,4
5
0 0.9 0.8
0 0.2
0
1,2
3,4
1 2 3 4 5
5
66Summary
- Seeded and Constrained K-Means partially labeled
data - COP K-Means constraints (Must-link and
Cannot-link) - Constrained K-Means and COP K-Means require all
the constraints to be satisfied. - May not be effective if the seeds contain noise.
- Seeded K-Means use the seeds only in the first
step to determine the initial centroids. - Less sensitive to the noise in the seeds.
- Semi-supervised hierarchical clustering