Title: Semi-Supervised Clustering
1Clustering I
Data Mining Soongsil University
2What is clustering ?
3What is a natural grouping among these objects?
4What is a natural grouping among these objects?
Clustering is subjective
5What is Similarity?
The quality or state of being similar, likeness,
resemblance as a similarity of features.
Similarity is hard to define, but We know it when
we see it
The real meaning of similarity is a
philosophical question. We will take a more
pragmatic approach.
6Defining Distance Measures
Definition Let O1 and O2 be two objects from
the universe of possible objects. The distance
(dissimilarity) between O1 and O2 is a real
number denoted by D(O1,O2)
7Unsupervised learning Clustering
Black Box
82-dimensional clustering, showing three data
clusters
9What is Cluster Analysis?
- Finding groups of objects such that the objects
in a group will be similar (or related) to one
another and different from (or unrelated to) the
objects in other groups
10What Is A Good Clustering?
- High intra-class similarity and low inter-class
similarity - Depending on the similarity measure
- The ability to discover some or all of the hidden
patterns
11Requirements of Clustering
- Scalability
- Ability to deal with various types of attributes
- Discovery of clusters with arbitrary shape
- Minimal requirements for domain knowledge to
determine input parameters
12Requirements of Clustering
- Able to deal with noise and outliers
- Insensitive to order of input records
- High dimensionality
- Incorporation of user-specified constraints
- Interpretability and usability
13- A technique demanded by many real world tasks
- Biology taxonomy of living things such as
kingdom, phylum, class, order, family, genus and
species - Information retrieval document/multimedia data
clustering - Land use Identification of areas of similar land
use in an earth observation database - Marketing Help marketers discover distinct
groups in their customer bases, and then use this
knowledge to develop targeted marketing programs - City-planning Identify groups of houses
according to their house type, value, and
geographical location - Earth-quake studies Observed earth quake
epicenters should be clustered along continent
faults - Climate understand earth climate, find patterns
of atmospheric and ocean - - Social network mining special interest
group discovery
14(No Transcript)
15Data Matrix
- For memory-based clustering
- Also called object-by-variable structure
- Represents n objects with p variables
(attributes, measures) - A relational table
16Dissimilarity Matrix
- For memory-based clustering
- Also called object-by-object structure
- Proximities of pairs of objects
- d(i,j) dissimilarity between objects i and j
- Nonnegative
- Close to 0 similar
17How Good Is A Clustering?
- Dissimilarity/similarity depends on distance
function - Different applications have different functions
- Judgment of clustering quality is typically
highly subjective
18Types of Attributes
- There are different types of attributes
- Nominal
- Examples ID numbers, eye color, zip codes
- Ordinal
- Examples rankings (e.g., taste of potato chips
on a scale from 1-10), grades, height in tall,
medium, short - Interval
- Examples calendar dates, temperatures in Celsius
or Fahrenheit. - Ratio
- Examples length, time, counts
19Types of Data in Clustering
- Interval-scaled variables
- Binary variables
- Nominal, ordinal, and ratio variables
- Variables of mixed types
20Similarity and Dissimilarity Between Objects
- Distances are normally used measures
- Minkowski distance a generalization
- If q 2, d is Euclidean distance
- If q 1, d is Manhattan distance
- Weighed distance
21Properties of Minkowski Distance
- Nonnegative d(i,j) ? 0
- The distance of an object to itself is 0
- d(i,i) 0
- Symmetric d(i,j) d(j,i)
- Triangular inequality
- d(i,j) ? d(i,k) d(k,j)
22Categories of Clustering Approaches (1)
- Partitioning algorithms
- Partition the objects into k clusters
- Iteratively reallocate objects to improve the
clustering - Hierarchy algorithms
- Agglomerative each object is a cluster, merge
clusters to form larger ones - Divisive all objects are in a cluster, split it
up into smaller clusters
23Partitional Clustering
Original Points
24Hierarchical Clustering
Traditional Hierarchical Clustering
Traditional Dendrogram
Non-traditional Hierarchical Clustering
Non-traditional Dendrogram
25Categories of Clustering Approaches (2)
- Density-based methods
- Based on connectivity and density functions
- Filter out noise, find clusters of arbitrary
shape - Grid-based methods
- Quantize the object space into a grid structure
- Model-based
- Use a model to find the best fit of data
26Partitioning Algorithms Basic Concepts
- Partition n objects into k clusters
- Optimize the chosen partitioning criterion
- Global optimal examine all partitions
- (kn-(k-1)n--1) possible partitions, too
expensive! - Heuristic methods k-means and k-medoids
- K-means a cluster is represented by the center
- K-medoids or PAM (partition around medoids) each
cluster is represented by one of the objects in
the cluster
27Overview of K-Means Clustering
- K-Means is a partitional clustering algorithm
based on iterative relocation that partitions a
dataset into K clusters. - Algorithm
- Initialize K cluster centers randomly. Repeat
until convergence - Cluster Assignment Step Assign each data point x
to the cluster Xl, such that L2 distance of x
from (center of Xl) is minimum - Center Re-estimation Step Re-estimate each
cluster center as the mean of the points in
that cluster
28K-Means Objective Function
- Locally minimizes sum of squared distance between
the data points and their corresponding cluster
centers - Initialization of K cluster centers
- Totally random
- Random perturbation from global mean
- Heuristic to ensure well-separated centers
Source J. Ye 2006
29K Means Example
30K Means ExampleRandomly Initialize Means
x
x
31Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
32Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
33Second Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
34Second Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35Pros and Cons of K-means
- Relatively efficient O(tkn)
- n objects, k clusters, t iterations k,
t ltlt n. - Often terminate at a local optimum
- Applicable only when mean is defined
- What about categorical data?
- Need to specify the number of clusters
- Unable to handle noisy data and outliers
- Unsuitable to discover non-convex clusters
36Variations of the K-means
- Aspects of variations
- Selection of the initial k means
- Dissimilarity calculations
- Strategies to calculate cluster means
- Handling categorical data k-modes
- Use mode instead of mean
- Mode the most frequent item(s)
- A mixture of categorical and numerical data
k-prototype method
37Categorical Values
- Handling categorical data k-modes (Huang98)
- Replacing means of clusters with modes
- Mode of an attribute most frequent value
- Mode of instances each attribute most frequent
value - K-mode is equivalent to K-means
- Using a frequency-based method to update modes of
clusters - A mixture of categorical and numerical data
k-prototype method
37
38A Problem of K-means
- Sensitive to outliers
- Outlier objects with extremely large values
- May substantially distort the distribution of the
data - K-medoids the most centrally located object in a
cluster
39PAM A K-medoids Method
- PAM partitioning around Medoids
- Arbitrarily choose k objects as the initial
medoids - Until no change, do
- (Re)assign each object to the cluster to which
the nearest medoid - Randomly select a non-medoid object o, compute
the total cost, S, of swapping medoid o with o - If S lt 0 then swap o with o to form the new set
of k medoids
40K-Medoids example
- 1, 2, 6, 7, 8, 10, 15, 17, 20 break into 3
clusters - Cluster 6 1, 2
- Cluster 7
- Cluster 8 10, 15, 17, 20
- Random non-medoid 15 replace 7 (total cost-13)
- Cluster 6 1 (cost 0), 2 (cost 0), 7(1-01)
- Cluster 8 10 (cost 0)
- New Cluster 15 17 (cost 2-9-7), 20 (cost
5-12-7) - Replace medoid 7 with new medoid (15) and
reassign - Cluster 6 1, 2, 7
- Cluster 8 10
- Cluster 15 17, 20
41K-Medoids example (continued)
- Random non-medoid 1 replaces 6 (total cost2)
- Cluster 8 7 (cost 6-15)10 (cost 0)
- Cluster 15 17 (cost 0), 20 (cost 0)
- New Cluster 1 2 (cost 1-4-3)
- 2 replaces 6 (total cost1)
- Dont replace medoid 6
- Cluster 6 1, 2, 7
- Cluster 8 10
- Cluster 15 17, 20
- Random non-medoid 7 replaces 6 (total cost2)
- Cluster 8 10 (cost 0)
- Cluster 15 17(cost 0), 20(cost 0)
- New Cluster 7 6 (cost 1-01), 2 (cost 5-41)
42K-Medoids example (continued)
- Dont Replace medoid 6
- Cluster 6 1, 2, 7
- Cluster 8 10
- Cluster 15 17, 20
- Random non-medoid 10 replaces 8 (total cost2)
dont replace - Cluster 6 1(cost 0), 2(cost 0), 7(cost 0)
- Cluster 15 17 (cost 0), 20(cost 0)
- New Cluster 10 8 (cost 2-02)
- Random non-medoid 17 replaces 15 (total cost0)
dont replace - Cluster 6 1(cost 0), 2(cost 0), 7(cost 0)
- Cluster 8 10 (cost 0)
- New Cluster 17 15 (cost 2-02), 20(cost
3-5-2)
43K-Medoids example (continued)
- Random non-medoid 20 replaces 15 (total cost3)
dont replace - Cluster 6 1(cost 0), 2(cost 0), 7(cost 0)
- Cluster 8 10 (cost 0)
- New Cluster 20 15 (cost 5-02), 17(cost
3-21) - Other possible changes all have high costs
- 1 replaces 15, 2 replaces 15, 1 replaces 8,
- No changes, final clusters
- Cluster 6 1, 2, 7
- Cluster 8 10
- Cluster 15 17, 20
44Semi-Supervised Clustering
45Outline
- Overview of clustering and classification
- What is semi-supervised learning?
- Semi-supervised clustering
- Semi-supervised classification
- Semi-supervised clustering
- What is semi-supervised clustering?
- Why semi-supervised clustering?
- Semi-supervised clustering algorithms
Source J. Ye 2006
46Supervised classification versus unsupervised
clustering
- Unsupervised clustering Group similar objects
together to find clusters - Minimize intra-class distance
- Maximize inter-class distance
- Supervised classification Class label for each
training sample is given - Build a model from the training data
- Predict class label on unseen future data points
Source J. Ye 2006
47What is clustering?
- Finding groups of objects such that the objects
in a group will be similar (or related) to one
another and different from (or unrelated to) the
objects in other groups
Source J. Ye 2006
48What is Classification?
Source J. Ye 2006
49Clustering algorithms
- K-Means
- Hierarchical clustering
- Graph based clustering (Spectral clustering)
- Bi-clustering
Source J. Ye 2006
50Classification algorithms
- K-Nearest-Neighbor classifiers
- Naïve Bayes classifier
- Linear Discriminant Analysis (LDA)
- Support Vector Machines (SVM)
- Logistic Regression
- Neural Networks
Source J. Ye 2006
51Supervised Classification Example
.
.
.
.
52Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
53Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
54Unsupervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
55Unsupervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
56Semi-Supervised Learning
- Combines labeled and unlabeled data during
training to improve performance - Semi-supervised classification Training on
labeled data exploits additional unlabeled data,
frequently resulting in a more accurate
classifier. - Semi-supervised clustering Uses small amount of
labeled data to aid and bias the clustering of
unlabeled data.
57Semi-Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
58Semi-Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
59Semi-Supervised Classification
- Algorithms
- Semisupervised EM GhahramaniNIPS94,NigamML00.
- Co-training BlumCOLT98.
- Transductive SVMs Vapnik98,JoachimsICML99.
- Graph based algorithms
- Assumptions
- Known, fixed set of categories given in the
labeled data. - Goal is to improve classification of examples
into these known categories.
60Semi-supervised clustering problem definition
- Input
- A set of unlabeled objects, each described by a
set of attributes (numeric and/or categorical) - A small amount of domain knowledge
- Output
- A partitioning of the objects into k clusters
(possibly with some discarded as outliers) - Objective
- Maximum intra-cluster similarity
- Minimum inter-cluster similarity
- High consistency between the partitioning and the
domain knowledge
61Why semi-supervised clustering?
- Why not clustering?
- The clusters produced may not be the ones
required. - Sometimes there are multiple possible groupings.
- Why not classification?
- Sometimes there are insufficient labeled data.
- Potential applications
- Bioinformatics (gene and protein clustering)
- Document hierarchy construction
- News/email categorization
- Image categorization
62Semi-Supervised Clustering
- Domain knowledge
- Partial label information is given
- Apply some constraints (must-links and
cannot-links) - Approaches
- Search-based Semi-Supervised Clustering
- Alter the clustering algorithm using the
constraints - Similarity-based Semi-Supervised Clustering
- Alter the similarity measure based on the
constraints - Combination of both
63Search-Based Semi-Supervised Clustering
- Alter the clustering algorithm that searches for
a good partitioning by - Modifying the objective function to give a reward
for obeying labels on the supervised data
Demeriz ANNIE99. - Enforcing constraints (must-link, cannot-link) on
the labeled data during clustering
WagstaffICML00, WagstaffICML01. - Use the labeled data to initialize clusters in an
iterative refinement algorithm (k-Means,)
BasuICML02.
Source J. Ye 2006
64(No Transcript)
65(No Transcript)
66K Means ExampleAssign Points to Clusters
x
x
67K Means ExampleRe-estimate Means
x
x
68K Means ExampleRe-assign Points to Clusters
x
x
69K Means ExampleRe-estimate Means
x
x
70K Means ExampleRe-assign Points to Clusters
x
x
71K Means ExampleRe-estimate Means and Converge
x
x
72Semi-Supervised K-Means
- Partial label information is given
- Seeded K-Means
- Constrained K-Means
- Constraints (Must-link, Cannot-link)
- COP K-Means
73Semi-Supervised K-Means for partially labeled data
- Seeded K-Means
- Labeled data provided by user are used for
initialization initial center for cluster i is
the mean of the seed points having label i. - Seed points are only used for initialization, and
not in subsequent steps. - Constrained K-Means
- Labeled data provided by user are used to
initialize K-Means algorithm. - Cluster labels of seed data are kept unchanged in
the cluster assignment steps, and only the labels
of the non-seed data are re-estimated.
74Seeded K-Means
Use labeled data to find the initial centroids
and then run K-Means. The labels for seeded
points may change.
Source J. Ye 2006
75Seeded K-Means Example
76Seeded K-Means ExampleInitialize Means Using
Labeled Data
x
x
77Seeded K-Means ExampleAssign Points to Clusters
x
x
78Seeded K-Means ExampleRe-estimate Means
x
x
79Seeded K-Means ExampleAssign points to clusters
and Converge
x
the label is changed
x
80Constrained K-Means
Use labeled data to find the initial centroids
and then run K-Means. The labels for seeded
points will not change.
Source J. Ye 2006
81Constrained K-Means Example
82Constrained K-Means ExampleInitialize Means
Using Labeled Data
x
x
83Constrained K-Means ExampleAssign Points to
Clusters
x
x
84Constrained K-Means ExampleRe-estimate Means and
Converge
85COP K-Means
- COP K-Means Wagstaff et al. ICML01 is K-Means
with must-link (must be in same cluster) and
cannot-link (cannot be in same cluster)
constraints on data points. - Initialization Cluster centers are chosen
randomly, - but as each one is chosen any must-link
constraints that it participates in are enforced
(so that they cannot later be chosen as the
center of another cluster). - Algorithm During cluster assignment step in
COP-K-Means, a point is assigned to its nearest
cluster without violating any of its constraints.
If no such assignment exists, abort.
Source J. Ye 2006
86COP K-Means Algorithm
87Illustration
Determine its label
Must-link
x
x
Assign to the red class
88Illustration
Determine its label
x
x
Cannot-link
Assign to the red class
89Illustration
Determine its label
Must-link
x
x
Cannot-link
The clustering algorithm fails
90Summary
- Seeded and Constrained K-Means partially labeled
data - COP K-Means constraints (Must-link and
Cannot-link) - Constrained K-Means and COP K-Means require all
the constraints to be satisfied. - May not be effective if the seeds contain noise.
- Seeded K-Means use the seeds only in the first
step to determine the initial centroids. - Less sensitive to the noise in the seeds.
- Experiments show that semi-supervised k-Means
outperform traditional K-Means.
91References
- Ye , Jieping Introduction to Data Mining,
Department of Computer Science and Engineering - Arizona State University, 2006
- Clifton, Chris Introduction to Data Mining,
- Purdue University, 2006
- Zhu, Xingquan Davidson, Ian , Knowledge
Discovery and Data Mining, 2007