Clustering: Unsupervised Learning Methods 15381 - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Clustering: Unsupervised Learning Methods 15381

Description:

Learning useful structure without labeled classes, ... Use underlying sematic property: E.g. Sim(Boston, LA) = dist(Boston, LA)-1, or Sim(Boston, LA) ... – PowerPoint PPT presentation

Number of Views:106

Avg rating:3.0/5.0

Slides: 26

Provided by: rcp6

Category:

more less

Transcript and Presenter's Notes

Title: Clustering: Unsupervised Learning Methods 15381

1
ClusteringUnsupervised Learning Methods15-381

Jaime Carbonell
8 April 2003
OUTLINE
What is unsupervised learning?
Similarity computations
Clustering Algorithms
Other kinds of unsupervised learning

2
Unsupervised Learning

Definition of Unsupervised Learning
Learning useful structure without labeled
classes, optimization criterion, feedback signal,
or any other information beyond the raw data and
grouping principle(s).

3
Unsupervised Learning

Examples
Find natural groupings of Xs (X human
languages, stocks, gene sequences, animal
species,) ?
Prelude to discovery of underlying properties
Summarize the news for the past month ?
Cluster first, then report centroids.
Sequence extrapolation E.g. Predict cancer
incidence next decade predict rise in
antibiotic-resistant bacteria
Methods
Clustering (n-link, k-means, GAC,)
Taxonomy creation (hierarchical clustering)
Novelty detection (meaningful outliers)
Trend detection (extrapolation from multivariate
partial derivatives)

4
Similarity Measures in Data Analysis

General Assumptions
Each data item is a tuple (vector)
Values of tuples are nominal, ordinal or
numerical
Similarity (Distance)-1
Pure Numerical Tuples
Sim(di,dj) ? di,kdj,k
sim (di,dj) cos(di,dj)
and many more (slide after next)

5
Similarity Measures in Data Analysis

For Ordinal Values
E.g. "small," "medium," "large," "X-large"
Convert to numerical assuming constant ?on a
normalized 0,1 scale, where max(v)1,
min(v)0, others interpolate
E.g. "small"0, "medium"0.33, etc.
Then, use numerical similarity measures
Or, use similarity matrix (see next slide)

6
Similarity Measures (cont.)

For Nominal Values
E.g. "Boston", "LA", "Pittsburgh", or "male",
"female", or "diffuse", "globular", "spiral",
"pinwheel"
Binary rule If di, dj,k, then sim 1, else 0
Use underlying sematic property E.g. Sim(Boston,
LA) ?dist(Boston, LA)-1, or Sim(Boston, LA)
?(size(Boston) - size(LA) )
/Max(size(cities))
Or, use similarity Matrix

7
Similarity Matrix

tiny little small medium large huge
tiny 1.0 0.8 0.7 0.5 0.2 0.0
little 1.0 0.9 0.7 0.3 0.1
small 1.0 0.7 0.3 0.2
medium 1.0 0.5 0.3
large 1.0 0.8
huge 1.0
Diagonal must be 1.0
Monotonicity property must hold
No linearity (value interpolation) assumed
Qualitative Transitive property must hold

8
Document Clustering Techniques

Similarity or Distance MeasureAlternative
Choices
Cosine similarity
Euclidean distance
Kernel functions, e.g.,
Language Modeling P(ymodelx) where x and y are
documents

9
Document Clustering Techniques

Kullback Leibler distance ("relative entropy")

10
Incremental Clustering Methods

Given n data items D D1, D2,Di,Dn
And given minimal similarity threshold Smin
Cluster data incrementally as follows
Procedure Singlelink(D) a.k.a closest-linkb
Let CLUSTERS D1
For i2 to n
Let Dc ArgmaxSim(Di,Dj
jlti
If DcgtSmin, add Dj to Dc's cluster
Else Append(CLUSTERS, Dj new cluster

11
Incremental Clustering via Closest-link Method
Attach to cluster containing closest point.
Danger Snake-like clusters
12
Incremental Clustering (cont.)

Procedure Averagelink(D)
Let CLUSTERS D1
For i2 to n
Let Dc ArgmaxSim(Di, centroid(C)
C in CLUSTERS
If DcgtSmin, add Dj to cluster C
Else Append(CLUSTERS, Dj new cluster
Observations
Single pass over the data ? easy to cluster new
data incrementally
Requires arbitrary Smin threshold
O(nC) time, O(n) space

13
K-Means Clustering

1. Select k-seeds s.t. d(ki,kj) gt dmin
2. Assign points to clusters by min dist.
Cluster(pi) Argmin(d(pi,sj))
sj?s1,,sk
3. Compute new cluster centroids
4. Reassign points to clusters (as in 2 above)
5. Iterate until no points change clusters

14
K-Means Clustering Initial Data Points
Step 1 Select k random seeds s.t. d(ki,kj) gt dmin
Initial Seeds (if k3)
15
K-Means Clustering First-Pass Clusters
Step 2 Assign points to clusters by min
dist. Cluster(pi) Argmin(d(pi,sj))
sj?s1,,sk
Initial Seeds
16
K-Means Clustering Seeds ? Centroids
Step 3 Compute new cluster centroids
New Centroids
17
K-Means Clustering Second Pass Clusters
Step 4 Recompute Cluster(pi)
Argmin(d(pi,cj))
cj?c1,,ck
Centroids
18
K-Means Clustering Iterate Until Stability
New Centroids
Steps 5 to N Iterate steps 3 4, until no point
changes cluster
19
Document Clustering Techniques

Example. Group documents based on similarity
Similarity matrix
Thresholding at similarity value of .9 yields
complete graph C1 1,4,5, namely Complete
Linkage
connected graph C21,4,5,6, namely Single
Linkage
For clustering we need three things
A similarity measure for pairwise comparison
between documents
A clustering criterion (complete Link, Single
Ling,)
A clustering algorithm

20
Document Clustering Techniques

Clustering Criterion Alternative Linkages
Single-link ('nearest neighbor")
Complete-link
Average-link ("group average clustering") or
GAC)

21
Hierarchical Agglomerative Clustering Methods

Generic Agglomerative Procedure (Salton '89)
result in nested clusters via iterations
Compute all pairwise document-document similarity
coefficients
Place each of n documents into a class of its own
Merge the two most similar clusters into one
- replace the two clusters by the new cluster
- recompute intercluster similarity scores
w.r.t. the new cluster
Repeat the above step until there are only k
clusters left (note k could 1).

22
Group Agglomerative Clustering
2
1
5
4
3
6
9
7
8
23
Hierarchical Agglomerative Clustering Methods
(cont.)

Heuristic Approaches to Speedy Clustering
Reallocation methods with k selected-seeds (O(kn)
time)
- k is the desired number of clusters n is the
number of documents
Buckshot random sampling (of ?(k)n documents)
puls global HAC
Fractionation Divide and Conquer

24
Creating Taxonomies

Hierarchical Clustering
GAC trace creates binary hierarchy
Incremental-link? Hierarchical version
Cluster data with high Smin? 1st hierarchical
level
Decrease Smin (stop at Smin0)
Treat cluster centroids as data tuples and
recluster, creating next level of hierarchy, then
repeat steps 2 and 3.
K-means? Hierarchical k-means
Cluster data with large k
Decrease k (stop at k1)
Treat cluster centroids as data tuples and
recluster, creating next level of hierarchy, then
repeat steps 2 and 3.