Title: Clustering: Unsupervised Learning Methods 15381
1ClusteringUnsupervised Learning Methods15-381
- Jaime Carbonell
- 8 April 2003
- OUTLINE
- What is unsupervised learning?
- Similarity computations
- Clustering Algorithms
- Other kinds of unsupervised learning
2Unsupervised Learning
- Definition of Unsupervised Learning
- Learning useful structure without labeled
classes, optimization criterion, feedback signal,
or any other information beyond the raw data and
grouping principle(s).
3Unsupervised Learning
- Examples
- Find natural groupings of Xs (X human
languages, stocks, gene sequences, animal
species,) ? - Prelude to discovery of underlying properties
- Summarize the news for the past month ?
- Cluster first, then report centroids.
- Sequence extrapolation E.g. Predict cancer
incidence next decade predict rise in
antibiotic-resistant bacteria - Methods
- Clustering (n-link, k-means, GAC,)
- Taxonomy creation (hierarchical clustering)
- Novelty detection (meaningful outliers)
- Trend detection (extrapolation from multivariate
partial derivatives)
4Similarity Measures in Data Analysis
- General Assumptions
- Each data item is a tuple (vector)
- Values of tuples are nominal, ordinal or
numerical - Similarity (Distance)-1
- Pure Numerical Tuples
- Sim(di,dj) ? di,kdj,k
- sim (di,dj) cos(di,dj)
- and many more (slide after next)
5Similarity Measures in Data Analysis
- For Ordinal Values
- E.g. "small," "medium," "large," "X-large"
- Convert to numerical assuming constant ?on a
normalized 0,1 scale, where max(v)1,
min(v)0, others interpolate - E.g. "small"0, "medium"0.33, etc.
- Then, use numerical similarity measures
- Or, use similarity matrix (see next slide)
6Similarity Measures (cont.)
- For Nominal Values
- E.g. "Boston", "LA", "Pittsburgh", or "male",
"female", or "diffuse", "globular", "spiral",
"pinwheel" - Binary rule If di, dj,k, then sim 1, else 0
- Use underlying sematic property E.g. Sim(Boston,
LA) ?dist(Boston, LA)-1, or Sim(Boston, LA) - ?(size(Boston) - size(LA) )
/Max(size(cities)) - Or, use similarity Matrix
7Similarity Matrix
- tiny little small medium large huge
- tiny 1.0 0.8 0.7 0.5 0.2 0.0
- little 1.0 0.9 0.7 0.3 0.1
- small 1.0 0.7 0.3 0.2
- medium 1.0 0.5 0.3
- large 1.0 0.8
- huge 1.0
- Diagonal must be 1.0
- Monotonicity property must hold
- No linearity (value interpolation) assumed
- Qualitative Transitive property must hold
-
8Document Clustering Techniques
- Similarity or Distance MeasureAlternative
Choices - Cosine similarity
-
- Euclidean distance
- Kernel functions, e.g.,
- Language Modeling P(ymodelx) where x and y are
documents
9Document Clustering Techniques
- Kullback Leibler distance ("relative entropy")
10Incremental Clustering Methods
- Given n data items D D1, D2,Di,Dn
- And given minimal similarity threshold Smin
- Cluster data incrementally as follows
- Procedure Singlelink(D) a.k.a closest-linkb
- Let CLUSTERS D1
- For i2 to n
- Let Dc ArgmaxSim(Di,Dj
- jlti
- If DcgtSmin, add Dj to Dc's cluster
- Else Append(CLUSTERS, Dj new cluster
11Incremental Clustering via Closest-link Method
Attach to cluster containing closest point.
Danger Snake-like clusters
12Incremental Clustering (cont.)
- Procedure Averagelink(D)
- Let CLUSTERS D1
- For i2 to n
- Let Dc ArgmaxSim(Di, centroid(C)
- C in CLUSTERS
- If DcgtSmin, add Dj to cluster C
- Else Append(CLUSTERS, Dj new cluster
- Observations
- Single pass over the data ? easy to cluster new
data incrementally - Requires arbitrary Smin threshold
- O(nC) time, O(n) space
13K-Means Clustering
- 1. Select k-seeds s.t. d(ki,kj) gt dmin
- 2. Assign points to clusters by min dist.
- Cluster(pi) Argmin(d(pi,sj))
- sj?s1,,sk
- 3. Compute new cluster centroids
- 4. Reassign points to clusters (as in 2 above)
- 5. Iterate until no points change clusters
14K-Means Clustering Initial Data Points
Step 1 Select k random seeds s.t. d(ki,kj) gt dmin
Initial Seeds (if k3)
15K-Means Clustering First-Pass Clusters
Step 2 Assign points to clusters by min
dist. Cluster(pi) Argmin(d(pi,sj))
sj?s1,,sk
Initial Seeds
16K-Means Clustering Seeds ? Centroids
Step 3 Compute new cluster centroids
New Centroids
17K-Means Clustering Second Pass Clusters
Step 4 Recompute Cluster(pi)
Argmin(d(pi,cj))
cj?c1,,ck
Centroids
18K-Means Clustering Iterate Until Stability
New Centroids
Steps 5 to N Iterate steps 3 4, until no point
changes cluster
19Document Clustering Techniques
- Example. Group documents based on similarity
- Similarity matrix
- Thresholding at similarity value of .9 yields
- complete graph C1 1,4,5, namely Complete
Linkage - connected graph C21,4,5,6, namely Single
Linkage - For clustering we need three things
- A similarity measure for pairwise comparison
between documents - A clustering criterion (complete Link, Single
Ling,) - A clustering algorithm
20Document Clustering Techniques
- Clustering Criterion Alternative Linkages
- Single-link ('nearest neighbor")
- Complete-link
- Average-link ("group average clustering") or
GAC)
21Hierarchical Agglomerative Clustering Methods
- Generic Agglomerative Procedure (Salton '89)
- result in nested clusters via iterations
- Compute all pairwise document-document similarity
coefficients - Place each of n documents into a class of its own
- Merge the two most similar clusters into one
- - replace the two clusters by the new cluster
- - recompute intercluster similarity scores
w.r.t. the new cluster - Repeat the above step until there are only k
clusters left (note k could 1).
22Group Agglomerative Clustering
2
1
5
4
3
6
9
7
8
23Hierarchical Agglomerative Clustering Methods
(cont.)
- Heuristic Approaches to Speedy Clustering
- Reallocation methods with k selected-seeds (O(kn)
time) - - k is the desired number of clusters n is the
number of documents - Buckshot random sampling (of ?(k)n documents)
puls global HAC - Fractionation Divide and Conquer
24Creating Taxonomies
- Hierarchical Clustering
- GAC trace creates binary hierarchy
- Incremental-link? Hierarchical version
- Cluster data with high Smin? 1st hierarchical
level - Decrease Smin (stop at Smin0)
- Treat cluster centroids as data tuples and
recluster, creating next level of hierarchy, then
repeat steps 2 and 3. - K-means? Hierarchical k-means
- Cluster data with large k
- Decrease k (stop at k1)
- Treat cluster centroids as data tuples and
recluster, creating next level of hierarchy, then
repeat steps 2 and 3.
25Taxonomies (cont.)
- Postprocess Taxonomies
- Eliminate "no-op" levels
- Agglomerate "skinny" levels
- Label meaningful levels manually or with centroid
summary