Clustering: Unsupervised Learning Methods 15381 - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Clustering: Unsupervised Learning Methods 15381

Description:

Learning useful structure without labeled classes, ... Use underlying sematic property: E.g. Sim(Boston, LA) = dist(Boston, LA)-1, or Sim(Boston, LA) ... – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 26
Provided by: rcp6
Category:

less

Transcript and Presenter's Notes

Title: Clustering: Unsupervised Learning Methods 15381


1
ClusteringUnsupervised Learning Methods15-381
  • Jaime Carbonell
  • 8 April 2003
  • OUTLINE
  • What is unsupervised learning?
  • Similarity computations
  • Clustering Algorithms
  • Other kinds of unsupervised learning

2
Unsupervised Learning
  • Definition of Unsupervised Learning
  • Learning useful structure without labeled
    classes, optimization criterion, feedback signal,
    or any other information beyond the raw data and
    grouping principle(s).

3
Unsupervised Learning
  • Examples
  • Find natural groupings of Xs (X human
    languages, stocks, gene sequences, animal
    species,) ?
  • Prelude to discovery of underlying properties
  • Summarize the news for the past month ?
  • Cluster first, then report centroids.
  • Sequence extrapolation E.g. Predict cancer
    incidence next decade predict rise in
    antibiotic-resistant bacteria
  • Methods
  • Clustering (n-link, k-means, GAC,)
  • Taxonomy creation (hierarchical clustering)
  • Novelty detection (meaningful outliers)
  • Trend detection (extrapolation from multivariate
    partial derivatives)

4
Similarity Measures in Data Analysis
  • General Assumptions
  • Each data item is a tuple (vector)
  • Values of tuples are nominal, ordinal or
    numerical
  • Similarity (Distance)-1
  • Pure Numerical Tuples
  • Sim(di,dj) ? di,kdj,k
  • sim (di,dj) cos(di,dj)
  • and many more (slide after next)

5
Similarity Measures in Data Analysis
  • For Ordinal Values
  • E.g. "small," "medium," "large," "X-large"
  • Convert to numerical assuming constant ?on a
    normalized 0,1 scale, where max(v)1,
    min(v)0, others interpolate
  • E.g. "small"0, "medium"0.33, etc.
  • Then, use numerical similarity measures
  • Or, use similarity matrix (see next slide)

6
Similarity Measures (cont.)
  • For Nominal Values
  • E.g. "Boston", "LA", "Pittsburgh", or "male",
    "female", or "diffuse", "globular", "spiral",
    "pinwheel"
  • Binary rule If di, dj,k, then sim 1, else 0
  • Use underlying sematic property E.g. Sim(Boston,
    LA) ?dist(Boston, LA)-1, or Sim(Boston, LA)
  • ?(size(Boston) - size(LA) )
    /Max(size(cities))
  • Or, use similarity Matrix

7
Similarity Matrix
  • tiny little small medium large huge
  • tiny 1.0 0.8 0.7 0.5 0.2 0.0
  • little 1.0 0.9 0.7 0.3 0.1
  • small 1.0 0.7 0.3 0.2
  • medium 1.0 0.5 0.3
  • large 1.0 0.8
  • huge 1.0
  • Diagonal must be 1.0
  • Monotonicity property must hold
  • No linearity (value interpolation) assumed
  • Qualitative Transitive property must hold

8
Document Clustering Techniques
  • Similarity or Distance MeasureAlternative
    Choices
  • Cosine similarity
  • Euclidean distance
  • Kernel functions, e.g.,
  • Language Modeling P(ymodelx) where x and y are
    documents

9
Document Clustering Techniques
  • Kullback Leibler distance ("relative entropy")

10
Incremental Clustering Methods
  • Given n data items D D1, D2,Di,Dn
  • And given minimal similarity threshold Smin
  • Cluster data incrementally as follows
  • Procedure Singlelink(D) a.k.a closest-linkb
  • Let CLUSTERS D1
  • For i2 to n
  • Let Dc ArgmaxSim(Di,Dj
  • jlti
  • If DcgtSmin, add Dj to Dc's cluster
  • Else Append(CLUSTERS, Dj new cluster

11
Incremental Clustering via Closest-link Method
Attach to cluster containing closest point.
Danger Snake-like clusters
12
Incremental Clustering (cont.)
  • Procedure Averagelink(D)
  • Let CLUSTERS D1
  • For i2 to n
  • Let Dc ArgmaxSim(Di, centroid(C)
  • C in CLUSTERS
  • If DcgtSmin, add Dj to cluster C
  • Else Append(CLUSTERS, Dj new cluster
  • Observations
  • Single pass over the data ? easy to cluster new
    data incrementally
  • Requires arbitrary Smin threshold
  • O(nC) time, O(n) space

13
K-Means Clustering
  • 1. Select k-seeds s.t. d(ki,kj) gt dmin
  • 2. Assign points to clusters by min dist.
  • Cluster(pi) Argmin(d(pi,sj))
  • sj?s1,,sk
  • 3. Compute new cluster centroids
  • 4. Reassign points to clusters (as in 2 above)
  • 5. Iterate until no points change clusters

14
K-Means Clustering Initial Data Points
Step 1 Select k random seeds s.t. d(ki,kj) gt dmin
Initial Seeds (if k3)
15
K-Means Clustering First-Pass Clusters
Step 2 Assign points to clusters by min
dist. Cluster(pi) Argmin(d(pi,sj))
sj?s1,,sk
Initial Seeds
16
K-Means Clustering Seeds ? Centroids
Step 3 Compute new cluster centroids
New Centroids
17
K-Means Clustering Second Pass Clusters
Step 4 Recompute Cluster(pi)
Argmin(d(pi,cj))
cj?c1,,ck
Centroids
18
K-Means Clustering Iterate Until Stability
New Centroids
Steps 5 to N Iterate steps 3 4, until no point
changes cluster
19
Document Clustering Techniques
  • Example. Group documents based on similarity
  • Similarity matrix
  • Thresholding at similarity value of .9 yields
  • complete graph C1 1,4,5, namely Complete
    Linkage
  • connected graph C21,4,5,6, namely Single
    Linkage
  • For clustering we need three things
  • A similarity measure for pairwise comparison
    between documents
  • A clustering criterion (complete Link, Single
    Ling,)
  • A clustering algorithm

20
Document Clustering Techniques
  • Clustering Criterion Alternative Linkages
  • Single-link ('nearest neighbor")
  • Complete-link
  • Average-link ("group average clustering") or
    GAC)

21
Hierarchical Agglomerative Clustering Methods
  • Generic Agglomerative Procedure (Salton '89)
  • result in nested clusters via iterations
  • Compute all pairwise document-document similarity
    coefficients
  • Place each of n documents into a class of its own
  • Merge the two most similar clusters into one
  • - replace the two clusters by the new cluster
  • - recompute intercluster similarity scores
    w.r.t. the new cluster
  • Repeat the above step until there are only k
    clusters left (note k could 1).

22
Group Agglomerative Clustering
2
1
5
4
3
6
9
7
8
23
Hierarchical Agglomerative Clustering Methods
(cont.)
  • Heuristic Approaches to Speedy Clustering
  • Reallocation methods with k selected-seeds (O(kn)
    time)
  • - k is the desired number of clusters n is the
    number of documents
  • Buckshot random sampling (of ?(k)n documents)
    puls global HAC
  • Fractionation Divide and Conquer

24
Creating Taxonomies
  • Hierarchical Clustering
  • GAC trace creates binary hierarchy
  • Incremental-link? Hierarchical version
  • Cluster data with high Smin? 1st hierarchical
    level
  • Decrease Smin (stop at Smin0)
  • Treat cluster centroids as data tuples and
    recluster, creating next level of hierarchy, then
    repeat steps 2 and 3.
  • K-means? Hierarchical k-means
  • Cluster data with large k
  • Decrease k (stop at k1)
  • Treat cluster centroids as data tuples and
    recluster, creating next level of hierarchy, then
    repeat steps 2 and 3.

25
Taxonomies (cont.)
  • Postprocess Taxonomies
  • Eliminate "no-op" levels
  • Agglomerate "skinny" levels
  • Label meaningful levels manually or with centroid
    summary
Write a Comment
User Comments (0)
About PowerShow.com