Clustering - PowerPoint PPT Presentation

About This Presentation
Title:

Clustering

Description:

Want to pick just a few colors. Solution: treat RGB triples as points in R3 and cluster ... whether each English word (aardvark, and, anchovy, ...) occurs in the doc ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 49
Provided by: JFH
Learn more at: http://mlrg.cs.brown.edu
Category:

less

Transcript and Presenter's Notes

Title: Clustering


1
Clustering
2
Example application posterizing
  • Lots of pixels, many colors
  • Want to pick just a few colors
  • Solution treat RGB triples as points in R3 and
    cluster
  • Use center points of clusters as new colors

3
Posterization problems
  • The distance in RGB space Sqrt (r1 - r2 ) 2
    (g1 - g2 ) 2 (b1 - b2 ) 2 is not
    perceptually uniform
  • Distance of 0.2 in one area (black-to-grey
    distance, for example) may seem much larger than
    another (yellow-to-yellow/green)
  • Approach ignores pixel adjaceny in the image
  • One solution cluster in R5 (x,y,r,g,b)

4
Tissue Classification
5
Problems
  • Not really a clustering problem, although
    similar tissues tend to be clustered
  • Fundamentally a mixture-model a pixel contains
    both bone and soft-tissue, for example.

6
Friendship nets Facebook
  • Given facebook data
  • Construct clusters of friends

7
Problems
  • No coordinates
  • Information about closeness is 0/1 (have you
    friended me yet???)

8
Netflix
9
Approach
  • N number of movies Netflix has
  • My coordinates
  • (1,0,0,1,1,0,where xi 1 means I liked movie
    I
  • Now finding clusters lets Netflix make
    recommendations

10
Problems
  • My coordinates really look like
    this(,,0,,,,,,1,1,,,)with
    meaning Never seen the movie and dont know.
  • Even if we did know all my coordinates, the
    problem lies in 0,1N rather than RN is our
    euclidean intuition really appropriate?

11
Document classification
  • Could represent a document by a vector(0,1,0,)
    representing whether each English word (aardvark,
    and, anchovy, ) occurs in the doc
  • Clusters represent similar topics

12
Problems
  • Non-isotropic distance two documents having
    the in common are far less likely to be similar
    than two with aardvark in common
  • Really need a distance metric that compensates
    for this before applying clustering

13
Conclusion
  • Clustering seems to have a lot of interesting
    applications
  • But its important, before starting, to have an
    embedding of your data in RN where
  • distance in RN is really related to distance
    between items (at least for small distances!)

14
A first clustering algorithm
  • Assuming data thats really pretty well
    clusteredhow do you find the clusters?
  • Intro to K-means

15
Distance vs. Cluster distance
16
(No Transcript)
17
(No Transcript)
18
Conclusion
  • We might want to use a distance, D, to indicate
    how much two things are in the same cluster
  • Tempting to write D(pi, pk)
  • Really needs to be D(p1, p2, , pi, pk)
  • One view of clustering is that we want to use
    euclidean distance, d, to bootstrap discovery of
    cluster distance, D.
  • K-means works when d and D are very similar.

19
How are distance and cluster distance related
  • If youre a friend of my friend, youre my
    friend
  • Suggests a graph-theory approach find connected
    components in a graph
  • Edges in graph when two points are close enough

20
Problems
  • Edges in graph when two points are close enough
  • Very data-sensitive a small perturbation of
    data can join two clusters
  • These points are much closer than
    these
  • If youre friends with lots of my friends,
    youre my friend.

21
Leads to study of how connected are nodes in a
graph?
  • The travelling token problem was an intro to that
    question

22
Solution to travelling token
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
  • and so on
  • Make a VERY long movie
  • Play it VERY rapidly
  • How pink each node appears tells you what
    fraction of the time the token spends there.

33
Insight
34
Create a shorter movie!
  • Two tokens (possibly at same spot) in each frame

35
Doubled movie
36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
Apply repeatedly
  • Limiting version of the movie
  • Has a huge number of moving tokens
  • every frame must look the same!

40
Every frame looks the same
  • If ui number of tokens at node i
  • di degree of node i
  • j is a neighbor of i
  • Then
  • i sends j a quantity ui / di tokens
  • every frame looks the same if j sends that many
    back to i.
  • That happens exactly if uk / dk constant
  • Population of tokens at a node is proportional to
    nodes degree!

41
Matrix form
  • Let aij
  • be 1 if i and j are connected
  • be 0 otherwise
  • Let D diag(deg of node 1, deg of node 2, )
  • Let M D-1 A
  • Then our solution u satisfies
    Mu u
  • Insight Things related to graph diffusion,
    neighboring, clustering, are related to
    eigenvalue problems.

42
Some Graph Terminology and Notation
43
Ng, Jordan, Weiss
  • S s1, s2, , sn in Rp. Want to cluster into k
    subsets
  • Form n x n matrix A with
  • aij exp(-si sj)2/2s2)
  • except aii 0
  • D diag(row sums of A) L D-1/2 A D-1/2
  • Find k largest (column) eigenvectors of D
    arrange in an n x k matrix, X

44
  • X contains eigenvectors as columns
  • Normalize each row of X to get Y.
  • Rows of Y are points on the unit sphere.
  • Theres one row per original point
  • Cluster these points on the unit sphere using
    k-means use these clusters on S.

45
Matlab
  • function y njw(s, sigma, k)
  • s nx3 array of pts k clust.
  • n size(s, 1)
  • X,Y meshgrid(1n, 1n)
  • diffs s(X,) - s(Y, )
  • dists reshape(dot(diffs', diffs'), n,
    n)squared dists.
  • A exp(-dists/(2sigma2))
  • for i 1n ICK
  • A(i,i) 0
  • end
  • D sum(A') ith entry is sum of A's ith row
  • L diag(1 ./ (D . 0.5)) A diag(1 ./ D .
    0.5)
  • X,D eigs(L,k) k largest eigenvals/vecs of
    L.
  • Y diag((1./dot(X', X')).0.5) X
  • normalize rows to unit length
  • IDX kmeans(Y,k, 'emptyaction', 'singleton')
  • IDX is a vector of cluster-ids, one per point
    of S.

46
Visualization part
  • clf
  • hold on
  • for t 1k
  • pts s(IDX t, )
  • c hsv2rgb( (t-1)/k, 0.8, 0.8 )
  • plot3(pts(,1), pts(, 2), pts(, 3), 'o',
    'Color', c)
  • end
  • hold off
  • figure(gcf)

47
(No Transcript)
48
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com