Clustering - PowerPoint PPT Presentation

About This Presentation

Title:

Clustering

Description:

Want to pick just a few colors. Solution: treat RGB triples as points in R3 and cluster ... whether each English word (aardvark, and, anchovy, ...) occurs in the doc ... – PowerPoint PPT presentation

Number of Views:21

Avg rating:3.0/5.0

Slides: 49

Provided by: JFH

Learn more at: http://mlrg.cs.brown.edu

Category:

more less

Transcript and Presenter's Notes

Title: Clustering

1
Clustering
2
Example application posterizing

Lots of pixels, many colors
Want to pick just a few colors
Solution treat RGB triples as points in R3 and
cluster
Use center points of clusters as new colors

3
Posterization problems

The distance in RGB space Sqrt (r1 - r2 ) 2
(g1 - g2 ) 2 (b1 - b2 ) 2 is not
perceptually uniform
Distance of 0.2 in one area (black-to-grey
distance, for example) may seem much larger than
another (yellow-to-yellow/green)
Approach ignores pixel adjaceny in the image
One solution cluster in R5 (x,y,r,g,b)

4
Tissue Classification
5
Problems

Not really a clustering problem, although
similar tissues tend to be clustered
Fundamentally a mixture-model a pixel contains
both bone and soft-tissue, for example.

6
Friendship nets Facebook

Given facebook data
Construct clusters of friends

7
Problems

No coordinates
Information about closeness is 0/1 (have you
friended me yet???)

8
Netflix
9
Approach

N number of movies Netflix has
My coordinates
(1,0,0,1,1,0,where xi 1 means I liked movie
I
Now finding clusters lets Netflix make
recommendations

10
Problems

My coordinates really look like
this(,,0,,,,,,1,1,,,)with
meaning Never seen the movie and dont know.
Even if we did know all my coordinates, the
problem lies in 0,1N rather than RN is our
euclidean intuition really appropriate?

11
Document classification

Could represent a document by a vector(0,1,0,)
representing whether each English word (aardvark,
and, anchovy, ) occurs in the doc
Clusters represent similar topics

12
Problems

Non-isotropic distance two documents having
the in common are far less likely to be similar
than two with aardvark in common
Really need a distance metric that compensates
for this before applying clustering

13
Conclusion

Clustering seems to have a lot of interesting
applications
But its important, before starting, to have an
embedding of your data in RN where
distance in RN is really related to distance
between items (at least for small distances!)

14
A first clustering algorithm

Assuming data thats really pretty well
clusteredhow do you find the clusters?
Intro to K-means

15
Distance vs. Cluster distance
16
(No Transcript)
17
(No Transcript)
18
Conclusion

We might want to use a distance, D, to indicate
how much two things are in the same cluster
Tempting to write D(pi, pk)
Really needs to be D(p1, p2, , pi, pk)
One view of clustering is that we want to use
euclidean distance, d, to bootstrap discovery of
cluster distance, D.
K-means works when d and D are very similar.

19
How are distance and cluster distance related

If youre a friend of my friend, youre my
friend
Suggests a graph-theory approach find connected
components in a graph
Edges in graph when two points are close enough

20
Problems

Edges in graph when two points are close enough
Very data-sensitive a small perturbation of
data can join two clusters
These points are much closer than
these
If youre friends with lots of my friends,
youre my friend.

21
Leads to study of how connected are nodes in a
graph?

The travelling token problem was an intro to that
question

22
Solution to travelling token
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32

and so on
Make a VERY long movie
Play it VERY rapidly
How pink each node appears tells you what
fraction of the time the token spends there.

33
Insight
34
Create a shorter movie!

Two tokens (possibly at same spot) in each frame

35
Doubled movie
36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
Apply repeatedly

Limiting version of the movie
Has a huge number of moving tokens
every frame must look the same!

40
Every frame looks the same

If ui number of tokens at node i
di degree of node i
j is a neighbor of i
Then
i sends j a quantity ui / di tokens
every frame looks the same if j sends that many
back to i.
That happens exactly if uk / dk constant
Population of tokens at a node is proportional to
nodes degree!

41
Matrix form

Let aij
be 1 if i and j are connected
be 0 otherwise
Let D diag(deg of node 1, deg of node 2, )
Let M D-1 A
Then our solution u satisfies
Mu u
Insight Things related to graph diffusion,
neighboring, clustering, are related to
eigenvalue problems.

42
Some Graph Terminology and Notation
43
Ng, Jordan, Weiss