Measurement%20of%20Similarity%20and%20Clustering - PowerPoint PPT Presentation

About This Presentation

Title:

Measurement%20of%20Similarity%20and%20Clustering

Description:

Otherwise you could claim 'Alex looks like Bob, but Bob looks nothing like Alex. ... Costas Hummingbird. Ruby Topaz Hummingbird. Kestrel. Gyrfalcon. Bald Eagle ... – PowerPoint PPT presentation

Number of Views:155

Avg rating:3.0/5.0

Slides: 55

Provided by: eam9

Learn more at: http://www.cs.ucr.edu

Category:

more less

Transcript and Presenter's Notes

Title: Measurement%20of%20Similarity%20and%20Clustering

1
Measurement of Similarity and Clustering Dr
Eamonn Keogh Computer Science Engineering
DepartmentUniversity of California -
RiversideRiverside,CA 92521eamonn_at_cs.ucr.edu
2
Outline of Talk

What is Similarity?
Some nomenclature
A useful tool (dendrogram)
Why Measure Similarity?
Classification
Clustering
Indexing
Desirable Properties of Similarity Measures
Mathematical properties
Intuitiveness
Time and space complexity
Two Approaches
Feature Projection
Transformation (Edit Distance)
Hierarchal Clustering

3
What is Similarity?
The quality or state of being similar likeness
resemblance as, a similarity of features.
Webster's Dictionary
Similarity is hard to define, but We know it
when we see it The real meaning of similarity
is a philosophical question. We will take a more
pragmatic approach.
4
Some Nomenclature I

We shall talk of measuring similarity, however
we are usually measuring dissimilarity.
Similarity The larger the number, the more
alike two objects are.
Dissimilarity The larger the number, the less
alike two objects are.
Distance is a common synonym for Dissimilarity,
so we may speak of Distance measure and
Dissimilarity measure interchangeably.
However a Distance measure is not the same
thing as a Distance Metric. We will see why
later

5
Some Nomenclature II
Similarity Queries are often expressed as Nearest
Neighbor Queries or Range Queries.
What is the nearest item to the green item?
What items are within R of the blue item?
a
a
c
c
b
b
R is given by the user
Can be generalized to the K nearest Neighbors
6
A Useful Tool for Summarizing Similarity
Measurements
In order to better appreciate and evaluate the
examples given in the early part of this talk, we
will now introduce the dendrogram. (We will have
much more to say about dendrograms later)
The similarity between two objects in a
dendrogram is represented as the height of the
lowest internal node they share.
7
(Bovine0.69395,(Gibbon0.36079,(Orangutan0.33636
,(Gorilla0.17147,(Chimp0.19268,Human0.11927)0.
08386)0.06124)0.15057)0.54939)
8
Swoopogram
Curvogram
Eurogram
Phenogram
Cladogram
Tree Diagram
9
Why Measure Similarity?

Classification Given an unlabeled item Q,
assign it to one of two or more predefined
classes. (We can do classification without
measuring similarity, but similarity based
methods (I.e. nearest neighbor), are very
competitive).
Clustering Find natural groupings of items
under some similarity measure.
Indexing (Query by Content) Given a query
object Q, and some similarity measure, find the
nearest matching item in the database, without
having to examine every item.

10
Defining Distance Measures
Definition Let O1 and O2 be two objects from the
universe of possible objects. The distance
(dissimilarity) between O1 and O2 is a real
number denoted by D(O1,O2)
Peter
Piotr
0.23
3
342.7
11
Peter
Piotr
When we peek inside one of these black boxes, we
see some function on two variables. These
functions might very simple or very complex. In
either case it is natural to ask, what properties
should these functions have?
d('', '') 0 d(s, '') d('', s) s -- i.e.
length of s d(s1ch1, s2ch2) min( d(s1, s2)
if ch1ch2 then 0 else 1 fi, d(s1ch1, s2) 1,
d(s1, s2ch2) 1 )
3
12
Intuitions behind desirable distance measure
properties
D(A,B) D(B,A) Symmetry Otherwise you could
claim Alex looks like Bob, but Bob looks nothing
like Alex. D(A,A) 0 Constancy of
Self-Similarity Otherwise you could claim Alex
looks more like Bob, than Bob does. D(A,B) 0
IIf AB Positivity (Separation) Otherwise there
are objects in your world that are different, but
you cannot tell apart. D(A,B) ? D(A,C)
D(B,C) Triangular Inequality Otherwise you could
claim Alex is very like Bob, and Alex is very
like Carl, but Bob is very unlike Carl.
13
Why is the Triangular Inequality so Important?
Virtually all techniques to index data require
the triangular inequality to hold.
Suppose I have a database of 3 objects. Further
suppose that the triangular inequality holds, and
that we have precomplied a table of distance
between all the items in the database.
14
Why is the Triangular Inequality so Important?
Virtually all techniques to index data require
the triangular inequality to hold.
Suppose I am looking for the closest point to Q,
in a database of 3 objects. Further suppose that
the triangular inequality holds, and that we have
precomplied a table of distance between all the
items in the database. I find a and calculate
that it is 2 units from Q, it becomes my
best-so-far. I find b and calculate that it is
7.81 units away from Q. I dont have to calculate
the distance from Q to c! I know
D(Q,b) ? D(Q,c) D(b,c) D(Q,b) - D(b,c) ?
D(Q,c) 7.81 - 2.30 ? D(Q,c)
5.51 ? D(Q,c) So I know that c is at least
5.51 units away, but my best-so-far is only 2
units away.
a
Q
c
b
15
Thoughts on the Triangular Inequality I
Sometimes the triangular inequality requirement
maps nicely onto human intuitions. Consider the
similarity between the horse, the zebra and the
lion.
The horse and the zebra are very similar, and
both are very unlike the lion.
16
Thoughts on the Triangular Inequality II
Sometimes the triangular inequality requirement
fails to map onto human intuition. Consider the
similarity between the horse, a man and the
centaur.
The horse and the man are very different, but
both share many features with the centaur. This
relationship does not obey the triangular
inequality.
The centaur example is due to Remco Velkamp
17
What other properties should we require of a
distance measure

It should really measure similarity!!
It should be fast to compute
Euclidean distance and Hamming distance are
O(n), Dynamic Time Warping and String Edit
distance are O(n2)
It should be space efficient
This is usually not as important as time
efficiency
It should allow indexing
If the measure is a metric, this is
automatically true, otherwise it depends
A fast lower bound measure is desirable
?A, B lower_bound_distance(A, B) ?
true_distance(A, B)

Whatever that means
(We will see why on the next slide)
18
If not fast to compute, a fast lower bound
measure is desirable
Assume that true_distance(A, B) is the correct
distance function, but is very expensive to
compute, and that lower_bound_distance(A, B), is
a cheap lower bounding estimate of
true_distance(A, B), the above algorithm will
allow faster sequential searching.
19

If we want to measure the similarity between
items, we will have to measure some features
Scalar
Binary Only two possible states.
True/False, Jew/Gentile, Married/Unmarried
Nominal Generalization of Binary to 3 or more
states
Jew/Catholic/Protestant, Married/Divorced/Widower
In basketball, jersey numbers are nominal
You cannot order, or do any mathematical
operations on nominal data

Scalar (continued)
Ordinal Same as nominal, but order matters.
However the distance between two values is not
meaningful
For example, we might have a coded survey, 0 No
high school, 1 some high school, 2 high
school diploma, 4 some college
While we can clearly rank these attribute, the
distacne between a 1 and a 2 is not the same
as the distance between a 2 and a 3.
Interval Distance between attributes is
meaningful. In this case the we can measure
intervals and take averages, but we cannot form
ratios (I.E we cannot say 10 is twice as large as
5)
For example, consider temperature in Fahrenheit
or Celsius
Ratio You can meaningfully form ratios.
For example, weight, height, number of children

Scalar (continued)
Note that both Interval and Ratio data can be
either discrete or continuous
For example consider the following two examples
of ratio data
Number of Children (for a given person)
Average Number of Children (For women in
different countries)
Some algorithms work better (or only work) for
one of either discrete or continuous.
We can convert from continuous to discrete

22
In addition to scalar values, much of the data
we are interested in is nonscalar... Vectors or
Matrices of Binary/Nominal/Ordinal/Interval/Ratio
Bitmaps, Time Series, Strings, Trees, Graphs
23
Consider color, what kind of feature is
this? Nominal Scalar Blue, Red, Yellow
etc Ordinal Discrete Red, Orange, Yellow,
Green, Blue, Indigo, Violet. Ordinal Continuous
780 622nm, 622 597nm, 597 577nm Vector
Continuous 0.95, 0.01, 0.21 (Red/Green/Blue,
or Hue/Saturation/Luminosity) We sometimes have
a choice of representation. Often making the
right choice can be very important.
24
The similarity between two items depends on the
features we measure (and the distance measure
itself)
Last Name Similarity
Skin Color Similarity
0
115
25

Sometimes we are given the perfect features to
measure similarity
sometimes we need to
Generate Features Suppose we hope to find
similar people with regard to their medical
conditions, knowing both their height and weight
is not helpful, knowing their BMI is. (BMI
Weight in kilos /Height in meters2 )
Clean Features Our features may contain noise
or outliers.
Normalize Features We may need to transform
features.
Reduce Features We may have too many features
to do efficient similarity measurement, so
dimensionality reduction may be necessary.

26
There is no single magic black box for
measuring similarity

However there are two useful and general tricks
Project the data into feature space, the distance
in feature space (appropriately measured) becomes
the similarity.
Transform one object into the other, the cost
of this transformation becomes the similarity.

Feature Projection
Edit Distance
27
Feature Projection Example I
1.0
0.9
0.8
0.7
0.6
0.5
Ratio of beak length over body length
0.4
0.3
From left to right Bee Hummingbird Costas
Hummingbird Ruby Topaz Hummingbird Kestrel Gyrfalc
on Bald Eagle
Use the features to project the items into
feature space. The distance between two objects
in this space (appropriately measured) is the
measure of similarity
0.2
0.1
1
2
3
4
5
6
7
8
9
10
Body Mass
28
Feature Projection Example II
R. A. Fishers Iris Dataset. 3 variations of the
Iris flower 50 of each
29
A generic technique for measuring similarity
To measure the similarity between two objects,
transform one of the objects into the other, and
measure how much effort it took. The measure of
effort becomes the distance measure.
The distance between Patty and Selma. Change
dress color, 1 point Change earring shape, 1
point Change hair part, 1 point D(Patty,Selma
) 3
The distance between Marge and Selma. Change
dress color, 1 point Add earrings, 1
point Decrease height, 1 point Take up
smoking, 1 point Lose weight, 1
point D(Marge,Selma) 5
30
Edit Distance Example I
How similar are the names Peter and
Piotr? Assume the following cost function
Substitution 1 Unit Insertion 1
Unit Deletion 1 Unit D(Peter,Piotr) is 3
It is possible to transform any string Q into
string C, using only Substitution, Insertion and
Deletion. Assume that each of these operators has
a cost associated with it. The similarity
between two strings can be defined as the cost of
the cheapest transformation from Q to C. Note
that for now we have ignored the issue of how we
can find this cheapest transformation
Peter Piter Pioter Piotr
Substitution (i for e)
Insertion (o)
Deletion (e)
31
Edit Distance Example II
We can make two time series appear more similar
by making one point on one map onto two (or more)
points it the other. For example, suppose we
have Q 5, 6, 8, 8, 7 and C 5, 6, 6, 8,
7
A one to one measure would have to match an 8
in Q to a 6 in C. However if we allowed
nonlinear alignments every number can match with
itself. Another way of looking at it is an
attempt to make the two sequences more similar by
inserting values
This is call Dynamic Time Warping
32
Dynamic Time Warping
Fixed Time Axis Sequences are aligned one to
one.
Warped Time Axis Nonlinear alignments are
possible.
33
The Minkowski Metric
So, we have projected our objects into feature
space. How do we measure the distance between
points?
Assume Q and C are vectors of features measured
from the objects of interest.
p 1 Manhattan (Rectilinear, City Block) p 2
Euclidean p ? Max (Supremum, sup)
34
The Minkowski Metric, a Weakness
Suppose we have a database of 3 items, with 2
features, number of children and temperature. We
want to know who is most similar to Mr Red
under the Euclidean distance.
44
110
Celsius
Fahrenheit
(5,96.8) (1,96.8) (5,102.2)
(5,36) (1,36) (5,38)
109
43
108
42
107
106
41
40
39
38
5.4
3
37
36
4
4
35
95
94
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
Green is closest to Red
Blue is closest to Red
The Minkowski metric is sensitive to the units
use to measure features, a very undesirable
property since the units are usually
arbitrary. Two solutions suggest themselves,
normalize the features or use a weighted version
of the Minkowski metric.
35
Normalizing Features
Let C be a database of items, with the ith
feature denoted by ci To normalize the database
After normalization each feature will have a mean
of zero and a standard deviation of one.
for each feature ci ci (mean(ci) /
std(ci) end
Note that in both these images the axes are
square (there is the same number of pixels per
unit in both the X and Y direction)
After normalization, both axes are equally
important
Before normalization, the Y-axis dominates
36
The Weighted Minkowski Metric
Assume Q and C are vectors of feature measured
from the objects of interest. Further assume
that W is a vector containing the relative
importance of the features
But how do we know the weights?
37
The Minkowski Metrics have Simple Geometric
Interpretations
Euclidean
Weighted Euclidean
Manhattan
Max
38
(No Transcript)
39
What is Clustering?
Also called unsupervised learning, sometimes
called classification by statisticians and
sorting by psychologists and segmentation by
people in marketing

Organizing data into classes such that there is
high intra-class similarity
low inter-class similarity
Finding the class labels and the number of
classes directly from the data (in contrast to
classification).
More informally, finding natural groupings among
objects.

40
What is a natural grouping among these objects?
41
What is a natural grouping among these objects?
Clustering is subjective
School Employees
Simpson's Family
Males
Females
42
Even if we know in advance the number of clusters
we expect to see, the clustering obtained may be
subjective.
43
Two Types of Clustering

Partitional algorithms Construct various
partitions and then evaluate them by some
criterion
Hierarchical algorithms Create a hierarchical
decomposition of the set of objects using some
criterion

Partitional
Hierarchical
44
Desirable Properties of a Clustering Algorithm

Scalability (in terms of both time and space)
Ability to deal with different data types
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to
determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability

45
Hierarchical Clustering
Since we cannot test all possible trees we will
have to heuristic search of all possible trees.
We could do this.. Bottom-Up (agglomerative)
Starting with each item in its own cluster, find
the best pair to merge into a new cluster. Repeat
until all clusters are fused together. Top-Down
(divisive) Starting with all the data in a
single cluster, consider every possible way to
divide the cluster into two. Choose the best
division and recursively operate on both sides.

The number of dendrograms with n leafs (2n
-3)!/(2(n -2)) (n -2)!
Number Number of Possible
of Leafs Dendrograms
2 1
3 3
4 15
5 105
...
34,459,425

46
We begin with a distance matrix which contains
the distances between every pair of objects in
our database.
D( , ) 8 D( , ) 1
47
Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best

48
Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

49
Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

50
Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

51
?
In the first iteration of agglomerative
clustering we merged so we need to remove
them from the matrix
We now need to add the single cluster to our
new smaller matrix
But what values do we fill in? What is
D( , ) ? D( , ) ?
52
We know how to measure the distance between two
objects, but defining the distance between an
object and a cluster, or defining the distance
between two clusters is non obvious.

Single linkage (nearest neighbor) In this
method the distance between two clusters is
determined by the distance of the two closest
objects (nearest neighbors) in the different
clusters.
Complete linkage (furthest neighbor) In this
method, the distances between clusters are
determined by the greatest distance between any
two objects in the different clusters (i.e., by
the "furthest neighbors").
Group average In this method, the distance
between two clusters is calculated as the average
distance between all pairs of objects in the two
different clusters.

53
Using Single linkage (nearest neighbor)
D( , ) Min D( , ), D( , ) 4 D(
, ) Min D( , ), D( , ) 7
54

Summary of Hierarchal Clustering Methods
No need to specify the number of clusters in
advance.
Hierarchal nature maps nicely onto human
intuition for some domains
They do not scale well time complexity of at
least O(n2), where n is the number of total
objects.
Like any heuristic search algorithms, local
optima are a problem.
Interpretation of results is subjective.