Title: Measurement%20of%20Similarity%20and%20Clustering
1Measurement of Similarity and Clustering Dr
Eamonn Keogh Computer Science Engineering
DepartmentUniversity of California -
RiversideRiverside,CA 92521eamonn_at_cs.ucr.edu
2Outline of Talk
- What is Similarity?
- Some nomenclature
- A useful tool (dendrogram)
- Why Measure Similarity?
- Classification
- Clustering
- Indexing
- Desirable Properties of Similarity Measures
- Mathematical properties
- Intuitiveness
- Time and space complexity
- Two Approaches
- Feature Projection
- Transformation (Edit Distance)
- Hierarchal Clustering
3What is Similarity?
The quality or state of being similar likeness
resemblance as, a similarity of features.
Webster's Dictionary
Similarity is hard to define, but We know it
when we see it The real meaning of similarity
is a philosophical question. We will take a more
pragmatic approach.
4Some Nomenclature I
- We shall talk of measuring similarity, however
we are usually measuring dissimilarity. - Similarity The larger the number, the more
alike two objects are. - Dissimilarity The larger the number, the less
alike two objects are. - Distance is a common synonym for Dissimilarity,
so we may speak of Distance measure and
Dissimilarity measure interchangeably. - However a Distance measure is not the same
thing as a Distance Metric. We will see why
later -
5Some Nomenclature II
Similarity Queries are often expressed as Nearest
Neighbor Queries or Range Queries.
What is the nearest item to the green item?
What items are within R of the blue item?
a
a
c
c
b
b
R is given by the user
Can be generalized to the K nearest Neighbors
6A Useful Tool for Summarizing Similarity
Measurements
In order to better appreciate and evaluate the
examples given in the early part of this talk, we
will now introduce the dendrogram. (We will have
much more to say about dendrograms later)
The similarity between two objects in a
dendrogram is represented as the height of the
lowest internal node they share.
7(Bovine0.69395,(Gibbon0.36079,(Orangutan0.33636
,(Gorilla0.17147,(Chimp0.19268,Human0.11927)0.
08386)0.06124)0.15057)0.54939)
8Swoopogram
Curvogram
Eurogram
Phenogram
Cladogram
Tree Diagram
9Why Measure Similarity?
- Classification Given an unlabeled item Q,
assign it to one of two or more predefined
classes. (We can do classification without
measuring similarity, but similarity based
methods (I.e. nearest neighbor), are very
competitive). - Clustering Find natural groupings of items
under some similarity measure. - Indexing (Query by Content) Given a query
object Q, and some similarity measure, find the
nearest matching item in the database, without
having to examine every item.
10Defining Distance Measures
Definition Let O1 and O2 be two objects from the
universe of possible objects. The distance
(dissimilarity) between O1 and O2 is a real
number denoted by D(O1,O2)
Peter
Piotr
0.23
3
342.7
11Peter
Piotr
When we peek inside one of these black boxes, we
see some function on two variables. These
functions might very simple or very complex. In
either case it is natural to ask, what properties
should these functions have?
d('', '') 0 d(s, '') d('', s) s -- i.e.
length of s d(s1ch1, s2ch2) min( d(s1, s2)
if ch1ch2 then 0 else 1 fi, d(s1ch1, s2) 1,
d(s1, s2ch2) 1 )
3
12Intuitions behind desirable distance measure
properties
D(A,B) D(B,A) Symmetry Otherwise you could
claim Alex looks like Bob, but Bob looks nothing
like Alex. D(A,A) 0 Constancy of
Self-Similarity Otherwise you could claim Alex
looks more like Bob, than Bob does. D(A,B) 0
IIf AB Positivity (Separation) Otherwise there
are objects in your world that are different, but
you cannot tell apart. D(A,B) ? D(A,C)
D(B,C) Triangular Inequality Otherwise you could
claim Alex is very like Bob, and Alex is very
like Carl, but Bob is very unlike Carl.
13Why is the Triangular Inequality so Important?
Virtually all techniques to index data require
the triangular inequality to hold.
Suppose I have a database of 3 objects. Further
suppose that the triangular inequality holds, and
that we have precomplied a table of distance
between all the items in the database.
14Why is the Triangular Inequality so Important?
Virtually all techniques to index data require
the triangular inequality to hold.
Suppose I am looking for the closest point to Q,
in a database of 3 objects. Further suppose that
the triangular inequality holds, and that we have
precomplied a table of distance between all the
items in the database. I find a and calculate
that it is 2 units from Q, it becomes my
best-so-far. I find b and calculate that it is
7.81 units away from Q. I dont have to calculate
the distance from Q to c! I know
D(Q,b) ? D(Q,c) D(b,c) D(Q,b) - D(b,c) ?
D(Q,c) 7.81 - 2.30 ? D(Q,c)
5.51 ? D(Q,c) So I know that c is at least
5.51 units away, but my best-so-far is only 2
units away.
a
Q
c
b
15Thoughts on the Triangular Inequality I
Sometimes the triangular inequality requirement
maps nicely onto human intuitions. Consider the
similarity between the horse, the zebra and the
lion.
The horse and the zebra are very similar, and
both are very unlike the lion.
16Thoughts on the Triangular Inequality II
Sometimes the triangular inequality requirement
fails to map onto human intuition. Consider the
similarity between the horse, a man and the
centaur.
The horse and the man are very different, but
both share many features with the centaur. This
relationship does not obey the triangular
inequality.
The centaur example is due to Remco Velkamp
17What other properties should we require of a
distance measure
- It should really measure similarity!!
- It should be fast to compute
- Euclidean distance and Hamming distance are
O(n), Dynamic Time Warping and String Edit
distance are O(n2) - It should be space efficient
- This is usually not as important as time
efficiency - It should allow indexing
- If the measure is a metric, this is
automatically true, otherwise it depends - A fast lower bound measure is desirable
- ?A, B lower_bound_distance(A, B) ?
true_distance(A, B)
Whatever that means
(We will see why on the next slide)
18If not fast to compute, a fast lower bound
measure is desirable
Assume that true_distance(A, B) is the correct
distance function, but is very expensive to
compute, and that lower_bound_distance(A, B), is
a cheap lower bounding estimate of
true_distance(A, B), the above algorithm will
allow faster sequential searching.
19- If we want to measure the similarity between
items, we will have to measure some features - Scalar
- Binary Only two possible states.
- True/False, Jew/Gentile, Married/Unmarried
- Nominal Generalization of Binary to 3 or more
states - Jew/Catholic/Protestant, Married/Divorced/Widower
- In basketball, jersey numbers are nominal
- You cannot order, or do any mathematical
operations on nominal data
20- Scalar (continued)
-
- Ordinal Same as nominal, but order matters.
However the distance between two values is not
meaningful - For example, we might have a coded survey, 0 No
high school, 1 some high school, 2 high
school diploma, 4 some college - While we can clearly rank these attribute, the
distacne between a 1 and a 2 is not the same
as the distance between a 2 and a 3. - Interval Distance between attributes is
meaningful. In this case the we can measure
intervals and take averages, but we cannot form
ratios (I.E we cannot say 10 is twice as large as
5) - For example, consider temperature in Fahrenheit
or Celsius - Ratio You can meaningfully form ratios.
- For example, weight, height, number of children
-
21- Scalar (continued)
-
- Note that both Interval and Ratio data can be
either discrete or continuous - For example consider the following two examples
of ratio data - Number of Children (for a given person)
- Average Number of Children (For women in
different countries) - Some algorithms work better (or only work) for
one of either discrete or continuous. - We can convert from continuous to discrete
22 In addition to scalar values, much of the data
we are interested in is nonscalar... Vectors or
Matrices of Binary/Nominal/Ordinal/Interval/Ratio
Bitmaps, Time Series, Strings, Trees, Graphs
23Consider color, what kind of feature is
this? Nominal Scalar Blue, Red, Yellow
etc Ordinal Discrete Red, Orange, Yellow,
Green, Blue, Indigo, Violet. Ordinal Continuous
780 622nm, 622 597nm, 597 577nm Vector
Continuous 0.95, 0.01, 0.21 (Red/Green/Blue,
or Hue/Saturation/Luminosity) We sometimes have
a choice of representation. Often making the
right choice can be very important.
24The similarity between two items depends on the
features we measure (and the distance measure
itself)
Last Name Similarity
Skin Color Similarity
0
115
25- Sometimes we are given the perfect features to
measure similarity - sometimes we need to
- Generate Features Suppose we hope to find
similar people with regard to their medical
conditions, knowing both their height and weight
is not helpful, knowing their BMI is. (BMI
Weight in kilos /Height in meters2 ) - Clean Features Our features may contain noise
or outliers. - Normalize Features We may need to transform
features. - Reduce Features We may have too many features
to do efficient similarity measurement, so
dimensionality reduction may be necessary.
26There is no single magic black box for
measuring similarity
- However there are two useful and general tricks
- Project the data into feature space, the distance
in feature space (appropriately measured) becomes
the similarity. - Transform one object into the other, the cost
of this transformation becomes the similarity.
Feature Projection
Edit Distance
27Feature Projection Example I
1.0
0.9
0.8
0.7
0.6
0.5
Ratio of beak length over body length
0.4
0.3
From left to right Bee Hummingbird Costas
Hummingbird Ruby Topaz Hummingbird Kestrel Gyrfalc
on Bald Eagle
Use the features to project the items into
feature space. The distance between two objects
in this space (appropriately measured) is the
measure of similarity
0.2
0.1
1
2
3
4
5
6
7
8
9
10
Body Mass
28Feature Projection Example II
R. A. Fishers Iris Dataset. 3 variations of the
Iris flower 50 of each
29A generic technique for measuring similarity
To measure the similarity between two objects,
transform one of the objects into the other, and
measure how much effort it took. The measure of
effort becomes the distance measure.
The distance between Patty and Selma. Change
dress color, 1 point Change earring shape, 1
point Change hair part, 1 point D(Patty,Selma
) 3
The distance between Marge and Selma. Change
dress color, 1 point Add earrings, 1
point Decrease height, 1 point Take up
smoking, 1 point Lose weight, 1
point D(Marge,Selma) 5
30Edit Distance Example I
How similar are the names Peter and
Piotr? Assume the following cost function
Substitution 1 Unit Insertion 1
Unit Deletion 1 Unit D(Peter,Piotr) is 3
It is possible to transform any string Q into
string C, using only Substitution, Insertion and
Deletion. Assume that each of these operators has
a cost associated with it. The similarity
between two strings can be defined as the cost of
the cheapest transformation from Q to C. Note
that for now we have ignored the issue of how we
can find this cheapest transformation
Peter Piter Pioter Piotr
Substitution (i for e)
Insertion (o)
Deletion (e)
31Edit Distance Example II
We can make two time series appear more similar
by making one point on one map onto two (or more)
points it the other. For example, suppose we
have Q 5, 6, 8, 8, 7 and C 5, 6, 6, 8,
7
A one to one measure would have to match an 8
in Q to a 6 in C. However if we allowed
nonlinear alignments every number can match with
itself. Another way of looking at it is an
attempt to make the two sequences more similar by
inserting values
This is call Dynamic Time Warping
32Dynamic Time Warping
Fixed Time Axis Sequences are aligned one to
one.
Warped Time Axis Nonlinear alignments are
possible.
33The Minkowski Metric
So, we have projected our objects into feature
space. How do we measure the distance between
points?
Assume Q and C are vectors of features measured
from the objects of interest.
p 1 Manhattan (Rectilinear, City Block) p 2
Euclidean p ? Max (Supremum, sup)
34The Minkowski Metric, a Weakness
Suppose we have a database of 3 items, with 2
features, number of children and temperature. We
want to know who is most similar to Mr Red
under the Euclidean distance.
44
110
Celsius
Fahrenheit
(5,96.8) (1,96.8) (5,102.2)
(5,36) (1,36) (5,38)
109
43
108
42
107
106
41
40
39
38
5.4
3
37
36
4
4
35
95
94
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
Green is closest to Red
Blue is closest to Red
The Minkowski metric is sensitive to the units
use to measure features, a very undesirable
property since the units are usually
arbitrary. Two solutions suggest themselves,
normalize the features or use a weighted version
of the Minkowski metric.
35Normalizing Features
Let C be a database of items, with the ith
feature denoted by ci To normalize the database
After normalization each feature will have a mean
of zero and a standard deviation of one.
for each feature ci ci (mean(ci) /
std(ci) end
Note that in both these images the axes are
square (there is the same number of pixels per
unit in both the X and Y direction)
After normalization, both axes are equally
important
Before normalization, the Y-axis dominates
36The Weighted Minkowski Metric
Assume Q and C are vectors of feature measured
from the objects of interest. Further assume
that W is a vector containing the relative
importance of the features
But how do we know the weights?
37The Minkowski Metrics have Simple Geometric
Interpretations
Euclidean
Weighted Euclidean
Manhattan
Max
38(No Transcript)
39What is Clustering?
Also called unsupervised learning, sometimes
called classification by statisticians and
sorting by psychologists and segmentation by
people in marketing
- Organizing data into classes such that there is
- high intra-class similarity
- low inter-class similarity
- Finding the class labels and the number of
classes directly from the data (in contrast to
classification). - More informally, finding natural groupings among
objects.
40What is a natural grouping among these objects?
41What is a natural grouping among these objects?
Clustering is subjective
School Employees
Simpson's Family
Males
Females
42Even if we know in advance the number of clusters
we expect to see, the clustering obtained may be
subjective.
43Two Types of Clustering
- Partitional algorithms Construct various
partitions and then evaluate them by some
criterion - Hierarchical algorithms Create a hierarchical
decomposition of the set of objects using some
criterion
Partitional
Hierarchical
44Desirable Properties of a Clustering Algorithm
- Scalability (in terms of both time and space)
- Ability to deal with different data types
- Discovery of clusters with arbitrary shape
- Minimal requirements for domain knowledge to
determine input parameters - Able to deal with noise and outliers
- Insensitive to order of input records
- High dimensionality
- Incorporation of user-specified constraints
- Interpretability and usability
45Hierarchical Clustering
Since we cannot test all possible trees we will
have to heuristic search of all possible trees.
We could do this.. Bottom-Up (agglomerative)
Starting with each item in its own cluster, find
the best pair to merge into a new cluster. Repeat
until all clusters are fused together. Top-Down
(divisive) Starting with all the data in a
single cluster, consider every possible way to
divide the cluster into two. Choose the best
division and recursively operate on both sides.
- The number of dendrograms with n leafs (2n
-3)!/(2(n -2)) (n -2)! - Number Number of Possible
- of Leafs Dendrograms
- 2 1
- 3 3
- 4 15
- 5 105
- ...
- 34,459,425
46We begin with a distance matrix which contains
the distances between every pair of objects in
our database.
D( , ) 8 D( , ) 1
47Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best
48Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best
Consider all possible merges
Choose the best
49Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best
Consider all possible merges
Choose the best
Consider all possible merges
Choose the best
50Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best
Consider all possible merges
Choose the best
Consider all possible merges
Choose the best
51?
In the first iteration of agglomerative
clustering we merged so we need to remove
them from the matrix
We now need to add the single cluster to our
new smaller matrix
But what values do we fill in? What is
D( , ) ? D( , ) ?
52We know how to measure the distance between two
objects, but defining the distance between an
object and a cluster, or defining the distance
between two clusters is non obvious.
- Single linkage (nearest neighbor) In this
method the distance between two clusters is
determined by the distance of the two closest
objects (nearest neighbors) in the different
clusters. - Complete linkage (furthest neighbor) In this
method, the distances between clusters are
determined by the greatest distance between any
two objects in the different clusters (i.e., by
the "furthest neighbors"). - Group average In this method, the distance
between two clusters is calculated as the average
distance between all pairs of objects in the two
different clusters.
53Using Single linkage (nearest neighbor)
D( , ) Min D( , ), D( , ) 4 D(
, ) Min D( , ), D( , ) 7
54- Summary of Hierarchal Clustering Methods
- No need to specify the number of clusters in
advance. - Hierarchal nature maps nicely onto human
intuition for some domains - They do not scale well time complexity of at
least O(n2), where n is the number of total
objects. - Like any heuristic search algorithms, local
optima are a problem. - Interpretation of results is subjective.