Data Warehousing FS 08

About This Presentation

Title:

Data Warehousing FS 08

Description:

Grasshoppers. Katydids. The Classification Problem (informal definition) ... Grasshoppers. Katydids. Abdomen Length ... Grasshoppers ... – PowerPoint PPT presentation

Number of Views:78

Avg rating:3.0/5.0

Slides: 117

Provided by: dbis

Category:

more less

Transcript and Presenter's Notes

Title: Data Warehousing FS 08

1
Data WarehousingFS 08 Data Mining based on VLDB
2006 Tutorial Slides byEamonn Keogh,
eamonn_at_cs.ucr.edu Jens Dittrich
2
Data Mining Definition

Finding hidden information in a database
Data Mining has been defined as
The nontrivial extraction of implicit,
previously unknown, and potentially useful
information from data.
Similar terms
Exploratory data analysis
Data driven discovery
Deductive learning
Discovery Science
Knowledge Discovery

G. Piatetsky-Shapiro and W. J. Frawley,
Knowledge Discovery in Databases, AAAI/MIT Press,
1991.
3
Database vs. Data Mining

Query
Poorly defined
No precise query language

Query
Well defined
SQL

Output
Subset of database

Output
Not a subset of database

Field
Still in infancy
Easy to publish a bad SIGKDD paper!

Field
Mature
Hard to publish a bad SIGMOD/VLDB paper

4
Query Examples

Database
Find all customers that live in Boa Vista
Find all customers that use Mastercard
Find all customers that missed one payment
Data mining
Find all customers that are likely to miss one
payment (Classification)
Group all customers with simpler buying habits
(Clustering)
List all items that are frequently purchased
with bicycles (Association rules)
Find any unusual customers (Outlier detection,
anomaly discovery)

5
The Major Data Mining Tasks

Classification
Clustering
Associations

Most of the other tasks (for example, outlier
discovery or anomaly detection ) make heavy use
of one or more of the above. So in this tutorial
we will focus most of our energy on the above,
starting with
6
The Classification Problem (informal definition)
Katydids
Given a collection of annotated data. In this
case 5 instances Katydids of and five of
Grasshoppers, decide what type of insect the
unlabeled example is.
Grasshoppers
Katydid or Grasshopper?
7
For any domain of interest, we can measure
features
Color Green, Brown, Gray, Other
Has Wings?
Thorax Length
Abdomen Length
Antennae Length
Mandible Size
Spiracle Diameter
Leg Length
8
My_Collection
We can store features in a database.

The classification problem can now be expressed
as
Given a training database (My_Collection),
predict the class label of a previously unseen
instance

previously unseen instance
9
Grasshoppers
Katydids
Antenna Length
Abdomen Length
10
Grasshoppers
Katydids
We will also use this lager dataset as a
motivating example
Antenna Length

Each of these data objects are called
exemplars
(training) examples
instances
tuples

Abdomen Length
11
We will return to the previous slide in two
minutes. In the meantime, we are going to play a
quick game. I am going to show you some
classification problems which were shown to
pigeons! Let us see if you are as smart as a
pigeon!
12
Pigeon Problem 1
13
Pigeon Problem 1
What class is this object?
8 1.5
What about this one, A or B?
4.5 7
14
Pigeon Problem 1
This is a B!
8 1.5
Here is the rule. If the left bar is smaller than
the right bar, it is an A, otherwise it is a B.
15
Pigeon Problem 2
Oh! This ones hard!
Examples of class A
Examples of class B
8 1.5
4 4
Even I know this one
5 5
6 6
7 7
3 3
16
Pigeon Problem 2
Examples of class A
Examples of class B
The rule is as follows, if the two bars are equal
sizes, it is an A. Otherwise it is a B.
4 4
5 5
So this one is an A.
6 6
7 7
3 3
17
Pigeon Problem 3
Examples of class A
Examples of class B
6 6
This one is really hard! What is this, A or B?
4 4
1 5
6 3
3 7
18
Pigeon Problem 3
It is a B!
Examples of class A
Examples of class B
6 6
4 4
The rule is as follows, if the square of the sum
of the two bars is less than or equal to 100, it
is an A. Otherwise it is a B.
1 5
6 3
3 7
19
Why did we spend so much time with this game?
Because we wanted to show that almost all
classification problems have a geometric
interpretation, check out the next 3 slides
20
Pigeon Problem 1
Here is the rule again. If the left bar is
smaller than the right bar, it is an A, otherwise
it is a B.
21
Pigeon Problem 2
Examples of class A
Examples of class B
4 4
5 5
Let me look it up here it is.. the rule is, if
the two bars are equal sizes, it is an A.
Otherwise it is a B.
6 6
3 3
22
Pigeon Problem 3
Examples of class A
Examples of class B
4 4
1 5
6 3
The rule again if the square of the sum of the
two bars is less than or equal to 100, it is an
A. Otherwise it is a B.
3 7
23
Grasshoppers
Katydids
Antenna Length
Abdomen Length
24
previously unseen instance

We can project the previously unseen instance
into the same space as the database.
We have now abstracted away the details of our
particular problem. It will be much easier to
talk about points in space.

Antenna Length
Abdomen Length
25
Simple Linear Classifier
R.A. Fisher 1890-1962
If previously unseen instance above the
line then class is Katydid else
class is Grasshopper
26
The simple linear classifier is defined for
higher dimensional spaces
27
we can visualize it as being an n-dimensional
hyperplane
28
It is interesting to think about what would
happen in this example if we did not have the 3rd
dimension
29
We can no longer get perfect accuracy with the
simple linear classifier We could try to solve
this problem by user a simple quadratic
classifier or a simple cubic classifier.. Howev
er, as we will later see, this is probably a bad
idea
30
Which of the Pigeon Problems can be solved by
the Simple Linear Classifier?

Perfect
Useless
Pretty Good

Problems that can be solved by a linear
classifier are call linearly separable.
31
Virginica

A Famous Problem
R. A. Fishers Iris Dataset.
3 classes
50 of each class
The task is to classify Iris plants into one of 3
varieties using the Petal Length and Petal Width.

Setosa
Versicolor
32
We can generalize the piecewise linear classifier
to N classes, by fitting N-1 lines. In this case
we first learned the line to (perfectly)
discriminate between Setosa and
Virginica/Versicolor, then we learned to
approximately discriminate between Virginica and
Versicolor.
If petal width gt 3.272 (0.325 petal length)
then class Virginica Elseif petal width
33
We have now seen one classification algorithm,
and we are about to see more. How should we
compare them?

Predictive accuracy
Speed and scalability
time to construct the model
time to use the model
efficiency in disk-resident databases
Robustness
handling noise, missing values and irrelevant
features, streaming data
Interpretability
understanding and insight provided by the model

34
Predictive Accuracy I

How do we estimate the accuracy of our
classifier?
We can use K-fold cross validation

We divide the dataset into K equal sized
sections. The algorithm is tested K times, each
time leaving out one of the K section from
building the classifier, but using it to test the
classifier instead
Number of correct classifications Number of
instances in our database
Accuracy
K 5
35
Predictive Accuracy II

Using K-fold cross validation is a good way to
set any parameters we may need to adjust in (any)
classifier.
We can do K-fold cross validation for each
possible setting, and choose the model with the
highest accuracy. Where there is a tie, we choose
the simpler model.
Actually, we should probably penalize the more
complex models, even if they are more accurate,
since more complex models are more likely to
overfit (discussed later).

Accuracy 94
Accuracy 100
Accuracy 100
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
36
Predictive Accuracy III
Number of correct classifications Number of
instances in our database
Accuracy
Accuracy is a single number, we may be better off
looking at a confusion matrix. This gives us
additional useful information
True label is...
Classified as a
37
Speed and Scalability I

We need to consider the time and space
requirements for the two distinct phases of
classification
Time to construct the classifier
In the case of the simpler linear classifier,
the time taken to fit the line, this is linear in
the number of instances.
Time to use the model
In the case of the simpler linear classifier,
the time taken to test which side of the line the
unlabeled instance is. This can be done in
constant time.

As we shall see, some classification algorithms
are very efficient in one aspect, and very poor
in the other.
38
Speed and Scalability II
For learning with small datasets, this is the
whole picture However, for data mining with
massive datasets, it is not so much the (main
memory) time complexity that matters, rather it
is how many times we have to scan the database.
This is because for most data mining operations,
disk access times completely dominate the CPU
times. For data mining, researchers often report
the number of times you must scan the database.
39
Robustness I

We need to consider what happens when we have
Noise
For example, a persons age could have been
mistyped as 650 instead of 65, how does this
effect our classifier? (This is important only
for building the classifier, if the instance to
be classified is noisy we can do nothing).
Missing values

For example suppose we want to classify an
insect, but we only know the abdomen length
(X-axis), and not the antennae length (Y-axis),
can we still classify the instance?
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
40
Robustness II

We need to consider what happens when we have
Irrelevant features

For example, suppose we want to classify people
as either
Suitable_Grad_Student
Unsuitable_Grad_Student
And it happens that scoring more than 5 on a
particular test is a perfect indicator for this
problem

10
If we also use hair_length as a feature, how
will this effect our classifier?
41
Robustness III

We need to consider what happens when we have
Streaming data

For many real world problems, we dont have a
single fixed dataset. Instead, the data
continuously arrives, potentially forever
(stock market, weather data, sensor data etc)
Can our classifier handle streaming data?
10
42
Interpretability
Some classifiers offer a bonus feature. The
structure of the learned classifier tells use
something about the domain.
As a trivial example, if we try to classify
peoples health risks based on just their height
and weight, we could gain the following insight
(Based of the observation that a single linear
classifier does not work well, but two linear
classifiers do). There are two ways to be
unhealthy, being obese and being too skinny.
Weight
Height
43
Nearest Neighbor Classifier
Evelyn Fix 1904-1965
Joe Hodges 1922-2000
Antenna Length
If the nearest instance to the previously unseen
instance is a Katydid class is Katydid else
class is Grasshopper
Abdomen Length
44
We can visualize the nearest neighbor algorithm
in terms of a decision surface
Note the we dont actually have to construct
these surfaces, they are simply the implicit
boundaries that divide the space into regions
belonging to each instance.
This division of space is called Dirichlet
Tessellation (or Voronoi diagram, or Theissen
regions).
45
The nearest neighbor algorithm is sensitive to
outliers
The solution is to
46
We can generalize the nearest neighbor algorithm
to the K- nearest neighbor (KNN) algorithm. We
measure the distance to the nearest K instances,
and let them vote. K is typically chosen to be an
odd number.
K 1
K 3
47
The nearest neighbor algorithm is sensitive to
irrelevant features
Training data
Suppose the following is true, if an insects
antenna is longer than 5.5 it is a Katydid,
otherwise it is a Grasshopper. Using just the
antenna length we get perfect classification!
6
5
Suppose however, we add in an irrelevant feature,
for example the insects mass. Using both the
antenna length and the insects mass with the 1-NN
algorithm we get the wrong classification!
10
48
How do we mitigate the nearest neighbor
algorithms sensitivity to irrelevant features?

Use more training instances
Ask an expert what features are relevant to the
task
Use statistical tests to try to determine which
features are useful
Search over feature subsets (in the next slide
we will see why this is hard)

49
Why searching over feature subsets is hard
Suppose you have the following classification
problem, with 100 features, where it happens that
Features 1 and 2 (the X and Y below) give perfect
classification, but all 98 of the other features
are irrelevant
Only Feature 2
Only Feature 1
Using all 100 features will give poor results,
but so will using only Feature 1, and so will
using Feature 2! Of the 2100 1 possible
subsets of the features, only one really works.
50
The nearest neighbor algorithm is sensitive to
the units of measurement
X axis measured in millimeters Y axis measure in
dollars The nearest neighbor to the pink unknown
instance is blue.
One solution is to normalize the units to pure
numbers.
51
We can speed up nearest neighbor algorithm by
throwing away some data. This is called data
editing. Note that this can sometimes improve
accuracy!
We can also speed up classification with indexing
One possible approach. Delete all instances that
are surrounded by members of their own class.
52
Up to now we have assumed that the nearest
neighbor algorithm uses the Euclidean Distance,
however this need not be the case
Max (pinf)
Manhattan (p1)
Weighted Euclidean
Mahalanobis
53
In fact, we can use the nearest neighbor
algorithm with any distance/similarity function
For example, is Faloutsos Greek or Irish? We
could compare the name Faloutsos to a database
of names using string edit distance
edit_distance(Faloutsos, Keogh)
8 edit_distance(Faloutsos, Gunopulos)
6 Hopefully, the similarity of the name
(particularly the suffix) to other Greek names
would mean the nearest nearest neighbor is also a
Greek name.
Specialized distance measures exist for DNA
strings, time series, images, graphs, videos,
sets, fingerprints etc
54
Advantages/Disadvantages of Nearest Neighbor

Advantages
Simple to implement
Handles correlated features (Arbitrary class
shapes)
Defined for any distance measure
Handles streaming data trivially
Disadvantages
Very sensitive to irrelevant features.
Slow classification time for large datasets
Works best for real valued datasets

55
Decision Tree Classifier
Ross Quinlan
Abdomen Length gt 7.1?
Antenna Length
yes
no
Antenna Length gt 6.0?
Katydid
yes
no
Katydid
Grasshopper
Abdomen Length
56
Antennae shorter than body?
Yes
No
3 Tarsi?
Grasshopper
Yes
No
Foretiba has ears?
Yes
No
Cricket
Decision trees predate computers
Katydids
Camel Cricket
57
Decision Tree Classification

Decision tree
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class
distribution
Decision tree generation consists of two phases
Tree construction
At start, all the training examples are at the
root
Partition examples recursively based on selected
attributes
Tree pruning
Identify and remove branches that reflect noise
or outliers
Use of decision tree Classifying an unknown
sample
Test the attribute values of the sample against
the decision tree

58
How do we construct the decision tree?

Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive
divide-and-conquer manner
At start, all the training examples are at the
root
Attributes are categorical (if continuous-valued,
they can be discretized in advance)
Examples are partitioned recursively based on
selected attributes.
Test attributes are selected on the basis of a
heuristic or statistical measure (e.g.,
information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same
class
There are no remaining attributes for further
partitioning majority voting is employed for
classifying the leaf
There are no samples left

59
We need dont need to keep the data around, just
the test conditions.
Weight lt 160?
yes
no
How would these people be classified?
Hair Length lt 2?
Male
yes
no
Male
Female
60
It is trivial to convert Decision Trees to rules
Weight lt 160?
yes
no
Hair Length lt 2?
Male
no
yes
Male
Female
Rules to Classify Males/Females If Weight
greater than 160, classify as Male Elseif Hair
Length less than or equal to 2, classify as
Male Else classify as Female
61
Once we have learned the decision tree, we dont
even need a computer!
This decision tree is attached to a medical
machine, and is designed to help nurses make
decisions about what type of doctor to call.
Decision tree for a typical shared-care setting
applying the system for the diagnosis of
prostatic obstructions.
GP general practitioner
62
The worked examples we have seen were performed
on small datasets. However with small datasets
there is a great danger of overfitting the
data When you have few datapoints, there are
many possible splitting rules that perfectly
classify the data, but will not generalize to
future datasets.
Yes
No
Wears green?
Male
Female
For example, the rule Wears green? perfectly
classifies the data, so does Mothers name is
Jacqueline?, so does Has blue shoes
63
Avoid Overfitting in Classification

The generated tree may overfit the training data
Too many branches, some may reflect anomalies due
to noise or outliers
Result is in poor accuracy for unseen samples
Two approaches to avoid overfitting
Prepruning Halt tree construction earlydo not
split a node if this would result in the goodness
measure falling below a threshold
Difficult to choose an appropriate threshold
Postpruning Remove branches from a fully grown
treeget a sequence of progressively pruned trees
Use a set of data different from the training
data to decide which is the best pruned tree

64
Which of the Pigeon Problems can be solved by a
Decision Tree?

Deep Bushy Tree
Useless
Deep Bushy Tree

?
The Decision Tree has a hard time with correlated
attributes
65
Advantages/Disadvantages of Decision Trees

Advantages
Easy to understand (Doctors love them!)
Easy to generate rules
Disadvantages
May suffer from overfitting.
Classifies by rectangular partitioning (so does
not handle correlated features very well).
Can be quite large pruning is necessary.
Does not handle streaming data easily

66
Summary of Classification

We have seen 3 major classification techniques
Simple linear classifier, Nearest neighbor,
Decision tree.
There are other techniques
Neural Networks, Support Vector Machines,
Genetic algorithms..
In general, there is no one best classifier for
all problems. You have to consider what you hope
to achieve, and the data itself

Let us now move on to the other classic problem
of data mining and machine learning, Clustering
67
What is Clustering?
Also called unsupervised learning, sometimes
called classification by statisticians and
sorting by psychologists and segmentation by
people in marketing

Organizing data into classes such that there is
high intra-class similarity
low inter-class similarity
Finding the class labels and the number of
classes directly from the data (in contrast to
classification).
More informally, finding natural groupings among
objects.

68
What is a natural grouping among these objects?
69
What is a natural grouping among these objects?
Clustering is subjective
Simpson's Family
Males
Females
School Employees
70
What is Similarity?
The quality or state of being similar likeness
resemblance as, a similarity of features.
Webster's Dictionary
Similarity is hard to define, but We know it
when we see it The real meaning of similarity
is a philosophical question. We will take a more
pragmatic approach.
71
Defining Distance Measures
Definition Let O1 and O2 be two objects from the
universe of possible objects. The distance
(dissimilarity) between O1 and O2 is a real
number denoted by D(O1,O2)
Peter
Piotr
0.23
3
342.7
72
Peter
Piotr
When we peek inside one of these black boxes, we
see some function on two variables. These
functions might very simple or very complex. In
either case it is natural to ask, what properties
should these functions have?
d('', '') 0 d(s, '') d('', s) s -- i.e.
length of s d(s1ch1, s2ch2) min( d(s1, s2)
if ch1ch2 then 0 else 1 fi, d(s1ch1, s2) 1,
d(s1, s2ch2) 1 )
3

What properties should a distance measure have?
D(A,B) D(B,A) Symmetry
D(A,A) 0 Constancy of Self-Similarity
D(A,B) 0 IIf A B Positivity (Separation)
D(A,B) ? D(A,C) D(B,C) Triangular Inequality

73
Intuitions behind desirable distance measure
properties
D(A,B) D(B,A) Symmetry Otherwise you could
claim Alex looks like Bob, but Bob looks nothing
like Alex. D(A,A) 0 Constancy of
Self-Similarity Otherwise you could claim Alex
looks more like Bob, than Bob does. D(A,B) 0
IIf AB Positivity (Separation) Otherwise there
are objects in your world that are different, but
you cannot tell apart. D(A,B) ? D(A,C)
D(B,C) Triangular Inequality Otherwise you could
claim Alex is very like Bob, and Alex is very
like Carl, but Bob is very unlike Carl.
74
Two Types of Clustering

Partitional algorithms Construct various
partitions and then evaluate them by some
criterion (we will see an example called BIRCH)
Hierarchical algorithms Create a hierarchical
decomposition of the set of objects using some
criterion

Partitional
Hierarchical
75
Desirable Properties of a Clustering Algorithm

Scalability (in terms of both time and space)
Ability to deal with different data types
Minimal requirements for domain knowledge to
determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
Incorporation of user-specified constraints
Interpretability and usability

76
A Useful Tool for Summarizing Similarity
Measurements
In order to better appreciate and evaluate the
examples given in the early part of this talk, we
will now introduce the dendrogram.
The similarity between two objects in a
dendrogram is represented as the height of the
lowest internal node they share.
77
There is only one dataset that can be perfectly
clustered using a hierarchy
(Bovine0.69395, (Spider Monkey 0.390,
(Gibbon0.36079,(Orang0.33636,(Gorilla0.17147,(C
himp0.19268, Human0.11927)0.08386)0.06124)0.1
5057)0.54939)
78
Note that hierarchies are commonly used to
organize information, for example in a web
portal. Yahoos hierarchy is manually created,
we will focus on automatic creation of
hierarchies in data mining.
Business Economy
B2B Finance Shopping Jobs
Aerospace Agriculture Banking Bonds Animals
Apparel Career Workspace
79
A Demonstration of Hierarchical Clustering using
String Edit Distance
Pedro (Portuguese) Petros (Greek), Peter
(English), Piotr (Polish), Peadar (Irish),
Pierre (French), Peder (Danish), Peka
(Hawaiian), Pietro (Italian), Piero (Italian
Alternative), Petr (Czech), Pyotr
(Russian) Cristovao (Portuguese) Christoph
(German), Christophe (French), Cristobal
(Spanish), Cristoforo (Italian), Kristoffer
(Scandinavian), Krystof (Czech), Christopher
(English) Miguel (Portuguese) Michalis (Greek),
Michael (English), Mick (Irish!)
Piotr
Peka
Mick
Piero
Peter
Pyotr
Pedro
Peder
Pietro
Pierre
Petros
Miguel
Peadar
Krystof
Michael
Michalis
Crisdean
Cristobal
Cristovao
Christoph
Kristoffer
Cristoforo
Christophe
Christopher
80
Pedro (Portuguese/Spanish) Petros (Greek), Peter
(English), Piotr (Polish), Peadar (Irish),
Pierre (French), Peder (Danish), Peka
(Hawaiian), Pietro (Italian), Piero (Italian
Alternative), Petr (Czech), Pyotr (Russian)
Piotr
Peka
Pedro
Peter
Piero
Pyotr
Peder
Pierre
Pietro
Petros
Peadar
81

Hierarchal clustering can sometimes show patterns
that are meaningless or spurious
For example, in this clustering, the tight
grouping of Australia, Anguilla, St. Helena etc
is meaningful, since all these countries are
former UK colonies.
However the tight grouping of Niger and India is
completely spurious, there is no connection
between the two.

The flag of Niger is orange over white over
green, with an orange disc on the central white
stripe, symbolizing the sun. The orange stands
the Sahara desert, which borders Niger to the
north. Green stands for the grassy plains of the
south and west and for the River Niger which
sustains them. It also stands for fraternity and
hope. White generally symbolizes purity and hope.
The Indian flag is a horizontal tricolor in
equal proportion of deep saffron on the top,
white in the middle and dark green at the bottom.
In the center of the white band, there is a wheel
in navy blue to indicate the Dharma Chakra, the
wheel of law in the Sarnath Lion Capital. This
center symbol or the 'CHAKRA' is a symbol dating
back to 2nd century BC. The saffron stands for
courage and sacrifice the white, for purity and
truth the green for growth and auspiciousness.

83
We can look at the dendrogram to determine the
correct number of clusters. In this case, the
two highly separated subtrees are highly
suggestive of two clusters. (Things are rarely
this clear cut, unfortunately)
84
One potential use of a dendrogram is to detect
outliers
The single isolated branch is suggestive of a
data point that is very different to all others
Outlier
85
(How-to) Hierarchical Clustering
Since we cannot test all possible trees we will
have to heuristic search of all possible trees.
We could do this.. Bottom-Up (agglomerative)
Starting with each item in its own cluster, find
the best pair to merge into a new cluster. Repeat
until all clusters are fused together. Top-Down
(divisive) Starting with all the data in a
single cluster, consider every possible way to
divide the cluster into two. Choose the best
division and recursively operate on both sides.

The number of dendrograms with n leafs (2n
-3)!/(2(n -2)) (n -2)!
Number Number of Possible
of Leafs Dendrograms
2 1
3 3
4 15
5 105
...
34,459,425

86
We begin with a distance matrix which contains
the distances between every pair of objects in
our database.
D( , ) 8 D( , ) 1
87
Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best

88
Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

89
Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

90
Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

91
We know how to measure the distance between two
objects, but defining the distance between an
object and a cluster, or defining the distance
between two clusters is non obvious.

Single linkage (nearest neighbor) In this
method the distance between two clusters is
determined by the distance of the two closest
objects (nearest neighbors) in the different
clusters.
Complete linkage (furthest neighbor) In this
method, the distances between clusters are
determined by the greatest distance between any
two objects in the different clusters (i.e., by
the "furthest neighbors").
Group average linkage In this method, the
distance between two clusters is calculated as
the average distance between all pairs of objects
in the two different clusters.
Wards Linkage In this method, we try to
minimize the variance of the merged clusters

92
Single linkage
Average linkage
Wards linkage
93

Summary of Hierarchal Clustering Methods
No need to specify the number of clusters in
advance.
Hierarchal nature maps nicely onto human
intuition for some domains
They do not scale well time complexity of at
least O(n2), where n is the number of total
objects.
Like any heuristic search algorithms, local
optima are a problem.
Interpretation of results is (very) subjective.

94
Up to this point we have simply assumed that we
can measure similarity, butHow do we measure
similarity?
Peter
Piotr
0.23
3
342.7
95
A generic technique for measuring similarity
To measure the similarity between two objects,
transform one of the objects into the other, and
measure how much effort it took. The measure of
effort becomes the distance measure.
The distance between Patty and Selma. Change
dress color, 1 point Change earring shape, 1
point Change hair part, 1 point D(Patty,Selma
) 3
The distance between Marge and Selma. Change
dress color, 1 point Add earrings, 1
point Decrease height, 1 point Take up
smoking, 1 point Lose weight, 1
point D(Marge,Selma) 5
This is called the edit distance or the
transformation distance
96
Edit Distance Example
How similar are the names Peter and
Piotr? Assume the following cost function
Substitution 1 Unit Insertion 1
Unit Deletion 1 Unit D(Peter,Piotr) is 3
It is possible to transform any string Q into
string C, using only Substitution, Insertion and
Deletion. Assume that each of these operators has
a cost associated with it. The similarity
between two strings can be defined as the cost of
the cheapest transformation from Q to C. Note
that for now we have ignored the issue of how we
can find this cheapest transformation
Peter Piter Pioter Piotr
Substitution (i for e)
Insertion (o)
Deletion (e)
97
Partitional Clustering

Nonhierarchical, each instance is placed in
exactly one of K nonoverlapping clusters.
Since only one set of clusters is output, the
user normally has to input the desired number of
clusters K.

98
Algorithm k-means 1. Decide on a value for
k. 2. Initialize the k cluster centers
(randomly, if necessary). 3. Decide the class
memberships of the N objects by assigning them to
the nearest cluster center. 4. Re-estimate the k
cluster centers, by assuming the memberships
found above are correct. 5. If none of the N
objects changed membership in the last iteration,
exit. Otherwise goto 3.
99
K-means Clustering Step 1
Algorithm k-means, Distance Metric Euclidean
Distance
5
4
3
2
1
0
0
1
2
3
4
5
100
K-means Clustering Step 2
Algorithm k-means, Distance Metric Euclidean
Distance
5
4
3
2
1
0
0
1
2
3
4
5
101
K-means Clustering Step 3
Algorithm k-means, Distance Metric Euclidean
Distance
5
4
3
2
1
0
0
1
2
3
4
5
102
K-means Clustering Step 4
Algorithm k-means, Distance Metric Euclidean
Distance
5
4
3
2
1
0
0
1
2
3
4
5
103
K-means Clustering Step 5
Algorithm k-means, Distance Metric Euclidean
Distance
5
4
3
2
1
0
0
1
2
3
4
5
104
Comments on the K-Means Method

Strength
Relatively efficient O(tkn), where n is
objects, k is clusters, and t is iterations.
Normally, k, t ltlt n.
Often terminates at a local optimum. The global
optimum may be found using techniques such as
deterministic annealing and genetic algorithms
Weakness
Applicable only when mean is defined, then what
about categorical data?
Need to specify k, the number of clusters, in
advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex
shapes

105
The K-Medoids Clustering Method

Find representative objects, called medoids, in
clusters
PAM (Partitioning Around Medoids, 1987)
starts from an initial set of medoids and
iteratively replaces one of the medoids by one of
the non-medoids if it improves the total distance
of the resulting clustering
PAM works effectively for small data sets, but
does not scale well for large data sets

106
What happens if the data is streaming
Nearest Neighbor Clustering Not to be confused
with Nearest Neighbor Classification

Items are iteratively merged into the existing
clusters that are closest.
Incremental
Threshold, t, used to determine if items are
added to existing clusters or a new cluster is
created.

107
10
Threshold t
1
t
2
108
10
New data point arrives It is within the
threshold for cluster 1, so add it to the
cluster, and update cluster center.
1
3
2
109
10
New data point arrives It is not within the
threshold for cluster 1, so create a new cluster,
and so on..
4
1
3
2
Algorithm is highly order dependent It is
difficult to determine t in advance
110
How can we tell the right number of clusters? In
general, this is a unsolved problem. However
there are many approximate methods. In the next
few slides we will see an example.
For our example, we will use the familiar
katydid/grasshopper dataset. However, in this
case we are imagining that we do NOT know the
class labels. We are only clustering on the X and
Y axis values.
111
Squared Error
Objective Function
112
When k 1, the objective function is 873.0
1
2
3
4
5
6
7
8
9
10
113
When k 2, the objective function is 173.1
1
2
3
4
5
6
7
8
9
10
114
When k 3, the objective function is 133.6
1
2
3
4
5
6
7
8
9
10
115
We can plot the objective function values for k
equals 1 to 6 The abrupt change at k 2, is
highly suggestive of two clusters in the data.
This technique for determining the number of
clusters is known as knee finding or elbow
finding.
1.00E03
9.00E02
8.00E02
7.00E02
6.00E02
Objective Function
5.00E02
4.00E02
3.00E02
2.00E02
1.00E02
0.00E00
k
1
2
3
4
5
6
Note that the results are not always as clear cut
as in this toy example
116
Conclusions

We have learned about the 3 major data
mining/machine learning algorithms.
Almost all data mining research is in these 3
areas, or is a minor extension of one or more of
them.
For further study, I recommend.
Proceedings of SIGKDD, IEEE ICDM, SIAM SDM
Data Mining Concepts and Techniques (Jiawei Han
and Micheline Kamber)
Data Mining Introductory and Advanced Topics
(Margaret Dunham)