Title: Data Warehousing FS 08
1Data WarehousingFS 08 Data Mining based on VLDB
2006 Tutorial Slides byEamonn Keogh,
eamonn_at_cs.ucr.edu Jens Dittrich
2Data Mining Definition
- Finding hidden information in a database
- Data Mining has been defined as
- The nontrivial extraction of implicit,
previously unknown, and potentially useful
information from data. - Similar terms
- Exploratory data analysis
- Data driven discovery
- Deductive learning
- Discovery Science
- Knowledge Discovery
G. Piatetsky-Shapiro and W. J. Frawley,
Knowledge Discovery in Databases, AAAI/MIT Press,
1991.
3Database vs. Data Mining
- Query
- Poorly defined
- No precise query language
- Output
- Subset of database
- Output
- Not a subset of database
- Field
- Still in infancy
- Easy to publish a bad SIGKDD paper!
- Field
- Mature
- Hard to publish a bad SIGMOD/VLDB paper
4Query Examples
- Database
- Find all customers that live in Boa Vista
- Find all customers that use Mastercard
- Find all customers that missed one payment
- Data mining
- Find all customers that are likely to miss one
payment (Classification) - Group all customers with simpler buying habits
(Clustering) - List all items that are frequently purchased
with bicycles (Association rules) - Find any unusual customers (Outlier detection,
anomaly discovery)
5The Major Data Mining Tasks
- Classification
- Clustering
- Associations
Most of the other tasks (for example, outlier
discovery or anomaly detection ) make heavy use
of one or more of the above. So in this tutorial
we will focus most of our energy on the above,
starting with
6The Classification Problem (informal definition)
Katydids
Given a collection of annotated data. In this
case 5 instances Katydids of and five of
Grasshoppers, decide what type of insect the
unlabeled example is.
Grasshoppers
Katydid or Grasshopper?
7For any domain of interest, we can measure
features
Color Green, Brown, Gray, Other
Has Wings?
Thorax Length
Abdomen Length
Antennae Length
Mandible Size
Spiracle Diameter
Leg Length
8My_Collection
We can store features in a database.
- The classification problem can now be expressed
as - Given a training database (My_Collection),
predict the class label of a previously unseen
instance
previously unseen instance
9Grasshoppers
Katydids
Antenna Length
Abdomen Length
10Grasshoppers
Katydids
We will also use this lager dataset as a
motivating example
Antenna Length
- Each of these data objects are called
- exemplars
- (training) examples
- instances
- tuples
Abdomen Length
11We will return to the previous slide in two
minutes. In the meantime, we are going to play a
quick game. I am going to show you some
classification problems which were shown to
pigeons! Let us see if you are as smart as a
pigeon!
12Pigeon Problem 1
13Pigeon Problem 1
What class is this object?
8 1.5
What about this one, A or B?
4.5 7
14Pigeon Problem 1
This is a B!
8 1.5
Here is the rule. If the left bar is smaller than
the right bar, it is an A, otherwise it is a B.
15Pigeon Problem 2
Oh! This ones hard!
Examples of class A
Examples of class B
8 1.5
4 4
Even I know this one
5 5
6 6
7 7
3 3
16Pigeon Problem 2
Examples of class A
Examples of class B
The rule is as follows, if the two bars are equal
sizes, it is an A. Otherwise it is a B.
4 4
5 5
So this one is an A.
6 6
7 7
3 3
17Pigeon Problem 3
Examples of class A
Examples of class B
6 6
This one is really hard! What is this, A or B?
4 4
1 5
6 3
3 7
18Pigeon Problem 3
It is a B!
Examples of class A
Examples of class B
6 6
4 4
The rule is as follows, if the square of the sum
of the two bars is less than or equal to 100, it
is an A. Otherwise it is a B.
1 5
6 3
3 7
19Why did we spend so much time with this game?
Because we wanted to show that almost all
classification problems have a geometric
interpretation, check out the next 3 slides
20Pigeon Problem 1
Here is the rule again. If the left bar is
smaller than the right bar, it is an A, otherwise
it is a B.
21Pigeon Problem 2
Examples of class A
Examples of class B
4 4
5 5
Let me look it up here it is.. the rule is, if
the two bars are equal sizes, it is an A.
Otherwise it is a B.
6 6
3 3
22Pigeon Problem 3
Examples of class A
Examples of class B
4 4
1 5
6 3
The rule again if the square of the sum of the
two bars is less than or equal to 100, it is an
A. Otherwise it is a B.
3 7
23Grasshoppers
Katydids
Antenna Length
Abdomen Length
24previously unseen instance
- We can project the previously unseen instance
into the same space as the database. - We have now abstracted away the details of our
particular problem. It will be much easier to
talk about points in space.
Antenna Length
Abdomen Length
25Simple Linear Classifier
R.A. Fisher 1890-1962
If previously unseen instance above the
line then class is Katydid else
class is Grasshopper
26The simple linear classifier is defined for
higher dimensional spaces
27 we can visualize it as being an n-dimensional
hyperplane
28It is interesting to think about what would
happen in this example if we did not have the 3rd
dimension
29We can no longer get perfect accuracy with the
simple linear classifier We could try to solve
this problem by user a simple quadratic
classifier or a simple cubic classifier.. Howev
er, as we will later see, this is probably a bad
idea
30Which of the Pigeon Problems can be solved by
the Simple Linear Classifier?
- Perfect
- Useless
- Pretty Good
Problems that can be solved by a linear
classifier are call linearly separable.
31Virginica
- A Famous Problem
- R. A. Fishers Iris Dataset.
- 3 classes
- 50 of each class
- The task is to classify Iris plants into one of 3
varieties using the Petal Length and Petal Width.
Setosa
Versicolor
32We can generalize the piecewise linear classifier
to N classes, by fitting N-1 lines. In this case
we first learned the line to (perfectly)
discriminate between Setosa and
Virginica/Versicolor, then we learned to
approximately discriminate between Virginica and
Versicolor.
If petal width gt 3.272 (0.325 petal length)
then class Virginica Elseif petal width
33We have now seen one classification algorithm,
and we are about to see more. How should we
compare them?
- Predictive accuracy
- Speed and scalability
- time to construct the model
- time to use the model
- efficiency in disk-resident databases
- Robustness
- handling noise, missing values and irrelevant
features, streaming data - Interpretability
- understanding and insight provided by the model
34Predictive Accuracy I
- How do we estimate the accuracy of our
classifier? - We can use K-fold cross validation
We divide the dataset into K equal sized
sections. The algorithm is tested K times, each
time leaving out one of the K section from
building the classifier, but using it to test the
classifier instead
Number of correct classifications Number of
instances in our database
Accuracy
K 5
35Predictive Accuracy II
- Using K-fold cross validation is a good way to
set any parameters we may need to adjust in (any)
classifier. - We can do K-fold cross validation for each
possible setting, and choose the model with the
highest accuracy. Where there is a tie, we choose
the simpler model. - Actually, we should probably penalize the more
complex models, even if they are more accurate,
since more complex models are more likely to
overfit (discussed later).
Accuracy 94
Accuracy 100
Accuracy 100
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
36Predictive Accuracy III
Number of correct classifications Number of
instances in our database
Accuracy
Accuracy is a single number, we may be better off
looking at a confusion matrix. This gives us
additional useful information
True label is...
Classified as a
37Speed and Scalability I
- We need to consider the time and space
requirements for the two distinct phases of
classification - Time to construct the classifier
- In the case of the simpler linear classifier,
the time taken to fit the line, this is linear in
the number of instances. - Time to use the model
- In the case of the simpler linear classifier,
the time taken to test which side of the line the
unlabeled instance is. This can be done in
constant time.
As we shall see, some classification algorithms
are very efficient in one aspect, and very poor
in the other.
38Speed and Scalability II
For learning with small datasets, this is the
whole picture However, for data mining with
massive datasets, it is not so much the (main
memory) time complexity that matters, rather it
is how many times we have to scan the database.
This is because for most data mining operations,
disk access times completely dominate the CPU
times. For data mining, researchers often report
the number of times you must scan the database.
39Robustness I
- We need to consider what happens when we have
- Noise
- For example, a persons age could have been
mistyped as 650 instead of 65, how does this
effect our classifier? (This is important only
for building the classifier, if the instance to
be classified is noisy we can do nothing). - Missing values
-
For example suppose we want to classify an
insect, but we only know the abdomen length
(X-axis), and not the antennae length (Y-axis),
can we still classify the instance?
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
40Robustness II
- We need to consider what happens when we have
- Irrelevant features
- For example, suppose we want to classify people
as either - Suitable_Grad_Student
- Unsuitable_Grad_Student
- And it happens that scoring more than 5 on a
particular test is a perfect indicator for this
problem
10
If we also use hair_length as a feature, how
will this effect our classifier?
41Robustness III
- We need to consider what happens when we have
- Streaming data
For many real world problems, we dont have a
single fixed dataset. Instead, the data
continuously arrives, potentially forever
(stock market, weather data, sensor data etc)
Can our classifier handle streaming data?
10
42Interpretability
Some classifiers offer a bonus feature. The
structure of the learned classifier tells use
something about the domain.
As a trivial example, if we try to classify
peoples health risks based on just their height
and weight, we could gain the following insight
(Based of the observation that a single linear
classifier does not work well, but two linear
classifiers do). There are two ways to be
unhealthy, being obese and being too skinny.
Weight
Height
43Nearest Neighbor Classifier
Evelyn Fix 1904-1965
Joe Hodges 1922-2000
Antenna Length
If the nearest instance to the previously unseen
instance is a Katydid class is Katydid else
class is Grasshopper
Abdomen Length
44We can visualize the nearest neighbor algorithm
in terms of a decision surface
Note the we dont actually have to construct
these surfaces, they are simply the implicit
boundaries that divide the space into regions
belonging to each instance.
This division of space is called Dirichlet
Tessellation (or Voronoi diagram, or Theissen
regions).
45The nearest neighbor algorithm is sensitive to
outliers
The solution is to
46We can generalize the nearest neighbor algorithm
to the K- nearest neighbor (KNN) algorithm. We
measure the distance to the nearest K instances,
and let them vote. K is typically chosen to be an
odd number.
K 1
K 3
47The nearest neighbor algorithm is sensitive to
irrelevant features
Training data
Suppose the following is true, if an insects
antenna is longer than 5.5 it is a Katydid,
otherwise it is a Grasshopper. Using just the
antenna length we get perfect classification!
6
5
Suppose however, we add in an irrelevant feature,
for example the insects mass. Using both the
antenna length and the insects mass with the 1-NN
algorithm we get the wrong classification!
10
48How do we mitigate the nearest neighbor
algorithms sensitivity to irrelevant features?
- Use more training instances
- Ask an expert what features are relevant to the
task - Use statistical tests to try to determine which
features are useful - Search over feature subsets (in the next slide
we will see why this is hard)
49Why searching over feature subsets is hard
Suppose you have the following classification
problem, with 100 features, where it happens that
Features 1 and 2 (the X and Y below) give perfect
classification, but all 98 of the other features
are irrelevant
Only Feature 2
Only Feature 1
Using all 100 features will give poor results,
but so will using only Feature 1, and so will
using Feature 2! Of the 2100 1 possible
subsets of the features, only one really works.
50The nearest neighbor algorithm is sensitive to
the units of measurement
X axis measured in millimeters Y axis measure in
dollars The nearest neighbor to the pink unknown
instance is blue.
One solution is to normalize the units to pure
numbers.
51We can speed up nearest neighbor algorithm by
throwing away some data. This is called data
editing. Note that this can sometimes improve
accuracy!
We can also speed up classification with indexing
One possible approach. Delete all instances that
are surrounded by members of their own class.
52Up to now we have assumed that the nearest
neighbor algorithm uses the Euclidean Distance,
however this need not be the case
Max (pinf)
Manhattan (p1)
Weighted Euclidean
Mahalanobis
53In fact, we can use the nearest neighbor
algorithm with any distance/similarity function
For example, is Faloutsos Greek or Irish? We
could compare the name Faloutsos to a database
of names using string edit distance
edit_distance(Faloutsos, Keogh)
8 edit_distance(Faloutsos, Gunopulos)
6 Hopefully, the similarity of the name
(particularly the suffix) to other Greek names
would mean the nearest nearest neighbor is also a
Greek name.
Specialized distance measures exist for DNA
strings, time series, images, graphs, videos,
sets, fingerprints etc
54Advantages/Disadvantages of Nearest Neighbor
- Advantages
- Simple to implement
- Handles correlated features (Arbitrary class
shapes) - Defined for any distance measure
- Handles streaming data trivially
- Disadvantages
- Very sensitive to irrelevant features.
- Slow classification time for large datasets
- Works best for real valued datasets
55Decision Tree Classifier
Ross Quinlan
Abdomen Length gt 7.1?
Antenna Length
yes
no
Antenna Length gt 6.0?
Katydid
yes
no
Katydid
Grasshopper
Abdomen Length
56Antennae shorter than body?
Yes
No
3 Tarsi?
Grasshopper
Yes
No
Foretiba has ears?
Yes
No
Cricket
Decision trees predate computers
Katydids
Camel Cricket
57Decision Tree Classification
- Decision tree
- A flow-chart-like tree structure
- Internal node denotes a test on an attribute
- Branch represents an outcome of the test
- Leaf nodes represent class labels or class
distribution - Decision tree generation consists of two phases
- Tree construction
- At start, all the training examples are at the
root - Partition examples recursively based on selected
attributes - Tree pruning
- Identify and remove branches that reflect noise
or outliers - Use of decision tree Classifying an unknown
sample - Test the attribute values of the sample against
the decision tree
58How do we construct the decision tree?
- Basic algorithm (a greedy algorithm)
- Tree is constructed in a top-down recursive
divide-and-conquer manner - At start, all the training examples are at the
root - Attributes are categorical (if continuous-valued,
they can be discretized in advance) - Examples are partitioned recursively based on
selected attributes. - Test attributes are selected on the basis of a
heuristic or statistical measure (e.g.,
information gain) - Conditions for stopping partitioning
- All samples for a given node belong to the same
class - There are no remaining attributes for further
partitioning majority voting is employed for
classifying the leaf - There are no samples left
59We need dont need to keep the data around, just
the test conditions.
Weight lt 160?
yes
no
How would these people be classified?
Hair Length lt 2?
Male
yes
no
Male
Female
60It is trivial to convert Decision Trees to rules
Weight lt 160?
yes
no
Hair Length lt 2?
Male
no
yes
Male
Female
Rules to Classify Males/Females If Weight
greater than 160, classify as Male Elseif Hair
Length less than or equal to 2, classify as
Male Else classify as Female
61Once we have learned the decision tree, we dont
even need a computer!
This decision tree is attached to a medical
machine, and is designed to help nurses make
decisions about what type of doctor to call.
Decision tree for a typical shared-care setting
applying the system for the diagnosis of
prostatic obstructions.
GP general practitioner
62The worked examples we have seen were performed
on small datasets. However with small datasets
there is a great danger of overfitting the
data When you have few datapoints, there are
many possible splitting rules that perfectly
classify the data, but will not generalize to
future datasets.
Yes
No
Wears green?
Male
Female
For example, the rule Wears green? perfectly
classifies the data, so does Mothers name is
Jacqueline?, so does Has blue shoes
63Avoid Overfitting in Classification
- The generated tree may overfit the training data
- Too many branches, some may reflect anomalies due
to noise or outliers - Result is in poor accuracy for unseen samples
- Two approaches to avoid overfitting
- Prepruning Halt tree construction earlydo not
split a node if this would result in the goodness
measure falling below a threshold - Difficult to choose an appropriate threshold
- Postpruning Remove branches from a fully grown
treeget a sequence of progressively pruned trees - Use a set of data different from the training
data to decide which is the best pruned tree
64Which of the Pigeon Problems can be solved by a
Decision Tree?
- Deep Bushy Tree
- Useless
- Deep Bushy Tree
?
The Decision Tree has a hard time with correlated
attributes
65Advantages/Disadvantages of Decision Trees
- Advantages
- Easy to understand (Doctors love them!)
- Easy to generate rules
- Disadvantages
- May suffer from overfitting.
- Classifies by rectangular partitioning (so does
not handle correlated features very well). - Can be quite large pruning is necessary.
- Does not handle streaming data easily
66Summary of Classification
- We have seen 3 major classification techniques
- Simple linear classifier, Nearest neighbor,
Decision tree. - There are other techniques
- Neural Networks, Support Vector Machines,
Genetic algorithms.. - In general, there is no one best classifier for
all problems. You have to consider what you hope
to achieve, and the data itself -
Let us now move on to the other classic problem
of data mining and machine learning, Clustering
67What is Clustering?
Also called unsupervised learning, sometimes
called classification by statisticians and
sorting by psychologists and segmentation by
people in marketing
- Organizing data into classes such that there is
- high intra-class similarity
- low inter-class similarity
- Finding the class labels and the number of
classes directly from the data (in contrast to
classification). - More informally, finding natural groupings among
objects.
68What is a natural grouping among these objects?
69What is a natural grouping among these objects?
Clustering is subjective
Simpson's Family
Males
Females
School Employees
70What is Similarity?
The quality or state of being similar likeness
resemblance as, a similarity of features.
Webster's Dictionary
Similarity is hard to define, but We know it
when we see it The real meaning of similarity
is a philosophical question. We will take a more
pragmatic approach.
71Defining Distance Measures
Definition Let O1 and O2 be two objects from the
universe of possible objects. The distance
(dissimilarity) between O1 and O2 is a real
number denoted by D(O1,O2)
Peter
Piotr
0.23
3
342.7
72Peter
Piotr
When we peek inside one of these black boxes, we
see some function on two variables. These
functions might very simple or very complex. In
either case it is natural to ask, what properties
should these functions have?
d('', '') 0 d(s, '') d('', s) s -- i.e.
length of s d(s1ch1, s2ch2) min( d(s1, s2)
if ch1ch2 then 0 else 1 fi, d(s1ch1, s2) 1,
d(s1, s2ch2) 1 )
3
- What properties should a distance measure have?
- D(A,B) D(B,A) Symmetry
- D(A,A) 0 Constancy of Self-Similarity
- D(A,B) 0 IIf A B Positivity (Separation)
- D(A,B) ? D(A,C) D(B,C) Triangular Inequality
73Intuitions behind desirable distance measure
properties
D(A,B) D(B,A) Symmetry Otherwise you could
claim Alex looks like Bob, but Bob looks nothing
like Alex. D(A,A) 0 Constancy of
Self-Similarity Otherwise you could claim Alex
looks more like Bob, than Bob does. D(A,B) 0
IIf AB Positivity (Separation) Otherwise there
are objects in your world that are different, but
you cannot tell apart. D(A,B) ? D(A,C)
D(B,C) Triangular Inequality Otherwise you could
claim Alex is very like Bob, and Alex is very
like Carl, but Bob is very unlike Carl.
74Two Types of Clustering
- Partitional algorithms Construct various
partitions and then evaluate them by some
criterion (we will see an example called BIRCH) - Hierarchical algorithms Create a hierarchical
decomposition of the set of objects using some
criterion
Partitional
Hierarchical
75Desirable Properties of a Clustering Algorithm
- Scalability (in terms of both time and space)
- Ability to deal with different data types
- Minimal requirements for domain knowledge to
determine input parameters - Able to deal with noise and outliers
- Insensitive to order of input records
- Incorporation of user-specified constraints
- Interpretability and usability
76A Useful Tool for Summarizing Similarity
Measurements
In order to better appreciate and evaluate the
examples given in the early part of this talk, we
will now introduce the dendrogram.
The similarity between two objects in a
dendrogram is represented as the height of the
lowest internal node they share.
77There is only one dataset that can be perfectly
clustered using a hierarchy
(Bovine0.69395, (Spider Monkey 0.390,
(Gibbon0.36079,(Orang0.33636,(Gorilla0.17147,(C
himp0.19268, Human0.11927)0.08386)0.06124)0.1
5057)0.54939)
78Note that hierarchies are commonly used to
organize information, for example in a web
portal. Yahoos hierarchy is manually created,
we will focus on automatic creation of
hierarchies in data mining.
Business Economy
B2B Finance Shopping Jobs
Aerospace Agriculture Banking Bonds Animals
Apparel Career Workspace
79A Demonstration of Hierarchical Clustering using
String Edit Distance
Pedro (Portuguese) Petros (Greek), Peter
(English), Piotr (Polish), Peadar (Irish),
Pierre (French), Peder (Danish), Peka
(Hawaiian), Pietro (Italian), Piero (Italian
Alternative), Petr (Czech), Pyotr
(Russian) Cristovao (Portuguese) Christoph
(German), Christophe (French), Cristobal
(Spanish), Cristoforo (Italian), Kristoffer
(Scandinavian), Krystof (Czech), Christopher
(English) Miguel (Portuguese) Michalis (Greek),
Michael (English), Mick (Irish!)
Piotr
Peka
Mick
Piero
Peter
Pyotr
Pedro
Peder
Pietro
Pierre
Petros
Miguel
Peadar
Krystof
Michael
Michalis
Crisdean
Cristobal
Cristovao
Christoph
Kristoffer
Cristoforo
Christophe
Christopher
80Pedro (Portuguese/Spanish) Petros (Greek), Peter
(English), Piotr (Polish), Peadar (Irish),
Pierre (French), Peder (Danish), Peka
(Hawaiian), Pietro (Italian), Piero (Italian
Alternative), Petr (Czech), Pyotr (Russian)
Piotr
Peka
Pedro
Peter
Piero
Pyotr
Peder
Pierre
Pietro
Petros
Peadar
81- Hierarchal clustering can sometimes show patterns
that are meaningless or spurious - For example, in this clustering, the tight
grouping of Australia, Anguilla, St. Helena etc
is meaningful, since all these countries are
former UK colonies. - However the tight grouping of Niger and India is
completely spurious, there is no connection
between the two.
82- The flag of Niger is orange over white over
green, with an orange disc on the central white
stripe, symbolizing the sun. The orange stands
the Sahara desert, which borders Niger to the
north. Green stands for the grassy plains of the
south and west and for the River Niger which
sustains them. It also stands for fraternity and
hope. White generally symbolizes purity and hope.
- The Indian flag is a horizontal tricolor in
equal proportion of deep saffron on the top,
white in the middle and dark green at the bottom.
In the center of the white band, there is a wheel
in navy blue to indicate the Dharma Chakra, the
wheel of law in the Sarnath Lion Capital. This
center symbol or the 'CHAKRA' is a symbol dating
back to 2nd century BC. The saffron stands for
courage and sacrifice the white, for purity and
truth the green for growth and auspiciousness.
83We can look at the dendrogram to determine the
correct number of clusters. In this case, the
two highly separated subtrees are highly
suggestive of two clusters. (Things are rarely
this clear cut, unfortunately)
84One potential use of a dendrogram is to detect
outliers
The single isolated branch is suggestive of a
data point that is very different to all others
Outlier
85(How-to) Hierarchical Clustering
Since we cannot test all possible trees we will
have to heuristic search of all possible trees.
We could do this.. Bottom-Up (agglomerative)
Starting with each item in its own cluster, find
the best pair to merge into a new cluster. Repeat
until all clusters are fused together. Top-Down
(divisive) Starting with all the data in a
single cluster, consider every possible way to
divide the cluster into two. Choose the best
division and recursively operate on both sides.
- The number of dendrograms with n leafs (2n
-3)!/(2(n -2)) (n -2)! - Number Number of Possible
- of Leafs Dendrograms
- 2 1
- 3 3
- 4 15
- 5 105
- ...
- 34,459,425
86We begin with a distance matrix which contains
the distances between every pair of objects in
our database.
D( , ) 8 D( , ) 1
87Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best
88Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best
Consider all possible merges
Choose the best
89Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best
Consider all possible merges
Choose the best
Consider all possible merges
Choose the best
90Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best
Consider all possible merges
Choose the best
Consider all possible merges
Choose the best
91We know how to measure the distance between two
objects, but defining the distance between an
object and a cluster, or defining the distance
between two clusters is non obvious.
- Single linkage (nearest neighbor) In this
method the distance between two clusters is
determined by the distance of the two closest
objects (nearest neighbors) in the different
clusters. - Complete linkage (furthest neighbor) In this
method, the distances between clusters are
determined by the greatest distance between any
two objects in the different clusters (i.e., by
the "furthest neighbors"). - Group average linkage In this method, the
distance between two clusters is calculated as
the average distance between all pairs of objects
in the two different clusters. - Wards Linkage In this method, we try to
minimize the variance of the merged clusters
92Single linkage
Average linkage
Wards linkage
93- Summary of Hierarchal Clustering Methods
- No need to specify the number of clusters in
advance. - Hierarchal nature maps nicely onto human
intuition for some domains - They do not scale well time complexity of at
least O(n2), where n is the number of total
objects. - Like any heuristic search algorithms, local
optima are a problem. - Interpretation of results is (very) subjective.
94Up to this point we have simply assumed that we
can measure similarity, butHow do we measure
similarity?
Peter
Piotr
0.23
3
342.7
95A generic technique for measuring similarity
To measure the similarity between two objects,
transform one of the objects into the other, and
measure how much effort it took. The measure of
effort becomes the distance measure.
The distance between Patty and Selma. Change
dress color, 1 point Change earring shape, 1
point Change hair part, 1 point D(Patty,Selma
) 3
The distance between Marge and Selma. Change
dress color, 1 point Add earrings, 1
point Decrease height, 1 point Take up
smoking, 1 point Lose weight, 1
point D(Marge,Selma) 5
This is called the edit distance or the
transformation distance
96Edit Distance Example
How similar are the names Peter and
Piotr? Assume the following cost function
Substitution 1 Unit Insertion 1
Unit Deletion 1 Unit D(Peter,Piotr) is 3
It is possible to transform any string Q into
string C, using only Substitution, Insertion and
Deletion. Assume that each of these operators has
a cost associated with it. The similarity
between two strings can be defined as the cost of
the cheapest transformation from Q to C. Note
that for now we have ignored the issue of how we
can find this cheapest transformation
Peter Piter Pioter Piotr
Substitution (i for e)
Insertion (o)
Deletion (e)
97Partitional Clustering
- Nonhierarchical, each instance is placed in
exactly one of K nonoverlapping clusters. - Since only one set of clusters is output, the
user normally has to input the desired number of
clusters K.
98Algorithm k-means 1. Decide on a value for
k. 2. Initialize the k cluster centers
(randomly, if necessary). 3. Decide the class
memberships of the N objects by assigning them to
the nearest cluster center. 4. Re-estimate the k
cluster centers, by assuming the memberships
found above are correct. 5. If none of the N
objects changed membership in the last iteration,
exit. Otherwise goto 3.
99K-means Clustering Step 1
Algorithm k-means, Distance Metric Euclidean
Distance
5
4
3
2
1
0
0
1
2
3
4
5
100K-means Clustering Step 2
Algorithm k-means, Distance Metric Euclidean
Distance
5
4
3
2
1
0
0
1
2
3
4
5
101K-means Clustering Step 3
Algorithm k-means, Distance Metric Euclidean
Distance
5
4
3
2
1
0
0
1
2
3
4
5
102K-means Clustering Step 4
Algorithm k-means, Distance Metric Euclidean
Distance
5
4
3
2
1
0
0
1
2
3
4
5
103K-means Clustering Step 5
Algorithm k-means, Distance Metric Euclidean
Distance
5
4
3
2
1
0
0
1
2
3
4
5
104Comments on the K-Means Method
- Strength
- Relatively efficient O(tkn), where n is
objects, k is clusters, and t is iterations.
Normally, k, t ltlt n. - Often terminates at a local optimum. The global
optimum may be found using techniques such as
deterministic annealing and genetic algorithms - Weakness
- Applicable only when mean is defined, then what
about categorical data? - Need to specify k, the number of clusters, in
advance - Unable to handle noisy data and outliers
- Not suitable to discover clusters with non-convex
shapes
105The K-Medoids Clustering Method
- Find representative objects, called medoids, in
clusters - PAM (Partitioning Around Medoids, 1987)
- starts from an initial set of medoids and
iteratively replaces one of the medoids by one of
the non-medoids if it improves the total distance
of the resulting clustering - PAM works effectively for small data sets, but
does not scale well for large data sets
106What happens if the data is streaming
Nearest Neighbor Clustering Not to be confused
with Nearest Neighbor Classification
- Items are iteratively merged into the existing
clusters that are closest. - Incremental
- Threshold, t, used to determine if items are
added to existing clusters or a new cluster is
created.
10710
Threshold t
1
t
2
10810
New data point arrives It is within the
threshold for cluster 1, so add it to the
cluster, and update cluster center.
1
3
2
10910
New data point arrives It is not within the
threshold for cluster 1, so create a new cluster,
and so on..
4
1
3
2
Algorithm is highly order dependent It is
difficult to determine t in advance
110How can we tell the right number of clusters? In
general, this is a unsolved problem. However
there are many approximate methods. In the next
few slides we will see an example.
For our example, we will use the familiar
katydid/grasshopper dataset. However, in this
case we are imagining that we do NOT know the
class labels. We are only clustering on the X and
Y axis values.
111Squared Error
Objective Function
112 When k 1, the objective function is 873.0
1
2
3
4
5
6
7
8
9
10
113 When k 2, the objective function is 173.1
1
2
3
4
5
6
7
8
9
10
114 When k 3, the objective function is 133.6
1
2
3
4
5
6
7
8
9
10
115We can plot the objective function values for k
equals 1 to 6 The abrupt change at k 2, is
highly suggestive of two clusters in the data.
This technique for determining the number of
clusters is known as knee finding or elbow
finding.
1.00E03
9.00E02
8.00E02
7.00E02
6.00E02
Objective Function
5.00E02
4.00E02
3.00E02
2.00E02
1.00E02
0.00E00
k
1
2
3
4
5
6
Note that the results are not always as clear cut
as in this toy example
116Conclusions
- We have learned about the 3 major data
mining/machine learning algorithms. - Almost all data mining research is in these 3
areas, or is a minor extension of one or more of
them. - For further study, I recommend.
- Proceedings of SIGKDD, IEEE ICDM, SIAM SDM
- Data Mining Concepts and Techniques (Jiawei Han
and Micheline Kamber) - Data Mining Introductory and Advanced Topics
(Margaret Dunham)