Data Warehousing FS 08 - PowerPoint PPT Presentation

1 / 116
About This Presentation
Title:

Data Warehousing FS 08

Description:

Grasshoppers. Katydids. The Classification Problem (informal definition) ... Grasshoppers. Katydids. Abdomen Length ... Grasshoppers ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 117
Provided by: dbis
Category:
Tags: data | warehousing

less

Transcript and Presenter's Notes

Title: Data Warehousing FS 08


1
Data WarehousingFS 08 Data Mining based on VLDB
2006 Tutorial Slides byEamonn Keogh,
eamonn_at_cs.ucr.edu Jens Dittrich
2
Data Mining Definition
  • Finding hidden information in a database
  • Data Mining has been defined as
  • The nontrivial extraction of implicit,
    previously unknown, and potentially useful
    information from data.
  • Similar terms
  • Exploratory data analysis
  • Data driven discovery
  • Deductive learning
  • Discovery Science
  • Knowledge Discovery

G. Piatetsky-Shapiro and W. J. Frawley,
Knowledge Discovery in Databases, AAAI/MIT Press,
1991.
3
Database vs. Data Mining
  • Query
  • Poorly defined
  • No precise query language
  • Query
  • Well defined
  • SQL
  • Output
  • Subset of database
  • Output
  • Not a subset of database
  • Field
  • Still in infancy
  • Easy to publish a bad SIGKDD paper!
  • Field
  • Mature
  • Hard to publish a bad SIGMOD/VLDB paper

4
Query Examples
  • Database
  • Find all customers that live in Boa Vista
  • Find all customers that use Mastercard
  • Find all customers that missed one payment
  • Data mining
  • Find all customers that are likely to miss one
    payment (Classification)
  • Group all customers with simpler buying habits
    (Clustering)
  • List all items that are frequently purchased
    with bicycles (Association rules)
  • Find any unusual customers (Outlier detection,
    anomaly discovery)

5
The Major Data Mining Tasks
  • Classification
  • Clustering
  • Associations

Most of the other tasks (for example, outlier
discovery or anomaly detection ) make heavy use
of one or more of the above. So in this tutorial
we will focus most of our energy on the above,
starting with
6
The Classification Problem (informal definition)
Katydids
Given a collection of annotated data. In this
case 5 instances Katydids of and five of
Grasshoppers, decide what type of insect the
unlabeled example is.
Grasshoppers
Katydid or Grasshopper?
7
For any domain of interest, we can measure
features
Color Green, Brown, Gray, Other
Has Wings?
Thorax Length
Abdomen Length
Antennae Length
Mandible Size
Spiracle Diameter
Leg Length
8
My_Collection
We can store features in a database.
  • The classification problem can now be expressed
    as
  • Given a training database (My_Collection),
    predict the class label of a previously unseen
    instance

previously unseen instance
9
Grasshoppers
Katydids
Antenna Length
Abdomen Length
10
Grasshoppers
Katydids
We will also use this lager dataset as a
motivating example
Antenna Length
  • Each of these data objects are called
  • exemplars
  • (training) examples
  • instances
  • tuples

Abdomen Length
11
We will return to the previous slide in two
minutes. In the meantime, we are going to play a
quick game. I am going to show you some
classification problems which were shown to
pigeons! Let us see if you are as smart as a
pigeon!
12
Pigeon Problem 1
13
Pigeon Problem 1
What class is this object?
8 1.5
What about this one, A or B?
4.5 7
14
Pigeon Problem 1
This is a B!
8 1.5
Here is the rule. If the left bar is smaller than
the right bar, it is an A, otherwise it is a B.
15
Pigeon Problem 2
Oh! This ones hard!
Examples of class A
Examples of class B
8 1.5
4 4
Even I know this one
5 5
6 6
7 7
3 3
16
Pigeon Problem 2
Examples of class A
Examples of class B
The rule is as follows, if the two bars are equal
sizes, it is an A. Otherwise it is a B.
4 4
5 5
So this one is an A.
6 6
7 7
3 3
17
Pigeon Problem 3
Examples of class A
Examples of class B
6 6
This one is really hard! What is this, A or B?
4 4
1 5
6 3
3 7
18
Pigeon Problem 3
It is a B!
Examples of class A
Examples of class B
6 6
4 4
The rule is as follows, if the square of the sum
of the two bars is less than or equal to 100, it
is an A. Otherwise it is a B.
1 5
6 3
3 7
19
Why did we spend so much time with this game?
Because we wanted to show that almost all
classification problems have a geometric
interpretation, check out the next 3 slides
20
Pigeon Problem 1
Here is the rule again. If the left bar is
smaller than the right bar, it is an A, otherwise
it is a B.
21
Pigeon Problem 2
Examples of class A
Examples of class B
4 4
5 5
Let me look it up here it is.. the rule is, if
the two bars are equal sizes, it is an A.
Otherwise it is a B.
6 6
3 3
22
Pigeon Problem 3
Examples of class A
Examples of class B
4 4
1 5
6 3
The rule again if the square of the sum of the
two bars is less than or equal to 100, it is an
A. Otherwise it is a B.
3 7
23
Grasshoppers
Katydids
Antenna Length
Abdomen Length
24
previously unseen instance
  • We can project the previously unseen instance
    into the same space as the database.
  • We have now abstracted away the details of our
    particular problem. It will be much easier to
    talk about points in space.

Antenna Length
Abdomen Length
25
Simple Linear Classifier
R.A. Fisher 1890-1962
If previously unseen instance above the
line then class is Katydid else
class is Grasshopper
26
The simple linear classifier is defined for
higher dimensional spaces
27
we can visualize it as being an n-dimensional
hyperplane
28
It is interesting to think about what would
happen in this example if we did not have the 3rd
dimension
29
We can no longer get perfect accuracy with the
simple linear classifier We could try to solve
this problem by user a simple quadratic
classifier or a simple cubic classifier.. Howev
er, as we will later see, this is probably a bad
idea
30
Which of the Pigeon Problems can be solved by
the Simple Linear Classifier?
  • Perfect
  • Useless
  • Pretty Good

Problems that can be solved by a linear
classifier are call linearly separable.
31
Virginica
  • A Famous Problem
  • R. A. Fishers Iris Dataset.
  • 3 classes
  • 50 of each class
  • The task is to classify Iris plants into one of 3
    varieties using the Petal Length and Petal Width.

Setosa
Versicolor
32
We can generalize the piecewise linear classifier
to N classes, by fitting N-1 lines. In this case
we first learned the line to (perfectly)
discriminate between Setosa and
Virginica/Versicolor, then we learned to
approximately discriminate between Virginica and
Versicolor.
If petal width gt 3.272 (0.325 petal length)
then class Virginica Elseif petal width
33
We have now seen one classification algorithm,
and we are about to see more. How should we
compare them?
  • Predictive accuracy
  • Speed and scalability
  • time to construct the model
  • time to use the model
  • efficiency in disk-resident databases
  • Robustness
  • handling noise, missing values and irrelevant
    features, streaming data
  • Interpretability
  • understanding and insight provided by the model

34
Predictive Accuracy I
  • How do we estimate the accuracy of our
    classifier?
  • We can use K-fold cross validation

We divide the dataset into K equal sized
sections. The algorithm is tested K times, each
time leaving out one of the K section from
building the classifier, but using it to test the
classifier instead
Number of correct classifications Number of
instances in our database
Accuracy
K 5
35
Predictive Accuracy II
  • Using K-fold cross validation is a good way to
    set any parameters we may need to adjust in (any)
    classifier.
  • We can do K-fold cross validation for each
    possible setting, and choose the model with the
    highest accuracy. Where there is a tie, we choose
    the simpler model.
  • Actually, we should probably penalize the more
    complex models, even if they are more accurate,
    since more complex models are more likely to
    overfit (discussed later).

Accuracy 94
Accuracy 100
Accuracy 100
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
36
Predictive Accuracy III
Number of correct classifications Number of
instances in our database
Accuracy
Accuracy is a single number, we may be better off
looking at a confusion matrix. This gives us
additional useful information
True label is...
Classified as a
37
Speed and Scalability I
  • We need to consider the time and space
    requirements for the two distinct phases of
    classification
  • Time to construct the classifier
  • In the case of the simpler linear classifier,
    the time taken to fit the line, this is linear in
    the number of instances.
  • Time to use the model
  • In the case of the simpler linear classifier,
    the time taken to test which side of the line the
    unlabeled instance is. This can be done in
    constant time.

As we shall see, some classification algorithms
are very efficient in one aspect, and very poor
in the other.
38
Speed and Scalability II
For learning with small datasets, this is the
whole picture However, for data mining with
massive datasets, it is not so much the (main
memory) time complexity that matters, rather it
is how many times we have to scan the database.
This is because for most data mining operations,
disk access times completely dominate the CPU
times. For data mining, researchers often report
the number of times you must scan the database.
39
Robustness I
  • We need to consider what happens when we have
  • Noise
  • For example, a persons age could have been
    mistyped as 650 instead of 65, how does this
    effect our classifier? (This is important only
    for building the classifier, if the instance to
    be classified is noisy we can do nothing).
  • Missing values

For example suppose we want to classify an
insect, but we only know the abdomen length
(X-axis), and not the antennae length (Y-axis),
can we still classify the instance?
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
40
Robustness II
  • We need to consider what happens when we have
  • Irrelevant features
  • For example, suppose we want to classify people
    as either
  • Suitable_Grad_Student
  • Unsuitable_Grad_Student
  • And it happens that scoring more than 5 on a
    particular test is a perfect indicator for this
    problem

10
If we also use hair_length as a feature, how
will this effect our classifier?
41
Robustness III
  • We need to consider what happens when we have
  • Streaming data

For many real world problems, we dont have a
single fixed dataset. Instead, the data
continuously arrives, potentially forever
(stock market, weather data, sensor data etc)
Can our classifier handle streaming data?
10
42
Interpretability
Some classifiers offer a bonus feature. The
structure of the learned classifier tells use
something about the domain.
As a trivial example, if we try to classify
peoples health risks based on just their height
and weight, we could gain the following insight
(Based of the observation that a single linear
classifier does not work well, but two linear
classifiers do). There are two ways to be
unhealthy, being obese and being too skinny.
Weight
Height
43
Nearest Neighbor Classifier
Evelyn Fix 1904-1965
Joe Hodges 1922-2000
Antenna Length
If the nearest instance to the previously unseen
instance is a Katydid class is Katydid else
class is Grasshopper
Abdomen Length
44
We can visualize the nearest neighbor algorithm
in terms of a decision surface
Note the we dont actually have to construct
these surfaces, they are simply the implicit
boundaries that divide the space into regions
belonging to each instance.
This division of space is called Dirichlet
Tessellation (or Voronoi diagram, or Theissen
regions).
45
The nearest neighbor algorithm is sensitive to
outliers
The solution is to
46
We can generalize the nearest neighbor algorithm
to the K- nearest neighbor (KNN) algorithm. We
measure the distance to the nearest K instances,
and let them vote. K is typically chosen to be an
odd number.
K 1
K 3
47
The nearest neighbor algorithm is sensitive to
irrelevant features
Training data
Suppose the following is true, if an insects
antenna is longer than 5.5 it is a Katydid,
otherwise it is a Grasshopper. Using just the
antenna length we get perfect classification!
6
5
Suppose however, we add in an irrelevant feature,
for example the insects mass. Using both the
antenna length and the insects mass with the 1-NN
algorithm we get the wrong classification!
10
48
How do we mitigate the nearest neighbor
algorithms sensitivity to irrelevant features?
  • Use more training instances
  • Ask an expert what features are relevant to the
    task
  • Use statistical tests to try to determine which
    features are useful
  • Search over feature subsets (in the next slide
    we will see why this is hard)

49
Why searching over feature subsets is hard
Suppose you have the following classification
problem, with 100 features, where it happens that
Features 1 and 2 (the X and Y below) give perfect
classification, but all 98 of the other features
are irrelevant
Only Feature 2
Only Feature 1
Using all 100 features will give poor results,
but so will using only Feature 1, and so will
using Feature 2! Of the 2100 1 possible
subsets of the features, only one really works.
50
The nearest neighbor algorithm is sensitive to
the units of measurement
X axis measured in millimeters Y axis measure in
dollars The nearest neighbor to the pink unknown
instance is blue.
One solution is to normalize the units to pure
numbers.
51
We can speed up nearest neighbor algorithm by
throwing away some data. This is called data
editing. Note that this can sometimes improve
accuracy!
We can also speed up classification with indexing
One possible approach. Delete all instances that
are surrounded by members of their own class.
52
Up to now we have assumed that the nearest
neighbor algorithm uses the Euclidean Distance,
however this need not be the case
Max (pinf)
Manhattan (p1)
Weighted Euclidean
Mahalanobis
53
In fact, we can use the nearest neighbor
algorithm with any distance/similarity function
For example, is Faloutsos Greek or Irish? We
could compare the name Faloutsos to a database
of names using string edit distance
edit_distance(Faloutsos, Keogh)
8 edit_distance(Faloutsos, Gunopulos)
6 Hopefully, the similarity of the name
(particularly the suffix) to other Greek names
would mean the nearest nearest neighbor is also a
Greek name.
Specialized distance measures exist for DNA
strings, time series, images, graphs, videos,
sets, fingerprints etc
54
Advantages/Disadvantages of Nearest Neighbor
  • Advantages
  • Simple to implement
  • Handles correlated features (Arbitrary class
    shapes)
  • Defined for any distance measure
  • Handles streaming data trivially
  • Disadvantages
  • Very sensitive to irrelevant features.
  • Slow classification time for large datasets
  • Works best for real valued datasets

55
Decision Tree Classifier
Ross Quinlan
Abdomen Length gt 7.1?
Antenna Length
yes
no
Antenna Length gt 6.0?
Katydid
yes
no
Katydid
Grasshopper
Abdomen Length
56
Antennae shorter than body?
Yes
No
3 Tarsi?
Grasshopper
Yes
No
Foretiba has ears?
Yes
No
Cricket
Decision trees predate computers
Katydids
Camel Cricket
57
Decision Tree Classification
  • Decision tree
  • A flow-chart-like tree structure
  • Internal node denotes a test on an attribute
  • Branch represents an outcome of the test
  • Leaf nodes represent class labels or class
    distribution
  • Decision tree generation consists of two phases
  • Tree construction
  • At start, all the training examples are at the
    root
  • Partition examples recursively based on selected
    attributes
  • Tree pruning
  • Identify and remove branches that reflect noise
    or outliers
  • Use of decision tree Classifying an unknown
    sample
  • Test the attribute values of the sample against
    the decision tree

58
How do we construct the decision tree?
  • Basic algorithm (a greedy algorithm)
  • Tree is constructed in a top-down recursive
    divide-and-conquer manner
  • At start, all the training examples are at the
    root
  • Attributes are categorical (if continuous-valued,
    they can be discretized in advance)
  • Examples are partitioned recursively based on
    selected attributes.
  • Test attributes are selected on the basis of a
    heuristic or statistical measure (e.g.,
    information gain)
  • Conditions for stopping partitioning
  • All samples for a given node belong to the same
    class
  • There are no remaining attributes for further
    partitioning majority voting is employed for
    classifying the leaf
  • There are no samples left

59
We need dont need to keep the data around, just
the test conditions.
Weight lt 160?
yes
no
How would these people be classified?
Hair Length lt 2?
Male
yes
no
Male
Female
60
It is trivial to convert Decision Trees to rules
Weight lt 160?
yes
no
Hair Length lt 2?
Male
no
yes
Male
Female
Rules to Classify Males/Females If Weight
greater than 160, classify as Male Elseif Hair
Length less than or equal to 2, classify as
Male Else classify as Female
61
Once we have learned the decision tree, we dont
even need a computer!
This decision tree is attached to a medical
machine, and is designed to help nurses make
decisions about what type of doctor to call.
Decision tree for a typical shared-care setting
applying the system for the diagnosis of
prostatic obstructions.
GP general practitioner
62
The worked examples we have seen were performed
on small datasets. However with small datasets
there is a great danger of overfitting the
data When you have few datapoints, there are
many possible splitting rules that perfectly
classify the data, but will not generalize to
future datasets.
Yes
No
Wears green?
Male
Female
For example, the rule Wears green? perfectly
classifies the data, so does Mothers name is
Jacqueline?, so does Has blue shoes
63
Avoid Overfitting in Classification
  • The generated tree may overfit the training data
  • Too many branches, some may reflect anomalies due
    to noise or outliers
  • Result is in poor accuracy for unseen samples
  • Two approaches to avoid overfitting
  • Prepruning Halt tree construction earlydo not
    split a node if this would result in the goodness
    measure falling below a threshold
  • Difficult to choose an appropriate threshold
  • Postpruning Remove branches from a fully grown
    treeget a sequence of progressively pruned trees
  • Use a set of data different from the training
    data to decide which is the best pruned tree

64
Which of the Pigeon Problems can be solved by a
Decision Tree?
  • Deep Bushy Tree
  • Useless
  • Deep Bushy Tree

?
The Decision Tree has a hard time with correlated
attributes
65
Advantages/Disadvantages of Decision Trees
  • Advantages
  • Easy to understand (Doctors love them!)
  • Easy to generate rules
  • Disadvantages
  • May suffer from overfitting.
  • Classifies by rectangular partitioning (so does
    not handle correlated features very well).
  • Can be quite large pruning is necessary.
  • Does not handle streaming data easily

66
Summary of Classification
  • We have seen 3 major classification techniques
  • Simple linear classifier, Nearest neighbor,
    Decision tree.
  • There are other techniques
  • Neural Networks, Support Vector Machines,
    Genetic algorithms..
  • In general, there is no one best classifier for
    all problems. You have to consider what you hope
    to achieve, and the data itself

Let us now move on to the other classic problem
of data mining and machine learning, Clustering
67
What is Clustering?
Also called unsupervised learning, sometimes
called classification by statisticians and
sorting by psychologists and segmentation by
people in marketing
  • Organizing data into classes such that there is
  • high intra-class similarity
  • low inter-class similarity
  • Finding the class labels and the number of
    classes directly from the data (in contrast to
    classification).
  • More informally, finding natural groupings among
    objects.

68
What is a natural grouping among these objects?
69
What is a natural grouping among these objects?
Clustering is subjective
Simpson's Family
Males
Females
School Employees
70
What is Similarity?
The quality or state of being similar likeness
resemblance as, a similarity of features.
Webster's Dictionary
Similarity is hard to define, but We know it
when we see it The real meaning of similarity
is a philosophical question. We will take a more
pragmatic approach.
71
Defining Distance Measures
Definition Let O1 and O2 be two objects from the
universe of possible objects. The distance
(dissimilarity) between O1 and O2 is a real
number denoted by D(O1,O2)
Peter
Piotr
0.23
3
342.7
72
Peter
Piotr
When we peek inside one of these black boxes, we
see some function on two variables. These
functions might very simple or very complex. In
either case it is natural to ask, what properties
should these functions have?
d('', '') 0 d(s, '') d('', s) s -- i.e.
length of s d(s1ch1, s2ch2) min( d(s1, s2)
if ch1ch2 then 0 else 1 fi, d(s1ch1, s2) 1,
d(s1, s2ch2) 1 )
3
  • What properties should a distance measure have?
  • D(A,B) D(B,A) Symmetry
  • D(A,A) 0 Constancy of Self-Similarity
  • D(A,B) 0 IIf A B Positivity (Separation)
  • D(A,B) ? D(A,C) D(B,C) Triangular Inequality

73
Intuitions behind desirable distance measure
properties
D(A,B) D(B,A) Symmetry Otherwise you could
claim Alex looks like Bob, but Bob looks nothing
like Alex. D(A,A) 0 Constancy of
Self-Similarity Otherwise you could claim Alex
looks more like Bob, than Bob does. D(A,B) 0
IIf AB Positivity (Separation) Otherwise there
are objects in your world that are different, but
you cannot tell apart. D(A,B) ? D(A,C)
D(B,C) Triangular Inequality Otherwise you could
claim Alex is very like Bob, and Alex is very
like Carl, but Bob is very unlike Carl.
74
Two Types of Clustering
  • Partitional algorithms Construct various
    partitions and then evaluate them by some
    criterion (we will see an example called BIRCH)
  • Hierarchical algorithms Create a hierarchical
    decomposition of the set of objects using some
    criterion

Partitional
Hierarchical
75
Desirable Properties of a Clustering Algorithm
  • Scalability (in terms of both time and space)
  • Ability to deal with different data types
  • Minimal requirements for domain knowledge to
    determine input parameters
  • Able to deal with noise and outliers
  • Insensitive to order of input records
  • Incorporation of user-specified constraints
  • Interpretability and usability

76
A Useful Tool for Summarizing Similarity
Measurements
In order to better appreciate and evaluate the
examples given in the early part of this talk, we
will now introduce the dendrogram.
The similarity between two objects in a
dendrogram is represented as the height of the
lowest internal node they share.
77
There is only one dataset that can be perfectly
clustered using a hierarchy
(Bovine0.69395, (Spider Monkey 0.390,
(Gibbon0.36079,(Orang0.33636,(Gorilla0.17147,(C
himp0.19268, Human0.11927)0.08386)0.06124)0.1
5057)0.54939)
78
Note that hierarchies are commonly used to
organize information, for example in a web
portal. Yahoos hierarchy is manually created,
we will focus on automatic creation of
hierarchies in data mining.
Business Economy
B2B Finance Shopping Jobs
Aerospace Agriculture Banking Bonds Animals
Apparel Career Workspace
79
A Demonstration of Hierarchical Clustering using
String Edit Distance
Pedro (Portuguese) Petros (Greek), Peter
(English), Piotr (Polish), Peadar (Irish),
Pierre (French), Peder (Danish), Peka
(Hawaiian), Pietro (Italian), Piero (Italian
Alternative), Petr (Czech), Pyotr
(Russian) Cristovao (Portuguese) Christoph
(German), Christophe (French), Cristobal
(Spanish), Cristoforo (Italian), Kristoffer
(Scandinavian), Krystof (Czech), Christopher
(English) Miguel (Portuguese) Michalis (Greek),
Michael (English), Mick (Irish!)
Piotr
Peka
Mick
Piero
Peter
Pyotr
Pedro
Peder
Pietro
Pierre
Petros
Miguel
Peadar
Krystof
Michael
Michalis
Crisdean
Cristobal
Cristovao
Christoph
Kristoffer
Cristoforo
Christophe
Christopher
80
Pedro (Portuguese/Spanish) Petros (Greek), Peter
(English), Piotr (Polish), Peadar (Irish),
Pierre (French), Peder (Danish), Peka
(Hawaiian), Pietro (Italian), Piero (Italian
Alternative), Petr (Czech), Pyotr (Russian)
Piotr
Peka
Pedro
Peter
Piero
Pyotr
Peder
Pierre
Pietro
Petros
Peadar
81
  • Hierarchal clustering can sometimes show patterns
    that are meaningless or spurious
  • For example, in this clustering, the tight
    grouping of Australia, Anguilla, St. Helena etc
    is meaningful, since all these countries are
    former UK colonies.
  • However the tight grouping of Niger and India is
    completely spurious, there is no connection
    between the two.

82
  • The flag of Niger is orange over white over
    green, with an orange disc on the central white
    stripe, symbolizing the sun. The orange stands
    the Sahara desert, which borders Niger to the
    north. Green stands for the grassy plains of the
    south and west and for the River Niger which
    sustains them. It also stands for fraternity and
    hope. White generally symbolizes purity and hope.
  • The Indian flag is a horizontal tricolor in
    equal proportion of deep saffron on the top,
    white in the middle and dark green at the bottom.
    In the center of the white band, there is a wheel
    in navy blue to indicate the Dharma Chakra, the
    wheel of law in the Sarnath Lion Capital. This
    center symbol or the 'CHAKRA' is a symbol dating
    back to 2nd century BC. The saffron stands for
    courage and sacrifice the white, for purity and
    truth the green for growth and auspiciousness.

83
We can look at the dendrogram to determine the
correct number of clusters. In this case, the
two highly separated subtrees are highly
suggestive of two clusters. (Things are rarely
this clear cut, unfortunately)
84
One potential use of a dendrogram is to detect
outliers
The single isolated branch is suggestive of a
data point that is very different to all others
Outlier
85
(How-to) Hierarchical Clustering
Since we cannot test all possible trees we will
have to heuristic search of all possible trees.
We could do this.. Bottom-Up (agglomerative)
Starting with each item in its own cluster, find
the best pair to merge into a new cluster. Repeat
until all clusters are fused together. Top-Down
(divisive) Starting with all the data in a
single cluster, consider every possible way to
divide the cluster into two. Choose the best
division and recursively operate on both sides.
  • The number of dendrograms with n leafs (2n
    -3)!/(2(n -2)) (n -2)!
  • Number Number of Possible
  • of Leafs Dendrograms
  • 2 1
  • 3 3
  • 4 15
  • 5 105
  • ...
  • 34,459,425

86
We begin with a distance matrix which contains
the distances between every pair of objects in
our database.
D( , ) 8 D( , ) 1
87
Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best

88
Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

89
Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

90
Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

91
We know how to measure the distance between two
objects, but defining the distance between an
object and a cluster, or defining the distance
between two clusters is non obvious.
  • Single linkage (nearest neighbor) In this
    method the distance between two clusters is
    determined by the distance of the two closest
    objects (nearest neighbors) in the different
    clusters.
  • Complete linkage (furthest neighbor) In this
    method, the distances between clusters are
    determined by the greatest distance between any
    two objects in the different clusters (i.e., by
    the "furthest neighbors").
  • Group average linkage In this method, the
    distance between two clusters is calculated as
    the average distance between all pairs of objects
    in the two different clusters.
  • Wards Linkage In this method, we try to
    minimize the variance of the merged clusters

92
Single linkage
Average linkage
Wards linkage
93
  • Summary of Hierarchal Clustering Methods
  • No need to specify the number of clusters in
    advance.
  • Hierarchal nature maps nicely onto human
    intuition for some domains
  • They do not scale well time complexity of at
    least O(n2), where n is the number of total
    objects.
  • Like any heuristic search algorithms, local
    optima are a problem.
  • Interpretation of results is (very) subjective.

94
Up to this point we have simply assumed that we
can measure similarity, butHow do we measure
similarity?
Peter
Piotr
0.23
3
342.7
95
A generic technique for measuring similarity
To measure the similarity between two objects,
transform one of the objects into the other, and
measure how much effort it took. The measure of
effort becomes the distance measure.
The distance between Patty and Selma. Change
dress color, 1 point Change earring shape, 1
point Change hair part, 1 point D(Patty,Selma
) 3
The distance between Marge and Selma. Change
dress color, 1 point Add earrings, 1
point Decrease height, 1 point Take up
smoking, 1 point Lose weight, 1
point D(Marge,Selma) 5
This is called the edit distance or the
transformation distance
96
Edit Distance Example
How similar are the names Peter and
Piotr? Assume the following cost function
Substitution 1 Unit Insertion 1
Unit Deletion 1 Unit D(Peter,Piotr) is 3
It is possible to transform any string Q into
string C, using only Substitution, Insertion and
Deletion. Assume that each of these operators has
a cost associated with it. The similarity
between two strings can be defined as the cost of
the cheapest transformation from Q to C. Note
that for now we have ignored the issue of how we
can find this cheapest transformation
Peter Piter Pioter Piotr
Substitution (i for e)
Insertion (o)
Deletion (e)
97
Partitional Clustering
  • Nonhierarchical, each instance is placed in
    exactly one of K nonoverlapping clusters.
  • Since only one set of clusters is output, the
    user normally has to input the desired number of
    clusters K.

98
Algorithm k-means 1. Decide on a value for
k. 2. Initialize the k cluster centers
(randomly, if necessary). 3. Decide the class
memberships of the N objects by assigning them to
the nearest cluster center. 4. Re-estimate the k
cluster centers, by assuming the memberships
found above are correct. 5. If none of the N
objects changed membership in the last iteration,
exit. Otherwise goto 3.
99
K-means Clustering Step 1
Algorithm k-means, Distance Metric Euclidean
Distance
5
4
3
2
1
0
0
1
2
3
4
5
100
K-means Clustering Step 2
Algorithm k-means, Distance Metric Euclidean
Distance
5
4
3
2
1
0
0
1
2
3
4
5
101
K-means Clustering Step 3
Algorithm k-means, Distance Metric Euclidean
Distance
5
4
3
2
1
0
0
1
2
3
4
5
102
K-means Clustering Step 4
Algorithm k-means, Distance Metric Euclidean
Distance
5
4
3
2
1
0
0
1
2
3
4
5
103
K-means Clustering Step 5
Algorithm k-means, Distance Metric Euclidean
Distance
5
4
3
2
1
0
0
1
2
3
4
5
104
Comments on the K-Means Method
  • Strength
  • Relatively efficient O(tkn), where n is
    objects, k is clusters, and t is iterations.
    Normally, k, t ltlt n.
  • Often terminates at a local optimum. The global
    optimum may be found using techniques such as
    deterministic annealing and genetic algorithms
  • Weakness
  • Applicable only when mean is defined, then what
    about categorical data?
  • Need to specify k, the number of clusters, in
    advance
  • Unable to handle noisy data and outliers
  • Not suitable to discover clusters with non-convex
    shapes

105
The K-Medoids Clustering Method
  • Find representative objects, called medoids, in
    clusters
  • PAM (Partitioning Around Medoids, 1987)
  • starts from an initial set of medoids and
    iteratively replaces one of the medoids by one of
    the non-medoids if it improves the total distance
    of the resulting clustering
  • PAM works effectively for small data sets, but
    does not scale well for large data sets

106
What happens if the data is streaming
Nearest Neighbor Clustering Not to be confused
with Nearest Neighbor Classification
  • Items are iteratively merged into the existing
    clusters that are closest.
  • Incremental
  • Threshold, t, used to determine if items are
    added to existing clusters or a new cluster is
    created.

107
10
Threshold t
1
t
2
108
10
New data point arrives It is within the
threshold for cluster 1, so add it to the
cluster, and update cluster center.
1
3
2
109
10
New data point arrives It is not within the
threshold for cluster 1, so create a new cluster,
and so on..
4
1
3
2
Algorithm is highly order dependent It is
difficult to determine t in advance
110
How can we tell the right number of clusters? In
general, this is a unsolved problem. However
there are many approximate methods. In the next
few slides we will see an example.
For our example, we will use the familiar
katydid/grasshopper dataset. However, in this
case we are imagining that we do NOT know the
class labels. We are only clustering on the X and
Y axis values.
111
Squared Error
Objective Function
112
When k 1, the objective function is 873.0
1
2
3
4
5
6
7
8
9
10
113
When k 2, the objective function is 173.1
1
2
3
4
5
6
7
8
9
10
114
When k 3, the objective function is 133.6
1
2
3
4
5
6
7
8
9
10
115
We can plot the objective function values for k
equals 1 to 6 The abrupt change at k 2, is
highly suggestive of two clusters in the data.
This technique for determining the number of
clusters is known as knee finding or elbow
finding.
1.00E03
9.00E02
8.00E02
7.00E02
6.00E02
Objective Function
5.00E02
4.00E02
3.00E02
2.00E02
1.00E02
0.00E00
k
1
2
3
4
5
6
Note that the results are not always as clear cut
as in this toy example
116
Conclusions
  • We have learned about the 3 major data
    mining/machine learning algorithms.
  • Almost all data mining research is in these 3
    areas, or is a minor extension of one or more of
    them.
  • For further study, I recommend.
  • Proceedings of SIGKDD, IEEE ICDM, SIAM SDM
  • Data Mining Concepts and Techniques (Jiawei Han
    and Micheline Kamber)
  • Data Mining Introductory and Advanced Topics
    (Margaret Dunham)
Write a Comment
User Comments (0)
About PowerShow.com