Decision Tree ???

About This Presentation

Title:

Decision Tree ???

Description:

Decision Tree * * * A root node is one that has no incoming edges. An internal node is one that 1 incoming edge and two or more outgoing edges. – PowerPoint PPT presentation

Number of Views:153

Avg rating:3.0/5.0

Slides: 63

Provided by: ylw2

Category:

more less

Transcript and Presenter's Notes

Title: Decision Tree ???

1
Decision Tree???
2
Outline

What is a Decision Tree
How to Construct a Decision Tree
Entropy, Information Gain
Problems with Decision Trees
Summary

3
What is Decision Tree?
Root node

A decision tree is a flow-chart-like tree
structure , where
Root node and each internal node denotes a test
on an attribute,
each branch represents an outcome (attribute
value) of the test .
leaves represent class labels(classification
results)
An example is shown on the right.

Internal node
Leaf node
A Decision tree showing whether a person will buy
a sports car or mini-van depending on their age
and marital status.
4
Decision Tree with probalities

A tree showing survival of passengers on the
Titanic ("sibsp" is the number of spouses or
siblings aboard).
The figures under the leaves show the probability
of survival.

yes
no
survived
0.73
yes
no
0.17
no
yes
Died
survived
0.89
0.05
5
Decision Tree for Play Tennis?

A tree for the concept play tennis
The tree classifies days to outcome result
whether or not they are suitable for playing
table tennis

E.g., ltoutlookSunny, TemperatureHot,
HumidityHigh, WindStronggt would be sorted down
the leftmost branch of the tree and is classified
as negative instance.
6
What factors cause some people to get
sunburned(??) ?
7
Sunburn Data Collected
Instance set 3 x 3 x 3x 2 54 possible
combinations of attributes.
Chance of an exact match for any randomly chosen
instance is 8/540.15
8
Decision Tree 1
is_sunburned
Height
short
tall
average
Dana, Pete
Hair colour
Weight
brown
red
light
blonde
average
heavy
Alex
Sarah
Hair colour
Weight
blonde
heavy
brown
average
red
light
Emily
John
Annie
Katie
9
Sunburn sufferers are ...

If heightaverage then
if weightlight then
return(true) Sarah
elseif weightheavy then
if hair_colourred then
return(true) Emily
elseif heightshort then
if hair_colourblonde then
if weightaverage then
return(true) Annie
else return(false) everyone else

ltHeightgt is IRRELEVANT for determining whether
someone will suffer from sunburn
10
Decision Tree 2
is_sunburned
Lotion used
yes
no
Hair colour
Hair colour
blonde
blonde
red
brown
red
brown
Sarah, Annie
Dana, Katie
Emily
Pete, John
Alex
This tree doesnt involve any of the irrelevant
attributes
11
Decision Tree 3
is_sunburned
Hair colour
brown
blonde
red
Emily
Alex, Pete, John
Lotion used
no
yes
Sarah, Annie
Dana, Katie
12
Why we prefer short hypotheses?

Irrelevant attributes do not classify the data
well
Using irrelevant attributes thus causes larger
decision trees. Conversely, larger trees may
involve irrelevant attributes.
So, simple trees likely reflect the nature.
Occams razor(A.D. 1320) Prefer simplest
hypotheses that fit the data.
A computer could look for simpler decision trees
Q How?

13
3.4.1 Which attribute is the best classifier?

Q Which is attribute should be test first in
the tree ? (Which is the best attribute for
splitting up the data)?
A The one which is most informative for the
classification.
Q What does more informative mean ?
A The attribute which best reduces the
uncertainty or the disorder, or impurity of the
data
Q How can we measure something like that?
A Simple.

14
3.4.1.1 measure the disorder of examples

We need a quantity to measure the
disorder/impurity in a set of examples
Ss1, s2, s3, , sn
where s1Sarah, s2Dana,
it will be measured according to the value
of target attribute of the data.
Then we need a quantity to measure the amount of
reduction of the disorder.

15
What should the measure be?

If all the examples in S have the same class,
then D(S)0 ---- purest
If half the examples in S are of one class and
half are the opposite class, then D(S)1 ----
impure

16
Examples

D(Dana,Pete) 0
D(Sarah,Annie,Emily )0
D(Sarah,Emily,Alex,John )1
D(Sarah,Emily, Alex )?

17
Entropy
0.918
2/30.67
18
Definition of Disorder
The Entropy(?) measures the disorder of a set S
containing a total of n examples of which n are
positive and n- are negative and it is given by

OR
where p1 is the fraction of positive examples in
S and p0 is the fraction of negatives.
If p1p00.5, Entropy(S) ? If p10,p01,
Entropy(S) ? If p11,p00, Entropy(S) ?
For multi-class problems with c categories,
entropy generalizes to
19
Back to the beach (or the disorder of sunbathers)!

D( Sarah,Dana,Alex,Annie,
Emily,Pete,John,Katie)
20
Some more useful properties of the Entropy

21
Whats left?

So We can measure the disorder
Whats left
We want to measure how much by knowing the value
of a particular attribute the disorder of a set
would reduce.

22
3.4.1.2 Information gain measures the expected
reduction in entropy

The Information Gain measures the expected
reduction in entropy due to splitting on an
attribute A

the average disorder is just the weighted sum of
the disorders in the branches (subsets) created
by the values of A.

We want
large Gain
same as small avg disorder created

23
Back to the beach calculate the Average Disorder
associated with Hair Colour
Hair colour
brown
blonde
red
Sarah AnnieDana Katie
Emily
Alex Pete John
24
Calculating the Disorder of the blondes
The first term of the sum D(Sblonde) D(
Sarah,Annie,Dana,Katie) D(2,2) 1
8 sunbathers in total 4 blondes
25
Calculating the disorder of the others
The second and third terms of the
sum SredEmily Sbrown Alex, Pete,
John. These are both 0 because within each set
all the examples have the same class So the avg
disorder created when splitting on hair colour
is 0.5000.5
26
Which decision variable minimises the disorder?

Test Disorder
Hair-colour 0.5 this what we just computed
Height 0.69
weight 0.94
lotion 0.61

these are the avg disorders of the other
attributes, computed in the same way

Which decision variable maximises the Info Gain
then?
Remember its the one which minimises the avg
disorder.

27
So what is the best decision tree?
is_sunburned
Hair colour
brown
blonde
red
Alex, Pete, John
?
Emily
Sarah AnnieDana Katie

Once we have finished with hair colour we then
need to calculate the remaining branches of the
decision tree.
The examples corresponding to that branch are now
the total set. One just applies the same
procedure as before with JUST those examples
(i.e. the blondes).

28
Decision Tree Induction Pseudocode

DTree(examples, features) returns a tree
If all examples are in one category, return a
leaf node with that category label.
Else if the set of features is empty, return a
leaf node with the category label that
is the most common in examples.
Else pick a feature F and create a node R for
it
For each possible value vi of F
Let examplesi be the subset of
examples that have value vi for F
Add an out-going edge E to node R labeled with
the value vi.
If examplesi is empty
then attach a leaf node to
edge E labeled with the category that
is the most common
in examples.
else call DTree(examplesi ,
features F) and attach the resulting
tree as the subtree
under edge E.
Return the subtree rooted at R.

29
3.5 Hypothesis Space Search

The hypothesis Space is a complete space of
finite discrete-valued functions, relative to the
attributes. Because every finite discrete-valued
function can be represented by some decision
tree.
Maintain only one current hypothesis as it
searches the space of decision trees. Do not test
how many decision trees are consistent with the
data.

30
3.5 Hypothesis Space Search

Performs no backtracking in its search. Performs
hill-climbing (greedy search) that may only find
a locally-optimal solution. Guaranteed to find a
tree consistent with any conflict-free training
set, but not necessarily the simplest tree.
Performs batch learning that processes all
training instances at once rather than
incremental learning that updates a hypothesis
after each example.

31
3.6 Bias in Decision-Tree Induction

Information-gain gives a bias for trees with
minimal depth.
Inductive bias of ID3 Shorter trees are
preferred over larger trees. Trees that place
high information gain attributes close to the
root are preferred over those that do not.
Implements a search (preference) bias instead of
a language (restriction) bias.

32
Is this all? So much simple?

Of course not
where do we stop growing the tree?
what if there are noisy (mislabelled) data as
well in data set?

33
Overfitting

Learning a tree that classifies the training data
perfectly may not lead to the tree with the best
generalization to unseen data.
There may be noise in the training data that the
tree is erroneously fitting.
The algorithm may be making poor decisions
towards the leaves of the tree that are based on
very little data and may not reflect reliable
trends.

accuracy

hypothesis complexity

34
Overfitting

Consider the error of hypothesis h over
Training data error_train(h)
The whole data set (new data as well) error_D(h)
If there is another hypothesis h such that
error_train(h) lt error_train(h) and
error_D(h)gterror_D(h) then we say that
hypothesis h overfits the training data.

accuracy

hypothesis complexity

35
Overfitting Example

Testing Ohms(?? ) Law V IR (I (1/R)V)

Experimentally measure 10 points

current (I)

Fit a curve to the Resulting data.

voltage (V)

It was wrong, we have found a more accurate
function!

36
Overfitting Example

Testing Ohms Law V IR (I (1/R)V)

current (I)

voltage (V)

Better generalization with a linear function
that fits training data less accurately.

37
Overfitting Noise in Decision Trees

Category or feature noise can easily cause
overfitting.
Add noisy instance ltmedium, blue, circlegt pos
(but really neg)

color

blue

green

shape

circle

triangle

square

38
Overfitting Noise in Decision Trees

Category or feature noise can easily cause
overfitting.
Add noisy instance ltmedium, blue, circlegt pos
(but really neg)

color

blue

green

ltbig, blue, circlegt ?
ltmedium, blue, circlegt

shape

circle

triangle

square

39
How can we avoid overfitting?

Split the data into training set validation set
Train on the training set and stop growing the
tree when further data split deteriorates
performance on validation set
Or grow the full tree first and then post-prune
What if data is limited?

40
Effect of Reduced-error pruning
41
Reduced Error Pruning

A post-pruning, cross-validation approach.

Partition training data in grow and
validation sets.
Build a complete tree from the grow data.
Until accuracy on validation set decreases do
For each non-leaf node, n, in the tree do
Temporarily prune the subtree below
n and replace it with a
leaf labeled with the current
majority class at that node.
Measure and record the accuracy of
the pruned tree on the validation set.
Permanently prune the node that results in
the greatest increase in accuracy on
the validation set.

42
Issues with Reduced Error Pruning

The problem with this approach is that it
potentially wastes training data on the
validation set.
Severity(???) of this problem depends where we
are on the learning curve

test accuracy

number of training examples

43
Cross-Validating without Losing Training Data

First, run several trials of reduced
error-pruning using different random splits of
grow and validation sets.
Record the complexity of the pruned tree learned
in each trial. Let C be the average pruned-tree
complexity.
Grow a final tree breadth-first from all the
training data but stop when the complexity
reaches C.
Similar cross-validation approach can be used to
set arbitrary algorithm parameters in general.

44
Rule Post-Pruning

Generate Decision tree which best fit the
training data. Allow overfitting.
Convert tree to equivalent set of rules.
Prune each rule independently of others.
Sort final rules into desired sequence for
further use.

45
Rule Post-Pruning
Converting A Tree to Rules
46

Example
Consider the leftmost path in the Figure
if (outlooksunny)?(HumidityHigh) then
PlayTennisNo
Remove any preconditions
(outlooksunny)?(HumidityHigh)
Choose the operation which does not reduce the
accuracy.

47
3.7.2 Incorporating Continuous-valued Attributes

ID3 is restricted to attributes that take on a
discrete set of values.
The target attribute is discrete valued.
The attributes tested is also discrete valued.
The second restriction can be easily be moved so
that
For a continuous attribute A, we can dynamically
create a new Boolean attribute that is true if A
lt c.

48
3.7.2 Incorporating Continuous-valued Attributes
Suppose that the training examples associated
with a particular node in the decision tree have
the the following values for Temperature and the
target attribute PlayTennis,

What threshold would be picked ?
---that produces the greatest information gain

49
3.7.2 Incorporating Continuous-valued Attributes

Sorting the examples according to the continuous
attribute.
Then, identify adjacent examples that differ in
their target classification.
Generate a set of candidate thresholds midway
between the corresponding values
The value of the threshold, that maximizes the
information gain must always lie at such a
boundary(Fayyad, 1991)

In the example, there are two candidate
thresholds 54, 85
1. Temperaturegt54, 2. Temperaturegt85,
Which one is the best?
Information gain can be computed for each of the
attributes
50
3.7.3 Alternative Measure for Selecting Attributes

Information gain favors attributes with many
values over those with few values.
If attribute has many values Gain will select it
Imagine using Date, as attribute, it will have
the highest information gain. Selected as
decision attribute for the root node.
It is not a useful predictor despite that it
perfectly separates the data

51
3.7.3 Alternative Measure for Selecting Attributes

Solution using gain ratio
Penalize attributes by incorporating a term,
called split information, that is sensitive to
how broadly and uniformly the attribute splits
the data

gain ratio discourages the selection of
attributes with many uniformly distributed
values. E.g., Date, SplitInformation would be log
n. A boolean attribute spliting n examples in
half , SplitInformation is 1.

52
3.7.3 Alternative Measure for Selecting Attributes

One problem of using gain ratio is the
Denominator can be zero or very small, when Si?S
for one of the Si
This make the gain ratio undefined or very large
for attributes that happen to have the same value
for nearly all members of S.

Solution using gain ratio
First calculate the Gain of each attribute, then
applying the GainRatio test only for those
attribute with above average Gain.

53
3.7.4 Handling training Examples with missing
attribute values

In certain cases, the values of attributes for
some training examples may be missing. Its
common to estimate the missing attribute value
based on other examples for which this attribute
has a known value.
E.g., in a medical domain, it may be that the lab
test Blood-Test-Result is available only for a
subset of the patients.
Consider the situation in which Gain(S,A) is to
be calculated at node n to evaluate whether the
attribute A is the best attribute to test at this
decision tree. Suppose that ltx, c(x)gt is one of
the training examples in S and that the value
A(x) is unknown.

54
3.7.4 Handling training Examples with missing
attribute values

Strategies
Assign it the value that is most common among
training examples at node n.
Or assign it the most common value among
examples at node n that have the classification
c(x).
Or A more complex strategy--assign a probability
to each of the possible values of A rather than
simply assigning the most common value to A(x).
These probabilities can be estimated again based
on the observed frequencies of the various values
for A the examples at node n.

55
training Examples with missing attribute values

SltA1, 6?gt,ltA0,4?gt, ltx,A1,0.6 gt, ltx,A0,0.4
gt

SltA1, 6.6?gt,ltA0,4.4?gt

S24.4 instances

S16.6instances

56
training Examples with missing attribute values

ltA1, 6?gt,ltA0,4?gt, ltx,A1,0.6 gt, ltx,A1,0.4 gt
assume that in the six examples for which A
1,the number of positive examples is 1, the
number of negative examples is 5 In the 4
examples for which A0,the number of positive
examples is 3,the number of negative examples is
1
If we know x is positive
Then E(S1)
-(1.6/6.6)log(1.6/6.6)-5/6.6)log(5/6.6)
.

S16.6 instances

S24.4 instances

57
3.7.5 handling attributes with differing costs

The instance attributes may have associated
costs. E.g., to classify diseases of patients,
Temperature, Pulse, BloodTestResults may be used.
Prefer decision trees that use low-cost
attributes where possible, relying on high-cost
attributes only when needed to produce reliable
classifications.
ID3 can be modified to take into account
attribute cost by introducing a cost term into
the attribute selection measure.

58
(No Transcript)
59
summary

A Practical Method
ID3
Greedy search
Growing the tree from the root downward
Search in a complete hypothesis space
Preference for smaller trees
Overfitting issuesmethods of post-pruning
Extensions to the basic ID3 algorithm

C4.5 is a software extension of the basic ID3
algorithm designed by Quinlan to address the
following issues not dealt with by ID3
Avoiding overfitting the data
Determining how deeply to grow a decision tree.
Reduced error pruning.
Rule post-pruning.
Handling continuous attributes.
e.g., temperature
Choosing an appropriate attribute selection
measure.
Handling training data with missing attribute
values.
Handling attributes with differing costs.
Improving computational efficiency.

61
Summary

Decision Tree Representation
Entropy, Information Gain
ID3 Learning algorithm
Overfitting and how to avoid it

62
Homework
Exercises 3.1 3.2 3.3

Write a Comment

User Comments (0)

About PowerShow.com

Decision Tree ??? - PowerPoint PPT Presentation

Decision Tree ???

Decision Tree * * * A root node is one that has no incoming edges. An internal node is one that 1 incoming edge and two or more outgoing edges. – PowerPoint PPT presentation