Title: Decision Tree ???
1Decision Tree???
2Outline
- What is a Decision Tree
- How to Construct a Decision Tree
- Entropy, Information Gain
- Problems with Decision Trees
- Summary
3What is Decision Tree?
Root node
- A decision tree is a flow-chart-like tree
structure , where - Root node and each internal node denotes a test
on an attribute, - each branch represents an outcome (attribute
value) of the test . - leaves represent class labels(classification
results) - An example is shown on the right.
Internal node
Leaf node
A Decision tree showing whether a person will buy
a sports car or mini-van depending on their age
and marital status.
4Decision Tree with probalities
- A tree showing survival of passengers on the
Titanic ("sibsp" is the number of spouses or
siblings aboard). - The figures under the leaves show the probability
of survival.
yes
no
survived
0.73
yes
no
0.17
no
yes
Died
survived
0.89
0.05
5Decision Tree for Play Tennis?
- A tree for the concept play tennis
- The tree classifies days to outcome result
whether or not they are suitable for playing
table tennis
E.g., ltoutlookSunny, TemperatureHot,
HumidityHigh, WindStronggt would be sorted down
the leftmost branch of the tree and is classified
as negative instance.
6What factors cause some people to get
sunburned(??) ?
7Sunburn Data Collected
Instance set 3 x 3 x 3x 2 54 possible
combinations of attributes.
Chance of an exact match for any randomly chosen
instance is 8/540.15
8Decision Tree 1
is_sunburned
Height
short
tall
average
Dana, Pete
Hair colour
Weight
brown
red
light
blonde
average
heavy
Alex
Sarah
Hair colour
Weight
blonde
heavy
brown
average
red
light
Emily
John
Annie
Katie
9Sunburn sufferers are ...
- If heightaverage then
- if weightlight then
- return(true) Sarah
- elseif weightheavy then
- if hair_colourred then
- return(true) Emily
- elseif heightshort then
- if hair_colourblonde then
- if weightaverage then
- return(true) Annie
- else return(false) everyone else
ltHeightgt is IRRELEVANT for determining whether
someone will suffer from sunburn
10Decision Tree 2
is_sunburned
Lotion used
yes
no
Hair colour
Hair colour
blonde
blonde
red
brown
red
brown
Sarah, Annie
Dana, Katie
Emily
Pete, John
Alex
This tree doesnt involve any of the irrelevant
attributes
11Decision Tree 3
is_sunburned
Hair colour
brown
blonde
red
Emily
Alex, Pete, John
Lotion used
no
yes
Sarah, Annie
Dana, Katie
12Why we prefer short hypotheses?
- Irrelevant attributes do not classify the data
well - Using irrelevant attributes thus causes larger
decision trees. Conversely, larger trees may
involve irrelevant attributes. - So, simple trees likely reflect the nature.
- Occams razor(A.D. 1320) Prefer simplest
hypotheses that fit the data. - A computer could look for simpler decision trees
- Q How?
133.4.1 Which attribute is the best classifier?
- Q Which is attribute should be test first in
the tree ? (Which is the best attribute for
splitting up the data)? - A The one which is most informative for the
classification. - Q What does more informative mean ?
- A The attribute which best reduces the
uncertainty or the disorder, or impurity of the
data - Q How can we measure something like that?
- A Simple.
143.4.1.1 measure the disorder of examples
- We need a quantity to measure the
disorder/impurity in a set of examples - Ss1, s2, s3, , sn
- where s1Sarah, s2Dana,
- it will be measured according to the value
of target attribute of the data. - Then we need a quantity to measure the amount of
reduction of the disorder.
15What should the measure be?
- If all the examples in S have the same class,
then D(S)0 ---- purest - If half the examples in S are of one class and
half are the opposite class, then D(S)1 ----
impure
16Examples
- D(Dana,Pete) 0
- D(Sarah,Annie,Emily )0
- D(Sarah,Emily,Alex,John )1
- D(Sarah,Emily, Alex )?
17Entropy
0.918
2/30.67
18Definition of Disorder
The Entropy(?) measures the disorder of a set S
containing a total of n examples of which n are
positive and n- are negative and it is given by
OR
where p1 is the fraction of positive examples in
S and p0 is the fraction of negatives.
If p1p00.5, Entropy(S) ? If p10,p01,
Entropy(S) ? If p11,p00, Entropy(S) ?
For multi-class problems with c categories,
entropy generalizes to
19Back to the beach (or the disorder of sunbathers)!
D( Sarah,Dana,Alex,Annie,
Emily,Pete,John,Katie)
20Some more useful properties of the Entropy
21Whats left?
- So We can measure the disorder
- Whats left
- We want to measure how much by knowing the value
of a particular attribute the disorder of a set
would reduce.
223.4.1.2 Information gain measures the expected
reduction in entropy
- The Information Gain measures the expected
reduction in entropy due to splitting on an
attribute A
the average disorder is just the weighted sum of
the disorders in the branches (subsets) created
by the values of A.
- We want
- large Gain
- same as small avg disorder created
23Back to the beach calculate the Average Disorder
associated with Hair Colour
Hair colour
brown
blonde
red
Sarah AnnieDana Katie
Emily
Alex Pete John
24Calculating the Disorder of the blondes
The first term of the sum D(Sblonde) D(
Sarah,Annie,Dana,Katie) D(2,2) 1
8 sunbathers in total 4 blondes
25Calculating the disorder of the others
The second and third terms of the
sum SredEmily Sbrown Alex, Pete,
John. These are both 0 because within each set
all the examples have the same class So the avg
disorder created when splitting on hair colour
is 0.5000.5
26Which decision variable minimises the disorder?
- Test Disorder
- Hair-colour 0.5 this what we just computed
- Height 0.69
- weight 0.94
- lotion 0.61
these are the avg disorders of the other
attributes, computed in the same way
- Which decision variable maximises the Info Gain
then? - Remember its the one which minimises the avg
disorder.
27So what is the best decision tree?
is_sunburned
Hair colour
brown
blonde
red
Alex, Pete, John
?
Emily
Sarah AnnieDana Katie
- Once we have finished with hair colour we then
need to calculate the remaining branches of the
decision tree. - The examples corresponding to that branch are now
the total set. One just applies the same
procedure as before with JUST those examples
(i.e. the blondes).
28Decision Tree Induction Pseudocode
- DTree(examples, features) returns a tree
- If all examples are in one category, return a
leaf node with that category label. - Else if the set of features is empty, return a
leaf node with the category label that - is the most common in examples.
- Else pick a feature F and create a node R for
it - For each possible value vi of F
- Let examplesi be the subset of
examples that have value vi for F - Add an out-going edge E to node R labeled with
the value vi. - If examplesi is empty
- then attach a leaf node to
edge E labeled with the category that - is the most common
in examples. - else call DTree(examplesi ,
features F) and attach the resulting - tree as the subtree
under edge E. - Return the subtree rooted at R.
293.5 Hypothesis Space Search
- The hypothesis Space is a complete space of
finite discrete-valued functions, relative to the
attributes. Because every finite discrete-valued
function can be represented by some decision
tree. - Maintain only one current hypothesis as it
searches the space of decision trees. Do not test
how many decision trees are consistent with the
data.
303.5 Hypothesis Space Search
- Performs no backtracking in its search. Performs
hill-climbing (greedy search) that may only find
a locally-optimal solution. Guaranteed to find a
tree consistent with any conflict-free training
set, but not necessarily the simplest tree. - Performs batch learning that processes all
training instances at once rather than
incremental learning that updates a hypothesis
after each example.
313.6 Bias in Decision-Tree Induction
- Information-gain gives a bias for trees with
minimal depth. - Inductive bias of ID3 Shorter trees are
preferred over larger trees. Trees that place
high information gain attributes close to the
root are preferred over those that do not. - Implements a search (preference) bias instead of
a language (restriction) bias.
32Is this all? So much simple?
- Of course not
- where do we stop growing the tree?
- what if there are noisy (mislabelled) data as
well in data set?
33Overfitting
- Learning a tree that classifies the training data
perfectly may not lead to the tree with the best
generalization to unseen data. - There may be noise in the training data that the
tree is erroneously fitting. - The algorithm may be making poor decisions
towards the leaves of the tree that are based on
very little data and may not reflect reliable
trends.
accuracy
34Overfitting
- Consider the error of hypothesis h over
- Training data error_train(h)
- The whole data set (new data as well) error_D(h)
- If there is another hypothesis h such that
error_train(h) lt error_train(h) and
error_D(h)gterror_D(h) then we say that
hypothesis h overfits the training data.
accuracy
35Overfitting Example
- Testing Ohms(?? ) Law V IR (I (1/R)V)
Experimentally measure 10 points
Fit a curve to the Resulting data.
- It was wrong, we have found a more accurate
function!
36Overfitting Example
- Testing Ohms Law V IR (I (1/R)V)
- Better generalization with a linear function
- that fits training data less accurately.
37Overfitting Noise in Decision Trees
- Category or feature noise can easily cause
overfitting. - Add noisy instance ltmedium, blue, circlegt pos
(but really neg)
38Overfitting Noise in Decision Trees
- Category or feature noise can easily cause
overfitting. - Add noisy instance ltmedium, blue, circlegt pos
(but really neg)
- ltbig, blue, circlegt ?
- ltmedium, blue, circlegt
39How can we avoid overfitting?
- Split the data into training set validation set
- Train on the training set and stop growing the
tree when further data split deteriorates
performance on validation set - Or grow the full tree first and then post-prune
- What if data is limited?
40Effect of Reduced-error pruning
41Reduced Error Pruning
- A post-pruning, cross-validation approach.
- Partition training data in grow and
validation sets. - Build a complete tree from the grow data.
- Until accuracy on validation set decreases do
- For each non-leaf node, n, in the tree do
- Temporarily prune the subtree below
n and replace it with a - leaf labeled with the current
majority class at that node. - Measure and record the accuracy of
the pruned tree on the validation set. - Permanently prune the node that results in
the greatest increase in accuracy on - the validation set.
-
42Issues with Reduced Error Pruning
- The problem with this approach is that it
potentially wastes training data on the
validation set. - Severity(???) of this problem depends where we
are on the learning curve
- number of training examples
43Cross-Validating without Losing Training Data
- First, run several trials of reduced
error-pruning using different random splits of
grow and validation sets. - Record the complexity of the pruned tree learned
in each trial. Let C be the average pruned-tree
complexity. - Grow a final tree breadth-first from all the
training data but stop when the complexity
reaches C. - Similar cross-validation approach can be used to
set arbitrary algorithm parameters in general.
44Rule Post-Pruning
- Generate Decision tree which best fit the
training data. Allow overfitting. - Convert tree to equivalent set of rules.
- Prune each rule independently of others.
- Sort final rules into desired sequence for
further use.
45Rule Post-Pruning
Converting A Tree to Rules
46- Example
- Consider the leftmost path in the Figure
- if (outlooksunny)?(HumidityHigh) then
PlayTennisNo - Remove any preconditions
- (outlooksunny)?(HumidityHigh)
- Choose the operation which does not reduce the
accuracy.
473.7.2 Incorporating Continuous-valued Attributes
- ID3 is restricted to attributes that take on a
discrete set of values. - The target attribute is discrete valued.
- The attributes tested is also discrete valued.
- The second restriction can be easily be moved so
that - For a continuous attribute A, we can dynamically
create a new Boolean attribute that is true if A
lt c.
483.7.2 Incorporating Continuous-valued Attributes
Suppose that the training examples associated
with a particular node in the decision tree have
the the following values for Temperature and the
target attribute PlayTennis,
- What threshold would be picked ?
- ---that produces the greatest information gain
493.7.2 Incorporating Continuous-valued Attributes
- Sorting the examples according to the continuous
attribute. - Then, identify adjacent examples that differ in
their target classification. - Generate a set of candidate thresholds midway
between the corresponding values - The value of the threshold, that maximizes the
information gain must always lie at such a
boundary(Fayyad, 1991)
In the example, there are two candidate
thresholds 54, 85
1. Temperaturegt54, 2. Temperaturegt85,
Which one is the best?
Information gain can be computed for each of the
attributes
503.7.3 Alternative Measure for Selecting Attributes
- Information gain favors attributes with many
values over those with few values. - If attribute has many values Gain will select it
- Imagine using Date, as attribute, it will have
the highest information gain. Selected as
decision attribute for the root node. - It is not a useful predictor despite that it
perfectly separates the data
513.7.3 Alternative Measure for Selecting Attributes
- Solution using gain ratio
- Penalize attributes by incorporating a term,
called split information, that is sensitive to
how broadly and uniformly the attribute splits
the data
- gain ratio discourages the selection of
attributes with many uniformly distributed
values. E.g., Date, SplitInformation would be log
n. A boolean attribute spliting n examples in
half , SplitInformation is 1.
523.7.3 Alternative Measure for Selecting Attributes
- One problem of using gain ratio is the
Denominator can be zero or very small, when Si?S
for one of the Si - This make the gain ratio undefined or very large
for attributes that happen to have the same value
for nearly all members of S.
- Solution using gain ratio
- First calculate the Gain of each attribute, then
applying the GainRatio test only for those
attribute with above average Gain.
533.7.4 Handling training Examples with missing
attribute values
- In certain cases, the values of attributes for
some training examples may be missing. Its
common to estimate the missing attribute value
based on other examples for which this attribute
has a known value. - E.g., in a medical domain, it may be that the lab
test Blood-Test-Result is available only for a
subset of the patients. - Consider the situation in which Gain(S,A) is to
be calculated at node n to evaluate whether the
attribute A is the best attribute to test at this
decision tree. Suppose that ltx, c(x)gt is one of
the training examples in S and that the value
A(x) is unknown.
543.7.4 Handling training Examples with missing
attribute values
- Strategies
- Assign it the value that is most common among
training examples at node n. - Or assign it the most common value among
examples at node n that have the classification
c(x). - Or A more complex strategy--assign a probability
to each of the possible values of A rather than
simply assigning the most common value to A(x).
These probabilities can be estimated again based
on the observed frequencies of the various values
for A the examples at node n.
55training Examples with missing attribute values
- SltA1, 6?gt,ltA0,4?gt, ltx,A1,0.6 gt, ltx,A0,0.4
gt
- SltA1, 6.6?gt,ltA0,4.4?gt
56training Examples with missing attribute values
- ltA1, 6?gt,ltA0,4?gt, ltx,A1,0.6 gt, ltx,A1,0.4 gt
assume that in the six examples for which A
1,the number of positive examples is 1, the
number of negative examples is 5 In the 4
examples for which A0,the number of positive
examples is 3,the number of negative examples is
1 - If we know x is positive
- Then E(S1)
- -(1.6/6.6)log(1.6/6.6)-5/6.6)log(5/6.6)
- .
573.7.5 handling attributes with differing costs
- The instance attributes may have associated
costs. E.g., to classify diseases of patients,
Temperature, Pulse, BloodTestResults may be used. - Prefer decision trees that use low-cost
attributes where possible, relying on high-cost
attributes only when needed to produce reliable
classifications. - ID3 can be modified to take into account
attribute cost by introducing a cost term into
the attribute selection measure.
58(No Transcript)
59summary
- A Practical Method
- ID3
- Greedy search
- Growing the tree from the root downward
- Search in a complete hypothesis space
- Preference for smaller trees
- Overfitting issuesmethods of post-pruning
- Extensions to the basic ID3 algorithm
60- C4.5 is a software extension of the basic ID3
algorithm designed by Quinlan to address the
following issues not dealt with by ID3 - Avoiding overfitting the data
- Determining how deeply to grow a decision tree.
- Reduced error pruning.
- Rule post-pruning.
- Handling continuous attributes.
- e.g., temperature
- Choosing an appropriate attribute selection
measure. - Handling training data with missing attribute
values. - Handling attributes with differing costs.
- Improving computational efficiency.
61Summary
- Decision Tree Representation
- Entropy, Information Gain
- ID3 Learning algorithm
- Overfitting and how to avoid it
62Homework
Exercises 3.1 3.2 3.3