Decision Tree ??? - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

Decision Tree ???

Description:

Decision Tree * * * A root node is one that has no incoming edges. An internal node is one that 1 incoming edge and two or more outgoing edges. – PowerPoint PPT presentation

Number of Views:150
Avg rating:3.0/5.0
Slides: 63
Provided by: ylw2
Category:

less

Transcript and Presenter's Notes

Title: Decision Tree ???


1
Decision Tree???
2
Outline
  • What is a Decision Tree
  • How to Construct a Decision Tree
  • Entropy, Information Gain
  • Problems with Decision Trees
  • Summary

3
What is Decision Tree?
Root node
  • A decision tree is a flow-chart-like tree
    structure , where
  • Root node and each internal node denotes a test
    on an attribute,
  • each branch represents an outcome (attribute
    value) of the test .
  • leaves represent class labels(classification
    results)
  • An example is shown on the right.

Internal node
Leaf node
A Decision tree showing whether a person will buy
a sports car or mini-van depending on their age
and marital status.
4
Decision Tree with probalities
  • A tree showing survival of passengers on the
    Titanic ("sibsp" is the number of spouses or
    siblings aboard).
  • The figures under the leaves show the probability
    of survival.

yes
no
survived
0.73
yes
no
0.17
no
yes
Died
survived
0.89
0.05
5
Decision Tree for Play Tennis?
  • A tree for the concept play tennis
  • The tree classifies days to outcome result
    whether or not they are suitable for playing
    table tennis

E.g., ltoutlookSunny, TemperatureHot,
HumidityHigh, WindStronggt would be sorted down
the leftmost branch of the tree and is classified
as negative instance.
6
What factors cause some people to get
sunburned(??) ?
7
Sunburn Data Collected
Instance set 3 x 3 x 3x 2 54 possible
combinations of attributes.
Chance of an exact match for any randomly chosen
instance is 8/540.15
8
Decision Tree 1
is_sunburned
Height
short
tall
average
Dana, Pete
Hair colour
Weight
brown
red
light
blonde
average
heavy
Alex
Sarah
Hair colour
Weight
blonde
heavy
brown
average
red
light
Emily
John
Annie
Katie
9
Sunburn sufferers are ...
  • If heightaverage then
  • if weightlight then
  • return(true) Sarah
  • elseif weightheavy then
  • if hair_colourred then
  • return(true) Emily
  • elseif heightshort then
  • if hair_colourblonde then
  • if weightaverage then
  • return(true) Annie
  • else return(false) everyone else

ltHeightgt is IRRELEVANT for determining whether
someone will suffer from sunburn
10
Decision Tree 2
is_sunburned
Lotion used
yes
no
Hair colour
Hair colour
blonde
blonde
red
brown
red
brown
Sarah, Annie
Dana, Katie
Emily
Pete, John
Alex
This tree doesnt involve any of the irrelevant
attributes
11
Decision Tree 3
is_sunburned
Hair colour
brown
blonde
red
Emily
Alex, Pete, John
Lotion used
no
yes
Sarah, Annie
Dana, Katie
12
Why we prefer short hypotheses?
  • Irrelevant attributes do not classify the data
    well
  • Using irrelevant attributes thus causes larger
    decision trees. Conversely, larger trees may
    involve irrelevant attributes.
  • So, simple trees likely reflect the nature.
  • Occams razor(A.D. 1320) Prefer simplest
    hypotheses that fit the data.
  • A computer could look for simpler decision trees
  • Q How?

13
3.4.1 Which attribute is the best classifier?
  • Q Which is attribute should be test first in
    the tree ? (Which is the best attribute for
    splitting up the data)?
  • A The one which is most informative for the
    classification.
  • Q What does more informative mean ?
  • A The attribute which best reduces the
    uncertainty or the disorder, or impurity of the
    data
  • Q How can we measure something like that?
  • A Simple.

14
3.4.1.1 measure the disorder of examples
  • We need a quantity to measure the
    disorder/impurity in a set of examples
  • Ss1, s2, s3, , sn
  • where s1Sarah, s2Dana,
  • it will be measured according to the value
    of target attribute of the data.
  • Then we need a quantity to measure the amount of
    reduction of the disorder.

15
What should the measure be?
  • If all the examples in S have the same class,
    then D(S)0 ---- purest
  • If half the examples in S are of one class and
    half are the opposite class, then D(S)1 ----
    impure

16
Examples
  • D(Dana,Pete) 0
  • D(Sarah,Annie,Emily )0
  • D(Sarah,Emily,Alex,John )1
  • D(Sarah,Emily, Alex )?

17
Entropy
0.918
2/30.67
18
Definition of Disorder
The Entropy(?) measures the disorder of a set S
containing a total of n examples of which n are
positive and n- are negative and it is given by

OR
where p1 is the fraction of positive examples in
S and p0 is the fraction of negatives.
If p1p00.5, Entropy(S) ? If p10,p01,
Entropy(S) ? If p11,p00, Entropy(S) ?
For multi-class problems with c categories,
entropy generalizes to
19
Back to the beach (or the disorder of sunbathers)!

D( Sarah,Dana,Alex,Annie,
Emily,Pete,John,Katie)
20
Some more useful properties of the Entropy

21
Whats left?
  • So We can measure the disorder
  • Whats left
  • We want to measure how much by knowing the value
    of a particular attribute the disorder of a set
    would reduce.

22
3.4.1.2 Information gain measures the expected
reduction in entropy
  • The Information Gain measures the expected
    reduction in entropy due to splitting on an
    attribute A

the average disorder is just the weighted sum of
the disorders in the branches (subsets) created
by the values of A.
  • We want
  • large Gain
  • same as small avg disorder created

23
Back to the beach calculate the Average Disorder
associated with Hair Colour
Hair colour
brown
blonde
red
Sarah AnnieDana Katie
Emily
Alex Pete John
24
Calculating the Disorder of the blondes
The first term of the sum D(Sblonde) D(
Sarah,Annie,Dana,Katie) D(2,2) 1
8 sunbathers in total 4 blondes
25
Calculating the disorder of the others
The second and third terms of the
sum SredEmily Sbrown Alex, Pete,
John. These are both 0 because within each set
all the examples have the same class So the avg
disorder created when splitting on hair colour
is 0.5000.5
26
Which decision variable minimises the disorder?
  • Test Disorder
  • Hair-colour 0.5 this what we just computed
  • Height 0.69
  • weight 0.94
  • lotion 0.61

these are the avg disorders of the other
attributes, computed in the same way
  • Which decision variable maximises the Info Gain
    then?
  • Remember its the one which minimises the avg
    disorder.

27
So what is the best decision tree?
is_sunburned
Hair colour
brown
blonde
red
Alex, Pete, John
?
Emily
Sarah AnnieDana Katie
  • Once we have finished with hair colour we then
    need to calculate the remaining branches of the
    decision tree.
  • The examples corresponding to that branch are now
    the total set. One just applies the same
    procedure as before with JUST those examples
    (i.e. the blondes).

28
Decision Tree Induction Pseudocode
  • DTree(examples, features) returns a tree
  • If all examples are in one category, return a
    leaf node with that category label.
  • Else if the set of features is empty, return a
    leaf node with the category label that
  • is the most common in examples.
  • Else pick a feature F and create a node R for
    it
  • For each possible value vi of F
  • Let examplesi be the subset of
    examples that have value vi for F
  • Add an out-going edge E to node R labeled with
    the value vi.
  • If examplesi is empty
  • then attach a leaf node to
    edge E labeled with the category that
  • is the most common
    in examples.
  • else call DTree(examplesi ,
    features F) and attach the resulting
  • tree as the subtree
    under edge E.
  • Return the subtree rooted at R.

29
3.5 Hypothesis Space Search
  • The hypothesis Space is a complete space of
    finite discrete-valued functions, relative to the
    attributes. Because every finite discrete-valued
    function can be represented by some decision
    tree.
  • Maintain only one current hypothesis as it
    searches the space of decision trees. Do not test
    how many decision trees are consistent with the
    data.

30
3.5 Hypothesis Space Search
  • Performs no backtracking in its search. Performs
    hill-climbing (greedy search) that may only find
    a locally-optimal solution. Guaranteed to find a
    tree consistent with any conflict-free training
    set, but not necessarily the simplest tree.
  • Performs batch learning that processes all
    training instances at once rather than
    incremental learning that updates a hypothesis
    after each example.

31
3.6 Bias in Decision-Tree Induction
  • Information-gain gives a bias for trees with
    minimal depth.
  • Inductive bias of ID3 Shorter trees are
    preferred over larger trees. Trees that place
    high information gain attributes close to the
    root are preferred over those that do not.
  • Implements a search (preference) bias instead of
    a language (restriction) bias.

32
Is this all? So much simple?
  • Of course not
  • where do we stop growing the tree?
  • what if there are noisy (mislabelled) data as
    well in data set?

33
Overfitting
  • Learning a tree that classifies the training data
    perfectly may not lead to the tree with the best
    generalization to unseen data.
  • There may be noise in the training data that the
    tree is erroneously fitting.
  • The algorithm may be making poor decisions
    towards the leaves of the tree that are based on
    very little data and may not reflect reliable
    trends.

accuracy
  • hypothesis complexity

34
Overfitting
  • Consider the error of hypothesis h over
  • Training data error_train(h)
  • The whole data set (new data as well) error_D(h)
  • If there is another hypothesis h such that
    error_train(h) lt error_train(h) and
    error_D(h)gterror_D(h) then we say that
    hypothesis h overfits the training data.

accuracy
  • hypothesis complexity

35
Overfitting Example
  • Testing Ohms(?? ) Law V IR (I (1/R)V)

Experimentally measure 10 points
  • current (I)

Fit a curve to the Resulting data.
  • voltage (V)
  • It was wrong, we have found a more accurate
    function!

36
Overfitting Example
  • Testing Ohms Law V IR (I (1/R)V)
  • current (I)
  • voltage (V)
  • Better generalization with a linear function
  • that fits training data less accurately.

37
Overfitting Noise in Decision Trees
  • Category or feature noise can easily cause
    overfitting.
  • Add noisy instance ltmedium, blue, circlegt pos
    (but really neg)
  • color
  • red
  • blue
  • green
  • shape
  • neg
  • neg
  • circle
  • triangle
  • square
  • pos
  • pos
  • neg

38
Overfitting Noise in Decision Trees
  • Category or feature noise can easily cause
    overfitting.
  • Add noisy instance ltmedium, blue, circlegt pos
    (but really neg)
  • color
  • red
  • blue
  • green
  • ltbig, blue, circlegt ?
  • ltmedium, blue, circlegt
  • shape
  • neg
  • circle
  • triangle
  • square
  • pos
  • pos
  • neg

39
How can we avoid overfitting?
  • Split the data into training set validation set
  • Train on the training set and stop growing the
    tree when further data split deteriorates
    performance on validation set
  • Or grow the full tree first and then post-prune
  • What if data is limited?

40
Effect of Reduced-error pruning
41
Reduced Error Pruning
  • A post-pruning, cross-validation approach.
  • Partition training data in grow and
    validation sets.
  • Build a complete tree from the grow data.
  • Until accuracy on validation set decreases do
  • For each non-leaf node, n, in the tree do
  • Temporarily prune the subtree below
    n and replace it with a
  • leaf labeled with the current
    majority class at that node.
  • Measure and record the accuracy of
    the pruned tree on the validation set.
  • Permanently prune the node that results in
    the greatest increase in accuracy on
  • the validation set.

42
Issues with Reduced Error Pruning
  • The problem with this approach is that it
    potentially wastes training data on the
    validation set.
  • Severity(???) of this problem depends where we
    are on the learning curve
  • test accuracy
  • number of training examples

43
Cross-Validating without Losing Training Data
  • First, run several trials of reduced
    error-pruning using different random splits of
    grow and validation sets.
  • Record the complexity of the pruned tree learned
    in each trial. Let C be the average pruned-tree
    complexity.
  • Grow a final tree breadth-first from all the
    training data but stop when the complexity
    reaches C.
  • Similar cross-validation approach can be used to
    set arbitrary algorithm parameters in general.

44
Rule Post-Pruning
  1. Generate Decision tree which best fit the
    training data. Allow overfitting.
  2. Convert tree to equivalent set of rules.
  3. Prune each rule independently of others.
  4. Sort final rules into desired sequence for
    further use.

45
Rule Post-Pruning
Converting A Tree to Rules
46
  • Example
  • Consider the leftmost path in the Figure
  • if (outlooksunny)?(HumidityHigh) then
    PlayTennisNo
  • Remove any preconditions
  • (outlooksunny)?(HumidityHigh)
  • Choose the operation which does not reduce the
    accuracy.

47
3.7.2 Incorporating Continuous-valued Attributes
  • ID3 is restricted to attributes that take on a
    discrete set of values.
  • The target attribute is discrete valued.
  • The attributes tested is also discrete valued.
  • The second restriction can be easily be moved so
    that
  • For a continuous attribute A, we can dynamically
    create a new Boolean attribute that is true if A
    lt c.

48
3.7.2 Incorporating Continuous-valued Attributes
Suppose that the training examples associated
with a particular node in the decision tree have
the the following values for Temperature and the
target attribute PlayTennis,
  • What threshold would be picked ?
  • ---that produces the greatest information gain

49
3.7.2 Incorporating Continuous-valued Attributes
  • Sorting the examples according to the continuous
    attribute.
  • Then, identify adjacent examples that differ in
    their target classification.
  • Generate a set of candidate thresholds midway
    between the corresponding values
  • The value of the threshold, that maximizes the
    information gain must always lie at such a
    boundary(Fayyad, 1991)

In the example, there are two candidate
thresholds 54, 85
1. Temperaturegt54, 2. Temperaturegt85,
Which one is the best?
Information gain can be computed for each of the
attributes
50
3.7.3 Alternative Measure for Selecting Attributes
  • Information gain favors attributes with many
    values over those with few values.
  • If attribute has many values Gain will select it
  • Imagine using Date, as attribute, it will have
    the highest information gain. Selected as
    decision attribute for the root node.
  • It is not a useful predictor despite that it
    perfectly separates the data

51
3.7.3 Alternative Measure for Selecting Attributes
  • Solution using gain ratio
  • Penalize attributes by incorporating a term,
    called split information, that is sensitive to
    how broadly and uniformly the attribute splits
    the data
  • gain ratio discourages the selection of
    attributes with many uniformly distributed
    values. E.g., Date, SplitInformation would be log
    n. A boolean attribute spliting n examples in
    half , SplitInformation is 1.

52
3.7.3 Alternative Measure for Selecting Attributes
  • One problem of using gain ratio is the
    Denominator can be zero or very small, when Si?S
    for one of the Si
  • This make the gain ratio undefined or very large
    for attributes that happen to have the same value
    for nearly all members of S.
  • Solution using gain ratio
  • First calculate the Gain of each attribute, then
    applying the GainRatio test only for those
    attribute with above average Gain.

53
3.7.4 Handling training Examples with missing
attribute values
  • In certain cases, the values of attributes for
    some training examples may be missing. Its
    common to estimate the missing attribute value
    based on other examples for which this attribute
    has a known value.
  • E.g., in a medical domain, it may be that the lab
    test Blood-Test-Result is available only for a
    subset of the patients.
  • Consider the situation in which Gain(S,A) is to
    be calculated at node n to evaluate whether the
    attribute A is the best attribute to test at this
    decision tree. Suppose that ltx, c(x)gt is one of
    the training examples in S and that the value
    A(x) is unknown.

54
3.7.4 Handling training Examples with missing
attribute values
  • Strategies
  • Assign it the value that is most common among
    training examples at node n.
  • Or assign it the most common value among
    examples at node n that have the classification
    c(x).
  • Or A more complex strategy--assign a probability
    to each of the possible values of A rather than
    simply assigning the most common value to A(x).
    These probabilities can be estimated again based
    on the observed frequencies of the various values
    for A the examples at node n.

55
training Examples with missing attribute values
  • SltA1, 6?gt,ltA0,4?gt, ltx,A1,0.6 gt, ltx,A0,0.4
    gt
  • A
  • SltA1, 6.6?gt,ltA0,4.4?gt
  • A1
  • A0
  • S24.4 instances
  • S16.6instances

56
training Examples with missing attribute values
  • ltA1, 6?gt,ltA0,4?gt, ltx,A1,0.6 gt, ltx,A1,0.4 gt
    assume that in the six examples for which A
    1,the number of positive examples is 1, the
    number of negative examples is 5 In the 4
    examples for which A0,the number of positive
    examples is 3,the number of negative examples is
    1
  • If we know x is positive
  • Then E(S1)
  • -(1.6/6.6)log(1.6/6.6)-5/6.6)log(5/6.6)
  • .
  • A
  • A0
  • A1
  • S16.6 instances
  • S24.4 instances

57
3.7.5 handling attributes with differing costs
  • The instance attributes may have associated
    costs. E.g., to classify diseases of patients,
    Temperature, Pulse, BloodTestResults may be used.
  • Prefer decision trees that use low-cost
    attributes where possible, relying on high-cost
    attributes only when needed to produce reliable
    classifications.
  • ID3 can be modified to take into account
    attribute cost by introducing a cost term into
    the attribute selection measure.

58
(No Transcript)
59
summary
  • A Practical Method
  • ID3
  • Greedy search
  • Growing the tree from the root downward
  • Search in a complete hypothesis space
  • Preference for smaller trees
  • Overfitting issuesmethods of post-pruning
  • Extensions to the basic ID3 algorithm

60
  • C4.5 is a software extension of the basic ID3
    algorithm designed by Quinlan to address the
    following issues not dealt with by ID3
  • Avoiding overfitting the data
  • Determining how deeply to grow a decision tree.
  • Reduced error pruning.
  • Rule post-pruning.
  • Handling continuous attributes.
  • e.g., temperature
  • Choosing an appropriate attribute selection
    measure.
  • Handling training data with missing attribute
    values.
  • Handling attributes with differing costs.
  • Improving computational efficiency.

61
Summary
  • Decision Tree Representation
  • Entropy, Information Gain
  • ID3 Learning algorithm
  • Overfitting and how to avoid it

62
Homework
Exercises 3.1 3.2 3.3
Write a Comment
User Comments (0)
About PowerShow.com