Decision Tree Learning - PowerPoint PPT Presentation

About This Presentation
Title:

Decision Tree Learning

Description:

separate test set to evaluate the use of pruning ... no bookkeeping on how to reorganize tree if root node is pruned. Improves readability ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 44
Provided by: timo165
Learn more at: http://web.cecs.pdx.edu
Category:

less

Transcript and Presenter's Notes

Title: Decision Tree Learning


1
Decision Tree Learning
  • Widely used, practical
  • Method of approximating discrete-valued functions
  • Robust to noisy data
  • Capable of learning disjunctive expressions
  • Typical bias prefer smaller trees

2
Decision trees
  • Classify instances
  • sorting them down the tree to a leaf node
    containing the class (value)
  • based on attributes of instances
  • branch for each value
  • In general
  • disjunction of conjunctions of constraints on
    attribute values of instances

3
When to use?
  • Instances presented as attribute-value pairs
  • Target function has discrete values
  • classification problems
  • Disjunctive descriptions required
  • Training data may contain
  • errors
  • missing attribute values

4
What follows?
  • Basic learning algorithm (ID3)
  • Hypothesis space
  • Inductive bias
  • Occams razor in general
  • Overfit problem extensions
  • post-pruning
  • real values, missing values, attribute costs,

5
Basic DT Learning Alg.
  • Most better ones variations of this
  • top-down greedy search in H
  • ID3, C4.5 (Quinlan -86, -93)
  • Top-down greedy construction
  • Which attribute should be tested?
  • Statistical testing with current data
  • repeat for descendants

6
Best attribute
  • Most useful in classification
  • how to measure the worth
  • information gain
  • how well attr. separates examples according to
    their classification
  • Next
  • precise definition for gain
  • example

7
Entropy
  • Homogeneity measure for set S
  • Entropy(S)
  • -p() log p() - p(-)log p(-)
  • p() proportion of pos. examples
  • p(-) prop. of neg. examples
  • note 0 log 0 is defined to be 0
  • 0 if all examples in same class
  • 1 if p() p(-) 0.5

8
Entropy...
  • Information theoretical concept
  • expected minimal number of bits required to code
    class of a randomly drawn member of S
  • optimal coding for having probability p has
    length - log p
  • length of opt. code for p() or p(-)
  • generalizes to m-ary classes

9
Information Gain
  • Expected reduction in entropy
  • Gain(S, A)
  • Ent(S) - sum (Sv/S)Ent(Sv)
  • v ranges over values of A
  • Sv members of S with Av
  • 2nd term expected value of entropy after
    partitioning with A

10
Interpretations of gain
  • Gain(S,A)
  • expected reduction in entropy caused by knowing A
  • information provided about the target function
    value given the value of A
  • number of bits saved in the coding a member of S
    knowing the value of A
  • Measure used by ID3 algorithm

11
Example
  • Gains for each attribute
  • Outlook 0.246, Humidity 0.151, Wind 0.048, Temp
    0.029
  • Node creation
  • Outlook selected at the root node
  • 3 descendants are created
  • S is sorted down to descendants
  • one becomes a leaf node (0 entropy)

12
Example...
  • At inner nodes
  • same steps as earlier but
  • only examples sorted to the node are used in Gain
    computations
  • knowing the value of A
  • Continues until
  • entropy 0 (all have same class)
  • all attributes are used

13
Hypothesis space of ID3
  • Set of possible decision trees
  • simple-to-complex hill-climbing
  • evaluation function inf. gain
  • Complete!
  • contains all discrete functions based on
    available attributes
  • including the target function

14
Hypothesis space...
  • Maintains only one hypothesis
  • how many other DT are consistent?
  • what queries to make?
  • No backtracking
  • local minima possible --gt extensions
  • Statistics-based choices
  • uses all data at each step, robustness
  • compare to incremental methods

15
Inductive Bias
  • Many DT are usually consistent
  • basis by which ID3 chooses one
  • Roughly prefer
  • shorter trees over longer ones
  • ones with high gain attributes at root
  • Difficult to characterize precisely
  • attribute selection heuristics
  • interacts closely with given data

16
Approx. bias of ID3
  • Shorter trees are better
  • only this breadth-first search in H
  • ID3 efficient approximation of BFS
  • Compare bias to C-E
  • ID3 complete space, incomplete search --gt bias
    from search strategy
  • C-E incomplete space, complete search --gt bias
    from expressive power of H

17
Restriction preference
  • ID3 preference bias, search bias
  • C-E restriction bias, language bias
  • Which one is better?
  • preference allows us to work with a complete
    hypothesis space
  • restriction c may not be there at all
  • combinations possible (linear functions LMS)

18
Why prefer short hypotheses?
  • William of Occam (ca. 1320)
  • Occams razor
  • Prefer the simplest hypothesis fitting the data
  • Sound principle?
  • fewer short ones than long ones
  • coincidences less likely

19
Difficulties
  • Many small sets of hypotheses
  • fit to the previous principle, e.g.
  • DT with m nodes n leafs
  • Attribute A1 at root, A2 at node 2,
  • Few such trees -gt small probability one fits the
    data
  • why the one with short trees is good?

20
Difficulties...
  • Size of a hypothesis?
  • depends on learners internal representation
  • two learner with different ones may reach
    different conclusions
  • Example case
  • L1 as before
  • L2 boolean attribute XYZ one node

21
Reject altogether?
  • Natural internal representations?
  • (artificial) evolution of algorithms
  • more successful descendants by modifying internal
    representation
  • result int. repr. working well with any learning
    algorithm bias
  • if alg. uses O.R, evolution creates int. repr.
    suitable for O.R
  • reason easier to change repr. than algorithm

22
Issues in DT learning
  • Facing the real world
  • how deeply to grow the DT
  • continuous attributes
  • attribute selection measures
  • missing attribute values
  • attributes with differing costs
  • computational efficiency
  • ID3 these issues --gt C4.5

23
Overfit
  • Basic algorithm overfits training examples
  • Creates problems
  • noise
  • small training sets
  • Informal definition
  • some less well-fitting h actually performs better
    with X

24
Overfit
  • h in H overfits D if
  • exists h in H such that
  • error(h,D) lt error(h,D) but
  • error(h,X) gt error(h,X)
  • Example figure
  • accuracy tree size
  • on training data test data

25
How can such happen?
  • One reason noise
  • noisy data creates a large tree h
  • h not fitting it is likely to work better
  • Small samples
  • coincidences possible
  • attributes unrelated to c may partition training
    data well

26
How to avoid overfit?
  • Several approaches
  • stop growing tree earlier
  • allow overfit but post-prune after construction
  • latter one has been found more successful

27
How to decide tree size?
  • What criterion to use
  • separate test set to evaluate the use of pruning
  • use all data, apply statistical test to estimate
    if expanding/pruning is likely to produce
    improvement
  • use an explicit complexity measure (coding length
    of data tree), stop growth when minimized

28
Training/validation sets
  • Available data split
  • training set apply learning to this
  • validation set evaluate result
  • accuracy
  • impact of pruning
  • safety check against overfit
  • common strategy 2/3 for training

29
Reduced error pruning
  • Pruning
  • make an inner node a leaf node
  • assign it the most common class
  • Procedure
  • candidate result performs no worse
  • coincidences likely to be removed
  • choose the one giving best accuracy
  • continue until no progress

30
Reduced error pruning
  • If large data available
  • training set
  • validation set used for pruning
  • test set to measure accuracy
  • If not
  • alternative methods (will follow)
  • Additional techniques
  • multiple partitioning averaging

31
Rule Post-Pruning
  • Procedure (C4.5 uses a variant)
  • infer DT as usual (allow overfit)
  • convert tree to rules (one per path)
  • prune each rule independently
  • remove preconditions if result is more accurate
  • sort runes by estimated accuracy
  • apply rules in this order in classification

32
Rule post-pruning
  • Estimating the accuracy
  • separate validation set
  • training data pessimistic estimates
  • data is too favorable for the rules
  • compute accuracy standard deviation
  • take lower bound from given confidence level
    (e.g. 95) as the measure
  • very close to observed one for large sets
  • not statistically valid but works

33
Why convert to rules?
  • Distinguishing different contexts in which a node
    is used
  • separate pruning decision for each path
  • No difference for root/inner
  • no bookkeeping on how to reorganize tree if root
    node is pruned
  • Improves readability

34
Continuous values
  • Define new discrete-valued attr
  • partition the continuous value into a discrete
    set of intervals
  • Ac true iff A lt c
  • How to select best c? (inf. gain)
  • Example case
  • sort examples by continuous value
  • identify borderlines

35
Continuous values
  • Fact
  • value maximizing inf. gain lies on such boundary
  • Evaluation
  • compute gain for each boundary
  • Extensions
  • multiple values
  • LTUs based on many attributes

36
Alternative selection measures
  • Information gain measure favors attributes with
    many values
  • separates data into small subsets
  • high gain, poor prediction
  • Gain ratio measure
  • penalize gain with split information
  • sensitive to how broadly uniformly attribute
    splits data

37
Split information
  • Entropy of S with respect to values of A
  • earlier entropy of S wrt target values
  • GR(S,A) Gain(S,A) / SI(S,A)
  • Discourages selection of attr with
  • many uniformly distr. Values
  • n values log n, boolean 1

38
Practical issues on SI
  • Some value rules
  • Si close to S
  • SI 0 or very small
  • GR undefined or very large
  • Apply heuristics to select attributes
  • compute Gain first
  • compute GR only when Gain large enough (above
    average)

39
Another alternative
  • Distance-based measure
  • define a metric between partitions of the data
  • evaluate attributes distance between created
    perfect partition
  • choose the attribute with closest one
  • Shown
  • not biased towards attr. with large value sets

40
Missing values
  • Estimate value
  • other examples with known value
  • Compute Gain(S,A), A(x) unknown
  • assign most common value in S
  • most common with class c(x)
  • assign probability for each value, distribute
    fractionals of x down
  • Sim. techniques in classification

41
Attributes with differing costs
  • Measuring attribute costs something
  • prefer cheap ones if possible
  • use costly ones only if good gain
  • introduce cost term in selection measure
  • no guarantee in finding optimum, but give bias
    towards cheapest

42
Attributes with costs...
  • Example applications
  • robot sonar time required to position
  • medical diagnosis cost of a laboratory test

43
Summary
  • Practical learning method
  • discrete-valued functions
  • ID3 greedy
  • Complete hypothesis space
  • Preference bias
  • Overfit pruning
  • methods using preference bias
  • Extensions
Write a Comment
User Comments (0)
About PowerShow.com