CS 478 Machine Learning - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

CS 478 Machine Learning

Description:

Let S be a set examples from c classes. Where pi is the proportion of examples of S belonging ... Intuitively, the smaller the entropy, the purer the partition ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 16
Provided by: mauc3
Category:

less

Transcript and Presenter's Notes

Title: CS 478 Machine Learning


1
CS 478 - Machine Learning
  • Decision Trees (II)

2
Entropy (I)
  • Let S be a set examples from c classes
  • Where pi is the proportion of examples of S
    belonging to class i. (Note, we define 0log00)

3
Entropy (II)
  • Intuitively, the smaller the entropy, the purer
    the partition
  • Based on Shannons information theory (c2)
  • If p11 (resp. p21), then receiver knows example
    is positive (resp. negative). No message need be
    sent.
  • If p1p20.5, then receiver needs to be told the
    class of the example. 1-bit message must be sent.
  • If 0ltp1lt1, then receiver needs a less than 1 bit
    on average to know the class of the example.

4
Information Gain
  • Let p be a property with n outcomes
  • The information gained by partitioning a set S
    according to p is
  • Where Si is the subset of S for which property p
    has its ith value

5
Play Tennis
What is the ID3 induced tree?
6
ID3s Splitting Criterion
  • The objective of ID3 at each split is to increase
    information gain, or equivalently, to lower
    entropy. It does so as much as possible
  • Pros Easy to do
  • Cons May lead to overfitting

7
Overfitting
  • Given a hypothesis space H, a hypothesis h?H is
    said to overfit the training data if there exists
    some alternative hypothesis h ?H, such that h
    has smaller error than h over the training
    examples, but h has smaller error than h over
    the entire distribution of instances

8
Avoiding Overfitting
  • Two alternatives
  • Stop growing the tree, before it begins to
    overfit (e.g., when data split is not
    statistically significant)
  • Grow the tree to full (overfitting) size and
    post-prune it
  • Either way, when do I stop? What is the correct
    final tree size?

9
Approaches
  • Use only training data and a statistical test to
    estimate whether expanding/pruning is likely to
    produce an improvement beyond the training set
  • Use MDL to minimize size(tree)
    size(misclassifications(tree))
  • Use a separate validation set to evaluate utility
    of pruning
  • Use richer node conditions and accuracy

10
Reduced Error Pruning
  • Split dataset into training and validation sets
  • Induce a full tree from the training set
  • While the accuracy on the validation set
    increases
  • Evaluate the impact of pruning each subtree,
    replacing its root by a leaf labeled with the
    majority class for that subtree
  • Remove the subtree that most increases validation
    set accuracy (greedy approach)

11
Rule Post-pruning
  • Split dataset into training and validation sets
  • Induce a full tree from the training set
  • Convert the tree into an equivalent set of rules
  • For each rule
  • Remove any preconditions that result in increased
    rule accuracy on the validation set
  • Sort the rules by estimated accuracy
  • Classify new examples using the new ordered set
    of rules

12
Discussion
  • Reduced-error pruning produces the smallest
    version of the most accurate subtree
  • Rule post-pruning is more fine-grained and
    possibly the most used method
  • In all cases, pruning based on a validation set
    is problematic when the amount of available data
    is limited

13
Accuracy vs Entropy
  • ID3 uses entropy to build the tree and accuracy
    to prune it
  • Why not use accuracy in the first place?
  • How?
  • How does it compare with entropy?
  • Is there a way to make it work?

14
Other Issues
  • The text briefly discusses the following aspects
    of decision tree learning
  • Continuous-valued attributes
  • Alternative splitting criteria (e.g., for
    attributes with many values)
  • Accounting for costs

15
Unknown Attribute Values
  • Alternatives
  • Remove examples with missing attribute values
  • Treat missing value as a distinct, special value
    of the attribute
  • Replace missing value with most common value of
    the attribute
  • Overall
  • At node n
  • At node n with same class label
  • Use probabilities
Write a Comment
User Comments (0)
About PowerShow.com