Decision Tree Learning - PowerPoint PPT Presentation

About This Presentation

Title:

Decision Tree Learning

Description:

separate test set to evaluate the use of pruning ... no bookkeeping on how to reorganize tree if root node is pruned. Improves readability ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 44

Provided by: timo165

Learn more at: http://web.cecs.pdx.edu

Category:

more less

Transcript and Presenter's Notes

Title: Decision Tree Learning

1
Decision Tree Learning

Widely used, practical
Method of approximating discrete-valued functions
Robust to noisy data
Capable of learning disjunctive expressions
Typical bias prefer smaller trees

2
Decision trees

Classify instances
sorting them down the tree to a leaf node
containing the class (value)
based on attributes of instances
branch for each value
In general
disjunction of conjunctions of constraints on
attribute values of instances

3
When to use?

Instances presented as attribute-value pairs
Target function has discrete values
classification problems
Disjunctive descriptions required
Training data may contain
errors
missing attribute values

4
What follows?

Basic learning algorithm (ID3)
Hypothesis space
Inductive bias
Occams razor in general
Overfit problem extensions
post-pruning
real values, missing values, attribute costs,

5
Basic DT Learning Alg.

Most better ones variations of this
top-down greedy search in H
ID3, C4.5 (Quinlan -86, -93)
Top-down greedy construction
Which attribute should be tested?
Statistical testing with current data
repeat for descendants

6
Best attribute

Most useful in classification
how to measure the worth
information gain
how well attr. separates examples according to
their classification
Next
precise definition for gain
example

7
Entropy

Homogeneity measure for set S
Entropy(S)
-p() log p() - p(-)log p(-)
p() proportion of pos. examples
p(-) prop. of neg. examples
note 0 log 0 is defined to be 0
0 if all examples in same class
1 if p() p(-) 0.5

8
Entropy...

Information theoretical concept
expected minimal number of bits required to code
class of a randomly drawn member of S
optimal coding for having probability p has
length - log p
length of opt. code for p() or p(-)
generalizes to m-ary classes

9
Information Gain

Expected reduction in entropy
Gain(S, A)
Ent(S) - sum (Sv/S)Ent(Sv)
v ranges over values of A
Sv members of S with Av
2nd term expected value of entropy after
partitioning with A

10
Interpretations of gain

Gain(S,A)
expected reduction in entropy caused by knowing A
information provided about the target function
value given the value of A
number of bits saved in the coding a member of S
knowing the value of A
Measure used by ID3 algorithm

11
Example

Gains for each attribute
Outlook 0.246, Humidity 0.151, Wind 0.048, Temp
0.029
Node creation
Outlook selected at the root node
3 descendants are created
S is sorted down to descendants
one becomes a leaf node (0 entropy)

12
Example...

At inner nodes
same steps as earlier but
only examples sorted to the node are used in Gain
computations
knowing the value of A
Continues until
entropy 0 (all have same class)
all attributes are used

13
Hypothesis space of ID3

Set of possible decision trees
simple-to-complex hill-climbing
evaluation function inf. gain
Complete!
contains all discrete functions based on
available attributes
including the target function

14
Hypothesis space...

Maintains only one hypothesis
how many other DT are consistent?
what queries to make?
No backtracking
local minima possible --gt extensions
Statistics-based choices
uses all data at each step, robustness
compare to incremental methods

15
Inductive Bias

Many DT are usually consistent
basis by which ID3 chooses one
Roughly prefer
shorter trees over longer ones
ones with high gain attributes at root
Difficult to characterize precisely
attribute selection heuristics
interacts closely with given data

16
Approx. bias of ID3

Shorter trees are better
only this breadth-first search in H
ID3 efficient approximation of BFS
Compare bias to C-E
ID3 complete space, incomplete search --gt bias
from search strategy
C-E incomplete space, complete search --gt bias
from expressive power of H

17
Restriction preference

ID3 preference bias, search bias
C-E restriction bias, language bias
Which one is better?
preference allows us to work with a complete
hypothesis space
restriction c may not be there at all
combinations possible (linear functions LMS)

18
Why prefer short hypotheses?

William of Occam (ca. 1320)
Occams razor
Prefer the simplest hypothesis fitting the data
Sound principle?
fewer short ones than long ones
coincidences less likely

19
Difficulties

Many small sets of hypotheses
fit to the previous principle, e.g.
DT with m nodes n leafs
Attribute A1 at root, A2 at node 2,
Few such trees -gt small probability one fits the
data
why the one with short trees is good?

20
Difficulties...

Size of a hypothesis?
depends on learners internal representation
two learner with different ones may reach
different conclusions
Example case
L1 as before
L2 boolean attribute XYZ one node

21
Reject altogether?

Natural internal representations?
(artificial) evolution of algorithms
more successful descendants by modifying internal
representation
result int. repr. working well with any learning
algorithm bias
if alg. uses O.R, evolution creates int. repr.
suitable for O.R
reason easier to change repr. than algorithm

22
Issues in DT learning

Facing the real world
how deeply to grow the DT
continuous attributes
attribute selection measures
missing attribute values
attributes with differing costs
computational efficiency
ID3 these issues --gt C4.5

23
Overfit

Basic algorithm overfits training examples
Creates problems
noise
small training sets
Informal definition
some less well-fitting h actually performs better
with X

24
Overfit

h in H overfits D if
exists h in H such that
error(h,D) lt error(h,D) but
error(h,X) gt error(h,X)
Example figure
accuracy tree size
on training data test data

25
How can such happen?

One reason noise
noisy data creates a large tree h
h not fitting it is likely to work better
Small samples
coincidences possible
attributes unrelated to c may partition training
data well

26
How to avoid overfit?

Several approaches
stop growing tree earlier
allow overfit but post-prune after construction
latter one has been found more successful

27
How to decide tree size?

What criterion to use
separate test set to evaluate the use of pruning
use all data, apply statistical test to estimate
if expanding/pruning is likely to produce
improvement
use an explicit complexity measure (coding length
of data tree), stop growth when minimized

28
Training/validation sets