Sl1 - PowerPoint PPT Presentation

1 / 64

About This Presentation

Title:

Sl1

Description:

Handling Continuous-Valued Attributes. Handling Missing Attribute Values. Decision Trees ... attribute as the root, create a branch for each of the values the ... – PowerPoint PPT presentation

Number of Views:78

Avg rating:3.0/5.0

Slides: 65

Provided by: fda017

Category:

more less

Transcript and Presenter's Notes

Title: Sl1

1
Decision-Tree Induction Decision-Rule Induction
Evgueni Smirnov
2
Overview

Instances, Classes, Languages, Hypothesis Spaces
Decision Trees
Decision Rules
Evaluation Techniques
Intro to Weka

3
Instances and Classes
A class is a set of objects in a world that are
unified by a reason. A reason may be a similar
appearance, structure or function.
friendly robots
Example. The set children, photos, cat,
diplomas can be viewed as a class Most
important things to take out of your apartment
when it catches fire.
4
Instances, Classes, Languages
head square body round smiling yes holding
flag color yellow
friendly robots
5
Instances, Classes, Hypothesis Spaces
smiling yes ? friendly robots
head square body round smiling yes holding
flag color yellow
friendly robots
6
The Classification Task
7
Decision Trees for Classification

Decision trees
Appropriate problems for decision trees
Entropy and Information Gain
The ID3 algorithm
Avoiding Overfitting via Pruning
Handling Continuous-Valued Attributes
Handling Missing Attribute Values

8
Decision Trees

Definition A decision tree is a tree s.t.
Each internal node tests an attribute
Each branch corresponds to attribute value
Each leaf node assigns a classification

9
Data Set for Playing Tennis
10
Decision Tree For Playing Tennis
Outlook
Sunny
Overcast
Rainy
Humidity
Windy
yes
High
Normal
False
True
no
yes
yes
no
11
When to Consider Decision Trees

Each instance consists of an attribute with
discrete values (e.g. outlook/sunny, etc..)
The classification is over discrete values (e.g.
yes/no )
It is okay to have disjunctive descriptions
each path in the tree represents a disjunction of
attribute combinations. Any Boolean function can
be represented!
It is okay for the training data to contain
errors decision trees are robust to
classification errors in the training data.
It is okay for the training data to contain
missing values decision trees can be used even
if instances have missing attributes.

12
Rules in Decision Trees
If Outlook Sunny Humidity High then Play
no If Outlook Sunny Humidity Normal then
Play yes If Outlook Overcast then Play
yes If Outlook Rainy Windy False then Play
yes If Outlook Rainy Windy True then Play
no
13
Decision Tree Induction

Basic Algorithm
1. A ? the best" decision attribute for a node
N.
2. Assign A as decision attribute for the node N.
3. For each value of A, create new descendant of
the node N.
4. Sort training examples to leaf nodes.
5. IF training examples perfectly classified,
THEN STOP.
ELSE iterate over new leaf nodes

14
Decision Tree Induction
Outlook
Sunny
Rain
Overcast
_____________________________________ Outlook
Temp Hum Wind Play
-------------------------------------------------
-------- Rain Mild High
Weak yes Rain Cool Normal
Weak yes Rain Cool Normal
Strong no Rain Mild Normal
Weak yes Rain Mild High
Strong no
____________________________________ Outlook
Temp Hum Wind Play ------------------
------------------------------------- Sunny
Hot High Weak no Sunny
Hot High Strong no Sunny
Mild High Weak no Sunny
Cool Normal Weak yes Sunny
Mild Normal Strong yes
_____________________________________ Outlook
Temp Hum Wind Play
-------------------------------------------------
-------- Overcast Hot High
Weak yes Overcast Cool Normal
Strong yes
15
Entropy

Let S be a sample of training examples, and
p is the proportion of positive examples in S
and
p- is the proportion of negative examples in S.
Then entropy measures the impurity of S
E( S) - p log2 p p- log2 p-

16
Entropy Example from the Dataset
17
Information Gain

Information Gain is the expected reduction
in entropy caused by partitioning the instances
according to a given attribute.
Gain(S, A) E(S) -
where Sv s ? S A(s) v

S
Sv1 s ? S A(s) v1
Sv12 s ? S A(s) v2
18
Example
Outlook
Sunny
Rain
Overcast
_____________________________________ Outlook
Temp Hum Windy Play
-------------------------------------------------
-------- Rain Mild High
False Yes Rain Cool Normal
False Yes Rain Cool
Normal True No Rain Mild
Normal False Yes Rain Mild
High True No
____________________________________ Outlook
Temp Hum Wind Play ------------------
------------------------------------- Sunny
Hot High False No Sunny
Hot High True No Sunny
Mild High False No Sunny
Cool Normal False Yes Sunny
Mild Normal True Yes
_____________________________________ Outlook
Temp Hum Wind Play
-------------------------------------------------
-------- Overcast Hot High
Weak Yes Overcast Cool Normal
Strong Yes
Which attribute should be tested here? Gain
(Ssunny , Humidity) .970 - (3/5) 0.0 - (2/5)
0.0 .970 Gain (Ssunny , Temperature) .970 -
(2/5) 0.0 - (2/5) 1.0 - (1/5) 0.0 .570 Gain
(Ssunny , Wind) .970 - (2/5) 1.0 - (3/5) .918
.019
19
ID3 Algorithm

Informally
Determine the attribute with the highest
information gain on the training set.
Use this attribute as the root, create a branch
for each of the values the attribute can have.
For each branch, repeat the process with subset
of the training set that is classified by that
branch.

20
Hypothesis Space Search in ID3

The hypothesis space is the set of all decision
trees defined over the given set of attributes.
ID3s hypothesis space is a compete space i.e.,
the target description is there!
ID3 performs a simple-to-complex, hill climbing
search through this space.

21
Hypothesis Space Search in ID3

The evaluation function is the information gain.
ID3 maintains only a single current decision
tree.
ID3 performs no backtracking in its search.
ID3 uses all training instances at each step of
the search.

22
Posterior Class Probabilities
Outlook
Sunny
Overcast
Rainy
no 2 pos and 3 neg Ppos 0.4, Pneg 0.6
Windy
no 2 pos and 0 neg Ppos 1.0, Pneg 0.0
False
True
no 0 pos and 2 neg Ppos 0.0, Pneg 1.0
no 3 pos and 0 neg Ppos 1.0, Pneg 0.0
23
Overfitting

Definition Given a hypothesis space H, a
hypothesis h ? H is said to overfit the training
data if there exists some hypothesis h ? H, such
that h has smaller error that h over the
training instances, but h has a smaller error
that h over the entire distribution of instances.

24
Reasons for Overfitting
Outlook
sunny
overcast
rainy
Humidity
Windy
yes
high
normal
false
true
no
yes
yes
no

Noisy training instances. Consider an noisy
training example
Outlook Sunny Temp Hot Humidity Normal
Wind True PlayTennis No
This instance affects the training instances
Outlook Sunny Temp Cool Humidity Normal
Wind False PlayTennis Yes
Outlook Sunny Temp Mild Humidity Normal
Wind True PlayTennis Yes

25
Reasons for Overfitting
Outlook
sunny
overcast
rainy
Humidity
Windy
yes
high
normal
false
true
no
yes
no
Windy
false
true
Outlook Sunny Temp Hot Humidity Normal
Wind True PlayTennis No Outlook Sunny
Temp Cool Humidity Normal Wind False
PlayTennis Yes Outlook Sunny Temp Mild
Humidity Normal Wind True PlayTennis Yes
yes
Temp
mild
high
cool
yes
no
?
26
Reasons for Overfitting

Small number of instances are associated with
leaf nodes. In this case it is possible that for
coincidental regularities to occur that are
unrelated to the actual target concept.

-

-

-

-

-

-
-

-
-
-
-
-
-
-
-
-
-
-
-
27
Approaches to Avoiding Overfitting

Pre-pruning stop growing the tree earlier,
before it reaches the point where it perfectly
classifies the training data
Post-pruning Allow the tree to overfit the data,
and then post-prune the tree.

28
Pre-pruning

It is difficult to decide when to stop growing
the tree.
A possible scenario is to stop when the leaf
nodes gets less than m training instances. Here
is an example for m 5.

Outlook
Sunny
Overcast
Rainy
no
?
yes
2
3
2
2
3
29
Validation Set

Validation set is a set of instances used to
evaluate the utility of nodes in decision trees.
The validation set has to be chosen so that it is
unlikely to suffer from same errors or
fluctuations as the training set.
Usually before pruning the training data is split
randomly into a growing set and a validation set.

30
Reduced-Error Pruning

Split data into growing and validation sets.
Pruning a decision node d consists of
removing the subtree rooted at d.
making d a leaf node.
assigning d the most common classification of the
training instances associated with d.

Outlook
sunny
overcast
rainy
Humidity
Windy
yes
high
normal
false
true
no
yes
yes
no
3 instances
2 instances
Accuracy of the tree on the validation set is 90.
31
Reduced-Error Pruning

Split data into growing and validation sets.
Pruning a decision node d consists of
removing the subtree rooted at d.
making d a leaf node.
assigning d the most common classification of the
training instances associated with d.

Outlook
sunny
overcast
rainy
Windy
no
yes
false
true
yes
no
Accuracy of the tree on the validation set is
92.4.
32
Reduced-Error Pruning

Split data into growing and validation sets.
Pruning a decision node d consists of
removing the subtree rooted at d.
making d a leaf node.
assigning d the most common classification of the
training instances associated with d.
Do until further pruning is harmful
Evaluate impact on validation set of pruning each
possible node (plus those below it).
Greedily remove the one that most improves
validation set accuracy.

Outlook
sunny
overcast
rainy
Windy
no
yes
false
true
yes
no
Accuracy of the tree on the validation set is
92.4.
33
Reduced Error Pruning Example
34
Rule Post-Pruning

Convert tree to equivalent set of rules.
Prune each rule independently of others.
Sort final rules by their estimated accuracy, and
consider them in this sequence when classifying
subsequent instances.

Outlook
IF (Outlook Sunny) (Humidity High) THEN
PlayTennis No IF (Outlook Sunny) (Humidity
Normal) THEN PlayTennis Yes .
sunny
overcast
rainy
Humidity
Windy
yes
normal
false
true
no
yes
yes
no
35
Continuous Valued Attributes

Create a set of discrete attributes to test
continuous.
Apply Information Gain in order to choose the
best attribute.
Temperature 40 48 60 72 80 90
PlayTennis No No Yes Yes Yes No

Tempgt54 Temgt85
36
Missing Attribute Values

Strategies
Assign most common value of A among other
instances belonging to the same concept.
If node n tests the attribute A, assign most
common value of A among other instances sorted to
node n.
If node n tests the attribute A, assign a
probability to each of possible values of A.
These probabilities are estimated based on the
observed frequencies of the values of A. These
probabilities are used in the information gain
measure (via info gain) (
).

37
Summary Points

Decision tree learning provides a practical
method for concept learning.
ID3-like algorithms search complete hypothesis
space.
The inductive bias of decision trees is
preference (search) bias.
Overfitting the training data is an important
issue in decision tree learning.
A large number of extensions of the ID3 algorithm
have been proposed for overfitting avoidance,
handling missing attributes, handling numerical
attributes, etc.

38
Learning Decision Rules

Decision Rules
Basic Sequential Covering Algorithm
Learn-One-Rule Procedure
Pruning

39
Definition of Decision Rules
Definition Decision rules are rules with the
following form if ltconditionsgt then class
C.
Example If you run the Prism algorithm from Weka
on the weather data you will get the following
set of decision rules if outlook overcast
then PlayTennis yes if humidity normal and
windy FALSE then PlayTennis yes if
temperature mild and humidity normal then
PlayTennis yes if outlook rainy and windy
FALSE then PlayTennis yes if outlook sunny
and humidity high then PlayTennis no if
outlook rainy and windy TRUE then PlayTennis
no
40
Why Decision Rules?

Decision rules are more compact.
Decision rules are more understandable.

Example Let X ?0,1, Y ?0,1, Z ?0,1, W
?0,1. The rules are if X1 and Y1 then 1 if
Z1 and W1 then 1 Otherwise 0
41
Why Decision Rules?
42
How to Learn Decision Rules?

We can convert trees to rules
We can use specific rule-learning methods

43
Sequential Covering Algorithms
function LearnRuleSet(Target, Attrs, Examples,
Threshold) LearnedRules ? Rule
LearnOneRule(Target, Attrs, Examples) while
performance(Rule,Examples) gt Threshold, do
LearnedRules LearnedRules ? Rule
Examples Examples \ examples covered by
Rule Rule LearnOneRule(Target, Attrs,
Examples) sort LearnedRules according to
performance return LearnedRules
44
Illustration
-
-

-
-

-

-
-
-

-
-
-
-
-
45
Illustration
-
-

-
-

-

-
-
-

-
-
-
-
-
46
Illustration
-
-

-
-

-

-
-
-

-
-
-
-
-
47
Illustration
-
-

-
-

-

-
-
-

-
-
-
-
-
IF A B THEN pos
48
Illustration
-
-

-
-

-

-
-
-

-
-
-
-
-
IF A B THEN pos
49
Illustration
-
-

-
-

-

-
-
-

-
-
-
-
-
IF A B THEN pos
IF true THEN pos
IF C THEN pos
IF C D THEN pos
50
Learning One Rule

To learn one rule we use one of the strategies
below
Top-down
Start with maximally general rule
Add literals one by one
Bottom-up
Start with maximally specific rule
Remove literals one by one
Combination of top-down and bottom-up
Candidate-elimination algorithm.

51
Bottom-up vs. Top-down
Bottom-up typically more specific rules
-
-

-
-

-

-
-
-

-
-
-
-
-
Top-down typically more general rules
52
Learning One Rule

Bottom-up
Example-driven (AQ family).
Top-down
Generate-then-Test (CN-2).

53
Example of Learning One Rule
54
Heuristics for Learning One Rule

When is a rule good?
High accuracy
Less important high coverage.
Possible evaluation functions
Relative frequency nc/n, where nc is the number
of correctly classified instances, and n is the
number of instances covered by the rule
m-estimate of accuracy (nc mp)/(nm), where nc
is the number of correctly classified instances,
n is the number of instances covered by the rule,
p is the prior probablity of the class predicted
by the rule, and m is the weight of p.
Entropy.

55
How to Arrange the Rules

The rules are ordered according to the order they
have been learned. This order is used for
instance classification.
The rules are ordered according to their
accuracy. This order is used for instance
classification.
The rules are not ordered but there exists a
strategy how to apply the rules (e.g., an
instance covered by conflicting rules gets the
classification of the rule that classifies
correctly more training instances if an instance
is not covered by any rule, then it gets the
classification of the majority class represented
in the training data).

56
Approaches to Avoiding Overfitting

Pre-pruning stop learning the decision rules
before they reach the point where they perfectly
classify the training data
Post-pruning allow the decision rules to overfit
the training data, and then post-prune the rules.

57
Post-Pruning

Split instances into Growing Set and Pruning Set
Learn set SR of rules using Growing Set
Find the best simplification BSR of SR.
while (Accuracy(BSR, Pruning Set) gt
Accuracy(SR, Pruning Set) )
do
4.1 SR BSR
4.2 Find the best simplification BSR
of SR.
5. return BSR

58
Incremental Reduced Error Pruning
Post-pruning
D1
D3
D1
D21
D2
D22
D3
59
Incremental Reduced Error Pruning

Split Training Set into Growing Set and
Validation Set
Learn rule R using Growing Set
Prune the rule R using Validation Set
if performance(R, Training Set) gt Threshold
4.1 Add R to Set of Learned Rules
4.2 Remove in Training Set the instances
covered by R
4.2 go to 1
5. else return Set of Learned Rules

60
Summary Points

Decision rules are easier for human comprehension
than decision trees.
Decision rules have simpler decision boundaries
than decision trees.
Decision rules are learned by sequential covering
of the training instances.

61
Model Evaluation Techniques

Evaluation on the training set too optimistic

Classifier
Training set
Training set
62
Model Evaluation Techniques

Hold-out Method depends on the make-up of the
test set.

Classifier
Training set
Test set
Data

To improve the precision of the hold-out
method it is repeated many times.

63
Model Evaluation Techniques

k-fold Cross Validation

Classifier
Data
64
Intro to Weka
_at_relation weather.symbolic _at_attribute outlook
sunny, overcast, rainy _at_attribute temperature
hot, mild, cool _at_attribute humidity high,
normal _at_attribute windy TRUE, FALSE _at_attribute
play TRUE, FALSE _at_data sunny,hot,high,FALSE,FAL
SE sunny,hot,high,TRUE,FALSE overcast,hot,high,FAL
SE,TRUE rainy,mild,high,FALSE,TRUE rainy,cool,norm
al,FALSE,TRUE rainy,cool,normal,TRUE,FALSE overcas
t,cool,normal,TRUE,TRUE .

Write a Comment

User Comments (0)