Title: Sl1
1Decision-Tree Induction Decision-Rule Induction
Evgueni Smirnov
2Overview
- Instances, Classes, Languages, Hypothesis Spaces
- Decision Trees
- Decision Rules
- Evaluation Techniques
- Intro to Weka
3Instances and Classes
A class is a set of objects in a world that are
unified by a reason. A reason may be a similar
appearance, structure or function.
friendly robots
Example. The set children, photos, cat,
diplomas can be viewed as a class Most
important things to take out of your apartment
when it catches fire.
4Instances, Classes, Languages
head square body round smiling yes holding
flag color yellow
friendly robots
5Instances, Classes, Hypothesis Spaces
smiling yes ? friendly robots
head square body round smiling yes holding
flag color yellow
friendly robots
6The Classification Task
7Decision Trees for Classification
- Decision trees
- Appropriate problems for decision trees
- Entropy and Information Gain
- The ID3 algorithm
- Avoiding Overfitting via Pruning
- Handling Continuous-Valued Attributes
- Handling Missing Attribute Values
8Decision Trees
- Definition A decision tree is a tree s.t.
- Each internal node tests an attribute
- Each branch corresponds to attribute value
- Each leaf node assigns a classification
9Data Set for Playing Tennis
10Decision Tree For Playing Tennis
Outlook
Sunny
Overcast
Rainy
Humidity
Windy
yes
High
Normal
False
True
no
yes
yes
no
11When to Consider Decision Trees
- Each instance consists of an attribute with
discrete values (e.g. outlook/sunny, etc..) - The classification is over discrete values (e.g.
yes/no ) - It is okay to have disjunctive descriptions
each path in the tree represents a disjunction of
attribute combinations. Any Boolean function can
be represented! - It is okay for the training data to contain
errors decision trees are robust to
classification errors in the training data. - It is okay for the training data to contain
missing values decision trees can be used even
if instances have missing attributes.
12Rules in Decision Trees
If Outlook Sunny Humidity High then Play
no If Outlook Sunny Humidity Normal then
Play yes If Outlook Overcast then Play
yes If Outlook Rainy Windy False then Play
yes If Outlook Rainy Windy True then Play
no
13Decision Tree Induction
- Basic Algorithm
- 1. A ? the best" decision attribute for a node
N. - 2. Assign A as decision attribute for the node N.
- 3. For each value of A, create new descendant of
the node N. - 4. Sort training examples to leaf nodes.
- 5. IF training examples perfectly classified,
THEN STOP. - ELSE iterate over new leaf nodes
14Decision Tree Induction
Outlook
Sunny
Rain
Overcast
_____________________________________ Outlook
Temp Hum Wind Play
-------------------------------------------------
-------- Rain Mild High
Weak yes Rain Cool Normal
Weak yes Rain Cool Normal
Strong no Rain Mild Normal
Weak yes Rain Mild High
Strong no
____________________________________ Outlook
Temp Hum Wind Play ------------------
------------------------------------- Sunny
Hot High Weak no Sunny
Hot High Strong no Sunny
Mild High Weak no Sunny
Cool Normal Weak yes Sunny
Mild Normal Strong yes
_____________________________________ Outlook
Temp Hum Wind Play
-------------------------------------------------
-------- Overcast Hot High
Weak yes Overcast Cool Normal
Strong yes
15Entropy
- Let S be a sample of training examples, and
- p is the proportion of positive examples in S
and - p- is the proportion of negative examples in S.
- Then entropy measures the impurity of S
- E( S) - p log2 p p- log2 p-
16Entropy Example from the Dataset
17Information Gain
- Information Gain is the expected reduction
in entropy caused by partitioning the instances
according to a given attribute. -
- Gain(S, A) E(S) -
- where Sv s ? S A(s) v
-
-
S
Sv1 s ? S A(s) v1
Sv12 s ? S A(s) v2
18Example
Outlook
Sunny
Rain
Overcast
_____________________________________ Outlook
Temp Hum Windy Play
-------------------------------------------------
-------- Rain Mild High
False Yes Rain Cool Normal
False Yes Rain Cool
Normal True No Rain Mild
Normal False Yes Rain Mild
High True No
____________________________________ Outlook
Temp Hum Wind Play ------------------
------------------------------------- Sunny
Hot High False No Sunny
Hot High True No Sunny
Mild High False No Sunny
Cool Normal False Yes Sunny
Mild Normal True Yes
_____________________________________ Outlook
Temp Hum Wind Play
-------------------------------------------------
-------- Overcast Hot High
Weak Yes Overcast Cool Normal
Strong Yes
Which attribute should be tested here? Gain
(Ssunny , Humidity) .970 - (3/5) 0.0 - (2/5)
0.0 .970 Gain (Ssunny , Temperature) .970 -
(2/5) 0.0 - (2/5) 1.0 - (1/5) 0.0 .570 Gain
(Ssunny , Wind) .970 - (2/5) 1.0 - (3/5) .918
.019
19ID3 Algorithm
- Informally
- Determine the attribute with the highest
information gain on the training set. - Use this attribute as the root, create a branch
for each of the values the attribute can have. - For each branch, repeat the process with subset
of the training set that is classified by that
branch.
20Hypothesis Space Search in ID3
- The hypothesis space is the set of all decision
trees defined over the given set of attributes. - ID3s hypothesis space is a compete space i.e.,
the target description is there! - ID3 performs a simple-to-complex, hill climbing
search through this space.
21Hypothesis Space Search in ID3
- The evaluation function is the information gain.
- ID3 maintains only a single current decision
tree. - ID3 performs no backtracking in its search.
- ID3 uses all training instances at each step of
the search.
22Posterior Class Probabilities
Outlook
Sunny
Overcast
Rainy
no 2 pos and 3 neg Ppos 0.4, Pneg 0.6
Windy
no 2 pos and 0 neg Ppos 1.0, Pneg 0.0
False
True
no 0 pos and 2 neg Ppos 0.0, Pneg 1.0
no 3 pos and 0 neg Ppos 1.0, Pneg 0.0
23Overfitting
- Definition Given a hypothesis space H, a
hypothesis h ? H is said to overfit the training
data if there exists some hypothesis h ? H, such
that h has smaller error that h over the
training instances, but h has a smaller error
that h over the entire distribution of instances.
24Reasons for Overfitting
Outlook
sunny
overcast
rainy
Humidity
Windy
yes
high
normal
false
true
no
yes
yes
no
- Noisy training instances. Consider an noisy
training example - Outlook Sunny Temp Hot Humidity Normal
Wind True PlayTennis No - This instance affects the training instances
- Outlook Sunny Temp Cool Humidity Normal
Wind False PlayTennis Yes - Outlook Sunny Temp Mild Humidity Normal
Wind True PlayTennis Yes
25Reasons for Overfitting
Outlook
sunny
overcast
rainy
Humidity
Windy
yes
high
normal
false
true
no
yes
no
Windy
false
true
Outlook Sunny Temp Hot Humidity Normal
Wind True PlayTennis No Outlook Sunny
Temp Cool Humidity Normal Wind False
PlayTennis Yes Outlook Sunny Temp Mild
Humidity Normal Wind True PlayTennis Yes
yes
Temp
mild
high
cool
yes
no
?
26Reasons for Overfitting
- Small number of instances are associated with
leaf nodes. In this case it is possible that for
coincidental regularities to occur that are
unrelated to the actual target concept.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
27Approaches to Avoiding Overfitting
- Pre-pruning stop growing the tree earlier,
before it reaches the point where it perfectly
classifies the training data - Post-pruning Allow the tree to overfit the data,
and then post-prune the tree.
28Pre-pruning
- It is difficult to decide when to stop growing
the tree. - A possible scenario is to stop when the leaf
nodes gets less than m training instances. Here
is an example for m 5.
Outlook
Sunny
Overcast
Rainy
no
?
yes
2
3
2
2
3
29Validation Set
- Validation set is a set of instances used to
evaluate the utility of nodes in decision trees.
The validation set has to be chosen so that it is
unlikely to suffer from same errors or
fluctuations as the training set. - Usually before pruning the training data is split
randomly into a growing set and a validation set.
30Reduced-Error Pruning
- Split data into growing and validation sets.
- Pruning a decision node d consists of
- removing the subtree rooted at d.
- making d a leaf node.
- assigning d the most common classification of the
training instances associated with d.
Outlook
sunny
overcast
rainy
Humidity
Windy
yes
high
normal
false
true
no
yes
yes
no
3 instances
2 instances
Accuracy of the tree on the validation set is 90.
31Reduced-Error Pruning
- Split data into growing and validation sets.
- Pruning a decision node d consists of
- removing the subtree rooted at d.
- making d a leaf node.
- assigning d the most common classification of the
training instances associated with d.
Outlook
sunny
overcast
rainy
Windy
no
yes
false
true
yes
no
Accuracy of the tree on the validation set is
92.4.
32Reduced-Error Pruning
- Split data into growing and validation sets.
- Pruning a decision node d consists of
- removing the subtree rooted at d.
- making d a leaf node.
- assigning d the most common classification of the
training instances associated with d. - Do until further pruning is harmful
- Evaluate impact on validation set of pruning each
possible node (plus those below it). - Greedily remove the one that most improves
validation set accuracy.
Outlook
sunny
overcast
rainy
Windy
no
yes
false
true
yes
no
Accuracy of the tree on the validation set is
92.4.
33Reduced Error Pruning Example
34Rule Post-Pruning
- Convert tree to equivalent set of rules.
- Prune each rule independently of others.
- Sort final rules by their estimated accuracy, and
consider them in this sequence when classifying
subsequent instances.
Outlook
IF (Outlook Sunny) (Humidity High) THEN
PlayTennis No IF (Outlook Sunny) (Humidity
Normal) THEN PlayTennis Yes .
sunny
overcast
rainy
Humidity
Windy
yes
normal
false
true
no
yes
yes
no
35Continuous Valued Attributes
- Create a set of discrete attributes to test
continuous. - Apply Information Gain in order to choose the
best attribute. - Temperature 40 48 60 72 80 90
- PlayTennis No No Yes Yes Yes No
Tempgt54 Temgt85
36Missing Attribute Values
- Strategies
- Assign most common value of A among other
instances belonging to the same concept. - If node n tests the attribute A, assign most
common value of A among other instances sorted to
node n. - If node n tests the attribute A, assign a
probability to each of possible values of A.
These probabilities are estimated based on the
observed frequencies of the values of A. These
probabilities are used in the information gain
measure (via info gain) (
).
37Summary Points
- Decision tree learning provides a practical
method for concept learning. - ID3-like algorithms search complete hypothesis
space. - The inductive bias of decision trees is
preference (search) bias. - Overfitting the training data is an important
issue in decision tree learning. - A large number of extensions of the ID3 algorithm
have been proposed for overfitting avoidance,
handling missing attributes, handling numerical
attributes, etc.
38Learning Decision Rules
- Decision Rules
- Basic Sequential Covering Algorithm
- Learn-One-Rule Procedure
- Pruning
39Definition of Decision Rules
Definition Decision rules are rules with the
following form if ltconditionsgt then class
C.
Example If you run the Prism algorithm from Weka
on the weather data you will get the following
set of decision rules if outlook overcast
then PlayTennis yes if humidity normal and
windy FALSE then PlayTennis yes if
temperature mild and humidity normal then
PlayTennis yes if outlook rainy and windy
FALSE then PlayTennis yes if outlook sunny
and humidity high then PlayTennis no if
outlook rainy and windy TRUE then PlayTennis
no
40Why Decision Rules?
- Decision rules are more compact.
- Decision rules are more understandable.
Example Let X ?0,1, Y ?0,1, Z ?0,1, W
?0,1. The rules are if X1 and Y1 then 1 if
Z1 and W1 then 1 Otherwise 0
41Why Decision Rules?
42How to Learn Decision Rules?
- We can convert trees to rules
- We can use specific rule-learning methods
43Sequential Covering Algorithms
function LearnRuleSet(Target, Attrs, Examples,
Threshold) LearnedRules ? Rule
LearnOneRule(Target, Attrs, Examples) while
performance(Rule,Examples) gt Threshold, do
LearnedRules LearnedRules ? Rule
Examples Examples \ examples covered by
Rule Rule LearnOneRule(Target, Attrs,
Examples) sort LearnedRules according to
performance return LearnedRules
44Illustration
-
-
-
-
-
-
-
-
-
-
-
-
-
45Illustration
-
-
-
-
-
-
-
-
-
-
-
-
-
46Illustration
-
-
-
-
-
-
-
-
-
-
-
-
-
47Illustration
-
-
-
-
-
-
-
-
-
-
-
-
-
IF A B THEN pos
48Illustration
-
-
-
-
-
-
-
-
-
-
-
-
-
IF A B THEN pos
49Illustration
-
-
-
-
-
-
-
-
-
-
-
-
-
IF A B THEN pos
IF true THEN pos
IF C THEN pos
IF C D THEN pos
50Learning One Rule
- To learn one rule we use one of the strategies
below - Top-down
- Start with maximally general rule
- Add literals one by one
- Bottom-up
- Start with maximally specific rule
- Remove literals one by one
- Combination of top-down and bottom-up
- Candidate-elimination algorithm.
51Bottom-up vs. Top-down
Bottom-up typically more specific rules
-
-
-
-
-
-
-
-
-
-
-
-
-
Top-down typically more general rules
52Learning One Rule
- Bottom-up
- Example-driven (AQ family).
- Top-down
- Generate-then-Test (CN-2).
53Example of Learning One Rule
54Heuristics for Learning One Rule
- When is a rule good?
- High accuracy
- Less important high coverage.
- Possible evaluation functions
- Relative frequency nc/n, where nc is the number
of correctly classified instances, and n is the
number of instances covered by the rule - m-estimate of accuracy (nc mp)/(nm), where nc
is the number of correctly classified instances,
n is the number of instances covered by the rule,
p is the prior probablity of the class predicted
by the rule, and m is the weight of p. - Entropy.
55How to Arrange the Rules
- The rules are ordered according to the order they
have been learned. This order is used for
instance classification. - The rules are ordered according to their
accuracy. This order is used for instance
classification. - The rules are not ordered but there exists a
strategy how to apply the rules (e.g., an
instance covered by conflicting rules gets the
classification of the rule that classifies
correctly more training instances if an instance
is not covered by any rule, then it gets the
classification of the majority class represented
in the training data).
56Approaches to Avoiding Overfitting
- Pre-pruning stop learning the decision rules
before they reach the point where they perfectly
classify the training data - Post-pruning allow the decision rules to overfit
the training data, and then post-prune the rules.
57Post-Pruning
- Split instances into Growing Set and Pruning Set
- Learn set SR of rules using Growing Set
- Find the best simplification BSR of SR.
- while (Accuracy(BSR, Pruning Set) gt
- Accuracy(SR, Pruning Set) )
do - 4.1 SR BSR
- 4.2 Find the best simplification BSR
of SR. - 5. return BSR
58Incremental Reduced Error Pruning
Post-pruning
D1
D3
D1
D21
D2
D22
D3
59Incremental Reduced Error Pruning
- Split Training Set into Growing Set and
Validation Set - Learn rule R using Growing Set
- Prune the rule R using Validation Set
- if performance(R, Training Set) gt Threshold
- 4.1 Add R to Set of Learned Rules
- 4.2 Remove in Training Set the instances
covered by R - 4.2 go to 1
- 5. else return Set of Learned Rules
60Summary Points
- Decision rules are easier for human comprehension
than decision trees. - Decision rules have simpler decision boundaries
than decision trees. - Decision rules are learned by sequential covering
of the training instances.
61Model Evaluation Techniques
- Evaluation on the training set too optimistic
Classifier
Training set
Training set
62Model Evaluation Techniques
- Hold-out Method depends on the make-up of the
test set.
Classifier
Training set
Test set
Data
- To improve the precision of the hold-out
method it is repeated many times.
63Model Evaluation Techniques
Classifier
Data
64Intro to Weka
_at_relation weather.symbolic _at_attribute outlook
sunny, overcast, rainy _at_attribute temperature
hot, mild, cool _at_attribute humidity high,
normal _at_attribute windy TRUE, FALSE _at_attribute
play TRUE, FALSE _at_data sunny,hot,high,FALSE,FAL
SE sunny,hot,high,TRUE,FALSE overcast,hot,high,FAL
SE,TRUE rainy,mild,high,FALSE,TRUE rainy,cool,norm
al,FALSE,TRUE rainy,cool,normal,TRUE,FALSE overcas
t,cool,normal,TRUE,TRUE .