Title: Learning Agents Laboratory
1CS 782 Machine Learning
4. Inductive Learning from Examples Decision
Tree Learning
Prof. Gheorghe Tecuci
Learning Agents Laboratory Computer Science
Department George Mason University
2Overview
The decision tree learning problem
The basic ID3 learning algorithm
Discussion and refinement of the ID3 method
Applicability of the decision tree learning
Exercises
Recommended reading
3The decision tree learning problem
Given language of instances feature value
vectors language of generalizations decision
trees a set of positive examples (E1, ..., En)
of a concept a set of negative examples (C1,
... , Cm) of the same concept learning bias
preference for shorter decision trees Determine
a concept description in the form of a decision
tree which is a generalization of the positive
examples that does not cover any of the negative
examples
4Illustration
Examples
Feature vector representation of examples That
is, there is a fixed set of attributes, each
attribute taking values from a specified set.
height hair eyes class short blond blue
tall blond brown - tall red blue
short dark blue - tall dark blue
- tall blond blue tall dark brown
- short blond brown -
5What is the logical expression represented by the
decision tree?
Decision tree concept
Which is the concept represented by this decision
tree?
6Feature-value representation
Is the feature value representation adequate?
If the training set (i.e. the set of positive and
negative examples from which the tree is learned)
contains a positive example and a negative
example that have identical values for each
attribute, it is impossible to differentiate
between the instances with reference only to the
given attributes. In such a case the attributes
are inadequate for the training set and for the
induction task.
7Feature-value representation (cont.)
When could a decision tree be built?
If the attributes are adequate, it is always
possible to construct a decision tree that
correctly classifies each instance in the
training set.
So what is the difficulty in learning a decision
tree?
The problem is that there are many such correct
decision trees and the task of induction is to
construct a decision tree that correctly
classifies not only instances from the training
set but other (unseen) instances as well.
8Overview
The decision tree learning problem
The basic ID3 learning algorithm
Discussion and refinement of the ID3 method
Applicability of the decision tree learning
Exercises
Recommended reading
9The basic ID3 learning algorithm
Let C be the set of training examples If
all the examples in C are positive then create a
node with label If all the examples in C
are negative then create a node with label -
If there is no attribute left then create a node
with the same label as the majority of
examples in C Otherwise
- select the best attribute A and create a
decision node,
are the values of A
- partition the examples into subsets C1, C2, ...
, Ck - according to the values of A.
- apply the algorithm recursively to each of the
sets Ci which is not empty - for each Ci which is empty create a node with
the same label as the majority of examples in
C the node
10Features selection information theory
Let us consider a set S containing objects from n
classes S1, ... , Sn, so that the probability of
an object to belong to a class Si is pi.
According to the information theory, the
amount of information needed to identify the
class of one particular member of S is Ii
- log2 pi Intuitively, Ii represents the number
of questions required to identify the class Si of
a given element in S. The average amount of
information needed to identify the class of an
element in S is - ? pi log2 pi
11Discussion
Consider the following letters A B C D E F G
H Think of one of them (call it, the secret
letter). How many questions need to be asked in
order to find the secret letter?
12Features selection the best attribute
Let us suppose that the decision tree has been
built from a training set C consisting of p
positive examples and n negative examples. The
average amount of information needed to classify
an instance from C is
If attribute A with values v1, v2,...,vk is
used for the root of the decision tree, it will
partition C into C1, C2,...,Ck, where each Ci
contains pi positive examples and ni negative
examples. The expected information required to
classify an instance in Ci is I(pi, ni). The
expected amount of information required to
classify an instance after the value of the
attribute A is known is therefore
The information gained by branching on A is
gain(A) I(p, n) - Ires(A)
13Features selection the heuristic
The information gained by branching on A
is gain(A) I(p, n) - Ires(A)
What would be a good heuristic?
Choose the attribute which leads to the greatest
information gain.
Why is this a heuristic and not a guaranteed
method?
Hint What kind of search method for the best
attribute does ID3 uses?
14Features selection the heuristic
Why is this a heuristic and not a guaranteed
method?
Hint Think of a situation where a is the best
attribute, but the combination of b and c would
actually be better than any of a and b, or a
and c. That is, knowing b and c you can
classify, but knowing only a and b (or only a and
c) you cannot.
This shows that the attributes may not be
independent. How could we deal with this?
Hint Consider also combination of attributes,
not only a, b, c, but also ab, bc, ca
What is a problem with this approach?
15Features selection The built tree depends of the
heuristic used to select the attribute to test at
each node. ID3 selects the 'most informative'
attribute first. This criterion is based on
concepts from information theory. Let us
consider a set S containing objects from n
classes S1, ... , Sn, so that the probability of
an object to belong to a class Si is pi. Then,
according to the information theory, the amount
of information needed to identify the class of
one particular member of S is Ii - log2
pi Intuitively, Ii represents the number of
questions required to identify the class Si of a
given element in S. Therefore, the average amount
of information needed to identify the class of an
element in S is - ? pi log2 pi Let us now
consider the problem of determining the most
informative attribute for building a decision
tree. To classify an instance as being a positive
or a negative example of a concept, a certain
amount of information is needed. After we have
learned the value of some attribute in the
instance, we only need some remaining amount of
information to classify the instance. This
remaining amount is normally smaller than the
initial amount, and is called the 'residual
information'. The 'most informative' attribute is
the one that minimizes the residual
information. Let us suppose that the decision
tree has been built from a training set C
consisting of p positive examples and n negative
examples. Then, according to the above formula,
the average amount of information needed to
classify an instance from C is
If attribute A with values v1, v2,...,vk is
used for the root of the decision tree, it will
partition C into C1, C2,...,Ck, where Ci
contains those examples in C that have value vi
of A. Let Ci contain pi positive examples and
ni negative examples. The expected information
required to classify an instance in Ci is I(pi,
ni). The expected amount of information required
to classify an instance after the value of the
attribute A is known is therefore
where the weight for the i-th branch is the
proportion of the examples in C that belong to
Ci. Therefore, the information gained by
branching on A is gain(A) I(p, n) - Ires(A)
16A good heuristic is to choose that attribute to
branch which leads to the greatest information
gain. Since I(p, n) is constant for all
attributes, maximizing the gain is equivalent to
minimizing Ires(A), which in turn is equivalent
to minimizing the following expression
p
(
S
)
n
i
i
log
-
p
n
log
-
i
i
p
n
2
p
2
n
i
i
i
i
i
p
where is the number of positive examples in
Ci
i
is the number of negative examples in Ci
n
i
if then the
corresponding term in the sum is 0
0
0
or
n
p
i
i
ID3 examines all candidate attributes and chooses
A to maximize gain(A) (or minimize Ires(A)),
forms the tree as above, and then uses the same
process recursively to form decision trees for
the residual subsets C1, C2,...,Ck. The
intuition behind the information content
criterion is the following one. We are interested
in small trees. Therefore we need powerful
attributes to discriminate between classes.
Ideally, such a powerful attribute should divide
the objects in C into subsets so that only one
class is represented in each subset. Such a
totally uniform subset (uniform with respect to
the class) is said to be 'pure' and no additional
information is needed to classify an instance
from the subset. In a non-ideal situation, the
set is not completely pure, but we want it to be
as pure as possible. Thus we prefer the
attributes that minimize the impurity of the
resulting subsets. I(p,n) is a measure of
impurity.
17Illustration of the method
Examples
1. Find the attribute that maximizes the
information gain
height hair eyes class short blond blue
tall blond brown - tall red blue
short dark blue - tall dark blue
- tall blond blue tall dark brown
- short blond brown -
I(3, 5-) -3/8log23/8 5/8log25/8
0.954434003 Height short (1, 2-) tall(2,
3-) Gain(height) 0.954434003 - 3/8I(1,2-) -
5/8I(2,3-) 0.954434003 3/8(-1/3log21/3
- 2/3log22/3) 5/8(-2/5log22/5 - 3/5log23/5)
0.003228944 Hair blond(2, 2-) red(1,
0-) dark(0, 3-) Gain(hair) 0.954434003
4/8(-2/4log22/4 2/4log22/4)
1/8(-1/1log21/1-0)
-3/8(0-3/3log23/3) 0.954434003 0.5
0.454434003 Eyes blue(3, 2-) brown(0,
3-) Gain(eyes) 0.954434003 5/8(-3/5log23/5
2/5log22/5) -5/8( 0.954434003 - 0.606844122
0.347589881
Hair is the best attribute.
18Illustration of the method (cont.)
Examples
height hair eyes class short blond blue
tall blond brown - tall red blue
short dark blue - tall dark blue
- tall blond blue tall dark brown
- short blond brown -
2. Hair is the best attribute. Build the
tree using it.
19Illustration of the method (cont.)
3. Select the best attribute for the set of
examples
short, blond, blue
tall, blond, brown -
tall, blond, blue
short, blond, brown -
I(2, 2-) -2/4log22/4 2/4log22/4
-log21/21 Height short (1, 1-) tall(1,
1-) Eyes blue (2, 0-) brown(0,
2-) Gain(height) 1 2/4I(1,1-)
2/4I(1,1-) 1 - I(1,1-) 1-1 0 Gain(eyes)
1 2/4I(2,0-) 2/4I(0,2-) 1 0 0 1
Eyes is the best attribute.
20Illustration of the method (cont.)
4. Eyes is the best attribute. Expand the tree
using it
21Illustration of the method (cont.)
5. Build the decision tree
What induction hypothesis is made?
22Overview
The decision tree learning problem
The basic ID3 learning algorithm
Discussion and refinement of the ID3 method
Applicability of the decision tree learning
Exercises
Recommended reading
23How could we transform a tree into a set of rules?
Answer IF (hair red) THEN positive
example IF (hair blond) (eyes
blue) THEN positive example
Why should we make such a transformation?
Converting to rules improves understandability.
24Learning from noisy data
What errors could be found in an example (also
called noise in data)?
errors in the values of attributes (due to
measurements or subjective judgments) errors
of classifications of the instances (for instance
a negative example that was considered a positive
example).
25How to deal with noise?
What are the effects of noise?
Noise may cause the attributes to become
inadequate. Noise may lead to decision trees of
spurious complexity (overfitting).
How to change the ID3 algorithm to deal with
noise?
The algorithm must be able to work with
inadequate attributes, because noise can cause
even the most comprehensive set of attributes to
appear inadequate.
The algorithm must be able to decide that testing
further attributes will not improve the
predictive accuracy of the decision tree. For
instance, it should refrain from increasing the
complexity of the decision tree to accommodate a
single noise-generated special case.
26How to deal with an inadequate attribute set?
(inadequacy due to noise)
A collection C of instances may contain
representatives of both classes, yet further
testing of C may be ruled out, either because the
attributes are inadequate and unable to
distinguish among the instances in C, or because
each attribute has been judged to be irrelevant
to the class of instances in C. In this
situation it is necessary to produce a leaf
labeled with a class information, but the
instances in C are not all of the same class.
What class to assign a leaf node that contains
both and - examples?
27What class to assign a leaf node that contains
both and - examples?
Approaches 1. The notion of class could be
generalized from a binary value (0 for negative
examples and 1 for positive examples) to a number
in the interval 0 1. In such a case, a class
of 0.8 would be interpreted as 'belonging to
class P with probability 0.8'. 2. Opt for the
more numerous class, i.e. assign the leaf to
class P if pgtn, to class N if pltn, and to either
if pn.
The first approach minimizes the sum of the
squares of the errors over objects in C. The
second approach minimizes the sum of the absolute
errors over objects in C. If the aim is to
minimize expected error, the second approach
might be anticipated to be superior.
28How to avoid overfitting the data?
One says that a hypothesis overfits the training
examples if some other hypothesis that fits the
training examples less well actually performs
better over the entire distribution of instances.
How to avoid overfitting?
- Stop growing the tree before it overfits
- Allow the tree to overfit and then prune it.
How to determine the correct size of the tree?
Use a testing set of examples to compare the
likely errors of various trees.
29Rule post pruning to avoid overfitting the data?
Rule post pruning algorithm
Infer a decision tree Convert the tree into a set
of rules Prune (generalize) the rules by removing
antecedents as long as this improves their
accuracy Sort the rules by their accuracy and use
this order in classification
Compare tree pruning with rule post pruning.
Rule post pruning is more general. We can remove
an attribute from the top of the tree without
removing all the attributes that follow.
30How to use continuous attributes?
31How to deal with missing attribute values?
Estimate the value from the values of the other
examples.
How?
Assign the value that is most common for the
training examples at that node.
Assign a probability to each of the values. How
does this affect the algorithm?
Consider fractional examples.
32Comparison with the candidate elimination
algorithm
Generalization language
ID3 disjunctions of conjunctions CE
conjunctions
Bias
ID3 preference bias (Occams razor) CE
representation bias
Search strategy
ID3 hill climbing (may not find the concept but
only an approximation) CE exhaustive search
Use of examples
ID3 all in the same time (can deal with noise
and missing values) CE one at a time (can
determine the most informative example)
33Overview
The decision tree learning problem
The basic ID3 learning algorithm
Discussion and refinement of the ID3 method
Applicability of the decision tree learning
Exercises
Recommended reading
34What problems are appropriate for decision tree
learning?
Problems for which Instances can be represented
by attribute-value pairs Disjunctive
descriptions may be required to represent the
learned concept Training data may contain
errors Training data may contain missing
attribute values
35What practical applications could you envision?
- Classify
- Patients by their disease
- Equipment malfunctions by their cause
- Loan applicants by their likelihood to default
on payments.
36Which are the main features of decision tree
learning?
May employ a large number of examples. Discovers
efficient classification trees that are
theoretically justified. Learns disjunctive
concepts. Is limited to attribute-value
representations. Has a non incremental nature
(there are however also incremental versions that
are less efficient). The tree representation is
not very understandable. The method is limited to
learning classification rules. The method was
successfully applied to complex real world
problems.
37Overview
The decision tree learning problem
The basic ID3 learning algorithm
Discussion and refinement of the ID3 method
Applicability of the decision tree learning
Exercises
Recommended reading
38Exercise
Build two different decision trees corresponding
to the examples and counterexamples from the
following table. Indicate the concept
represented by each decision tree.
food medium type class herbivore land harm
less mammal deer (e1) carnivore land harmful
mammal - lion (c1) omnivorous water harmless fi
sh goldfish (e2) herbivore amphibious harmless
amphibian - frog (c2) omnivorous air harmless b
ird - parrot (c3) carnivore land harmful reptil
e cobra (e3) carnivore land harmless reptile
- lizard (c4) omnivorous land moody mammal bea
r (e4)
Apply the ID3 algorithm to build the decision
tree corresponding to the examples and
counterexamples from the above table.
39Exercise
Consider the following positive and negative
examples of a concept
shape size class ball large e1 brick small - c1
cube large - c2 ball small e2
and the following background knowledge
a) You will be required to learn this concept by
applying two different learning methods, the
Induction of Decision Trees method, and the
Versions Space (candidate elimination) method. Do
you expect to learn the same concept with each
method or different concepts? Explain in detail
your prediction (You will need to consider
various aspects like the instance space, the
hypothesis space, and the method of
learning). b) Learn the concept represented by
the above examples by applying - the Induction
of Decision Trees method - the Versions Space
method. c) Explain the results obtained in b)
and compare them with your predictions. d) Which
will be the results of learning with the above
two methods if only the first three examples are
available?
40Exercise
Consider the following positive and negative
examples of a concept
workstation software printer class maclc macwrite
laserwriter e1 sun frame-maker laserwriter e2
hp accounting laserjet - c1 sgi spreadsheet las
erwriter - c2 macII microsoft-word proprinter e3
and the following background knowledge
a) Build two decision trees corresponding to the
above examples. Indicate the concept represented
by each decision tree. In principle, how many
different decision trees could you build? b)
Learn the concept represented by the above
examples by applying the Versions Space method.
Which is the learned concept if only the first
four examples are available? c) Compare and
justify the obtained results.
41Exercise
True of false If decision tree D2 is an
elaboration of D1 (according to ID3), then D1 is
more general than D2.
42Recommended reading
Mitchell T.M., Machine Learning, Chapter 3
Decision tree learning, pp. 52 -80, McGraw Hill,
1997. Quinlan J.R., Induction of decision trees,
in Machine Learning Journal, 181-106. Also in
Shavlik J. and Dietterich T. (eds), Readings in
Machine Learning, Morgan Kaufmann, 1990. Barr
A., Cohen P., and Feigenbaum E.(eds), The
Handbook of Artificial Intelligence, vol III,
pp.406-410, Morgan Kaufmann, 1982. Elwyn
Edwards, Information Transmission, Chapter 4
Uncertainty, pp. 28-39, Chapman and Hall, 1964.