Title: Data Mining using Decision Trees
1Data Mining using Decision Trees
2Decision Trees from Data Base
Ex Att Att Att Concept Num Size Colour Shape
Satisfied 1 med blue brick yes 2 small red
wedge no 3 small red sphere yes 4 large red
wedge no 5 large green pillar yes 6 large red
pillar no 7 large green sphere yes
Choose target Concept satisfied Use all
attributes except Ex Num
3CLS - Concept LearningSystem - Hunt et al.
Tree Structure
Node with mixture of ve and -ve examples
Parent node
Attribute V
v1
v2
v3
Children nodes
4CLS ALGORITHM
1. Initialise the tree T by setting it to consist
of onenode containing all the examples, both ve
and -ve,in the training set 2. If all the
examples in T are ve, create a YES node and
HALT 3. If all the examples in T are -ve,
create a NO node and HALT 4. Otherwise, select
an attribute F with values v1, ..., vn Partition
T into subsets T1, ..., Tn according to the
values on F. Create branches with F as parent
and T1, ..., Tn as child nodes. 5. Apply the
procedure recursively to each child node
5Data Base Example
Using attribute SIZE
1, 2, 3, 4, 5, 6, 7 SIZE
med
large
small
1
2, 3
4, 5, 6, 7
Expand
Expand
YES
6Expanding
1, 2, 3, 4, 5, 6, 7 SIZE
large
med
small
1
2, 3 COLOUR
4, 5, 6, 7 SHAPE
wedge
sphere
YES
2, 3 SHAPE
pillar
wedge
7
4
5, 6 COLOUR
sphere
3
2
red
green
Yes
No
6 No
5 Yes
no
yes
7Rules from Tree
IF (SIZE large AND ((SHAPE wedge)
OR (SHAPE pillar AND
COLOUR red) ))) OR (SIZE small AND SHAPE
wedge) THEN NO IF (SIZE large AND
((SHAPE pillar) AND COLOUR green)
OR SHAPE sphere) ) OR (SIZE
small AND SHAPE sphere) OR (SIZE
medium) THEN YES
8Disjunctive Normal Form - DNF
IF (SIZE medium) OR (SIZE small AND SHAPE
sphere) OR (SIZE large AND SHAPE
sphere) OR (SIZE large AND SHAPE pillar
AND
COLOUR green THEN CONCEPT satisfied ELSE
CIONCEPT not satisfied
9ID3 - Quinlan
Attributes are chosen in any order for the CLS
algorithm. This can result in large decision
trees if the ordering is not optimal. Optimal
ordering would result in smallest decision Tree.
No method is known to determine optimal
ordering. We use a heuristic to provide efficient
ordering which will result in near optimal
ordering
ID3 CLS efficient ordering of attributes
Entropy is used to order the attributes.
10Entropy
For random variable V which can take values v1,
v2, , vn with Pr(vi) pi, all i, the entropy
of V is given by
Entropy for a fair dice
1.7917
Entropy for fair dice with even score
1.0986
Differences between entropies
Information gain 1.7917 - 1.0986 0.6931
11Attribute Expansion
Expand attribute Ai -
other attributes
Ai
T
Pr
Equally likely unless specified
Pr(A1, Ai, An, T)
Attributes Except Ai
aim
ai1
T
Pr
T
Pr(A1, Ai-1, Ai1, An, T Ai ai1)
Pass probabilities corresponding to ai1 from
above and re-normalise -equally likely again if
previous equally likely
12Expected Entropy for an Attribute
Attribute Ai and target T -
Ai
T
Pr
Pass probabilities corresponding to tk from
above for ai1and re-normalise
aim
ai1
Pr
T
T
Pr
Pr(T Aiaim)
S(ai2)
S(aim)
S(ai1)
Expected Entropy for Ai
13How to choose attribute and Information gain
Determine expected entropy for each
attribute i.e. S(Ai), all i
Choose s such that
Expand attribute As
By choosing attribute As the information gain
is S - S(As) where
where
Minimising expected entropy is equivalent to
maximising Information gain
14Previous Example
Ex Att Att Att Concept Num Size Colour Shape
Satisfied 1 med blue brick yes
1/7 2 small red wedge no
1/7 3 small red sphere yes
1/7 4 large red wedge no
1/7 5 large green pillar yes
1/7 6 large red pillar no
1/7 7 large green sphere yes 1/7
Pr
Concept satisfied
Pr
S (4/7)Log(4/7) (3/7)Log(3/7) 0.99
yes no
4/7 3/7
15Entropy for attribute Size
Att Concept Size Satisfied med yes
1/7 small no 1/7 small yes
1/7 large no 2/7 large yes 2/7
Pr
S(Size) (2/7)1 (1/7)0 (4/7)1 6/7 0.86
Information Gain for Size 0.99 - 0.86 0.13
Pr(large) 4/7
Pr(small) 2/7
large
small
Concept Satisfied no 1/2 yes 1/2
Pr
Concept Satisfied no 1/2 yes 1/2
Pr
med
Pr(med) 1/7
Concept Satisfied yes 1
S(large) 1
Pr
S(small) 1
S(med) 0
16First Expansion
Attribute Information Gain SIZE 0.13 COLOUR 0.52
SHAPE 0.7
max
choose
1, 2, 3, 4, 5, 6, 7
SHAPE
sphere
wedge
pillar
brick
5, 6
2, 4
1
3, 7
Expand
YES
NO
YES
17Complete Decision Tree
1, 2, 3, 4, 5, 6, 7
Rule IF Shape is wedge OR Shape is
brick OR Shape is pillar AND Colour is
red OR Shape is sphere THEN NO ELSE YES
SHAPE
sphere
wedge
pillar
brick
5, 6
2, 4
1
3, 7
COLOUR
YES
NO
YES
green
red
5
6)
YES
NO
18A new case
Att Att Att Concept Size Colour Shape Sati
sfied med red pillar ?
SHAPE
pillar
COLOUR
red
? NO
19Post Pruning
Any Node S
N examples in node
Let C be class with most examples i.e majority
E(S)
n cases of C
C is one of YES, NO
Suppose we terminate this node and make it a leaf
with classification C. What will be the expected
error, E(S), if we use the tree for new cases and
we reach this node. E(S) Pr(class of new case
is a class ? C)
20Bayes Updating for Post Pruning
Let p denote probability of class C for new case
arriving at S We do not know p. Let f(p) be a
prior probability distribution for p on 0, 1.
We can update this prior using Bayes
updating with the information at node S. The
information at node S is
n C in S
Pr(n C in S p) f(p)
f(p n in S)
1
?
Pr(n C in S p) f(p)dp
0
21Mathematics of Post Pruning
Assume f(p) to be uniform over 0, 1
The evaluation of the integral
n
N n
p (1-p)
1
f(p n C in S)
a
b
?
dx
x (1-x)
1
n
N n
0
?
p (1-p)
dp
n! (N n 1)!
0
(N 2)!
using Beta Functions
E(S) E (1 p)
f(p n C in S)
n
N n 1
?
N n 1
p (1-p)
dp
using Beta Functions.
E(S)
N 2
1
n
N n
?
p (1-p)
dp
0
22Post Pruning for Binary Case
For leaf nodes Si Error(Si) E(Si)
Error(S) MIN
S
E(S) BackUpError(S)
Num of examples in Si
Pm
Pi
Num of examples in S
P2
P1
E(S) BackUpError(S)
S1
S2
Sm
Error(Sm)
Error(S2)
Error(S1)
For any node S which is not a leaf node we can
calculate BackUpError(S) Pi Error(Si)
Decision Prune at S if BackUpError(S) Error(S)
?
i
23Example of Post Pruning
x, y means x YES cases and y NO cases
Before Pruning
a
0.417 0.378
6, 4
We underline Error(Sk)
c
0.5 0.383
b
0.375 0.413
2, 2
PRUNE
4, 2
1, 0 0.333
3, 2 0.429
1, 0 0.333
d
0.4 0.444
PRUNE
1, 2
PRUNE means cut the sub- tree below this point
1, 1 0.5
0, 1 0.333
24Result of Pruning
After Pruning
a
6, 4
c
4, 2
2, 2
1, 0
1, 2
25Generalisation
For the case in which we have k classes the
generalisation for E(S) is
N n k 1
E(S)
N k
Otherwise, pruning method is the same.
26Testing
DataBase
Learn rules using Training Set and Prune Test
rules on this set and record correct Test
rules on Test Set record correct
Training Set
Test Set
accuracy on test set should be close to that of
training set. This indicates good generalisation
Over-fitting can occur if noisy data is used or
too specific attributes are used. Pruning will
overcome noise to some extent but not
completely. Too specific attributes must be
dropped.