Huffman Trees and ID3 - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Huffman Trees and ID3

Description:

Title: ID3 and C4.5 Author: Lee Last modified by: Computer Science Department Created Date: 4/29/2006 5:24:03 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:139
Avg rating:3.0/5.0
Slides: 58
Provided by: Lee149
Category:

less

Transcript and Presenter's Notes

Title: Huffman Trees and ID3


1
Huffman Trees and ID3
CS157B Lecture 19
  • Prof. Sin-Min Lee
  • Department of Computer Science

2
Huffman coding is an algorithm used for lossless
data compression developed by David A. Huffman as
a PhD student at MIT in 1952, and published in A
Method for the Construction of Minimum-Redundancy
Codes.
"Huffman Codes" are widely used applications
that involve the compression and transmission of
digital data, such as fax machines, modems,
computer networks, and high-definition television
(HDTV), etc.
Professor David A. Huffman (August 9, 1925 -
October 7, 1999)
3
Motivation The motivations for data compression
are obvious
  • reducing the space required to store files on
    disk or tape
  • reducing the time to transmit large files.

Image Source plus.maths.org/issue23/
features/data/data.jpg
Huffman savings are between 20 - 90
4
Basic Idea It uses a variable-length code
table for encoding a source symbol (such as a
character in a file) where the variable-length
code table has been derived in a particular way
based on the frequency of occurrence for each
possible value of the source symbol.
5
Example Suppose you have a file with 100K
characters. For simplicity assume that there are
only 6 distinct characters in the file from a
through f, with frequencies as indicated
below. We represent the file using a unique
binary string for each character.
a b c d e f
Frequency (in 1000s) 45 13 12 16 9 5
Fixed-length codeword 000 001 010 011 100 101
Space (453 133 123 163 93 53)
1000 300K bits
6
Can we do better ??
YES !!
By using variable-length codes instead of
fixed-length codes. Idea Giving frequent
characters short codewords, and infrequent
characters long codewords. i.e. The length of the
encoded character is inversely proportional to
that character's frequency.
a b c d e f
Frequency (in 1000s) 45 13 12 16 9 5
Fixed-length codeword 000 001 010 011 100 101
Variable-length codeword 0 101 100 111 1101 1100
Space (451 133 123 163 94 54)
1000 224K bits ( Savings 25)
7
PREFIX CODES Codes in which no codeword is
also a prefix of some other codeword.
("prefix-free codes" would have been a more
appropriate name)
Variable-length codeword 0 101 100 111 1101 1100
It is very easy to encode and decode using prefix
codes. No Ambiguity !! It is possible to show
(although we won't do so here) that the optimal
data compression achievable by a character code
can always be achieved with a prefix code, so
there is no loss of generality in restricting
attention to prefix codes.
8
Benefits of using Prefix Codes Example
a b c d e f
Variable-length codeword 0 101 100 111 1101 1100
F A C E Encoded as 1100 0 100 1101
110001001101 To decode, we have to decide where
each code begins and ends, since they are no
longer all the same length. But this is easy,
since, no codes share a prefix. This means we
need only scan the input string from left to
right, and as soon as we recognize a code, we can
print the corresponding character and start
looking for the next code. In the above case,
the only code that begins with 1100.." or a
prefix is f", so we can print f" and start
decoding 0100...", get a", etc.
9
Benefits of using Prefix Codes Example To see
why the no-common prefix property is essential,
suppose that we encoded e" with the shorter code
110
a b c d e f
Variable-length codeword 0 101 100 111 1101 1100
Variable-length codeword 0 101 100 111 110 1100
FACE 11000100110 When we try to decode
1100" we could not tell whether 1100 110 0
f" or 1100 110 0 ea"
10
  • Representation
  • The Huffman algorithm is represented as
  • binary tree
  • each edge represents either 0 or 1
  • 0 means "go to the left child"
  • 1 means "go to the right child."
  • each leaf corresponds to the sequence of 0s and
    1s traversed from the root to reach it, i.e. a
    particular code.
  • Since no prefix is shared, all legal codes are at
    the leaves, and decoding a string means following
    edges, according to the sequence of 0s and 1s in
    the string, until a leaf is reached.

11
a b c d e f
Frequency (in 1000s) 45 13 12 16 9 5
Fixed-length codeword 000 001 010 011 100 101
a b c d e f
Frequency (in 1000s) 45 13 12 16 9 5
Variable-length codeword 0 101 100 111 1101 1100
0
1
0
1
0
0
1
0
1
0
1
1
0
0
0
1
1
1
0
0
1
Labeling leaf -gt character it represents
frequency with which it appears in the text.
internal node -gt frequency with which all leaf
nodes under it appear in the text (i.e. the sum
of their frequencies).
1
12
0ptimal Code
0
1
0
1
0
0
1
0
1
0
1
1
0
0
0
1
1
1
0
0
1
An optimal code for a file is always represented
by a full binary tree, in which every non-leaf
node has two children. The fixed-length code in
our example is not optimal since its tree, is not
a full binary tree there are codewords beginning
10 . . . , but none beginning 11 .. Since we can
now restrict our attention to full binary trees,
we can say that if C is the alphabet from which
the characters are drawn, then the tree for an
optimal prefix code has exactly C leaves, one
for each letter of the alphabet, and exactly C
- 1 internal nodes.
13
  • Given a tree T corresponding to a prefix code, it
    is a simple matter to compute the number of bits
    required to encode a file.
  • For each character c in the alphabet C,
  • f(c) denote the frequency of c in the file
  • dT(c) denote the depth of c's leaf in the tree.
  • (dT(c) is also the length of the codeword for
    character c)
  • The number of bits required to encode a file is
    thus
  • B(T) S f(c) dT(c)

c C
Which we define as the cost of the tree.
14
Constructing a Huffman code Huffman invented a
greedy algorithm that constructs an optimal
prefix code called a Huffman code. The algorithm
builds the tree T corresponding to the optimal
code in a bottom-up manner. It begins with a set
of C leaves and performs a sequence of C - 1
"merging" operations to create the final tree.
Greedy Choice? The two smallest nodes are chosen
at each step, and this local decision results in
a globally optimal encoding tree. In general,
greedy algorithms use small-grained, or local
minimal/maximal choices to result in a global
minimum/maximum
15
HUFFMAN(C) 1 n C 2 Q C 3 for
i 1 to n - 1 4 do ALLOCATE-NODE(z)
5 leftz x
EXTRACT-MIN(Q) 6 rightz y
EXTRACT-MIN(Q) 7 fz fx fy
8 INSERT(Q, z) 9 return EXTRACT-MIN(Q) C is
a set of n characters and that each character c
C is an object with a defined frequency fc. A
min-priority queue Q, keyed on f, is used to
identify the two least-frequent objects to merge
together. The result of the merger of two objects
is a new object whose frequency is the sum of the
frequencies of the two objects that were merged.

16
For our example, Huffman's algorithm proceeds as
shown. 1 n C 2 Q C Line 1
sets the initial queue size, n 6 (letters in
the alphabet) Line 2 initializes the priority
queue Q with the characters in C. (a through f)
3 for i 1 to n - 1 4 do
ALLOCATE-NODE(z) 5 leftz x
EXTRACT-MIN(Q) 6 rightz y
EXTRACT-MIN(Q) 7 fz fx
fy 8 INSERT(Q, z) The for loop uses n - 1
(6 - 1 5) merge steps to build the tree. It
repeatedly extracts the two nodes x and y of
lowest frequency from the queue, and replaces
them in the queue with a new node z representing
their merger. The frequency of z is computed as
the sum of the frequencies of x and y in line 7.
The node z has x as its left child and y as its
right child.
9 return EXTRACT-MIN(Q) After mergers, the one
node left in the queue -- the root -- is returned
in line 9. The final tree represents the optimal
prefix code. The codeword for a letter is the
sequence of edge labels on the path from the root
to the letter.
17
The steps of Huffman's algorithm
18
Running Time Analysis The analysis of the
running time of Huffman's algorithm assumes that
Q is implemented as a binary min-heap.
  • For a set C of n characters, the
    initialization of Q in line 2 can be performed in
    O(n) time using the BUILD-MIN-HEAP
    procedure.
  • The for loop in lines 3-8 is executed exactly
    n - 1 times. Each heap operation requires time
    O(log n). The loop
    contributes (n - 1) O(log n)



    O(nlog
    n)

Thus, the total running time of HUFFMAN on a set
of n characters O(n) O(nlog n)
O(n log n)
19
Correctness of Huffman's algorithm To prove that
the greedy algorithm HUFFMAN is correct, we show
that the problem of determining an optimal prefix
code exhibits the greedy-choice and
optimal-substructure properties.
20
Lemma that shows that the greedy-choice property
holds. Lemma Let C be an alphabet in which
each character c C has frequency fc. Let x
and y be two characters in C having the lowest
frequencies. Then there exists an optimal prefix
code for C in which the codewords for x and y
have the same length and differ only in the last
bit.
Why ? Must be on the bottom (least frequent)
Full tree, so they must be siblings, and so
differ in one bit.
Proof The idea of the proof is to take the tree
T representing an arbitrary optimal prefix code
and modify it to make a tree representing another
optimal prefix code such that the characters x
and y appear as sibling leaves of maximum depth
in the new tree. If we can do this, then their
codewords will have the same length and differ
only in the last bit.
21
Proof
T
T
Let a and b be two characters that are are
sibling leaves of maximum depth in T. Without
loss in generality, assume that fa lt fb and
fx lt fy Since fx and fy are the two
lowest frequencies in that order, and fa and
fb are two arbitrary frequencies in that order,
we have fx lt fa and fy lt fb.
Exchange the positions of a and x in T, to
produce T.
The difference in cost between T and T is B(T)
B(T) S f(c) dT(c) - S f(c) dT(c)
fx dT(x) fa dT(a) - fx dT(x) - fa
dT(a)
fx dT(x) fa dT(a) - fx dT(a) - fa
dT(x)
(fa - fx) (dT(a) - dT(x))
0 (cost is not increased)
22
Proof
T
T
T
Similarly exchanging the positions of b and y in
T, to produce T does not increase the cost,
B(T) B(T) is non-negative. Therefore B(T)
B(T), and since T is optimal, B(T) B(T),
gt B(T) B(T) Thus, T is an optimal tree
in which x y appear as sibling leaves of
maximum depth from which the lemma follows.
23
Lemma that shows that the optimal substructure
property holds. Lemma Let C be a given
alphabet with frequency fc defined for each
character c C . Let x and y be two characters
in C with minimum frequency. Let C be the
alphabet C with characters x,y removed and (new)
character z added, so that C C x,y U z
define f for C as for C, except that fz fx
fy. Let T be any tree representing an
optimal prefix code for the alphabet C. Then the
tree T, obtained from T by replacing the leaf
node for z with an internal node having x and y
as children, represents an optimal prefix code
for the alphabet C.
ÃŽ
  • Proof
  • We first express B (T) in terms of B (T')
  • c C x,y we have dT(c) dT(c), and
    hence
  • fcdT(c) fcdT (c)

ÃŽ
24
Since dT(x) dT(y) dT(z) 1, we have
fxdT(x) fydT(y) (fx fy) (dT'(z)
1 ) f(z)dT'(z) (fx fy) From
which we conclude that B(T) B(T) (fx
fy) B(T) B(T) - (fx - fy)
Proof by contradiction Suppose that T does not
represent an optimal prefix code for C.
Then there exists a
tree T such that B(T) lt B(T). Without loss in
generality (by the previous lemma) T has x y
as siblings. Let T be the tree T with the
common parent of x y replaced by a leaf z with
frequency fz fx fy. Then, B(T)
B(T) - (fx fy) lt B(T)
- (fx - fy)
B(T) Yielding a contradiction to the assumption
that T represents an optimal prefix code for C.
Thus, T must represent an optimal prefix code for
the alphabet C.
25
  • Drawbacks
  • The main disadvantage of Huffmans method is that
    it makes two passes over the data
  • one pass to collect frequency counts of the
    letters in the message, followed by the
    construction of a Huffman tree and transmission
    of the tree to the receiver and
  • a second pass to encode and transmit the letters
    themselves, based on the static tree structure.
  • This causes delay when used for network
    communication, and in file compression
    applications the extra disk accesses can slow
    down the algorithm.

We need one-pass methods, in which letters are
encoded on the fly.
26
ID3 algorithm
  • To get the fastest decision-making procedure, one
    has to arrange attributes in a decision tree in a
    proper order - the most discriminating attributes
    first. This is done by the algorithm called ID3.
  • The most discriminating attribute can be defined
    in precise terms as the attribute for which the
    fixing its value changes the enthropy of possible
    decisions at most. Let wj be the frequency of
    the j-th decision in a set of examples x. Then
    the enthropy of the set is
  • E(x) - Swj log(wj)
  • Let fix(x,a,v) denote the set of these elements
    of x whose value of attribute a is v. The
    average enthropy that remains in x , after the
    value a has been fixed, is
  • H(x,a) S kv E(fix(x,a,v)),
  • where kv is the ratio of examples in x
    with attribute a having value v.

27
(No Transcript)
28
(No Transcript)
29
Ok now we want a quantitative way of seeing the
effect of splitting the dataset by using a
particular attribute (which is part of the tree
building process). We can use a measure called
Information Gain, which calculates the reduction
in entropy (Gain in information) that would
result on splitting the data on an attribute, A.
where v is a value of A, Sv is the subset
of instances of S where A takes the value v, and
S is the number of instances
30
Continuing with our example dataset, let's name
it S just for convenience, let's work out the
Information Gain that splitting on the attribute
District would result in over the entire dataset
  • So by calculating this value for each attribute
    that remains, we can see which attribute splits
    the data more purely. Let's imagine we want to
    select an attribute for the root node, then
    performing the above calcualtion for all
    attributes gives
  • Gain(S,House Type) 0.049 bits
  • Gain(S,Income) 0.151 bits
  • Gain(S,Previous Customer) 0.048 bits

31
We can clearly see that District results in the
highest reduction in entropy or the highest
information gain. We would therefore choose this
at the root node splitting the data up into
subsets corresponding to all the different values
for the District attribute. With this node
evaluation technique we can procede recursively
through the subsets we create until leaf nodes
have been reached throughout and all subsets are
pure with zero entropy. This is exactly how ID3
and other variants work.
32
Example 1 If S is a collection of 14 examples
with 9 YES and 5 NO examples then Entropy(S) -
(9/14) Log2 (9/14) - (5/14) Log2 (5/14)
0.940 Notice entropy is 0 if all members of S
belong to the same class (the data is perfectly
classified). The range of entropy is 0
("perfectly classified") to 1 ("totally
random"). Gain(S, A) is information gain of
example set S on attribute A is defined
as Gain(S, A) Entropy(S) - S ((Sv / S)
Entropy(Sv)) Where S is each value v of all
possible values of attribute A Sv subset of S
for which attribute A has value v Sv number
of elements in Sv S number of elements in S
33
Example 2. Suppose S is a set of 14 examples in
which one of the attributes is wind speed. The
values of Wind can be Weak or Strong. The
classification of these 14 examples are 9 YES and
5 NO. For attribute Wind, suppose there are 8
occurrences of Wind Weak and 6 occurrences of
Wind Strong. For Wind Weak, 6 of the examples
are YES and 2 are NO. For Wind Strong, 3 are
YES and 3 are NO. Therefore Gain(S,Wind)Entropy(
S)-(8/14)Entropy(Sweak)-(6/14)Entropy(Sstrong)
0.940 - (8/14)0.811 - (6/14)1.00
0.048 Entropy(Sweak) - (6/8)log2(6/8) -
(2/8)log2(2/8) 0.811 Entropy(Sstrong) -
(3/6)log2(3/6) - (3/6)log2(3/6) 1.00 For each
attribute, the gain is calculated and the highest
gain is used in the decision node.
34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
  • Decision Tree Construction Algorithm
    (pseudo-code)
  • Input A data set, S Output A decision tree
  • If all the instances have the same value for the
    target attribute then return a decision tree that
    is simply this value (not really a tree - more of
    a stump).
  • Else
  • Compute Gain values (see above) for all
    attributes and select an attribute with the
    lowest value and create a node for that
    attribute.
  • Make a branch from this node for every value of
    the attribute
  • Assign all possible values of the attribute to
    branches.
  • Follow each branch by partitioning the dataset to
    be only instances whereby the value of the branch
    is present and then go back to 1.

42
Decision Tree Example
Outlook Temperature Humidity Windy Play?
sunny hot high false No
sunny hot high true No
overcast hot high false Yes
rain mild high false Yes
rain cool normal false Yes
rain cool normal true No
overcast cool normal true Yes
sunny mild high false No
sunny cool normal false Yes
rain mild normal false Yes
sunny mild normal true Yes
overcast mild high true Yes
overcast hot normal false Yes
rain mild high true No
43
Outlook
sunny
rain
overcast
Yes
Humidity
Windy
high
normal
false
true
No
No
Yes
Yes
44
Which Attributes to Select?
45
A Criterion for Attribute Selection
Which is the best attribute?
The one which will result in the smallest
tree Heuristic choose the attribute that
produces the purest nodes
Outlook
overcast
Yes
46
  • Information gain
  • (information before split) (information after
    split)
  • Information gain for attributes from weather
    data

47
Continuing to Split
48
The Final Decision Tree
Splitting stops when data cant be split any
further
49
Person Hair Length Weight Age Class
Homer 0 250 36 M
Marge 10 150 34 F
Bart 2 90 10 M
Lisa 6 78 8 F
Maggie 4 20 1 F
Abe 1 170 70 M
Selma 8 160 41 F
Otto 10 180 38 M
Krusty 6 200 45 M
Comic 8 290 38 ?
50
Entropy(4F,5M) -(4/9)log2(4/9) -
(5/9)log2(5/9) 0.9911
no
yes
Hair Length lt 5?
Let us try splitting on Hair length
Entropy(3F,2M) -(3/5)log2(3/5) -
(2/5)log2(2/5) 0.9710
Entropy(1F,3M) -(1/4)log2(1/4) -
(3/4)log2(3/4) 0.8113
Gain(Hair Length lt 5) 0.9911 (4/9 0.8113
5/9 0.9710 ) 0.0911
51
Entropy(4F,5M) -(4/9)log2(4/9) -
(5/9)log2(5/9) 0.9911
no
yes
Weight lt 160?
Let us try splitting on Weight
Entropy(0F,4M) -(0/4)log2(0/4) -
(4/4)log2(4/4) 0
Entropy(4F,1M) -(4/5)log2(4/5) -
(1/5)log2(1/5) 0.7219
Gain(Weight lt 160) 0.9911 (5/9 0.7219
4/9 0 ) 0.5900
52
Entropy(4F,5M) -(4/9)log2(4/9) -
(5/9)log2(5/9) 0.9911
no
yes
age lt 40?
Let us try splitting on Age
Entropy(1F,2M) -(1/3)log2(1/3) -
(2/3)log2(2/3) 0.9183
Entropy(3F,3M) -(3/6)log2(3/6) -
(3/6)log2(3/6) 1
Gain(Age lt 40) 0.9911 (6/9 1 3/9
0.9183 ) 0.0183
53
Of the 3 features we had, Weight was best. But
while people who weigh over 160 are perfectly
classified (as males), the under 160 people are
not perfectly classified So we simply recurse!
no
yes
Weight lt 160?
This time we find that we can split on Hair
length, and we are done!
no
yes
Hair Length lt 2?
54
We need dont need to keep the data around, just
the test conditions.
Weight lt 160?
yes
no
How would these people be classified?
Hair Length lt 2?
Male
yes
no
Male
Female
55
It is trivial to convert Decision Trees to rules
Weight lt 160?
yes
no
Hair Length lt 2?
Male
no
yes
Male
Female
Rules to Classify Males/Females If Weight
greater than 160, classify as Male Elseif Hair
Length less than or equal to 2, classify as
Male Else classify as Female
56
Once we have learned the decision tree, we dont
even need a computer!
This decision tree is attached to a medical
machine, and is designed to help nurses make
decisions about what type of doctor to call.
Decision tree for a typical shared-care setting
applying the system for the diagnosis of
prostatic obstructions.
57
The worked examples we have seen were performed
on small datasets. However with small datasets
there is a great danger of overfitting the
data When you have few datapoints, there are
many possible splitting rules that perfectly
classify the data, but will not generalize to
future datasets.
Yes
No
Wears green?
Male
Female
For example, the rule Wears green? perfectly
classifies the data, so does Mothers name is
Jacqueline?, so does Has blue shoes
Write a Comment
User Comments (0)
About PowerShow.com