Title: Prsentation PowerPoint
1Binary Encodings
A binary encoding of an object M is an assignment
of a string of zeros and ones to M. For our
purposes, M will be a string of symbols from some
alphabet (Greek, Arabic, Hebrew, ascii, . . .
). One instance of encoding is for error
detection and correction. The situation here is
that the binary string is being sent over
a binary channel in which errors may occur,
though infrequently. An error consists of a bit
being reversed (0 to 1, or vice-versa).
2Binary Encodings
If we detect errors, we can request
retransmission. For certain codes, we can
determine the location of the errors and thus
recover the original message without
retransmission. Such codes are called
error-correcting codes. Each such code is
capable of correcting up to k bit errors, for
some integer k. Codes as described above are
studied in the area known as Coding Theory and
involve some very sophisticated mathematics
3Binary Codes
In the symbol-by-symbol encoding method, we first
choose a string of 0s and 1s for each symbol in
an alphabet S such an assignment is called a
binary code for the alphabet. The string
assigned to a symbol s is called the codeword
for s.
Given an alphabet S, we denote by S the set of
all finite strings of symbols from S. The
string of length zero is denoted by l.
Thus 0,1 l, 0, 1, 00, 01, 10, 11, 000,
001, 010, 100,
Formally, a binary code for alphabet S is a
function g S ? 0,1
4Symbol-by-Symbol Encoding
Given a code g S ? 0,1 for alphabet S, we
define g? S ? 0,1 by
g?(s1s2...sn) g(s1) g(s2)... g(sn)
Example Suppose S a,b and g is defined
by g(a) 01, g(b) 11. Then g?(aba) g (a)
g (b) g(a) 011101 And
g ?(bab) g(b) g(a) g(b)
110111.
5An important example of a binary code is the
ascii code for characters.
The characters on which the ascii code is defined
include all upper and lower case English letters
and punctuation symbols as well as control
characters. The original ascii code covered
128 characters and in this case each code word
consisted of 7 bits an extra bit was usually
added for error detection.
A 256 character set was introduced later for IBM
PCs and was called the IBM extended ascii
character set. For this character set 8-bit code
words are required.
6Exampleascii coded string
Consider the string bye bye baseball
From a table of ascii codes we find the following
code values
Thus the ascii encoding for bye bye baseball is
1100010111100111001010100000 110001011110011100101
0100000 110001011000011110011110010111000101100001
11011001101100
7Since the string bye bye baseball is 16
characters long and each character is
represented by a 7-bit string of 0s and 1s, the
number of bits in in the ascii encoding of bye
bye baseball is 167 112
Ascii code is an example of a fixed-length code
all codewords have the same length. That length
cannot be less than 7 since for any positive
integer n, there are 2n different binary strings
of length n and 7 is the least integer n with 2n
? 128.
If bye bye baseball were the only string we
wanted to encode we could use a shorter
fixed-length code. There are only 7
distinct characters in our string thus we could
use a 3-bit code, since 22
8A 3-bit Binary Code
Here is a possible 3-bit binary code for the
alphabet ,a,b,e,l,s,y
Using this code bye bye baseball is represented
by the binary string
01011010000001011010000001000110110001000101101
1
This string is just 16 3 48 bits long, a
savings of 64 bits
9Data Compression
We have seen that the ascii encoding of bye bye
baseball produces a bit string of length 112 but
that we could use a 3-bit fixed-length code get
an encoding of length 48.
This is an example of data compression. The
natural question is whether we could find a code
that would require less than 48 bits for the
encoding of bye bye baseball.
If we are restricted to fixed-length codes, the
answer is no.
Thus we will consider variable-length codes
10Great! We have reduced the number of bits from
48 to 27.
But there is a problem here. How do we decode
the string?
For fixed-length codes, decoding is trivial we
just break the bit-string into fixed-length
substrings and use a table lookup.
We are in trouble from the start with our example
variable-length code is the first 0 the code for
a b or the beginning of the code for a space?
11The previous example suggests that we require
that no codeword can be a prefix (initial
substring) of any other codeword. Such a code
is called a prefix code.
Using this code, the encoding of bye bye
baseball is
0011010110000011010110000010110010100101111111
The number of bits in the above encoding of bye
bye baseball is 46, which beats the fixed-length
encoding by 2.
12The code to the right is a prefix code. The
table shows the frequency of each character in
the string bye bye baseball. To compute the
number of bits in the encoding of bye bye
baseball we need only take the product of the
frequency of each
character with the length of its codeword and
then add these products
23 23 42 33 23 13 23 44
This is the shortest encoding yet. We will see
later that it is the best possible for a
symbol-by-symbol encoding.
13We can think of the contents of a file as a
string M of ascii characters.
To compress the file, we try to find a prefix
code that gives the least number of bits in the
encoded form of the string M.
If µ is a string over some alphabet, let len(µ)
denote the length of µ.
Thus len(0110111) 7 and len(bye bye
baseball) 16.
Definition. Let M be a string over alphabet S
and let g be a binary code for S. Then g is said
to be an optimal code for M if for every binary
code n for S, len(g?(M)) ? len(n?(M)).
14Suppose file F contains the string bye bye
baseball. The characters in the file are stored
using 8-bit ascii code, thus one character per
byte. Thus the file has length 16 bytes.
The bit-encoded form of bye bye baseball using
the above code is
10110111001101101110011001100011110011010010
which consists of 44 bits. But a file must
consist of bytes and thus its length must be a
multiple of 8. What we must do is pad the
encoded string with 4 zeros on the right to fill
out the last byte.
101101110011011011100110011000111100110100100000
15The encoded message, filled out to an integral
number of bytes is
101101110011011011100110011000111100110100100000
The byte structure could now look somewhat like
this
So the compressed file would consist of 6 bytes
as opposed to the 16 bytes in the uncompressed
file.
Thus the compressed file is 6/16 the size of the
original file, which is to say, the compressed
file size is 37.5 of the original file size.
We will discuss an algorithm for constructing an
optimal code for a given message M later the
resulting code is called a huffman code, after
the man who invented the algorithm.
16The question now arises as to how we output the
bits in such a way that they are packed into
bytes as shown on the previous slide.
File operations in C operate at the byte level,
not the bit level.
So we must provide our own data types and
functions for bit output.
We do that by means of two ADTs called bitOStream
and bitIStream.
While we defer the representation details until
later, we should note that a bitIStream B has
an associated FILE object B.inFile that contains
the bits to be input via B.
Similarly, a bitOStream B has an associated
file B-outFile where bits that are output using
B are stored.
The function headers are given on the next
slides. The typedefs are postponed until we
consider the implementation of the bitIO ADT.
17Boolean openIStream(bitIStream B, FILE
inF) / Pre inF has previously been opened
for input. Post If inF is a NULL pointer or
the file is empty, the function returns FALSE.
Otherwise, the stream is initialized to
obtain bit input from the file inF.
The function then returns TRUE. /
18void closeIStream(bitIStream B) / Pre B has
previously been opened. Post The file
attached to B has been closed /
Boolean getBit(bitIStream B, char theBit) /
Pre B has been opened Post If no valid
bits remain on the bitIStream, the function
returns FALSE. Otherwise, theBit is set to
0 if the next available bit is a zero,
otherwise it is set to 1 and the function
returns TRUE. /
19Boolean openOStream(bitOStream B, FILE
outF) / Pre outF has been opened for
output. Post If outF is a NULL pointer,
the function immediately returns FALSE.
Otherwise, the stream is initialized for
bit output and the function returns TRUE.
/ void closeOStream(bitOStream B) / Pre
B has previously been opened. Post All
valid bits have been written to B-outFile and
B-outFile has been closed. /
20void putBit(bitOStream B, char theBit) / Pre
B has been opened for output and theBit contains
either the character '0' or the
character '1' Post The bit corresponding to
the character theBit has been inserted into the
output bit stream.
21void putBitString(bitOStream B, char
theBits) / Pre B has been opened for output
and theBits contains a string '0's
and '1's Post The bits corresponding to the
bit characters in theBits have been
inserted into the output bit stream using putBit.
/
22Prefix Codes and Binary Trees
We can represent a prefix code by means of a
binary tree.
Each node v of a binary tree may be associated
with a string of 0s and 1s. To do so, we
think of each edge in the binary tree as
being labelled with a 0 if it connects a node to
its left child and a 1 if it connects a node to
its right child.
Then the binary string associated with a node v
is the concatenation of the edge labels in the
path from the root to v.
23Note that if a binary string a is associated
with a node of a binary tree by the method
above, then every prefix of a is also
associated with some node of the tree. Moreover,
that node is an ancestor of the node associated
with a.
24Given a binary code, we consider the set of
prefixes of all its codewords
Example
We can now define a tree with one node for each
string in the set of prefixes where if v is the
node associated with prefix p, then its left
child is the node associated with string p0 (if
it is in the set) and its right child is the
node associated with the string p1 (if it is in
the set).
25Binary tree defined by the prefix set
The tree representation for the code is now
obtained by attaching a symbol to the vertex
labeled by its codeword. Since we have a prefix
codes, these nodes will be exactly the set of
leaves.
26Binary tree representation of the code
'
27Decoding Using the Binary Tree Representation
The binary tree representation of a prefix code
may be used to decode a bit encoded string.
We start at the root and process the bits in the
string by moving left for a 0 and right for a 1
until we hit a leaf. At that point we output
the symbol attached to the leaf and restart at
the root.
If we run out of bits while we are at an internal
node of the tree, we have an improperly formed
encoding.
281101110110011001110110
Encoded message
Decoded message
291101110110011001110110
Encoded message
y
Decoded message
301101110110011001110110
Encoded message
ye
Decoded message
311101110110011001110110
Encoded message
yea
Decoded message
321101110110011001110110
Encoded message
yea
Decoded message
331101110110011001110110
Encoded message
yea b
Decoded message
341101110110011001110110
Encoded message
yea ba
Decoded message
351101110110011001110110
Encoded message
yea bab
Decoded message
361101110110011001110110
Encoded message
yea baby
Decoded message
37Huffman Coding
We'll use Huffman's algorithm to construct an
optimal tree that is used for data compression.
Assume each character has an associated weight
equal to the number of times the character
occurs in a file
Example In "bye bye baseball", the character 'b'
has weight 4, the character 'e' has weight 3,
the 's' has weight 1, and the other characters
have weight 2.
When compressing a file we'll need to calculate
these weights we'll ignore this step for now
and assume that all character weights have been
calculated.
38Huffman's algorithm assumes that we're building a
single tree from a group (or forest) of trees.
Initially, all the trees have a single node with
a character and the character's weight.
Trees are combined by picking two trees and
making a new tree from the two trees and a new
node to serve as the root of the new tree. This
decreases the number of trees by one at each step
since two trees are combined into one tree.
39- The algorithm is as follows
- Begin with a forest of one-node trees. Each
trees root is labeled by - one of the characters and has weight equal to
the weight of the character in the node. - Repeat this step until there is only one tree
- Choose two trees with the smallest weights,
call these trees T1 and T2. - Create a new tree whose root has a weight equal
to the sum of - the weights T1 and T2, whose left subtree is T1
and whose right subtree is T2. -
- 3. The single tree left after the previous step
is the encoding tree.
40(No Transcript)
41We'll use the string "bye bye baseball" as an
example.
42(No Transcript)
43(No Transcript)
44(No Transcript)
45(No Transcript)
46(No Transcript)
47(No Transcript)
48The tree produced by the Huffman algorithm is the
one we used previously to optimally encode the
message bye bye
baseball
49Why is Huffman Coding Greedy? Huffman's
algorithm is an example of a greedy algorithm.
It's called greedy because the two smallest
nodes are chosen at each step, and this local
decision results in a globally optimal encoding
tree. In general, greedy algorithms use
small-grained, or local minimal/maximal choices
to result in a global minimum/maximum.
50 Making change using U.S. money is another
example of a greedy algorithm.
Problem give change in U.S. coins for any
amount (say under 1.00)
using the minimal number of coins.
Solution (assuming coin denominations of
0.25, 0.10, 0.05, and 0.01, called quarters,
dimes, nickels, and pennies, respectively) Use
the highest-value coin that you can, and give as
many of these as you can. Repeat the process
until the correct change is given.
51 Example make change for 0.91. Use 3
quarters (the highest coin we can use, and as
many as we can use). This leaves 0.16. To
make change use a dime (leaving 0.06), a nickel
(leaving 0.01), and a penny. The total change
for 0.91 is three quarters, a dime, a nickel,
and a penny. This is a total of six coins, it
is not possible to make change for 0.91 using
fewer coins.
The solution/algorithm is greedy because the
largest denomination coin is chosen to use at
each step, and as many are used as possible. This
locally optimal step leads to a globally optimal
solution.
52Note that the algorithm does not work with
different denominations.
For example, if there are no nickels, the
algorithm will make change for 0.31 using one
quarter and six pennies, a total of seven coins.
However, it's possible to use three dimes and one
penny, a total of four coins.
This shows that greedy algorithms are not always
optimal algorithms.
53Homework
Apply the Huffman algorithm to construct an
optimal code tree for a file with the following
frequencies.
1 2 4 5 6 z t s a e