Title: INFORMATION THEORY
1INFORMATION THEORY
Wayne Lawton Department of Mathematics National
University of Singapore S14-04-04,
matwml_at_nus.edu.sg http//math.nus.edu.sg/matwml
2CHOICE CONCEPT
- We are confronted by choices every day, for
instance we - may choose to purchase an apple, a banana, or a
coconut - or to withdraw k lt N dollars from our bank
account.
Choice has two aspect we may need to make the
decision what fruit to purchase or how many
dollars to withdraw. We will ignore this aspect
which is the concern of Decision Theory.
We may also need to communicate this choice to
our food seller or our bank. This is the concern
of Information Theory.
3INCREASING CHOICE?
- The number of choices increases if more elements
are added to the set of existing choices. For
example, if a shopper is to choose one fruit from
a store that carries Apples, Bananas, and
Coconuts, then discovers that the store added
Durians and Edelberries. The number of choices
increased from 3 to 5 by the addition of 2 extra
choices.
The number of choices increases if two or more
sets of choices are combined. A shopper may have
the choice of choosing one fruit from A,B,C on
Monday and choosing one fruit from D,E on
Thursday. Compare with case above.
4MEASURING CHOICE?
- For any set X let X denote the number of
elements in X. Then
We require that the information, required to
specify the choice of a element from a set, be an
additive function, therefore
Theorem 1. The logarithm functions measure
information
log to the base a, where a gt 0 called bits if a
2, nats if a e
5FACTORIAL FUNCTION
- For any positive integer n define n factorial
Often the choices within different sets are
mutually constrained. If a shopper is to purchase
a different fruit from the set A,B,C,D,E on
each of 5 days, then there are 5 choices on the
first days but only 4 choices on the second day,
etc. so the total number of choices equals
6STIRLINGS APPROXIMATION
constant
Proof K pages 111-115
7COMBINATORICS
where
called n choose k, is the number of ways of
choosing k elements from a set with n elements.
Furthermore, n choose k equals
Proof. Consider that
is the product of n-factors and it equals the sum
of
terms, each term is obtained by specifying a
choice of s or t from each factor, the number of
terms with
is exactly the number of ways of choosing k
factors to have s out of n factors.
8MULTINOMIAL COEFFICIENTS
- Theorem 3 The number of sequences with
symbols of a given type equals
where
9SHANNONS FORMULA
- Theorem 4 For large N, the average information
per symbol of a string of length N containing M
symbols with probabilities
is
Proof. The law of large numbers says that the
i-th symbol will occur approximately
times, so the result follows from Stirlings
Approximation.
10(No Transcript)
11ENTROPY OF A RANDOM VARIABLE
Let X denote a random variable with values in a
set
with probabilities
We define the entropy H(X) by
bits
Recall that for a large integer N, NH(X) equals
the log of the number of strings of length N from
A whose relative frequencies of letters are these
probabilities
Hence the entropy of a random variable equals the
average information required to describe the
values that it takes, it takes 1000 bits to
describe 1000 flips of a fair coin but we can
describe the loaded coin sequence
HHHHHHTHHHHHHHHHTT by its run lengths 6H1T9H2T
12JOINT DISTRIBUTION
Let X denote a random variable with values in a
set
with probabilities
and Y is a random variable with values in a set
with probabilities
Then
is a random variable with values in
whose probabilities
satisfy the marginal equations (mn-1 independent)
13MUTUAL INFORMATION
The joint entropy of X and Y
satisfies
Equality holds if and only if X and Y are
independent, this means that
The mutual information of X and Y, defined by
satisfies
and
14CHANNELS AND THEIR CAPACITY
A channel is a relationship between the
transmitted message X and received message Y
Typically, this relationship does not determine Y
as a function of X but only determines the
statistics of Y given the value of X, this
determines the joint distribution of X and Y
The channel capacity is defined as
Example a binary channel with a 10 bit error
rate
has joint probabilities
Max I(X,Y) .531 bits
for
15SHANNONS THEOREM
If a channel has capacity C then it is possible
to send information over that channel with a rate
arbitrarily close to,but never more than, C
with a probability of error arbitrarily small.
Shannon showed that this was possible to do
by Proving that there existed a sequence of
codes Whose rates approached C and whose
probabilities of error approaced zero
His masterpiece, called the Channel Coding
Theorem, never actually constructed any specific
codes, and thus provided jobs for thousands
of for engineers, mathematicians and scientists
16LANGUAGE AS A CODE
- During my first visit to Indonesia I ate a
curious small fruit. - Back in Singapore I went to a store and asked for
a small fruit with the skin of a dark brown snake
and more bitter than any gourd. Now I ask for
Salak a far more efficient, if less
descriptive, code to specify my choice of fruit.
When I specify the number of dollars that I want
to withdraw from my bank account I use positional
notation (in base 10), a code to specify
nonnegative integers that was invented in
Babylonia (now Iraq) about 4000 years ago (in
base 60).
Digital Computers, in contrast to analog
computers, represent numbers using positional
notation in base 2 (shouldnt they be called
binary computers?). Is this because they cant
count futher than 1? These lecture will explore
this and other intriguing mysteries.
17WHAT IS THE BEST BASE?
- A base B-code of length L (or uses an ordered
sequence on symbols from a set of B symbols to
represent - B x B x x B (read B to the power L) choices
Physically, this is represented using L devices
each of which can exist in one of B states.
The cost is L times the cost of each device and
the cost of each device is proportional to B
since physical material is required to represent
each of the B-1 inactive states for each of the
L devices that correspond to each position.
The efficiency of base B is therefore the ratio
of information in a base B sequence of length L
divided by BL, therefore
18(No Transcript)
19IS THE SKY BLUE?
- If I use base 2 positional notation to specify
that I want to withdraw d lt 8 dollars from my
bank account then
Positional notation is great for computing, but
if I decide to withdraw 2 rather than 1 (or 4
rather than 3) dollars I must change my code by 2
(or 3) bits. Consider the gray code
Whats different?
20GRAY CODE GEOMETRY
- How many binary gray codes of length 3 are there?
And how can we construct them. Cube geometry
gives the answers.
Bits in a code are the Cartesian Coordinates of
the vertices. The d-th and (d1)-th vertex share
a common edge.
Answer the questions.
21PROBLEMS?
- 1. Derive Theorem 1. Hint review properties of
logarithms.
2. Write and run a simple computer program to
demonstrate Stirlings Approximation.
3. Derive the formula for n choose k by induction
and then try to find another derivation. Then use
the other derivation to derive the multinomial
formula.
4. Complete the details for the second half of
the derivation of Shannons Formula for
Information.
5. How many binary Gray codes are there of length
3?
22ERROR CORRECTING CODES
- How many bits of information can be sent reliably
by sending 3 bits if one of those bits may be
corrupted? Consider the 3-dimensional binary
hypercube.
H binary seq. of length 3
H has 8 sequences
A Code C is a subset of H
The Hamming Distance d(x,y) between x and y in H
is the number of bits that they differ by. Hence
d(010,111) 2.
The minimal distance d(C) of a code C is min
d(x,y) x, y in C
A code C can correct 1 error bit if and only if
d(C)
So we can send 1 bit reliably with the code C
(000),(111)
23PARITY CODES
- If we wanted to send 4 bits reliably (to correct
up to 1 bit error) then we could send each of
these bits three times this code consist of a
set C of 16 sequences having length 12 the code
rate is 50 since 12 bits are used to communicate
4 bits
However, it is possible to send 4 bits reliably
using only 8 bits Arranging the four bits in a 2
x 2 square and assigning 4 parity bits- one for
each row and each column
To send a sequence abcd
subscript means mod 2
Note a single bit error in a,b,c,d results in
odd parity in its row and column
Ref. See rectangular and triangle codes in H
24HAMMING CODES
- The following 7,4,3 Hamming Code can send 4
bits reliably using only 7 bits, it has d(C) 3.
25OTHER CODES
- Hamming Codes are examples of cyclic group codes
why?
BCH (Bose-Chaudhuri-Hocquenghem) codes are
another class of cyclic group codes and generated
by the coefficient sequences of certain
irreducible polynomials over a finite field
Reed-Solomon Codes were the first class of BCH
codes to be discovered. They were first used by
NASA for space communications and are now used as
error corrections in CDs
Other codes include Convolutional, Goethals,
Golay, Goppa, Hadamard, Julin, Justesen,
Nordstrom-Robinson, Pless double circulant,
Preparata, Quadratic Residue, Rao-Reddy,
Reed-Muller, t-designs and Steiner systems,
Sugiyama, Trellis, Turbo, and Weldon codes. There
are many waiting to be discovered and the number
of open problems is huge.
26COUNTING STRINGS
be positive integers
Let
and
and let
be positive real numbers
Let A be an alphabet with
symbols of time duration
symbols of time duration
symbols of time duration
be the number of strings, made from the letters
of A,
whose time duration is
27MORSE CODE MESSAGES
A dot, dash
time duration of
time duration of
messages whose duration
Examples If
28PROTEINS
A amino acids Glycine, Alanine, Valine,
Phenylalanine, Proline, Methionine, Isoleucine,
Leucine, Aspartic Acid, Glutamic Acid, Lysine,
Arginine, Serine, Threonine, Tyrosine, Histidine,
Cystein, Aspargine, Glutatimine, Tryptophan
weight of corresponding Peptide Unit of i-th
amino acid arranged from lightest (i 1) to
heaviest (i 20)
Single Chain Protein with Three Units
Single Chain Proteins
Unit with Amino Acid Residue R
with weight
29RECURSIVE ALGORITHM
30MATLAB PROGRAM
function N cs(t,m,T) function N cs(t,m,T)
Wayne Lawton 19 / 1 / 2003 Inputs t time,
m array of n positive integers T
array of n increasing positive numbers Outputs
N number of strings composed out of these
m(i) symbols of duration T(i), i 1,...,n and
having duration lt t k sum(T lt t) N 0 if
k gt 0 for j 1k N N
m(j)(1cs(t-T(j),m,T)) end end
31(No Transcript)
32ASYMPTOTIC GROWTH
Theorem 5 For large
where
is the unique real root of the equation
and c is some constant
Example
Proof For integer Ts a proof based on linear
algebra works and X is the largest eigenvalue of
a matrix that represents the recursion or
difference equation for CS Othewise the Laplace
Transform is required. We discovered a new proof
based on Information Theory
33INFORMATION THEORY PROOF
We choose probabilities
for the symbols having time duration
to maximize
where
is the Shannon information (or entropy) per
symbol and
is the average duration per symbol
Clearly
is the average information per time
34INFORMATION THEORY PROOF
Since there is the constraint
for some Lagrange multiplier
hence
where the denominator, called the partition
function, is the sum of the numerators (why).
Substituting these probabilities into
gives
hence
satisfies the root
condition in Theorem 5. The proof is complete
since the probabilities that maximal information
are the ones that occur in the set of all
sequences
35MORE PROBLEMS?
- 6. Compute H(1/2,1/3,1/6)
7. Show that H(X,Y) is maximal when X and Y are
independent
8. Read H and explain what a triangular parity
code is.
9. Compute all Morse Code Sequences of length lt
5 If dots have duration 1 and dashes have
duration 2
10. Compute the smallest molecular weight W so
that only at least 100 single strand proteins
have weight lt W
36REFERENCES
BT Carl Brandon and John Tooze, Introduction to
Protein Structure, Garland Publishing, Inc., New
York, 1991.
BC James Ward Brown and Ruel V. Churchill,
Complex Variables and Applications, McGraw-Hill,
New York, 1996.
CS J. H. Conway and N. J. A. Sloan, Sphere
Packings, Lattices and Groups, Springer, New
York, 1993.
Ham R. W. Hamming, Coding and Information
Theory, Prentice-Hall, New Jersey, 1980.
H Sharon Heumann, Coding theory and its
application to the study of sphere-packing,
Course Notes, October 1998 http//www.mdstud.chalm
ers.se/md7sharo/coding/main/main.html
K Donald E. Knuth, The Art of Computer
Programming, Volume 1 Fundamental Algorithms,
Addison-Wesley, Reading, 1997.
SW Claude E. Shannon and Warren Weaver, The
Mathematical Theory of Communication, Univ. of
Illinois Press, Urbana, 1949.
37MATHEMATICAL APPENDIX
and
then
If
largest integer
so its Laplace Transform
satisfies
exists for complex s if
The recursion for f implies that
38MATHEMATICAL APPENDIX
This allows F to be defined as a meromorphic
function
with singularities to the left of a line
Therefore, f is given by a Bromwich integral that
can be computed by a contour integral using the
method of residues, see page 235 BC,
and
are the singularities of F
The unique real singularity
and for large t
thus proving Theorem 5