Title: Minimum Description Length Principle
1Minimum Description Length Principle for
statistical Inference Presented by Vikas C.
Raykar University of Maryland, CollegePark
2Contents
- Statistical Modelling Traditional approach.
- Algorithmic theory of complexity Kolmogorov
Complexity. - MDL principle.
- Coding and Information theory.
- Formalization of MDL principle and examples.
3Statistical Modelling Traditional approach
- Assumes that data has been generated as a sample
from a population. - Parametric
- Non parametric
- Unknown distribution is then estimated using the
data. - Minimization of some mean loss function.
- Maximum Likelihood
- Least squares
- Works well when we understand the physics of the
problem i.e. we know that there is some law
generating the data instrument noise. - If we do not understand the data generating
process there is no way we can determine whether
the given data set is sampled form a given
distribution. - Data Mining Image processing DNA modelling
4Getting around the Curse of false assumption
- If we are estimating probability distributions
from data there is no rational way to compare
different sets of distributions. - Best model is the most complex one.
- First believe that no model can capture all the
regular features in the data. - Look for a model within a collection of models
that does its best. - Akaike1973 Robust estimation
Cross-validation - Bayesian view of modelling
- P(Ytheta)P(thetaY)P(theta)/P(Y)
- Meaningful if one of theta is true
- Jeffreys interpretation as
- probability being a degree of belief
- If we are estimating probability distributions
from data there is no rational way to compare
different sets of distributions. - Best model is the most complex one.
- First believe that no model can capture all the
regular features in the data. - Look for a model within a collection of models
that does its best. - Akaike1973
- Bayesian view of modelling
- Meaningful if one of theta is true
- Jeffreys interpretation as
- probability being a degree of belief
True model
True model
5Contents
- Statistical Modelling Traditional approach.
- Algorithmic theory of complexity Kolmogorov
Complexity. - MDL principle.
- Coding and Information theory.
- Formalization of MDL principle
- Simple examples.
6Algorithmic theory of information
- Introduced by Solomonoff Kolmogorov Chaitian.
- Data need not be regarded as a sample from any
distribution or metaphysical true model. - Idea of a model is a computer program that
describes/encodes the data. - Two notions
- Complexity how long is the program
- Information what properties can it
expressuninteresing information
7Regularity and compression
- Any regularity in given data can be used to
compress the data. - 01010101010101010101
- simple rulelength(log n)
- 00100010000000010000
- 17 zeros and 3 ones
- N20!/(17!3!) such strings
- Log N 11
- Bernoulli model
- Flip a coin 20 times
- Maximally complex
- Non regular sequences cannot be compressed.
8Kolmogorov complexity
- U - computer be a program printing the
desired binary string as the output and halts. - The program after printing halts.
- length of the program.
- Once a program has halted it cannot start by
itself. - No program can be prefix of a longer program.
- The Kolmogorov complexity is the length of the
shortest program in the language of U that
generates the string and then halts. - The shortest program can be considered a optimal
model for the data. - Occams Razor Principle of parsimony one should
not increase, beyond what is necessary, the
number of entities required to explain anything - Does it depend on U the programming
language-Invariance theorem
9Kolmogorov complexity is noncomputable
- Proof
- Let us say we have program Q which can compute
the Kolmogorov complexity. - We can write a program P which uses Q as its
subroutine such that it finds a shortest string
whos kolmogorov complexity is greater than the
length of the program P. - Since P prints out such a string the Kolmogorov
complexity of the string should be less than or
equal to the length of the string. - Contradiction Q.E.D
- No algorithm can find the physical laws.
- MDL principles scales down the idea of Kolmogorov
complexity. - Focus on a class of models M because we can never
find the true model. - Encode the data based on a hypothesis H.
- First encode H and then encode data on the basis
of H. - Choose H that minimizes the total codelength.
10Contents
- Statistical Modelling Traditional approach.
- Algorithmic theory of complexity Kolmogorov
Complexity. - MDL principle.
- Coding and Information theory.
- Formalization of MDL principle and examples.
11Two Part Codes MDL Principle
- Among a set of candidate hypothesis M, the best
hypothesis to explain the data is the one which
minimizes the sum of - Trade-off between model complexity and goodness
of fit. - Better a hypothesis fits the data more
information it gives about the data more the
information fewer the bits we need to encode it -
the length, in bits, of the description of the
data when encoded using the hypothesis.
the length,in bits, of the description of the
hypothesis.
12Example Under and Over Fitting
13Contents
- Statistical Modelling Traditional approach.
- Algorithmic theory of complexity Kolmogorov
Complexity. - MDL principle.
- Coding and Information theory.
- Formalization of MDL principle
- Simple examples.
14Prefix Codes Krafts Inequality
- alphabet
- Message is a sequence of symbols from the
alphabet. - Code Bunion of all the n-cartesian folds of
0,1 - Prefix Code No code is the prefix of another.
- Unique decodability.
- An integer valued function L() corresponds to the
codelength of a binary prefix code if and only if
it satisfies the Krafts inequality. - Proof Binary Tree construction Induction
- Given a prefix code C on A with length function
L() we can define a distribution Q on A. - Conversely for any distribution Q on A we can
find a prefix code with length function L(x)??
15Code Design Shannons source coding theorem
- Huffmans algorithm
- Lower bound on the mean codelength
- Can use as a measure of complexity.
16Connecting code length and probability
distributions
- Short codelength corresponds to a high
probabililty and vice versa. - Given a prefix code C on A with length function
L() we can define a distribution Q on A. - Conversely for any distribution Q on A we can
find a prefix code with length function
L(x)-log2 Q - Does not necessarily mean that we assume our data
is drawn according to the probability
distribution. - probability distribution is just a
mathematical object
17Contents
- Statistical Modelling Traditional approach.
- Algorithmic theory of complexity Kolmogorov
Complexity. - MDL principle.
- Coding and Information theory.
- Formalization of MDL principle and examples.
18Formalizing the two-part code
- M - class of models or hypothesis
- D - given data sequence
- H be a hypothesis belonging to M
- For probabilistic class of models
- Probability of the data given the model is used
- Two part code first code theta and then code
the data given theta - We know that there exists a code C such that
- Use this code for the second part.
- ML estimator if we neglect the complexity of
\theta
19Bernoulli Example..
20Bernoulli Example..
- We have to truncate ? to finite precision.
- If we use fixed precision d we need d bits to
send one of 2d possible truncated parameter
values. - We want precision that is not fixed..so first
send d..Encode d using a prefix code.. - d can be any natural number..How to prefix code a
natural number.. - Trivial 1-1 2-01 3-001 4-0001..need d bits
- Consider binary standard form 1 0 2-1 3-10
..length is ceil(log d) - So first encode the length of the binary standard
form using trivial coderequired ceil(log d)
bits.. - Total 2ceil(log d) bits
- Repeat the trick-encode the length of length of d
- Length ceil(logd)2ceil(log(ceil(log d))) ¼
2loglogdlogd - Lc1(?)dlogd2loglogd
- Can be shown aysmptotically that the optimal
precison d for encoding a sample of size n is
given d(n)0.5log(n)c - C2 grows linearly in n while c1 grows
logarithmically
21Non probabilistic models
- Consider the case of polynomials
- Measure goodness of fit by the total squared
error - Construct a probablity distribution such that
log(P)total squared errorconstant - Gaussian distribution of specifc variance
- This does not mean the underlying model is
Gaussian. - Encode the polynomial coefficients in a similar
way as before. - For any model with k parameters and sample size
n-log (pmodel)(k/2)log no(1) - If the data are truly generated by some
polynomialnoise then MDL will converge to the
true one as the sample size increases. - If the true degree is high and if the number of
samples is small MDL will underfit
22MDL is looking for a good, not for a true model
- If enough data is not available MDL picks a model
which is too simple. - This does not mean simple models are a priori
more likely to be true or nature prefers
simplicity or something like that. - The rationale is that the dataset is too small to
identify a complex model with any reliability. - Is it OK to use simple models for prediction?
- Simple model is safe-Will give a correct
impression of the error-Model itself tells us
that it is not very accurate. - In the previous Bernoulli example if the data
were generated using a first order Markov chain? - Explains why modelling errors using a gaussian
distribution generally leads to good results even
though the distribution of errors is not gaussian - Probabilites are monotone transforms of
codelengths and not frequencies
23References
- Peter Grünwald's Thesis The Minimum Description
Length Principle and Reasoning under Uncertainty - http//homepages.cwi.nl/pdg/thesispage.html
- First two chapters for an introduction
- Jorma Rissanen Lectures on statistical modelling
theory - http//www.cs.tut.fi/rissanen/
24 Thank You ! Questions ?