Minimum Description Length Principle - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Minimum Description Length Principle

Description:

... for any distribution Q on A we can find a prefix code with length function L(x) ... So first encode the length of the binary standard form using trivial code... – PowerPoint PPT presentation

Number of Views:876

Avg rating:3.0/5.0

Slides: 25

Provided by: vikasr

Category:

more less

Transcript and Presenter's Notes

Title: Minimum Description Length Principle

1
Minimum Description Length Principle for
statistical Inference Presented by Vikas C.
Raykar University of Maryland, CollegePark
2
Contents

Statistical Modelling Traditional approach.
Algorithmic theory of complexity Kolmogorov
Complexity.
MDL principle.
Coding and Information theory.
Formalization of MDL principle and examples.

3
Statistical Modelling Traditional approach

Assumes that data has been generated as a sample
from a population.
Parametric
Non parametric
Unknown distribution is then estimated using the
data.
Minimization of some mean loss function.
Maximum Likelihood
Least squares
Works well when we understand the physics of the
problem i.e. we know that there is some law
generating the data instrument noise.
If we do not understand the data generating
process there is no way we can determine whether
the given data set is sampled form a given
distribution.
Data Mining Image processing DNA modelling

4
Getting around the Curse of false assumption

If we are estimating probability distributions
from data there is no rational way to compare
different sets of distributions.
Best model is the most complex one.
First believe that no model can capture all the
regular features in the data.
Look for a model within a collection of models
that does its best.
Akaike1973 Robust estimation
Cross-validation
Bayesian view of modelling
P(Ytheta)P(thetaY)P(theta)/P(Y)
Meaningful if one of theta is true
Jeffreys interpretation as
probability being a degree of belief

If we are estimating probability distributions
from data there is no rational way to compare
different sets of distributions.
Best model is the most complex one.
First believe that no model can capture all the
regular features in the data.
Look for a model within a collection of models
that does its best.
Akaike1973
Bayesian view of modelling
Meaningful if one of theta is true
Jeffreys interpretation as
probability being a degree of belief

True model
True model
5
Contents

Statistical Modelling Traditional approach.
Algorithmic theory of complexity Kolmogorov
Complexity.
MDL principle.
Coding and Information theory.
Formalization of MDL principle
Simple examples.

6
Algorithmic theory of information

Introduced by Solomonoff Kolmogorov Chaitian.
Data need not be regarded as a sample from any
distribution or metaphysical true model.
Idea of a model is a computer program that
describes/encodes the data.
Two notions
Complexity how long is the program
Information what properties can it
expressuninteresing information

7
Regularity and compression

Any regularity in given data can be used to
compress the data.
01010101010101010101
simple rulelength(log n)
00100010000000010000
17 zeros and 3 ones
N20!/(17!3!) such strings
Log N 11
Bernoulli model
Flip a coin 20 times
Maximally complex
Non regular sequences cannot be compressed.

8
Kolmogorov complexity

U - computer be a program printing the
desired binary string as the output and halts.
The program after printing halts.
length of the program.
Once a program has halted it cannot start by
itself.
No program can be prefix of a longer program.
The Kolmogorov complexity is the length of the
shortest program in the language of U that
generates the string and then halts.
The shortest program can be considered a optimal
model for the data.
Occams Razor Principle of parsimony one should
not increase, beyond what is necessary, the
number of entities required to explain anything
Does it depend on U the programming
language-Invariance theorem

9
Kolmogorov complexity is noncomputable

Proof
Let us say we have program Q which can compute
the Kolmogorov complexity.
We can write a program P which uses Q as its
subroutine such that it finds a shortest string
whos kolmogorov complexity is greater than the
length of the program P.
Since P prints out such a string the Kolmogorov
complexity of the string should be less than or
equal to the length of the string.
Contradiction Q.E.D
No algorithm can find the physical laws.
MDL principles scales down the idea of Kolmogorov
complexity.
Focus on a class of models M because we can never
find the true model.
Encode the data based on a hypothesis H.
First encode H and then encode data on the basis
of H.
Choose H that minimizes the total codelength.

10
Contents

Statistical Modelling Traditional approach.
Algorithmic theory of complexity Kolmogorov
Complexity.
MDL principle.
Coding and Information theory.
Formalization of MDL principle and examples.

11
Two Part Codes MDL Principle

Among a set of candidate hypothesis M, the best
hypothesis to explain the data is the one which
minimizes the sum of
Trade-off between model complexity and goodness
of fit.
Better a hypothesis fits the data more
information it gives about the data more the
information fewer the bits we need to encode it

the length, in bits, of the description of the
data when encoded using the hypothesis.
the length,in bits, of the description of the
hypothesis.

12
Example Under and Over Fitting
13
Contents

Statistical Modelling Traditional approach.
Algorithmic theory of complexity Kolmogorov
Complexity.
MDL principle.
Coding and Information theory.
Formalization of MDL principle
Simple examples.

14
Prefix Codes Krafts Inequality

alphabet
Message is a sequence of symbols from the
alphabet.
Code Bunion of all the n-cartesian folds of
0,1
Prefix Code No code is the prefix of another.
Unique decodability.
An integer valued function L() corresponds to the
codelength of a binary prefix code if and only if
it satisfies the Krafts inequality.
Proof Binary Tree construction Induction
Given a prefix code C on A with length function
L() we can define a distribution Q on A.
Conversely for any distribution Q on A we can
find a prefix code with length function L(x)??

15
Code Design Shannons source coding theorem

Huffmans algorithm
Lower bound on the mean codelength
Can use as a measure of complexity.

16
Connecting code length and probability
distributions

Short codelength corresponds to a high
probabililty and vice versa.
Given a prefix code C on A with length function
L() we can define a distribution Q on A.
Conversely for any distribution Q on A we can
find a prefix code with length function
L(x)-log2 Q
Does not necessarily mean that we assume our data
is drawn according to the probability
distribution.
probability distribution is just a
mathematical object

17
Contents

Statistical Modelling Traditional approach.
Algorithmic theory of complexity Kolmogorov
Complexity.
MDL principle.
Coding and Information theory.
Formalization of MDL principle and examples.

18
Formalizing the two-part code

M - class of models or hypothesis
D - given data sequence
H be a hypothesis belonging to M
For probabilistic class of models
Probability of the data given the model is used
Two part code first code theta and then code
the data given theta
We know that there exists a code C such that
Use this code for the second part.
ML estimator if we neglect the complexity of
\theta

19
Bernoulli Example..
20
Bernoulli Example..

We have to truncate ? to finite precision.
If we use fixed precision d we need d bits to
send one of 2d possible truncated parameter
values.
We want precision that is not fixed..so first
send d..Encode d using a prefix code..
d can be any natural number..How to prefix code a
natural number..
Trivial 1-1 2-01 3-001 4-0001..need d bits
Consider binary standard form 1 0 2-1 3-10
..length is ceil(log d)
So first encode the length of the binary standard
form using trivial coderequired ceil(log d)
bits..
Total 2ceil(log d) bits
Repeat the trick-encode the length of length of d
Length ceil(logd)2ceil(log(ceil(log d))) ¼
2loglogdlogd
Lc1(?)dlogd2loglogd
Can be shown aysmptotically that the optimal
precison d for encoding a sample of size n is
given d(n)0.5log(n)c
C2 grows linearly in n while c1 grows
logarithmically

21
Non probabilistic models

Consider the case of polynomials
Measure goodness of fit by the total squared
error
Construct a probablity distribution such that
log(P)total squared errorconstant
Gaussian distribution of specifc variance
This does not mean the underlying model is
Gaussian.
Encode the polynomial coefficients in a similar
way as before.
For any model with k parameters and sample size
n-log (pmodel)(k/2)log no(1)
If the data are truly generated by some
polynomialnoise then MDL will converge to the
true one as the sample size increases.
If the true degree is high and if the number of
samples is small MDL will underfit

22
MDL is looking for a good, not for a true model

If enough data is not available MDL picks a model
which is too simple.
This does not mean simple models are a priori
more likely to be true or nature prefers
simplicity or something like that.
The rationale is that the dataset is too small to
identify a complex model with any reliability.
Is it OK to use simple models for prediction?
Simple model is safe-Will give a correct
impression of the error-Model itself tells us
that it is not very accurate.
In the previous Bernoulli example if the data
were generated using a first order Markov chain?
Explains why modelling errors using a gaussian
distribution generally leads to good results even
though the distribution of errors is not gaussian
Probabilites are monotone transforms of
codelengths and not frequencies

23
References

Peter Grünwald's Thesis The Minimum Description
Length Principle and Reasoning under Uncertainty
http//homepages.cwi.nl/pdg/thesispage.html
First two chapters for an introduction
Jorma Rissanen Lectures on statistical modelling
theory
http//www.cs.tut.fi/rissanen/

24
Thank You ! Questions ?

Write a Comment

User Comments (0)