Title: Mathematical Foundations of Markov Chain Monte Carlo Algorithms
1Mathematical FoundationsofMarkov Chain Monte
Carlo Algorithms
- Based on lectures given by
- Alistair Sinclair
- Computer Science Division
- U.C. Berkeley
2Overview
- Random Sampling
- The Markov Chain Monte-Carlo Paradigm
- Mixing Time
- Coupling
- Flow
- Geometry
Techniques for Bounding the Mixing Time
3Random Sampling
x
- ? - very large sample set.
- ? - probability distribution over ?.
Goal Sample points x?? at random from
distribution ?.
4The Probability Distribution
w??R is an easily-computed weight function
ZSx w(x) is an unknown normalization factor
5Application 1 Card Shuffling
- ? - all 52! permutations of a deck of cards.
- ? - uniform distribution ?x w(x)1.
Goal pick a permutation uniformly at random
6Application 2 Counting
- How many ways can we tile some given pattern with
dominos?
7Application 2 Counting (cont.)
- Sample tilings uniformly at random.
- Let P1 proportion of sample of type 1.
- Compute estimate N1 of N1 recursively.
- output N N1 / P1.
N1
N2
N N1 N2
sample size O(n), levels O(n) ? O(n2)
samples total
8Application 3 Volume Integration
Dyer\Frieze\Kannan
- ? a convex body in Rd (d large)
- Problem estimate vol(?)
sequence of concentric balls B0 ? ? Br
estimate by sampling uniformly from ??Bi
Generalization Integration of log-concave
function over a cube A?Rd
9Application 4 Statistical Physics
- ? - set of configurations of a physical system
- ? - Gibbs distribution
- ?(x)Pr system in config. xw(x)/Z
- where w(x)e-H(x)/KT
temperature
energy
10The Ising Model
- n atomic magnets
- configuration x?-,n
- H(x) -(aligned neighbors)
- - - -
- - - -
- - - -
- - - - -
- -
11Why Sampling?
- statistics of typical configurations.
- mean energy (E?H(x)), specific heat,
- estimate of partition function ZZ(T)?x??w(x)
12Estimating the Partition Function
- Let ?e-1/KT ? ZZ(?)?x?? ?-H(x).
- Define 1?0 lt ?1 lt lt ?r?.
? r ? nlog?O(n2)
can be estimated by random sampling from
??i-1 ?i? ?i-1(11/n) ensures small variance ?
O(n) samples suffice for each ratio
13Application 5 Optimization
- ? - set of feasible solutions to an optimization
problem - f(x) - value of solution x.
- Goal maximize f(x).
- Idea sample solutions where w(x)?f(x).
14Application 5 Optimization
- Idea sample solutions where w(x)?f(x).
concentration on good solutions (large values
f(x))
large ?
?
greater mobility (local optima are less high)
small ?
Simulated Annealing heuristic Slowly increase ?
15Application 6 Hypothesis Verification in
Statistical Models
- ? - set of hypotheses
- X - observed data
Let w(?)P(?)P(X/?).
prior
easy
16Application 6 Hypothesis Verification in
Statistical Models (cont.)
- Sampling from ?(?)P(?/X) gives
- Statistical estimate of hypotheses ?.
- Prediction
- Model comparison
- normalization factor P(X)
- Prob model generated X
17Markov Chains
- Sample space ?
- Random variables (r.v) over ?
- X1,X2,,Xt,
- Memoryless ?tgt0, ?x1,,xt1??,
18Sampling Algorithm
- Start at an arbitrary state X0.
- Simulate MC for sufficiently many steps t.
- Output Xt.
- Then, ?x?? Prob Xt x ?(x)
X0
?
Xt
19Transitions Matrix
PrXt1y/Xtx
y
- P is non-negative
- P is stochastic (?x ?xP(x,y)1)
- PrXt1y/X0xPt(x,y)
- PxtPx0 Pt
- Definition ? is a stationary distribution, if
?P?.
Px
x
P
20Irreducibility
- Definition P is irreducible if
21Aperiodicity
- Definition P is aperiodic if
22Note on Irreducibility and Aperiodicity
- If P is irreducible, we can always make it
aperiodic, by adding self-loops - P ½(PI)
- P has same stationary distribution as P.
- Call P a lazy MC.
23Fundamental Theorem
- Theorem If P is irreducible and aperiodic,
- then it is ergodic, i.e
- where ? is the (unique) stationary distribution
of P i.e ? P?.
24Main Idea (The MCMC Paradigm)
- An ergodic MC provides an effective algorithm for
sampling from ?.
25Examples
- Random Walks on Graphs
- Ehrenfest Urn
- Card Shuffling
- Coloring of a Graph
- The Ising Model
261. Random Walk on Undirected Graphs
At each node, choose a neighbor u.a.r and jump to
it
27Random Walk on Undirected Graph G(V,E)
?V
degree
- Irreducible ? G is connected
- Aperiodic ? G is not bipartite
28Random Walk The Stationary Distribution
not essential
- Claim If G is connected and not bipartite, then
the probability distribution induced by a random
walk on it converges to ?(x)d(x)/Sxd(x). - Proof
2E
292. Ehrenfest Urn
j balls
(n-j) balls
- Pick a ball u.a.r
- Move the ball to the other urn
302. Ehrenfest Urn
- Xt number of balls in first urn.
- MC is a non-uniform random walk on ?0,1,,n.
j/n
1-j/n
- Irreducible Periodic
- Stationary distribution
313. Card Shuffling
- Top-in-at-random
- Irreducible
- Aperiodic
- P is doubly stochastic ?y SxP(x,y)1
- ? ? is uniform ?x ?(x)1/n!
323. Card Shuffling
- Random Transpositions
- Irreducible
- Aperiodic
- P is symmetric ?x,y P(x,y)P(y,x)
- ? ? is uniform
333. Card Shuffling
- Riffle shuffle Gilbert/Shannon/Reeds
343. Card Shuffling
- Riffle shuffle Gilbert/Shannon/Reeds
- Irreducible
- Aperiodic
- P is doubly stochastic
- ? ? is uniform
354. Colorings of a graph
- G(V,E) connected, undirected
- q number of colors
- ? set of proper q-colorings of G
- ? uniform
36Colorings Markov Chain
- pick v?V and c?1,,q u.a.r.
- recolor v with c if possible.
Gs max degree
- Irreducible if q??2
- Aperiodic
- P is symmetric
- ? ? is uniform
375. The Ising Model
- Markov chain (Heat bath)
- pick a site i u.a.r
- replace spin x(i) by random spin x(i) s.t
- n sites
- ?-,n
- w(x)?aligned neighbors (x)
- - - -
- - - -
- - - -
- - - - -
- -
neighbors of i
Irreducible, aperiodic, reversible w.r.t ? ?
converges to ?
38Designing Markov Chains
- What do we want?
- Given ?, ?
- MC over ? which converges to ?
39The Metropolis Rule
- Define any connected undirected graph on ?
(neighborhood structure/(local) moves)
40The Metropolis Rule
- Transitions from state x??
- pick a neighbor y of x w.p ?(x,y)
- move to y w.p minw(y)/w(x),1
- (else stay at x)
?(x,y)?(y,x), ?(x,x)1-Sy-x?(x,y)
- Irreducible
- Aperiodic (make lazy if nec.)
- reversible w.r.t w
- ? converges to ?.
41The Mixing Time
- Key Question How long until Pxt looks like ??
- We will use the variation distance
42The Mixing Time
- Define
- ?x(t) pxt-?
- ?(t) maxx ?x(t)
- The mixing time is
- ?mixmin t ?(t)?1/2e
43Toy Example Top-In-At-Random
- Let T time after initial bottom card reaches
top - T is a strong stationary time, i.e
- PrXtx/tT?(x)
- Claim ?(t)?PrTgtt
- Thus, it remains to estimate T.
n
44The Coupon Collector Problem
- Each pack contains one coupon.
- The goal is to complete the series.
- How many packs would we buy?!
45The Coupon Collector Problem
- N total number of different coupons.
- Xi time to get the i-th coupon.
46Toy Example Top-In-At-Random
- By the coupon collector,
- the i-th coupon is a ticket to advance from the
(n-i1) level to the next one. - Pr T gt nlnn cn ? e-c
- ? ?mixnlnn cn
n
47Example Riffle Shuffle
48Example Riffle Shuffle
- Inverse shuffle (same mixing time)
0 0 0 1 1 1 1 1
1 0 1 1 1 0 1 0
0/1 u.a.r
sorted stably
49Inverse Shuffle
- After t steps, each card is labeled with t
digits. - Cards are sorted by their labels.
- Cards with different labels are in random order
- Cards with same label are in original order
0 0 0 0 0 1 0 1 0 0 1 0 0 1 1 1 0 0 1 0 1 1 1 1
50Riffle Shuffle (Cont.)
- Let T time until all cards have distinct labels
- T is a strong stationary time.
- Again we need to estimate T.
51B i rthday Paradox
- With which probability two of them have the same
birthday?
52B I rthday Paradox (Cont.)
- k people, n days (ngtkgt1)
- The probability all birthdays are distinct
arithmetic sum
53Riffle Shuffle (Cont.)
- By the birthday paradox,
- each card (1..n) picks a random label
- there are 2t possible labels
- we want all labels to be distinct
- ?mixO(logn)
54General Techniques for Mixing Time
- Probabilistic Coupling
- Combinatorial Flows
- Geometric - Conductance
55Coupling
56Mixing Time Via Coupling
- Let P be an ergodic MC. A coupling for P is a
pair process (Xt,Yt) s.t - Xt,Yt are each copies of P
- XtYt ? Xt1Yt1
- Define Txymint XtYt X0x, Y0Y
57Coupling Theorem
- Theorem Aldous et al.
- ?(t) ? maxx,yPrTx,y gt t
Design a coupling that brings X and Y together
fast
581. Random Walk On Cube
?0,1n ? is uniform
- Markov Chain
- pick coordinate i?R1,,n
- pick value b?R0,1
- set x(i)b
1/2
1/6
1/6
1/6
59Coupling For Random Walk
- pick same i,b for both X and Y
- Txy ? time to hit all n coordinates
- By coupon collecting,
- Pr Txy gt nlnn cn lt e-c
- ? ?mix ? nlnn cn
( 0 , 0 , 1 , 0 , 1 , 1 )
( 1 , 1 , 0 , 0 , 1 , 0 )
( 0 , 0 , 1 , 0 , 1 , 1 )
( 1 , 1 , 0 , 0 , 1 , 1 )
60Flow
capacity of e(z,z) C(e)?(z)P(z,z)
flow along e denoted f(e)
flow routes ?(x)?(y) units from x to y, for every
x,y
l(f)
Diameter
61Flow Theorem
- Theorem Diaconis/Stroak, Jerrum/Sinclair
- For a lazy ergodic MC and any flow f,
- ?x(?) ? 2p(f)l(f) ln?(x)-1 2ln?-1
621. Random Walk On Cube
- Flow f Route (x,y) flow evenly along all
shortest paths xy - ? ?mix ?
- constp(f)l(f)log?-1 O(n3)
?0,1n ?2nN ?x ?(x)1/N
1/2
1/2n
1/2n
1/2n
63Conductance
bottleneck
64Conductance
S
?-S
65Conductance Theorem
- Theorem Jerrum/Sinclair, Lawler/Sokal, Alon,
Cheeger For a lazy reversible MC, - ?x(?) ? 2/?2 ln?(x)-1 ln?-1
661. Random Walk On Cube
- The sketched S is (essentially) the worst S.
- ? ?mix O(?-2 log?min-1) O(n3)