Title: Sampling Bayesian Networks
1Sampling Bayesian Networks
2Answering BN Queries
- Probability of Evidence P(e) ? NP-hard
- Conditional Prob. P(xie) ? NP-hard
- MPE x arg max P(xe) ? NP-hard
- MAP y arg max P(ye), y ? x ?
- NPPP-hard
- Approximating P(e) or P(xie) within ? NP-hard
3Approximation Algorithms
- Structural Approximations
- Eliminate some dependencies
- Remove edges
- Mini-Bucket Approach
- Search
- Approach for optimization tasks MPE, MAP
- Sampling
- Generate random samples and compute values of
interest from samples, - not original network
4Algorithm Tree
5Sampling Fundamentals
Given a set of variables X X1, X2, Xn, a
joint probability distribution ?(X) and some
function g(X), we can compute expected value of
g(X)
6Sampling From ?(X)
A sample St is an instantiation
Given independent, identically distributed
samples (iid) S1, S2, ST from ?(X), it follows
from Strong Law of Large Numbers
7Sampling Challenge
- It is hard to generate samples from ?(X)
- Trade-Offs
- Generate samples from Q(X)
- Forward Sampling, Likelyhood Weighting, IS
- Try to find Q(X) close to ?(X)
- Generate dependent samples forming a Markov Chain
from P(X)??(X) - Metropolis, Metropolis-Hastings, Gibbs
- Try to reduce dependence between samples
8Markov Chain
- A sequence of random values x0,x1, , defined on
a finite state space D(X), is called a Markov
Chain if it satisfies the Markov Property
- If P(xt1 y xt) does not change with t (time
homogeneous), then it is often expressed as a
transition function, A(x,y)
Liu, Ch.12, p 245
9Markov Chain Monte Carlo
- First, define a transition probability
P(xt1yxt) - Pick initial state x0, usually not important
because it becomes forgotten - Generate samples x1, x2, sampling each next
value from P(X xt)
x0
x1
xt
xt1
- If we choose proper P(xt1xt), we can guarantee
that the distribution represented by samples
x0,x1, converges to ?(X)
10Markov Chain Properties
- Irreducibility
- Periodicity
- Recurrence
- Revercibility
- Ergodicity
- Stationary Distribution
11Irreducible
- A station x is said to be irreducible if under
the transition rule one has nonzero probability
of moving from x to any other state and then
coming back in a finite number of steps. - If on state is irreducible, then all the sates
must be irreducible.
Liu, Ch. 12, pp. 249, Def. 12.1.1
12Aperiodic
- A state x is aperiodic if the greatest common
divider of n An(x,x) gt 0 is 1. - If state x is aperiodic and the chain is
irreducible, then every state must be aperiodic.
Liu, Ch. 12, pp.240-250, Def. 12.1.1
13Recurrence
- A state x is recurrent if the chain returns to x
with probability 1 - State x is recurrent if and only if
- Let M(x) be the expected number of steps to
return to state x - State x is positive recurrent if M(x) is finite
- The recurrent states in a finite state chain are
positive recurrent.
14Ergodicity
- A state x is ergodic if it is aperiodic and
positive recurrent. - If all states in a Markov chain are ergodic then
the chain is ergodic.
15Reversibility
- Detail balance condition
- Markov chain is reversible if there is a ? such
that - For a reversible Markov chain, ? is always a
stationary distribution.
16Stationary Distribution
- If the Markov chain is time-homogeneous, then the
vector ?(X) is a stationary distribution (aka
invariant or equilibrium distribution, aka fixed
point), if its entries sum up to 1 and satisfy
- An irreducible chain has a stationary
distribution if and only if all of its states are
positive recurrent. The distribution is unique.
17Stationary Distribution In Finite State Space
- Stationary distribution always exists but may not
be unique - If a finite-state Markov chain is irreducible and
aperiodic, it is guaranteed to be unique and
A(n)P(xn y x0) converges to a rank-one
matrix in which each row is the stationary
distribution ?. - Thus, initial state x0 is not important for
convergence it gets forgotten and we start
sampling from target distribution - However, it is important how long it takes to
forget it!
18Convergence Theorem
- Given a finite state Markov Chain whose
transition function is irreducible and aperiodic,
then An(x0,y) converges to its invariant
distribution ?(y) geometrically in variation
distance, then there exists a 0 lt r lt 1 and c gt 0
s.t.
19Eigen-Value Condition
- Convergence to stationary distribution is driven
by eigen-values of matrix A(x,y). - The chain will converge to its unique invariant
distribution if and only if matrix As second
largest eigen-value in modular is strictly less
than 1. - Many proofs of convergence are centered around
analyzing second eigen-value.
Liu, Ch. 12, p. 249
20Convergence In Finite State Space
- Assume a finite-state Markov chain is irreducible
and aperiodic - Initial state x0 is not important for
convergence it gets forgotten and we start
sampling from target distribution - However, it is important how long it takes to
forget it! known as burn-in time - Since the first k states are not drown exactly
from ?, they are often thrown away. Open
question how big a k ?
21Sampling in BN
- Same Idea generate a set of samples T
- Estimate P(XiE) from samples
- Challenge X is a vector and P(X) is a huge
distribution represented by BN - Need to know
- How to generate a new sample ?
- How many samples T do we need ?
- How to estimate P(Ee) and P(Xie) ?
22Sampling
- Input Bayesian network with set of nodes X
- Sample a tuple with assigned values
- s(X1x1,X2x2, ,Xkxk)
- Tuple may include all variables (except evidence)
or a subset - Sampling schemas dictate how to generate samples
(tuples) - Ideally, samples are distributed according to
P(XE)
23Sampling Algorithms
- Forward Sampling
- Gibbs Sampling (MCMC)
- Blocking
- Rao-Blackwellised
- Likelihood Weighting
- Importance Sampling
- Sequential Monte-Carlo (Particle Filtering) in
Dynamic Bayesian Networks
24Gibbs Sampling
- Markov Chain Monte Carlo method
- (Gelfand and Smith, 1990, Smith and Roberts,
1993, Tierney, 1994) - Transition probability equals the conditional
distribution - Example ?(X,Y), A(xt1yt)P(xy), A(yt1xt)
P(yx)
y1
y0
x0
x1
25Gibbs Sampling for BN
- Samples are dependent, form Markov Chain
- Sample from P(Xe) which converges to P(Xe)
- Guaranteed to converge when all P gt 0
- Methods to improve convergence
- Blocking
- Rao-Blackwellised
- Error Bounds
- Lag-t autocovariance
- Multiple Chains, Chebyshevs Inequality
26Gibbs Sampling (Pearl, 1988)
- A sample t?1,2,,is an instantiation of all
variables in the network - Sampling process
- Fix values of observed variables e
- Instantiate node values in sample x0 at random
- Generate samples x1,x2,xT from P(xe)
- Compute posteriors from samples
27Ordered Gibbs Sampler
- Generate sample xt1 from xt
- In short, for i1 to N
Process All Variables In Some Order
28Gibbs Sampling (contd)(Pearl, 1988)
Markov blanket
29Ordered Gibbs Sampling Algorithm
- Input X, E
- Output T samples xt
- Fix evidence E
- Generate samples from P(X E)
- For t 1 to T (compute samples)
- For i 1 to N (loop through variables)
- Xi ? sample xit from P(Xi markovt \ Xi)
30Answering Queries
- Query P(xi e) ?
- Method 1 count of samples where Xixi
-
- Method 2 average probability (mixture estimator)
31Gibbs Sampling Example - BN
X1
X3
X6
X8
X5
X2
X9
X4
X7
32Gibbs Sampling Example - BN
- X1 x10 X6 x60
- X2 x20 X7 x70
- X3 x30 X8 x80
- X4 x40
- X5 x50
X1
X3
X6
X8
X5
X2
X9
X4
X7
33Gibbs Sampling Example - BN
- X1 ? P (X1 X02,,X08 ,X9
- E X9
X1
X3
X6
X8
X5
X2
X9
X4
X7
34Gibbs Sampling Example - BN
- X2 ? P(X2 X11,,X08 ,X9
- E X9
X1
X3
X6
X8
X5
X2
X9
X4
X7
35Gibbs Sampling Illustration
36Gibbs Sampling Example Init
- Initialize nodes with random values
- X1 x10 X6 x60
- X2 x20 X7 x70
- X3 x30 X8 x80
- X4 x40
- X5 x50
- Initialize Running Sums
- SUM1 0
- SUM2 0
- SUM3 0
- SUM4 0
- SUM5 0
- SUM6 0
- SUM7 0
- SUM8 0
37Gibbs Sampling Example Step 1
- Generate Sample 1
- compute SUM1 P(x1 x20, x30, x40, x50, x60,
x70, x80, x9 ) - select and assign new value X1 x11
- compute SUM2 P(x2 x11, x30, x40, x50, x60,
x70, x80, x9 ) - select and assign new value X2 x21
- compute SUM3 P(x2 x11, x21, x40, x50, x60,
x70, x80, x9 ) - select and assign new value X3 x31
- ..
- At the end, have new sample
- S1 x11, x21, x41, x51, x61, x71, x81, x9
38Gibbs Sampling Example Step 2
- Generate Sample 2
- Compute P(x1 x21, x31, x41, x51, x61, x71, x81,
x9 ) - select and assign new value X1 x11
- update SUM1 P(x1 x21, x31, x41, x51, x61,
x71, x81, x9 ) - Compute P(x2 x12, x31, x41, x51, x61, x71, x81,
x9 ) - select and assign new value X2 x21
- update SUM2 P(x2 x12, x31, x41, x51, x61,
x71, x81, x9 ) - Compute P(x3 x12, x22, x41, x51, x61, x71, x81,
x9 ) - select and assign new value X3 x31
- compute SUM3 P(x2 x12, x22, x41, x51, x61,
x71, x81, x9 ) - ..
- New sample S2 x12, x22, x42, x52, x62, x72,
x82, x9
39Gibbs Sampling Example Answering Queries
- P(x1x9) SUM1 /2
- P(x2x9) SUM2 /2
- P(x3x9) SUM3 /2
- P(x4x9) SUM4 /2
- P(x5x9) SUM5 /2
- P(x6x9) SUM6 /2
- P(x7x9) SUM7 /2
- P(x8x9) SUM8 /2
40Gibbs Convergence
- Stationary distribution target sampling
distribution - MCMC converges to the stationary distribution if
network is ergodic - Chain is ergodic if all probabilities are positive
- If ?i,j such that pij 0 , then we may not be
able to explore full sampling space !
41Gibbs Sampling Burn-In
- We want to sample from P(X E)
- Butstarting point is random
- Solution throw away first K samples
- Known As Burn-In
- What is K ? Hard to tell. Use intuition.
- Alternatives sample first sample values from
approximate P(xe) (for example, run IBP first)
42Gibbs Sampling Performance
- Advantage guaranteed to converge to P(XE)
- -Disadvantage convergence may be slow
- Problems
- Samples are dependent !
- Statistical variance is too big in
high-dimensional problems
43Gibbs Speeding Convergence
- Objectives
- Reduce dependence between samples
(autocorrelation) - Skip samples
- Randomize Variable Sampling Order
- Reduce variance
- Blocking Gibbs Sampling
- Rao-Blackwellisation
44Skipping Samples
- Pick only every k-th sample (Gayer, 1992)
- Can reduce dependence between samples !
- Increases variance ! Waists samples !
45Randomized Variable Order
- Random Scan Gibbs Sampler
- Pick each next variable Xi for update at random
with probability pi , ?i pi 1. - (In the simplest case, pi are distributed
uniformly.) - In some instances, reduces variance (MacEachern,
Peruggia, 1999 - Subsampling the Gibbs Sampler Variance
Reduction)
46Blocking
- Sample several variables together, as a block
- Example Given three variables X,Y,Z, with
domains of size 2, group Y and Z together to form
a variable WY,Z with domain size 4. Then,
given sample (xt,yt,zt), compute next sample - Xt1 ? P(yt,zt)P(wt)
- (yt1,zt1)Wt1 ? P(xt1)
- Can improve convergence greatly when two
variables are strongly correlated! - - Domain of the block variable grows
exponentially with the variables in a block!
47Blocking Gibbs Sampling
- Jensen, Kong, Kjaerulff, 1993
- Blocking Gibbs Sampling Very Large Probabilistic
Expert Systems - Select a set of subsets
- E1, E2, E3, , Ek s.t. Ei ? X
- Ui Ei X
- Ai X \ Ei
- Sample P(Ei Ai)
48Rao-Blackwellisation
- Do not sample all variables!
- Sample a subset!
- Example Given three variables X,Y,Z, sample only
X and Y, sum out Z. Given sample (xt,yt), compute
next sample - Xt1 ? P(xyt)
- yt1 ? P(yxt1)
49Rao-Blackwell Theorem
Bottom line reducing number of variables in a
sample reduce variance!
50Blocking vs. Rao-Blackwellisation
- Standard Gibbs
- P(xy,z),P(yx,z),P(zx,y) (1)
- Blocking
- P(xy,z), P(y,zx) (2)
- Rao-Blackwellised
- P(xy), P(yx) (3)
- Var3 lt Var2 lt Var1
- Liu, Wong, Kong, 1994
- Covariance structure of the Gibbs sampler
X
Y
Z
51Rao-Blackwellised Gibbs Cutset Sampling
- Select C ? X (possibly cycle-cutset), C m
- Fix evidence E
- Initialize nodes with random values
- For i1 to m ci to Ci c 0i
- For t1 to n , generate samples
- For i1 to m
- Cicit1 ? P(cic1 t1,,ci-1 t1,ci1t,,cmt
,e)
52Cutset Sampling
- Select a subset CC1,,CK ? X
- A sample t?1,2,,is an instantiation of C
- Sampling process
- Fix values of observed variables e
- Generate sample c0 at random
- Generate samples c1,c2,cT from P(ce)
- Compute posteriors from samples
53Cutset SamplingGenerating Samples
- Generate sample ct1 from ct
- In short, for i1 to K
54Rao-Blackwellised Gibbs Cutset Sampling
- How to compute P(cic t\ci, e) ?
- Compute joint P(ci, c t\ci, e) for each ci ?
D(Ci) - Then normalize
- P(ci c t\ci , e) ? P(ci, c t\ci , e)
- Computation efficiency depends
- on choice of C
55Rao-Blackwellised Gibbs Cutset Sampling
- How to choose C ?
- Special case C is cycle-cutset, O(N)
- General case apply Bucket Tree Elimination
(BTE), O(exp(w)) where w is the induced width of
the network when nodes in C are observed. - Pick C wisely so as to minimize w ? notion of
w-cutset
56w-cutset Sampling
- Cw-cutset of the network, a set of nodes such
that when C and E are instantiated, the adjusted
induced width of the network is w - Complexity of exact inference
- bounded by w !
- cycle-cutset is a special case
57Cutset Sampling-Answering Queries
- Query ?ci ?C, P(ci e)? same as Gibbs
- Special case of w-cutset
computed while generating sample t
compute after generating sample t
58Cutset Sampling Example
X1
X2
X3
X5
X4
X6
X9
X7
X8
Ex9
59Cutset Sampling Example
Sample a new value for X2
X1
X2
X3
X6
X5
X4
X9
X7
X8
60Cutset Sampling Example
Sample a new value for X5
X1
X2
X3
X6
X5
X4
X9
X7
X8
61Cutset Sampling Example
Query P(x2e) for sampling node X2
Sample 1
X1
X2
X3
Sample 2
X6
X5
X4
Sample 3
X9
X7
X8
62Cutset Sampling Example
Query P(x3 e) for non-sampled node X3
X1
X2
X3
X6
X5
X4
X9
X7
X8
63Gibbs Error Bounds
- Objectives
- Estimate needed number of samples T
- Estimate error
- Methodology
- 1 chain ? use lag-k autocovariance
- Estimate T
- M chains ? standard sampling variance
- Estimate Error
64Gibbs lag-k autocovariance
Lag-k autocovariance
65Gibbs lag-k autocovariance
Estimate Monte Carlo variance
Here, ? is the smallest positive integer
satisfying
Effective chain size
In absense of autocovariance
66Gibbs Multiple Chains
- Generate M chains of size K
- Each chain produces independent estimate Pm
Estimate P(xie) as average of Pm (xie)
Treat Pm as independent random variables.
67Gibbs Multiple Chains
- Pm are independent random variables
- Therefore
68GemanGeman1984
- Geman, S. Geman D., 1984. Stocahstic
relaxation, Gibbs distributions, and the Bayesian
restoration of images. IEEE Trans.Pat.Anal.Mach.In
tel. 6, 721-41. - Introduce Gibbs sampling
- Place the idea of Gibbs sampling in a general
setting in which the collection of variables is
structured in a graphical model and each variable
has a neighborhood corresponding to a local
region of the graphical structure. Geman and
Geman use the Gibbs distribution to define the
joint distribution on this structured set of
variables.
69TannerWong 1987
- Tanner and Wong (1987)
- Data-augmentation
- Convergence Results
70Pearl1988
- Pearl,1988. Probabilistic Reasoning in
Intelligent Systems, Morgan-Kaufmann. - In the case of Bayesian networks, the
neighborhoods correspond to the Markov blanket of
a variable and the joint distribution is defined
by the factorization of the network.
71GelfandSmith,1990
- Gelfand, A.E. and Smith, A.F.M., 1990.
Sampling-based approaches to calculating marginal
densities. J. Am.Statist. Assoc. 85, 398-409. - Show variance reduction in using mixture
estimator for posterior marginals.
72Neal, 1992
- R. M. Neal, 1992. Connectionist learning of
belief networks, Artifical Intelligence, v. 56,
pp. 71-118. - Stochastic simulation in Noisy-Or networks.
73CPCS54 Test Results
MSE vs. samples (left) and time (right)
Ergodic, X 54, D(Xi) 2, C 15, E
4 Exact Time 30 sec using Cutset Conditioning
74CPCS179 Test Results
MSE vs. samples (left) and time (right)
Non-Ergodic (1 deterministic CPT entry) X
179, C 8, 2lt D(Xi)lt4, E 35 Exact Time
122 sec using Loop-Cutset Conditioning
75CPCS360b Test Results
MSE vs. samples (left) and time (right)
Ergodic, X 360, D(Xi)2, C 21, E
36 Exact Time gt 60 min using Cutset
Conditioning Exact Values obtained via Bucket
Elimination
76Random Networks
MSE vs. samples (left) and time (right) X
100, D(Xi) 2,C 13, E 15-20 Exact Time
30 sec using Cutset Conditioning
77Coding Networks
x1
x2
x3
x4
u1
u2
u3
u4
p1
p2
p3
p4
y4
y3
y2
y1
MSE vs. time (right) Non-Ergodic, X 100,
D(Xi)2, C 13-16, E 50 Sample Ergodic
Subspace UU1, U2,Uk Exact Time 50 sec using
Cutset Conditioning
78Non-Ergodic Hailfinder
MSE vs. samples (left) and time
(right) Non-Ergodic, X 56, C 5, 2 ltD(Xi)
lt11, E 0 Exact Time 2 sec using
Loop-Cutset Conditioning
79Non-Ergodic CPCS360b - MSE
MSE vs. Time Non-Ergodic, X 360, C 26,
D(Xi)2 Exact Time 50 min using BTE
80Non-Ergodic CPCS360b - MaxErr
81Importance vs. Gibbs
wt