Title: Inference III: Approximate Inference
1Inference IIIApproximate Inference
2Global conditioning
Fixing value of A B
Fixing values in the beginning of the summation
can decrease tables formed by variable
elimination. This way space is traded with time.
Special case choose to fix a set of nodes that
break all loops. This method is called
cutset-conditioning. Alternatively, choose to fix
some variables from the largest cliques in a
clique tree.
3Approximation
- Until now, we examined exact computation
- In many applications, approximation are
sufficient - Example P(X xe) 0.3183098861838
- Maybe P(X xe) ? 0.3 is a good enough
approximation - e.g., we take action only if P(X xe) gt 0.5
- Can we find good approximation algorithms?
4Types of Approximations
- Absolute error
- An estimate q of P(X x e) has absolute error
?, if - P(X xe) - ? ? q ? P(X xe) ?
- equivalently
- q - ? ? P(X xe) ? q ?
- Absolute error is not always what we want
- If P(X x e) 0.0001, then an absolute error
of 0.001 is unacceptable - If P(X x e) 0.3, then an absolute error of
0.001 is overly precise
1
q
2?
0
5Types of Approximations
- Relative error
- An estimate q of P(X x e) has relative error
?, if - P(X xe)(1 - ?) ? q ? P(X xe)(1 ?)
- equivalently
- q/(1 ?) ? P(X xe) ? q/(1 - ?)
- Sensitivity of approximation depends on actual
value of desired result
1
q/(1-?)
q
q/(1?)
0
6Complexity
- Recall, exact inference is NP-hard
- Is approximate inference any easier?
- Construction for exact inference
- Input a 3-SAT problem ?
- Output a BN such that P(Xt) gt 0 iff ? is
satisfiable
7Complexity Relative Error
- Suppose that q is a relative error estimate
ofP(X t), - If ? is not satisfiable, then P(X t)0 . Hence,
0 P(X t)(1 - ?) ? q ? P(X t)(1 ?) 0
namely, q0. Thus, if q gt 0, then ? is
satisfiable
An immediate consequence Thm Given ?, finding
an ?-relative error approximation is NP-hard
8Complexity Absolute error
- We can find absolute error approximations to P(X
x) with high probability (via sampling). - We will see such algorithms shortly
- However, once we have evidence, the problem is
harder - Thm
- If ? lt 0.5, then finding an estimate of P(Xxe)
with ? absulote error approximation is NP-Hard
9Proof
10Proof (cont.)
- Suppose we can estimate with ? absolute error
- Let p1 ? P(Q1 t X t)
- Assign q1 t if p1 gt 0.5, else q1 f
Let p2 ? P(Q2 t X t, Q1 q1 ) Assign q2
t if p2 gt 0.5, else q2 f
Let pn ? P(Qn t X t, Q1 q1, , Qn-1
qn-1 ) Assign qn t if pn gt 0.5, else qn f
11Proof (cont.)
- Claim if ? is satisfiable, then q1 ,, qn is a
satisfying assignment - Suppose ? is satisfiable
- By induction on i there is a satisfying
assignment with Q1 q1, , Qi qi
Base case If Q1 t in all satisfying
assignments, ? P(Q1 t X t) 1 ? p1 ? 1 - ?
gt 0.5 ? q1 t If Q1 f, in all satisfying
assignments, then q1 f Otherwise, statement
holds for any choice of q1
12Proof (cont.)
- Claim if ? is satisfiable, then q1 ,, qn is a
satisfying assignment - Suppose ? is satisfiable
- By induction on i there is a satisfying
assignment with Q1 q1, , Qi qi
Induction argument If Qi1 t in all satisfying
assignments s.t.Q1 q1, , Qi qi ? P(Qi1 t
X t, Q1 q1, , Qi qi ) 1 ? pi1 ? 1 - ?
gt 0.5 ? qi1 t If Qi1 f in all satisfying
assignments s.t.Q1 q1, , Qi qi then qi1 f
13Proof (cont.)
- We can efficiently check whether q1 ,, qn is a
satisfying assignment (linear time) - If it is, then ? is satisfiable
- If it is not, then ? is not satisfiable
- Suppose we have an approximation procedure with ?
relative error - ? we can determine 3-SAT with n procedure calls
- ? approximation is NP-hard
14When can we hope to approximate?
- Two situations
- Peaked distributions
- improbable values are ignored
- Highly stochastic distributions
- Far evidence is discarded
15Peaked distributions
- If the distribution is peaked, then most of the
mass is on few instances - If we can focus on these instances, we can ignore
the rest
Instances
16Stochasticity Approximations
- Consider a chain
- P(Xi1 t Xi t) 1- ?P(Xi1 f Xi f)
1- ? - Computing the probability of Xn given X1 , we get
17Plot of P(Xn t X1 t)
?
18Stochastic Processes
- This behavior of a chain (a Markov Process) is
called Mixing. We return to this as a tool in
approximation - In general networks there is a similar behavior
- If probabilities are far from 0 1, then effect
of far evidence vanishes (and so can be
discarded in approximations).
19Bounded conditioning
Fixing value of A B
By examining only the probable assignment of A
B, we perform several simple computations instead
of a complex one
20Bounded conditioning
- Choose A and B so that P(Y,e a,b) can be
computed easily. E.g., a cycle cutset. - Search for highly probable assignments to A,B.
- Option 1--- select a,b with high P(a,b).
- Option 2--- select a,b with high P(a,b e).
- We need to search for such high mass values and
that can be hard.
21Bounded Conditioning
- Advantages
- Combines exact inference within approximation
- Continuous more time can be used to examine more
cases - Bounds unexamined mass used to compute
error-bars - Possible problems
- P(a,b) is prior mass not the posterior.
- If posterior is significantly different P(a,b
e), Computation can be wasted on irrelevant
assignments
22Network Simplifications
- In these approaches, we try to replace the
original network with a simpler one - the resulting network allows fast exact methods
23Network Simplifications
- Typical simplifications
- Remove parts of the network
- Remove edges
- Reduce the number of values (value abstraction)
- Replace a sub-network with a simpler one(model
abstraction) - These simplifications are often w.r.t. to the
particular evidence and query
24Stochastic Simulation
- Suppose we can sample instances ltx1,,xngt
according to P(X1,,Xn) - What is the probability that a random sample
ltx1,,xngt satisfies e? - This is exactly P(e)
- We can view each sample as tossing a biased coin
with probability P(e) of Heads
25Stochastic Sampling
- Intuition given a sufficient number of samples
x1,,xN, we can estimate - Law of large number implies that as N grows, our
estimate will converge to p with high probability
26Sampling a Bayesian Network
- If P(X1,,Xn) is represented by a Bayesian
network, can we efficiently sample from it? - Idea sample according to structure of the
network - Write distribution using the chain rule, and then
sample each variable given its parents
27Logic sampling
P(b)
0.03
P(e)
0.001
b e
P(a)
0.4
0.01
0.98
0.7
e
P(r)
0.3
0.001
a
P(c)
0.8
0.05
28Logic sampling
P(b)
0.03
P(e)
0.001
b e
P(a)
0.4
0.01
0.98
0.7
e
P(r)
0.3
0.001
a
P(c)
0.8
0.05
e
29Logic sampling
P(b)
0.03
P(e)
0.001
b e
P(a)
0.4
0.01
0.98
0.7
e
P(r)
0.3
0.001
a
P(c)
0.8
0.05
e
a
30Logic sampling
P(b)
0.03
P(e)
0.001
b e
P(a)
0.4
0.01
0.98
0.7
e
P(r)
0.3
0.001
a
P(c)
0.8
0.05
e
a
c
31Logic sampling
P(b)
0.03
P(e)
0.001
b e
P(a)
0.4
0.01
0.98
0.7
e
P(r)
0.3
0.001
a
P(c)
0.8
0.05
e
a
c
32Logic Sampling
- Let X1, , Xn be order of variables consistent
with arc direction - for i 1, , n do
- sample xi from P(Xi pai )
- (Note since Pai ? X1,,Xi-1, we already
assigned values to them) - return x1, ,xn
33Logic Sampling
- Sampling a complete instance is linear in number
of variables - Regardless of structure of the network
- However, if P(e) is small, we need many samples
to get a decent estimate
34Can we sample from P(X1,,Xn e)?
- If evidence is in roots of network, easily
- If evidence is in leaves of network, we have a
problem - Our sampling method proceeds according to order
of nodes in graph - Note, we can use arc-reversal to make evidence
nodes root. - In some networks, however, this will create
exponentially large tables...
35Likelihood Weighting
- Can we ensure that all of our sample satisfy e?
- One simple solution
- When we need to sample a variable that is
assigned value by e, use the specified value - For example we know Y 1
- Sample X from P(X)
- Then take Y 1
- Is this a sample from P(X,Y Y 1) ?
36Likelihood Weighting
- Problem these samples of X are from P(X)
- Solution
- Penalize samples in which P(Y1X) is small
- We now sample as follows
- Let xi be a sample from P(X)
- Let wi be P(Y 1X x i)
37Likelihood Weighting
- Why does this make sense?
- When N is large, we expect to sample NP(X x)
samples with xi x - Thus,
38Likelihood Weighting
P(b)
0.03
P(e)
0.001
b e
P(a)
0.4
0.01
0.98
0.7
r
r
P(r)
0.3
0.001
a
a
P(c)
B E A C R
0.8
0.05
Samples
39Likelihood Weighting
P(b)
0.03
P(e)
0.001
b e
P(a)
0.4
0.01
0.98
0.7
r
r
P(r)
0.3
0.001
a
P(c)
0.8
0.05
e
40Likelihood Weighting
P(b)
0.03
P(e)
0.001
b e
P(a)
0.4
0.01
0.98
0.7
r
r
P(r)
0.3
0.001
a
P(c)
0.8
0.05
0.6
e
41Likelihood Weighting
P(b)
0.03
P(e)
0.001
b e
P(a)
0.4
0.01
0.98
0.7
r
r
P(r)
0.3
0.001
a
P(c)
0.8
0.05
0.6
e
c
42Likelihood Weighting
P(b)
0.03
P(e)
0.001
b e
P(a)
0.4
0.01
0.98
0.7
r
r
P(r)
0.3
0.001
a
P(c)
B E A C R
0.8
0.05
0.6
0.3
e
c
r
Samples
43Likelihood Weighting
- Let X1, , Xn be order of variables consistent
with arc direction - w 1
- for i 1, , n do
- if Xi xi has been observed
- w ?w ? P(Xi xi pai )
- else
- sample xi from P(Xi pai )
- return x1, ,xn, and w
44Likelihood Weighting
- What can we say about the quality of answer?
- Intuitively, the weights of sample reflects their
probability given the evidence. We need collect a
certain mass. - Another factor is the extremeness of CPDs.
- Thm
- If P(Xi Pai) ?l,u for all CPDs, and
- then with probability 1-?, the estimate is ?
relative error approximation
45END