Title: Bayesian Inference
1Bayesian Inference
- Summer School on Causality, Uncertainty and
Ignorance - Konstanz, 15-21 August, 2004
- David Glass, University of Ulsterdh.glass_at_ulster.
ac.uk
2Lecture 1Introduction to Bayesian Networks
- Bayes theorem
- Large problems
- Bayesian networks - an overview
- Causality and Bayesian networks
- the Causal Markov condition
- Inference in BNs
- constructing the junction tree
- two-phase propagation
3Bayes Theorem
Prior
Likelihood
Posterior
Probability of Evidence
Probability of an hypothesis, h, can be updated
when evidence, e, has been obtained. Note it is
usually not necessary to calculate P(e) directly
as it can be obtained by normalizing the
posterior probabilities, P(hi e).
4A Simple Example
Consider two related variables 1. Drug (D) with
values y or n 2. Test (T) with values ve or
ve And suppose we have the following
probabilities P(D y) 0.001 P(T ve D
y) 0.8 P(T ve D n) 0.01 These
probabilities are sufficient to define a joint
probability distribution. Suppose an athlete
tests positive. What is the probability that he
has taken the drug?
5A More Complex Case
Suppose now that there is a similar link between
Lung Cancer (L) and a chest X-ray (X) and that we
also have the following relationships History of
smoking (S) has a direct influence on bronchitis
(B) and lung cancer (L) L and B have a direct
influence on fatigue (F). What is the probability
that someone has bronchitis given that they
smoke, have fatigue and have received a positive
X-ray result?
where, for example, the variable B takes on
values b1 (has bronchitis) and b2 (does not have
bronchitis).
R.E. Neapolitan, Learning Bayesian Networks (2004)
6Problems with Large Instances
- The joint probability distribution,
P(b,s,f,x,l) - For five binary variables there are 25 32
values in the joint distribution (for 100
variables there are over 1030 values) - How are these values to be obtained?
- Inference
- To obtain posterior distributions once some
evidence is available requires summation over an
exponential number of terms eg 22 in the
calculation of
which increases to 297 if there are 100 variables.
7Bayesian Networks
- A Bayesian network consists of
- A Graph
- nodes represent the random variables
- directed edges (arrows) between pairs of
nodes - it must be a Directed Acyclic Graph (DAG)
no directed cycles - the graph represents independence
relationships between variables - Conditional probability specifications
- the conditional probability of each variable
given its parents in the DAG
8An Example Bayesian Network
P(s1)0.2
P(l1s1)0.003P(l1s2)0.00005
P(b1s1)0.25P(b1s2)0.05
P(f1b1,l1)0.75P(f1b1,l2)0.10P(f1b2,l1)0.5
P(f1b2,l2)0.05
P(x1l1)0.6P(x1l2)0.02
R.E. Neapolitan, Learning Bayesian Networks (2004)
9The Markov Condition
A Bayesian network (G,P) satisfies the Markov
condition according to which for each variable,
X, in G X is conditionally independent of its
nondescendents given its parents in G Denoted by
X - nd(X) pa(X) or IP(X, nd(X) pa(X))
Eg, in this network F - S,X B,L L - B
S
The Markov Condition is sometimes referred to as
the local directed Markov condition or the
parental Markov condition. See Cowell et al
(1999) or Whittaker (1990) for a detailed
discussion of Markov properties.
10The Joint Probability Distribution
Note that our joint distribution with 5 variables
can be represented as
But due to the Markov condition we have, for
example,
Consequently the joint probability distribution
can now be expressed as
For example, the probability that someone has a
smoking history, lung cancer but not bronchitis,
suffers from fatigue and tests positive in an
X-ray test is
11Representing the Joint Distribution
In general, for a network with nodes X1, X2, ,
Xn then
An enormous saving can be made regarding the
number of values required for the joint
distribution. To determine the joint
distribution directly for n binary variables 2n
1 values are required. For a BN with n binary
variables and each node has at most k parents
then less than 2kn values are required.
12Causality and Bayesian Networks
Clearly not every BN describes causal
relationships between the variables. Consider
the dependence between Lung Cancer, L, and the
X-ray test, X. By focusing on just these
variables we might be tempted to represent them
by the following BN
P(x1l1)0.6P(x1l2)0.02
P(l1)0.001
However, the following BN represents the same
distribution and independencies (i.e. none)
P(l1x1)0.02915P(l1x2)0.00041
P(x1)0.02058
Nevertheless, it is tempting to think that BNs
can be created by creating a DAG where the edges
represent direct causal relationships between the
variables.
13Common Causes
Consider the following DAG
Smoking
Bronchitis
Lung Cancer
Markov condition Ip(B, L S), i.e. P(b l, s)
P(b s) If we know the causal relationships
S?B and S?L and we know that Joe is a smoker,
then finding out that he has Bronchitis will not
give us any more information about the
probability of him having Lung Cancer. So the
Markov condition would be satisfied.
14Common Effects
Consider the following DAG
Markov condition Ip(B, E), i.e. P(b e)
P(b) We would expect Burglary and Earthquake to
be independent of each other which is in
agreement with the Markov condition. We would,
however expect them to be conditionally dependent
given Alarm. If the alarm has gone off, news that
there had been an earthquake would explain away
the idea that a burglary had taken place. Again
in agreement with the Markov condition.
15The Causal Markov Condition
- The basic idea is that the Markov condition holds
for a causal DAG. - Certain other conditions must be met for the
Causal Markov condition to hold - there must be no hidden common causes
- there must not be selection bias
- there must be no feedback loops
- Even with these provisos there is a lot of
controversy as to its validity. - It seems to be false in quantum mechanical
systems which have been found to violate Bells
inequalities.
16Hidden Common Causes
H
X
Y
Z
If a DAG is created on the basis of causal
relationships between the variables under
consideration then X and Y would be marginally
independent according to the Markov
condition. But since they have a hidden common
cause, H, they will normally be dependent.
17Inference in Bayesian Networks
- The main point of BNs is to enable probabilistic
inference to be performed. - There are two main types of inference to be
carried out - Belief updating to obtain the posterior
probability of one or more variables given
evidence concerning the values of other variables - Abductive inference (or belief updating)
find the most probable configuration of a
set of variables (hypothesis) given evidence - Consider the BN discussed earlier
What is the probability that someone has
bronchitis (B) given that they smoke (S) have
fatigue (F) and have received a positive X-ray
(X) result?
18Inference an overview
- Trees and singly connected networks only
one path between any two nodes - message passing (Pearl, 1988)
- Multiply connected networks
- a range of algorithms including cut-set
conditioning (Pearl, 1988), junction tree
propagation (Lauritzen and Spiegelhalter, 1988),
bucket elimination (Dechter, 1996) to
mention a few. - a range of algorithms for approximate
inference
Both exact and approximate inference are NP-hard
in the worst case. Here the focus will be on
junction tree propagation for discrete variables.
19Junction Tree Propagation
- (Lauritzen and Spiegelhalter, 1988)
- The general idea is that the propagation of
evidence through the network can be carried out
more efficiently by representing the joint
probability distribution on an undirected graph
called the Junction tree (or Join tree). - The junction tree has the following
characteristics - it is an undirected tree
- its nodes are clusters of variables (i.e.
from the original BN) - given two clusters, C1 and C2, every node on
the path between them contains their
intersection C1 ? C2 - a Separator, S, is associated with each edge
and contains the variables in the
intersection between neighbouring nodes
20Constructing the Junction Tree (1)
Step 1. Form the moral graph from the
DAG Consider the Asia network
DAG
Moral Graph marry parents and remove arrows
21Constructing the Junction Tree (2)
Step 2. Triangulate the moral graph An undirected
graph is triangulated if every cycle of length
greater than 3 possesses a chord
22Constructing the Junction Tree (3)
Step 3. Identify the Cliques A clique is a subset
of nodes which is complete (i.e. there is an edge
between every pair of nodes) and maximal.
Cliques B,S,LB,L,EB,E,FL,E,TA,TE,X
?
23Constructing the Junction Tree (4)
Step 4. Build Junction Tree The cliques should be
ordered (C1,C2,,Ck) so they possess the running
intersection property for all 1 lt j k, there
is an i lt j such that Cj ? (C1? ?Cj-1) ? Ci.
To build the junction tree choose one such I for
each j and add an edge between Cj and Ci.
Junction Tree
Cliques B,S,LB,L,EB,E,FL,E,TA,TE,X
?
24Potential Representation
The joint probability distribution can now be
represented in terms of potential functions, ?,
defined on each clique and each separator of the
junction tree. The joint distribution is given by
The idea is to transform one representation of
the joint distribution to another in which for
each clique, C, the potential function gives the
marginal distribution for the variables in C, i.e.
This will also apply for the separators, S.
25Initialization
To initialize the potential functions 1. set all
potentials to unity 2. for each variable, Xi,
select one node in the junction tree (i.e. one
clique) containing both that variable and its
parents, pa(Xi), in the original DAG 3. multiply
the potential by P(xipa(xi)) Example. Our
original BN can be represented as
BSL
BLF
LX
26Propagating Information
Passing information from one clique C1 to another
C2 via the separator in between them, S0,
requires two steps 1. Obtain a new potential for
S0 by marginalizing out the variables in C1 that
are not in S0
2. Obtain a new potential for C2
where
27An Example
Consider a flow from the clique B,S,L to B,L,F
Initial representation
After Flow
?BSL P(B S)P(L S)P(S) l1 l2s1,b1 0.00015 0.
04985s1,b2 0.00045 0.14955s2,b1 0.000002 0.03999
8s2,b2 0.000038 0.759962
?BSL l1 l2s1,b1 0.00015 0.04985s1,b2 0.00045 0.
14955s2,b1 0.000002 0.039998s2,b2 0.000038 0.759
962
?BL 1 l1 l2b1 1 1b2 1 1
?BL l1 l2b1 0.000152 0.089848b2 0.000488 0.909
512
?BLF P(F B,L) l1 l2f1,b1 0.75 0.1f1,b2 0.5 0
.05f2,b1 0.25 0.9f2,b2 0.5 0.95
?BLF l1 l2f1,b1 0.000114 0.0089848f1,b2 0.00024
4 0.0454756f2,b1 0.000038 0.0808632f2,b2 0.00024
4 0.8640364
28An Example with Evidence
Consider a flow from the clique B,S,L to
B,L,F, but this time we include the information
that Joe is a smoker, S s1.
Incorporation of Evidence
After Flow
?BSL l1 l2s1,b1 0.00015 0.04985s1,b2 0.00045 0.
14955s2,b1 0 0s2,b2 0 0
?BL l1 l2b1 0.00015 0.04985b2 0.00045 0.14955
?BLF l1 l2f1,b1 0.0001125 0.004985f1,b2 0.000
225 0.0074775f2,b1 0.0000375
0.044865f2,b2 0.000225 0.1420725
29The Full Propagation (1)
Two phase propagation (Jensen et al, 1990) 1.
Select an arbitrary clique, C0 2. Collection
Phase flows passed from periphery to C0 3.
Distribution Phase flows passed from C0 to
periphery Eg
Collection
Distribution
Collection
30The Full Propagation (2)
After the two propagation phases have been
carried out the Junction tree will be in
equilibrium with each clique containing the joint
probability distribution for the variables it
contains. Marginal probabilities for individual
variables can then be obtained from the
cliques. Evidence, E, can be included before
propagation by selecting a clique for each
variable for which evidence is available. The
potential for the clique is then set to 0 for any
configuration which differs from the evidence.
After propagation the result will be
Normalizing gives
31A Final Example (1)
What is the probability that someone has
bronchitis given that they smoke, have fatigue
and have received a positive Xray result? Recall
that the BN
can be represented by the junction tree
On entering evidence S s1, F f1 and X x1,
we obtain
32A Final Example (2)
?BSL l1 l2s1,b1 0.0000675 0.0000997s1,b2 0.00
0135 0.00014955s2,b1 0 0s2,b2 0 0
After collection phase ?BSL is in final state. To
obtain P(b1,E) marginalize out L, 0.00006750.000
0997 0.0001672 Normalizing for P(b1E) gives
0.37. If we also observe Ll1 then P(b1E,l2)
0.33
?BL l1 l2b1 0.45 0.002b2 0.3 0.001
?BLF l1 l2f1,b1 0.45 0.002f1,b2 0.3 0.001f2,b1
0 0f2,b2 0 0
?L l1 l2 0.6 0.02
?LX l1 l2x1 0.6 0.02x2 0 0
33Summary
- Things to remember
- the Markov condition - a property of BNs
- what it means for a distribution P to
satisfy the Markov condition with respect
to a DAG - the causal Markov condtion - assumed for
causal BNs - how to construct a junction tree
- propagation in junction tree
34Abductive Inference in BNs (1)
So far we have considered inference problems
where the goal is to obtain posterior
probabilities for variables given evidence. In
abductive inference it is to find the
configuration of a set of variables (hypothesis)
which will best explain the evidence.
What would count as the best explanation of
fatigue (Ff1) and a positive X-ray (Xx1)? A
configuration of all the other variables? A
subset of them?
35Abductive Inference in BNs (2)
- There are two types of abductive inference in
BNs - MPE (Most Probable Explanation) - the most
probable configuration of all variables
in the BN given evidence - MAP (Maximum A Posteriori) - the most
probable configuration of a subset of
variables in the BN given evidence - Note 1 In general the MPE cannot be found by
taking the most probable configuration of nodes
individually! - Note 2 And the MAP cannot be found by taking the
projection of the MPEonto the explanation set!
36The Most Probable Explanations
(Dawid, 1992 and Nilsson, 1998) The MPE can be
obtained by adapting the propagation algorithm in
the junction tree when a message is passed from
one clique to another the potential function on
the separator is obtained by
rather than
Nilsson (1998) discusses how to find the K MPEs
and also outlines how to perform MAP
inferences marginalize out the variables not in
the explanation set and use the MPE approach on
the remaining variables.
37An Example
Consider a max-flow from the clique B,S,L to
B,L,F
Initial representation
After Flow
?BSL P(B S)P(L S)P(S) l1 l2s1,b1 0.00015 0.
04985s1,b2 0.00045 0.14955s2,b1 0.000002 0.03999
8s2,b2 0.000038 0.759962
?BSL l1 l2s1,b1 0.00015 0.04985s1,b2 0.00045 0.
14955s2,b1 0.000002 0.039998s2,b2 0.000038 0.759
962
?BL 1 l1 l2b1 1 1b2 1 1
?BL l1 l2b1 0.00015 0.04985b2 0.00045 0.759962
?BLF P(F B,L) l1 l2f1,b1 0.75 0.1f1,b2 0.5 0
.05f2,b1 0.25 0.9f2,b2 0.5 0.95
?BLF l1 l2f1,b1 0.000113 0.004985f1,b2 0.000225
0.037998f2,b1 0.000038 0.044685f2,b2 0.000225 0
.721964
38What is the best explanation?
Two questions What counts as an explanation?
And which one is best?
Eg What would count as an explanation of fatigue
(Ff1) and a positive X-ray (Xx1)? Or of lung
cancer (Ll1)?
Causality is often taken to play a crucial role
in explanation so if BNs can be interpreted
causally they provide a good platform for
obtaining explanations. So a suitable restriction
is that explanatory variables causally precede
the evidence.
39What is the best explanation? (1)
Which explanation is best? How do we determine
how good an explanation is? One approach is to
say that it is the one which makes the evidence
most probable. Select explanation H1 over H2 if
But even if P(E H1) 1 the posterior of H1 may
be small due to a small prior. The MAP approach
takes account of this by selecting H1 over H2 if
But perhaps this overcompensates since it could
be the case that
and so H1 actually lowers the probability of the
evidence.
40What is the best explanation? (2)
In many cases the two approaches agree as to
which is the better of the two explanations, i.e.
and
A reasonable requirement for any approach that
wants to do better is that it should agree with
these approaches when they agree with each other.
Why consider an alternative to MPE / MAP? 1. It
might be closer to human reasoning. 2. It could
be used to test Inference to the Best Explanation
(IBE) - does IBE make probable inferences? 3.
Perhaps high probability is not what we want -
trade-off with information content.