Title: Bayesian Network
1 Bayesian Network
- CVPR Winter seminar
- Jaemin Kim
2Outline
- Concepts in Probability
- Probability
- Random variables
- Basic properties (Bayes rule)
- Bayesian Networks
- Inference
- Decision making
- Learning networks from data
- Reasoning over time
- Applications
3Probabilities
- Probability distribution P(Xx)
- X is a random variable
- Discrete
- Continuous
- x is background state of information
4Discrete Random Variables
- Finite set of possible outcomes
X binary
5Continuous Random Variables
- Probability distribution (density function) over
continuous values
5 7
6More Probabilities
- Joint
- Probability that both Xx and Yy
- Conditional
- Probability that Xx given we know that Yy
7Rule of Probabilities
- Product Rule
- Marginalization
X binary
8Bayes Rule
9Graph Model
- ??
- ?? variable? ?? ?? (probability
distribution)? ????? ?? ?? variables ?? ???? ??
- Definition
- A collection of variables (nodes) with a set of
dependencies (edges) between the variables, and - a set of probability distribution functions
for each variable - A Bayesian network is a special type of graph
model which is a directed acyclic graph (DAG)
10Bayesian Networks
- A Graph
- nodes represent the random variables
- directed edges (arrows) between pairs of nodes
- it must be a Directed Acyclic Graph (DAG)
- the graph represents relationships between
variables - Conditional probability specifications
- the conditional probability distribution (CPD)
of each variable - given its parents
- discrete variable table (CPT)
11Bayesian Networks (Belief Networks)
- A Graph
- directed edges (arrows) between pairs of nodes
- causality A causes B
- AI an statistics communities
Markov Random fields (MRF)
- A Graph
- undirected edges (arrows) between pairs of
nodes - a simple definition of independence
- If all paths between the nodes in A and B
are separated by a node c - A and B are conditionally independent given a
third set C - physics and vision communities
12Bayesian Networks
13Bayesian networks
- Basics
- Structured representation
- Conditional independence
- Naïve Bayes model
- Independence facts
14Bayesian networks
Smoking
Cancer
P(S)
P(CS)
15Product Rule
P(Cnone Sno) P(Cnone Sno)P(Sno)
0.960.8 0.768
16Product Rule
P(Cnone Sno) P(Cnone Sno)P(Sno)
0.960.8 0.768
17Marginalization
P(Smoke)
P(Cancer)
P(Sno) P(Sno Cno) P(Sno Cbe)
P(Sno Cmal)
P(Cmal) P(Cmal Sno) P(Cmal Slight)
P(Cmal Sheavy)
18Bayes Rule Revisited
P(SC)
19A Bayesian Network
Gender
Age
Exposure to Toxics
Smoking
Cancer
Serum Calcium
Lung Tumor
20Problems with Large Instances
- The joint probability distribution,
P(A,G,E,S,C,L,SC) - For five binary variables there are 27 128
values in the joint distribution (for 100
variables there are over 1030 values) - How are these values to be obtained?
- Inference
- To obtain posterior distributions once some
evidence is available requires summation over an
exponential number of terms eg 22 in the
calculation of
which increases to 297 if there are 100 variables.
21Independence
Age and Gender are independent.
Gender
Age
P(A,G) P(G)P(A)
P(AG) P(A) A G P(GA) P(G) G A
P(A,G) P(GA) P(A) P(G)P(A) P(A,G) P(AG)
P(G) P(A)P(G)
22Conditional Independence
Cancer is independent of Age and Gender given
Smoking.
Gender
Age
Smoking
P(CA,G,S) P(CS) C A,G S
Cancer
- (Smokingheavy)??? Age? Gender? ????? ??
- (Smokingheavy)??? cancer? ????? ??
- (Smokingheavy)????? cancer? age? gender? ??
23More Conditional IndependenceNaïve Bayes
Serum Calcium and Lung Tumor are dependent
Cancer
Serum Calcium
Lung Tumor
??
24More Conditional IndependenceExplaining Away
Exposure to Toxics and Smoking are independent
Exposure to Toxics
Smoking
E S
Cancer
Exposure to Toxics is dependent on Smoking, given
Cancer
25More Conditional IndependenceExplaining Away
Exposure to Toxics
Exposure to Toxics
Smoking
Smoking
Cancer
Cancer
Exposure to Toxics is dependent on Smoking,
given Cancer
Moralize the graph.
26Put it all together
27General Product (Chain) Rule for Bayesian
Networks
Paiparents(Xi)
28Conditional Independence
A variable (node) is conditionally independent of
its non-descendants given its parents.
Gender
Age
Non-Descendants
Exposure to Toxics
Smoking
Parents
Cancer is independent of Age and Gender given
Exposure to Toxics and Smoking.
Cancer
Serum Calcium
Lung Tumor
Descendants
29Another non-descendant
Gender
Age
Cancer is independent of Diet given Exposure to
Toxics and Smoking.
Exposure to Toxics
Smoking
Diet
Cancer
Serum Calcium
Lung Tumor
30Representing the Joint Distribution
In general, for a network with nodes X1, X2, ,
Xn then
An enormous saving can be made regarding the
number of values required for the joint
distribution. To determine the joint
distribution directly for n binary variables 2n
1 values are required. For a BN with n binary
variables and each node has at most k parents
then less than 2kn values are required.
31An Example
P(s1)0.2
P(l1s1)0.003P(l1s2)0.00005
P(b1s1)0.25P(b1s2)0.05
P(f1b1,l1)0.75P(f1b1,l2)0.10P(f1b2,l1)0.5
P(f1b2,l2)0.05
P(x1l1)0.6P(x1l2)0.02
32Solution
Note that our joint distribution with 5 variables
can be represented as
Consequently the joint probability distribution
can now be expressed as
For example, the probability that someone has a
smoking history, lung cancer but not bronchitis,
suffers from fatigue and tests positive in an
X-ray test is
33Independence and Graph Separation
- Given a set of observations, is one set of
variables dependent on another set? - Observing effects can induce dependencies.
- d-separation (Pearl 1988) allows us to check
conditional independence graphically.
34Bayesian networks
- Additional structure
- Nodes as functions
- Causal independence
- Context specific dependencies
- Continuous variables
- Hierarchy and model construction
35Nodes as funtions
- A BN node is conditional distribution function
- its parent values are the inputs
- its output is a distribution over its values
A
0.5
X
0.3
0.2
B
36Nodes as funtions
A
Any type of function from Val(A,B) to
distributions over Val(X)
X
B
37Continuous variables
A/C Setting
Outdoor Temperature
hi
97o
38Gaussian (normal) distributions
N(m, s)
39Gaussian networks
Each variable is a linear function of its
parents, with Gaussian noise
Joint probability density functions
40Composing functions
- Recall a BN node is a function
- We can compose functions to get more complex
functions. - The result A hierarchically structured BN.
- Since functions can be called more than once, we
can reuse a BN model fragment in multiple
contexts.
41Owner
Maintenance
Age
Original-value
Mileage
Brakes
Car
Fuel-efficiency
Braking-power
42Bayesian Networks
- Knowledge acquisition
- Variables
- Structure
- Numbers
43What is a variable?
- Collectively exhaustive, mutually exclusive values
Error Occured
No Error
44Clarity Test Knowable in Principle
- Weather Sunny, Cloudy, Rain, Snow
- Gasoline Cents per gallon
- Temperature ? 100F , lt 100F
- User needs help on Excel Charting Yes, No
- Users personality dominant, submissive
45Structuring
Network structure corresponding to causality is
usually good.
Extending the conversation.
Lung Tumor
46Course Contents
- Concepts in Probability
- Bayesian Networks
- Inference
- Decision making
- Learning networks from data
- Reasoning over time
- Applications
47Inference
- Patterns of reasoning
- Basic inference
- Exact inference
- Exploiting structure
- Approximate inference
48Predictive Inference
Gender
Age
How likely are elderly males to get malignant
cancer?
Exposure to Toxics
Smoking
P(Cmalignant Agegt60, Gender male)
Cancer
Serum Calcium
Lung Tumor
49Combined
Gender
Age
How likely is an elderly male patient with high
Serum Calcium to have malignant cancer?
Exposure to Toxics
Smoking
Cancer
P(Cmalignant Agegt60, Gender male, Serum
Calcium high)
Serum Calcium
Lung Tumor
50Explaining away
Gender
Age
- If we see a lung tumor, the probability of heavy
smoking and of exposure to toxics both go up.
Exposure to Toxics
Smoking
Cancer
Serum Calcium
Lung Tumor
51Inference in Belief Networks
- Find P(QqE e)
- Q the query variable
- E set of evidence variables
X1,, Xn are network variables except Q, E
P(q, e)
S P(q, e, x1,, xn)
x1,, xn
52Basic Inference
A
B
53Inference in trees
Y2
Y1
X
X
P(x) S P(x y1, y2) P(y1, y2)
y1, y2
54Polytrees
- A network is singly connected (a polytree) if it
contains no undirected loops.
D
C
Theorem Inference in a singly connected network
can be done in linear time. Main idea in
variable elimination, need only maintain
distributions over single nodes. in network
size including table sizes.
55The problem with loops
P(c)
0.5
Cloudy
c
c
Rain
Sprinkler
P(s)
0.01
0.99
P(r)
0.01
0.99
Grass-wet
deterministic or
The grass is dry only if no rain and no
sprinklers.
56The problem with loops contd.
P(g)
0
problem
57Variable elimination
A
B
C
58Inference as variable elimination
- A factor over X is a function from val(X) to
numbers in 0,1 - A CPT is a factor
- A joint distribution is also a factor
- BN inference
- factors are multiplied to give new ones
- variables in factors summed out
- A variable can be summed out as soon as all
factors mentioning it have been multiplied.
59Variable Elimination with loops
Gender
Age
Exposure to Toxics
Smoking
Cancer
Serum Calcium
Lung Tumor
Complexity is exponential in the size of the
factors
60Inference in BNs and Junction Tree
- The main point of BNs is to enable probabilistic
inference to be performed. Inference is the task
of computing the probability of each value of a
node in BNs when other variables values are
know. - The general idea is doing inference by
representing the joint probability distribution
on an undirected graph called the Junction tree - The junction tree has the following
characteristics - it is an undirected tree, its nodes are
clusters of variables - given two clusters, C1 and C2, every node on
the path between them contains their
intersection C1 ? C2 - a Separator, S, is associated with each edge
and contains the variables in the
intersection between neighbouring nodes
61Inference in BNs
- Moralize the Bayesian network
- Triangulate the moralized graph
- Let the cliques of the triangulated graph be the
nodes of a tree, and construct the junction tree - Belief propagation throughout the junction tree
to do inference
62Constructing the Junction Tree (1)
Step 1. Form the moral graph from the
DAG Consider BN in our example
Moral Graph marry parents and remove arrows
DAG
63Constructing the Junction Tree (2)
Step 2. Triangulate the moral graph An undirected
graph is triangulated if every cycle of length
greater than 3 possesses a chord
64Constructing the Junction Tree (3)
Step 3. Identify the Cliques A clique is a subset
of nodes which is complete (i.e. there is an edge
between every pair of nodes) and maximal.
Cliques B,S,LB,L,FL,X
?
65Constructing the Junction Tree (4)
Step 4. Build Junction Tree The cliques should be
ordered (C1,C2,,Ck) so they possess the running
intersection property for all 1 lt j k, there
is an i lt j such that Cj ? (C1? ?Cj-1) ? Ci.
To build the junction tree choose one such I for
each j and add an edge between Cj and Ci.
Junction Tree
Cliques B,S,LB,L,FL,X
?
BL
L
66Potentials Initialization
To initialize the potential functions 1. set all
potentials to unity 2. for each variable, Xi,
select one node in the junction tree (i.e. one
clique) containing both that variable and its
parents, pa(Xi), in the original DAG 3. multiply
the potential by P(xipa(xi))
BL
L
67Potential Representation
The joint probability distribution can now be
represented in terms of potential functions, ?,
defined on each clique and each separator of the
junction tree. The joint distribution is given by
The idea is to transform one representation of
the joint distribution to another in which for
each clique, C, the potential function gives the
marginal distribution for the variables in C, i.e.
This will also apply for the separators, S.
68Triangulation
- Given a numbered graph, proceed from node n,
decrease to 1 - Determine the lower-numbered nodes which are
adjacent to the current node, including those
which may have been made adjacent to this node
earlier in this algorithm - Connects these nodes to each other.
69Triangulation
- Numbering the nodes
- Arbitrarily number the nodes
- Maximum cardinality search
- Give any node a value of 1
- For each subsequent number, pick an new
unnumbered node that neighbors the most already
numbered nodes
70Triangulation
Moralized graph
BN
71Triangulation
8
5
3
6
4
7
2
1
Arbitrary numbering
72Triangulation
Maximum cardinality search
73Course Contents
- Concepts in Probability
- Bayesian Networks
- Inference
- Decision making
- Learning networks from data
- Reasoning over time
- Applications
74Decision making
- Decision - an irrevocable allocation of domain
resources - Decision should be made so as to maximize
expected utility. - View decision making in terms of
- Beliefs/Uncertainties
- Alternatives/Decisions
- Objectives/Utilities
75Course Contents
- Concepts in Probability
- Bayesian Networks
- Inference
- Decision making
- Learning networks from data
- Reasoning over time
- Applications
76Learning networks from data
- The learning task
- Parameter learning
- Fully observable
- Partially observable
- Structure learning
- Hidden variables
77The learning task
B E A C N
...
Input training data
- Input fully or partially observable data cases?
- Output parameters or also structure?
78Parameter learning one variable
- Unfamiliar coin
- Let q bias of coin (long-run fraction of heads)
- If q known (given), then
- P(X heads q )
q
- Different coin tosses independent given q
- P(X1, , Xn q )
q h (1-q)t
79Maximum likelihood
- Input a set of previous coin tosses
- X1, , Xn H, T, H, H, H, T, T, H, . . ., H
- Goal estimate q
- The likelihood P(X1, , Xn q ) q h (1-q )t
- The maximum likelihood solution is
80Conditioning on data
? P(q ) P(D q ) P(q ) q h (1-q )t
P(q )
81Conditioning on data
82General parameter learning
- A multi-variable BN is composed of several
independent parameters (coins).
Three parameters
- Can use same techniques as one-variable case to
learn each one separately
83Partially observable data
Burglary
Earthquake
B E A C N
?
a
c
?
Alarm
b
?
a
?
n
Newscast
Call
...
- Fill in missing data with expected value
- expected distribution over possible values
- use best guess BN to estimate distribution
84Intuition
- In partially observable case I is unknown.
Best estimate for I is
Problem q unknown.
85Expectation Maximization (EM)
- Expectation (E) step
- Use current parameters q to estimate filled in
data.
- Maximization (M) step
- Use filled in data to do max likelihood
estimation
86Structure learning
Goal find good BN structure (relative to
data)
Solution do heuristic search over space of
network structures.
87Search space
Space network structures Operators
add/reverse/delete edges
88Heuristic search
Use scoring function to do heuristic search (any
algorithm). Greedy hill-climbing with randomness
works pretty well.
score
89Scoring
- Fill in parameters using previous techniques
score completed networks. - One possibility for score
D
likelihood function Score(B) P(data B)
Example X, Y independent coin tosses typical
data (27 h-h, 22 h-t, 25 t-h, 26 t-t)
Max. likelihood network typically fully connected
This is not surprising maximum likelihood always
overfits
90Better scoring functions
- MDL formulation balance fit to data and model
complexity ( of parameters)
Score(B) P(data B) - model complexity
- Full Bayesian formulation
- prior on network structures parameters
- more parameters ? higher dimensional space
- get balance effect as a byproduct
with Dirichlet parameter prior, MDL is an
approximation to full Bayesian score.
91Hidden variables
- There may be interesting variables that we never
get to observe - topic of a document in information retrieval
- users current task in online help system.
- Our learning algorithm should
- hypothesize the existence of such variables
- learn an appropriate state space for them.
92E1
E3
E2
randomly scattered data
93E1
E3
E2
actual data
94Bayesian clustering (Autoclass)
Class
naïve Bayes model
...
E1
E2
En
- (hypothetical) class variable never observed
- if we know that there are k classes, just run EM
- learned classes clusters
- Bayesian analysis allows us to choose k, trade
off fit to data with model complexity
95E1
E3
E2
Resulting cluster distributions
96Detecting hidden variables
- Unexpected correlations hidden variables.
97Course Contents
- Concepts in Probability
- Bayesian Networks
- Inference
- Decision making
- Learning networks from data
- Reasoning over time
- Applications
98Reasoning over time
- Dynamic Bayesian networks
- Hidden Markov models
- Decision-theoretic planning
- Markov decision problems
- Structured representation of actions
- The qualification problem the frame problem
- Causality (and the frame problem revisited)
99Dynamic environments
State(t)
- Markov property
- past independent of future given current state
- a conditional independence assumption
- implied by fact that there are no arcs t? t2.
100Dynamic Bayesian networks
- State described via random variables.
...
101Hidden Markov model
- An HMM is a simple model for a partially
observable stochastic domain.
102Hidden Markov model
Partially observable stochastic environment
- Mobile robots
- states location
- observations sensor input
- Speech recognition
- states phonemes
- observations acoustic signal
- Biological sequencing
- states protein structure
- observations amino acids
103Acting under uncertainty
Markov Decision Problem (MDP)
- Overall utility sum of momentary rewards.
- Allows rich preference model, e.g.
rewards corresponding to get to goal asap
104Partially observable MDPs
- The optimal action at time t depends on the
entire history of previous observations. - Instead, a distribution over State(t) suffices.
105Structured representation
- Probabilistic action model
- allows for exceptions qualifications
- persistence arcs a solution to the frame
problem.
106Applications
- Medical expert systems
- Pathfinder
- Parenting MSN
- Fault diagnosis
- Ricoh FIXIT
- Decision-theoretic troubleshooting
- Vista
- Collaborative filtering