Bayesian Network

About This Presentation

Title:

Bayesian Network

Description:

Bayesian Network CVPR Winter seminar Jaemin Kim – PowerPoint PPT presentation

Number of Views:195

Avg rating:3.0/5.0

Slides: 107

Provided by: DavidG364

Category:

more less

Transcript and Presenter's Notes

Title: Bayesian Network

1
Bayesian Network

CVPR Winter seminar
Jaemin Kim

2
Outline

Concepts in Probability
Probability
Random variables
Basic properties (Bayes rule)
Bayesian Networks
Inference
Decision making
Learning networks from data
Reasoning over time
Applications

3
Probabilities

Probability distribution P(Xx)
X is a random variable
Discrete
Continuous
x is background state of information

4
Discrete Random Variables

Finite set of possible outcomes

X binary
5
Continuous Random Variables

Probability distribution (density function) over
continuous values

5 7
6
More Probabilities

Joint
Probability that both Xx and Yy
Conditional
Probability that Xx given we know that Yy

7
Rule of Probabilities

Product Rule
Marginalization

X binary
8
Bayes Rule
9
Graph Model

??
?? variable? ?? ?? (probability
distribution)? ????? ?? ?? variables ?? ???? ??

Definition
A collection of variables (nodes) with a set of
dependencies (edges) between the variables, and
a set of probability distribution functions
for each variable
A Bayesian network is a special type of graph
model which is a directed acyclic graph (DAG)

10
Bayesian Networks

A Graph
nodes represent the random variables
directed edges (arrows) between pairs of nodes
it must be a Directed Acyclic Graph (DAG)
the graph represents relationships between
variables
Conditional probability specifications
the conditional probability distribution (CPD)
of each variable
given its parents
discrete variable table (CPT)

11
Bayesian Networks (Belief Networks)

A Graph
directed edges (arrows) between pairs of nodes
causality A causes B
AI an statistics communities

Markov Random fields (MRF)

A Graph
undirected edges (arrows) between pairs of
nodes
a simple definition of independence
If all paths between the nodes in A and B
are separated by a node c
A and B are conditionally independent given a
third set C
physics and vision communities

12
Bayesian Networks
13
Bayesian networks

Basics
Structured representation
Conditional independence
Naïve Bayes model
Independence facts

14
Bayesian networks
Smoking
Cancer
P(S)
P(CS)
15
Product Rule

P(C,S) P(CS) P(S)

P(Cnone Sno) P(Cnone Sno)P(Sno)
0.960.8 0.768
16
Product Rule

P(C,S) P(CS) P(S)

P(Cnone Sno) P(Cnone Sno)P(Sno)
0.960.8 0.768
17
Marginalization
P(Smoke)
P(Cancer)
P(Sno) P(Sno Cno) P(Sno Cbe)
P(Sno Cmal)
P(Cmal) P(Cmal Sno) P(Cmal Slight)
P(Cmal Sheavy)
18
Bayes Rule Revisited
P(SC)
19
A Bayesian Network
Gender
Age
Exposure to Toxics
Smoking
Cancer
Serum Calcium
Lung Tumor
20
Problems with Large Instances

The joint probability distribution,
P(A,G,E,S,C,L,SC)
For five binary variables there are 27 128
values in the joint distribution (for 100
variables there are over 1030 values)
How are these values to be obtained?
Inference
To obtain posterior distributions once some
evidence is available requires summation over an
exponential number of terms eg 22 in the
calculation of

which increases to 297 if there are 100 variables.
21
Independence
Age and Gender are independent.
Gender
Age
P(A,G) P(G)P(A)
P(AG) P(A) A G P(GA) P(G) G A
P(A,G) P(GA) P(A) P(G)P(A) P(A,G) P(AG)
P(G) P(A)P(G)
22
Conditional Independence
Cancer is independent of Age and Gender given
Smoking.
Gender
Age
Smoking
P(CA,G,S) P(CS) C A,G S
Cancer

(Smokingheavy)??? Age? Gender? ????? ??
(Smokingheavy)??? cancer? ????? ??
(Smokingheavy)????? cancer? age? gender? ??

23
More Conditional IndependenceNaïve Bayes
Serum Calcium and Lung Tumor are dependent
Cancer
Serum Calcium
Lung Tumor
??
24
More Conditional IndependenceExplaining Away
Exposure to Toxics and Smoking are independent
Exposure to Toxics
Smoking
E S
Cancer
Exposure to Toxics is dependent on Smoking, given
Cancer
25
More Conditional IndependenceExplaining Away
Exposure to Toxics
Exposure to Toxics
Smoking
Smoking
Cancer
Cancer
Exposure to Toxics is dependent on Smoking,
given Cancer
Moralize the graph.
26
Put it all together
27
General Product (Chain) Rule for Bayesian
Networks
Paiparents(Xi)
28
Conditional Independence
A variable (node) is conditionally independent of
its non-descendants given its parents.
Gender
Age
Non-Descendants
Exposure to Toxics
Smoking
Parents
Cancer is independent of Age and Gender given
Exposure to Toxics and Smoking.
Cancer
Serum Calcium
Lung Tumor
Descendants
29
Another non-descendant
Gender
Age
Cancer is independent of Diet given Exposure to
Toxics and Smoking.
Exposure to Toxics
Smoking
Diet
Cancer
Serum Calcium
Lung Tumor
30
Representing the Joint Distribution
In general, for a network with nodes X1, X2, ,
Xn then
An enormous saving can be made regarding the
number of values required for the joint
distribution. To determine the joint
distribution directly for n binary variables 2n
1 values are required. For a BN with n binary
variables and each node has at most k parents
then less than 2kn values are required.
31
An Example
P(s1)0.2
P(l1s1)0.003P(l1s2)0.00005
P(b1s1)0.25P(b1s2)0.05
P(f1b1,l1)0.75P(f1b1,l2)0.10P(f1b2,l1)0.5
P(f1b2,l2)0.05
P(x1l1)0.6P(x1l2)0.02
32
Solution
Note that our joint distribution with 5 variables
can be represented as
Consequently the joint probability distribution
can now be expressed as
For example, the probability that someone has a
smoking history, lung cancer but not bronchitis,
suffers from fatigue and tests positive in an
X-ray test is
33
Independence and Graph Separation

Given a set of observations, is one set of
variables dependent on another set?
Observing effects can induce dependencies.
d-separation (Pearl 1988) allows us to check
conditional independence graphically.

34
Bayesian networks

Additional structure
Nodes as functions
Causal independence
Context specific dependencies
Continuous variables
Hierarchy and model construction

35
Nodes as funtions

A BN node is conditional distribution function
its parent values are the inputs
its output is a distribution over its values

A
0.5
X
0.3
0.2
B
36
Nodes as funtions
A
Any type of function from Val(A,B) to
distributions over Val(X)
X
B
37
Continuous variables
A/C Setting
Outdoor Temperature
hi
97o
38
Gaussian (normal) distributions
N(m, s)
39
Gaussian networks
Each variable is a linear function of its
parents, with Gaussian noise
Joint probability density functions
40
Composing functions

Recall a BN node is a function
We can compose functions to get more complex
functions.
The result A hierarchically structured BN.
Since functions can be called more than once, we
can reuse a BN model fragment in multiple
contexts.

41
Owner
Maintenance
Age
Original-value
Mileage
Brakes
Car
Fuel-efficiency
Braking-power
42
Bayesian Networks

Knowledge acquisition
Variables
Structure
Numbers

43
What is a variable?

Collectively exhaustive, mutually exclusive values

Error Occured
No Error
44
Clarity Test Knowable in Principle

Weather Sunny, Cloudy, Rain, Snow
Gasoline Cents per gallon
Temperature ? 100F , lt 100F
User needs help on Excel Charting Yes, No
Users personality dominant, submissive

45
Structuring
Network structure corresponding to causality is
usually good.
Extending the conversation.
Lung Tumor
46
Course Contents

Concepts in Probability
Bayesian Networks
Inference
Decision making
Learning networks from data
Reasoning over time
Applications

47
Inference

Patterns of reasoning
Basic inference
Exact inference
Exploiting structure
Approximate inference

48
Predictive Inference
Gender
Age
How likely are elderly males to get malignant
cancer?
Exposure to Toxics
Smoking
P(Cmalignant Agegt60, Gender male)
Cancer
Serum Calcium
Lung Tumor
49
Combined
Gender
Age
How likely is an elderly male patient with high
Serum Calcium to have malignant cancer?
Exposure to Toxics
Smoking
Cancer
P(Cmalignant Agegt60, Gender male, Serum
Calcium high)
Serum Calcium
Lung Tumor
50
Explaining away
Gender
Age

If we see a lung tumor, the probability of heavy
smoking and of exposure to toxics both go up.

Exposure to Toxics
Smoking
Cancer
Serum Calcium
Lung Tumor
51
Inference in Belief Networks

Find P(QqE e)
Q the query variable
E set of evidence variables

X1,, Xn are network variables except Q, E
P(q, e)
S P(q, e, x1,, xn)
x1,, xn
52
Basic Inference
A
B
53
Inference in trees
Y2
Y1
X
X
P(x) S P(x y1, y2) P(y1, y2)
y1, y2
54
Polytrees

A network is singly connected (a polytree) if it
contains no undirected loops.

D
C
Theorem Inference in a singly connected network
can be done in linear time. Main idea in
variable elimination, need only maintain
distributions over single nodes. in network
size including table sizes.
55
The problem with loops
P(c)
0.5
Cloudy
c
c
Rain
Sprinkler
P(s)
0.01
0.99
P(r)
0.01
0.99
Grass-wet
deterministic or
The grass is dry only if no rain and no
sprinklers.
56
The problem with loops contd.
P(g)
0
problem
57
Variable elimination
A
B
C
58
Inference as variable elimination

A factor over X is a function from val(X) to
numbers in 0,1
A CPT is a factor
A joint distribution is also a factor
BN inference
factors are multiplied to give new ones
variables in factors summed out
A variable can be summed out as soon as all
factors mentioning it have been multiplied.

59
Variable Elimination with loops
Gender
Age
Exposure to Toxics
Smoking
Cancer
Serum Calcium
Lung Tumor
Complexity is exponential in the size of the
factors
60
Inference in BNs and Junction Tree

The main point of BNs is to enable probabilistic
inference to be performed. Inference is the task
of computing the probability of each value of a
node in BNs when other variables values are
know.
The general idea is doing inference by
representing the joint probability distribution
on an undirected graph called the Junction tree
The junction tree has the following
characteristics
it is an undirected tree, its nodes are
clusters of variables
given two clusters, C1 and C2, every node on
the path between them contains their
intersection C1 ? C2
a Separator, S, is associated with each edge
and contains the variables in the
intersection between neighbouring nodes

61
Inference in BNs

Moralize the Bayesian network
Triangulate the moralized graph
Let the cliques of the triangulated graph be the
nodes of a tree, and construct the junction tree
Belief propagation throughout the junction tree
to do inference

62
Constructing the Junction Tree (1)
Step 1. Form the moral graph from the
DAG Consider BN in our example
Moral Graph marry parents and remove arrows
DAG
63
Constructing the Junction Tree (2)
Step 2. Triangulate the moral graph An undirected
graph is triangulated if every cycle of length
greater than 3 possesses a chord
64
Constructing the Junction Tree (3)
Step 3. Identify the Cliques A clique is a subset
of nodes which is complete (i.e. there is an edge
between every pair of nodes) and maximal.
Cliques B,S,LB,L,FL,X
?
65
Constructing the Junction Tree (4)
Step 4. Build Junction Tree The cliques should be
ordered (C1,C2,,Ck) so they possess the running
intersection property for all 1 lt j k, there
is an i lt j such that Cj ? (C1? ?Cj-1) ? Ci.
To build the junction tree choose one such I for
each j and add an edge between Cj and Ci.
Junction Tree
Cliques B,S,LB,L,FL,X
?
BL
L
66
Potentials Initialization
To initialize the potential functions 1. set all
potentials to unity 2. for each variable, Xi,
select one node in the junction tree (i.e. one
clique) containing both that variable and its
parents, pa(Xi), in the original DAG 3. multiply
the potential by P(xipa(xi))
BL
L
67
Potential Representation
The joint probability distribution can now be
represented in terms of potential functions, ?,
defined on each clique and each separator of the
junction tree. The joint distribution is given by
The idea is to transform one representation of
the joint distribution to another in which for
each clique, C, the potential function gives the
marginal distribution for the variables in C, i.e.
This will also apply for the separators, S.
68
Triangulation

Given a numbered graph, proceed from node n,
decrease to 1
Determine the lower-numbered nodes which are
adjacent to the current node, including those
which may have been made adjacent to this node
earlier in this algorithm
Connects these nodes to each other.

69
Triangulation

Numbering the nodes
Arbitrarily number the nodes
Maximum cardinality search
Give any node a value of 1
For each subsequent number, pick an new
unnumbered node that neighbors the most already
numbered nodes

70
Triangulation
Moralized graph
BN
71
Triangulation
8
5
3
6
4
7
2
1
Arbitrary numbering
72
Triangulation
Maximum cardinality search
73
Course Contents

Concepts in Probability
Bayesian Networks
Inference
Decision making
Learning networks from data
Reasoning over time
Applications

74
Decision making

Decision - an irrevocable allocation of domain
resources
Decision should be made so as to maximize
expected utility.
View decision making in terms of
Beliefs/Uncertainties
Alternatives/Decisions
Objectives/Utilities

75
Course Contents

Concepts in Probability
Bayesian Networks
Inference
Decision making
Learning networks from data
Reasoning over time
Applications

76
Learning networks from data

The learning task
Parameter learning
Fully observable
Partially observable
Structure learning
Hidden variables

77
The learning task
B E A C N
...
Input training data

Input fully or partially observable data cases?
Output parameters or also structure?

78
Parameter learning one variable

Unfamiliar coin
Let q bias of coin (long-run fraction of heads)
If q known (given), then
P(X heads q )

Different coin tosses independent given q
P(X1, , Xn q )

q h (1-q)t
79
Maximum likelihood

Input a set of previous coin tosses
X1, , Xn H, T, H, H, H, T, T, H, . . ., H

Goal estimate q
The likelihood P(X1, , Xn q ) q h (1-q )t
The maximum likelihood solution is

80
Conditioning on data
? P(q ) P(D q ) P(q ) q h (1-q )t
P(q )
81
Conditioning on data
82
General parameter learning

A multi-variable BN is composed of several
independent parameters (coins).

Three parameters

Can use same techniques as one-variable case to
learn each one separately

83
Partially observable data
Burglary
Earthquake
B E A C N
?
a
c
?
Alarm
b
?
a
?
n
Newscast
Call
...

Fill in missing data with expected value
expected distribution over possible values
use best guess BN to estimate distribution

84
Intuition

In fully observable case

In partially observable case I is unknown.

Best estimate for I is
Problem q unknown.
85
Expectation Maximization (EM)

Expectation (E) step
Use current parameters q to estimate filled in
data.

Maximization (M) step
Use filled in data to do max likelihood
estimation

86
Structure learning
Goal find good BN structure (relative to
data)
Solution do heuristic search over space of
network structures.
87
Search space
Space network structures Operators
add/reverse/delete edges
88
Heuristic search
Use scoring function to do heuristic search (any
algorithm). Greedy hill-climbing with randomness
works pretty well.
score
89
Scoring

Fill in parameters using previous techniques
score completed networks.
One possibility for score

D
likelihood function Score(B) P(data B)
Example X, Y independent coin tosses typical
data (27 h-h, 22 h-t, 25 t-h, 26 t-t)
Max. likelihood network typically fully connected
This is not surprising maximum likelihood always
overfits
90
Better scoring functions

MDL formulation balance fit to data and model
complexity ( of parameters)

Score(B) P(data B) - model complexity

Full Bayesian formulation
prior on network structures parameters
more parameters ? higher dimensional space
get balance effect as a byproduct

with Dirichlet parameter prior, MDL is an
approximation to full Bayesian score.
91
Hidden variables

There may be interesting variables that we never
get to observe
topic of a document in information retrieval
users current task in online help system.
Our learning algorithm should
hypothesize the existence of such variables
learn an appropriate state space for them.

92
E1
E3
E2
randomly scattered data
93
E1
E3
E2
actual data
94
Bayesian clustering (Autoclass)
Class
naïve Bayes model
...
E1
E2
En

(hypothetical) class variable never observed
if we know that there are k classes, just run EM
learned classes clusters
Bayesian analysis allows us to choose k, trade
off fit to data with model complexity

95
E1
E3
E2
Resulting cluster distributions
96
Detecting hidden variables