Introduction to Probabilistic Graphical Models

About This Presentation

Title:

Introduction to Probabilistic Graphical Models

Description:

Introduction to Probabilistic Graphical Models Eran Segal Weizmann Institute * * * * Tree cpd induces independencies and spurious edges under some contexts We can ... – PowerPoint PPT presentation

Number of Views:182

Avg rating:3.0/5.0

Slides: 59

Provided by: erans4

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Probabilistic Graphical Models

1
Introduction to Probabilistic Graphical Models

Eran Segal
Weizmann Institute

2
Probabilistic Graphical Models

Tool for representing complex systems and
performing sophisticated reasoning tasks
Fundamental notion Modularity
Complex systems are built by combining simpler
parts
Why have a model?
Compact and modular representation of complex
systems
Ability to execute complex reasoning patterns
Make predictions
Generalize from particular problem

3
Probabilistic Graphical Models

Increasingly important in Machine Learning
Many classical probabilistic problems in
statistics, information theory, pattern
recognition, and statistical mechanics are
special cases of the formalism
Graphical models provides a common framework
Advantage specialized techniques developed in
one field can be transferred between research
communities

4
Representation Graphs

Intuitive data structure for modeling
highly-interacting sets of variables
Explicit model for modularity
Data structure that allows for design of
efficient general-purpose algorithms

5
Reasoning Probability Theory

Well understood framework for modeling
uncertainty
Partial knowledge of the state of the world
Noisy observations
Phenomenon not covered by our model
Inherent stochasticity
Clear semantics
Can be learned from data

6
A Simple Example

We want to model whether our neighbor will inform
us of the alarm being set off
The alarm can set off if
There is a burglary
There is an earthquake
Whether our neighbor calls depends on whether the
alarm is set off

7
A Simple Example

Variables
Earthquake (E), Burglary (B), Alarm (A),
NeighborCalls (N)

E B A N Prob.
F F F F 0.01
F F F T 0.04
F F T F 0.05
F F T T 0.01
F T F F 0.02
F T F T 0.07
F T T F 0.2
F T T T 0.1
T F F F 0.01
T F F T 0.07
T F T F 0.13
T F T T 0.04
T T F F 0.06
T T F T 0.05
T T T F 0.1
T T T T 0.05
24-1 independent parameters
8
A Simple Example
E E
F T
0.9 0.1
B B
F T
0.7 0.3
Burglary
Earthquake
A A
E B F T
F F 0.99 0.01
F T 0.1 0.9
T F 0.3 0.7
T T 0.01 0.99
Alarm
NeighborCalls
7 independent parameters
N N
A F T
F 0.9 0.1
T 0.2 0.8
9
Example Bayesian Network

The Alarm network for monitoring intensive care
patients
509 parameters (full joint 237)
37 variables

10
Application Clustering Users

Input TV shows that each user watches
Output TV show clusters
Assumption shows watched by same users are
similar

Class 1
Power rangers
Animaniacs
X-men
Tazmania
Spider man

Class 2
Young and restless
Bold and the beautiful
As the world turns
Price is right
CBS eve news

Class 3
Tonight show
Conan OBrien
NBC nightly news
Later with Kinnear
Seinfeld

Class 4
60 minutes
NBC nightly news
CBS eve news
Murder she wrote
Matlock

Class 5
Seinfeld
Friends
Mad about you
ER
Frasier

11
App. Recommendation Systems

Given user preferences, suggest recommendations
Example Amazon.com
Input movie preferences of many users
Solution model correlations between movie
features
Users that like comedy, often like drama
Users that like action, often do not like
cartoons
Users that like Robert Deniro films often like Al
Pacino films
Given user preferences, can predict probability
that new movies match preferences

12
Probability Theory

Probability distribution P over (?, S) is a
mapping from events in S such that
P(?)?? 0 for all ??S
P(?) 1
If ?,??S and ????, then P(???)P(?)P(?)
Conditional Probability
Chain Rule
Bayes Rule
Conditional Independence

13
Random Variables Notation

Random variable Function from?? to a value
Categorical / Ordinal / Continuous
Val(X) set of possible values of RV X
Upper case letters denote RVs (e.g., X, Y, Z)
Upper case bold letters denote set of RVs (e.g.,
X, Y)
Lower case letters denote RV values (e.g., x, y,
z)
Lower case bold letters denote RV set values
(e.g., x)
Values for categorical RVs with Val(X)k
x1,x2,,xk
Marginal distribution over X P(X)
Conditional independence X is independent of Y
given Z if?

14
Expectation

Discrete RVs
Continuous RVs
Linearity of expectation
Expectation of products(when X?? Y in P)

15
Variance

Variance of RV
If X and Y are independent VarXYVarXVarY
VaraXba2VarX

16
Information Theory

Entropy
We use log base 2 to interpret entropy as bits of
information
Entropy of X is a lower bound on avg. of bits
to encode values of X
0 ? Hp(X) ? logVal(X) for any distribution P(X)
Conditional entropy
Information only helps
Mutual information
0 ? Ip(XY) ? Hp(X)
Symmetry Ip(XY) Ip(YX)
Ip(XY)0 iff X and Y are independent
Chain rule of entropies

17
Representing Joint Distributions

Random variables X1,,Xn
P is a joint distribution over X1,,Xn

Can we represent P more compactly?
Key Exploit independence properties

18
Independent Random Variables

Two variables X and Y are independent if
P(XxYy) P(Xx) for all values x,y
Equivalently, knowing Y does not change
predictions of X
If X and Y are independent then
P(X, Y) P(XY)P(Y) P(X)P(Y)
If X1,,Xn are independent then
P(X1,,Xn) P(X1)P(Xn)
O(n) parameters
All 2n probabilities are implicitly defined
Cannot represent many types of distributions

19
Conditional Independence

X and Y are conditionally independent given Z if
P(XxYy, Zz) P(XxZz) for all values x, y,
z
Equivalently, if we know Z, then knowing Y does
not change predictions of X
Notation Ind(XY Z) or (X ? Y Z)

20
Conditional Parameterization

S Score on test, Val(S) s0,s1
I Intelligence, Val(I) i0,i1

P(SI)
P(I)
P(I,S)
I S P(I,S)
i0 s0 0.665
i0 s1 0.035
i1 s0 0.06
i1 s1 0.24
I I
i0 i1
0.7 0.3
S S
I s0 s1
i0 0.95 0.05
i1 0.2 0.8
Joint parameterization
Conditional parameterization
3 parameters
3 parameters
Alternative parameterization P(S) and P(IS)
21
Conditional Parameterization

S Score on test, Val(S) s0,s1
I Intelligence, Val(I) i0,i1
G Grade, Val(G) g0,g1,g2
Assume that G and S are independent given I

22
Naïve Bayes Model

Class variable C, Val(C) c1,,ck
Evidence variables X1,,Xn
Naïve Bayes assumption evidence variables are
conditionally independent given C
Applications in medical diagnosis, text
classification
Used as a classifier
Problem Double counting correlated evidence

23
Bayesian Network (Informal)

Directed acyclic graph G
Nodes represent random variables
Edges represent direct influences between random
variables
Local probability models

24
Bayesian Network (Informal)

Represent a joint distribution
Specifies the probability for P(Xx)
Specifies the probability for P(XxEe)
Allows for reasoning patterns
Prediction (e.g., intelligent ? high scores)
Explanation (e.g., low score ? not intelligent)
Explaining away (different causes for same effect
interact)

I
S
G
Example 2
25
Bayesian Network Structure

Directed acyclic graph G
Nodes X1,,Xn represent random variables
G encodes local Markov assumptions
Xi is independent of its non-descendants given
its parents
Formally (Xi ? NonDesc(Xi) Pa(Xi))

26
Independency Mappings (I-Maps)

Let P be a distribution over X
Let I(P) be the independencies (X ? Y Z) in P
A Bayesian network structure is an I-map
(independency mapping) of P if I(G)?I(P)

I S P(I,S)
i0 s0 0.25
i0 s1 0.25
i1 s0 0.25
i1 s1 0.25
I
I
I S P(I,S)
i0 s0 0.4
i0 s1 0.3
i1 s0 0.2
i1 s1 0.1
S
S
I(P)I?S
I(G)I?S
I(G)?
I(P)?
27
Factorization Theorem

If G is an I-Map of P, then
Proof
wlog. X1,,Xn is an ordering consistent with G
By chain rule
From assumption
Since G is an I-Map ? (Xi NonDesc(Xi)
Pa(Xi))?I(P)

28
Factorization Implies I-Map

? G is an
I-Map of P
Proof
Need to show (Xi NonDesc(Xi) Pa(Xi))?I(P) or
that P(Xi NonDesc(Xi)) P(Xi Pa(Xi))
wlog. X1,,Xn is an ordering consistent with G

29
Bayesian Network Definition

A Bayesian network is a pair (G,P)
P factorizes over G
P is specified as set of CPDs associated with Gs
nodes
Parameters
Joint distribution 2n
Bayesian network (bounded in-degree k) n2k

30
Bayesian Network Design

Variable considerations
Clarity test can an omniscient being determine
its value?
Hidden variables?
Irrelevant variables
Structure considerations
Causal order of variables
Which independencies (approximately) hold?
Probability considerations
Zero probabilities
Orders of magnitude
Relative values

31
CPDs

Thus far we ignored the representation of CPDs
Now we will cover the range of CPD
representations
Discrete
Continuous
Sparse
Deterministic
Linear

32
Table CPDs

Entry for each joint assignment of X and Pa(X)
For each pax
Most general representation
Represents every discrete CPD
Limitations
Cannot model continuous RVs
Number of parameters exponential in Pa(X)
Cannot model large in-degree dependencies
Ignores structure within the CPD

I
S
P(SI)
P(I)
S S
I s0 s1
i0 0.95 0.05
i1 0.2 0.8
I I
i0 i1
0.7 0.3
33
Structured CPDs

Key idea reduce parameters by modeling P(XPaX)
without explicitly modeling all entries of the
joint
Lose expressive power (cannot represent every
CPD)

34
Deterministic CPDs

There is a function f Val(PaX) ? Val(X) such
that
Examples
OR, AND, NAND functions
Z YX (continuous variables)

35
Deterministic CPDs

Replace spurious dependencies with deterministic
CPDs
Need to make sure that deterministic CPD is
compactly stored

T1
T2
T1
T2
T T
T1 T2 t0 t1
t0 t0 1 0
t0 t1 0 1
t1 t0 0 1
t1 t1 0 1
T
S
S S
T1 T2 s0 s1
t0 t0 0.95 0.05
t0 t1 0.2 0.8
t1 t0 0.2 0.8
t1 t1 0.2 0.8
S
S S
T s0 s1
t0 0.95 0.05
t1 0.2 0.8
36
Deterministic CPDs

Induce additional conditional independencies
Example T is any deterministic function of T1,T2

T1
T2
T
S1
S2
37
Deterministic CPDs

Induce additional conditional independencies
Example C is an XOR deterministic function of
A,B

D
A
B
C
E
38
Deterministic CPDs

Induce additional conditional independencies
Example T is an OR deterministic function of
T1,T2

T1
T2
T
S1
S2
Context specific independencies
39
Tree CPDs
A
B
C
D
D D
A B C d0 d1
a0 b0 c0 0.2 0.8
a0 b0 c1 0.2 0.8
a0 b1 c0 0.2 0.8
a0 b1 c1 0.2 0.8
a1 b0 c0 0.9 0.1
a1 b0 c1 0.7 0.3
a1 b1 c0 0.4 0.6
A1 b1 C1 0.4 0.6
8 parameters
40
Context Specific Independencies
A
B
C
D
A
a0
a1
C
B
c1
c0
b1
b0
(0.2,0.8)
(0.4,0.6)
(0.7,0.3)
(0.9,0.1)
Reasoning by cases implies that Ind(BC A,D)
41
Continuous Variables

One solution Discretize
Often requires too many value states
Loses domain structure
Other solution use continuous function for
P(XPa(X))
Can combine continuous and discrete variables,
resulting in hybrid networks
Inference and learning may become more difficult

42
Gaussian Density Functions

Among the most common continuous representations
Univariate case

43
Gaussian Density Functions

A multivariate Gaussian distribution over
X1,...Xn has
Mean vector ?
nxn positive definite covariance matrix
?positive definite
Joint density function
?iEXi
?iiVarXi
?ijCovXi,XjEXiXj-EXiEXj (i?j)

44
Hybrid Models

Models of continuous and discrete variables
Continuous variables with discrete parents
Discrete variables with continuous parents
Conditional Linear Gaussians
Y continuous variable
X X1,...,Xn continuous parents
U U1,...,Um discrete parents
A Conditional Linear Bayesian network is one
where
Discrete variables have only discrete parents
Continuous variables have only CLG CPDs

45
Hybrid Models

Continuous parents for discrete children
Threshold models
Linear sigmoid

46
Undirected Graphical Models

Useful when edge directionality cannot be
assigned
Simpler interpretation of structure
Simpler inference
Simpler independency structure
Harder to learn
We will also see models with combined directed
and undirected edges
Some computations require restriction to discrete
variables

47
Undirected Model (Informal)

Nodes correspond to random variables
Local factor models are attached to sets of nodes
Factor elements are positive
Do not have to sum to 1
Represent affinities

A C ?1A,C
a0 c0 4
a0 c1 12
a1 c0 2
a1 c1 9
A B ?2A,B
a0 b0 30
a0 b1 5
a1 b0 1
a1 b1 10
A
B
C
C D ?3C,D
c0 d0 30
c0 d1 5
c1 d0 1
c1 d1 10
D
B D ?4B,D
b0 d0 100
b0 d1 1
b1 d0 1
b1 d1 1000
48
Undirected Model (Informal)

Represents joint distribution
Unnormalized factor
Partition function
Probability
As Markov networks represent joint distributions,
they can be used for answering queries

A
B
C
D
49
Markov Network Structure

Undirected graph H
Nodes X1,,Xn represent random variables
H encodes independence assumptions
A path X1,,Xk is active if none of the Xi
variables along the path are observed
X and Y are separated in H given Z if there is no
active path between any node x?X and any node y?Y
given Z
Denoted sepH(XYZ)

Global Markov assumptions I(H) (X?YZ)
sepH(XYZ)
50
Relationship with Bayesian Network

Can all independencies encoded by Markov networks
be encoded by Bayesian networks?
No, Ind(AB C,D) and Ind(CD A,B) example
Can all independencies encoded by Bayesian
networks be encoded by Markov networks?
No, immoral v-structures (explaining away)
Markov networks encode monotonic independencies
If sepH(XYZ) and Z?Z then sepH(XYZ)

51
Markov Network Factors

A factor is a function from value assignments of
a set of random variables D to real positive
numbers ??
The set of variables D is the scope of the factor
Factors generalize the notion of CPDs
Every CPD is a factor (with additional
constraints)

52
Markov Network Factors

Can we represent any joint distribution by using
only factors that are defined on edges?
No!
Example binary variables
Joint distribution has 2n-1 independent
parameters
Markov network with edge factors has
parameters

53
Markov Network Distribution

A distribution P factorizes over H if it has
A set of subsets D1,...Dm where each Di is a
complete subgraph in H
Factors ?1D1,...,?mDm such that
Z is called the partition function
P is also called a Gibbs distribution over H

where
54
Relationship with Bayesian Network

Bayesian Networks
Semantics defined via local Markov assumptions
Global independencies induced by d-separation
Local and global independencies equivalent since
one implies the other
Markov Networks
Semantics defined via global separation property
Can we define the induced local independencies?
We show two definitions
All three definitions (global and two local) are
equivalent only for positive distributions P

55
Local Structure

Factor graphs still encode complete tables
Goal as in Bayesian networks, represent
context-specificity
A feature ?D on variables D is an indicator
function that for some y?D
A distribution P is a log-linear model over H if
it has
Features ?1D1,...,?kDk where each Di is a
subclique in H
A set of weights w1,...,wk such that

56
Domain Application Vision

The image segmentation problem
Task Partition an image into distinct parts of
the scene
Example separate water, sky, background

57
Markov Network for Segmentation

Grid structured Markov network
Random variable Xi corresponds to pixel i
Domain is 1,...K
Value represents region assignment to pixel i
Neighboring pixels are connected in the network
Appearance distribution
wik extent to which pixel i fits region k
(e.g., difference from typical pixel for region
k)
Introduce node potential exp(-wik1Xik)
Edge potentials
Encodes contiguity preference by edge
potentialexp(?1XiXj) for ?gt0

58
Markov Network for Segmentation