Title: PROBABILISTIC GRAPHICAL MODELS
1PROBABILISTIC GRAPHICAL MODELS
- David Madigan
- Rutgers University
- madigan_at_stat.rutgers.edu
2Expert Systems
- Explosion of interest in Expert Systems in the
early 1980s
- Many companies (Teknowledge, IntelliCorp,
Inference, etc.), many IPOs, much media hype - Ad-hoc uncertainty handling
3Uncertainty in Expert Systems
If A then C (p1) If B then C (p2)
What if both A and B true?
Then C true with CF p1 (p2 X (1- p1))
Currently fashionable ad-hoc mumbo
jumbo A.F.M. Smith
4Eschewed Probabilistic Approach
- Computationally intractable
- Inscrutable
- Requires vast amounts of data/elicitation
e.g., for n dichotomous variables need 2n - 1
probabilities to fully specify the joint
distribution
5Conditional Independence
X Y Z
6Conditional Independence
- Suppose A and B are marginally independent.
Pr(A), Pr(B), Pr(CAB) X 4 6 probabilities - Suppose A and C are conditionally independent
given B Pr(A), Pr(BA) X 2, Pr(CB) X 2 5 - Chain with 50 variables requires 99 probabilities
versus 250-1
A
B
C
C A B
7Properties of Conditional Independence (Dawid,
1980)
For any probability measure P and random
variables A, B, and C
CI 1 A B P ? B A P CI 2 A B ?
C P ? A B P CI 3 A B ? C P ? A
B C P CI 4 A B and A C B P ?
A B ? C P
Some probability measures also satisfy
CI 5 A B C and A C B P ? A B
? C P
CI5 satisfied whenever P has a positive joint
probability density with respect to some product
measure
8Markov Properties for Undirected Graphs
(Global) S separates A from B ? A B
S (Local) a V \ cl(a) bd (a) (Pairwise) a
b V \ a,b
(G) ? (L) ? (P)
B E, D A, C (1)
A
B
?
B D A, C, E (2)
C
E
To go from (2) to (1) need E B A,C? or CI5
D
Lauritzen, Dawid, Larsen Leimer (1990)
9Factorizations
A density f is said to factorize according to
G if f(x) ? ?C(xC)
C ? C
clique potentials
- cliques are maximally complete subgraphs
Proposition If f factorizes according to a UG
G, then it also obeys the global Markov
property Proof Let S separate A from B in G
and assume Let CA be the set of cliques with
non-empty intersection with A. Since S separates
A from B, we must have for
all C in CA. Then
10Markov Properties for Acyclic Directed
Graphs (Bayesian Networks)
(Global) S separates A from B in Gan(A,B,S)m ? A
B S (Local) a nd(a)\pa(a) pa (a)
(G) ? (L)
A
B
B
S
S
Lauritzen, Dawid, Larsen Leimer (1990)
11Factorizations
A density f admits a recursive factorization
according to an ADG G if f(x) ? f(xv xpa(v) )
ADG Global Markov Property ? f(x) ? f(xv
xpa(v) )
v ? V
Lemma If P admits a recursive factorization
according to an ADG G, then P factorizes
according GM (and chordal supergraphs of GM)
Lemma If P admits a recursive factorization
according to an ADG G, and A is an ancestral set
in G, then PA admits a recursive factorization
according to the subgraph GA
12Factorizations
p(A,B,C,D,E,F,G,H,S) p(A)p(CA)p(DC)p(SD,F)p(
ES) p(FG)p(GB)p(HS,B)p(B) ? p(SA,B,C,D,E,F,G,
H) ? p(SD,F)p(ES)p(HS,B)
C
G
B
F
D
S
H
E
D,F,W,H,B is the Markov Blanket of S. It
contains the parents of S, the children of S, and
the other parents of the children of S.
13Markov Properties for Acyclic Directed
Graphs (Bayesian Networks)
(Global) S separates A from B in Gan(A,B,S)m ? A
B S (Local) a nd(a)\pa(a) pa (a)
- ? nd(a) is an ancestral set pa(a) obviously
- separates a from nd(a)\pa(a) in Gan(a?nd(a))m
(G) ? (L)
(L) ? (factorization)
induction on the number of vertices
14d-separation
A chain p from a to b in an acyclic directed
graph G is said to be blocked by S if it contains
a vertex g ? p such that either - g ? S and
arrows of p do not meet head to head at g, or -
g ? S nor has g any descendents in S, and arrows
of p do meet head to head at g Two subsets A
and B are d-separated by S if all chains from A
to B are blocked by S
15(No Transcript)
16d-separation and global markov property
Let A, B, and S be disjoint subsets of a
directed, acyclic graph, G. Then S d-separates A
from B if and only if S separates A from B in
Gan(A,B,S)m
17UG ADG Intersection
A
B
C
C A B
A
D
A
A
B
C
B
A C B
C
B
C
A B C,D C D A,B
A C
A
B
C
A C B
A
B
C
A C B
18UG ADG Intersection
UG
ADG
Decomposable
- UG is decomposable if chordal
- ADG is decomposable if moral
- Decomposable closed-form log-linear models
No CI5
19Chordal Graphs and RIP
- Chordal graphs (uniquely) admit clique orderings
that have the Running Intersection Property
- V,T
- A,L,T
- L,A,B
- S,L,B
- A,B,D
- A,X
V
T
L
A
S
X
D
B
- The intersection of each set with those earlier
in the list is fully contained in previous set - Can compute cond. probabilities (e.g. Pr(XV)) by
message passing (Lauritzen Spiegelhalter,
Dawid, Jensen)
20Probabilistic Expert System
- Computationally intractable
- Inscrutable
- Requires vast amounts of data/elicitation
- Chordal UG models facilitate fast inference
- ADG models better for expert system applications
more natural to specify Pr( v pa(v) )
21Factorizations
UG Global Markov Property ? f(x) ? ?C(xC)
C ? C
ADG Global Markov Property ? f(x) ? f(xv
xpa(v) )
v ? V
22Lauritzen-Spiegelhalter Algorithm
A
- ? (C,S,D) ? Pr(SC, D)
- (A,E) ? Pr(EA) Pr(A)
- ? (C,E) ? Pr(CE)
- (F,D,B) ? Pr(DF)Pr(BF)Pr(F)
- ? (D,B,S) ? 1
- ? (B,S,G) ? Pr(GS,B)
- ? (H,S) ? Pr(HS)
E
F
E
F
D
C
D
C
B
B
S
S
H
H
G
G
Algorithm is widely deployed in commercial
software
23LS Toy Example
Pr(CB)0.2 Pr(CB)0.6 Pr(BA)0.5
Pr(BA)0.1 Pr(A)0.7
A
B
C
- (A,B) ? Pr(BA)Pr(A)
- ? (B,C) ? Pr(CB)
A
B
C
B
B
C
C
B
B
A
0.35
0.35
B
0.2
0.8
AB
B
BC
1
1
A
0.03
0.27
B
0.6
0.4
Pr(AC)
Message Schedule AB BC BC AB
C
C
C
C
B
B
B
0.076
0
B
0.076
0.304
0.38
0.62
B
0.372
0
B
0.372
0.248
24Other Theoretical Developments
Do the UG and ADG global Markov properties
identify all the conditional independences
implied by the corresponding factorizations? Yes.
Completeness for ADGs by Geiger and Pearl
(1988) for UGs by Frydenberg (1988)
Graphical characterization of collapsibility in
hierarchical log-linear models (Asmussen and
Edwards, 1983)
25Collapsibility
Survival
Survival
No
Yes
No
Yes
Less
3
176
1.7
Less
17
197
7.9
Care
Care
More
4
293
1.4
More
2
23
8.0
Clinic B
Clinic A
Survival
No
Yes
Less
20
373
5.1
Care
More
6
316
1.9
Pooled
26Collapsibility
Surv.
Clinic
Care
Theorem A graphical log-linear model L is
collapsible onto A iff every connected component
of Ac is complete.
27Bayesian Learning for Discrete ADGs
- Example three binary variables
- Five parameters
28Local and Global Independence
29Bayesian learning
Consider a particular state pa(v) of pa(v)
30Equivalence Classes and Chain Graphs
- ADG models for a fixed set of vertices decompose
into Markov equivalence classes
A C B
A D B,C B C A
A D B,C B C
31Why is this a problem?
- Repeating analyses for equivalent ADGs leads to
significant computational inefficiencies. - Ensuring that equivalent ADGs have equal
posterior probabilities imposes severe
constraints on prior distributions (Geiger and
Heckerman, 1995). - Bayesian model averaging procedures that average
across ADGs assign weights to statistical models
that are proportional to equivalence class sizes.
32Theorem (Verma Pearl, Glymour et al,
Frydenberg, AMP94)Two ADGs are Markov
equivalent iff they have the same skeletons and
the same immoralities.
Equivalence Class Characterization
Definition The essential graph D associated with
D is the graph D ?(DD D),
33Essential GraphsAMP (1995)
- Essential graphs are chain graphs
- D is the unique smallest chain graph Markov
equivalent to D - A graph G (V, E) is equal to D for some ADG D
if and only if G satisfies the following four
conditions
(i) G is a chain graph (ii) For every chain
component t of G, Gt is chordal (iii) The
configuration ab¾c does not occur as an induced
subgraph of G (iv) Every arrow ab ÃŽ G is
strongly protected in G
also Meek (1995) and Chickering (1995)
34Whats a Chain Graph?
Equivalence a b iff a
b
35Chain Graphs
ADG
UG
CG
Decomposable
- Chain graph Markov property, Frydenberg (1990)
- Equivalence results (LWF, AMP, Meek, Studeny)
A
D
or
?
C D A,B
C D A
C
B
Cox Wermuth (1996)