Finding Optimal Bayesian Networks with Greedy Search - PowerPoint PPT Presentation

1 / 59

About This Presentation

Title:

Finding Optimal Bayesian Networks with Greedy Search

Description:

Rev. Rev. Rev. Meek's Conjecture and BES. S*S. Assume: Local Max S Not S ... Rev. Rev. Rev. Discussion Points. In practice, GES is as fast as DAG-based search ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 60

Provided by: dmax

Category:

more less

Transcript and Presenter's Notes

Title: Finding Optimal Bayesian Networks with Greedy Search

1
Finding Optimal Bayesian Networks with Greedy
Search

Max Chickering

2
Outline

Bayesian-Network Definitions
Learning
Greedy Equivalence Search (GES)
Optimality of GES

3
Bayesian Networks
Use B (S,q) to represent p(X1, , Xn)
4
Markov Conditions
From factorization I(X, ND Par(X))
Par
Par
Par
ND
X
Desc
ND
Desc
Markov Conditions Graphoid Axioms characterize
all independencies
5
Structure/Distribution Inclusion
All distributions
p
X
Y
Z
S

p is included in S if there exists q s.t. B(S,q)
defines p

6
Structure/Structure Inclusion T S
All distributions
X
Y
Z
X
Y
Z
S
T

T is included in S if every p included in T is
included in S

(S is an I-map of T)
7
Structure/Structure EquivalenceT ? S
All distributions
X
Y
Z
X
Y
Z
S
T
Reflexive, Symmetric, Transitive
8
Equivalence
A
B
C
A
B
C
D
D
Skeleton
V-structure
Theorem (Verma and Pearl, 1990) S ? T ? same
v-structures and skeletons
9
Learning Bayesian Networks
X
X Y Z 0 1 1 1 0 1 0 1 0 . . . 1 0 1
iid samples
Y
p
Z
Generative Distribution
Observed Data
Learned Model

Learn the structure
Estimate the conditional distributions

10
Learning Structure

Scoring criterion
F(D, S)
Search procedure
Identify one or more structures with high values
for the scoring function

11
Properties of Scoring Criteria

Consistent
Locally Consistent
Score Equivalent

12
Consistent Criterion
Criterion favors (in the limit) simplest model
that includes the generative distribution p

S includes p, T does not include p ? F(S,D)
gt F(T,D)
Both include p, S has fewer parameters ? F(S,D)
gt F(T,D)

13
Locally Consistent Criterion
S and T differ by one edge
X
Y
X
Y
S
T
If I(X,YPar(X)) in p then F(S,D) gt
F(T,D) Otherwise F(S,D) lt F(T,D)
14
Score-Equivalent Criterion
X
Y
S
X
Y
T
S?T ? F(S,D) F(T,D)
15
Bayesian Criterion(Consistent, locally
consistent and score equivalent)

Sh generative distribution p has same
independence constraints as S.
FBayes(S,D) log p(Sh D)
k log p(DSh) log p(Sh)

Structure Prior (e.g. prefer simple)
Marginal Likelihood (closed form w/ assumptions)
16
Search Procedure

Set of states
Representation for the states
Operators to move between states
Systematic Search Algorithm

17
Greedy Equivalence Search

Set of states
Equivalence classes of DAGs
Representation for the states
Essential graphs
Operators to move between states
Forward and Backward Operators
Systematic Search Algorithm
Two-phase Greedy

18
Representation Essential Graphs
A
B
C
Compelled Edges Reversible Edges
E
F
D
A
B
C
E
F
D
19
GES Operators
Forward Direction single edge additions
Backward Direction single edge deletions
20
Two-Phase Greedy Algorithm

Phase 1 Forward Equivalence Search (FES)
Start with all-independence model
Run Greedy using forward operators

Phase 2 Backward Equivalence Search (BES)
Start with local max from FES
Run Greedy using backward operators

21
Forward Operators

Consider all DAGs in the current state
For each DAG, consider all single-edge additions
(acyclic)
Take the union of the resulting equivalence
classes

22
Forward-Operators Example
Current State
All DAGs
All DAGs resulting from single-edge addition
Union of corresponding essential graphs
23
Forward-Operators Example
24
Backward Operators

Consider all DAGs in the current state
For each DAG, consider all single-edge deletions
Take the union of the resulting equivalence
classes

25
Backward-Operators Example
Current State
All DAGs
All DAGs resulting from single-edge deletion
B
A
B
A
B
A
B
A
B
A
B
A
C
C
C
C
C
C
Union of corresponding essential graphs
26
Backward-Operators Example
27
DAG Perfect

DAG-perfect distribution p
Exists DAG G
I(X,YZ) in p ? I(X,YZ) in G

Non-DAG-perfect distribution q
B
A
B
A
B
A
D
C
D
C
D
C
I(A,DB,C) I(B,CA,D)
I(B,CA,D)
I(A,DB,C)
28
DAG-Perfect Consequence Composition Axiom Holds
in p
If ?I(X,Y Z) then ?I(X,Y Z) for some
singleton Y ? Y
A
B
C
D
C
X
X
29
Optimality of GES
If p is DAG-perfect wrt some G
X
X
X
X Y Z 0 1 1 1 0 1 0 1 0 . .
. 1 0 1
Y
Y
Y
n
iid samples
GES
Z
Z
Z
S
G
S
p
For large n, S S
30
Optimality of GES
BES
FES
State includes S
State equals S
All-independence

Proof Outline
After first phase (FES), current state includes
S
After second phase (BES), the current state S

31
FES Maximum Includes S
Assume Local Max does NOT include S
Any DAG G from S
Markov Conditions characterize independencies In
p, exists X not indep. non-desc given parents
A
B
C
? I(X,A,B,C,D E) in p
E
X
D
p is DAG-perfect ? composition axiom holds
A
B
C
? I(X,C E) in p
E
X
D
Locally consistent adding C?X edge improves
score, and EQ class is a neighbor
32
BES Identifies S

Current state always includes S
Local consistency of the criterion
Local Minimum is S
Meeks conjecture

33
Meeks Conjecture

Any pair of DAGs G,H such that H includes G (G
H)
There exists a sequence of
covered edge reversals in G
(2) single-edge additions to G
after each change G H
after all changes GH

34
Meeks Conjecture
B
A
I(A,B) I(C,BA,D)
C
D
H
B
A
B
A
B
A
B
A
C
D
C
D
C
D
C
D
G
35
Meeks Conjecture and BESSS
Assume Local Max S Not S
Any DAG H from S
Any DAG G from S
Add
Add
Rev
Rev
Rev
G
H
36
Meeks Conjecture and BESSS
Assume Local Max S Not S
Any DAG H from S
Any DAG G from S
Add
Add
Rev
Rev
Rev
G
H
Del
Del
Rev
Rev
Rev
G
H
37
Meeks Conjecture and BESSS
Assume Local Max S Not S
Any DAG H from S
Any DAG G from S
Add
Add
Rev
Rev
Rev
G
H
Del
Del
Rev
Rev
Rev
G
H
S
S
Neighbor of S in BES
38
Discussion Points

In practice, GES is as fast as DAG-based search
Neighborhood of essential graphs can be
generated and scored very efficiently
When DAG-perfect assumption fails, we still get
optimality guarantees
As long as composition holds in generative
distribution, local maximum is inclusion-minimal

39
Thanks!

My Home Page
http//research.microsoft.com/dmax
Relevant Papers
Optimal Structure Identification with Greedy
Search
JMLR Submission
Contains detailed proofs of Meeks conjecture and
optimality of GES
Finding Optimal Bayesian Networks
UAI02 Paper with Chris Meek
Contains extension of optimality results of GES
when not DAG perfect

40
(No Transcript)
41
Bayesian Criterion is Locally Consistent

Bayesian score approaches BIC constant
BIC is decomposible
Difference in score same for any DAGS that differ
by Y?X edge if X has same parents

X
Y
X
Y
Complete network (always includes p)
42
Bayesian Criterion is Consistent

Assume Conditionals
unconstrained multinomials
linear regressions

Geiger, Heckerman, King and Meek (2001)
Network structures curved exponential models
Haughton (1988)
Bayesian Criterion is consistent
43
Bayesian Criterion isScore Equivalent
S?T ? F(S,D) F(T,D)
X
Y
Sh no independence constraints
S
X
Y
Th no independence constraints
T
Sh Th
44
Active Paths

Z-active Path between X and Y (non-standard)
Neither X nor Y is in Z
Every pair of colliding edges meets at a member
of Z
No other pair of edges meets at a member of Z

X
Z
Y
G H ? If Z-active path between X and Y in
G then Z-active path between X and Y in H
45
Active Paths
X
A
Z
W
B
Y

X-Y Out-of X and In-to Y
X-W Out-of both X and W
Any sub-path between A,B?Z is also active
A B, B C, at least one is out-of B
?Active path between A and C

46
Simple Active Paths
A
B
contains Y?X
Then ? active path
(1) Edge appears exactly once
OR
Y
X
B
A
(2) Edge appears exactly twice
A
Y
X
X
Y
B
Simplify discussion Assume (1) only proofs
for (2) almost identical
47
Typical ArgumentCombining Active Paths
X
Y
B
A
X
Y
Z
Z sink node adj X,Y
G
Z
H
X
Y
B
A
X
A
GH
Y
B
Z
G Suppose AP in G (X not in CS) with no
corresp. AP in H. Then Z not in CS.
48
Proof Sketch

Two DAGs G, H with GltH
Identify either
a covered edge X?Y in G that has opposite
orientation in H
a new edge X?Y to be added to G such that it
remains included in H

49
The Transformation
Choose any node Y that is a sink in H
Case 1a Y is a sink in G X ? ParH(Y) X ?
ParG(Y) Case 1b Y is a sink in G same
parents Case 2a ?X s.t. Y?X covered Case
2b ?X s.t. Y?X W par of Y but not X Case
2c Every Y?X, Par (Y) ? Par(X)
X
Y
X
Y
Y
X
Y
X
Y
W
W
X
Y
X
Y
Y
Y
50
Preliminaries
(G H)

The adjacencies in G are a subset of the
adjacencies in H
If X?Y?Z is a v-structure in G but not H, then X
and Z are adjacent in H
Any new active path that results from adding X?Y
to G includes X?Y

51
Proof Sketch Case 1
Y is a sink in G
Case 1a X ? ParH(Y) X ? ParG(Y)
H
X
Y
X
Y
G
X
Y
Suppose theres some new active path between A
and B not in H
Y
X
B
Z
A

Y is a sink in G, so it must be in CS
Neither X nor next node Z is in CS
In H, AP(A,Z), AP(X,B), Z?Y?X

Case 1b Parents identical Remove Y from both
graphs proof similar
52
Proof Sketch Case 2
Y is not a sink in G
Case 2a There is a covered edge Y?X Reverse
the edge
Case 2b There is a non-covered edge Y?X such
that W is a parent of Y but not a parent of X
W
W
W
G
H
G
X
Y
X
Y
X
Y
Suppose theres some new active path between A
and B not in H
Y must be in CS, else replace W?X by W ? Y ? X
(not new). If X not in CS, then in H active A-W,
X-B, W?Y?X
B
W
A
B
W
A
G
H
Z
X
Y
Z
X
Y
53
Case 2c The Difficult Case

All non-covered edges Y?Z have Par(Y) ? Par(Z)

W1
W2
W1
W2
Y
Y
Z1
Z2
Z1
Z2
G
H
W1?Y G no longer lt H (Z2-active path between W1
and W2) W2?Y G lt H
54
Choosing Z
G
H
Y
Y
D
D
Z
Descendants of Y in G
Descendants of Y in G
D is the maximal G-descendant in H Z is any
maximal child of Y such that D is a descendant of
Z in G
55
Choosing Z
G
H
Descendants of Y in G Y, Z1, Z2 Maximal
descendant in H DZ2 Maximal child of Y in G
that has DZ2 as descendant Z2
Add W2?Y
56
Difficult Case Proof Intuition
Y
A
Y
W
W
A
B
B
Z
Z
B or CS
B or CS
D
D
G
H
1. W not in CS 2. Y not in CS, else active in
H 3. In G, next edges must be away from Y until B
or CS reached 4. In G, neither Z nor desc in CS,
else active before addition 5. From (1,2,4), AP
(A,D) and (B,D) in H 6. Choice of D directed
path from D to B or CS in H
57
(No Transcript)
58
Optimality of GES