Title: Finding Optimal Bayesian Networks with Greedy Search
1Finding Optimal Bayesian Networks with Greedy
Search
2Outline
- Bayesian-Network Definitions
- Learning
- Greedy Equivalence Search (GES)
- Optimality of GES
3Bayesian Networks
Use B (S,q) to represent p(X1, , Xn)
4Markov Conditions
From factorization I(X, ND Par(X))
Par
Par
Par
ND
X
Desc
ND
Desc
Markov Conditions Graphoid Axioms characterize
all independencies
5Structure/Distribution Inclusion
All distributions
p
X
Y
Z
S
- p is included in S if there exists q s.t. B(S,q)
defines p
6Structure/Structure Inclusion T S
All distributions
X
Y
Z
X
Y
Z
S
T
- T is included in S if every p included in T is
included in S
(S is an I-map of T)
7Structure/Structure EquivalenceT ? S
All distributions
X
Y
Z
X
Y
Z
S
T
Reflexive, Symmetric, Transitive
8Equivalence
A
B
C
A
B
C
D
D
Skeleton
V-structure
Theorem (Verma and Pearl, 1990) S ? T ? same
v-structures and skeletons
9Learning Bayesian Networks
X
X Y Z 0 1 1 1 0 1 0 1 0 . . . 1 0 1
iid samples
Y
p
Z
Generative Distribution
Observed Data
Learned Model
- Learn the structure
- Estimate the conditional distributions
10Learning Structure
- Scoring criterion
- F(D, S)
- Search procedure
- Identify one or more structures with high values
- for the scoring function
11Properties of Scoring Criteria
- Consistent
- Locally Consistent
- Score Equivalent
12Consistent Criterion
Criterion favors (in the limit) simplest model
that includes the generative distribution p
- S includes p, T does not include p ? F(S,D)
gt F(T,D) - Both include p, S has fewer parameters ? F(S,D)
gt F(T,D)
13Locally Consistent Criterion
S and T differ by one edge
X
Y
X
Y
S
T
If I(X,YPar(X)) in p then F(S,D) gt
F(T,D) Otherwise F(S,D) lt F(T,D)
14Score-Equivalent Criterion
X
Y
S
X
Y
T
S?T ? F(S,D) F(T,D)
15Bayesian Criterion(Consistent, locally
consistent and score equivalent)
- Sh generative distribution p has same
- independence constraints as S.
- FBayes(S,D) log p(Sh D)
- k log p(DSh) log p(Sh)
Structure Prior (e.g. prefer simple)
Marginal Likelihood (closed form w/ assumptions)
16Search Procedure
- Set of states
- Representation for the states
- Operators to move between states
- Systematic Search Algorithm
17Greedy Equivalence Search
- Set of states
- Equivalence classes of DAGs
- Representation for the states
- Essential graphs
- Operators to move between states
- Forward and Backward Operators
- Systematic Search Algorithm
- Two-phase Greedy
18Representation Essential Graphs
A
B
C
Compelled Edges Reversible Edges
E
F
D
A
B
C
E
F
D
19GES Operators
Forward Direction single edge additions
Backward Direction single edge deletions
20Two-Phase Greedy Algorithm
- Phase 1 Forward Equivalence Search (FES)
- Start with all-independence model
- Run Greedy using forward operators
- Phase 2 Backward Equivalence Search (BES)
- Start with local max from FES
- Run Greedy using backward operators
21Forward Operators
- Consider all DAGs in the current state
- For each DAG, consider all single-edge additions
(acyclic) - Take the union of the resulting equivalence
classes
22Forward-Operators Example
Current State
All DAGs
All DAGs resulting from single-edge addition
Union of corresponding essential graphs
23Forward-Operators Example
24Backward Operators
- Consider all DAGs in the current state
- For each DAG, consider all single-edge deletions
- Take the union of the resulting equivalence
classes
25Backward-Operators Example
Current State
All DAGs
All DAGs resulting from single-edge deletion
B
A
B
A
B
A
B
A
B
A
B
A
C
C
C
C
C
C
Union of corresponding essential graphs
26Backward-Operators Example
27DAG Perfect
- DAG-perfect distribution p
- Exists DAG G
- I(X,YZ) in p ? I(X,YZ) in G
Non-DAG-perfect distribution q
B
A
B
A
B
A
D
C
D
C
D
C
I(A,DB,C) I(B,CA,D)
I(B,CA,D)
I(A,DB,C)
28DAG-Perfect Consequence Composition Axiom Holds
in p
If ?I(X,Y Z) then ?I(X,Y Z) for some
singleton Y ? Y
A
B
C
D
C
X
X
29Optimality of GES
If p is DAG-perfect wrt some G
X
X
X
X Y Z 0 1 1 1 0 1 0 1 0 . .
. 1 0 1
Y
Y
Y
n
iid samples
GES
Z
Z
Z
S
G
S
p
For large n, S S
30Optimality of GES
BES
FES
State includes S
State equals S
All-independence
- Proof Outline
- After first phase (FES), current state includes
S - After second phase (BES), the current state S
31FES Maximum Includes S
Assume Local Max does NOT include S
Any DAG G from S
Markov Conditions characterize independencies In
p, exists X not indep. non-desc given parents
A
B
C
? I(X,A,B,C,D E) in p
E
X
D
p is DAG-perfect ? composition axiom holds
A
B
C
? I(X,C E) in p
E
X
D
Locally consistent adding C?X edge improves
score, and EQ class is a neighbor
32BES Identifies S
- Current state always includes S
- Local consistency of the criterion
- Local Minimum is S
- Meeks conjecture
33Meeks Conjecture
- Any pair of DAGs G,H such that H includes G (G
H) - There exists a sequence of
- covered edge reversals in G
- (2) single-edge additions to G
-
- after each change G H
- after all changes GH
34Meeks Conjecture
B
A
I(A,B) I(C,BA,D)
C
D
H
B
A
B
A
B
A
B
A
C
D
C
D
C
D
C
D
G
35Meeks Conjecture and BESSS
Assume Local Max S Not S
Any DAG H from S
Any DAG G from S
Add
Add
Rev
Rev
Rev
G
H
36Meeks Conjecture and BESSS
Assume Local Max S Not S
Any DAG H from S
Any DAG G from S
Add
Add
Rev
Rev
Rev
G
H
Del
Del
Rev
Rev
Rev
G
H
37Meeks Conjecture and BESSS
Assume Local Max S Not S
Any DAG H from S
Any DAG G from S
Add
Add
Rev
Rev
Rev
G
H
Del
Del
Rev
Rev
Rev
G
H
S
S
Neighbor of S in BES
38Discussion Points
- In practice, GES is as fast as DAG-based search
- Neighborhood of essential graphs can be
generated and scored very efficiently - When DAG-perfect assumption fails, we still get
optimality guarantees - As long as composition holds in generative
distribution, local maximum is inclusion-minimal
39Thanks!
- My Home Page
- http//research.microsoft.com/dmax
- Relevant Papers
- Optimal Structure Identification with Greedy
Search - JMLR Submission
- Contains detailed proofs of Meeks conjecture and
optimality of GES - Finding Optimal Bayesian Networks
- UAI02 Paper with Chris Meek
- Contains extension of optimality results of GES
when not DAG perfect
40(No Transcript)
41Bayesian Criterion is Locally Consistent
- Bayesian score approaches BIC constant
- BIC is decomposible
- Difference in score same for any DAGS that differ
by Y?X edge if X has same parents
X
Y
X
Y
Complete network (always includes p)
42Bayesian Criterion is Consistent
- Assume Conditionals
- unconstrained multinomials
- linear regressions
Geiger, Heckerman, King and Meek (2001)
Network structures curved exponential models
Haughton (1988)
Bayesian Criterion is consistent
43Bayesian Criterion isScore Equivalent
S?T ? F(S,D) F(T,D)
X
Y
Sh no independence constraints
S
X
Y
Th no independence constraints
T
Sh Th
44Active Paths
- Z-active Path between X and Y (non-standard)
- Neither X nor Y is in Z
- Every pair of colliding edges meets at a member
of Z - No other pair of edges meets at a member of Z
X
Z
Y
G H ? If Z-active path between X and Y in
G then Z-active path between X and Y in H
45Active Paths
X
A
Z
W
B
Y
- X-Y Out-of X and In-to Y
- X-W Out-of both X and W
- Any sub-path between A,B?Z is also active
- A B, B C, at least one is out-of B
- ?Active path between A and C
46Simple Active Paths
A
B
contains Y?X
Then ? active path
(1) Edge appears exactly once
OR
Y
X
B
A
(2) Edge appears exactly twice
A
Y
X
X
Y
B
Simplify discussion Assume (1) only proofs
for (2) almost identical
47Typical ArgumentCombining Active Paths
X
Y
B
A
X
Y
Z
Z sink node adj X,Y
G
Z
H
X
Y
B
A
X
A
GH
Y
B
Z
G Suppose AP in G (X not in CS) with no
corresp. AP in H. Then Z not in CS.
48Proof Sketch
- Two DAGs G, H with GltH
- Identify either
- a covered edge X?Y in G that has opposite
orientation in H - a new edge X?Y to be added to G such that it
remains included in H
49The Transformation
Choose any node Y that is a sink in H
Case 1a Y is a sink in G X ? ParH(Y) X ?
ParG(Y) Case 1b Y is a sink in G same
parents Case 2a ?X s.t. Y?X covered Case
2b ?X s.t. Y?X W par of Y but not X Case
2c Every Y?X, Par (Y) ? Par(X)
X
Y
X
Y
Y
X
Y
X
Y
W
W
X
Y
X
Y
Y
Y
50Preliminaries
(G H)
- The adjacencies in G are a subset of the
adjacencies in H - If X?Y?Z is a v-structure in G but not H, then X
and Z are adjacent in H - Any new active path that results from adding X?Y
to G includes X?Y
51Proof Sketch Case 1
Y is a sink in G
Case 1a X ? ParH(Y) X ? ParG(Y)
H
X
Y
X
Y
G
X
Y
Suppose theres some new active path between A
and B not in H
Y
X
B
Z
A
- Y is a sink in G, so it must be in CS
- Neither X nor next node Z is in CS
- In H, AP(A,Z), AP(X,B), Z?Y?X
Case 1b Parents identical Remove Y from both
graphs proof similar
52Proof Sketch Case 2
Y is not a sink in G
Case 2a There is a covered edge Y?X Reverse
the edge
Case 2b There is a non-covered edge Y?X such
that W is a parent of Y but not a parent of X
W
W
W
G
H
G
X
Y
X
Y
X
Y
Suppose theres some new active path between A
and B not in H
Y must be in CS, else replace W?X by W ? Y ? X
(not new). If X not in CS, then in H active A-W,
X-B, W?Y?X
B
W
A
B
W
A
G
H
Z
X
Y
Z
X
Y
53Case 2c The Difficult Case
- All non-covered edges Y?Z have Par(Y) ? Par(Z)
W1
W2
W1
W2
Y
Y
Z1
Z2
Z1
Z2
G
H
W1?Y G no longer lt H (Z2-active path between W1
and W2) W2?Y G lt H
54Choosing Z
G
H
Y
Y
D
D
Z
Descendants of Y in G
Descendants of Y in G
D is the maximal G-descendant in H Z is any
maximal child of Y such that D is a descendant of
Z in G
55Choosing Z
G
H
Descendants of Y in G Y, Z1, Z2 Maximal
descendant in H DZ2 Maximal child of Y in G
that has DZ2 as descendant Z2
Add W2?Y
56Difficult Case Proof Intuition
Y
A
Y
W
W
A
B
B
Z
Z
B or CS
B or CS
D
D
G
H
1. W not in CS 2. Y not in CS, else active in
H 3. In G, next edges must be away from Y until B
or CS reached 4. In G, neither Z nor desc in CS,
else active before addition 5. From (1,2,4), AP
(A,D) and (B,D) in H 6. Choice of D directed
path from D to B or CS in H
57(No Transcript)
58Optimality of GES
- Definition
- p is DAG-perfect wrt G
- Independence constraints in p are precisely those
in G - Assumption
- Generative distribution p is perfect wrt some G
defined - over the observable variables
- S Equivalence class containing G
- Under DAG-perfect assumption, GES results in S
59Important Definitions
- Bayesian Networks
- Markov Conditions
- Distribution/Structure Inclusion
- Structure/Structure Inclusion