Title: Learning the Structure of Markov Logic Networks
1Learning the Structure of Markov Logic Networks
Stanley Kok Pedro DomingosDept. of Computer
Science and Eng.University of Washington
2Overview
- Motivation
- Background
- Structure Learning Algorithm
- Experiments
- Future Work Conclusion
3Motivation
- Statistical Relational Learning (SRL)
- combines the benefits of
- Statistical Learning uses probability to handle
uncertainty in a robust and principled way - Relational Learning models domains with multiple
relations
4Motivation
- Many SRL approaches combine a logical language
and Bayesian networks - e.g. Probabilistic Relational Models
- Friedman et al., 1999
- The need to avoid cycles in Bayesian networks
causes many difficulties Taskar et al., 2002 - Started using Markov networks instead
5Motivation
- Relational Markov Networks Taskar et al., 2002
- conjunctive database queries Markov networks
- Require space exponential in the size of the
cliques - Markov Logic Networks Richardson Domingos,
2004 - First-order logic Markov networks
- Compactly represent large cliques
- Did not learn structure (used external ILP system)
6Motivation
- Relational Markov Networks Taskar et al., 2002
- conjunctive database queries Markov networks
- Require space exponential in the size of the
cliques - Markov Logic Networks Richardson Domingos,
2004 - First-order logic Markov networks
- Compactly represent large cliques
- Did not learn structure (used external ILP
system) - This paper develops a fast algorithm that
- learns MLN structure
- Most powerful SRL learner to date
7Overview
- Motivation
- Background
- Structure Learning Algorithm
- Experiments
- Future Work Conclusion
8Markov Logic Networks
- First-order KB set of hard constraints
- Violate one formula, a world has zero probability
- MLNs soften constraints
- OK to violate formulas
- The fewer formulas a world violates,
- the more probable it is
- Gives each formula a weight,
- reflects how strong a constraint it is
9MLN Definition
- A Markov Logic Network (MLN) is a set of pairs
(F, w) where - F is a formula in first-order logic
- w is a real number
- Together with a finite set of constants,it
defines a Markov network with - One node for each grounding of each predicate
- in the MLN
- One feature for each grounding of each formula F
in the MLN, with the corresponding weight w
10Ground Markov Network
AdvisedBy(S,P) ) Student(S) Professor(P)
2.7
constants STAN, PEDRO
AdvisedBy(STAN,STAN)
Student(STAN)
Professor(STAN)
AdvisedBy(STAN,PEDRO)
AdvisedBy(PEDRO,STAN)
Professor(PEDRO)
Student(PEDRO)
AdvisedBy(PEDRO,PEDRO)
11MLN Model
12MLN Model
Vector of value assignments to ground predicates
13MLN Model
Vector of value assignments to ground predicates
Partition function. Sums over all possible value
assignments to ground predicates
14MLN Model
Vector of value assignments to ground predicates
Weight of ith formula
Partition function. Sums over all possible value
assignments to ground predicates
15MLN Model
Vector of value assignments to ground predicates
Weight of ith formula
Partition function. Sums over all possible value
assignments to ground predicates
of true groundings of ith formula
16MLN Weight Learning
- Likelihood is concave function of weights
- Quasi-Newton methods to find optimal weights
- e.g. L-BFGS Liu Nocedal, 1989
17MLN Weight Learning
- Likelihood is concave function of weights
- Quasi-Newton methods to find optimal weights
- e.g. L-BFGS Liu Nocedal, 1989
SLOW P-complete
18MLN Weight Learning
- Likelihood is concave function of weights
- Quasi-Newton methods to find optimal weights
- e.g. L-BFGS Liu Nocedal, 1989
SLOW P-complete
SLOW P-complete
19MLN Weight Learning
- RD used pseudo-likelihood Besag, 1975
20MLN Weight Learning
- RD used pseudo-likelihood Besag, 1975
21MLN Structure Learning
- RD learned MLN structure in two disjoint
steps - Learn first-order clauses with an off-the-shelf
- ILP system (CLAUDIEN De Raedt Dehaspe, 1997)
- Learn clause weights by optimizing
- pseudo-likelihood
- Unlikely to give best results because CLAUDIEN
- find clauses that hold with some
accuracy/frequency - in the data
- dont find clauses that maximize datas
- (pseudo-)likelihood
22Overview
- Motivation
- Background
- Structure Learning Algorithm
- Experiments
- Future Work Conclusion
23MLN Structure Learning
- This paper develops an algorithm that
- Learns first-order clauses by directly optimizing
pseudo-likelihood - Is fast enough
- Performs better than RD, pure ILP,
- purely KB and purely probabilistic approaches
24Structure Learning Algorithm
- High-level algorithm
- REPEAT
- MLN Ã MLN FindBestClauses(MLN)
- UNTIL FindBestClauses(MLN) returns NULL
- FindBestClauses(MLN)
- Create candidate clauses
- FOR EACH candidate clause c
- Compute increase in evaluation measure
- of adding c to MLN
- RETURN k clauses with greatest increase
25Structure Learning
- Evaluation measure
- Clause construction operators
- Search strategies
- Speedup techniques
26Evaluation Measure
- RD used pseudo-log-likelihood
-
- This gives undue weight to predicates with large
of groundings
27Evaluation Measure
- Weighted pseudo-log-likelihood (WPLL)
-
- Gaussian weight prior
- Structure prior
28Evaluation Measure
- Weighted pseudo-log-likelihood (WPLL)
-
- Gaussian weight prior
- Structure prior
weight given to predicate r
29Evaluation Measure
- Weighted pseudo-log-likelihood (WPLL)
-
- Gaussian weight prior
- Structure prior
weight given to predicate r
sums over groundings of predicate r
30Evaluation Measure
- Weighted pseudo-log-likelihood (WPLL)
-
- Gaussian weight prior
- Structure prior
CLL conditional log-likelihood
weight given to predicate r
sums over groundings of predicate r
31Clause Construction Operators
- Add a literal (negative/positive)
- Remove a literal
- Flip signs of literals
- Limit of distinct variables to restrict search
space
32Beam Search
- Same as that used in ILP rule induction
- Repeatedly find the single best clause
33Shortest-First Search (SFS)
- Start from empty or hand-coded MLN
- FOR L Ã 1 TO MAX_LENGTH
- Apply each literal addition deletion to
- each clause to create clauses of length L
- Repeatedly add K best clauses of length L
- to the MLN until no clause of length L
- improves WPLL
- Similar to Della Pietra et al. (1997),
- McCallum (2003)
34Speedup Techniques
- FindBestClauses(MLN)
- Creates candidate clauses
- FOR EACH candidate clause c
- Compute increase in WPLL (using L-BFGS)
- of adding c to MLN
- RETURN k clauses with greatest increase
35Speedup Techniques
- FindBestClauses(MLN)
- Creates candidate clauses
- FOR EACH candidate clause c
- Compute increase in WPLL (using L-BFGS)
- of adding c to MLN
- RETURN k clauses with greatest increase
SLOW Many candidates
36Speedup Techniques
- FindBestClauses(MLN)
- Creates candidate clauses
- FOR EACH candidate clause c
- Compute increase in WPLL (using L-BFGS)
- of adding c to MLN
- RETURN k clauses with greatest increase
SLOW Many candidates
SLOW Many CLLs
SLOW Each CLL involves a P-complete problem
37Speedup Techniques
- FindBestClauses(MLN)
- Creates candidate clauses
- FOR EACH candidate clause c
- Compute increase in WPLL (using L-BFGS)
- of adding c to MLN
- RETURN k clauses with greatest increase
NOT THAT FAST
SLOW Many candidates
SLOW Many CLLs
SLOW Each CLL involves a P-complete problem
38Speedup Techniques
- Clause Sampling
- Predicate Sampling
- Avoid Redundancy
- Loose Convergence Thresholds
- Ignore Unrelated Clauses
- Weight Thresholding
39Speedup Techniques
- Clause Sampling
- Predicate Sampling
- Avoid Redundancy
- Loose Convergence Thresholds
- Ignore Unrelated Clauses
- Weight Thresholding
40Speedup Techniques
- Clause Sampling
- Predicate Sampling
- Avoid Redundancy
- Loose Convergence Thresholds
- Ignore Unrelated Clauses
- Weight Thresholding
41Speedup Techniques
- Clause Sampling
- Predicate Sampling
- Avoid Redundancy
- Loose Convergence Thresholds
- Ignore Unrelated Clauses
- Weight Thresholding
42Speedup Techniques
- Clause Sampling
- Predicate Sampling
- Avoid Redundancy
- Loose Convergence Thresholds
- Ignore Unrelated Clauses
- Weight Thresholding
43Speedup Techniques
- Clause Sampling
- Predicate Sampling
- Avoid Redundancy
- Loose Convergence Thresholds
- Ignore Unrelated Clauses
- Weight Thresholding
44Overview
- Motivation
- Background
- Structure Learning Algorithm
- Experiments
- Future Work Conclusion
45Experiments
- UW-CSE domain
- 22 predicates, e.g., AdvisedBy(X,Y), Student(X),
etc. - 10 types, e.g., Person, Course, Quarter, etc.
- ground predicates ¼ 4 million
- true ground predicates ¼ 3000
- Handcrafted KB with 94 formulas
- Each student has at most one advisor
- If a student is an author of a paper, so is her
advisor - Cora domain
- Computer science research papers
- Collective deduplication of author, venue, title
46Systems
- MLN(SLB) structure learning with beam search
- MLN(SLS) structure learning with SFS
47Systems
KB hand-coded KB CL CLAUDIEN FO FOIL AL Aleph
48Systems
KB CL FO AL
MLN(KB) MLN(CL) MLN(FO) MLN(AL)
49Systems
NB Naïve Bayes BN Bayesian networks
KB CL FO AL
MLN(KB) MLN(CL) MLN(FO) MLN(AL)
50Methodology
- UW-CSE domain
- DB divided into 5 areas
- AI, Graphics, Languages, Systems, Theory
- Leave-one-out testing by area
- Measured
- average CLL of the ground predicates
- average area under the precision-recall curve of
the ground predicates (AUC)
51UW-CSE
CLL
MLN(SLB)
MLN(SLS)
MLN(CL)
MLN(FO)
MLN(AL)
MLN(KB)
CL
FO
AL
KB
AUC
MLN(SLS)
MLN(SLB)
MLN(KB)
MLN(CL)
MLN(AL)
MLN(FO)
CL
FO
AL
KB
52UW-CSE
CLL
MLN(SLB)
MLN(SLS)
MLN(CL)
MLN(FO)
MLN(AL)
MLN(KB)
CL
FO
AL
KB
AUC
MLN(SLS)
MLN(SLB)
MLN(KB)
MLN(CL)
MLN(AL)
MLN(FO)
CL
FO
AL
KB
53UW-CSE
CLL
MLN(SLB)
MLN(SLS)
MLN(CL)
MLN(FO)
MLN(AL)
MLN(KB)
CL
FO
AL
KB
AUC
MLN(SLS)
MLN(SLB)
MLN(KB)
MLN(CL)
MLN(AL)
MLN(FO)
CL
FO
AL
KB
54UW-CSE
CLL
MLN(SLB)
MLN(SLS)
MLN(CL)
MLN(FO)
MLN(AL)
MLN(KB)
CL
FO
AL
KB
AUC
MLN(SLS)
MLN(SLB)
MLN(KB)
MLN(CL)
MLN(AL)
MLN(FO)
CL
FO
AL
KB
55UW-CSE
CLL
MLN(SLS)
MLN(SLB)
NB
BN
AUC
MLN(SLS)
MLN(SLB)
NB
BN
56Timing
- MLN(SLS) on UW-CSE
- Cluster of 15 dual-CPUs 2.8 GHz Pentium 4
machines - Without speedups did not finish in 24 hrs
- With speedups 5.3 hrs
57Lesion Study
- Disable one speedup technique at a time SFS
UW-CSE (one-fold)
Hour
all speedups
no clause sampling
no weight thresholding
no predicate sampling
dont avoid redundancy
no loose converg. threshold
58Overview
- Motivation
- Background
- Structure Learning Algorithm
- Experiments
- Future Work Conclusion
59Future Work
- Speed up counting of true groundings of clause
- Probabilistically bound the loss in accuracy due
to subsampling - Probabilistic predicate discovery
60Conclusion
- Markov logic networks a powerful combination
- of first-order logic and probability
- Richardson Domingos (2004) did not learn
- MLN structure
- We develop an algorithm that automatically learns
both first-order clauses and their weights - We develop speedup techniques to make our
algorithm fast enough to be practical - We show experimentally that our algorithm
outperforms - Richardson Domingos
- Pure ILP
- Purely KB approaches
- Purely probabilistic approaches
- (For software, email koks_at_cs.washington.edu)