Title: Learning Tree Conditional Random Fields
1Learning Tree Conditional Random Fields
- Joseph K. Bradley
- Carlos Guestrin
2Reading peoples minds
(Application from Palatucci et al., 2009)
X fMRI voxels
Y semantic features
- Metal?
- Manmade?
- Found in house?
- ...
We want to model conditional correlations
Predict independently? Yi X, for all i
Image from http//en.wikipedia.org/wiki/FileFMRI.
jpg
3Conditional Random Fields (CRFs)
In fMRI, X 500 to 10,000 voxels
Pro Avoid modeling P(X)
4Conditional Random Fields (CRFs)
Y4
Y3
Y2
Y1
Pro Avoid modeling P(X)
5Conditional Random Fields (CRFs)
Y4
Y3
Y2
Y1
Pro Avoid modeling P(X)
6Conditional Random Fields (CRFs)
Con Compute Z(x) for each inference
Pro Avoid modeling P(X)
7Conditional Random Fields (CRFs)
Exact inference intractable in general. Approximat
e inference expensive.
Use tree CRFs!
Con Compute Z(x) for each inference
Pro Avoid modeling P(X)
8Conditional Random Fields (CRFs)
Use tree CRFs!
Pro Fast, exact inference
Con Compute Z(x) for each inference
Pro Avoid modeling P(X)
9CRF Structure Learning
Feature selection
Tree CRFs Fast, exact inference Avoid
modeling P(X)
10CRF Structure Learning
(scalable)
Local inputs
Tree CRFs Fast, exact inference Avoid
modeling P(X)
11This work
Goals
- Structured conditional models P(YX)
- Scalable methods
- Tree structures
- Local inputs Xij
- Max spanning trees
- Outline
- Gold standard
- Max spanning trees
- Generalized edge weights
- Heuristic weights
- Experiments synthetic fMRI
12Related work
Method Feature selection? Tractable models?
Torralba et al. (2004) Boosted Random Fields Yes No
Schmidt et al. (2008) Block-L1 regularized pseudolikelihood No No
Shahaf et al. (2009) Edge weight low-treewidth model No Yes
- Vs. our work
- Choice of edge weights
- Local inputs
13Chow-Liu
For generative models
14Chow-Liu for CRFs?
For CRFs with global inputs
- Global CMI (Conditional Mutual Information)
- Pro Gold standard
- Con I(YiYj X) intractable for big X
15Where now?
- Global CMI (Conditional Mutual Information)
- Pros Gold standard
- Cons I(YiYj X) intractable for big X
- Algorithmic framework
- Given data (y(i),x(i)).
- Given input mapping Yi ? Xi
- Weight potential edge (Yi,Yj) with Score(i,j)
- Choose max spanning tree
Local inputs!
16Generalized edge scores
- Key step Weight edge (Yi,Yj) with Score(i,j).
Local Linear Entropy Scores Score(i,j)
linear combination of
entropies over Yi,Yj,Xi,Xj
E.g., Local Conditional Mutual Information
17Generalized edge scores
- Key step Weight edge (Yi,Yj) with Score(i,j).
Local Linear Entropy Scores Score(i,j)
linear combination of
entropies over Yi,Yj,Xi,Xj
- Theorem
- Assume true P(YX) is tree CRF
- (w/ non-trivial parameters).
- ?No Local Linear Entropy Score can recover all
such tree CRFs - (even with exact entropies).
18Heuristics
- Outline
- Gold standard
- Max spanning trees
- Generalized edge weights
- Heuristic weights
- Experiments synthetic fMRI
? Piecewise likelihood ? Local CMI ? DCI
19Piecewise likelihood (PWL)
Sutton and McCallum (2005,2007) PWL for
parameter learning Main idea Bound Z(X)
For tree CRFs, optimal parameters give
- Edge score w/ local inputs Xij
- Bounds log likelihood
- Fails on simple counterexample
- Does badly in practice
- Helps explain other edge scores
20Piecewise likelihood (PWL)
True P(Y,X)
21Local Conditional Mutual Info
- Decomposable score w/ local inputs Xij
- Theorem Local CMI bounds log likelihood gain
- Does pretty well in practice
- Can fail with strong potentials
22Local Conditional Mutual Info
True P(Y,X)
Strong potential ?
Y3
Y2
Y1
23Decomposable Conditional Influence (DCI)
- Exact measure of gain for some edges
- Edge score w/ local inputs Xij
- Succeeds on counterexample
- Does best in practice
24Experiments
Algorithmic details
- Given Data (yi,xi) input mapping Yi ? Xi
- Compute edge scores
- Choose max spanning tree
- Parameter learning
- Conjugate gradient on L2-regularized log
likelihood - 10-fold CV to choose regularization
25Synthetic experiments
P(YX)
P(X)
...
X1
X2
X3
Xn
- Experiments
- Binary Y,X tabular edge factors
- Use natural input mapping Yi ? Xi
26Synthetic experiments
P(YX)
P(X)
Y4
Y3
Y2
Y1
Y5
intractable P(Y,X)
tractable P(Y,X)
- P(Y,X) tractable intractable
F(Yij,Xij)
27Synthetic experiments
P(YX)
Y1
Y2
Y3
Yn
...
cross factors
X1
X2
X3
Xn
- P(Y,X) tractable intractable
F(Yij,Xij)
- With without cross-factors
- Associative (all positive alternating /-)
random factors
28Synthetic vary train exs.
29Synthetic vary train exs.
Tree Intractable P(Y,X) Associative F
(alternating /-) Y40 1000 test examples
30Synthetic vary train exs.
Tree Intractable P(Y,X) Associative F
(alternating /-) Y40 1000 test examples
31Synthetic vary train exs.
Tree Intractable P(Y,X) Associative F
(alternating /-) Y40 1000 test examples
32Synthetic vary train exs.
Tree Intractable P(Y,X) Associative F
(alternating /-) Y40 1000 test examples
33Synthetic vary train exs.
Tree Intractable P(Y,X) Associative F
(alternating /-) Y40 1000 test examples
34Synthetic vary train exs.
35Synthetic vary model size
Fixed 50 train exs., 1000 test exs.
36fMRI experiments
X (500 fMRI voxels)
Y (218 semantic features)
predict
- Metal?
- Manmade?
- Found in house?
- ...
Data, setup from Palatucci et al. (2009)
Zero-shot learning Can predict objects not
in training data (given decoding).
Image from http//en.wikipedia.org/wiki/FileFMRI.
jpg
37fMRI experiments
X (500 fMRI voxels)
Y (218 semantic features)
predict
Input mapping Regressed Yi Y-i,X
Chose top K inputs
38fMRI experiments
Accuracy (for zero-shot learning) Hold out
objects i,j. Predict Y(i), Y(j) If
Y(i) - Y(i)2 lt Y(j) - Y(i)2 then we
got i right.
39fMRI experiments
Accuracy CRFs a bit worse
better
40fMRI experiments
Accuracy CRFs a bit worse Log likelihood
CRFs better
better
41fMRI experiments
Accuracy CRFs a bit worse Log likelihood
CRFs better Squared error CRFs better
better
42fMRI experiments
Accuracy CRFs a bit worse Log likelihood
CRFs better Squared error CRFs better
better
43Conclusion
- Scalable learning of CRF structure
- Analyzed edge scores for spanning tree methods
- Local Linear Entropy Scores imperfect
- Heuristics
- Pleasing theoretical properties
- Empirical successwe recommend DCI
- Future work
- Templated CRFs
- Learning edge score
- Assumptions on model/factors which give
learnability
Thank you!
44Thank you!
- References
- M. Craven, D. DiPasquo, D. Freitag, A. McCallum,
T. Mitchell, K. Nigam, S. Slattery. Learning to
Extract Symbolic Knowledge from the World Wide
Web. AAAI 1998. - Lafferty, J.D., McCallum, A., Pereira, F.C.N.
Conditional Random Fields Probabilistic Models
for Segmenting and Labeling Sequence Data. ICML
2001. - M. Palatucci, D. Pomerleau, G. Hinton, T.
Mitchell. Zero-Shot Learning with Semantic Output
Codes. NIPS 2009. - M. Schmidt, K. Murphy, G. Fung, R. Rosales.
Structure learning in random fields for heart
motion abnormality detection. CVPR 2008. - D. Shahaf, A. Chechetka, C. Guestrin. Learning
Thin Junction Trees via Graph Cuts. AI-STATS
2009. - C. Sutton, A. McCallum. Piecewise training of
undirected models. UAI 2005. - C. Sutton, A. McCallum. Piecewise
pseudolikelihood for efficient training of
conditional random fields. ICML, 2007. - A. Torralba, K. Murphy, W. Freeman. Contextual
models for object detection using boosted random
fields. NIPS 2004.
45(extra slides)
46B Score Decay Assumption
47B Example complexity
48Future work Templated CRFs
- Learn template, e.g.
- Score(i,j) DCI(i,j)
- Parametrization
- WebKB (Craven et al., 1998)
- Given webpages (Yipage type, Xicontent)
- Use template to Choose tree over pages
- Instantiate parameters
- ? P(YXx) P(pages types pages content)
- Requires local inputs
- Potentially very fast
49Future work Learn score
- Given training queries
- Data
- Ground-truth model (E.g., from expensive
structure learning method) - Learn function Score(Yi,Yj) for MST algorithm.