Learning Tree Conditional Random Fields - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Learning Tree Conditional Random Fields

Description:

Learning Tree Conditional Random Fields Joseph K. Bradley Carlos Guestrin TO DO: SAY FRACTION EDGES RECOVERED NOT PERCENT ! Global CMI OK; local CMI bad. – PowerPoint PPT presentation

Number of Views:130
Avg rating:3.0/5.0
Slides: 50
Provided by: S951951
Category:

less

Transcript and Presenter's Notes

Title: Learning Tree Conditional Random Fields


1
Learning Tree Conditional Random Fields
  • Joseph K. Bradley
  • Carlos Guestrin

2
Reading peoples minds
(Application from Palatucci et al., 2009)
X fMRI voxels
Y semantic features
  • Metal?
  • Manmade?
  • Found in house?
  • ...

We want to model conditional correlations
Predict independently? Yi X, for all i
Image from http//en.wikipedia.org/wiki/FileFMRI.
jpg
3
Conditional Random Fields (CRFs)
  • (Lafferty et al., 2001)

In fMRI, X 500 to 10,000 voxels
Pro Avoid modeling P(X)
4
Conditional Random Fields (CRFs)
Y4
Y3
Y2
Y1
Pro Avoid modeling P(X)
5
Conditional Random Fields (CRFs)
Y4
Y3
Y2
Y1
Pro Avoid modeling P(X)
6
Conditional Random Fields (CRFs)
Con Compute Z(x) for each inference
Pro Avoid modeling P(X)
7
Conditional Random Fields (CRFs)
Exact inference intractable in general. Approximat
e inference expensive.
Use tree CRFs!
Con Compute Z(x) for each inference
Pro Avoid modeling P(X)
8
Conditional Random Fields (CRFs)
Use tree CRFs!
Pro Fast, exact inference
Con Compute Z(x) for each inference
Pro Avoid modeling P(X)
9
CRF Structure Learning
Feature selection
Tree CRFs Fast, exact inference Avoid
modeling P(X)
10
CRF Structure Learning
(scalable)
Local inputs
Tree CRFs Fast, exact inference Avoid
modeling P(X)
11
This work
Goals
  • Structured conditional models P(YX)
  • Scalable methods
  • Tree structures
  • Local inputs Xij
  • Max spanning trees
  • Outline
  • Gold standard
  • Max spanning trees
  • Generalized edge weights
  • Heuristic weights
  • Experiments synthetic fMRI

12
Related work
Method Feature selection? Tractable models?
Torralba et al. (2004) Boosted Random Fields Yes No
Schmidt et al. (2008) Block-L1 regularized pseudolikelihood No No
Shahaf et al. (2009) Edge weight low-treewidth model No Yes
  • Vs. our work
  • Choice of edge weights
  • Local inputs

13
Chow-Liu
For generative models
14
Chow-Liu for CRFs?
For CRFs with global inputs
  • Global CMI (Conditional Mutual Information)
  • Pro Gold standard
  • Con I(YiYj X) intractable for big X

15
Where now?
  • Global CMI (Conditional Mutual Information)
  • Pros Gold standard
  • Cons I(YiYj X) intractable for big X
  • Algorithmic framework
  • Given data (y(i),x(i)).
  • Given input mapping Yi ? Xi
  • Weight potential edge (Yi,Yj) with Score(i,j)
  • Choose max spanning tree

Local inputs!
16
Generalized edge scores
  • Key step Weight edge (Yi,Yj) with Score(i,j).

Local Linear Entropy Scores Score(i,j)
linear combination of
entropies over Yi,Yj,Xi,Xj
E.g., Local Conditional Mutual Information
17
Generalized edge scores
  • Key step Weight edge (Yi,Yj) with Score(i,j).

Local Linear Entropy Scores Score(i,j)
linear combination of
entropies over Yi,Yj,Xi,Xj
  • Theorem
  • Assume true P(YX) is tree CRF
  • (w/ non-trivial parameters).
  • ?No Local Linear Entropy Score can recover all
    such tree CRFs
  • (even with exact entropies).

18
Heuristics
  • Outline
  • Gold standard
  • Max spanning trees
  • Generalized edge weights
  • Heuristic weights
  • Experiments synthetic fMRI

? Piecewise likelihood ? Local CMI ? DCI
19
Piecewise likelihood (PWL)
Sutton and McCallum (2005,2007) PWL for
parameter learning Main idea Bound Z(X)
For tree CRFs, optimal parameters give
  • Edge score w/ local inputs Xij
  • Bounds log likelihood
  • Fails on simple counterexample
  • Does badly in practice
  • Helps explain other edge scores

20
Piecewise likelihood (PWL)
True P(Y,X)
21
Local Conditional Mutual Info
  • Decomposable score w/ local inputs Xij
  • Theorem Local CMI bounds log likelihood gain
  • Does pretty well in practice
  • Can fail with strong potentials

22
Local Conditional Mutual Info
True P(Y,X)
Strong potential ?
Y3
Y2
Y1
23
Decomposable Conditional Influence (DCI)
  • Exact measure of gain for some edges
  • Edge score w/ local inputs Xij
  • Succeeds on counterexample
  • Does best in practice

24
Experiments
Algorithmic details
  • Given Data (yi,xi) input mapping Yi ? Xi
  • Compute edge scores
  • Choose max spanning tree
  • Parameter learning
  • Conjugate gradient on L2-regularized log
    likelihood
  • 10-fold CV to choose regularization

25
Synthetic experiments
P(YX)
P(X)
...
X1
X2
X3
Xn
  • Experiments
  • Binary Y,X tabular edge factors
  • Use natural input mapping Yi ? Xi

26
Synthetic experiments
P(YX)
P(X)
Y4
Y3
Y2
Y1
Y5
intractable P(Y,X)
tractable P(Y,X)
  • P(YX), P(X) chains trees
  • P(Y,X) tractable intractable

F(Yij,Xij)
27
Synthetic experiments
P(YX)
Y1
Y2
Y3
Yn
...
cross factors
X1
X2
X3
Xn
  • P(YX) chains trees
  • P(Y,X) tractable intractable

F(Yij,Xij)
  • With without cross-factors
  • Associative (all positive alternating /-)
    random factors

28
Synthetic vary train exs.
29
Synthetic vary train exs.
Tree Intractable P(Y,X) Associative F
(alternating /-) Y40 1000 test examples
30
Synthetic vary train exs.
Tree Intractable P(Y,X) Associative F
(alternating /-) Y40 1000 test examples
31
Synthetic vary train exs.
Tree Intractable P(Y,X) Associative F
(alternating /-) Y40 1000 test examples
32
Synthetic vary train exs.
Tree Intractable P(Y,X) Associative F
(alternating /-) Y40 1000 test examples
33
Synthetic vary train exs.
Tree Intractable P(Y,X) Associative F
(alternating /-) Y40 1000 test examples
34
Synthetic vary train exs.
35
Synthetic vary model size
Fixed 50 train exs., 1000 test exs.
36
fMRI experiments
X (500 fMRI voxels)
Y (218 semantic features)
predict
  • Metal?
  • Manmade?
  • Found in house?
  • ...

Data, setup from Palatucci et al. (2009)
Zero-shot learning Can predict objects not
in training data (given decoding).
Image from http//en.wikipedia.org/wiki/FileFMRI.
jpg
37
fMRI experiments
X (500 fMRI voxels)
Y (218 semantic features)
predict
Input mapping Regressed Yi Y-i,X
Chose top K inputs
38
fMRI experiments
Accuracy (for zero-shot learning) Hold out
objects i,j. Predict Y(i), Y(j) If
Y(i) - Y(i)2 lt Y(j) - Y(i)2 then we
got i right.
39
fMRI experiments
Accuracy CRFs a bit worse
better
40
fMRI experiments
Accuracy CRFs a bit worse Log likelihood
CRFs better
better
41
fMRI experiments
Accuracy CRFs a bit worse Log likelihood
CRFs better Squared error CRFs better
better
42
fMRI experiments
Accuracy CRFs a bit worse Log likelihood
CRFs better Squared error CRFs better
better
43
Conclusion
  • Scalable learning of CRF structure
  • Analyzed edge scores for spanning tree methods
  • Local Linear Entropy Scores imperfect
  • Heuristics
  • Pleasing theoretical properties
  • Empirical successwe recommend DCI
  • Future work
  • Templated CRFs
  • Learning edge score
  • Assumptions on model/factors which give
    learnability

Thank you!
44
Thank you!
  • References
  • M. Craven, D. DiPasquo, D. Freitag, A. McCallum,
    T. Mitchell, K. Nigam, S. Slattery. Learning to
    Extract Symbolic Knowledge from the World Wide
    Web. AAAI 1998.
  • Lafferty, J.D., McCallum, A., Pereira, F.C.N.
    Conditional Random Fields Probabilistic Models
    for Segmenting and Labeling Sequence Data. ICML
    2001.
  • M. Palatucci, D. Pomerleau, G. Hinton, T.
    Mitchell. Zero-Shot Learning with Semantic Output
    Codes. NIPS 2009.
  • M. Schmidt, K. Murphy, G. Fung, R. Rosales.
    Structure learning in random fields for heart
    motion abnormality detection. CVPR 2008.
  • D. Shahaf, A. Chechetka, C. Guestrin. Learning
    Thin Junction Trees via Graph Cuts. AI-STATS
    2009.
  • C. Sutton, A. McCallum. Piecewise training of
    undirected models. UAI 2005.
  • C. Sutton, A. McCallum. Piecewise
    pseudolikelihood for efficient training of
    conditional random fields. ICML, 2007.
  • A. Torralba, K. Murphy, W. Freeman. Contextual
    models for object detection using boosted random
    fields. NIPS 2004.

45
(extra slides)
46
B Score Decay Assumption
47
B Example complexity
48
Future work Templated CRFs
  • Learn template, e.g.
  • Score(i,j) DCI(i,j)
  • Parametrization
  • WebKB (Craven et al., 1998)
  • Given webpages (Yipage type, Xicontent)
  • Use template to Choose tree over pages
  • Instantiate parameters
  • ? P(YXx) P(pages types pages content)
  • Requires local inputs
  • Potentially very fast

49
Future work Learn score
  • Given training queries
  • Data
  • Ground-truth model (E.g., from expensive
    structure learning method)
  • Learn function Score(Yi,Yj) for MST algorithm.
Write a Comment
User Comments (0)
About PowerShow.com