Title: Learning Structured Prediction Models: A Large Margin Approach
1Learning Structured Prediction ModelsA Large
Margin Approach
- Ben Taskar
- U.C. Berkeley
- Vassil Chatalbashev Michael
Collins Carlos Guestrin
Dan Klein - Daphne Koller Chris
Manning
2Dont worry, Howard. The big questions are
multiple choice.
3Handwriting recognition
x
y
brace
Sequential structure
4Object segmentation
x
y
Spatial structure
5Natural language parsing
x
y
The screen was a sea of red
Recursive structure
6Disulfide connectivity prediction
x
y
RSCCPCYWGGCPW GQNCYPEGCSGPKV
Combinatorial structure
7Outline
- Structured prediction models
- Sequences (CRFs)
- Trees (CFGs)
- Associative Markov networks (Special MRFs)
- Matchings
- Geometric View
- Structured model polytopes
- Linear programming inference
- Structured large margin estimation
- Min-max formulation
- Application 3D object segmentation
- Certificate formulation
- Application disulfide connectivity prediction
8Structured models
scoring function
space of feasible outputs
- Mild assumption
- linear
combination
9Chain Markov Net (aka CRF)
P(yx) ? ?i ?(xi,yi) ?i ?(yi,yi1)
?(xi,yi) exp?? w?f?(xi,yi)
?(yi,yi1) exp?? w?f? (yi,yi1)
f?(y,y) I(yz,ya)
y
f?(x,y) I(xp1, yz)
x
Lafferty et al. 01
10Chain Markov Net (aka CRF)
P(yx)?? ?i ?(xi,yi) ?i ?(yi,yi1)
w , w? , , w?, f(x,y) , f?(x,y) ,
, f?(x,y) ,
?i ?(xi,yi) exp?? w? ?i f?(xi,yi)
?i ?(yi,yi1) exp?? w? ?i f? (yi,yi1)
f?(x,y) (yz,ya)
y
f?(x,y) (xp1, yz)
x
Lafferty et al. 01
11Associative Markov Nets
Edge features
Point features
spin-images, point height
length of edge, edge orientation
associative restriction
?i
yi
?ij
yj
12PCFG
(NP ? DT NN) (PP ? IN NP) (NN ? sea)
13Disulfide bonds non-bipartite matching
RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2
3 4
5
6
1
6
2
5
4
3
Fariselli Casadio 01, Baldi et al. 04
14Scoring function
RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2
3 4
5
6
RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2
3 4
5
6
String features residues, physical properties
15Structured models
scoring function
space of feasible outputs
- Mild assumption
-
- Another mild assumption
- ? linear
programming
16MAP inference ? linear program
- LP inference for
- Chains
- Trees
- Associative Markov Nets
- Bipartite Matchings
-
17Markov Net Inference LP
Has integral solutions y for chains, trees Gives
upper bound for general networks
18Associative MN Inference LP
associative restriction
- For K2, solutions are always integral (optimal)
- For Kgt2, within factor of 2 of optimal
- Constraint matrix A is linear in number of nodes
and edges, regardless of the tree-width
19Other Inference LPs
- Context-free parsing
- Dynamic programs
- Bipartite matching
- Network flow
- Many other combinatorial problems
20Outline
- Structured prediction models
- Sequences (CRFs)
- Trees (CFGs)
- Associative Markov networks (Special MRFs)
- Matchings
- Geometric View
- Structured model polytopes
- Linear programming inference
- Structured large margin estimation
- Min-max formulation
- Application 3D object segmentation
- Certificate formulation
- Application disulfide connectivity prediction
21Learning w
- Training example (x, y)
- Probabilistic approach
- Maximize conditional likelihood
- Problem computing Zw(x) is P-complete
22Geometric Example
Training data
Goal
Learn w s.t. wTf( , y) points the right way
23OCR Example
- We want
- argmaxword wT f( ,word) brace
- Equivalently
- wT f( ,brace) gt wT f( ,aaaaa)
- wT f( ,brace) gt wT f( ,aaaab)
-
- wT f( ,brace) gt wT f( ,zzzzz)
a lot!
24Large margin estimation
- Given training example (x, y), we want
- Maximize margin
- Mistake weighted margin
of mistakes in y
Taskar et al. 03
25Large margin estimation
- Brute force enumeration
- Min-max formulation
- Plug-in linear program for inference
26Min-max formulation
Assume linear loss (Hamming)
Inference
LP inference
27Min-max formulation
By strong LP duality
Minimize jointly over w, z
28Min-max formulation
- Formulation produces compact QP for
- Low-treewidth Markov networks
- Associative Markov networks
- Context free grammars
- Bipartite matchings
- Any problem with compact LP inference
293D Mapping
Data provided by Michael Montemerlo Sebastian
Thrun
Laser Range Finder
GPS
IMU
Label ground, building, tree, shrub Training
30 thousand points Testing 3 million points
30(No Transcript)
31(No Transcript)
32(No Transcript)
33(No Transcript)
34Segmentation results
Hand labeled 180K test points
35Fly-through
36Certificate formulation
- Non-bipartite matchings
- O(n3) combinatorial algorithm
- No polynomial-size LP known
- Spanning trees
- No polynomial-size LP known
- Simple certificate of optimality
- Intuition
- Verifying optimality easier than optimizing
- Compact optimality condition of y wrt.
kl
ij
37Certificate for non-bipartite matching
- Alternating cycle
- Every other edge is in matching
- Augmenting alternating cycle
- Score of edges not in matching greater than edges
in matching - Negate score of edges not in matching
- Augmenting alternating cycle negative length
alternating cycle - Matching is optimal ?? no negative alternating
cycles
Edmonds 65
38Certificate for non-bipartite matching
- Pick any node r as root
- length of shortest alternating
- path from r to j
- Triangle inequality
- Theorem
- No negative length cycle ? distance function d
exists - Can be expressed as linear constraints
- O(n) distance variables, O(n2) constraints
39Certificate formulation
- Formulation produces compact QP for
- Spanning trees
- Non-bipartite matchings
- Any problem with compact optimality condition
40Disulfide connectivity prediction
- Dataset
- Swiss Prot protein database, release 39
- Fariselli Casadio 01, Baldi et al. 04
- 446 sequences (4-50 cysteines)
- Features window profiles (size 9) around each
pair - Two modes bonded state known/unknown
- Comparison
- SVM-trained weights (ignoring constraints during
learning) - DAG Recursive Neural Network Baldi et al. 04
- Our model
- Max-margin matching using RBF kernel
- Training off-the-shelf LP/QP solver CPLEX (1
hour)
41Known bonded state
Precision / Accuracy
4-fold cross-validation
42Unknown bonded state
Precision / Recall / Accuracy
4-fold cross-validation
43Formulation summary
- Brute force enumeration
- Min-max formulation
- Plug-in convex program for inference
- Certificate formulation
- Directly guarantee optimality of y
44Estimation
Margin
Discriminative
MEMMs
CRFs
P(yx)
HMMs PCFGs
MRFs
Generative
P(x,y)
Local
Global
P(z) 1/Z ?c ?(zc)
P(z) ?i P(ziz?)
45Omissions
- Formulation details
- Kernels
- Multiple examples
- Slacks for non-separable case
- Approximate learning of intractable models
- General MRFs
- Learning to cluster
- Structured generalization bounds
- Scalable algorithms (no QP solver needed)
- Structured SMO (works for chains, trees)
- Structured EG (works for chains, trees)
- Structured PG (works for chains, matchings, AMNs,
)
46Current Work
- Learning approximate energy functions
- Protein folding
- Physical processes
- Semi-supervised learning
- Hidden variables
- Mixing labeled and unlabeled data
- Discriminative structure learning
- Using sparsifying priors
47Conclusion
- Two general techniques for structured
large-margin estimation - Exact, compact, convex formulations
- Allow efficient use of kernels
- Tractable when other estimation methods are not
- Structured generalization bounds
- Efficient learning algorithms
- Empirical success on many domains
- Papers at http//www.cs.berkeley.edu/taskar
48(No Transcript)
49Duals and Kernels
- Kernel trick works!
- Scoring functions (log-potentials) can use
kernels - Same for certificate formulation
50Handwriting Recognition
- Length 8 chars
- Letter 16x8 pixels
- 10-fold Train/Test
- 5000/50000 letters
- 600/6000 words
- Models
- Multiclass-SVMs
- CRFs
- M3 nets
30
better
25
20
Test error (average per-character)
15
10
45 error reduction over linear CRFs 33 error
reduction over multiclass SVMs
5
0
CRFs
MCSVMs
M3 nets
Crammer Singer 01
51Hypertext Classification
- WebKB dataset
- Four CS department websites 1300 pages/3500
links - Classify each page faculty, course, student,
project, other - Train on three universities/test on fourth
better
relaxed dual
53 error reduction over SVMs 38 error reduction
over RMNs
loopy belief propagation
Taskar et al 02
52Projected Gradient
yk1
yk
- Projecting y onto constraints
- ? min-cost convex flow for Markov nets,
matchings - Convergence same as steepest gradient
- Conjugate gradient also possible (two-metric
proj.)
yk3
yk2
yk4
53Min-Cost Flow for Markov Chains
a
a
a
a
a
s
t
z
z
z
z
z
- Capacities C
- Edge costs
- For edges from node s, to node t, cost 0
54Min-Cost Flow for Bipartite Matchings
t
s
- Capacities C
- Edge costs
- For edges from node s, to node t, cost 0
55CFG Chart
- CNF tree set of two types of parts
- Constituents (A, s, e)
- CF-rules (A ? B C, s, m, e)
56CFG Inference LP
inside
outside
Has integral solutions y for trees