Title: Structured Prediction: A Large Margin Approach
1Structured PredictionA Large Margin Approach
- Ben Taskar
- University of Pennsylvania
- Joint work with
- V. Chatalbashev, M. Collins, C. Guestrin, M.
Jordan, D. Klein, D. Koller, S. Lacoste-Julien,
C. Manning
2Dont worry, Howard. The big questions are
multiple choice.
3Handwriting Recognition
x
y
brace
Sequential structure
4Object Segmentation
x
y
Spatial structure
5Natural Language Parsing
x
y
The screen was a sea of red
Recursive structure
6Bilingual Word Alignment
En vertu de les nouvelles propositions , quel
est le coût prévu de perception de les
droits ?
x
y
What is the anticipated cost of collecting fees
under the new proposal ?
What is the anticipated cost of collecting fees
under the new proposal?
En vertu des nouvelles propositions, quel est le
coût prévu de perception des droits?
Combinatorial structure
7Protein Structure and Disulfide Bridges
AVITGACERDLQCG KGTCCAVSLWIKSV RVCTPVGTSGEDCH PASHK
IPFSGQRMH HTCPCAPNLACVQT SPKKFKCLSK
Protein 1IMT
8Local Prediction
b
r
e
a
c
- Classify using local information
- ? Ignores correlations constraints!
9Local Prediction
10Structured Prediction
b
r
e
a
c
- Use local information
- Exploit correlations
11Structured Prediction
12Outline
- Structured prediction models
- Sequences (CRFs)
- Trees (CFGs)
- Associative Markov networks (Special MRFs)
- Matchings
- Structured large margin estimation
- Margins and structure
- Min-max formulation
- Linear programming inference
- Certificate formulation
13Structured Models
scoring function
space of feasible outputs
- Mild assumption
- linear
combination
14Chain Markov Net (aka CRF)
y
x
Lafferty et al. 01
15Chain Markov Net (aka CRF)
y
x
Lafferty et al. 01
16Associative Markov Nets
Edge features
Point features
spin-images, point height
length of edge, edge orientation
associative restriction
?j
yj
?jk
yk
17CFG Parsing
(NP ? DT NN) (PP ? IN NP) (NN ? sea)
18Bilingual Word Alignment
En vertu de les nouvelles propositions , quel
est le coût prévu de perception de le
droits ?
What is the anticipated cost of collecting fees
under the new proposal ?
-
- position
- orthography
- association
k
j
19Disulfide Bonds Non-bipartite Matching
RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2
3 4
5
6
2
3
1
5
1
4
2
6
4
6
5
3
Fariselli Casadio 01, Baldi et al. 04
20Scoring Function
RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2
3 4
5
6
RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2
3 4
5
6
- amino acid identities
- phys/chem properties
21Structured Models
scoring function
space of feasible outputs
- Mild assumptions
-
- linear combination
- sum of part scores
22Supervised Structured Prediction
Model
Prediction
Learning
Data
Estimate w
Example Weighted matching Generally
Combinatorialoptimization
Margin
Local (ignores structure)
Likelihood (intractable)
23Outline
- Structured prediction models
- Sequences (CRFs)
- Trees (CFGs)
- Associative Markov networks (Special MRFs)
- Matchings
- Structured large margin estimation
- Margins and structure
- Min-max formulation
- Linear programming inference
- Certificate formulation
24OCR Example
brace
brace
aaaaa
brace
aaaab
a lot!
brace
zzzzz
25Parsing Example
It was red
It was red
It was red
It was red
It was red
a lot!
It was red
It was red
26Alignment Example
What is the Quel est le
1 2 3
1 2 3
What is the Quel est le
What is the Quel est le
What is the Quel est le
What is the Quel est le
a lot!
What is the Quel est le
What is the Quel est le
27Structured Loss
b c a r e
2
b r o r e
2
b r o c e
1
b r a c e
0
0 1 2 2
0 1 2 3
What is the Quel est le
It was red
28Large margin estimation
- Given training examples , we want
- Maximize margin
- Mistake weighted margin
of mistakes in y
Collins 02, Altun et al 03, Taskar 03
29Large margin estimation
- Eliminate
- Add slacks for inseparable case
30Large margin estimation
- Brute force enumeration
- Min-max formulation
- Plug-in linear program for inference
31Min-max formulation
Structured loss (Hamming)
Inference
LP Inference
Key step
discrete optim.
continuous optim.
32Outline
- Structured prediction models
- Sequences (CRFs)
- Trees (CFGs)
- Associative Markov networks (Special MRFs)
- Matchings
- Structured large margin estimation
- Margins and structure
- Min-max formulation
- Linear programming inference
- Certificate formulation
33y ? z Map for Markov Nets
1
0
0
0
1
0
0
1
0
1
0
0
0
1
0
a
b
z
a
b
z
0 1 . 0
0 0 . 0
. . . 0
0 0 0 0
0 0 . 0
1 0 . 0
. . . 0
0 0 0 0
0 1 . 0
0 0 . 0
. . . 0
0 0 0 0
0 0 . 0
0 1 . 0
. . . 0
0 0 0 0
a b . z
a b . z
a b . z
a b . z
34Markov Net Inference LP
0 1 0 0
normalization
0 0 0 0
0 0 0 0
0 1 0 0
0 0 0 0
0
0
1
0
agreement
Has integral solutions z for chains, trees Can be
fractional for untriangulated networks
35Associative MN Inference LP
associative restriction
0 1 0 0
0
1
0
0
0
1
0
0
- For K2, solutions are always integral (optimal)
- For Kgt2, within factor of 2 of optimal
36CFG Chart
- CNF tree set of two types of parts
- Constituents (A, s, e)
- CF-rules (A ? B C, s, m, e)
37CFG Inference LP
root
inside
outside
Has integral solutions z
38Matching Inference LP
En vertu de les nouvelles propositions , quel
est le coût prévu de perception de le
droits ?
k
What is the anticipated cost of collecting fees
under the new proposal ?
degree
j
Has integral solutions z
39LP Duality
- Linear programming duality
- Variables ? constraints
- Constraints ? variables
- Optimal values are the same
- When both feasible regions are bounded
40Min-max Formulation
LP duality
41Min-max formulation summary
- Formulation produces concise QP for
- Low-treewidth Markov networks
- Associative MNs (K2)
- Context free grammars
- Bipartite matchings
- Approximate for untriangulated MNs, AMNs with Kgt2
Taskar et al 04
42Unfactored Primal/Dual
QP duality
Exponentially many constraints/variables
43Factored Primal/Dual
By QP duality
Dual inherits structure from problem-specific
inference LP Variables ? correspond to a
decomposition of ? variables of the flat case
44The Connection
b c a r e
2
.2
b r o r e
.15
2
b r o c e
.25
1
b r a c e
.4
0
r
c
a
1
.65
1
.8
.6
e
b
?
c
r
o
.4
.35
.2
45Duals and Kernels
- Kernel trick works
- Factored dual
- Local functions (log-potentials) can use kernels
46Alternatives Perceptron
- Simple iterative method
- Unstable for structured output fewer instances,
big updates - May not converge if non-separable
- Noisy
- Voted / averaged perceptron Freund Schapire
99, Collins 02 - Regularize / reduce variance by aggregating over
iterations
47Alternatives Constraint Generation
- Add most violated constraint
- Handles more general loss functions
- Only polynomial of constraints needed
- Need to re-solve QP many times
- Worst case of constraints larger than factored
Collins 02 Altun et al, 03 Tsochantaridis et
al, 04
48Handwriting Recognition
- Length 8 chars
- Letter 16x8 pixels
- 10-fold Train/Test
- 5000/50000 letters
- 600/6000 words
- Models
- Multiclass-SVMs
- CRFs
- M3 nets
30
better
25
20
Test error (average per-character)
15
10
45 error reduction over linear CRFs 33 error
reduction over multiclass SVMs
5
0
CRFs
MCSVMs
M3 nets
Crammer Singer 01
49Hypertext Classification
- WebKB dataset
- Four CS department websites 1300 pages/3500
links - Classify each page faculty, course, student,
project, other - Train on three universities/test on fourth
better
relaxed dual
53 error reduction over SVMs 38 error reduction
over RMNs
loopy belief propagation
Taskar et al 02
503D Mapping
Data provided by Michael Montemerlo Sebastian
Thrun
Laser Range Finder
GPS
IMU
Label ground, building, tree, shrub Training
30 thousand points Testing 3 million points
51(No Transcript)
52(No Transcript)
53(No Transcript)
54(No Transcript)
55Segmentation results
Hand labeled 180K test points
Model Accuracy
SVM 68
V-SVM 73
M3N 93
56Fly-through
57Word Alignment Results
Data Hansards Canadian Parliament
Features induced on ? 1 mil unsupervised
sentences Trained on 100 sentences (10,000
edges) Tested on 350 sentences (35,000
edges)
Model Error
Local learningmatching 10.0
Our approach 8.5
GIZA/IBM4 Och Ney 03 6.5
Local learningmatching 5.4
Our approach 4.9
Our approachQAP 4.5
Taskaral 05
Error weighted combination of precision/recall
Lacoste-JulienTaskaral 06
58Outline
- Structured prediction models
- Sequences (CRFs)
- Trees (CFGs)
- Associative Markov networks (Special MRFs)
- Matchings
- Structured large margin estimation
- Margins and structure
- Min-max formulation
- Linear programming inference
- Certificate formulation
59Certificate formulation
- Non-bipartite matchings
- O(n3) combinatorial algorithm
- No polynomial-size LP known
- Spanning trees
- No polynomial-size LP known
- Simple certificate of optimality
- Intuition
- Verifying optimality easier than optimizing
- Compact optimality condition of wrt.
kl
ij
60Certificate for non-bipartite matching
- Alternating cycle
- Every other edge is in matching
- Augmenting alternating cycle
- Score of edges not in matching greater than edges
in matching - Negate score of edges not in matching
- Augmenting alternating cycle negative length
alternating cycle - Matching is optimal ?? no negative alternating
cycles
Edmonds 65
61Certificate for non-bipartite matching
- Pick any node r as root
- length of shortest alternating
- path from r to j
- Triangle inequality
- Theorem
- No negative length cycle ? distance function d
exists - Can be expressed as linear constraints
- O(n) distance variables, O(n2) constraints
62Certificate formulation
- Formulation produces compact QP for
- Spanning trees
- Non-bipartite matchings
- Any problem with compact optimality condition
Taskar et al. 05
63Disulfide Bonding Prediction
- Data Swiss Prot 39
- 450 sequences (4-10 cysteines)
- Features
- windows around C-C pair
- physical/chemical properties
AVITGA ERDLQ GKGT AVSLWIKSVRV TPVGTSGED
HPASHKIPFSGQRMHHT P APNLA VQTSPKKFK LSK
C C CC C C
C C C C
Model Acc
Local learningmatching 41
Recursive Neural Net Baldial04 52
Our approach (certificate) 55
Accuracy proteins with all correct bonds
Taskaral 05
64Formulation summary
- Brute force enumeration
- Min-max formulation
- Plug-in convex program for inference
- Certificate formulation
- Directly guarantee optimality of
65Omissions
- Kernels
- Non-parametric models
- Structured generalization bounds
- Bounds on hamming loss
- Scalable algorithms (no QP solver needed)
- Structured SMO (works for chains, trees)Taskar
04 - Structured ExpGrad (works for chains,
trees)Bartlettal 04 - Structured ExtraGrad (works for matchings, AMNs)
Taskaral 06
66Open questions
- Statistical consistency
- Hinge loss not consistent for non-binary output
- See Tewari Bartlett 05, McAllester 07
- Learning with approximate inference
- Does constant factor approximate inference
guarantee anything about learning? - No See Kulesza Pereira 07
- Perhaps other assumptions needed
- Discriminative structure learning
- Using sparsifying priors
67Conclusion
- Two general techniques for structured
large-margin estimation - Exact, compact, convex formulations
- Allow efficient use of kernels
- Tractable when other estimation methods are not
- Efficient learning algorithms
- Empirical success on many domains
68References
- Y. Altun, I. Tsochantaridis, and T. Hofmann.
Hidden Markov support vector machines. ICML03. - M. Collins. Discriminative training methods for
hidden Markov models Theory and experiments with
perceptron algorithms. EMNLP02 - K. Crammer and Y. Singer. On the algorithmic
implementation of multiclass kernel-based vector
machines. JMLR01 - J. Lafferty, A. McCallum, and F. Pereira.
Conditional random fields Probabilistic models
for segmenting and labeling sequence data. ICML04 - More papers at http//www.cis.upenn.edu/taskar
69(No Transcript)
70Modeling First Order Effects
Monotonicity Local inversion Local fertility
- QAP NP-complete
- Sentences (?30 words, ?1k vars) ? few seconds
(Mosek) - Learning use LP relaxation
- Testing using LP, 83.5 sentences, 99.85 edges
integral
71Segmentation Model ? Min-Cut
Local evidence
0
1
Spatial smoothness
- Computing is hard in
general, but - if edge potentials attractive ? min-cut algorithm
- Multiway-cut for multiclass case ? use LP
relaxation
Greigal 89, Boykoval 99, Kolmogorov Zabih
02, Taskaral 04
72Scalable Algorithms
- Batch and online
- Linear in the size of the data
- Iterate until convergence
- For each example in the training sample
- Run inference using current parameters (varies by
method) - Online Update parameters using computed example
values - Batch Update parameters using computed sample
values
Structured SMO (Taskar et al, 03 Taskar 04)
Structured Exponentiated Gradient (Bartlett et
al, 04) Structured Extragradient (Taskar et al,
05)
73Experimental Setup
- Standard Penn treebank split (2-21/22/23)
- Generative baselines
- Klein Manning 03 and Collins 99
- Discriminative
- Basic max-margin version of KM 03
- Lexical Lexical Aux
- Lexical features (on constituent parts only)
- ts-1 ts te te1 ? predicted
tags - xs-1 xs xe xe1
- Auxillary features
- Flat classifier using same features
- Prediction of KM 03 on each span
74Results for sentences 40 words
Model LP LR F1
Generative 86.37 85.27 85.82
LexicalAux 87.56 86.85 87.20
Collins 99 85.33 85.94 85.73
Trained only on sentences 20 words
Taskar et al 04
75Example
- The Egyptian president said he would visit
- Libya today to resume the talks.
- Generative model Libya today is base NP
- Lexical model today is a one word constituent