Structured Prediction: A Large Margin Approach - PowerPoint PPT Presentation

1 / 75

About This Presentation

Title:

Structured Prediction: A Large Margin Approach

Description:

V. Chatalbashev, M. Collins, C. Guestrin, M. Jordan, D. Klein, ... orthography. association. What. is. the. anticipated. cost. of. collecting. fees. under. the ... – PowerPoint PPT presentation

Number of Views:90

Avg rating:3.0/5.0

Slides: 76

Provided by: carbonVide

Category:

more less

Transcript and Presenter's Notes

Title: Structured Prediction: A Large Margin Approach

1
Structured PredictionA Large Margin Approach

Ben Taskar
University of Pennsylvania
Joint work with
V. Chatalbashev, M. Collins, C. Guestrin, M.
Jordan, D. Klein, D. Koller, S. Lacoste-Julien,
C. Manning

2
Dont worry, Howard. The big questions are
multiple choice.
3
Handwriting Recognition
x
y
brace
Sequential structure
4
Object Segmentation
x
y
Spatial structure
5
Natural Language Parsing
x
y
The screen was a sea of red
Recursive structure
6
Bilingual Word Alignment
En vertu de les nouvelles propositions , quel
est le coût prévu de perception de les
droits ?
x
y
What is the anticipated cost of collecting fees
under the new proposal ?
What is the anticipated cost of collecting fees
under the new proposal?
En vertu des nouvelles propositions, quel est le
coût prévu de perception des droits?
Combinatorial structure
7
Protein Structure and Disulfide Bridges
AVITGACERDLQCG KGTCCAVSLWIKSV RVCTPVGTSGEDCH PASHK
IPFSGQRMH HTCPCAPNLACVQT SPKKFKCLSK
Protein 1IMT
8
Local Prediction
b
r
e
a
c

Classify using local information
? Ignores correlations constraints!

9
Local Prediction
10
Structured Prediction
b
r
e
a
c

Use local information
Exploit correlations

11
Structured Prediction
12
Outline

Structured prediction models
Sequences (CRFs)
Trees (CFGs)
Associative Markov networks (Special MRFs)
Matchings
Structured large margin estimation
Margins and structure
Min-max formulation
Linear programming inference
Certificate formulation

13
Structured Models
scoring function
space of feasible outputs

Mild assumption
linear
combination

14
Chain Markov Net (aka CRF)
y
x
Lafferty et al. 01
15
Chain Markov Net (aka CRF)
y
x
Lafferty et al. 01
16
Associative Markov Nets
Edge features
Point features
spin-images, point height
length of edge, edge orientation
associative restriction
?j
yj
?jk
yk
17
CFG Parsing
(NP ? DT NN) (PP ? IN NP) (NN ? sea)
18
Bilingual Word Alignment
En vertu de les nouvelles propositions , quel
est le coût prévu de perception de le
droits ?
What is the anticipated cost of collecting fees
under the new proposal ?

position
orthography
association

k
j
19
Disulfide Bonds Non-bipartite Matching
RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2
3 4
5
6
2
3
1
5
1
4
2
6
4
6
5
3
Fariselli Casadio 01, Baldi et al. 04
20
Scoring Function
RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2
3 4
5
6
RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2
3 4
5
6

amino acid identities
phys/chem properties

21
Structured Models
scoring function
space of feasible outputs

Mild assumptions
linear combination
sum of part scores

22
Supervised Structured Prediction
Model
Prediction
Learning
Data
Estimate w
Example Weighted matching Generally
Combinatorialoptimization
Margin
Local (ignores structure)
Likelihood (intractable)
23
Outline

Structured prediction models
Sequences (CRFs)
Trees (CFGs)
Associative Markov networks (Special MRFs)
Matchings
Structured large margin estimation
Margins and structure
Min-max formulation
Linear programming inference
Certificate formulation

24
OCR Example

We want
Equivalently

brace
brace
aaaaa
brace
aaaab
a lot!

brace
zzzzz
25
Parsing Example

We want
Equivalently

It was red
It was red
It was red
It was red
It was red
a lot!

It was red
It was red
26
Alignment Example

We want
Equivalently

What is the Quel est le
1 2 3
1 2 3
What is the Quel est le
What is the Quel est le
What is the Quel est le
What is the Quel est le
a lot!

What is the Quel est le
What is the Quel est le
27
Structured Loss
b c a r e
2
b r o r e
2
b r o c e
1
b r a c e
0
0 1 2 2
0 1 2 3
What is the Quel est le
It was red
28
Large margin estimation

Given training examples , we want

Maximize margin
Mistake weighted margin

of mistakes in y
Collins 02, Altun et al 03, Taskar 03
29
Large margin estimation

Eliminate
Add slacks for inseparable case

30
Large margin estimation

Brute force enumeration
Min-max formulation
Plug-in linear program for inference

31
Min-max formulation
Structured loss (Hamming)
Inference
LP Inference
Key step
discrete optim.
continuous optim.
32
Outline

Structured prediction models
Sequences (CRFs)
Trees (CFGs)
Associative Markov networks (Special MRFs)
Matchings
Structured large margin estimation
Margins and structure
Min-max formulation
Linear programming inference
Certificate formulation

33
y ? z Map for Markov Nets
1
0

0
0
1

0
0
1

0
1
0

0
0
1

0
a
b

z
a
b

z
0 1 . 0
0 0 . 0
. . . 0
0 0 0 0
0 0 . 0
1 0 . 0
. . . 0
0 0 0 0
0 1 . 0
0 0 . 0
. . . 0
0 0 0 0
0 0 . 0
0 1 . 0
. . . 0
0 0 0 0
a b . z
a b . z
a b . z
a b . z
34
Markov Net Inference LP
0 1 0 0
normalization
0 0 0 0
0 0 0 0
0 1 0 0
0 0 0 0
0
0
1
0
agreement
Has integral solutions z for chains, trees Can be
fractional for untriangulated networks
35
Associative MN Inference LP
associative restriction
0 1 0 0
0
1
0
0
0
1
0
0

For K2, solutions are always integral (optimal)
For Kgt2, within factor of 2 of optimal

36
CFG Chart

CNF tree set of two types of parts
Constituents (A, s, e)
CF-rules (A ? B C, s, m, e)

37
CFG Inference LP
root
inside
outside
Has integral solutions z
38
Matching Inference LP
En vertu de les nouvelles propositions , quel
est le coût prévu de perception de le
droits ?
k
What is the anticipated cost of collecting fees
under the new proposal ?
degree
j
Has integral solutions z
39
LP Duality

Linear programming duality
Variables ? constraints
Constraints ? variables
Optimal values are the same
When both feasible regions are bounded

40
Min-max Formulation
LP duality
41
Min-max formulation summary

Formulation produces concise QP for
Low-treewidth Markov networks
Associative MNs (K2)
Context free grammars
Bipartite matchings
Approximate for untriangulated MNs, AMNs with Kgt2

Taskar et al 04
42
Unfactored Primal/Dual
QP duality
Exponentially many constraints/variables
43
Factored Primal/Dual
By QP duality
Dual inherits structure from problem-specific
inference LP Variables ? correspond to a
decomposition of ? variables of the flat case
44
The Connection
b c a r e
2
.2
b r o r e
.15
2
b r o c e
.25
1
b r a c e
.4
0
r
c
a
1
.65
1
.8
.6
e
b
?
c
r
o
.4
.35
.2
45
Duals and Kernels

Kernel trick works
Factored dual
Local functions (log-potentials) can use kernels

46
Alternatives Perceptron

Simple iterative method
Unstable for structured output fewer instances,
big updates
May not converge if non-separable
Noisy
Voted / averaged perceptron Freund Schapire
99, Collins 02
Regularize / reduce variance by aggregating over
iterations

47
Alternatives Constraint Generation

Add most violated constraint
Handles more general loss functions
Only polynomial of constraints needed
Need to re-solve QP many times
Worst case of constraints larger than factored

Collins 02 Altun et al, 03 Tsochantaridis et
al, 04
48
Handwriting Recognition

Length 8 chars
Letter 16x8 pixels
10-fold Train/Test
5000/50000 letters
600/6000 words
Models
Multiclass-SVMs
CRFs
M3 nets

30
better
25
20
Test error (average per-character)
15
10
45 error reduction over linear CRFs 33 error
reduction over multiclass SVMs
5
0
CRFs
MCSVMs
M3 nets
Crammer Singer 01
49
Hypertext Classification

WebKB dataset
Four CS department websites 1300 pages/3500
links
Classify each page faculty, course, student,
project, other
Train on three universities/test on fourth

better
relaxed dual
53 error reduction over SVMs 38 error reduction
over RMNs
loopy belief propagation
Taskar et al 02
50
3D Mapping
Data provided by Michael Montemerlo Sebastian
Thrun
Laser Range Finder
GPS
IMU
Label ground, building, tree, shrub Training
30 thousand points Testing 3 million points
51
(No Transcript)
52
(No Transcript)
53
(No Transcript)
54
(No Transcript)
55
Segmentation results
Hand labeled 180K test points
Model Accuracy
SVM 68
V-SVM 73
M3N 93
56
Fly-through
57
Word Alignment Results
Data Hansards Canadian Parliament
Features induced on ? 1 mil unsupervised
sentences Trained on 100 sentences (10,000
edges) Tested on 350 sentences (35,000
edges)
Model Error
Local learningmatching 10.0
Our approach 8.5
GIZA/IBM4 Och Ney 03 6.5
Local learningmatching 5.4
Our approach 4.9
Our approachQAP 4.5
Taskaral 05
Error weighted combination of precision/recall
Lacoste-JulienTaskaral 06
58
Outline

Structured prediction models
Sequences (CRFs)
Trees (CFGs)
Associative Markov networks (Special MRFs)
Matchings
Structured large margin estimation
Margins and structure
Min-max formulation
Linear programming inference
Certificate formulation

59
Certificate formulation

Non-bipartite matchings
O(n3) combinatorial algorithm
No polynomial-size LP known
Spanning trees
No polynomial-size LP known
Simple certificate of optimality
Intuition
Verifying optimality easier than optimizing
Compact optimality condition of wrt.

kl
ij
60
Certificate for non-bipartite matching

Alternating cycle
Every other edge is in matching
Augmenting alternating cycle
Score of edges not in matching greater than edges
in matching
Negate score of edges not in matching
Augmenting alternating cycle negative length
alternating cycle
Matching is optimal ?? no negative alternating
cycles

Edmonds 65
61
Certificate for non-bipartite matching

Pick any node r as root
length of shortest alternating
path from r to j
Triangle inequality
Theorem
No negative length cycle ? distance function d
exists
Can be expressed as linear constraints
O(n) distance variables, O(n2) constraints

62
Certificate formulation

Formulation produces compact QP for
Spanning trees
Non-bipartite matchings
Any problem with compact optimality condition

Taskar et al. 05
63
Disulfide Bonding Prediction

Data Swiss Prot 39
450 sequences (4-10 cysteines)
Features
windows around C-C pair
physical/chemical properties

AVITGA ERDLQ GKGT AVSLWIKSVRV TPVGTSGED
HPASHKIPFSGQRMHHT P APNLA VQTSPKKFK LSK
C C CC C C
C C C C
Model Acc
Local learningmatching 41
Recursive Neural Net Baldial04 52
Our approach (certificate) 55
Accuracy proteins with all correct bonds
Taskaral 05
64
Formulation summary

Brute force enumeration
Min-max formulation
Plug-in convex program for inference
Certificate formulation
Directly guarantee optimality of

65
Omissions

Kernels
Non-parametric models
Structured generalization bounds
Bounds on hamming loss
Scalable algorithms (no QP solver needed)
Structured SMO (works for chains, trees)Taskar
04
Structured ExpGrad (works for chains,
trees)Bartlettal 04
Structured ExtraGrad (works for matchings, AMNs)
Taskaral 06

66
Open questions

Statistical consistency
Hinge loss not consistent for non-binary output
See Tewari Bartlett 05, McAllester 07
Learning with approximate inference
Does constant factor approximate inference
guarantee anything about learning?
No See Kulesza Pereira 07
Perhaps other assumptions needed
Discriminative structure learning
Using sparsifying priors

67
Conclusion

Two general techniques for structured
large-margin estimation
Exact, compact, convex formulations
Allow efficient use of kernels
Tractable when other estimation methods are not
Efficient learning algorithms
Empirical success on many domains

68
References

Y. Altun, I. Tsochantaridis, and T. Hofmann.
Hidden Markov support vector machines. ICML03.
M. Collins. Discriminative training methods for
hidden Markov models Theory and experiments with
perceptron algorithms. EMNLP02
K. Crammer and Y. Singer. On the algorithmic
implementation of multiclass kernel-based vector
machines. JMLR01
J. Lafferty, A. McCallum, and F. Pereira.
Conditional random fields Probabilistic models
for segmenting and labeling sequence data. ICML04
More papers at http//www.cis.upenn.edu/taskar

69
(No Transcript)
70
Modeling First Order Effects
Monotonicity Local inversion Local fertility

QAP NP-complete
Sentences (?30 words, ?1k vars) ? few seconds
(Mosek)
Learning use LP relaxation
Testing using LP, 83.5 sentences, 99.85 edges
integral

71
Segmentation Model ? Min-Cut
Local evidence
0
1
Spatial smoothness

Computing is hard in
general, but
if edge potentials attractive ? min-cut algorithm
Multiway-cut for multiclass case ? use LP
relaxation

Greigal 89, Boykoval 99, Kolmogorov Zabih
02, Taskaral 04
72
Scalable Algorithms

Batch and online
Linear in the size of the data
Iterate until convergence
For each example in the training sample
Run inference using current parameters (varies by
method)
Online Update parameters using computed example
values
Batch Update parameters using computed sample
values

Structured SMO (Taskar et al, 03 Taskar 04)
Structured Exponentiated Gradient (Bartlett et
al, 04) Structured Extragradient (Taskar et al,
05)
73
Experimental Setup

Standard Penn treebank split (2-21/22/23)
Generative baselines
Klein Manning 03 and Collins 99
Discriminative
Basic max-margin version of KM 03
Lexical Lexical Aux
Lexical features (on constituent parts only)
ts-1 ts te te1 ? predicted
tags
xs-1 xs xe xe1
Auxillary features
Flat classifier using same features
Prediction of KM 03 on each span

74
Results for sentences 40 words
Model LP LR F1
Generative 86.37 85.27 85.82
LexicalAux 87.56 86.85 87.20
Collins 99 85.33 85.94 85.73
Trained only on sentences 20 words
Taskar et al 04
75
Example