Learning Structured Prediction Models: A Large Margin Approach - PowerPoint PPT Presentation

1 / 56

About This Presentation

Title:

Learning Structured Prediction Models: A Large Margin Approach

Description:

Learning Structured Prediction Models: A Large Margin Approach – PowerPoint PPT presentation

Number of Views:120

Avg rating:3.0/5.0

Slides: 57

Provided by: csr62

Category:

more less

Transcript and Presenter's Notes

Title: Learning Structured Prediction Models: A Large Margin Approach

1
Learning Structured Prediction ModelsA Large
Margin Approach

Ben Taskar
U.C. Berkeley
Vassil Chatalbashev Michael
Collins Carlos Guestrin
Dan Klein
Daphne Koller Chris
Manning

2
Dont worry, Howard. The big questions are
multiple choice.
3
Handwriting recognition
x
y
brace
Sequential structure
4
Object segmentation
x
y
Spatial structure
5
Natural language parsing
x
y
The screen was a sea of red
Recursive structure
6
Disulfide connectivity prediction
x
y
RSCCPCYWGGCPW GQNCYPEGCSGPKV
Combinatorial structure
7
Outline

Structured prediction models
Sequences (CRFs)
Trees (CFGs)
Associative Markov networks (Special MRFs)
Matchings
Geometric View
Structured model polytopes
Linear programming inference
Structured large margin estimation
Min-max formulation
Application 3D object segmentation
Certificate formulation
Application disulfide connectivity prediction

8
Structured models
scoring function
space of feasible outputs

Mild assumption
linear
combination

9
Chain Markov Net (aka CRF)
P(yx) ? ?i ?(xi,yi) ?i ?(yi,yi1)
?(xi,yi) exp?? w?f?(xi,yi)
?(yi,yi1) exp?? w?f? (yi,yi1)
f?(y,y) I(yz,ya)
y
f?(x,y) I(xp1, yz)
x
Lafferty et al. 01
10
Chain Markov Net (aka CRF)
P(yx)?? ?i ?(xi,yi) ?i ?(yi,yi1)

expwTf(x,y)

w , w? , , w?, f(x,y) , f?(x,y) ,
, f?(x,y) ,
?i ?(xi,yi) exp?? w? ?i f?(xi,yi)
?i ?(yi,yi1) exp?? w? ?i f? (yi,yi1)
f?(x,y) (yz,ya)
y
f?(x,y) (xp1, yz)
x
Lafferty et al. 01
11
Associative Markov Nets
Edge features
Point features
spin-images, point height
length of edge, edge orientation
associative restriction
?i
yi
?ij
yj
12
PCFG
(NP ? DT NN) (PP ? IN NP) (NN ? sea)
13
Disulfide bonds non-bipartite matching
RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2
3 4
5
6
1
6
2
5
4
3
Fariselli Casadio 01, Baldi et al. 04
14
Scoring function
RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2
3 4
5
6
RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2
3 4
5
6
String features residues, physical properties
15
Structured models
scoring function
space of feasible outputs

Mild assumption
Another mild assumption
? linear
programming

16
MAP inference ? linear program

LP inference for
Chains
Trees
Associative Markov Nets
Bipartite Matchings

17
Markov Net Inference LP
Has integral solutions y for chains, trees Gives
upper bound for general networks
18
Associative MN Inference LP
associative restriction

For K2, solutions are always integral (optimal)
For Kgt2, within factor of 2 of optimal
Constraint matrix A is linear in number of nodes
and edges, regardless of the tree-width

19
Other Inference LPs

Context-free parsing
Dynamic programs
Bipartite matching
Network flow
Many other combinatorial problems

20
Outline

Structured prediction models
Sequences (CRFs)
Trees (CFGs)
Associative Markov networks (Special MRFs)
Matchings
Geometric View
Structured model polytopes
Linear programming inference
Structured large margin estimation
Min-max formulation
Application 3D object segmentation
Certificate formulation
Application disulfide connectivity prediction

21
Learning w

Training example (x, y)
Probabilistic approach
Maximize conditional likelihood
Problem computing Zw(x) is P-complete

22
Geometric Example
Training data
Goal
Learn w s.t. wTf( , y) points the right way
23
OCR Example

We want
argmaxword wT f( ,word) brace
Equivalently
wT f( ,brace) gt wT f( ,aaaaa)
wT f( ,brace) gt wT f( ,aaaab)
wT f( ,brace) gt wT f( ,zzzzz)

a lot!
24
Large margin estimation

Given training example (x, y), we want

Maximize margin
Mistake weighted margin

of mistakes in y
Taskar et al. 03
25
Large margin estimation

Brute force enumeration
Min-max formulation
Plug-in linear program for inference

26
Min-max formulation
Assume linear loss (Hamming)
Inference
LP inference
27
Min-max formulation
By strong LP duality
Minimize jointly over w, z
28
Min-max formulation

Formulation produces compact QP for
Low-treewidth Markov networks
Associative Markov networks
Context free grammars
Bipartite matchings
Any problem with compact LP inference

29
3D Mapping
Data provided by Michael Montemerlo Sebastian
Thrun
Laser Range Finder
GPS
IMU
Label ground, building, tree, shrub Training
30 thousand points Testing 3 million points
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
Segmentation results
Hand labeled 180K test points
35
Fly-through
36
Certificate formulation

Non-bipartite matchings
O(n3) combinatorial algorithm
No polynomial-size LP known
Spanning trees
No polynomial-size LP known
Simple certificate of optimality
Intuition
Verifying optimality easier than optimizing
Compact optimality condition of y wrt.

kl
ij
37
Certificate for non-bipartite matching

Alternating cycle
Every other edge is in matching
Augmenting alternating cycle
Score of edges not in matching greater than edges
in matching
Negate score of edges not in matching
Augmenting alternating cycle negative length
alternating cycle
Matching is optimal ?? no negative alternating
cycles

Edmonds 65
38
Certificate for non-bipartite matching

Pick any node r as root
length of shortest alternating
path from r to j
Triangle inequality
Theorem
No negative length cycle ? distance function d
exists
Can be expressed as linear constraints
O(n) distance variables, O(n2) constraints

39
Certificate formulation

Formulation produces compact QP for
Spanning trees
Non-bipartite matchings
Any problem with compact optimality condition

40
Disulfide connectivity prediction

Dataset
Swiss Prot protein database, release 39
Fariselli Casadio 01, Baldi et al. 04
446 sequences (4-50 cysteines)
Features window profiles (size 9) around each
pair
Two modes bonded state known/unknown
Comparison
SVM-trained weights (ignoring constraints during
learning)
DAG Recursive Neural Network Baldi et al. 04
Our model
Max-margin matching using RBF kernel
Training off-the-shelf LP/QP solver CPLEX (1
hour)

41
Known bonded state
Precision / Accuracy
4-fold cross-validation
42
Unknown bonded state
Precision / Recall / Accuracy
4-fold cross-validation
43
Formulation summary

Brute force enumeration
Min-max formulation
Plug-in convex program for inference
Certificate formulation
Directly guarantee optimality of y

44
Estimation
Margin
Discriminative
MEMMs
CRFs
P(yx)
HMMs PCFGs
MRFs
Generative
P(x,y)
Local
Global
P(z) 1/Z ?c ?(zc)
P(z) ?i P(ziz?)
45
Omissions

Formulation details
Kernels
Multiple examples
Slacks for non-separable case
Approximate learning of intractable models
General MRFs
Learning to cluster
Structured generalization bounds
Scalable algorithms (no QP solver needed)
Structured SMO (works for chains, trees)
Structured EG (works for chains, trees)
Structured PG (works for chains, matchings, AMNs,
)

46
Current Work

Learning approximate energy functions
Protein folding
Physical processes
Semi-supervised learning
Hidden variables
Mixing labeled and unlabeled data
Discriminative structure learning
Using sparsifying priors

47
Conclusion

Two general techniques for structured
large-margin estimation
Exact, compact, convex formulations
Allow efficient use of kernels
Tractable when other estimation methods are not
Structured generalization bounds
Efficient learning algorithms
Empirical success on many domains
Papers at http//www.cs.berkeley.edu/taskar

48
(No Transcript)
49
Duals and Kernels

Kernel trick works!
Scoring functions (log-potentials) can use
kernels
Same for certificate formulation

50
Handwriting Recognition

Length 8 chars
Letter 16x8 pixels
10-fold Train/Test
5000/50000 letters
600/6000 words
Models
Multiclass-SVMs
CRFs
M3 nets

30
better
25
20
Test error (average per-character)
15
10
45 error reduction over linear CRFs 33 error
reduction over multiclass SVMs
5
0
CRFs
MCSVMs
M3 nets
Crammer Singer 01
51
Hypertext Classification