Title: Magic Moments: Momentbased Approaches to Structured Output Prediction
1Magic MomentsMoment-based Approaches
toStructured Output Prediction
The Analysis of Patterns
- Elisa Ricci
- joint work with Nobuhisa Ueda, Tijl De Bie, Nello
Cristianini
Thursday, October 25th
2Outline
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
- Learning in structured output spaces
- New algorithms based on Z-score
- Experimental results and computational issues
- Conclusions
3Structured data everywhere!!!
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
- Many problems involve highly structured data
which can be represented by sequences, trees and
graphs. - Temporal, spatial and structural dependencies
between objects are modeled. - This phenomenon is observed in several fields
such as computational biology, computer vision,
natural language processing or web data analysis.
4Learning with structured data
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
- Machine learning and data mining algorithms must
be able to analyze efficiently and automatically
a vast amount of complex and structured data. - The goal of structured learning algorithms is to
predict complex structures, such as sequences,
trees, or graphs. - Using traditional algorithms to cope with
problems involving structured data often implies
a loss of information about the structure.
5Supervised learning
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
- Data are available in form of examples and their
associated correct answers.
Training set
Hypotheses space
Find s.t.
Learning
Prediction
6Classification
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
- A typical supervised learning task is
classification.
Named entity recognition (NER) locate named
entities in text. Entities of interest are person
names, location names, organization names,
miscellaneous (dates, times...)
x
Observed variable word in a sentence.
Multiclass classification
y
Label entity tag.
PP ESTUDIA YA PROYECTO LEY TV REGIONAL REMITIDO
POR LA JUNTA Merida. O N N
N M m m
N N N O L
7Sequence labeling
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
- Can we consider the interactions between adjacent
words? - Goal realize a joint labeling for all the words
in the sentence.
Sequence labeling given an input sequence x,
reconstruct the associated label sequence y of
equal length.
x (x1...xn)
Observed sequence words in a sentence.
Label sequence entity tags.
y (y1...yn)
8Sequence alignment
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
Biological sequence alignment is used to
determine the similarity between biological
sequences.
ACTGATTACGTGAACTGGATCCA ACTC--TAGGTGAAGTG-ATCCA
?
Given two sequences S1, S2 ? S a global
alignment is an assignment of gaps, so as to line
up each letter in one sequence with either a gap
or a letter in the other sequence.
S A,T,G,C, S1 , S2 ? S
ATGCTTTC--- ---CTGTCGCC
S1 ATGCTTTC S2 CTGTCGCC
9Sequence alignment
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
Sequence alignment given a sequences pair x,
predict the correct sequence y of alignment
operations (e.g. matches, mismatches,
gaps). Alignments can be represented as
paths from the upper-left to the lower-right
corner in the alignment graph.
10RNA secondary structure prediction
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
RNA secondary structure prediction given a RNA
sequence, predict the most likely secondary
structure. The study of RNA structure
is important in understanding its functions.
AUGAGUAUAAGUUAAUGGUUAAAGUAAAUGUCUUCCACACAUUCCAUCUG
AUUUCGAUUCUCACUACUCAU
?
11Sequence parsing
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
Sequence parsing given an input sequence x,
determine the associated parse tree y given an
underlying context-free grammar. Example
Context-free grammar GV, A, R, S V S set
of non-terminals symbols S G, A, U, C set of
terminals symbols R S ? SS GSC CSG ASU
USA e .
y
12Generative models
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
Sequence labeling
- Traditionally HMMs have been used for sequence
labeling. - Two main drawbacks
- The conditional independence assumptions are
often too restrictive. HMMs cannot represent
multiple interacting features or long range
dependencies between the observations. - They are typically trained by maximum likelihood
(ML) estimation.
13Discriminative models
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
- Specify the probability of possible output y
given an observation x (consider conditional
probability P(yx) rather than joint probability
P(y,x)). - Do not require strict independence assumptions of
generative models. - Arbitrary features of the observations are
considered. - Conditional Random Fields (CRFs)
- Lafferty et al., 01
14Learning in structured output spaces
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
- Several discriminative algorithms have emerged
recently in order to predict complex structures,
such as sequences, trees, or graphs. - New discriminative approaches.
- Problems analyzed
- Given a training set of correct pairs of
sentences and their associated entity tags learn
to extract entities from a new sentence. - Given a training set of correct biological
alignments learn to align two unknown sequences. - Given a training set of corrects RNA secondary
structures associated to a set of sequences learn
to determine the secondary structure of a new
sequence. - This is not an exhaustive list of possible
applications.
15Learning in structured output spaces
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
- Multilabel supervised classification (Output y
(y1...yn)).
Training set
Hypotheses space
Find s.t.
Learning
Prediction
16Learning in structured output spaces
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
- Three main phases
- Encoding
- define a suitable feature map f(x,y).
- Compression
- characterize the output space in a synthetic and
compact way. - Optimization
- define a suitable objective function and use it
for learning.
17Learning in structured output spaces
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
- Encoding
- define a suitable feature map f(x,y).
- Compression
- characterize the output space in a synthetic and
compact way. - Optimization
- define a suitable objective function and use it
for learning.
18Encoding
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
S1 ATGCTTTC S2 CTGTCGCC
- Features must be defined in a way such that
prediction can be computed efficiently. - The feature vector f(x,y) decomposes as sum of
elementary features f(x,y) on parts. - Parts are typically edges or nodes in graphs.
19Encoding
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
Sequence labeling
Example CRF with HMM features
In general features reflect long range
interactions (when labeling xi past and future
observations are taken into account). Arbitrary
features of the observations are considered (e.g.
spelling properties in NER).
20Encoding
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
Sequence alignment
- 3-parameters model
- In practice more complex models are used
- 4-parameters model affine function for gap
penalties, i.e. different costs if the gap starts
(gap opening penalty) in a given position or if
it continues (gap extension penalty). - 211/212-parameters model f(x,y) contains the
statistics associated to the gap penalties and
all the possible pairs of amino acids.
21Encoding
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
Sequence parsing
The feature vector contains the statistics
associated to the occurrences of the rules.
y
22Encoding
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
- Having defined these features, predictions can be
computed efficiently with dynamic programming
(DP). - Sequence labeling Viterbi algorithm
- Sequence alignment Needleman-Wunsch
algorithm - Sequence parsing Cocke-Younger-Kasami (CYK)
algorithm
DP TABLE
23Learning in structured output spaces
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
- Encoding
- define a suitable feature map f(x,y).
- Compression
- characterize the output space in a synthetic and
compact way. - Optimization
- define a suitable objective function and use it
for learning.
24Computing moments
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
- The number N of possible output vector yk given
an observation x is typically huge. -
- To characterize the distribution of the scores
its mean and its variance are considered. - C and m can be computed efficiently with DP
techniques.
25Computing moments
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
Sequence labeling
Recursive formula
The number N of possible label sequences yk given
an observation sequence x is exponential in the
length of the sequences. An algorithm similar to
the forward algorithm is used to compute m and C.
Mean value associated to the feature which
represents the emission of a symbol q at state p.
26Computing moments
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
- Basic idea behind recursive formulas
- Mean values are computed considering
- Variances are computed centering the second order
moments
27Computing moments
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
- Problem high computational cost for large
feature spaces. - 1st Solution Exploit the structure and the
sparseness of the covariance matrix C. - In sequence labeling for CRF with HMM features
the number of different values in C is linear in
the size of the observation alphabet. - 2nd Solution Sampling strategy.
Example
28Learning in structured output spaces
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
- Encoding
- define a suitable feature map f(x,y).
- Compression
- characterize the output space in a synthetic and
compact way. - Optimization
- define a suitable objective function and use it
for learning.
29Z-score
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
- New optimization criterion particularly suited
for non-separable cases. - Minimize the number of output vectors with score
higher than the score of the correct pairs. - Maximize the Z-score
30Z-score
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
- The Z-score can be expressed as a function of the
parameters w. - Two equivalent optimization problems
31Z-score
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
- Ranking loss
- An upper bound on the ranking loss is minimized
- The number of output vectors with score higher
than the score of the correct pairs is minimized.
32Previous approaches
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
- Minimize the number of incorrect macrolabels y.
-
- CRFs Lafferty et al., 01, HMSVM Altun at al.,
03, averaged perceptron Collins 02. - Minimize the number of incorrect microlabels y.
-
- M3Ns Taskar et al., 03, SVMISO Tsochantaridis
et al., 04.
33SODA
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
- Given a training set T the empirical risk
associated to the upper-bound on the ranking loss
is minimized. - An equivalent formulation in terms of C and b is
considered to solve it .
SODA (Structured Output Discriminant Analysis)
34SODA
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
- Convex optimization
- If C is not PSD, regularization can be
introduced. - Solution simple matrix inversion .
- Fast conjugate gradient methods available.
35Rademacher bound
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
- The bound shows that learning based on the upper
bound on the ranking loss is effectively
achieved. - The bound holds also in the case where b and C
are estimated by sampling. - Two directions of sampling
- For each only a limited number n
of incorrect outputs is considered to estimate b
and C. - Only a finite number l of input-output pairs is
given in the training set. - The empirical expectation of the estimated loss
(estimated by computing b
and C by random sampling) is a good approximate
upper bound for the expected loss
. - The latter is an upper bound for the ranking loss
, such that the Rademacher bound is
also a bound on the expectation of the ranking
loss.
36Rademacher bound
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
- Theorem (Rademacher bound for SODA). With
probability at least 1-d over the joint of the - random sample T and the random samples from the
output space for each that are - taken to approximate the matrices b and C, the
following bound holds for any w with squared - norm smaller than c
- whereby M is a constant and we assume that the
number of random samples for each training - pair is equal to n.
- The Rademacher complexity terms and
decrease with and respectively,
such - that the bound becomes tight for increasing n and
l, as long as n grows faster than log(l).
37Z-score approach
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
- How to define the Z-score of a training set?
- Another possible approach (independence
assumption) - Convex optimization problem which can be solved
again by simple matrix inversion. - Maximizing the Z-score most linear constraints
-
- are satisfied.
Z-score approach
38Iterative approach
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
- One may want to impose explicitly the violated
constraints. - This is again a convex optimization problem that
can be solved with an iterative algorithm similar
to previous approaches (HMSVM Altun at al., 03,
averaged perceptron Collins 02). - Eventually relax constraints (e.g. add slack
variables for non separable problems).
39Iterative approach
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
40Experimental results
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
Sequence labeling artificial data.
- Chain CRF with HMM features.
- Sequence length 50. Training set size 20
pairs. Test set size 100 pairs. - Comparison with SVMISO Tsochantaridis et al.,
04, Perceptron Collins 02, CRFs Lafferty et
al., 01. - Average number of incorrect labels varying the
level of noise p.
41Experimental results
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
Sequence labeling artificial data.
- HMM features ( ).
- Noise level p0.2.
- Average number of incorrect labels and
computational time as function of the training
set size.
42Experimental results
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
Sequence labeling artificial data.
Chain CRF with HMM features (
). Sequence length 10. Training set size 50
pairs. Test set size 100 pairs. Level of
noise p0.2 Comparison with SVMISO
Tsochantaridis et al., 04. Labeling error on
test set and average training time as function of
the observation alphabet size.
43Experimental results
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
Sequence labeling artificial data.
- Chain CRF with HMM features (
). - Adding constraints is not very useful when data
are noisy and non linearly separable.
44Experimental results
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
Sequence labeling
NER Spanish news wire article - Special
Session of CoNLL02 300 sentences with average
length of 30 words. 9 labels non-name, beginning
and continuation of persons, organizations,
locations and miscellaneous names. Two sets of
binary features S1 (HMM features) and S2 (S1 and
HMM features for the previous and the next word).
Labeling error on test set (5-fold
crossvalidation)
45Experimental results
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
Sequence alignment artificial sequences.
Test error (number of incorrectly aligned pairs)
as function of the training set size.
Original and reconstructed substitution matrices.
46Experimental results
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
- Sequence parsing
- G6 grammar in Dowell and Eddy, 2004.
- RNA sequences of five families extracted from the
Rfam database Griffiths-Jones et al., 2003
Prediction on five-fold crossvalidation.
47Conclusions
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
- New methods for learning in structured output
spaces. - Accuracy comparable with state-of-the-art
techniques. - Easy to implement (DP for matrix computations and
simple optimization problem). - Fast for large training set and reasonable number
of features. - Mean and variance computations parallelizable for
large training set. - Conjugate gradient techniques used in the
optimization phase. - Three application analyzed sequence labeling,
sequence parsing and sequence alignment. - Future works
- Test the scalability of this approach using
approximate techniques. - Develop a dual version with kernels.
48 Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
Thank you