Title: Sequential and Spatial Supervised Learning
1Sequential and Spatial Supervised Learning
Guohua Hao, Rongkun Shen, ,Dan Vega, Yaroslav
Bulatov and Thomas G. Dietterich School of
Electrical Engineering and Computer Science,
Oregon State University, Corvallis, Oregon 97331
Applications (contd)
Abstract
Methods
Applications (contd)
Traditional supervised learning assumes
independence between the training examples.
However, many statistical learning problems
involve sequential or spatial data that are not
independent. Furthermore, the sequential or
spatial relationships can be exploited to improve
the prediction accuracy of a classifier. We are
developing and testing new practical methods for
machine learning with sequential and spatial
data. This poster gives a snapshot of our
current methods and results.
- Sliding window / Recurrent sliding window
- Experiment result
- Divide the training data into a sub-training set
and a development (validation) set. Try window
sizes from 1 to 21 and tree sizes from 10, 20,
30, 50, 70. - The best window size was 11, and the best tree
size was 20. With this configuration, the best
number of iterations to train was 110, which gave
66.3 correct predictions on the development set.
- Train on the entire training set with this
configuration and evaluate on the test set. The
result was 67.1 correct. - Neural network sliding windows give better
performance than this, so we are currently
designing experiments to understand why!
- A classifier is trained, and run on the 8
rotations and reflections of the test set. - A majority vote decides the final class.
- A sliding window is used to group the input
pixels with varying square size, the same is done
for the output window. Thus, the labeling of a
pixel is dependent not only on the pixel
intensity values in its neighborhood, but also on
the labels placed on the pixels in the
neighborhood
Protein AA sequence gt1avhb-4-AS IPAYL AETLY
YAMKG AGTDD HTLIR VMVSR SEIDL FNIRK EFRKN FATSL
YSMIK GDTSG DYKKA LLLLC GEDD
yt-1
yt
yt1
Generate raw profile with PSI-BLAST
yt-1
yt
yt1
Majority voting
xt-1
xt
xt1
xt-1
xt
xt1
Classification result
Recurrent Sliding window
Sliding window
Pixel labels by NaiveBayes IC1 OC3 on each of
8 rotations and reflections of the test image
- Hidden Markov Model Joint Distribution P(X,Y)
- Experiment result
- Different window sizes affect not only the
computation time, but also the accuracy of the
classifier - J48 (C4.5) and Naïve Bayes classifiers are the
most extensively studied. The results show that
Naïve Bayes achieves a higher accuracy with
smaller sliding windows, while J48 does better
with larger window sizes.
Feed into CRF
- Generalization of Naïve Bayesian Networks
- Transition probability P(ytyt-1)
- Observation probability P(xtyt)
- With conditional independence, impractical to
- represent overlapping features of observations
yt-1
yt
yt1
CRF Training and Testing
Output
Prediction Results
Introduction
xt-1
xt
xt1
Figure 4
In the classical supervised learning problems, we
assume that the training examples are drawn
independently and identically from some joint
distribution P(x,y). However, many applications
of machine learning involve predicting a sequence
of labels for a sequence of observations. New
learning methods are needed that can capture the
possible interdependencies between labels. We can
formulate this Sequential Supervised Learning
problem as follows Given a set of training
examples of the form (X,Y), where
X(x1,x2,,xn) a sequence of feature vectors
Y(y1,y2,,yn) corresponding label
sequences Goal find a classifier h to predict
new X as Yh(X)
A2 Accepted-from
- The official "shared task" of the 2004 Co-NLL
conference. - For each verb in the sentence, find all of its
arguments and label their semantic roles.
- Conditional Random Field Conditional
Distribution P(YX)
A1 thing accepted
( IC input context OC output context )
He would nt accept
anything of value from those he was writing
about
V verb
- Extension of logistic regression to sequential
data - Label sequence Y forms a Markov random field
- globally conditioned on observation X
- Removes the HMM independence assumption
AM-MOD modal
yt-1
yt
yt1
A0 acceptor
AM-NEG negation
- When compared to individual pixel classification
(59) it is easy to see that the recurrent
sliding window allows for a significant
improvement in the accuracy of the classifier. - Currently, the affect bagging and boosting has
on the accuracy is under investigation.
xt-1
xt
xt1
- Difficulty for the machine learning
- Humans use background knowledge to figure out
semantic roles - There are 70 various semantic role tags, which
makes it computationally intensive - Experiment result
- Two forms of feature-inducing in Conditional
Random Fields - Regression tree approach
- Incremental field growing
- 70 different semantic tags to learn. Training
set is 9000, whereas test set is 2,000 - Evaluated using F-measure, the harmonic mean
between precision and recall of requested
argument types. - Both methods got similar performance, F-measure
around 65. The best published performance was
71.72, using a simple greedy left to right
sequence labeling - Again, simpler non-relational approaches
outperform CRF on this task. Why?
- Potential function of the random field
Conclusions and Future work
- In the recent years, substantial progress has
already been make to the sequential and spatial
supervised learning problems. This poster has
attempted to review some of the existing methods
and give out our current methods and experiment
results in several applications. Future work will
include - Develop methods that can handle large number of
classes - Discriminative methods using large margin
principles - Understand why structural learning methods, such
as CRF, do not outperform classical methods in
some structural learning problems
- Maximize the log likelihood
yt-1
yt
yt1
- Vertical relationship as in normal supervised
learning - Horizontal relationship interdependencies
between label variable, can improve accuracy
- Parameter Estimation
- Iterative scaling and gradient descent
exponential number of parameters - Gradient tree boosting only necessary
interactions among features
xt-1
xt
xt1
- Classification of remotely sensed images
Acknowledgement
Figure 1
- Discriminative methods Score function f(X,Y)
Examples include part of speech tagging, protein
secondary structure prediction etc. Extending
1-D observation and label sequences to 2-D
arrays, we obtain a similar formulation for the
Spatial Supervised Learning problem, where both X
and Y have 2-D structure and interdependencies
between labels.
- Assign the crop identification classes
(unknown, sugar beets, stubble, bare soil,
potatoes, carrots) to pixels in the remotely
sensed image
We thank the National Science Foundation for
supporting this research under grant number
IIS-0307592
- Averaged perceptron ( Michael Collins et al.
2002 ) - Hidden Markov support vector machine (Yasemin
Altun et al. 2003 ) - Maximum Margin Markov Network ( Ben Taskar 2003 )
- Training set and test set are created by
dividing the image in half with a horizontal
line. The top half is used as the training set,
and the bottom half as the test set. - Training Set Expansion rotations and
reflections of the training set increase the
training set 8 fold.
Figure 5 Image with true labels of the classes.
Upper part is the training example and the lower
part is the testing example
Reference
Applications
- Dietterich, T.G (2002). Machine learning for
sequential data a review. Structural,
Syntactic, and Statistical Pattern Recognition
(pp. 15-30). New York Springer Verlag - Lafferty, J., MaCallum, A., Pereira, F. (2001)
Conditional random fields Probabilistic models
for segmenting and labeling sequence data.
Proceedings of the 18th International Conference
on Machine Learning (pp.282-289). San Francisco,
CA Morgan Kaufmann - Dietterich T. G. , Ashenfelter, A., Bulatov,
Y. (2004) Training Conditional Random Field via
Gradient Tree Boosting. Proceedings of the 21st
International Conference on Machine Learning (pp
217-224) Banff, Canada - Jones D. T. (1999) Protein Secondary Structure
Prediction Based Matrices. J. Mol. Biol.
292195-202 - Cuff J.A. and Barton G.J. (2000) Application of
Multiple Sequence Alignment Profiles to Improve
Protein Secondary Structure Prediction. Proteins
Structure, Function and Genetics 40502-511 - Carreras, X. Marquez, L. Introduction to the
CoNLL-2004 Shared Task Semantic Role Labeling.
Proceedings of CoNLL-2004 - Della Pietra, S., Della Pietra, V. Lafferty,
J. (1997) Inducing features of random fields.
IEEE Transactions on Pattern Analysis and Machine
Intelligence, 19(4), 380---393.
yi1,j
yi1,j1
- Protein secondary structure prediction
Structural Supervised Learning Given A graph G
(V,E), each vertex is an (xv,yv) pair.
Some vertexes are missing the y
label Goal Predict the missing y labels
- Assign the secondary structure classes (a-helix,
b-sheet and coil) to proteins - amino acid (AA) sequence, leading to tertiary
and/or quaternary structure - corresponding to protein functions
- Use Position-Specific Scoring Matrix profiles to
improve the prediction accuracy - Use CB513 datasets, with sequences shorter than
30 AA residues excluded, in - our experiment
yi,j1
yi,j
xi1,j
xi1,j1
xi,j
xi,,j1
Figure 2
Figure 6 training set expansion