Title: A Sparse Modeling Approach to Speech Recognition Using Kernel Machines
1A Sparse Modeling Approach to Speech Recognition
Using Kernel Machines
- Jon Hamaker
- hamaker_at_isip.msstate.edu
- Institute for Signal and Information Processing
- Mississippi State University
2Abstract
- Statistical techniques based on Hidden Markov
models (HMMs) with Gaussian emission densities
have dominated the signal processing and pattern
recognition literature for the past 20 years.
However, HMMs suffer from an inability to learn
discriminative information and are prone to
over-fitting and over-parameterization. Recent
work in machine learning has focused on models,
such as the support vector machine (SVM), that
automatically control generalization and
parameterization as part of the overall
optimization process. SVMs have been shown to
provide significant improvements in performance
on small pattern recognition tasks compared to a
number of conventional approaches. SVMs, however,
require ad hoc (and unreliable) methods to couple
it to probabilistic learning machines.
Probabilistic Bayesian learning machines, such as
the relevance vector machine (RVM), are fairly
new approaches that attempt to overcome the
deficiencies of SVMs by explicitly accounting for
sparsity and statistics in their formulation. - In this presentation, we describe both of these
modeling approaches in brief. We then describe
our work to integrate these as acoustic models in
large vocabulary speech recognition systems.
Particular attention is given to algorithms for
training these learning machines on large
corpora. In each case, we find that both SVM and
RVM-based systems perform better than Gaussian
mixture-based HMMs in open-loop recognition. We
further show that the RVM-based solution performs
on par with the SVM system using an order of
magnitude fewer parameters. We conclude with a
discussion of the remaining hurdles for providing
this technology in a form amenable to current
state-of-the-art recognizers.
3Bio
- Jon Hamaker is a Ph.D. candidate in the
Department of Electrical and Computer Engineering
at Mississippi State University under the
supervision of Dr. Joe Picone. He has been a
senior member of the Institute for Signal and
Information Processing (ISIP) at MSU since 1996.
Mr. Hamaker's research work has revolved around
automatic structural analysis and optimization
methods for acoustic modeling in speech
recognition systems. His most recent work has
been in the application of kernel machines as
replacements for the underlying Gaussian
distribution in hidden Markov acoustic models.
His dissertation work compares the popular
support vector machine with the relatively new
relevance vector machine in the context of a
speech recognition system. Mr. Hamaker has
co-authored 4 journal papers (2 under review), 22
conference papers, and 3 invited presentations
during his graduate studies at MS State
(http//www.isip.msstate.edu/publications). He
also spent two summers as an intern at Microsoft
in the recognition engine group.
4Outline
- The acoustic modeling problem for speech
- Current state-of-the-art
- Discriminative approaches
- Structural optimization and Occams Razor
- Support vector classifiers
- Relevance vector classifiers
- Coupling vector machines to ASR systems
- Scaling relevance vector methods to real
problems - Extensions of this work
5ASR Problem
- Front-end maintains information important for
modeling in a reduced parameter set - Language model typically predicts a small set of
next words based on knowledge of a finite number
of previous words (N-grams) - Search engine uses knowledge sources and models
to chooses amongst competing hypotheses
6Acoustic Confusability
Requires reasoning under uncertainty!
- Regions of overlap represent classification error
- Reduce overlap by introducing acoustic and
linguistic context
Comparison of aa in lOck and iy in bEAt
for SWB
7Probabilistic Formulation
- To deal with the uncertainty, we typically
formulate speech as a probabilistic problem - Objective Minimize the word error rate by
maximizing P(WA) - Approach Maximize P(AW) during training
- Components
- P(AW) Acoustic Model
- P(W) Language Model
- P(A) Acoustic probability (ignored during
maximization)
8Acoustic Modeling - HMMs
- HMMs model temporal variation in the transition
probabilities of the state machine - GMM emission densities are used to account for
variations in speaker, accent, and pronunciation - Sharing model parameters is a common strategy to
reduce complexity
9Maximum Likelihood Training
- Data-driven modeling supervised only from a
word-level transcription - Approach maximum likelihood estimation
- The EM algorithm is used to improve our
estimates - Guaranteed convergence to local maximum
- No guard against overfitting!
- Computationally efficient training algorithms
(Forward-Backward) have been crucial - Decision trees are used to optimize sharing
parameters, minimize system complexity and
integrate additional linguistic knowledge
10Drawbacks of Current Approach
- ML Convergence does not translate to optimal
classification - Error from incorrect modeling assumptions
- Finding the optimal decision boundary requires
only one parameter!
11Drawbacks of Current Approach
- Data not separable by a hyperplane nonlinear
classifier is needed - Gaussian MLE models tend toward the center of
mass overtraining leads to poor generalization
12Acoustic Modeling
- Acoustic Models Must
- Model the temporal progression of the speech
- Model the characteristics of the sub-word units
- We would also like our models to
- Optimally trade-off discrimination and
representation - Incorporate Bayesian statistics (priors)
- Make efficient use of parameters (sparsity)
- Produce confidence measures of their predictions
for higher-level decision processes
13Paradigm Shift - Discriminative Modeling
- Discriminative Training (Maximum Mutual
Information Estimation) - Essential Idea Maximize
- Maximize numerator (ML term), minimize
denominator (discriminative term) - Discriminative Modeling (e.g. ANN Hybrids
Bourlard and Morgan)
14Research Focus
- Our Research replace the Gaussian likelihood
computation with a machine that incorporates
notions of - Discrimination
- Bayesian statistics (prior information)
- Confidence
- Sparsity
- All while maintaining computational efficiency
15ANN Hybrids
- Architecture
- ANN provides flexible, discriminative classifiers
for emission probabilities that avoid HMM
independence assumptions (can use wider acoustic
context) - Trained using Viterbi iterative training (hard
decision rule) or can be trained to learn
Baum-Welch targets (soft decision rule)
- Shortcomings
- Prone to overfitting require cross-validation to
determine when to stop training. Need methods to
automatically penalize overfitting - No substantial recognition improvements over
HMM/GMM
16Structural Optimization
- Structural optimization often guided by an
Occams Razor approach - Trading goodness of fit and model complexity
- Examples MDL, BIC, AIC, Structural Risk
Minimization, Automatic Relevance Determination
17Structural Risk Minimization
- Expected Risk
- Not possible to estimate P(x,y)
- Empirical Risk
- Related by the VC dimension, h
- Approach choose the machine that gives the least
upper bound on the actual risk
Expected risk
bound on the expected risk
optimum
VC confidence
empirical risk
VC dimension
- The VC dimension is a measure of the complexity
of the learning machine - Higher VC dimension gives a looser bound on the
actual risk thus penalizing a more complex
model (Vapnik)
18Support Vector Machines
- Optimization Separable Data
- Hyperplane
- Constraints
- Quadratic optimization of a Lagrange functional
minimizes risk criterion (maximizes margin). Only
a small portion become support vectors - Final classifier
- Hyperplanes C0-C2 achieve zero empirical risk. C0
generalizes optimally - The data points that define the boundary are
called support vectors
19SVMs as Nonlinear Classifiers
- Data for practical applications typically not
separable using a hyperplane in the original
input feature space - Transform data to higher dimension where
hyperplane classifier is sufficient to model
decision surface - Kernels used for this transformation
- Final classifier
20SVMs for Non-Separable Data
- No hyperplane could achieve zero empirical risk
(in any dimension space!) - Recall the SRM Principle trade-off empirical
risk and model complexity - Relax our optimization constraint to allow for
errors on the training set - A new parameter, C, must be estimated to
optimally control the trade-off between training
set errors and model complexity
21SVM Drawbacks
- Uses a binary (yes/no) decision rule
- Generates a distance from the hyperplane, but
this distance is often not a good measure of our
confidence in the classification - Can produce a probability as a function of the
distance (e.g. using sigmoid fits), but they are
inadequate - Number of support vectors grows linearly with the
size of the data set - Requires the estimation of trade-off parameter,
C, via held-out sets
22Evidence Maximization
- Build a fully specified probabilistic model
incorporate prior information/beliefs as well as
a notion of confidence in predictions - MacKay posed a special form for regularization in
neural networks sparsity - Evidence maximization evaluate candidate models
based on their evidence, P(DHi) - Structural optimization by maximizing the
evidence across all candidate models! - Steeped in Gaussian approximations
23Evidence Framework
- Evidence approximation
- Likelihood of data given best fit parameter set
- Penalty that measures how well our posterior
model fits our prior assumptions - We can use set the prior in favor of sparse,
smooth models!
24Relevance Vector Machines
- A kernel-based learning machine
- Incorporates an automatic relevance determination
(ARD) prior over each weight (MacKay) - A flat (non-informative) prior over a completes
the Bayesian specification
25Relevance Vector Machines
- The goal in training becomes finding
- Estimation of the sparsity parameters is
inherent in the optimization no need for a
held-out set! - A closed-form solution to this maximization
problem is not available. Rather, we iteratively
reestimate
26Laplaces Method
- Fix a and estimate w (e.g. gradient descent)
- Use the Hessian to approximate the covariance of
a Gaussian posterior of the weights centered at - With and as the mean and covariance,
respectively, of the Gaussian approximation, we
find by finding - Method is O(N2) in memory and O(N3) in time
27RVMs Compared to SVMs
- RVM
- Data Class labels (0,1)
- Goal Learn posterior, P(t1x)
- Structural Optimization Hyperprior distribution
encourages sparsity - Training iterative O(N3)
- SVM
- Data Class labels (-1,1)
- Goal Find optimal decision surface under
constraints - Structural Optimization Trade-off parameter that
must be estimated - Training Quadratic O(N2)
28Simple Example
29ML Comparison
30SVM Comparison
31SVM With Sigmoid Posterior Comparison
32RVM Comparison
33Experimental Progression
- Proof of concept on speech classification data
- Coupling classifiers to ASR system
- Reduced-set tests on Alphadigits task
- Algorithms for scaling up RVM classifiers
- Further tests on Alphadigits task (still not the
full training set though!) - New work aiming at larger data sets and HMM
decoupling
34Vowel Classification
- Deterding Vowel Data 11 vowels spoken in hd
context 10 log area parameters 528 train, 462
SI test
35Coupling to ASR
- Data size
- 30 million frames of data in training set
- Solution Segmental phone models
- Source for Segmental Data
- Solution Use HMM system in bootstrap procedure
- Could also build a segment-based decoder
- Probabilistic decoder coupling
- SVMs Sigmoid-fit posterior
- RVMs naturally probabilistic
36Coupling to ASR System
Features (Mel-Cepstra)
HMM RECOGNITION
Segment Information
SEGMENTAL CONVERTER
N-best List
Segmental Features
HYBRID DECODER
Hypothesis
37Alphadigit Recognition
- OGI Alphadigits continuous, telephone bandwidth
letters and numbers (A19B4E) - Reduced training set size for RVM comparison
2000 training segments per phone model - Could not, at this point, run larger sets
efficiently - 3329 utterances using 10-best lists generated by
the HMM decoder - SVM and RVM system architecture are nearly
identical RBF kernels with gamma 0.5 - SVM requires the sigmoid posterior estimate to
produce likelihoods sigmoid parameters
estimated from large held-out set
38SVM Alphadigit Recognition
- HMM system is cross-word state-tied triphones
with 16 mixtures of Gaussian models - SVM system has monophone models with segmental
features - System combination experiment yields another 1
reduction in error
39SVM/RVM Alphadigit Comparison
- RVMs yield a large reduction in the parameter
count while attaining superior performance - Computational costs mainly in training for RVMs
but is still prohibitive for larger sets
40Scaling Up
- Central to RVM training is the inversion of an
MxM Hessian matrix an O(N3) operation initially - Solutions
- Constructive Approach Start with an empty model
and iteratively add candidate parameters. M is
typically much smaller than N - Divide and Conquer Approach Divide complete
problem into set of sub-problems. Iteratively
refine the candidate parameter set according to
sub-problem solution. M is user-defined
41Constructive Approach
- Tipping and Faul (MSR-Cambridge)
- Define
- has a unique solution with respect to
- The results give a set of rules for adding
vectors to the model, removing vectors from the
model or updating parameters in the model
42Constructive Approach Algorithm
- Prune all parameters
- While not converged
- For each parameter
- If parameter is pruned
- checkAddRule
- Else checkPruneRule
- checkUpdateRule
- End
- Update Model
- End
- Begin with all weights set to zero and
iteratively construct an optimal model without
evaluating the full NxN inverse - Formed for RVM regression can have oscillatory
behavior for classification - Rule subroutines require the full design matrix
(NxN) storage requirement
43Iterative Reduction Algorithm
RVs
Subset 0
Candidate Pool
Subset J
RVs
Iteration I
Iteration I1
- O(M3) in run-time and O(MxN) in memory. M is a
user-defined parameter - Assumes that if P(wk0wI,J,D) is 1 then
P(wk0w,D) is also 1! Optimality?
44Alphadigit Recognition
- Data increased to 10000 training vectors
- Reduction method has been trained up to 100k
vectors (on toy task). Not possible for
Constructive method
45Summary
- First to apply kernel machines as acoustic models
- Comparison of two machines that apply structural
optimization to learning SVM and RVM - Performance exceeds that of HMM but with quite a
bit of HMM interaction - Algorithms for increased data sizes are key
46Decoupling the HMM
- Still want to use segmental data (data size)
- Want the kernel machine acoustic model to
determine an optimal segmentation though - Need a new decoder
- Hypothesize each phone for each possible segment
- Pruning is a huge issue
- Stack decoder is beneficial
- Status In development
47Improved Iterative Algorithm
- Same principle of operation
- One pass over the data much faster!
- Status Equivalent performance on all benchmarks
running on Alphadigits now
48Active Learning for RVMs
- Idea Given the current model, iteratively
chooses a subset of points from the full training
set that will improve the system performance - Problem 1 Performance typically is defined as
classifier error rate (e.g. boosting). What about
the posterior estimate accuracy? - Problem 2 For kernel machines, an added
training point can - Assist in bettering the model performance
- Become part of the model itself! How do we
determine which points should be added? - Look to work in Gaussian Processes (Lawrence,
Seeger, Herbrich, 2003)
49Extensions
- Not ready for prime time as an acoustic model
- How else might we use the same techniques for
speech? - Online Speech/Noise Classification?
- Requires adaptation methods
- Application of automatic relevance determination
to model selection for HMMs?
50Acknowledgments
- Collaborators Aravind Ganapathiraju and Joe
Picone at Mississippi State - Consultants Michael Tipping (MSR-Cambridge) and
Thorsten Joachims (now at Cornell)