A Sparse Modeling Approach to Speech Recognition Using Kernel Machines presentation

About This Presentation

Title:

A Sparse Modeling Approach to Speech Recognition Using Kernel Machines

Description:

Maximize numerator (ML term), minimize denominator (discriminative term) ... ANN provides flexible, discriminative classifiers for emission probabilities ... –

Number of Views:113

Avg rating:3.0/5.0

Slides: 51

Provided by: jonha9

Category:

more less

Transcript and Presenter's Notes

Title: A Sparse Modeling Approach to Speech Recognition Using Kernel Machines

1
A Sparse Modeling Approach to Speech Recognition
Using Kernel Machines

Jon Hamaker
hamaker_at_isip.msstate.edu
Institute for Signal and Information Processing
Mississippi State University

2
Abstract

Statistical techniques based on Hidden Markov
models (HMMs) with Gaussian emission densities
have dominated the signal processing and pattern
recognition literature for the past 20 years.
However, HMMs suffer from an inability to learn
discriminative information and are prone to
over-fitting and over-parameterization. Recent
work in machine learning has focused on models,
such as the support vector machine (SVM), that
automatically control generalization and
parameterization as part of the overall
optimization process. SVMs have been shown to
provide significant improvements in performance
on small pattern recognition tasks compared to a
number of conventional approaches. SVMs, however,
require ad hoc (and unreliable) methods to couple
it to probabilistic learning machines.
Probabilistic Bayesian learning machines, such as
the relevance vector machine (RVM), are fairly
new approaches that attempt to overcome the
deficiencies of SVMs by explicitly accounting for
sparsity and statistics in their formulation.
In this presentation, we describe both of these
modeling approaches in brief. We then describe
our work to integrate these as acoustic models in
large vocabulary speech recognition systems.
Particular attention is given to algorithms for
training these learning machines on large
corpora. In each case, we find that both SVM and
RVM-based systems perform better than Gaussian
mixture-based HMMs in open-loop recognition. We
further show that the RVM-based solution performs
on par with the SVM system using an order of
magnitude fewer parameters. We conclude with a
discussion of the remaining hurdles for providing
this technology in a form amenable to current
state-of-the-art recognizers.

3
Bio

Jon Hamaker is a Ph.D. candidate in the
Department of Electrical and Computer Engineering
at Mississippi State University under the
supervision of Dr. Joe Picone. He has been a
senior member of the Institute for Signal and
Information Processing (ISIP) at MSU since 1996.
Mr. Hamaker's research work has revolved around
automatic structural analysis and optimization
methods for acoustic modeling in speech
recognition systems. His most recent work has
been in the application of kernel machines as
replacements for the underlying Gaussian
distribution in hidden Markov acoustic models.
His dissertation work compares the popular
support vector machine with the relatively new
relevance vector machine in the context of a
speech recognition system. Mr. Hamaker has
co-authored 4 journal papers (2 under review), 22
conference papers, and 3 invited presentations
during his graduate studies at MS State
(http//www.isip.msstate.edu/publications). He
also spent two summers as an intern at Microsoft
in the recognition engine group.

4
Outline

The acoustic modeling problem for speech
Current state-of-the-art
Discriminative approaches
Structural optimization and Occams Razor
Support vector classifiers
Relevance vector classifiers
Coupling vector machines to ASR systems
Scaling relevance vector methods to real
problems
Extensions of this work

5
ASR Problem

Front-end maintains information important for
modeling in a reduced parameter set
Language model typically predicts a small set of
next words based on knowledge of a finite number
of previous words (N-grams)
Search engine uses knowledge sources and models
to chooses amongst competing hypotheses

6
Acoustic Confusability
Requires reasoning under uncertainty!

Regions of overlap represent classification error
Reduce overlap by introducing acoustic and
linguistic context

Comparison of aa in lOck and iy in bEAt
for SWB
7
Probabilistic Formulation

To deal with the uncertainty, we typically
formulate speech as a probabilistic problem
Objective Minimize the word error rate by
maximizing P(WA)
Approach Maximize P(AW) during training
Components
P(AW) Acoustic Model
P(W) Language Model
P(A) Acoustic probability (ignored during
maximization)

8
Acoustic Modeling - HMMs

HMMs model temporal variation in the transition
probabilities of the state machine
GMM emission densities are used to account for
variations in speaker, accent, and pronunciation
Sharing model parameters is a common strategy to
reduce complexity

9
Maximum Likelihood Training

Data-driven modeling supervised only from a
word-level transcription
Approach maximum likelihood estimation
The EM algorithm is used to improve our
estimates
Guaranteed convergence to local maximum
No guard against overfitting!
Computationally efficient training algorithms
(Forward-Backward) have been crucial
Decision trees are used to optimize sharing
parameters, minimize system complexity and
integrate additional linguistic knowledge

10
Drawbacks of Current Approach

ML Convergence does not translate to optimal
classification
Error from incorrect modeling assumptions
Finding the optimal decision boundary requires
only one parameter!

11
Drawbacks of Current Approach

Data not separable by a hyperplane nonlinear
classifier is needed
Gaussian MLE models tend toward the center of
mass overtraining leads to poor generalization

12
Acoustic Modeling

Acoustic Models Must
Model the temporal progression of the speech
Model the characteristics of the sub-word units
We would also like our models to
Optimally trade-off discrimination and
representation
Incorporate Bayesian statistics (priors)
Make efficient use of parameters (sparsity)
Produce confidence measures of their predictions
for higher-level decision processes

13
Paradigm Shift - Discriminative Modeling

Discriminative Training (Maximum Mutual
Information Estimation)
Essential Idea Maximize
Maximize numerator (ML term), minimize
denominator (discriminative term)
Discriminative Modeling (e.g. ANN Hybrids
Bourlard and Morgan)

14
Research Focus

Our Research replace the Gaussian likelihood
computation with a machine that incorporates
notions of
Discrimination
Bayesian statistics (prior information)
Confidence
Sparsity
All while maintaining computational efficiency

15
ANN Hybrids

Architecture
ANN provides flexible, discriminative classifiers
for emission probabilities that avoid HMM
independence assumptions (can use wider acoustic
context)
Trained using Viterbi iterative training (hard
decision rule) or can be trained to learn
Baum-Welch targets (soft decision rule)

Shortcomings
Prone to overfitting require cross-validation to
determine when to stop training. Need methods to
automatically penalize overfitting
No substantial recognition improvements over
HMM/GMM

16
Structural Optimization

Structural optimization often guided by an
Occams Razor approach
Trading goodness of fit and model complexity
Examples MDL, BIC, AIC, Structural Risk
Minimization, Automatic Relevance Determination

17
Structural Risk Minimization

Expected Risk
Not possible to estimate P(x,y)
Empirical Risk
Related by the VC dimension, h
Approach choose the machine that gives the least
upper bound on the actual risk

Expected risk
bound on the expected risk
optimum
VC confidence
empirical risk
VC dimension

The VC dimension is a measure of the complexity
of the learning machine
Higher VC dimension gives a looser bound on the
actual risk thus penalizing a more complex
model (Vapnik)

18
Support Vector Machines

Optimization Separable Data
Hyperplane
Constraints
Quadratic optimization of a Lagrange functional
minimizes risk criterion (maximizes margin). Only
a small portion become support vectors
Final classifier

Hyperplanes C0-C2 achieve zero empirical risk. C0
generalizes optimally
The data points that define the boundary are
called support vectors

19
SVMs as Nonlinear Classifiers

Data for practical applications typically not
separable using a hyperplane in the original
input feature space
Transform data to higher dimension where
hyperplane classifier is sufficient to model
decision surface
Kernels used for this transformation
Final classifier

20
SVMs for Non-Separable Data

No hyperplane could achieve zero empirical risk
(in any dimension space!)
Recall the SRM Principle trade-off empirical
risk and model complexity
Relax our optimization constraint to allow for
errors on the training set
A new parameter, C, must be estimated to
optimally control the trade-off between training
set errors and model complexity

21
SVM Drawbacks

Uses a binary (yes/no) decision rule
Generates a distance from the hyperplane, but
this distance is often not a good measure of our
confidence in the classification
Can produce a probability as a function of the
distance (e.g. using sigmoid fits), but they are
inadequate
Number of support vectors grows linearly with the
size of the data set
Requires the estimation of trade-off parameter,
C, via held-out sets

22
Evidence Maximization

Build a fully specified probabilistic model
incorporate prior information/beliefs as well as
a notion of confidence in predictions
MacKay posed a special form for regularization in
neural networks sparsity
Evidence maximization evaluate candidate models
based on their evidence, P(DHi)
Structural optimization by maximizing the
evidence across all candidate models!
Steeped in Gaussian approximations

23
Evidence Framework

Evidence approximation
Likelihood of data given best fit parameter set

Penalty that measures how well our posterior
model fits our prior assumptions
We can use set the prior in favor of sparse,
smooth models!

24
Relevance Vector Machines

A kernel-based learning machine
Incorporates an automatic relevance determination
(ARD) prior over each weight (MacKay)
A flat (non-informative) prior over a completes
the Bayesian specification

25
Relevance Vector Machines

The goal in training becomes finding
Estimation of the sparsity parameters is
inherent in the optimization no need for a
held-out set!
A closed-form solution to this maximization
problem is not available. Rather, we iteratively
reestimate

26
Laplaces Method

Fix a and estimate w (e.g. gradient descent)
Use the Hessian to approximate the covariance of
a Gaussian posterior of the weights centered at
With and as the mean and covariance,
respectively, of the Gaussian approximation, we
find by finding
Method is O(N2) in memory and O(N3) in time

27
RVMs Compared to SVMs

RVM
Data Class labels (0,1)
Goal Learn posterior, P(t1x)
Structural Optimization Hyperprior distribution
encourages sparsity
Training iterative O(N3)

SVM
Data Class labels (-1,1)
Goal Find optimal decision surface under
constraints
Structural Optimization Trade-off parameter that
must be estimated
Training Quadratic O(N2)

28
Simple Example
29
ML Comparison
30
SVM Comparison
31
SVM With Sigmoid Posterior Comparison
32
RVM Comparison
33
Experimental Progression

Proof of concept on speech classification data
Coupling classifiers to ASR system
Reduced-set tests on Alphadigits task
Algorithms for scaling up RVM classifiers
Further tests on Alphadigits task (still not the
full training set though!)
New work aiming at larger data sets and HMM
decoupling

34
Vowel Classification

Deterding Vowel Data 11 vowels spoken in hd
context 10 log area parameters 528 train, 462
SI test

35
Coupling to ASR

Data size
30 million frames of data in training set
Solution Segmental phone models
Source for Segmental Data
Solution Use HMM system in bootstrap procedure
Could also build a segment-based decoder
Probabilistic decoder coupling
SVMs Sigmoid-fit posterior
RVMs naturally probabilistic

36
Coupling to ASR System
Features (Mel-Cepstra)
HMM RECOGNITION
Segment Information
SEGMENTAL CONVERTER
N-best List
Segmental Features
HYBRID DECODER
Hypothesis
37
Alphadigit Recognition

OGI Alphadigits continuous, telephone bandwidth
letters and numbers (A19B4E)
Reduced training set size for RVM comparison
2000 training segments per phone model
Could not, at this point, run larger sets
efficiently
3329 utterances using 10-best lists generated by
the HMM decoder
SVM and RVM system architecture are nearly
identical RBF kernels with gamma 0.5
SVM requires the sigmoid posterior estimate to
produce likelihoods sigmoid parameters
estimated from large held-out set

38
SVM Alphadigit Recognition

HMM system is cross-word state-tied triphones
with 16 mixtures of Gaussian models
SVM system has monophone models with segmental
features
System combination experiment yields another 1
reduction in error

39
SVM/RVM Alphadigit Comparison

RVMs yield a large reduction in the parameter
count while attaining superior performance
Computational costs mainly in training for RVMs
but is still prohibitive for larger sets

40
Scaling Up

Central to RVM training is the inversion of an
MxM Hessian matrix an O(N3) operation initially
Solutions
Constructive Approach Start with an empty model
and iteratively add candidate parameters. M is
typically much smaller than N
Divide and Conquer Approach Divide complete
problem into set of sub-problems. Iteratively
refine the candidate parameter set according to
sub-problem solution. M is user-defined

41
Constructive Approach

Tipping and Faul (MSR-Cambridge)
Define
has a unique solution with respect to
The results give a set of rules for adding
vectors to the model, removing vectors from the
model or updating parameters in the model

42
Constructive Approach Algorithm

Prune all parameters
While not converged
For each parameter
If parameter is pruned
checkAddRule
Else checkPruneRule
checkUpdateRule
End
Update Model
End

Begin with all weights set to zero and
iteratively construct an optimal model without
evaluating the full NxN inverse
Formed for RVM regression can have oscillatory
behavior for classification
Rule subroutines require the full design matrix
(NxN) storage requirement

43
Iterative Reduction Algorithm
RVs
Subset 0
Candidate Pool
Subset J
RVs
Iteration I
Iteration I1

O(M3) in run-time and O(MxN) in memory. M is a
user-defined parameter
Assumes that if P(wk0wI,J,D) is 1 then
P(wk0w,D) is also 1! Optimality?

44
Alphadigit Recognition

Data increased to 10000 training vectors
Reduction method has been trained up to 100k
vectors (on toy task). Not possible for
Constructive method

45
Summary

First to apply kernel machines as acoustic models
Comparison of two machines that apply structural
optimization to learning SVM and RVM
Performance exceeds that of HMM but with quite a
bit of HMM interaction
Algorithms for increased data sizes are key

46
Decoupling the HMM

Still want to use segmental data (data size)
Want the kernel machine acoustic model to
determine an optimal segmentation though
Need a new decoder
Hypothesize each phone for each possible segment
Pruning is a huge issue
Stack decoder is beneficial
Status In development

47
Improved Iterative Algorithm

Same principle of operation
One pass over the data much faster!
Status Equivalent performance on all benchmarks
running on Alphadigits now

48
Active Learning for RVMs

Idea Given the current model, iteratively
chooses a subset of points from the full training
set that will improve the system performance
Problem 1 Performance typically is defined as
classifier error rate (e.g. boosting). What about
the posterior estimate accuracy?
Problem 2 For kernel machines, an added
training point can
Assist in bettering the model performance
Become part of the model itself! How do we
determine which points should be added?
Look to work in Gaussian Processes (Lawrence,
Seeger, Herbrich, 2003)

49
Extensions

Not ready for prime time as an acoustic model
How else might we use the same techniques for
speech?
Online Speech/Noise Classification?
Requires adaptation methods
Application of automatic relevance determination
to model selection for HMMs?

50
Acknowledgments

Collaborators Aravind Ganapathiraju and Joe
Picone at Mississippi State
Consultants Michael Tipping (MSR-Cambridge) and
Thorsten Joachims (now at Cornell)

Write a Comment

User Comments (0)

About PowerShow.com