A Sparse Modeling Approach to Speech Recognition Using Kernel Machines

1 / 50
About This Presentation
Title:

A Sparse Modeling Approach to Speech Recognition Using Kernel Machines

Description:

Maximize numerator (ML term), minimize denominator (discriminative term) ... ANN provides flexible, discriminative classifiers for emission probabilities ... –

Number of Views:113
Avg rating:3.0/5.0
Slides: 51
Provided by: jonha9
Category:

less

Transcript and Presenter's Notes

Title: A Sparse Modeling Approach to Speech Recognition Using Kernel Machines


1
A Sparse Modeling Approach to Speech Recognition
Using Kernel Machines
  • Jon Hamaker
  • hamaker_at_isip.msstate.edu
  • Institute for Signal and Information Processing
  • Mississippi State University

2
Abstract
  • Statistical techniques based on Hidden Markov
    models (HMMs) with Gaussian emission densities
    have dominated the signal processing and pattern
    recognition literature for the past 20 years.
    However, HMMs suffer from an inability to learn
    discriminative information and are prone to
    over-fitting and over-parameterization. Recent
    work in machine learning has focused on models,
    such as the support vector machine (SVM), that
    automatically control generalization and
    parameterization as part of the overall
    optimization process. SVMs have been shown to
    provide significant improvements in performance
    on small pattern recognition tasks compared to a
    number of conventional approaches. SVMs, however,
    require ad hoc (and unreliable) methods to couple
    it to probabilistic learning machines.
    Probabilistic Bayesian learning machines, such as
    the relevance vector machine (RVM), are fairly
    new approaches that attempt to overcome the
    deficiencies of SVMs by explicitly accounting for
    sparsity and statistics in their formulation.
  • In this presentation, we describe both of these
    modeling approaches in brief. We then describe
    our work to integrate these as acoustic models in
    large vocabulary speech recognition systems.
    Particular attention is given to algorithms for
    training these learning machines on large
    corpora. In each case, we find that both SVM and
    RVM-based systems perform better than Gaussian
    mixture-based HMMs in open-loop recognition. We
    further show that the RVM-based solution performs
    on par with the SVM system using an order of
    magnitude fewer parameters. We conclude with a
    discussion of the remaining hurdles for providing
    this technology in a form amenable to current
    state-of-the-art recognizers.

3
Bio
  • Jon Hamaker is a Ph.D. candidate in the
    Department of Electrical and Computer Engineering
    at Mississippi State University under the
    supervision of Dr. Joe Picone. He has been a
    senior member of the Institute for Signal and
    Information Processing (ISIP) at MSU since 1996.
    Mr. Hamaker's research work has revolved around
    automatic structural analysis and optimization
    methods for acoustic modeling in speech
    recognition systems. His most recent work has
    been in the application of kernel machines as
    replacements for the underlying Gaussian
    distribution in hidden Markov acoustic models.
    His dissertation work compares the popular
    support vector machine with the relatively new
    relevance vector machine in the context of a
    speech recognition system. Mr. Hamaker has
    co-authored 4 journal papers (2 under review), 22
    conference papers, and 3 invited presentations
    during his graduate studies at MS State
    (http//www.isip.msstate.edu/publications). He
    also spent two summers as an intern at Microsoft
    in the recognition engine group.

4
Outline
  • The acoustic modeling problem for speech
  • Current state-of-the-art
  • Discriminative approaches
  • Structural optimization and Occams Razor
  • Support vector classifiers
  • Relevance vector classifiers
  • Coupling vector machines to ASR systems
  • Scaling relevance vector methods to real
    problems
  • Extensions of this work

5
ASR Problem
  • Front-end maintains information important for
    modeling in a reduced parameter set
  • Language model typically predicts a small set of
    next words based on knowledge of a finite number
    of previous words (N-grams)
  • Search engine uses knowledge sources and models
    to chooses amongst competing hypotheses

6
Acoustic Confusability
Requires reasoning under uncertainty!
  • Regions of overlap represent classification error
  • Reduce overlap by introducing acoustic and
    linguistic context

Comparison of aa in lOck and iy in bEAt
for SWB
7
Probabilistic Formulation
  • To deal with the uncertainty, we typically
    formulate speech as a probabilistic problem
  • Objective Minimize the word error rate by
    maximizing P(WA)
  • Approach Maximize P(AW) during training
  • Components
  • P(AW) Acoustic Model
  • P(W) Language Model
  • P(A) Acoustic probability (ignored during
    maximization)

8
Acoustic Modeling - HMMs
  • HMMs model temporal variation in the transition
    probabilities of the state machine
  • GMM emission densities are used to account for
    variations in speaker, accent, and pronunciation
  • Sharing model parameters is a common strategy to
    reduce complexity

9
Maximum Likelihood Training
  • Data-driven modeling supervised only from a
    word-level transcription
  • Approach maximum likelihood estimation
  • The EM algorithm is used to improve our
    estimates
  • Guaranteed convergence to local maximum
  • No guard against overfitting!
  • Computationally efficient training algorithms
    (Forward-Backward) have been crucial
  • Decision trees are used to optimize sharing
    parameters, minimize system complexity and
    integrate additional linguistic knowledge

10
Drawbacks of Current Approach
  • ML Convergence does not translate to optimal
    classification
  • Error from incorrect modeling assumptions
  • Finding the optimal decision boundary requires
    only one parameter!

11
Drawbacks of Current Approach
  • Data not separable by a hyperplane nonlinear
    classifier is needed
  • Gaussian MLE models tend toward the center of
    mass overtraining leads to poor generalization

12
Acoustic Modeling
  • Acoustic Models Must
  • Model the temporal progression of the speech
  • Model the characteristics of the sub-word units
  • We would also like our models to
  • Optimally trade-off discrimination and
    representation
  • Incorporate Bayesian statistics (priors)
  • Make efficient use of parameters (sparsity)
  • Produce confidence measures of their predictions
    for higher-level decision processes

13
Paradigm Shift - Discriminative Modeling
  • Discriminative Training (Maximum Mutual
    Information Estimation)
  • Essential Idea Maximize
  • Maximize numerator (ML term), minimize
    denominator (discriminative term)
  • Discriminative Modeling (e.g. ANN Hybrids
    Bourlard and Morgan)

14
Research Focus
  • Our Research replace the Gaussian likelihood
    computation with a machine that incorporates
    notions of
  • Discrimination
  • Bayesian statistics (prior information)
  • Confidence
  • Sparsity
  • All while maintaining computational efficiency

15
ANN Hybrids
  • Architecture
  • ANN provides flexible, discriminative classifiers
    for emission probabilities that avoid HMM
    independence assumptions (can use wider acoustic
    context)
  • Trained using Viterbi iterative training (hard
    decision rule) or can be trained to learn
    Baum-Welch targets (soft decision rule)
  • Shortcomings
  • Prone to overfitting require cross-validation to
    determine when to stop training. Need methods to
    automatically penalize overfitting
  • No substantial recognition improvements over
    HMM/GMM

16
Structural Optimization
  • Structural optimization often guided by an
    Occams Razor approach
  • Trading goodness of fit and model complexity
  • Examples MDL, BIC, AIC, Structural Risk
    Minimization, Automatic Relevance Determination

17
Structural Risk Minimization
  • Expected Risk
  • Not possible to estimate P(x,y)
  • Empirical Risk
  • Related by the VC dimension, h
  • Approach choose the machine that gives the least
    upper bound on the actual risk

Expected risk
bound on the expected risk
optimum
VC confidence
empirical risk
VC dimension
  • The VC dimension is a measure of the complexity
    of the learning machine
  • Higher VC dimension gives a looser bound on the
    actual risk thus penalizing a more complex
    model (Vapnik)

18
Support Vector Machines
  • Optimization Separable Data
  • Hyperplane
  • Constraints
  • Quadratic optimization of a Lagrange functional
    minimizes risk criterion (maximizes margin). Only
    a small portion become support vectors
  • Final classifier
  • Hyperplanes C0-C2 achieve zero empirical risk. C0
    generalizes optimally
  • The data points that define the boundary are
    called support vectors

19
SVMs as Nonlinear Classifiers
  • Data for practical applications typically not
    separable using a hyperplane in the original
    input feature space
  • Transform data to higher dimension where
    hyperplane classifier is sufficient to model
    decision surface
  • Kernels used for this transformation
  • Final classifier

20
SVMs for Non-Separable Data
  • No hyperplane could achieve zero empirical risk
    (in any dimension space!)
  • Recall the SRM Principle trade-off empirical
    risk and model complexity
  • Relax our optimization constraint to allow for
    errors on the training set
  • A new parameter, C, must be estimated to
    optimally control the trade-off between training
    set errors and model complexity

21
SVM Drawbacks
  • Uses a binary (yes/no) decision rule
  • Generates a distance from the hyperplane, but
    this distance is often not a good measure of our
    confidence in the classification
  • Can produce a probability as a function of the
    distance (e.g. using sigmoid fits), but they are
    inadequate
  • Number of support vectors grows linearly with the
    size of the data set
  • Requires the estimation of trade-off parameter,
    C, via held-out sets

22
Evidence Maximization
  • Build a fully specified probabilistic model
    incorporate prior information/beliefs as well as
    a notion of confidence in predictions
  • MacKay posed a special form for regularization in
    neural networks sparsity
  • Evidence maximization evaluate candidate models
    based on their evidence, P(DHi)
  • Structural optimization by maximizing the
    evidence across all candidate models!
  • Steeped in Gaussian approximations

23
Evidence Framework
  • Evidence approximation
  • Likelihood of data given best fit parameter set
  • Penalty that measures how well our posterior
    model fits our prior assumptions
  • We can use set the prior in favor of sparse,
    smooth models!

24
Relevance Vector Machines
  • A kernel-based learning machine
  • Incorporates an automatic relevance determination
    (ARD) prior over each weight (MacKay)
  • A flat (non-informative) prior over a completes
    the Bayesian specification

25
Relevance Vector Machines
  • The goal in training becomes finding
  • Estimation of the sparsity parameters is
    inherent in the optimization no need for a
    held-out set!
  • A closed-form solution to this maximization
    problem is not available. Rather, we iteratively
    reestimate

26
Laplaces Method
  • Fix a and estimate w (e.g. gradient descent)
  • Use the Hessian to approximate the covariance of
    a Gaussian posterior of the weights centered at
  • With and as the mean and covariance,
    respectively, of the Gaussian approximation, we
    find by finding
  • Method is O(N2) in memory and O(N3) in time

27
RVMs Compared to SVMs
  • RVM
  • Data Class labels (0,1)
  • Goal Learn posterior, P(t1x)
  • Structural Optimization Hyperprior distribution
    encourages sparsity
  • Training iterative O(N3)
  • SVM
  • Data Class labels (-1,1)
  • Goal Find optimal decision surface under
    constraints
  • Structural Optimization Trade-off parameter that
    must be estimated
  • Training Quadratic O(N2)

28
Simple Example
29
ML Comparison
30
SVM Comparison
31
SVM With Sigmoid Posterior Comparison
32
RVM Comparison
33
Experimental Progression
  • Proof of concept on speech classification data
  • Coupling classifiers to ASR system
  • Reduced-set tests on Alphadigits task
  • Algorithms for scaling up RVM classifiers
  • Further tests on Alphadigits task (still not the
    full training set though!)
  • New work aiming at larger data sets and HMM
    decoupling

34
Vowel Classification
  • Deterding Vowel Data 11 vowels spoken in hd
    context 10 log area parameters 528 train, 462
    SI test

35
Coupling to ASR
  • Data size
  • 30 million frames of data in training set
  • Solution Segmental phone models
  • Source for Segmental Data
  • Solution Use HMM system in bootstrap procedure
  • Could also build a segment-based decoder
  • Probabilistic decoder coupling
  • SVMs Sigmoid-fit posterior
  • RVMs naturally probabilistic

36
Coupling to ASR System
Features (Mel-Cepstra)
HMM RECOGNITION
Segment Information
SEGMENTAL CONVERTER
N-best List
Segmental Features
HYBRID DECODER
Hypothesis
37
Alphadigit Recognition
  • OGI Alphadigits continuous, telephone bandwidth
    letters and numbers (A19B4E)
  • Reduced training set size for RVM comparison
    2000 training segments per phone model
  • Could not, at this point, run larger sets
    efficiently
  • 3329 utterances using 10-best lists generated by
    the HMM decoder
  • SVM and RVM system architecture are nearly
    identical RBF kernels with gamma 0.5
  • SVM requires the sigmoid posterior estimate to
    produce likelihoods sigmoid parameters
    estimated from large held-out set

38
SVM Alphadigit Recognition
  • HMM system is cross-word state-tied triphones
    with 16 mixtures of Gaussian models
  • SVM system has monophone models with segmental
    features
  • System combination experiment yields another 1
    reduction in error

39
SVM/RVM Alphadigit Comparison
  • RVMs yield a large reduction in the parameter
    count while attaining superior performance
  • Computational costs mainly in training for RVMs
    but is still prohibitive for larger sets

40
Scaling Up
  • Central to RVM training is the inversion of an
    MxM Hessian matrix an O(N3) operation initially
  • Solutions
  • Constructive Approach Start with an empty model
    and iteratively add candidate parameters. M is
    typically much smaller than N
  • Divide and Conquer Approach Divide complete
    problem into set of sub-problems. Iteratively
    refine the candidate parameter set according to
    sub-problem solution. M is user-defined

41
Constructive Approach
  • Tipping and Faul (MSR-Cambridge)
  • Define
  • has a unique solution with respect to
  • The results give a set of rules for adding
    vectors to the model, removing vectors from the
    model or updating parameters in the model

42
Constructive Approach Algorithm
  • Prune all parameters
  • While not converged
  • For each parameter
  • If parameter is pruned
  • checkAddRule
  • Else checkPruneRule
  • checkUpdateRule
  • End
  • Update Model
  • End
  • Begin with all weights set to zero and
    iteratively construct an optimal model without
    evaluating the full NxN inverse
  • Formed for RVM regression can have oscillatory
    behavior for classification
  • Rule subroutines require the full design matrix
    (NxN) storage requirement

43
Iterative Reduction Algorithm
RVs
Subset 0
Candidate Pool
Subset J
RVs
Iteration I
Iteration I1
  • O(M3) in run-time and O(MxN) in memory. M is a
    user-defined parameter
  • Assumes that if P(wk0wI,J,D) is 1 then
    P(wk0w,D) is also 1! Optimality?

44
Alphadigit Recognition
  • Data increased to 10000 training vectors
  • Reduction method has been trained up to 100k
    vectors (on toy task). Not possible for
    Constructive method

45
Summary
  • First to apply kernel machines as acoustic models
  • Comparison of two machines that apply structural
    optimization to learning SVM and RVM
  • Performance exceeds that of HMM but with quite a
    bit of HMM interaction
  • Algorithms for increased data sizes are key

46
Decoupling the HMM
  • Still want to use segmental data (data size)
  • Want the kernel machine acoustic model to
    determine an optimal segmentation though
  • Need a new decoder
  • Hypothesize each phone for each possible segment
  • Pruning is a huge issue
  • Stack decoder is beneficial
  • Status In development

47
Improved Iterative Algorithm
  • Same principle of operation
  • One pass over the data much faster!
  • Status Equivalent performance on all benchmarks
    running on Alphadigits now

48
Active Learning for RVMs
  • Idea Given the current model, iteratively
    chooses a subset of points from the full training
    set that will improve the system performance
  • Problem 1 Performance typically is defined as
    classifier error rate (e.g. boosting). What about
    the posterior estimate accuracy?
  • Problem 2 For kernel machines, an added
    training point can
  • Assist in bettering the model performance
  • Become part of the model itself! How do we
    determine which points should be added?
  • Look to work in Gaussian Processes (Lawrence,
    Seeger, Herbrich, 2003)

49
Extensions
  • Not ready for prime time as an acoustic model
  • How else might we use the same techniques for
    speech?
  • Online Speech/Noise Classification?
  • Requires adaptation methods
  • Application of automatic relevance determination
    to model selection for HMMs?

50
Acknowledgments
  • Collaborators Aravind Ganapathiraju and Joe
    Picone at Mississippi State
  • Consultants Michael Tipping (MSR-Cambridge) and
    Thorsten Joachims (now at Cornell)
Write a Comment
User Comments (0)
About PowerShow.com