Hidden Conditional Random Fields - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Hidden Conditional Random Fields

Description:

Microsoft Research. reference ... We generalize this work and use CRFs with hidden state sequences for modeling speech ... Development set: 15334. Evaluation set: 7333 ... – PowerPoint PPT presentation

Number of Views:181
Avg rating:3.0/5.0
Slides: 23
Provided by: Ryan85
Category:

less

Transcript and Presenter's Notes

Title: Hidden Conditional Random Fields


1
Hidden Conditional Random Fields
  • Asela Gunawardana, Milind Mahajan, Alex Acero,
    John C. Platt
  • Microsoft Research

2
reference
  • INTERSPEECH 2005 Asela Gunawardana, Milind
    Mahajan, Alex Acero, John C. Platt, Hidden
    Conditional Random Fields for Phone
    Classification
  • ICASSP 2006 Milind Mahajan, Asela Gunawardana,
    Alex Acero, Training Algorithm for Hidden
    Conditional Random Fields

3
outline
  • Introduction
  • HCRFs as a generalization of HMMs
  • HCRF estimation
  • Experimental results
  • Conclusions

4
Random Fields
  • At its most basic a random field is a list of
    random numbers whose values are mapped onto a
    space (of n dimensions)
  • Values in random field are usually spatially
    correlated in one way or another, in its most
    basic form this might mean that adjacent values
    do not differ as much as values that are further
    apart
  • Several kinds of random fields exist, among them
    Markov random fields (MRF), Gibbs random fields
    (GRF), conditional random fields (CRF), and
    Gaussian random fields
  • In detail, please ref http//en.wikipedia.org/wik
    i/Random_field

5
Introduction (1/2)
  • There has been a resurgence of interest in
    discriminative methods for ASR due to the success
    of extended Baum-Welch based techniques such as
    MMI and MPE training in LVCSR
  • However, the methods are poorly understood as
    they are used in ways in which their convergence
    guarantees no longer hold, and their successful
    use is as much art as it is science
  • The rationale for the use of these EBW based
    techniques is that general unconstrained
    optimization algorithms are not well-suited to
    optimizing generative hidden Markov models (HMMs)
    under discriminative criteria such as the
    conditional likelihood

6
Introduction (2/2)
  • We present a class of models that in contrast to
    HMMs are discriminative rather than generative in
    nature, and are amenable to the use of general
    purpose unconstrained optimization algorithms
  • The HMM framework is difficult to incorporate
    long-range dependencies between the states and
    the observations
  • Maximum entropy Markov models (MEMMs) are direct
    (non-generative) models that instead of
    observations being generated at each state, the
    state sequence is generated conditioned on the
    observations

7
Generative Models
  • A generative model is a model for randomly
    generating observed data, typically given some
    hidden parameters
  • Generative models are used in machine learning
    for either modeling data directly (i.e., modeling
    observed draws from a probability density
    function), or as an intermediate step to forming
    a conditional probability density function
  • Examples of generative models include
  • Gaussian distribution
  • Gaussian mixture model
  • Multinomial distribution
  • Hidden Markov model
  • Generative grammar

Refhttp//en.wikipedia.org/wiki/Generative_model
8
Maximum Entropy Markov Models
  • The state at each time is chosen with a
    probability that depends on the previous state as
    well as the observations
  • The model does not assign probability to the
    observations, and the conditional state
    transition probabilities are exponential
    (maximum entropy) distributions that may depend
    on arbitrary features of the entire observation
    sequence

P(ss,o) that provides the probability of the
current state s given the previous state s and
the current observation o
9
Conditional Random Fields
  • CRFs are generalizations of MEMMs where the
    conditional probability of the entire state
    sequence given the observation sequence is
    modeled as an exponential distribution
  • While MEMMs use per-state exponential
    distributions to model the transition probability
    at each state, CRFs use a single exponential
    distribution to model the entire state sequence
    given the observation sequence
  • MEMMs and CRFs have been used successfully for
    tasks such as part-of-speech (POS) tagging and
    information extraction

10
Hidden CRFs
  • In previous approaches using MEMMs and CRFs for
    speech, an HMM system is used to reveal the
    correct training state sequence through Viterbi
    alignment
  • We generalize this work and use CRFs with hidden
    state sequences for modeling speech
  • HCRFs are able to use features which can be
    arbitrary functions of the observations without
    complicating the training

11
HCRFs
  • CRFs are typically trained using iterative
    scaling methods or quasi-Newton methods such as
    L-BFGS
  • Its possible to train HCRFs using Generalized EM
    (GEM) where the M-step is an iterative algorithm
    such as GIS or L-BFGS, rather than a closed form
    solution
  • We have successfully used direct optimization
    techniques such as L-BFGS and stochastic gradient
    descent to estimate HCRF parameters

12
HCRFs vs. HMMs
  • The key difference between HCRFs and HMMs
  • HCRFs model the state sequence as being
    conditionally Markov given the observation
    sequence
  • HMMs model the state sequence as being Markov,
    and each observation being independent of all
    others given the corresponding state

13
HCRFs as a generalization of HMMs (1/3)
  • The HCRF model gives the conditional probability
    of a segment (phonetic) label ? given the
    observation sequence o (o1, , oT )
  • ? is the parameter vector and f(w,s,o) is a
    vector of sufficient statistics referred to as
    the feature vector. And the partition function
    z(o ?) ensures that the model is a properly
    normalized probability and is given by
  • The choice of sufficient statistics determines
    the dependencies modeled by the HCRF

14
HCRFs as a generalization of HMMs (2/3)
  • We use the vector of sufficient statistics f with
    components
  • These sufficient statistics may be recognized as
    the ones that are commonly accumulated in order
    to estimate HMMs
  • Since all components of f are sums of terms that
    involve at most pairs of neighboring states

15
HCRFs as a generalization of HMMs (3/3)
  • It can be shown that setting the corresponding
    components of ?to
  • Gives the conditional p.d.f. induced by an
    HMM with transition probabilities ,
    emission means , emission covariance
    and unigram probability .

16
HCRF Estimation (1/4)
  • we have chosen to use direct optimization of the
    conditional log-likelihood of the training set
    rather than GEM
  • Need to find ? to maximize the conditional
    log-likelihood of the training set
  • L-BFGS is a batch training method which uses the
    statistics such as ?L(?) computed from the entire
    training set in order to make an update to the
    parameter vector ?
  • Stochastic gradient descent (SGD) updates the
    parameter vector after processing each single
    training sample using noisy estimates of the
    gradient ?L(?)

17
HCRF Estimation (2/4)
  • If (w(1), o(1)) . . . (w(N), o(N)) is the entire
    sequence of training samples processed by SGD,
    then
  • where ?(n) is the learning rate and U(n) is a
    conditioning matrix which can be used to speed up
    the convergence
  • We used a constant learning rate ?(n) ? and
    U(n) I
  • Both L-BFGS and SGD require the computation of
    the gradient of

numerator
denominator
18
HCRF Estimation (3/4)
  • The forward and backward recursions and the
    computation of occupancy probabilities are
    analogous to the case of HMM estimation, with the
    transition
  • probability ass replaced by a transition score
    and the observation probability
    replaced by an observation score

19
HCRF Estimation (4/4)
  • Thus, the gradient of the log conditional
    likelihood can be efficiently computed, just as
    with MMI estimation of HMMs
  • Note that the conditional log-likelihood is not
    convex in ?.Training methods will therefore in
    general find a local optimum rather than the
    global optimum.
  • We initialized the HCRF estimation from ML
    trained HMM parameters.

20
Generalizing to multi-component models
21
Experimental Results
Training set 142910 Development set
15334 Evaluation set 7333
It should be noted that while MMI estimation of
the HMMs and SGD estimation of the HCRFs
converged within ten iterations over the training
set, L-BFGS convergence was much slower, taking
up to fifty iterations
22
Conclusions
  • The advantage of HCRFs is that the model is a
    state sequence probability model, even when
    applied to the phone classification task, and can
    easily be extended to recognition tasks where the
    boundaries of phonetic segments are unknown
  • The HCRF framework is easily extensible to
    recognition since it is a state and label
    sequence modeling technique
  • HCRFs have the ability to handle complex features
    without any change in training procedure
Write a Comment
User Comments (0)
About PowerShow.com