Combining Phonetic Attributes Using Conditional Random Fields - PowerPoint PPT Presentation

1 / 1
About This Presentation
Title:

Combining Phonetic Attributes Using Conditional Random Fields

Description:

Combining Phonetic Attributes Using Conditional Random Fields Jeremy Morris and Eric Fosler-Lussier Department of Computer Science and Engineering, OSU – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 2
Provided by: Jere146
Category:

less

Transcript and Presenter's Notes

Title: Combining Phonetic Attributes Using Conditional Random Fields


1
Combining Phonetic Attributes Using Conditional
Random Fields
Jeremy Morris and Eric Fosler-Lussier
Department of Computer Science and Engineering,
OSU
A Conditional Random Field is a mathematical
model for sequences that is similar in many ways
to a Hidden Markov Model, but is discriminative
rather than generative in nature. Here we
explore the application of the CRF model to ASR
processing by building a system that performs
first-pass phonetic recogintion using
discriminatively trained phonetic attributes.
This system achieves an accuracy level in a phone
recognition task superior to that of an HMM that
has been comparably trained.
Conditional Random Fields
  • Phonetic Attributes
  • Phonetic attributes are defined via linguistic
    properties per the International Phonetics
    Association (IPA) phonetic chart
  • Consonants defined by their sonority, voicing,
    manner, and place of articulation
  • Vowels defined by their sonority, voicing,
    height, frontness, roundness and tenseness
  • Additional features for silence
  • Phonetic attributes extracted by multi-layer
    perceptron (MLP) neural net classifiers
  • Classifiers are trained on 12th order cepstral
    PLP and delta coefficients derived from the
    speech data
  • Speech data is broken up into frames of 25ms,
    with overlapping frames starting every 10ms
  • Input is a vector of PLP and delta coefficients
    for a nine frame window, centered on the current
    frame, with four frames of context on either side
  • Each classifier outputs a series of posterior
    probabilities representing the probability of the
    attribute given the data
  • One probability is output for each possible
    attribute for that classifier for any given frame
    of the data
  • These posteriors sum to one for a given
    classifier MLP
  • Classifiers were trained using phonetically
    transcribed corpus
  • Phonetic attribute labels were derived by using
    the attributes provided by the IPA description of
    the transcribed phone (See Figure 1)
  • For our purposes, all phones are assumed to have
    their canonical values for training, and
    attribute boundaries occur at phonetic boundaries
  • A discriminative model of a sequence that
    attempts to model the posterior probability of a
    label sequence given a set of observed data
    (Lafferty, et. al, 2001)
  • A CRF can be described by the following equation
  • Where each s is a state feature function and each
    t is a transition feature function
  • State feature functions associate observations in
    the data at a particular time segment with the
    label at that time segment
  • Described as s(y, x, i), where y is the label, x
    is the observed data, and i is the time frame.
  • Takes a non-zero value when the current label at
    frame i is the same as y and some observation in
    x holds for the frame I, otherwise the value is
    zero.
  • Prior work using CRFs in speech recognition have
    used Gaussian attributes to build state feature
    function (Gunawardana et. al, 2005)
  • Transition feature functions associate
    observations in the data at a particular time
    segment with the transition from the previous
    label into the current label
  • Described as t(y,y,x,i), where y is the label,
    y is the previous label, x is the observed data,
    and i is the time frame
  • Takes a non-zero value when the current label at
    frame i is the same as y, the previous label is
    the same as y, and some observation in x holds
    for the frame i
  • For our model, a state feature function is a
    single output from our MLP phonetic attribute
    classifiers associated with a single label
  • Example sj(y,x,i) MLPstop(xi)d(yi /t/)
  • The state feature function above has the value of
    the output of our MLP classifier for the STOP
    attribute if the label at time i is /t/.
    Otherwise, it takes the value of zero.
  • Currently, transition feature functions do not
    use the output of the MLP neural networks
  • The value of the function is 1 if the label pair
    matches the pair defined for the function, 0 if
    it does not.
  • Each feature function has an associated weight
    value
  • This weight value is high when a non-zero feature
    function value is strongly associated with a
    particular label giving a high value to the
    computation of the probability for that label
  • Weights are trained by maximizing the log
    likelihood of the training set with respect to
    the model
  • The strength of the CRF model is in its ability
    to use arbitrary features as input
  • In traditional HMMs, dependencies among features
    can lead to computationally difficult models
    features are usually required to be independent
  • In a CRF, no independence assumption on the
    features is made. Features can have arbitrary
    dependencies.

/dh/
/dh/
/ae/
/ae/
/dx/
TABLE 1 PHONETIC ATTRIBUTES TABLE 1 PHONETIC ATTRIBUTES
Attribute Possible output values
SONORITY vowel, obstruent, sonorant, syllabic, silence
VOICE voiced, unvoiced, n/a
MANNER fricative,stop, closure flap, nasal, approximant, nasalflap, n/a
PLACE labial, dental, alveolar, palatal, velar, glottal, lateral, rhotic, n/a
HEIGHT high, mid, low, lowhigh, midhigh, n/a
FRONT front, back, central, backfront, n/a
ROUND round, nonround, roundnonround, nonroundround, n/a
TENSE tense, lax, n/a
  • Discussion
  • The CRF system trained on monophones has accuracy
    results that fall between that of the monophone
    trained Tandem and triphone trained Tandem
    systems.
  • The CRF system makes many fewer insertions (extra
    hypothesized phones) than the Tandem systems
  • The CRF system also makes many more deletions
    (missed phones where ones should be hypothesized)
    than the Tandem systems
  • The CRF system makes fewer hypotheses overall
    than either Tandem system
  • The precision measurement shows how often a
    hypothesis is a correct hypothesis
  • When the CRF system makes a hypothesis, it is
    correct more often than the Tandem systems
  • These results suggest some means to improve the
    performance of the CRF system
  • Addition of new extracted attributes (such as a
    boundary detector) to incorporate as transition
    features
  • Addition of a penalty factor on transition
    weights to generate more transitions
  • Addition of more contextual attributes into state
    features to gain some level of triphonic context
  • Results
  • Phone-level accuracies of the CRF system were
    compared to a baseline Tandem system (Hermansky
    et. al, 2000)
  • A Tandem system uses the output of the neural
    networks as inputs to a Hidden Markov Model
    system
  • Tandem system was trained with both triphone
    label contexts and monophone label contexts
  • Triphone labels give a single left and right
    context phone to the label, allowing a finer
    level of context to be used when labels are
    assigned
  • In other words, the context for the phone /ae/ in
    the string of phones /k ae t/ is different from
    that in the string /k ae p/ since the right
    context is different
  • Monophone labels are a single phonetic label
  • CRF system results are only for monophone labels
  • TABLE 2 (below) breaks down the results into
    three categories
  • Phone Correctness Was the correct phone
    hypothesized?
  • Phone Accuracy Correctness penalized for
    overgeneration
  • Phone Precision When a phone is hypothesized,
    how often is it right?
  • References
  • J. Lafferty, A. McCallum, and F. Pereira,
    Conditional Random Fields Probabilistic Models
    for Segmenting and Labelling Sequence Data, in
    Proceedings of the 18th International Conference
    on Machine Learning, 2001.
  • H. Hermansky, D. Ellis, and S.Sharma, Tandem
    connectionist feature stream extraction for
    conventional HMM systems, in Proceedings of the
    IEEE Intl. Conf. on Acoustics, Speech, and Signal
    Processing, 2000.
  • A. Gunawardana, M. Mahajan, A. Acero and J.
    Platt, Hidden Conditional Random Fields for
    Phone Classification, in Proceedings of
    Interspeech, 2005.
  • J. Morris and E. Fosler-Lussier, Discriminative
    Phonetic Recognition with Conditional Random
    Fields, HLT-NAACL Workshop on Computationally
    Hard Problems and Joint Inference in Speech and
    Language Processing, 2006
  • M. Rajamanohar and E. Fosler-Lussier, An
    evaluation of hierarchical articulatory feature
    detectors, in IEEE Automatic Speech Recognition
    and Understanding Workshop, 2005.
  • S. Sarawagi, CRF package for Java,
    http//crf.sourceforge.net
  • D. Johnson et al. ICSI QuickNet software,
    http//www.icsi.berkely.edu/Speech/qn.html
  • S. Young et al. HTK HMM software,
    http//htk.eng.cam.ac.uk/

TABLE 2 Phone Recognition Comparisons TABLE 2 Phone Recognition Comparisons TABLE 2 Phone Recognition Comparisons TABLE 2 Phone Recognition Comparisons TABLE 2 Phone Recognition Comparisons
Model Phone Accuracy Phone Correct Phone Precision Parameters
Tandem (mono) 61.48 63.50 73.66 gt28,000
Tandem (tri) 66.69 72.52 73.44 gt2 million
CRF (mono) 65.23 66.74 77.66 4500
  • The CRF uses far fewer parameters than either of
    the Tandem systems, yet achieves a comparable
    result.
  • Further work has shown that combining phonetic
    attribute posteriors with phonetic class
    posteriors yields a superior result for the CRF
    over the Tandem systems (Morris and
    Fosler-Lussier, 2006)

This work was supported by NSF ITR grant
IIS-0427413 the opinions and conclusions
expressed in this work are those of the authors
and not of any funding agency
Write a Comment
User Comments (0)
About PowerShow.com