Combining Phonetic Attributes Using Conditional Random Fields - PowerPoint PPT Presentation

1 / 1

About This Presentation

Title:

Combining Phonetic Attributes Using Conditional Random Fields

Description:

Combining Phonetic Attributes Using Conditional Random Fields Jeremy Morris and Eric Fosler-Lussier Department of Computer Science and Engineering, OSU – PowerPoint PPT presentation

Number of Views:101

Avg rating:3.0/5.0

Slides: 2

Provided by: Jere146

Category:

more less

Transcript and Presenter's Notes

Title: Combining Phonetic Attributes Using Conditional Random Fields

1
Combining Phonetic Attributes Using Conditional
Random Fields
Jeremy Morris and Eric Fosler-Lussier
Department of Computer Science and Engineering,
OSU
A Conditional Random Field is a mathematical
model for sequences that is similar in many ways
to a Hidden Markov Model, but is discriminative
rather than generative in nature. Here we
explore the application of the CRF model to ASR
processing by building a system that performs
first-pass phonetic recogintion using
discriminatively trained phonetic attributes.
This system achieves an accuracy level in a phone
recognition task superior to that of an HMM that
has been comparably trained.
Conditional Random Fields

Phonetic Attributes
Phonetic attributes are defined via linguistic
properties per the International Phonetics
Association (IPA) phonetic chart
Consonants defined by their sonority, voicing,
manner, and place of articulation
Vowels defined by their sonority, voicing,
height, frontness, roundness and tenseness
Additional features for silence
Phonetic attributes extracted by multi-layer
perceptron (MLP) neural net classifiers
Classifiers are trained on 12th order cepstral
PLP and delta coefficients derived from the
speech data
Speech data is broken up into frames of 25ms,
with overlapping frames starting every 10ms
Input is a vector of PLP and delta coefficients
for a nine frame window, centered on the current
frame, with four frames of context on either side
Each classifier outputs a series of posterior
probabilities representing the probability of the
attribute given the data
One probability is output for each possible
attribute for that classifier for any given frame
of the data
These posteriors sum to one for a given
classifier MLP
Classifiers were trained using phonetically
transcribed corpus
Phonetic attribute labels were derived by using
the attributes provided by the IPA description of
the transcribed phone (See Figure 1)
For our purposes, all phones are assumed to have
their canonical values for training, and
attribute boundaries occur at phonetic boundaries

A discriminative model of a sequence that
attempts to model the posterior probability of a
label sequence given a set of observed data
(Lafferty, et. al, 2001)
A CRF can be described by the following equation
Where each s is a state feature function and each
t is a transition feature function
State feature functions associate observations in
the data at a particular time segment with the
label at that time segment
Described as s(y, x, i), where y is the label, x
is the observed data, and i is the time frame.
Takes a non-zero value when the current label at
frame i is the same as y and some observation in
x holds for the frame I, otherwise the value is
zero.
Prior work using CRFs in speech recognition have
used Gaussian attributes to build state feature
function (Gunawardana et. al, 2005)
Transition feature functions associate
observations in the data at a particular time
segment with the transition from the previous
label into the current label
Described as t(y,y,x,i), where y is the label,
y is the previous label, x is the observed data,
and i is the time frame
Takes a non-zero value when the current label at
frame i is the same as y, the previous label is
the same as y, and some observation in x holds
for the frame i

For our model, a state feature function is a
single output from our MLP phonetic attribute
classifiers associated with a single label
Example sj(y,x,i) MLPstop(xi)d(yi /t/)
The state feature function above has the value of
the output of our MLP classifier for the STOP
attribute if the label at time i is /t/.
Otherwise, it takes the value of zero.
Currently, transition feature functions do not
use the output of the MLP neural networks
The value of the function is 1 if the label pair
matches the pair defined for the function, 0 if
it does not.
Each feature function has an associated weight
value
This weight value is high when a non-zero feature
function value is strongly associated with a
particular label giving a high value to the
computation of the probability for that label
Weights are trained by maximizing the log
likelihood of the training set with respect to
the model
The strength of the CRF model is in its ability
to use arbitrary features as input
In traditional HMMs, dependencies among features
can lead to computationally difficult models
features are usually required to be independent
In a CRF, no independence assumption on the
features is made. Features can have arbitrary
dependencies.

/dh/
/dh/
/ae/
/ae/
/dx/
TABLE 1 PHONETIC ATTRIBUTES TABLE 1 PHONETIC ATTRIBUTES
Attribute Possible output values
SONORITY vowel, obstruent, sonorant, syllabic, silence
VOICE voiced, unvoiced, n/a
MANNER fricative,stop, closure flap, nasal, approximant, nasalflap, n/a
PLACE labial, dental, alveolar, palatal, velar, glottal, lateral, rhotic, n/a
HEIGHT high, mid, low, lowhigh, midhigh, n/a
FRONT front, back, central, backfront, n/a
ROUND round, nonround, roundnonround, nonroundround, n/a
TENSE tense, lax, n/a

Discussion
The CRF system trained on monophones has accuracy
results that fall between that of the monophone
trained Tandem and triphone trained Tandem
systems.
The CRF system makes many fewer insertions (extra
hypothesized phones) than the Tandem systems
The CRF system also makes many more deletions
(missed phones where ones should be hypothesized)
than the Tandem systems
The CRF system makes fewer hypotheses overall
than either Tandem system
The precision measurement shows how often a
hypothesis is a correct hypothesis
When the CRF system makes a hypothesis, it is
correct more often than the Tandem systems
These results suggest some means to improve the
performance of the CRF system
Addition of new extracted attributes (such as a
boundary detector) to incorporate as transition
features
Addition of a penalty factor on transition
weights to generate more transitions
Addition of more contextual attributes into state
features to gain some level of triphonic context

Results
Phone-level accuracies of the CRF system were
compared to a baseline Tandem system (Hermansky
et. al, 2000)
A Tandem system uses the output of the neural
networks as inputs to a Hidden Markov Model
system
Tandem system was trained with both triphone
label contexts and monophone label contexts
Triphone labels give a single left and right
context phone to the label, allowing a finer
level of context to be used when labels are
assigned
In other words, the context for the phone /ae/ in
the string of phones /k ae t/ is different from
that in the string /k ae p/ since the right
context is different
Monophone labels are a single phonetic label
CRF system results are only for monophone labels
TABLE 2 (below) breaks down the results into
three categories
Phone Correctness Was the correct phone
hypothesized?
Phone Accuracy Correctness penalized for
overgeneration
Phone Precision When a phone is hypothesized,
how often is it right?

References
J. Lafferty, A. McCallum, and F. Pereira,
Conditional Random Fields Probabilistic Models
for Segmenting and Labelling Sequence Data, in
Proceedings of the 18th International Conference
on Machine Learning, 2001.
H. Hermansky, D. Ellis, and S.Sharma, Tandem
connectionist feature stream extraction for
conventional HMM systems, in Proceedings of the
IEEE Intl. Conf. on Acoustics, Speech, and Signal
Processing, 2000.
A. Gunawardana, M. Mahajan, A. Acero and J.
Platt, Hidden Conditional Random Fields for
Phone Classification, in Proceedings of
Interspeech, 2005.
J. Morris and E. Fosler-Lussier, Discriminative
Phonetic Recognition with Conditional Random
Fields, HLT-NAACL Workshop on Computationally
Hard Problems and Joint Inference in Speech and
Language Processing, 2006
M. Rajamanohar and E. Fosler-Lussier, An
evaluation of hierarchical articulatory feature
detectors, in IEEE Automatic Speech Recognition
and Understanding Workshop, 2005.
S. Sarawagi, CRF package for Java,
http//crf.sourceforge.net
D. Johnson et al. ICSI QuickNet software,
http//www.icsi.berkely.edu/Speech/qn.html
S. Young et al. HTK HMM software,
http//htk.eng.cam.ac.uk/

TABLE 2 Phone Recognition Comparisons TABLE 2 Phone Recognition Comparisons TABLE 2 Phone Recognition Comparisons TABLE 2 Phone Recognition Comparisons TABLE 2 Phone Recognition Comparisons
Model Phone Accuracy Phone Correct Phone Precision Parameters
Tandem (mono) 61.48 63.50 73.66 gt28,000
Tandem (tri) 66.69 72.52 73.44 gt2 million
CRF (mono) 65.23 66.74 77.66 4500

The CRF uses far fewer parameters than either of
the Tandem systems, yet achieves a comparable
result.
Further work has shown that combining phonetic
attribute posteriors with phonetic class
posteriors yields a superior result for the CRF
over the Tandem systems (Morris and
Fosler-Lussier, 2006)

This work was supported by NSF ITR grant
IIS-0427413 the opinions and conclusions
expressed in this work are those of the authors
and not of any funding agency

Write a Comment

User Comments (0)