Title: Combining Phonetic Attributes Using Conditional Random Fields
1Combining Phonetic Attributes Using Conditional
Random Fields
Jeremy Morris and Eric Fosler-Lussier
Department of Computer Science and Engineering,
OSU
A Conditional Random Field is a mathematical
model for sequences that is similar in many ways
to a Hidden Markov Model, but is discriminative
rather than generative in nature. Here we
explore the application of the CRF model to ASR
processing by building a system that performs
first-pass phonetic recogintion using
discriminatively trained phonetic attributes.
This system achieves an accuracy level in a phone
recognition task superior to that of an HMM that
has been comparably trained.
Conditional Random Fields
- Phonetic Attributes
- Phonetic attributes are defined via linguistic
properties per the International Phonetics
Association (IPA) phonetic chart - Consonants defined by their sonority, voicing,
manner, and place of articulation - Vowels defined by their sonority, voicing,
height, frontness, roundness and tenseness - Additional features for silence
- Phonetic attributes extracted by multi-layer
perceptron (MLP) neural net classifiers - Classifiers are trained on 12th order cepstral
PLP and delta coefficients derived from the
speech data - Speech data is broken up into frames of 25ms,
with overlapping frames starting every 10ms - Input is a vector of PLP and delta coefficients
for a nine frame window, centered on the current
frame, with four frames of context on either side - Each classifier outputs a series of posterior
probabilities representing the probability of the
attribute given the data - One probability is output for each possible
attribute for that classifier for any given frame
of the data - These posteriors sum to one for a given
classifier MLP - Classifiers were trained using phonetically
transcribed corpus - Phonetic attribute labels were derived by using
the attributes provided by the IPA description of
the transcribed phone (See Figure 1) - For our purposes, all phones are assumed to have
their canonical values for training, and
attribute boundaries occur at phonetic boundaries
- A discriminative model of a sequence that
attempts to model the posterior probability of a
label sequence given a set of observed data
(Lafferty, et. al, 2001) - A CRF can be described by the following equation
- Where each s is a state feature function and each
t is a transition feature function - State feature functions associate observations in
the data at a particular time segment with the
label at that time segment - Described as s(y, x, i), where y is the label, x
is the observed data, and i is the time frame. - Takes a non-zero value when the current label at
frame i is the same as y and some observation in
x holds for the frame I, otherwise the value is
zero. - Prior work using CRFs in speech recognition have
used Gaussian attributes to build state feature
function (Gunawardana et. al, 2005) - Transition feature functions associate
observations in the data at a particular time
segment with the transition from the previous
label into the current label - Described as t(y,y,x,i), where y is the label,
y is the previous label, x is the observed data,
and i is the time frame - Takes a non-zero value when the current label at
frame i is the same as y, the previous label is
the same as y, and some observation in x holds
for the frame i
- For our model, a state feature function is a
single output from our MLP phonetic attribute
classifiers associated with a single label - Example sj(y,x,i) MLPstop(xi)d(yi /t/)
- The state feature function above has the value of
the output of our MLP classifier for the STOP
attribute if the label at time i is /t/.
Otherwise, it takes the value of zero. - Currently, transition feature functions do not
use the output of the MLP neural networks - The value of the function is 1 if the label pair
matches the pair defined for the function, 0 if
it does not. - Each feature function has an associated weight
value - This weight value is high when a non-zero feature
function value is strongly associated with a
particular label giving a high value to the
computation of the probability for that label - Weights are trained by maximizing the log
likelihood of the training set with respect to
the model - The strength of the CRF model is in its ability
to use arbitrary features as input - In traditional HMMs, dependencies among features
can lead to computationally difficult models
features are usually required to be independent - In a CRF, no independence assumption on the
features is made. Features can have arbitrary
dependencies.
/dh/
/dh/
/ae/
/ae/
/dx/
TABLE 1 PHONETIC ATTRIBUTES TABLE 1 PHONETIC ATTRIBUTES
Attribute Possible output values
SONORITY vowel, obstruent, sonorant, syllabic, silence
VOICE voiced, unvoiced, n/a
MANNER fricative,stop, closure flap, nasal, approximant, nasalflap, n/a
PLACE labial, dental, alveolar, palatal, velar, glottal, lateral, rhotic, n/a
HEIGHT high, mid, low, lowhigh, midhigh, n/a
FRONT front, back, central, backfront, n/a
ROUND round, nonround, roundnonround, nonroundround, n/a
TENSE tense, lax, n/a
- Discussion
- The CRF system trained on monophones has accuracy
results that fall between that of the monophone
trained Tandem and triphone trained Tandem
systems. - The CRF system makes many fewer insertions (extra
hypothesized phones) than the Tandem systems - The CRF system also makes many more deletions
(missed phones where ones should be hypothesized)
than the Tandem systems - The CRF system makes fewer hypotheses overall
than either Tandem system - The precision measurement shows how often a
hypothesis is a correct hypothesis - When the CRF system makes a hypothesis, it is
correct more often than the Tandem systems - These results suggest some means to improve the
performance of the CRF system - Addition of new extracted attributes (such as a
boundary detector) to incorporate as transition
features - Addition of a penalty factor on transition
weights to generate more transitions - Addition of more contextual attributes into state
features to gain some level of triphonic context
- Results
- Phone-level accuracies of the CRF system were
compared to a baseline Tandem system (Hermansky
et. al, 2000) - A Tandem system uses the output of the neural
networks as inputs to a Hidden Markov Model
system - Tandem system was trained with both triphone
label contexts and monophone label contexts - Triphone labels give a single left and right
context phone to the label, allowing a finer
level of context to be used when labels are
assigned - In other words, the context for the phone /ae/ in
the string of phones /k ae t/ is different from
that in the string /k ae p/ since the right
context is different - Monophone labels are a single phonetic label
- CRF system results are only for monophone labels
- TABLE 2 (below) breaks down the results into
three categories - Phone Correctness Was the correct phone
hypothesized? - Phone Accuracy Correctness penalized for
overgeneration - Phone Precision When a phone is hypothesized,
how often is it right?
- References
- J. Lafferty, A. McCallum, and F. Pereira,
Conditional Random Fields Probabilistic Models
for Segmenting and Labelling Sequence Data, in
Proceedings of the 18th International Conference
on Machine Learning, 2001. - H. Hermansky, D. Ellis, and S.Sharma, Tandem
connectionist feature stream extraction for
conventional HMM systems, in Proceedings of the
IEEE Intl. Conf. on Acoustics, Speech, and Signal
Processing, 2000. - A. Gunawardana, M. Mahajan, A. Acero and J.
Platt, Hidden Conditional Random Fields for
Phone Classification, in Proceedings of
Interspeech, 2005. - J. Morris and E. Fosler-Lussier, Discriminative
Phonetic Recognition with Conditional Random
Fields, HLT-NAACL Workshop on Computationally
Hard Problems and Joint Inference in Speech and
Language Processing, 2006 - M. Rajamanohar and E. Fosler-Lussier, An
evaluation of hierarchical articulatory feature
detectors, in IEEE Automatic Speech Recognition
and Understanding Workshop, 2005. - S. Sarawagi, CRF package for Java,
http//crf.sourceforge.net - D. Johnson et al. ICSI QuickNet software,
http//www.icsi.berkely.edu/Speech/qn.html - S. Young et al. HTK HMM software,
http//htk.eng.cam.ac.uk/
TABLE 2 Phone Recognition Comparisons TABLE 2 Phone Recognition Comparisons TABLE 2 Phone Recognition Comparisons TABLE 2 Phone Recognition Comparisons TABLE 2 Phone Recognition Comparisons
Model Phone Accuracy Phone Correct Phone Precision Parameters
Tandem (mono) 61.48 63.50 73.66 gt28,000
Tandem (tri) 66.69 72.52 73.44 gt2 million
CRF (mono) 65.23 66.74 77.66 4500
- The CRF uses far fewer parameters than either of
the Tandem systems, yet achieves a comparable
result. - Further work has shown that combining phonetic
attribute posteriors with phonetic class
posteriors yields a superior result for the CRF
over the Tandem systems (Morris and
Fosler-Lussier, 2006)
This work was supported by NSF ITR grant
IIS-0427413 the opinions and conclusions
expressed in this work are those of the authors
and not of any funding agency