Title: A Tutorial on Pronunciation Modeling for Large Vocabulary Speech Recognition
1A Tutorial on Pronunciation Modeling for Large
Vocabulary Speech Recognition
- Dr. Eric Fosler-Lussier
- Presentation for CiS 788
2Overview
- Our task moving from read speech recognition
to recognizing spontaneous conversational speech - Two basic approaches for modeling
pronunciationvariation - Encoding linguistic knowledge to pre-specify
possiblealternative pronunciations of words - Deriving alternatives directly from a
pronunciation corpus. - Purposes of this tutorial
- Explain basic linguistic concepts in phonetics
and phonology - Outline several pronunciation modeling strategies
- Summarize promising recent research directions.
3Pronunciations Pronunciation Modeling
4Pronunciations Pronunciation Modeling
- Why sub-word units?
- Data sparseness at word level
- Intermediate level allows extensible vocabulary
- Why phone(me)s?
- Available dictionaries/orthographies assume this
unit - Research suggests humans use this unit
- Phone inventory more manageable than syllables,
etc. (in e.g., English)
5Statistical Underpinnings for Pronunciation
Modeling
- In the whole-word approach, we could find the
most likely utterance (word-string) M given the
perceived signal - M
6Statistical Underpinnings for Pronunciation
Modeling
- With independence assumptions, we can use the
following approximation - Argmax P(MX)
7Statistical Underpinnings for Pronunciation
Modeling
- PA(XQ) the acoustic model
- continuous sound (vector)s to discrete phone
(state)s - Analogous to categorical perception in human
hearing - PQ(QM) the pronunciation model
- Probability of phone states given words
- Also includes context-dependence duration
models - PL(M) the language model
- The prior probability of word sequences
8Statistical Underpinnings for Pronunciation
Modeling
The three models working in sequence
9Linguistic Formalisms Pronunciation Variation
- Phones Phonemes
- (Articulatory) Features
- Phonological Rules
- Finite State Transducers
10Linguistic Formalisms Pronunciation Variation
- Phones Phonemes
- Phones Types of (uttered) segments
- E.g., p unaspirated voiceless labial stop
spik - Vs. ph aspirated voiceless labial stop phik
- Phonemes Mental abstractions of phones
- /p/ in speak /p/ in peak to naïve speakers
- ARPABET between phones phonemes
- SAMPAbet closer to phones, but not perfect
11SAMPA for American English
- Selected Consonants (arpa)
- tS chin tSIn (ch)
- dZ gin dZIn (jh)
- T thin TIn (th)
- D this DIs (dh)
- Z measure "mEZ_at_ (zh)
- N thing TIN (ng)
- j yacht jAt (y)
- 4 butter bV4_at_ (dx)
- Selected Vowels (arpa)
- pat pt (ae)
- A pot pAt (aa)
- V cut kVt (uh) !
- U put pUt (uh) !
- aI rise raIz (ay)
- 3 furs f3z (er)
- _at_ allow _at_laU (ax)
- _at_ corner kOrn_at_ (axr)
12Linguistic Formalisms Pronunciation Variation
- (Articulatory) Features
- Describe where (place) and how (manner) a sound
is made, and whether it is voiced. - Typical features (dimensions) for vowels include
height, backness, roundness - (Acoustic) Features
- Vowel features actually correlate better with
formants than with actual tongue position
13- From Hume-OHaire Winters (2001)
14Linguistic Formalisms Pronunciation Variation
- Phonological Rules
- Used to classify, explain, and predict phonetic
alternations in related words write (t) vs.
writer (dx) - May also be useful for capturing differences in
speech mode (e.g., dialect, register, rate) - Example flapping in American English
15Linguistic Formalisms Pronunciation Variation
- Finite State Transducers
- (Same example transducer as on Tuesday)
16Linguistic Formalisms Pronunciation Variation
- Useful properties of FSTs
- Invertible
- (thus usable in both production recognition)
- Learnable (Oncina, Garcia, Vidal 1993, Gildea
Jurafsky 1996) - Composable
- Compatible with HMMs
17ASR Models Predicting Variation in Pronunciations
- Knowledge-Based Approaches
- Hand-Crafted Dictionaries
- Letter to Sound Rules
- Phonological Rules
- Data-Driven Approaches
- Baseform Learning
- Learning Pronunciation Rules
18ASR Models Predicting Variation in Pronunciations
- Hand-Crafted Dictionaries
- E.g., CMUdict, Pronlex for American English
- The most readily available starting point
- Limitations
- Generally only one or two pronunciations per word
- Does not reflect fast speech, multi-word context
- May not contain e.g., proper names, acronyms
- Time-consuming to build for new languages
19ASR Models Predicting Variation in Pronunciations
- Letter to Sound Rules
- In English, used to supplement dictionaries
- In e.g., Spanish, may be enough by themselves
- Can be learned (e.g. by DTs, ANNs)
- Hard-to-catch Exceptions
- Compound-words, acronyms, etc.
- Loan words, foreign words
- Proper names (Brands, people, places)
20ASR Models Predicting Variation in Pronunciations
- Phonological Rules
- Useful for modeling e.g., fast speech, likely
non-canonical pronunciations - Can provide basis for speaker-adaptation
- Limitations
- Requires labeled corpus to learn rule
probabilities - May over-generalize, creating spurious homophones
- (Pruning minimizes this)
21Examples of Fast-Speech Rules
22ASR Models Predicting Variation in Pronunciations
- Automatic Baseform Learning
- 1) Use ASR with dummy dictionary to find
surface phone sequences of an utterance - 2) Find canonical pronunciation of utterance
(e.g., by forced-Viterbi) - 3) Align these two (w/ dynamic programming)
- 4) Record surface pronunciations of words
23ASR Models Predicting Variation in Pronunciations
- Limitations of Baseform Learning
- Limited to single-word learning
- Ignores multi-word phrases, cross word-boundary
effects (e.g., Did you ? didja) - Misses generalizations across words (e.g., learns
flapping separately for each word)
24ASR Models Predicting Variation in Pronunciations
- Learning Pronunciation Rules
- Each word has a canonical pronunciation c1 c2
cj cn. - Each phone cj in a word can be pronounced by some
sj. - Set of surface pronunciations S Si si1, ,
sin - Taking canonical tri-phone and last surface phone
into account, the probability of a given Si can
be estimated
25ASR Models Predicting Variation in Pronunciations
- (Machine) Learning Pronunciation Rules
- Typical ML techniques apply CART, ANNs, etc.
- Using features (pre-specified or learned) helps
- Brill-type rules (e.g., Yang Martens 2000)
- A ? B // C __ D with P(BA,C,D) positive
rule - A ? not B // C __ D with 1 - P(BA,C,D)
neg. rule - (Note equivalent to Two-level rule types 1 4)
26ASR Models Predicting Variation in Pronunciations
- Pruning Learned Rules Pronunciations
- Vary of allowed pronunciations by
word-frequency - E.g., f (count(w)) k log(count(w))
- Use probability threshold for candidate
pronunciations - Absolute cutoff
- Relmax (relative to maximum) cutoff
- Use acoustic confidence C(pj,wi) as measure
27Online Transformation-Based Pronunciation Modeling
- In theory, a dynamic dictionary could halve
error-rates - Using an oracle dictionary for each utterance
in switchboard reduces error by 43 - Using e.g., multi-word context, hidden
speaking-mode states may capture some of this. - Actual results less dramatic, of course!
28Online Transformation-Based Pronunciation Modeling
29Five Problems Yet to Be Solved
- Confusability and Discriminability
- Hard Decisions
- Consistency
- Information Structure
- Moving Beyond Phones as Basic Units
30Five Problems Yet to Be Solved
- Confusability and Discriminability
- New pronunciations can create homophones not only
with other words, but with parts of words. - Few exact metrics exist to measure confusion
31Five Problems Yet to Be Solved
- Hard Decisions
- Forced-Viterbi throws away good, but
second-best representations. - N-best would avoid this (Mokbel and Jouvet), but
problematic for large-vocabulary - DTs also introduce hard decisions and
data-splitting
32Five Problems Yet to Be Solved
- Consistency
- Current ASR works word-by-word w/o picking up on
long-term patterns (e.g., stretches of fast
speech, consistent patterns like dialect,
speaker) - Hidden speech-mode variable helps, but data is
perhaps too sparse for dialect-dependent states.
33Five Problems Yet to Be Solved
- Information Structure
- Language is about the message!
- Hence, not all words are pronounced equal
- Confounding variables
- Prosody intonation (emphasis, de-accenting)
- Position of word in utterance (beginning or end)
- Given vs. new information Topic/focus, etc.
- First-time use vs. repetitions of a word
34Five Problems Yet to Be Solved
- Moving Beyond Phones as Basic Units
- Other types of units
- Fenones
- Hybrid phones xy for //x//?/y/ rules
- Detecting (changes in) distinctive features
- E.g., ax ? voicing,nasality,
voicing,nasality,back, voicing,back, - (cf. Autosegmental Non-linear phonology?)
35Conclusions
- An ideal model would
- Be dynamic and adaptive in dictionary use
- Integrate knowledge of previously heard
pronunciation patterns from that speaker - Incorporate higher-level factors (e.g., speaking
rate, semantics of the message) to predict
changes from the canonical pronunciation - (Perhaps) operate on a sub-phonetic level, too.