A Tutorial on Pronunciation Modeling for Large Vocabulary Speech Recognition

1 / 35
About This Presentation
Title:

A Tutorial on Pronunciation Modeling for Large Vocabulary Speech Recognition

Description:

Available dictionaries/orthographies assume this unit. Research suggests humans use this unit ... Be dynamic and adaptive in dictionary use ... –

Number of Views:288
Avg rating:3.0/5.0
Slides: 36
Provided by: cseOhi
Category:

less

Transcript and Presenter's Notes

Title: A Tutorial on Pronunciation Modeling for Large Vocabulary Speech Recognition


1
A Tutorial on Pronunciation Modeling for Large
Vocabulary Speech Recognition
  • Dr. Eric Fosler-Lussier
  • Presentation for CiS 788

2
Overview
  • Our task moving from read speech recognition
    to recognizing spontaneous conversational speech
  • Two basic approaches for modeling
    pronunciationvariation
  • Encoding linguistic knowledge to pre-specify
    possiblealternative pronunciations of words
  • Deriving alternatives directly from a
    pronunciation corpus.
  • Purposes of this tutorial
  • Explain basic linguistic concepts in phonetics
    and phonology
  • Outline several pronunciation modeling strategies
  • Summarize promising recent research directions.

3
Pronunciations Pronunciation Modeling
4
Pronunciations Pronunciation Modeling
  • Why sub-word units?
  • Data sparseness at word level
  • Intermediate level allows extensible vocabulary
  • Why phone(me)s?
  • Available dictionaries/orthographies assume this
    unit
  • Research suggests humans use this unit
  • Phone inventory more manageable than syllables,
    etc. (in e.g., English)

5
Statistical Underpinnings for Pronunciation
Modeling
  • In the whole-word approach, we could find the
    most likely utterance (word-string) M given the
    perceived signal
  • M

6
Statistical Underpinnings for Pronunciation
Modeling
  • With independence assumptions, we can use the
    following approximation
  • Argmax P(MX)

7
Statistical Underpinnings for Pronunciation
Modeling
  • PA(XQ) the acoustic model
  • continuous sound (vector)s to discrete phone
    (state)s
  • Analogous to categorical perception in human
    hearing
  • PQ(QM) the pronunciation model
  • Probability of phone states given words
  • Also includes context-dependence duration
    models
  • PL(M) the language model
  • The prior probability of word sequences

8
Statistical Underpinnings for Pronunciation
Modeling
The three models working in sequence
9
Linguistic Formalisms Pronunciation Variation
  • Phones Phonemes
  • (Articulatory) Features
  • Phonological Rules
  • Finite State Transducers

10
Linguistic Formalisms Pronunciation Variation
  • Phones Phonemes
  • Phones Types of (uttered) segments
  • E.g., p unaspirated voiceless labial stop
    spik
  • Vs. ph aspirated voiceless labial stop phik
  • Phonemes Mental abstractions of phones
  • /p/ in speak /p/ in peak to naïve speakers
  • ARPABET between phones phonemes
  • SAMPAbet closer to phones, but not perfect

11
SAMPA for American English
  • Selected Consonants (arpa)
  • tS chin tSIn (ch)
  • dZ gin dZIn (jh)
  • T thin TIn (th)
  • D this DIs (dh)
  • Z measure "mEZ_at_ (zh)
  • N thing TIN (ng)
  • j yacht jAt (y)
  • 4 butter bV4_at_ (dx)
  • Selected Vowels (arpa)
  • pat pt (ae)
  • A pot pAt (aa)
  • V cut kVt (uh) !
  • U put pUt (uh) !
  • aI rise raIz (ay)
  • 3 furs f3z (er)
  • _at_ allow _at_laU (ax)
  • _at_ corner kOrn_at_ (axr)

12
Linguistic Formalisms Pronunciation Variation
  • (Articulatory) Features
  • Describe where (place) and how (manner) a sound
    is made, and whether it is voiced.
  • Typical features (dimensions) for vowels include
    height, backness, roundness
  • (Acoustic) Features
  • Vowel features actually correlate better with
    formants than with actual tongue position

13
  • From Hume-OHaire Winters (2001)

14
Linguistic Formalisms Pronunciation Variation
  • Phonological Rules
  • Used to classify, explain, and predict phonetic
    alternations in related words write (t) vs.
    writer (dx)
  • May also be useful for capturing differences in
    speech mode (e.g., dialect, register, rate)
  • Example flapping in American English

15
Linguistic Formalisms Pronunciation Variation
  • Finite State Transducers
  • (Same example transducer as on Tuesday)

16
Linguistic Formalisms Pronunciation Variation
  • Useful properties of FSTs
  • Invertible
  • (thus usable in both production recognition)
  • Learnable (Oncina, Garcia, Vidal 1993, Gildea
    Jurafsky 1996)
  • Composable
  • Compatible with HMMs

17
ASR Models Predicting Variation in Pronunciations
  • Knowledge-Based Approaches
  • Hand-Crafted Dictionaries
  • Letter to Sound Rules
  • Phonological Rules
  • Data-Driven Approaches
  • Baseform Learning
  • Learning Pronunciation Rules

18
ASR Models Predicting Variation in Pronunciations
  • Hand-Crafted Dictionaries
  • E.g., CMUdict, Pronlex for American English
  • The most readily available starting point
  • Limitations
  • Generally only one or two pronunciations per word
  • Does not reflect fast speech, multi-word context
  • May not contain e.g., proper names, acronyms
  • Time-consuming to build for new languages

19
ASR Models Predicting Variation in Pronunciations
  • Letter to Sound Rules
  • In English, used to supplement dictionaries
  • In e.g., Spanish, may be enough by themselves
  • Can be learned (e.g. by DTs, ANNs)
  • Hard-to-catch Exceptions
  • Compound-words, acronyms, etc.
  • Loan words, foreign words
  • Proper names (Brands, people, places)

20
ASR Models Predicting Variation in Pronunciations
  • Phonological Rules
  • Useful for modeling e.g., fast speech, likely
    non-canonical pronunciations
  • Can provide basis for speaker-adaptation
  • Limitations
  • Requires labeled corpus to learn rule
    probabilities
  • May over-generalize, creating spurious homophones
  • (Pruning minimizes this)

21
Examples of Fast-Speech Rules
22
ASR Models Predicting Variation in Pronunciations
  • Automatic Baseform Learning
  • 1) Use ASR with dummy dictionary to find
    surface phone sequences of an utterance
  • 2) Find canonical pronunciation of utterance
    (e.g., by forced-Viterbi)
  • 3) Align these two (w/ dynamic programming)
  • 4) Record surface pronunciations of words

23
ASR Models Predicting Variation in Pronunciations
  • Limitations of Baseform Learning
  • Limited to single-word learning
  • Ignores multi-word phrases, cross word-boundary
    effects (e.g., Did you ? didja)
  • Misses generalizations across words (e.g., learns
    flapping separately for each word)

24
ASR Models Predicting Variation in Pronunciations
  • Learning Pronunciation Rules
  • Each word has a canonical pronunciation c1 c2
    cj cn.
  • Each phone cj in a word can be pronounced by some
    sj.
  • Set of surface pronunciations S Si si1, ,
    sin
  • Taking canonical tri-phone and last surface phone
    into account, the probability of a given Si can
    be estimated

25
ASR Models Predicting Variation in Pronunciations
  • (Machine) Learning Pronunciation Rules
  • Typical ML techniques apply CART, ANNs, etc.
  • Using features (pre-specified or learned) helps
  • Brill-type rules (e.g., Yang Martens 2000)
  • A ? B // C __ D with P(BA,C,D) positive
    rule
  • A ? not B // C __ D with 1 - P(BA,C,D)
    neg. rule
  • (Note equivalent to Two-level rule types 1 4)

26
ASR Models Predicting Variation in Pronunciations
  • Pruning Learned Rules Pronunciations
  • Vary of allowed pronunciations by
    word-frequency
  • E.g., f (count(w)) k log(count(w))
  • Use probability threshold for candidate
    pronunciations
  • Absolute cutoff
  • Relmax (relative to maximum) cutoff
  • Use acoustic confidence C(pj,wi) as measure

27
Online Transformation-Based Pronunciation Modeling
  • In theory, a dynamic dictionary could halve
    error-rates
  • Using an oracle dictionary for each utterance
    in switchboard reduces error by 43
  • Using e.g., multi-word context, hidden
    speaking-mode states may capture some of this.
  • Actual results less dramatic, of course!

28
Online Transformation-Based Pronunciation Modeling
29
Five Problems Yet to Be Solved
  • Confusability and Discriminability
  • Hard Decisions
  • Consistency
  • Information Structure
  • Moving Beyond Phones as Basic Units

30
Five Problems Yet to Be Solved
  • Confusability and Discriminability
  • New pronunciations can create homophones not only
    with other words, but with parts of words.
  • Few exact metrics exist to measure confusion

31
Five Problems Yet to Be Solved
  • Hard Decisions
  • Forced-Viterbi throws away good, but
    second-best representations.
  • N-best would avoid this (Mokbel and Jouvet), but
    problematic for large-vocabulary
  • DTs also introduce hard decisions and
    data-splitting

32
Five Problems Yet to Be Solved
  • Consistency
  • Current ASR works word-by-word w/o picking up on
    long-term patterns (e.g., stretches of fast
    speech, consistent patterns like dialect,
    speaker)
  • Hidden speech-mode variable helps, but data is
    perhaps too sparse for dialect-dependent states.

33
Five Problems Yet to Be Solved
  • Information Structure
  • Language is about the message!
  • Hence, not all words are pronounced equal
  • Confounding variables
  • Prosody intonation (emphasis, de-accenting)
  • Position of word in utterance (beginning or end)
  • Given vs. new information Topic/focus, etc.
  • First-time use vs. repetitions of a word

34
Five Problems Yet to Be Solved
  • Moving Beyond Phones as Basic Units
  • Other types of units
  • Fenones
  • Hybrid phones xy for //x//?/y/ rules
  • Detecting (changes in) distinctive features
  • E.g., ax ? voicing,nasality,
    voicing,nasality,back, voicing,back,
  • (cf. Autosegmental Non-linear phonology?)

35
Conclusions
  • An ideal model would
  • Be dynamic and adaptive in dictionary use
  • Integrate knowledge of previously heard
    pronunciation patterns from that speaker
  • Incorporate higher-level factors (e.g., speaking
    rate, semantics of the message) to predict
    changes from the canonical pronunciation
  • (Perhaps) operate on a sub-phonetic level, too.
Write a Comment
User Comments (0)
About PowerShow.com