Title: Speech Recognition Part 2
1Speech Recognition(Part 2)
- T. J. Hazen
- MIT Computer Science and Artificial Intelligence
Laboratory
2Lecture Overview
- Probabilistic framework
- Pronunciation modeling
- Language modeling
- Finite state transducers
- Search
- System demonstrations (time permitting)
3Probabilistic Framework
- Speech recognition is typically performed a using
probabilistic modeling approach - Goal is to find the most likely string of words,
W, given the acoustic observations, A
- The expression is rewritten using Bayes Rule
4Probabilistic Framework
- Words are represented as sequence of phonetic
units. - Using phonetic units, U, expression expands to
- Pronunciation and language models provide
constraint - Pronunciation and language models encoded in
network - Search must efficiently find most likely U and W
5Phonemes
- Phonemes are the basic linguistic units used to
construct morphemes, words and sentences. - Phonemes represent unique canonical acoustic
sounds - When constructing words, changing a single
phoneme changes the word. - Example phonemic mappings
- pin ? /p ih n/
- thought ? /th ao t/
- saves? /s ey v z/
- English spelling is not (exactly) phonemic
- Pronunciation can not always be determined from
spelling - Homophones have same phonemes but different
spellings - Two vs. to vs. too, bear vs. bare, queue vs.
cue, etc. - Same spelling can have different pronunciations
- read, record, either, etc.
6Phonemic Units and Classes
Vowels aa pot er bert ae bat ey
bait ah but ih bit ao bought iy
beat aw bout ow boat ax about oy
boy ay buy uh book eh bet uw boot
Semivowels l light w wet r right y yet
Nasals m might n night ng sing
Stops p pay b bay t tea d day k key g
go
Fricatives s sue f fee z zoo v vee sh sh
oe th thesis zh azure dh that hh hat
Affricates ch chew jh Joe
7Phones
- Phones (or phonetic units) are used to represent
the actual acoustic realization of phonemes. - Examples
- Stops contain a closure and a release
- /t/ ?tcl t
- /k/ ? kcl k
- The /t/ and /d/ phonemes can be flapped
- utter ? /ah t er/ ? ah dx er
- udder ?/ah d er/ ? ah dx er
- Vowels can be fronted
- Tuesday ? /t uw z d ey/ ? tcl t ux z d ey
8Enhanced Phoneme Labels
Stops p pay b bay t tea d day k key g
go
Stops w/ optional release pd tap bd
tab td pat dd bad kd pack gd dog
Unaspirated stops p- speed t- steep k- sk
i
Stops w/ optional flap tf batter df badder
Special sequences nt interview tq en Clinton
Retroflexed stops tr tree dr drop
9Example Phonemic Baseform File
lthangupgt _h1 ltnoisegt _n1 ltuhgt
ah_fp ltumgt ah_fp m adder ae df
er atlanta ( ae ax ) td l ae nt ax either
( iy ay ) th er laptop l ae pd t aa
pd northwest n ao r th w eh s td speech
s p- iy ch temperature t eh m p ( r ? ax er
ax ? ) ch er trenton tr r eh n tq en
10Applying Phonological Rules
- Multiple phonetic realization of phonemes can be
generated by applying phonological rules. - Example
- Phonological rewrite rules can be used to
generate this
butter b ah tf er This can be realized
phonetically as bcl b ah tcl t er or as bcl
b ah dx er
butter bcl b ah ( tcl t dx ) er
11Example Phonological Rules
- Example rule for /t/ deletion (destination)
s t ax ix gt tcl t
- Example rule for palatalization of /s/ (miss
you)
s y gt s sh
12Contractions and Reductions
- Examples of contractions
- whats ? what is
- isnt ? is not
- wont ? will not
- id ? i would i had
- todays ? today is todays
- Example of multi-word reductions
- gimme ? give me
- gonna ? going to
- ave ? avenue
- bout ? about
- dyave ? do you have
- Contracted and reduced forms entered in lexical
dictionary
13Language Modeling
- A language model constrains hypothesized word
sequences - A finite state grammar (FSG) example
- Probabilities can be added to arcs for additional
constraint - FSGs work well when users stay within grammar
- but FSGs cant cover everything that might be
spoken.
14N-gram Language Modeling
- An n-gram model is a statistical language model
- Predicts current word based on previous n-1 words
- Trigram model expression
- Examples
- An n-gram model allows any sequence of words
- but prefers sequences common in training data.
P( wn wn-2 , wn-1 )
P( arriving in )
boston
P( tuesday march )
seventeenth
15N-gram Model Smoothing
- For a bigram model, what if
- To avoid sparse training data problems, we can
use an interpolated bigram - One method for determining interpolation weight
16Class N-gram Language Modeling
- Class n-gram models can also help sparse data
problems - Class trigram expression
- Example
P(class(wn) class(wn-2) , class(wn-1)) P(wn
class(wn))
P(seventeenth tuesday march )
17Multi-Word N-gram Units
- Common multi-word units can be treated as a
single unit within an N-gram language model - Common uses of compound units
- Common multi-word phrases
- thank_you , good_bye , excuse_me
- Multi word sequences that act as a single
semantic unit - new_york , labor_day , wind_speed
- Letter sequences or initials
- j_f_k , t_w_a , washington_d_c
18Finite-State Transducer (FST) Motivation
- Most speech recognition constraints and results
can be represented as finite-state automata - Language models (e.g., n-grams and word networks)
- Lexicons
- Phonological rules
- N-best lists
- Word graphs
- Recognition paths
- Common representation and algorithms desirable
- Consistency
- Powerful algorithms can be employed throughout
system - Flexibility to combine or factor in unforeseen
ways
19What is an FST?
- One initial state
- One or more final states
- Transitions between states input output /
weight - input requires an input symbol to match
- output indicates symbol to output when transition
taken - epsilon (?) consumes no input or produces no
output - weight is the cost (e.g., -log probability) of
taking transition - An FST defines a weighted relationship between
regular languages - A generalization of the classic finite-state
acceptor (FSA)
20FST Example Lexicon
- Lexicon maps /phonemes/ to words
- Words can share parts of pronunciations
- Sharing at beginning beneficial to recognition
speed because pruning can prune many words at once
21FST Composition
- Composition (o) combines two FSTs to produce a
single FST that performs both mappings in single
step
22FST Optimization Example
letter to word lexicon
23FST Optimization Example Determinization
- Determinization turns lexicon into tree
- Words share common prefix
24FST Optimization Example Minimization
- Minimization enables sharing at the ends
25A Cascaded FST Recognizer
26A Cascaded FST Recognizer
27Search
- Once again, the probabilistic expression is
- Pronunciation and language models encoded in FST
- Search must efficiently find most likely U and W
28Viterbi Search
- Viterbi search a time synchronous breadth-first
search
m
a
r
z
h
h
29Viterbi Search Pruning
- Search efficiency can be improved with pruning
- Score-based Dont extend low scoring hypotheses
- Count-based Extend only a fixed number of
hypotheses
30Search Pruning Example
- Count-based pruning can effectively reduce search
- Example Fix beam size (count) and vary beam
width (score)
36
31N-best Computation with Backwards A Search
- Backwards A search can be used to find N-best
paths - Viterbi backtrace is used as future estimate for
path scores
35
32Street Address Recognition
- Street address recognition is difficult
- 6.2M unique street, city, state pairs in US (283K
unique words) - High confusion rate among similar street names
- Very large search space for recognition
- Commercial solution ? Directed dialogue
- Breaks problem into set of smaller recognition
tasks - Simple for first time users, but tedious with
repeated use
C Main menu. Please say one of the following C
directions, restaurants, gas stations, or
more options. H Directions. C Okay.
Directions. What state are you going to? H
Massachusetts. C Okay. Massachusetts. What city
are you going to? H Cambridge. C Okay.
Cambridge. What is the street address? H 32
Vassar Street. C Okay. 32 Vassar Street in
Cambridge, Massachusetts. C From you current
location, continue straight on
33Street Address Recognition
- Research goal ? Mixed initiative dialogue
- More difficult to predict what users will say
- Far more natural for repeat or expert users
C How can I help you? H I need directions to 32
Vassar Street in Cambridge, Mass.
34Dynamic Vocabulary Recognition