Speech Recognition Part 2 - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Speech Recognition Part 2

Description:

Goal is to find the most likely string of words, W, given the acoustic ... trenton : tr r eh n tq en. special filled. pause vowel. alternate. pronunciations ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 35

Provided by: spoke9

Category:

more less

Transcript and Presenter's Notes

Title: Speech Recognition Part 2

1
Speech Recognition(Part 2)

T. J. Hazen
MIT Computer Science and Artificial Intelligence
Laboratory

2
Lecture Overview

Probabilistic framework
Pronunciation modeling
Language modeling
Finite state transducers
Search
System demonstrations (time permitting)

3
Probabilistic Framework

Speech recognition is typically performed a using
probabilistic modeling approach
Goal is to find the most likely string of words,
W, given the acoustic observations, A

The expression is rewritten using Bayes Rule

4
Probabilistic Framework

Words are represented as sequence of phonetic
units.
Using phonetic units, U, expression expands to
Pronunciation and language models provide
constraint
Pronunciation and language models encoded in
network
Search must efficiently find most likely U and W

5
Phonemes

Phonemes are the basic linguistic units used to
construct morphemes, words and sentences.
Phonemes represent unique canonical acoustic
sounds
When constructing words, changing a single
phoneme changes the word.
Example phonemic mappings
pin ? /p ih n/
thought ? /th ao t/
saves? /s ey v z/
English spelling is not (exactly) phonemic
Pronunciation can not always be determined from
spelling
Homophones have same phonemes but different
spellings
Two vs. to vs. too, bear vs. bare, queue vs.
cue, etc.
Same spelling can have different pronunciations
read, record, either, etc.

6
Phonemic Units and Classes
Vowels aa pot er bert ae bat ey
bait ah but ih bit ao bought iy
beat aw bout ow boat ax about oy
boy ay buy uh book eh bet uw boot
Semivowels l light w wet r right y yet
Nasals m might n night ng sing
Stops p pay b bay t tea d day k key g
go
Fricatives s sue f fee z zoo v vee sh sh
oe th thesis zh azure dh that hh hat
Affricates ch chew jh Joe
7
Phones

Phones (or phonetic units) are used to represent
the actual acoustic realization of phonemes.
Examples
Stops contain a closure and a release
/t/ ?tcl t
/k/ ? kcl k
The /t/ and /d/ phonemes can be flapped
utter ? /ah t er/ ? ah dx er
udder ?/ah d er/ ? ah dx er
Vowels can be fronted
Tuesday ? /t uw z d ey/ ? tcl t ux z d ey

8
Enhanced Phoneme Labels
Stops p pay b bay t tea d day k key g
go
Stops w/ optional release pd tap bd
tab td pat dd bad kd pack gd dog
Unaspirated stops p- speed t- steep k- sk
i
Stops w/ optional flap tf batter df badder
Special sequences nt interview tq en Clinton
Retroflexed stops tr tree dr drop
9
Example Phonemic Baseform File
lthangupgt _h1 ltnoisegt _n1 ltuhgt
ah_fp ltumgt ah_fp m adder ae df
er atlanta ( ae ax ) td l ae nt ax either
( iy ay ) th er laptop l ae pd t aa
pd northwest n ao r th w eh s td speech
s p- iy ch temperature t eh m p ( r ? ax er
ax ? ) ch er trenton tr r eh n tq en
10
Applying Phonological Rules

Multiple phonetic realization of phonemes can be
generated by applying phonological rules.
Example
Phonological rewrite rules can be used to
generate this

butter b ah tf er This can be realized
phonetically as bcl b ah tcl t er or as bcl
b ah dx er
butter bcl b ah ( tcl t dx ) er
11
Example Phonological Rules

Example rule for /t/ deletion (destination)

s t ax ix gt tcl t

Example rule for palatalization of /s/ (miss
you)

s y gt s sh
12
Contractions and Reductions

Examples of contractions
whats ? what is
isnt ? is not
wont ? will not
id ? i would i had
todays ? today is todays
Example of multi-word reductions
gimme ? give me
gonna ? going to
ave ? avenue
bout ? about
dyave ? do you have
Contracted and reduced forms entered in lexical
dictionary

13
Language Modeling

A language model constrains hypothesized word
sequences
A finite state grammar (FSG) example
Probabilities can be added to arcs for additional
constraint
FSGs work well when users stay within grammar
but FSGs cant cover everything that might be
spoken.

14
N-gram Language Modeling

An n-gram model is a statistical language model
Predicts current word based on previous n-1 words
Trigram model expression
Examples
An n-gram model allows any sequence of words
but prefers sequences common in training data.

P( wn wn-2 , wn-1 )
P( arriving in )
boston
P( tuesday march )
seventeenth
15
N-gram Model Smoothing

For a bigram model, what if
To avoid sparse training data problems, we can
use an interpolated bigram
One method for determining interpolation weight

16
Class N-gram Language Modeling

Class n-gram models can also help sparse data
problems
Class trigram expression
Example

P(class(wn) class(wn-2) , class(wn-1)) P(wn
class(wn))
P(seventeenth tuesday march )
17
Multi-Word N-gram Units

Common multi-word units can be treated as a
single unit within an N-gram language model
Common uses of compound units
Common multi-word phrases
thank_you , good_bye , excuse_me
Multi word sequences that act as a single
semantic unit
new_york , labor_day , wind_speed
Letter sequences or initials
j_f_k , t_w_a , washington_d_c

18
Finite-State Transducer (FST) Motivation

Most speech recognition constraints and results
can be represented as finite-state automata
Language models (e.g., n-grams and word networks)
Lexicons
Phonological rules
N-best lists
Word graphs
Recognition paths
Common representation and algorithms desirable
Consistency
Powerful algorithms can be employed throughout
system
Flexibility to combine or factor in unforeseen
ways

19
What is an FST?

One initial state
One or more final states
Transitions between states input output /
weight
input requires an input symbol to match
output indicates symbol to output when transition
taken
epsilon (?) consumes no input or produces no
output
weight is the cost (e.g., -log probability) of
taking transition
An FST defines a weighted relationship between
regular languages
A generalization of the classic finite-state
acceptor (FSA)

20
FST Example Lexicon

Lexicon maps /phonemes/ to words
Words can share parts of pronunciations
Sharing at beginning beneficial to recognition
speed because pruning can prune many words at once

21
FST Composition

Composition (o) combines two FSTs to produce a
single FST that performs both mappings in single
step

22
FST Optimization Example
letter to word lexicon
23
FST Optimization Example Determinization

Determinization turns lexicon into tree
Words share common prefix

24
FST Optimization Example Minimization

Minimization enables sharing at the ends

25
A Cascaded FST Recognizer
26
A Cascaded FST Recognizer
27
Search

Once again, the probabilistic expression is
Pronunciation and language models encoded in FST
Search must efficiently find most likely U and W

28
Viterbi Search

Viterbi search a time synchronous breadth-first
search

m
a
r
z
h
h
29
Viterbi Search Pruning

Search efficiency can be improved with pruning
Score-based Dont extend low scoring hypotheses
Count-based Extend only a fixed number of
hypotheses

30
Search Pruning Example

Count-based pruning can effectively reduce search
Example Fix beam size (count) and vary beam
width (score)

36
31
N-best Computation with Backwards A Search

Backwards A search can be used to find N-best
paths
Viterbi backtrace is used as future estimate for
path scores

35
32
Street Address Recognition

Street address recognition is difficult
6.2M unique street, city, state pairs in US (283K
unique words)
High confusion rate among similar street names
Very large search space for recognition
Commercial solution ? Directed dialogue
Breaks problem into set of smaller recognition
tasks
Simple for first time users, but tedious with
repeated use

C Main menu. Please say one of the following C
directions, restaurants, gas stations, or
more options. H Directions. C Okay.
Directions. What state are you going to? H
Massachusetts. C Okay. Massachusetts. What city
are you going to? H Cambridge. C Okay.
Cambridge. What is the street address? H 32
Vassar Street. C Okay. 32 Vassar Street in
Cambridge, Massachusetts. C From you current
location, continue straight on
33
Street Address Recognition