Speech Recognition Part 2 - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Speech Recognition Part 2

Description:

Goal is to find the most likely string of words, W, given the acoustic ... trenton : tr r eh n tq en. special filled. pause vowel. alternate. pronunciations ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 35
Provided by: spoke9
Category:

less

Transcript and Presenter's Notes

Title: Speech Recognition Part 2


1
Speech Recognition(Part 2)
  • T. J. Hazen
  • MIT Computer Science and Artificial Intelligence
    Laboratory

2
Lecture Overview
  • Probabilistic framework
  • Pronunciation modeling
  • Language modeling
  • Finite state transducers
  • Search
  • System demonstrations (time permitting)

3
Probabilistic Framework
  • Speech recognition is typically performed a using
    probabilistic modeling approach
  • Goal is to find the most likely string of words,
    W, given the acoustic observations, A
  • The expression is rewritten using Bayes Rule

4
Probabilistic Framework
  • Words are represented as sequence of phonetic
    units.
  • Using phonetic units, U, expression expands to
  • Pronunciation and language models provide
    constraint
  • Pronunciation and language models encoded in
    network
  • Search must efficiently find most likely U and W

5
Phonemes
  • Phonemes are the basic linguistic units used to
    construct morphemes, words and sentences.
  • Phonemes represent unique canonical acoustic
    sounds
  • When constructing words, changing a single
    phoneme changes the word.
  • Example phonemic mappings
  • pin ? /p ih n/
  • thought ? /th ao t/
  • saves? /s ey v z/
  • English spelling is not (exactly) phonemic
  • Pronunciation can not always be determined from
    spelling
  • Homophones have same phonemes but different
    spellings
  • Two vs. to vs. too, bear vs. bare, queue vs.
    cue, etc.
  • Same spelling can have different pronunciations
  • read, record, either, etc.

6
Phonemic Units and Classes
Vowels aa pot er bert ae bat ey
bait ah but ih bit ao bought iy
beat aw bout ow boat ax about oy
boy ay buy uh book eh bet uw boot
Semivowels l light w wet r right y yet
Nasals m might n night ng sing
Stops p pay b bay t tea d day k key g
go
Fricatives s sue f fee z zoo v vee sh sh
oe th thesis zh azure dh that hh hat
Affricates ch chew jh Joe
7
Phones
  • Phones (or phonetic units) are used to represent
    the actual acoustic realization of phonemes.
  • Examples
  • Stops contain a closure and a release
  • /t/ ?tcl t
  • /k/ ? kcl k
  • The /t/ and /d/ phonemes can be flapped
  • utter ? /ah t er/ ? ah dx er
  • udder ?/ah d er/ ? ah dx er
  • Vowels can be fronted
  • Tuesday ? /t uw z d ey/ ? tcl t ux z d ey

8
Enhanced Phoneme Labels
Stops p pay b bay t tea d day k key g
go
Stops w/ optional release pd tap bd
tab td pat dd bad kd pack gd dog
Unaspirated stops p- speed t- steep k- sk
i
Stops w/ optional flap tf batter df badder
Special sequences nt interview tq en Clinton
Retroflexed stops tr tree dr drop
9
Example Phonemic Baseform File
lthangupgt _h1 ltnoisegt _n1 ltuhgt
ah_fp ltumgt ah_fp m adder ae df
er atlanta ( ae ax ) td l ae nt ax either
( iy ay ) th er laptop l ae pd t aa
pd northwest n ao r th w eh s td speech
s p- iy ch temperature t eh m p ( r ? ax er
ax ? ) ch er trenton tr r eh n tq en
10
Applying Phonological Rules
  • Multiple phonetic realization of phonemes can be
    generated by applying phonological rules.
  • Example
  • Phonological rewrite rules can be used to
    generate this

butter b ah tf er This can be realized
phonetically as bcl b ah tcl t er or as bcl
b ah dx er
butter bcl b ah ( tcl t dx ) er
11
Example Phonological Rules
  • Example rule for /t/ deletion (destination)

s t ax ix gt tcl t
  • Example rule for palatalization of /s/ (miss
    you)

s y gt s sh
12
Contractions and Reductions
  • Examples of contractions
  • whats ? what is
  • isnt ? is not
  • wont ? will not
  • id ? i would i had
  • todays ? today is todays
  • Example of multi-word reductions
  • gimme ? give me
  • gonna ? going to
  • ave ? avenue
  • bout ? about
  • dyave ? do you have
  • Contracted and reduced forms entered in lexical
    dictionary

13
Language Modeling
  • A language model constrains hypothesized word
    sequences
  • A finite state grammar (FSG) example
  • Probabilities can be added to arcs for additional
    constraint
  • FSGs work well when users stay within grammar
  • but FSGs cant cover everything that might be
    spoken.

14
N-gram Language Modeling
  • An n-gram model is a statistical language model
  • Predicts current word based on previous n-1 words
  • Trigram model expression
  • Examples
  • An n-gram model allows any sequence of words
  • but prefers sequences common in training data.

P( wn wn-2 , wn-1 )
P( arriving in )
boston
P( tuesday march )
seventeenth
15
N-gram Model Smoothing
  • For a bigram model, what if
  • To avoid sparse training data problems, we can
    use an interpolated bigram
  • One method for determining interpolation weight

16
Class N-gram Language Modeling
  • Class n-gram models can also help sparse data
    problems
  • Class trigram expression
  • Example

P(class(wn) class(wn-2) , class(wn-1)) P(wn
class(wn))
P(seventeenth tuesday march )
17
Multi-Word N-gram Units
  • Common multi-word units can be treated as a
    single unit within an N-gram language model
  • Common uses of compound units
  • Common multi-word phrases
  • thank_you , good_bye , excuse_me
  • Multi word sequences that act as a single
    semantic unit
  • new_york , labor_day , wind_speed
  • Letter sequences or initials
  • j_f_k , t_w_a , washington_d_c

18
Finite-State Transducer (FST) Motivation
  • Most speech recognition constraints and results
    can be represented as finite-state automata
  • Language models (e.g., n-grams and word networks)
  • Lexicons
  • Phonological rules
  • N-best lists
  • Word graphs
  • Recognition paths
  • Common representation and algorithms desirable
  • Consistency
  • Powerful algorithms can be employed throughout
    system
  • Flexibility to combine or factor in unforeseen
    ways

19
What is an FST?
  • One initial state
  • One or more final states
  • Transitions between states input output /
    weight
  • input requires an input symbol to match
  • output indicates symbol to output when transition
    taken
  • epsilon (?) consumes no input or produces no
    output
  • weight is the cost (e.g., -log probability) of
    taking transition
  • An FST defines a weighted relationship between
    regular languages
  • A generalization of the classic finite-state
    acceptor (FSA)

20
FST Example Lexicon
  • Lexicon maps /phonemes/ to words
  • Words can share parts of pronunciations
  • Sharing at beginning beneficial to recognition
    speed because pruning can prune many words at once

21
FST Composition
  • Composition (o) combines two FSTs to produce a
    single FST that performs both mappings in single
    step

22
FST Optimization Example
letter to word lexicon
23
FST Optimization Example Determinization
  • Determinization turns lexicon into tree
  • Words share common prefix

24
FST Optimization Example Minimization
  • Minimization enables sharing at the ends

25
A Cascaded FST Recognizer
26
A Cascaded FST Recognizer
27
Search
  • Once again, the probabilistic expression is
  • Pronunciation and language models encoded in FST
  • Search must efficiently find most likely U and W

28
Viterbi Search
  • Viterbi search a time synchronous breadth-first
    search

m
a
r
z
h
h
29
Viterbi Search Pruning
  • Search efficiency can be improved with pruning
  • Score-based Dont extend low scoring hypotheses
  • Count-based Extend only a fixed number of
    hypotheses

30
Search Pruning Example
  • Count-based pruning can effectively reduce search
  • Example Fix beam size (count) and vary beam
    width (score)

36
31
N-best Computation with Backwards A Search
  • Backwards A search can be used to find N-best
    paths
  • Viterbi backtrace is used as future estimate for
    path scores

35
32
Street Address Recognition
  • Street address recognition is difficult
  • 6.2M unique street, city, state pairs in US (283K
    unique words)
  • High confusion rate among similar street names
  • Very large search space for recognition
  • Commercial solution ? Directed dialogue
  • Breaks problem into set of smaller recognition
    tasks
  • Simple for first time users, but tedious with
    repeated use

C Main menu. Please say one of the following C
directions, restaurants, gas stations, or
more options. H Directions. C Okay.
Directions. What state are you going to? H
Massachusetts. C Okay. Massachusetts. What city
are you going to? H Cambridge. C Okay.
Cambridge. What is the street address? H 32
Vassar Street. C Okay. 32 Vassar Street in
Cambridge, Massachusetts. C From you current
location, continue straight on
33
Street Address Recognition
  • Research goal ? Mixed initiative dialogue
  • More difficult to predict what users will say
  • Far more natural for repeat or expert users

C How can I help you? H I need directions to 32
Vassar Street in Cambridge, Mass.
34
Dynamic Vocabulary Recognition
Write a Comment
User Comments (0)
About PowerShow.com