Title: Automatic%20Speech%20Recognition
1- Automatic Speech Recognition
2- Opportunity to participate in a new user study
for Newsblaster and get 25-30 for 2.5-3 hours
of time respectively. - http//www1.cs.columbia.edu/delson/study.html
- More opportunities will be coming.
3What is speech recognition?
- Transcribing words?
- Understanding meaning?
- Today
- Overview ASR issues
- Building an ASR system
- Using an ASR system
- Future research
4Its hard to ... recognize speech/wreck a nice
beach
- Speaker variability within and across
- Recording environment varies wrt noise
- Transcription task must handle all of this and
produce a transcript of what was said, from
limited, noisy information in the speech signal - Success low word error rate (WER)
- WER (SID)/N 100
- Thesis test vs. This is a test. 75 WER
- Understanding task must do more from words to
meaning
5- Measure concept accuracy (CA) of string in terms
of accuracy of recognition of domain concepts
mentioned in string and their values - I want to go from Boston to Baltimore on
September 29 - Domain concepts Values
- source city Boston
- target city Baltimore
- travel date September 29
- Score recognized string Go from Boston to
Washington on December 29 (1/3 33 CA) - Go to Boston from Baltimore on September 29
6Again, the Noisy Channel Model
- Input to channel spoken sentence s
- Output from channel an observation O
- Decoding task find s P(sO)
- Using Bayes Rule
- And since P(O) doesnt change for any
hypothetical s - s P(Os) P(s)
- P(Os) is the observation likelihood, or Acoustic
Model, and P(s) is the prior, or Language Model
7What do we need to build use an ASR system?
- Corpora for training and testing of components
- Feature extraction component
- Pronunciation Model
- Acoustic Model
- Language Model
- Algorithms to search hypothesis space efficiently
8Training and Test Corpora
- Collect corpora appropriate for recognition task
at hand - Small speech phonetic transcription to
associate sounds with symbols (Acoustic Model) - Large (gt 60 hrs) speech orthographic
transcription to associate words with sounds
(Acoustic Model) - Very large text corpus to identify unigram and
bigram probabilities (Language Model)
9Representing the Signal
- What parameters (features) of the speech input
- Can be extracted automatically
- Will preserve phonetic identity and distinguish
it from other phones - Will be independent of speaker variability and
channel conditions - Will not take up too much space
- Speech representations (for ae in had)
- Waveform change in sound pressure over time
- LPC Spectrum component frequencies of a waveform
- Spectrogram overall view of how frequencies
change from phone to phone
10- Speech captured by microphone and sampled
(digitized) -- may not capture all vital
information - Signal divided into frames
- Power spectrum computed to represent energy in
different bands of the signal - LPC spectrum, Cepstra, PLP
- Each frames spectral features represented by
small set of numbers - Frames clustered into phone-like groups (phones
in context) -- Gaussian or other models
11- Why this works?
- Different phonemes have different spectral
characteristics - Why it doesnt work?
- Phonemes can have different properties in
different acoustic contexts, spoken by different
people - Nice white rice
12Pronunciation Model
- Models likelihood of word given network of
candidate phone hypotheses (weighted phone
lattice) - Allophones butter vs. but
- Multiple pronunciations for each word
- Lexicon may be weighted automaton or simple
dictionary - Words come from all corpora pronunciations from
pronouncing dictionary or TTS system
13Acoustic Models
- Model likelihood of phones or subphones given
spectral features and prior context - Use pronunciation models
- Usually represented as HMM
- Set of states representing phones or other
subword units - Transition probabilities on states how likely is
it to see one phone after seeing another? - Observation/output likelihoods how likely is
spectral feature vector to be observed from phone
state i, given phone state i-1?
14- Initial estimates for
- Transition probabilities between phone states
- Observation probabilities associating phone
states with acoustic examples - Re-estimate both probabilities by feeding the HMM
the transcribed speech training corpus (forced
alignment) - I.e., we tell the HMM the right answers --
which words to associate with which sequences of
sounds - Iteratively retrain the transition and
observation probabilities by running the training
data through the model and scoring output until
no improvement
15Language Model
- Models likelihood of word given prior word and of
entire sentence - Ngram models
- Build the LM by calculating bigram or trigram
probabilities from text training corpus - Smoothing issues very important for real systems
- Grammars
- Finite state grammar or Context Free Grammar
(CFG) or semantic grammar - Out of Vocabulary (OOV) problem
16- Entropy H(X) the amount of information in a LM,
grammar - How many bits will it take on average to encode a
choice or a piece of information? - More likely things will take fewer bits to encode
- Perplexity 2H a measure of the weighted mean
number of choice points in e.g. a language model
17Search/Decoding
- Find the best hypothesis P(Os) P(s) given
- Lattice of subword units (Acoustic Model)
- Segmentation of all paths into possible words
(Pronunciation Model) - Probabilities of word sequences (Language Model)
- Produces a huge search space How to reduce?
- Lattice minimization and determinization
- Forward algorithm sum of all paths leading to a
state - Viterbi algorithm max of all paths leading to a
state
18- Forward-backward (Baum-Welch, Expectation-Maximiza
tion) algorithm computes probability of
sequence at any state in search space - Beam search prune the lattice
19Varieties of Speech Recognition
20Challenges for Transcription
- Robustness to channel characteristics and noise
- Portability to new applications
- Adapatation to speakers, to environments
- LMs simple ngrams need help
- Confidence measures
- OOV words
- New speaking styles/genres
- New applications
21Challenges for Understanding
- Recognizing communicative problems
- ASR errors
- User corrections
- Disfluencies and self-repairs
- Possibilities
- Recognizing speaker emotion
- Identifying speech acts okay
- Locating topic boundaries for topic tracking,
audio browsing, speech data mining
22An Unsuccessful Dialogue
- S Hi, this is ATT Amtrak schedule system. This
is TOOT. How may I help you? - U I'd like to find a train from Ch- to Chicago
from Baltimore. - S Sorry, I can't understand you. Please repeat
your utterance. - U I'd like to find a train from Baltimore to
Chicago. - S I heard you say leave today around 5 thirty
in the morning . Which city do you want to go
to? - U I would like to leave Saturday at eight in the
morning.
23S You can say the arrival city name, such as
New York City." U Baltimore to Chicago. S You
can say the arrival city name, such as New York
City. U Baltimore to Chicago. (I hate it.) U
Train from Baltimore ...S ...I heard you s...U
to Chicago.... Hello? S You can say the name of
your departure city, such as "New York City." U
Baltimore.
24U Bal-ti-more.... S You can say... U
Baltimore. S Do you want to continue and find a
new set of trains? U Yes. S You can say "yes"
or "I am done here" to find a new set of
trains. U Yes. S Thank you for using ATT
Amtrak train time table system. See you next
time. U I nev-
25Summary
- ASR technology relies upon a large number of
phenomena and techniques weve already seen to
convert sound into words - Phonetic/phonological, morphological, and lexical
events - FSAs, Ngrams, Dynamic programming algorithms
- Better modeling of linguistic phenomena will be
needed to improve performance on transcription
and especially on understanding - For next class well start talking about larger
structures in language above the word (Ch 8)
26Disfluencies and Self-Repairs
- Disfluencies abound in spontaneous speech
- every 4.6s in radio call-in (Blackmer Mitton
91) - hesitation Ch- change strategy.
- filled pause Um Baltimore.
- self-repair Ba- uh Chicago.
- Hard to recognize
- Ch- change strategy. --gt to D C D C today ten
fifteen. - Um Baltimore. --gt From Baltimore ten.
- Ba- uh Chicago. --gt For Boston Chicago.