Title: Speech Recognition
1Speech Recognition
- Mitch Marcus
- CIS 530
- Introduction to Natural Language Processing
2A sample of speech recognition
- The general problem of automatic transcription of
speech by any speaker in any environment is still
far from solved. But recent years have seen ASR
technology matured (mature) to the point where
(it) is viable in certain limited domains. One
major application area is inhuman- (in human-)
computer interaction. While many tasks are
better solved with visual or pointing interfaces,
speech has the potential to be a better interface
than the keyboard for tasks were (where) full
natural language communication is useful, or for
which keyboards are not appropriate. This
includes hands-busy or eyes-busy applications,
such as where the user has objects to manipulate
or equipment to control. - This was dictated one (on) April 16th, 2007,
using Dragon NaturallySpeaking 9.1. The text is
from Speech and Language Processing, draft of the
second edition, by giraffe ski (Jurafsky) and
Martin. - 140 words 6 errors
3I. Why is Speech Recognition Hard??
4A Speech Spectrogram
?Frequency
Time ?
- Represents the varying short term amplitude
spectra of the speech waveform - Darkness represents amplitude at that time
frequency.
5A trained person can read a spectrogram
Therefore, the spectrogram contains all the
information a machine needs as well.
Prof. Victor Zue, MIT
6Vowels are determined by their formants
F3F2F1
bee baa
boo The frequencies of F1, F2, and F3
the first three resonances of the vocal tract
largely determine the perceived vowel
7Consonants are determined by
- burst spectra,
- length of silence
- formant motion
- ...
8Coarticulation
- The same abstract phoneme can be realized very
differently in different phonetic contexts
coarticulation - F2 in the vowel /u/, crucial to its
identification, varies significantly due to
surrounding consonants in the syllables
Moom
Kook
Toot
9Speech Information is not local
- The identity of speech units, phones, cannot be
determined independently of context. - Sometimes two phones can best be distinguished by
examining properties of neighboring phones
d o s d o z
10Speech Information is not local
- /s/ and /z/ are often acoustically identical
- They are differentiated by the length of the
preceding vowel
d o s d o z
11Words are constant, but utterances arent
- Spectrograms of similar words pronounced by
the same speaker - may be more alike than
- Spectrograms of the same word pronounced by
different speakers.
wait MM (m) wait JH (f) wait
whispered(MM)
12II. HMMs for Speech Recognition
- (Illustrations in II from draft Chapter 9,
- Jurafsky Martin)
13Speech Recognition Architecture
14Schematic HMM for the word six
- Simple one state per phone model
- Left to right topology with self loops and no
skips - Start and End states with no emissions
15Review Phones have dynamic structure
- The name Ike, pronounced ay k
- The formants of the dipthong ay move continually
- K consists of (a) a silence, (b) a burst
16A 3-state HMM phone model
- Three emitting states
- Two non-emitting states
- Usually includes skip states
- The word six siks using 3-state HMM phone models
17A simple full HMM for digit recognition
18III. Speech Dialogue Understanding
19Multiple knowledge sources provide redundancy
- Grammatical, semantic and pragmatic information
can be used to make recognition robust. - A first experiment ATT Bell Labs airline
reservation system (1977)
20Multiple knowledge sources provide redundancy
21(No Transcript)
22Speech Recognition Task Dimensions I
- Continuous speech vs. isolated word
- Speaker Dependent, Speaker Independent, Speaker
Adaptive - Speaker dependent System trained for current
speaker - Speaker independent No modificiation per speaker
- Speaker Adaptive Initially speaker independent,
then adapts to speaker while functioning - Vocabulary size
- Small 10-50 words
- Large 1,000-64,000 words
- Unlimited System can handle Out of Vocabulary
words
23Speech Recognition Task Dimensions I
- Perplexity level
- Low perplexity Average expected branching factor
of grammar lt 10-20 - High perplexity Average expected branching
factor of grammar gt 100 - Read vs. dictation style vs. conversational
speech - Quiet Conditions vs. various noise conditions
24Perplexity Why it matters
- Experiment (1992) read speech, Three tasks
- Mammography transcription (perplexity 60)There
are scattered calcifications with the right
breastThese too have increased very slightly - General radiology (perplexity 140)This is
somewhat diffuse in natureThere is no evidence
of esophageal or gastric perforation - Encyclopedia dictation (perplexity
430)Czechoslovakia is known internationally in
music and filmMany large sulphur deposits are
found at or near the earths surface
25Real Speech is Difficult The air travel domain
- Fragments
- show me flights from boston to new york
- to philadelphia
- Ungrammatical utterances
- what type of ground transportation from the
airport to denver - Restarts and self-corrections
- Id like to see show me flights leaving before
noon - And finally..
- from uh sss from the philadelphia airport um at
ooh the airline is united airlines and it is
flight number one ninety four once that one lands
I need ground transportation to uh broad street
in philadelphia what can you arrange for that
26Conversational Speech Transcription
- Automatically transcribe conversational speech,
not necessarily intended for speech recognition - Best results (3/06)
- English word error rate 17.2
- Arabic word error rate 15.5