Title: Variability in the Speech Signal
1Variability in the Speech Signal
- Why perfect speech recognition
- is always ten years away.
- 11-752 Spring 2004
- Antoine Raux
2Outline
- What is the speech signal?
- What is in the speech signal?
- Linguistic variability
- Speaker variability
- Task variability
- Environmental variability
3What is the speech signal?
- A 1-dimension waveform analyzed in terms of
- Spectrum
- F0
- Power
- Duration (segments, pauses)
4What information is in the Speech Signal?
- Linguistic content
- Phonemes, Words and Sentences
- Prosody
- Speaker characteristics
- Gender
- Dialect
- Individual Differences
5What is in the Speech Signal?
- Task/State characteristics
- Emotions
- Lombard speech in noisy environment
- Speaking style
- Environment characteristics
- Surrounding noise
- Microphone/channel
6Source-Channel Model of Speech Production
Linguistic Message
Channel
Speech Signal
7Source-Channel Modelof Speech Production
Linguistic Message
Channel
Speech Signal
Speaker
Environment
Task
8Source-Channel Model
Assumed to be Invariant!!
Channel
Linguistic Message
Speech Signal
Speaker
Environment
Task
9Linguistic Variability
- Different phonemes have different spectral
characteristics (of course) - Coarticulation effect spectral characteristics
of phonemes change depending on the neighboring
phonemes - F0, duration, and power vary according to
intonation and stress
10Environment Variability
- Non-speech events (usually not the focus of the
task) - Noise at the source
- Static noises (e.g. fan, engine)
- Transient noises (e.g. door slam, other
speakers) - Noise in the channel
- Microphone buzz
- Telephone/Cell Phone (limited bandwidth)
- (Speech Enhancement Assessment Resource (SpEAR)
Database. http//ee.ogi.edu/NSEL/. Beta Release
v1.0. CSLU, Oregon Graduate Institute of Science
and Technology. E. Wan, A. Nelson, and Rick
Peterson.)
11Speaker Variability Gender
- Females usually have higher mean F0 than males
- Other formants are often also higher for females
- Some phonetic phenomena are more frequent in one
gender than the other (e.g. in North American
English, vocal creaks are more frequent for
females than males) -
12Speaker Variability Gender
(A. Syrdal, Acoustic Variability in Spontaneous
Conversational Speech of American English
Talkers, ICSLP96)
13Speaker Variability Dialect
- Different dialects use different phonemes for the
same word - e.g. British vs American better
- Different dialects use different allophones for
the same phoneme (in a given context) - e.g. Japanese accented vs American L/R
- Differences in prosody
14Speaker Variability Individual Differences
- Physical constitution (lungs, vocal tract)
- Level of education/social environment
- Personal history
-
- yield differences between the speech of different
individuals.
15Task/State Variability
- Emotions
- Irritation, frustration (e.g. dialogue systems)
- Tiredness (e.g. at the end of long recording
sessions) -
- Lombard speech (speech produced in noisy
environments) - Energy shifts towards higher frequencies
- Vowels get longer
- (J.C. Junqua, The Lombard reflex and its role
on human listeners and automatic speech
recognizers, J. Acoust. Soc. Am., 1993)
16Task/State Variability
- Speaking style
- Speech signal of the same speaker is different
when reading a novel and having informal
conversation. - Differences in formant positions can be
significant (i.e. similar to inter-speaker) - Same for prosodic features (F0, duration)
- (M. Abe, Speaking Styles Statistical Analysis
and Synthesis by a Text-to-Speech System,
Progress in Speech Synthesis, 1997)
17Interaction Between Different Sources of
Variability
- Examples
- Effect of dialect depends on linguistic context
- Effect of gender depends on dialect
- Effect of emotion depends on gender
-
- The components of the speech signal are HARD to
separate
18The Good Thing about Variability
Many types of information combined in the speech
signal
Many things to learn (about the speaker,
environment) just from the speech signal (or
combined with visual cues)!
19Conclusion
- Speech signal is analyzed in terms of spectrum,
F0, duration and power - Language, Environment, Speaker and Task all
affect one or more features - The impact on each source of variability depends
on the others
20Conclusion
- This is why speech recognition is really
difficult! - A speech processing system needs to either
- Separate the uninteresting sources of
variability from the interesting one(s) - OR
- Work in limited conditions. Example
- speech recognition fixed speaker, task, and
environment - speaker recognition fixed linguistic content,
task, and environment