Title: Challenges in Speech Processing for ManMachine Communication
1(No Transcript)
2Nonspectral Features for Speech Processing
- B.Yegnanarayana
- Dept of Computer Science and Engineering
- Indian Institute of Technology Madras
- Chennai-600036, India
- yegna_at_cs.iitm.ernet.in
- Talk at NOLISP2005
- April 22, 2005
3Images
- Digital Image - Matrix of numbers
- Types of Images
- line sketches, binary, gray level and color
- Still images, video, multimedia
lt Prev
Next gt
4Objective To examine whether spectral
information alone is adequate, or there is
information in speech which we are not using in
many speech applications
5Outline of the Talk
- Illustration of nonspectral information Line
sketch of an image - What is spectrum in a signal ?
- Speech and language
- Information in speech and nature of speech signal
- Speech message medium vs message, system vs
source - Progress in speech processing need for
nonspectral processing - Some illustrations of nonspectral processing
- Need for new tools/approaches
6Speech and Language
- Link between speech and language
- Language is a sequence of events at different
levels - Sound units (spectral), pairs of units (spectral
transition) and prosody (duration and intonation) - Illustration of speech in different languages
7Information in Speech Signal
- Message, speaker, language, health, environment,
emotion, etc - Nature of speech in relation to text Signal,
spectrogram, epochs, formant contours, pitch
contours, source chs, glottal pulse, duration,
modulation frequency
8Features of Speech signal Waveform and
Spectrogram
Speech signal
Spectrogram
Pitch contour
9Segmental Features Short-time Spectra
10Medium vs Message
- System vs source
- Excite a system to convey information
- For time varying system, it also has message
- But we focus mostly on system, i.e., spectrum
- Spectrum is related to distribution of energy
with freq - What is processed is the sequence of pulses in
excitation, not frequency - Then why do spectrum analysis? We get the message
even if the spectrum is degraded by channel and
noise
11Progress in Speech Processing
- Humans do processing other than spectrum also
- Demo Listening to LP residuals of different
sounds - Information in speech 2nd order (spectrum),
higher order (residual) and long term (prosody)
relations - Leftovers in speech processing Phase, residual
and suprasegmental - Why these were not addressed ? Lack of tools
12Perception-based Features
- Most of the features are volume (amplitude and
spectrum) based - Performance may be limited
- Significance of perceptual attributes
- Original signal
- LP residual
13Some Illustrations of Sounds of Silence
Signal
Residual
Instants
More examples Signal, residual and instants
Some more examples Signal, residual and instants
Back
lt Prev
Next gt
14Comment on Speech Recognition
Note that SR is achieved by using language and
other constraints, not by recognition of sound
units by spectral means. Speech is perceived even
by human beings as a sequence of acoustic hints
F.S.Cooper
15Nonspectral Processing of Speech (Some
illustrations)
- Perceptual listening Demo
- Analysis Epoch extraction, LP residual, Hilbert
transform - Periodic-aperiodic decomposition of LP residual
- Prosody manipulation Epochs and LP residual
- Prosody (duration and intonation) modeling using
ANN Language and Speaker ID - Speaker recognition Complementary information in
LP residual - Speech enhancement single ch, multiple ch,
multispeaker
16Perceptual Importance of LP Residual
(a)
(b)
(c)
(a) Speech signal, (b) Random excitation, (c) LP
res Excitation
17Nonspectral Processing of Speech (Some
illustrations)
- Perceptual listening Demo
- Analysis Epoch extraction, LP residual, Hilbert
transform - Periodic-aperiodic decomposition of LP residual
- Prosody manipulation Epochs and LP residual
- Prosody (duration and intonation) modeling using
ANN Language and Speaker ID - Speaker recognition Complementary information in
LP residual - Speech enhancement single ch, multiple ch,
multispeaker
18Speech Production System
19Nature of Excitation of Voiced Speech
(a) Glottal volume velocity, (b) Speech Waveform
20Principle of the Group delay processing
Consider a unit sample sequence
0
?
t
?(?)
?(?)
FT e-j??
0
0
?
?
- group delay ?(?) - ?
FT phase ?(?) - ??
-?
As the windows is moved to the right, ?(?)
increases linearly with time
?(?) f(t)
?
t
0
- ?
21Principle of Group Delay(contd.)
Consider a damped sinusoid (Resonant system)
Low damping
0
0
?0
t
?
High damping
0
0
?0
t
?
?0
Shifted
0
-?
?
0
t
?
22(No Transcript)
23Speech, Glottal Waveform and Instants of
Excitation
a
b
c
d
(a) Segment of voiced speech signal (b) Linear
predication residual for (a) (c) Derivative of
the ECG signal (d) Instants of significant
excitation from the proposed algorithm.
24Instants Significant Excitation for Male Speech
A N Y D I
C T I O N A RY
Speech
LP residual
Phase slope fn.
ZC instants
Gain plot
25Instants of Significant Excitation for Female
Speech
Speech
LP residual
Phase slope fn.
ZC instants
Gain plot
26Formant Extraction using Knowledge of Instants
27Nonspectral Processing of Speech (Some
illustrations)
- Perceptual listening Demo
- Analysis Epoch extraction, LP residual, Hilbert
transform - Periodic-aperiodic decomposition of LP residual
- Prosody manipulation Epochs and LP residual
- Prosody (duration and intonation) modeling using
ANN Language and Speaker ID - Speaker recognition Complementary information in
LP residual - Speech enhancement single ch, multiple ch,
multispeaker
28(No Transcript)
29(No Transcript)
30Nonspectral Processing of Speech (Some
illustrations)
- Perceptual listening Demo
- Analysis Epoch extraction, LP residual, Hilbert
transform - Periodic-aperiodic decomposition of LP residual
- Prosody manipulation Epochs and LP residual
- Prosody (duration and intonation) modeling using
ANN Language and Speaker ID - Speaker recognition Complementary information in
LP residual - Speech enhancement single ch, multiple ch,
multispeaker
31Demonstration of Pitch period modification
(Indian Male speaker)
(a)
(b)
(c)
Speech waveforms and narrowband spectrograms for
(a) Original, (b) pitch period increased by 1.33
and (c) decreased by 0.66 factors
32Demonstration of Duration modification
(Indian Male speaker)
(a)
(b)
(c)
Speech waveforms and narrowband spectrograms for
(a) Original, (b) Duration increased by 1.5
times and (c) duration decreased by 0.75 times
33Nonspectral Processing of Speech (Some
illustrations)
- Perceptual listening Demo
- Analysis Epoch extraction, LP residual, Hilbert
transform - Periodic-aperiodic decomposition of LP residual
- Prosody manipulation Epochs and LP residual
- Prosody (duration and intonation) modeling using
ANN Language and Speaker ID - Speaker recognition Complementary information in
LP residual - Speech enhancement single ch, multiple ch,
multispeaker
34- Duration modeling using ANN
- Language and speaker ID using duration
- Intonation modeling using ANN
- Language and Speaker ID using intonation
35 Performance of the duration model
36Language and speaker identification using
duration models
Language identification
ltlt
Speaker identification
37Performance of the intonation model
38Language and speaker identification using
intonation models
Language identification
Speaker identification
39Nonspectral Processing of Speech (Some
illustrations)
- Perceptual listening Demo
- Analysis Epoch extraction, LP residual, Hilbert
transform - Periodic-aperiodic decomposition of LP residual
- Prosody manipulation Epochs and LP residual
- Prosody (duration and intonation) modeling using
ANN Language and Speaker ID - Speaker recognition Complementary information in
LP residual - Speech enhancement single ch, multiple ch,
multispeaker
40Speaker Recognition using LP Residual
- Linear prediction (LP) residual as a feature for
characterizing speaker-specific information - Learning speaker-specific characteristics using
AANN models - Significance of regions of LP residual around the
instants of glottal closure - Relative significance of excitation sources
corresponding to different sound units - Complementary nature of excitation source features
41Speaker Recognition using Spectral Features
- Features linear prediction cepstral coefficients
(LPCC) - Cepstral mean subtraction for channel
compensation - AANN models for estimating the density of feature
vectors - Modeling the distribution of LPCC features using
AANN models
42Combining Evidence form Spectral and Source
Features
- Database (NIST 2003)
- Training data
- 149 male 191 female speakers
- Duration 2 minutes
- Verification data
- 1343 male 2257 female tests
- Duration 15 - 45 seconds
- Sampling rate 8 kHz
43Nonspectral Processing of Speech (Some
illustrations)
- Perceptual listening Demo
- Analysis Epoch extraction, LP residual, Hilbert
transform - Periodic-aperiodic decomposition of LP residual
- Prosody manipulation Epochs and LP residual
- Prosody (duration and intonation) modeling using
ANN Language and Speaker ID - Speaker recognition Complementary information in
LP residual - Speech enhancement single ch, multiple ch,
multispeaker
44Enhancement of Speech from Single Channel
(Example 1)
DEGRADED SPEECH
ENHANCED SPEECH
45Enhancement of Speech from Single
Channel (Example 2)
DEGRADED SPEECH
ENHANCED SPEECH
46Speech Signals of Different Microphones
(a), (d) and (g) are waveforms at three
microphone locations. (b), (e) and (h) are
extracted instants of significant
excitation. (c), (f) and (i) are short-time
spectra for the marked regions
47Characteristics of Hilbert Envelope
(a), (b) and (c) are LP residual, its Hilbert
transform and Hilbert Envelope for the signal at
mic-0. (d), (e) and (f) are LP residual, its
Hilbert transform and Hilbert Envelope for the
signal at mic-1.
48Excitation Characteristics of Speech Signal
(a) Waveform, (b) LP residual and (c) Hilbert
Envelope for the speech collected over a
close-speaking microphone
(a) Waveform, (b) LP residual and (c) Hilbert
Envelope for the speech collected over a distant
microphone
49Cross-Correlation of Hilbert Envelopes
50Coherent and Incoherent Addition of Hilbert
Envelopes
Hilbert envelope for the signal at (a) mic-1, (b)
mic-2 and (c) mic-3. Result of (d) Incoherent
addition and (e) Coherent addition.
51Enhancement of Speech from Multiple
Channels (Example 1)
MICROPHONE-1
MICROPHONE-2
WAVEFORM ADDITION
ENHANCED SPEECH
52Enhancement of Speech using Multiple
Channels (Example 2)
MICROPHONE-1
MICROPHONE-2
WAVEFORM ADDITION
ENHANCED SPEECH
53Enhancement of Speech in Multispeaker
Environment Results(Ex2)
mic-1
mic-2
Sp1g
Sp1p
Sp2g
Sp2p
54Enhancement of Speech in Multispeaker
Environment Results(Ex1)
MICROPHONE-1
MICROPHONE-2
SPEAKER-1 ENHANCED
SPEAKER-2 ENHANCED
55Enhancement of Speech in Multispeaker
Environment Results(Ex3)
MICROPHONE-1
MICROPHONE-2
SPEAKER-1 ENHANCED
SPEAKER-2 ENHANCED
56Enhancement of Speech in Multispeaker Environment
MICROPHONE-1
MICROPHONE-2
ENHANCED SPEAKER-1
ENHANCED SPEAKER-2
57Conclusions
- Speech signal contains significant nonspectral
information - Need to develop new tools
- Breakthroughs are difficult due to heavy bias in
thinking spectral way - Finally the motivation for this talk is
58Do not follow where the path may lead go
instead where there is no path and leave a
trail Ralph Waldo Emerson
59- Thank you very much for your attention
60(No Transcript)