Speech Processing - PowerPoint PPT Presentation

1 / 70

About This Presentation

Title:

Speech Processing

Description:

... coders. Source (voice) coders (vo-coders) Source coders eg ... Waveform & Source Coders (Vocoders) 2 periodicities/redundancies in source. short-term (formants) ... – PowerPoint PPT presentation

Number of Views:224

Avg rating:3.0/5.0

Slides: 71

Provided by: jsdm5

Category:

more less

Transcript and Presenter's Notes

Title: Speech Processing

1
Speech Processing

John Mason
Engineering
University of Wales Swansea

2
Time Sequence Information
3
VT Shape Some Vowels - Ladefoged 62
4
VT Shape Some Vowels - Ladefoged 62
5
Speech Processing - Applications

Why?
Communications
Synthesis
Recognition
Speech Speaker
How?
Frame-based
Systems approach

6
Some Books

Flanagan -Speech Analysis, Synthesis and
Perception, Springer-Verlag, - a classic!
Furui - several books on recognition
Parsons - Voice and Speech Processing - McGraw
Hill, one of the first text books on computer
speech processing
OShaughnessy - Speech Comms - human and
machine Addison-Wesley
Rabiner Juang - Fundamentals of Speech
Recognition Prentice Hall, 1993
Ramachandran Mamone (eds) Modern Methods of
Speech Processing Kluer Academic, 1995

7
Speech Communications
Person-to-Person
Person-to-Machine speech/speaker recognition
Machine-to-Person speech synthesis
8
(Electronic) Speech Communications
perhaps separated by long distance (or in time)
9
Telephony Broadcasting
Acoustic Air Path
Acoustic Air Path
l Transmission Path
Electronic Link
10
Speech Comms Telephony
Microphone ADC Analysis Coding Transmitter
Receiver Decoding (re-)Synthesis DAC Loudspeaker
11
Speech Bit Rates
hundreds
thousands
Tens of thousands
tens
Approx. bit rate in bps
Acoustic Space
Human Hearing Extraction
Message Realisation
Language decoding
12
Criteria in Speech Comms.

Quality versus Bit-rate

4 Quality Measures intelligibility loudness n
aturalness ease-of-listening
13
Speech Processing

The three main application areas are
Speech Comms. (the electronic link)
Automatic Speech/Speaker recognition
Speech SynthesisMuch of the underlying analysis
is common, eg linear predictive coding

14
What does speech look like?
15
What does speech look like?
Dynamic Range - for flexibility and
robustness Time-varying - to convey
information
16
Frame-based Analysis

To capture time variations
20-30 ms frames - centi-second labeling
spectral analysis
FFT
Filter-bank
Linear Predictive Coding

17
Speech Analysis/Coding

Two general cases
Waveform coders
Source (voice) coders (vo-coders)
Source coders eg linear predictive coding (LPC)
Model the source ie the vocal tract (VT)
Linear, time varying model of VT, plus excitation

18
Systems Approach
Excitation
Speech
Vocal Tract
Voiced
Speech
Model
f0
Unvoiced
Time Varying Parameters
19
LPC Analysis/Synthesis

Synthesis
Input Excitation
output Speech
Analysis
Input Speech
output Excitation

20
Perfect Analysis/Synthesis
Input sn and output sn are identical (within
arithmetic limits)
21
Practical Analysis/Synthesis
22
Practical Analysis/Synthesis

Parameters for Transmission
Input / Excitation en
Source model H(z)
Thus Analysis must derive these parameters,
and
Synthesis must use them to re-generate speech

23
Linear Predictive Coding - LPC

Principle of linear prediction
The next value (or sample) in a series, ie at
time n, is predicted or estimated by a weighted
sum of previous values, ie those at time n-1,
n-2, ...
Thus for a predictor of order p, we have

24
Linear Prediction
Transforming to the z-domain gives
25
LPC Error Terms
Error is simply difference between predicted and
actual values
sn
en

-
ˆ
sn
A(z)
26
Synthesis
sn
H(z)
Parameters updated at frame rate
sn
en

?

A(z)
NB hat of approximation omitted for simplicity
27
Analysis for Synthesis

The Analysis and Synthesis must match
what is needed for the Synthesis?
Answer en - the excitation and H(z) - the
system
Thus the Analysis must derive these terms (from
sn )
The speech signal, sn is analysed to give en and
H(z) ie A(z) parameters for transmission.

28
Derivation of LPC Coefficients - A(z)
Recall
where ai are the p prediction coefficients.The
principle behind LPC is to find a set of p
coefficients, a1, a2, a3, ... ap, which in some
sense minimizes the error signal en, over a
frame of speech, N. This leads to a set p
coefficients for each frame.
29
Derivation of A(z) (2)
Minimisation of En is achieved by setting the p
partial derivatives to zero
The matrix R is Toepliz symmetric, offering
numerically efficient inversion techniques -
Durbins recursion algorithm being one of the
most popular.
30
Derivation of A(z) (3)

When N very large r is the autocorrelation
coefficients of s
S comes from e convolved with h (excitation
vocal tract)
we are interested here in separating e and h
the predictor order, p, is small to reflect the
short-term periodicities (formants)
with higher predictor orders we will get the
longer-term periodicities (pitch)
2 practical problems with evaluating a
matrix singularities in R-1
unstable resultant H(z)
in practice both are solved by windowing -
shaping frame - Hamming

31
Speech Signal Characteristics

Duration
Dynamic Range
Periodicities
vocal tract
pitch
Frame-based Analysis
frame size quasi-stationary capture
transition typically 20 - 30ms
frame rate task dependent more means
moreband-width/computation - up to 100
frames/second

32
Harmonic Structures and Periodicities

Harmonic Structures Periodicities give
potential for data reduction
LPC is one way of gaining this compression
Speech has two obvious separate structures
vocal tract resonances
pitch

33
Harmonic Structures and Periodicities
voiced or unvoiced
speech
en
sn
H(z)
Vocal tract
Short Term
Tp
p
Short term prediction
34
Harmonic Structures and Periodicities
voiced unvoiced
epn
speech
sn
Hlt(z)
Hst(z)
en
Pitch
Vocal tract
Tp
P
Long term prediction
35
Harmonic Structures and Periodicities
Two Structures short-term (formants)
long-term - pitch (excitation)
eg 20ms frame 160 samples _at_ 8Khz
ai eg p3
ai eg p10
NB Representations of these parameters are
transmitted
36
Practical Coding Systems

Waveform Source Coders (Vocoders)
2 periodicities/redundancies in source
short-term (formants)
long-term - pitch
Excitation en

en
epn
sn
Hlt(z)
Hst(z)
37
Perfect Analysis/Synthesis
Input sn and output sn are identical (within
arithmetic limits)
38
Practical System
Transmitted Data Frame
?
?
S(z)
E(z)
H(z)
?
?
sn
en
?
Input sn and output sn are similar
39
Analysis-by-Synthesis LPAS

Integrated encoder decoder at the encoder

-
sn
Basic decoder
Adaptive encoder

Weighted error
LPAS Encoder
40
Log Spectral Estimates

Comparisons between frames are very important in
many situations
log spectral estimates are the most common
(though in Comms. An approximation is used to
reduce computation)

In Comms, compuation is expensive and parameter
vector approximations to D are used
41
Some Standards

GSM European Cellular RPE-LTP 13kb/s
FS1016 Secure Voice CELP 4.8
IS54 NA Cellular VSELP 7.95
IS96 QCELP 1-8
JDC-FR Japanese Cellular VSELP 6.7
JDC-HR PSI-CELP 3.67
G.728 (terrestrial) LD-CELP 16

42
CELP eg
Short-term coefficients (formants)
Long-term coefficients (pitch)
CB Index
Gain
en
sn
Hlt(z)
Hst(z)
Excitation is represented by address ie CB
Index
?
en
43
CELP eg
Short-term coefficients (formants)
Long-term coefficients (pitch)
CB Index
Gain
sn
?
?
?
en
en
sn
sn
Hlt(z)
Hst(z)
Excitation is represented by address ie CB
Index
?
en
44
Conversion of LPC Parameters

A(z) 1 a1 z - 1 a2 z - 2 ap z -
p and a i are to be Txd
Line Spectral Frequencies (LSF) present a clever
way of representing the LPC coefficients, the
ais of A(z)
The ais are floating point numbers and their
accuracy is important
Factorising A(z) tends to give complex roots in
the z-domain
LSFs map these complex roots on to the unit
circle

LSFs
Lead to efficient coding
Ensure a minimum phase filter
Bit errors are spectrum localised minimising
loss of speech quality

45
Line Spectral Frequencies

Consider
P(z) A(z) z(n1) A(z1 )
and
Q(z) A(z) - z(n1) A(z1 )
then P(z) and Q(z) lead to what is known as
LSFs
Clearly if P(z) and Q(z) are known then A(z) can
be found
A(z) P(z) Q(z) / 2
Roots of P(z) and Q(z) lie on the unit circle in
z-domain The locations give
the LSFs
P(z) and Q(z), and whence A(z)

46
LSF Evaluation

Consider one pair of complex roots, A1(z)
A1(z) 1 a1 z -1 a2 z -2
P1(z) 1 a1 z -1 a2 z -2 z -3 (1
a1 z1 a2 z2 )
(z2 (a1 a2 - 1) z 1 )( z 1 )
z 3
Q1(z) 1 a1 z -1 a2 z -2 - z -3
(1 a1 z1 a2 z2 )
(z2 (a1 - a2 1) z 1 )( z
- 1 ) z -3
The roots at 0 and 1 are discarded
It follows that the LSFs, ?1 ?2 , are given
by
cos (?1) - (a1 a2 - 1)/2
and cos (?2) - (a1 - a2 1)/2

Show
a1 -(cos (?1) cos (?2) ) and
a2 (cos (?2) - cos (?1) 1 )

47
LSF Test Example

A1(z) 1 a1 z -1 a2 z - 2
(z2 a1 z a2 )z - 2
(z2 2 cos(?) wn z wn2 ) z - 2
where wn is radius and ? is angle from ?. So
radius ? a2 ? ? - ?
Note in P Q all w n2 terms (of the
multiple 2nd orders) are unity
EG 1 a2 1 then cos (?1) - (a1 a2 -
1)/2 - (a1)/2
roots already on circle and do not move (unstable
system not practical)
EG 2 a1 0 then cos (?1) - (a1 a2 -1)/2
- (a2 - 1)/2
cos (?2) - (a1 - a2 1)/2 - (-a2
1)/2
so LSFs are symmetric about ? /4

48
LSF Review Example (1)
LSFs/LSPs are defined as P(z)
A(z) z-(n1) A(z-1 ) and Q(z)
A(z) - z-(n1) A(z-1 ) thus A(z)
P(z) Q(z) / 2
49
LSF Review Example (2)
For a second order A(z) 1 a1 z-1 a2 z-2 P
(z) 1 a1 z-1 a2 z-2 (1 a1 z1 a2
z2)z-3 (z2 (a1 a2 - 1)z
1)(z 1)z3 Q (z) 1 a1 z-1 a2 z-2 -
(a1 z1 a2 z2)z-3 (z2 (a1 - a2 1)z
1)(z - 1 )z3 cf (s2 ( 2cos(?)wn ) s
wn2)
50
LSF Review Example (3)
For a second order A(z) 1 a1 z-1 a2 z-2
P (z) (z2 (a1 a2 - 1)z 1)(z
1)z3 Q (z) (z2 (a1 - a2 1)z 1)(z -
1 )z3 cf (s2 ( 2cos(?)wn )s wn2)
Thus (a1 a2 - 1) 2cos(?1) -
2cos(?1) (a1 - a2 1) - 2cos(?2 ) So,
given i) LPC coeffs., a1 and a2 , then LSFs
?1 ?2 can be found ii) LSFs, ?1 ?2 ,
then the LPC coeffs. a1 and a2 be found
?2
?1
51
LSF Review Example (4)
For a second order and with P(z) corresponding to
the first root, Q(z) to the second root, and so
P (z) 1 a1 z-1 a2 z-2 (1 a1 z1
a2 z2)z-3 (z2 (a1 a2 - 1)z
1)(z 1)z3 for the second pair of qi, 1.37
and 1.77 (z2 - 2cos(1.37) z 1 )(z 1)
z3 (z3 (1 - 2cos(1.37) z2 (1 -
2cos(1.37))z 1)z3 Likewise Q (z) 1
a1 z-1 a2 z-2 - (a1 z1 a2 z2)z-3 (z2
(a1 - a2 1)z 1)(z - 1 )z3 (z2 -
2cos(1.77) z 1 )(z - 1) z3 (z3 (-1 -
2cos(1.77) z2 (1 2cos(1.77))z -
1)z3 Then A(z) P(z) Q(z) / 2)
(z3 (cos(1.37) cos(1.77))z2 (1 -
cos(1.37) cos(1.77))z)z3
52
LSF Examples
LPC coeffs. LSFs LPC coeffs. LSFs LPC coeffs. LSFs LPC coeffs. LSFs
a1 a2 ?1 ?2
0 0.5 1.31812 1.82348
-1.8 0.9 0.31756 0.554811
1.8 0.9 p-0.554811 p-0. 31756
2.2274 2.3743
53
Example Bit Allocation
54
Codebooks VQ
N 2L
Identical book
i (0 N-1)
p
p
Data reduction (p x B) to L
time
time
55
Codebook Compression

Principle
representative data sets
data vector is replaced / representedby
nearest vector, chosen from a codebook - a
closed set of vectors
Examples
LPC parameter sets
Excitation as in CELP

56
Codebook Compression - CELP
Codebook of time-domain samples
start point
en
y ms
y ms
y ms
en are time domain samples (integers) R samples
per second (eg 8000 Hz) Frame rate governs vector
size P 2 j Bit rate j/y bits/ms
P
57
Codebook Compression of H(z)
x ms
N 2 k
time
i
M
index, i
Az at time t
Vector with M elements, every x ms Codebook with
N 2 k vectors Bit rate k/x bits per ms (not
a function of M) In practice Az is converted to
LSFs.
58
Codebook Generation
1) Initialise form a single centroid of all
training data, N1 2) Repeat Split centroids
N -gt 2N Repeat Cluster data to nearest
centroid until convergence until N large
enough
59
VQ Performance on Unseen Data
Ramachandran Mamone (eds) Modern Methods of
Speech Processing Kluer Academic, 1995
60
VQ Performance on Unseen Data
Ramachandran Mamone (eds) Modern Methods of
Speech Processing Kluer Academic, 1995
61
LPC FFT Spectra
LPC Roots -0.6651 0.6695i -0.0560 0.9709i
0.7228 0.6225i 0.8714 0.3694i 0.5758
-0.4200
LSFs
40
?2 of Q(z) ?1 of P(z)
2.3743 2.2274
1.6540 1.5997
0.8261 0.6954
0.6106 0.3937
20
0
Magnitude (dB)
-20
-40
0
1
2
3
4
5
Frequency (KHz) (
0-to-Fs/2)
62
LPC Spectra LSFs
LPC Roots -0.6651 0.6695i -0.0560 0.9709i
0.7228 0.6225i 0.8714 0.3694i 0.5758
-0.4200
LSFs
?2 of Q(z) ?1 of P(z)
2.3743 2.2274
1.6540 1.5997
0.8261 0.6954
0.6106 0.3937
Frequency (KHz) (
0-to-Fs/2)
63
LPC FFT Spectra - 2nd Order
A(z) 1.5537 -0.8276 Roots 0.7769
0.4733i
1
0.5
0
-0.5
H(0) K (1- (1.5537 -
0.8276)) H(ws/2) K
(1- (-1.5537 - 0.8276)) H(0) K/0.274
21.8dB H(ws /2) K/
3.38
-1
0
3.2
6.4
9.6
12.8
16
19.2
22.4
25.6
Time (ms)
40
20
0
-20
-40
0
1
2
3
4
5
Frequency (KHz) (
0-to-Fs/2)
64
VT Shape Some Vowels - Ladefoged 62
65
VT Shape Some Vowels - Ladefoged 62
66
GSM

Groupe Special Mobile - EU
First digital cellular system in world
See Hodge 1990
Based on TDMA FDMA at 900MHz, and RPE-LPC(ie
it is an LPAS system)
Now at 1800 MHz
Carriers at 200kHz
Supporting 8 TDMA time slots each
Time slots 577ms - 156.26 bit slots
8 time slots form 1 GSM frame of 4.62 ms
Modulation Gaussian minimum shift key
26 bit training in every time slot
Round-trip delay 80ms
EU GSM US D-AMPS

67
Other Related Topics
Spectral Lifting H(z) (1-az-1) Codebook
Training Spectral Differences between 2
frames Cepstra Modeling Speech Space - HMMs
68
Pre-Emphasis Example
69
Pre-Emphasis Example
z-plane jy
G(ws/2) 1 a G(0) 1 - a
a
For G(ws/2 ) gt G(0) then a must be gt 0
1a 2
ws/2
70
Z-plane to Magnitude Spectrum
1
0.5
0
Imaginary Part
-0.5
-1
-1
-0.5
0
0.5
1
Real Part
50
40
30
1a 2
20
10
Magnitude (dB)
0
-10
ws/2
-20
-30
0
1
2
3
4
5
Frequency (KHz) (
0-to-Fs/2)

Write a Comment

User Comments (0)