Title: Speech Processing
1Speech Processing
- John Mason
- Engineering
- University of Wales Swansea
2Time Sequence Information
3VT Shape Some Vowels - Ladefoged 62
4VT Shape Some Vowels - Ladefoged 62
5Speech Processing - Applications
- Why?
- Communications
- Synthesis
- Recognition
- Speech Speaker
- How?
- Frame-based
- Systems approach
6Some Books
- Flanagan -Speech Analysis, Synthesis and
Perception, Springer-Verlag, - a classic! - Furui - several books on recognition
- Parsons - Voice and Speech Processing - McGraw
Hill, one of the first text books on computer
speech processing - OShaughnessy - Speech Comms - human and
machine Addison-Wesley - Rabiner Juang - Fundamentals of Speech
Recognition Prentice Hall, 1993 - Ramachandran Mamone (eds) Modern Methods of
Speech Processing Kluer Academic, 1995
7Speech Communications
Person-to-Person
Person-to-Machine speech/speaker recognition
Machine-to-Person speech synthesis
8(Electronic) Speech Communications
perhaps separated by long distance (or in time)
9Telephony Broadcasting
Acoustic Air Path
Acoustic Air Path
l Transmission Path
Electronic Link
10Speech Comms Telephony
Microphone ADC Analysis Coding Transmitter
Receiver Decoding (re-)Synthesis DAC Loudspeaker
11Speech Bit Rates
hundreds
thousands
Tens of thousands
tens
Approx. bit rate in bps
Acoustic Space
Human Hearing Extraction
Message Realisation
Language decoding
12Criteria in Speech Comms.
4 Quality Measures intelligibility loudness n
aturalness ease-of-listening
13Speech Processing
- The three main application areas are
- Speech Comms. (the electronic link)
- Automatic Speech/Speaker recognition
- Speech SynthesisMuch of the underlying analysis
is common, eg linear predictive coding
14What does speech look like?
15 What does speech look like?
Dynamic Range - for flexibility and
robustness Time-varying - to convey
information
16 Frame-based Analysis
- To capture time variations
- 20-30 ms frames - centi-second labeling
- spectral analysis
- FFT
- Filter-bank
- Linear Predictive Coding
17Speech Analysis/Coding
- Two general cases
- Waveform coders
- Source (voice) coders (vo-coders)
- Source coders eg linear predictive coding (LPC)
- Model the source ie the vocal tract (VT)
- Linear, time varying model of VT, plus excitation
18 Systems Approach
Excitation
Speech
Vocal Tract
Voiced
Speech
Model
f0
Unvoiced
Time Varying Parameters
19LPC Analysis/Synthesis
- Synthesis
- Input Excitation
- output Speech
- Analysis
- Input Speech
- output Excitation
20Perfect Analysis/Synthesis
Input sn and output sn are identical (within
arithmetic limits)
21Practical Analysis/Synthesis
22Practical Analysis/Synthesis
- Parameters for Transmission
- Input / Excitation en
- Source model H(z)
- Thus Analysis must derive these parameters,
and - Synthesis must use them to re-generate speech
23Linear Predictive Coding - LPC
- Principle of linear prediction
- The next value (or sample) in a series, ie at
time n, is predicted or estimated by a weighted
sum of previous values, ie those at time n-1,
n-2, ... - Thus for a predictor of order p, we have
24Linear Prediction
Transforming to the z-domain gives
25LPC Error Terms
Error is simply difference between predicted and
actual values
sn
en
-
ˆ
sn
A(z)
26Synthesis
sn
H(z)
Parameters updated at frame rate
sn
en
?
A(z)
NB hat of approximation omitted for simplicity
27Analysis for Synthesis
- The Analysis and Synthesis must match
- what is needed for the Synthesis?
- Answer en - the excitation and H(z) - the
system - Thus the Analysis must derive these terms (from
sn ) - The speech signal, sn is analysed to give en and
H(z) ie A(z) parameters for transmission.
28Derivation of LPC Coefficients - A(z)
Recall
where ai are the p prediction coefficients.The
principle behind LPC is to find a set of p
coefficients, a1, a2, a3, ... ap, which in some
sense minimizes the error signal en, over a
frame of speech, N. This leads to a set p
coefficients for each frame.
29Derivation of A(z) (2)
Minimisation of En is achieved by setting the p
partial derivatives to zero
The matrix R is Toepliz symmetric, offering
numerically efficient inversion techniques -
Durbins recursion algorithm being one of the
most popular.
30Derivation of A(z) (3)
- When N very large r is the autocorrelation
coefficients of s - S comes from e convolved with h (excitation
vocal tract) - we are interested here in separating e and h
- the predictor order, p, is small to reflect the
short-term periodicities (formants) - with higher predictor orders we will get the
longer-term periodicities (pitch) - 2 practical problems with evaluating a
- matrix singularities in R-1
- unstable resultant H(z)
- in practice both are solved by windowing -
shaping frame - Hamming
31Speech Signal Characteristics
- Duration
- Dynamic Range
- Periodicities
- vocal tract
- pitch
- Frame-based Analysis
- frame size quasi-stationary capture
transition typically 20 - 30ms - frame rate task dependent more means
moreband-width/computation - up to 100
frames/second
32Harmonic Structures and Periodicities
- Harmonic Structures Periodicities give
potential for data reduction - LPC is one way of gaining this compression
- Speech has two obvious separate structures
- vocal tract resonances
- pitch
33Harmonic Structures and Periodicities
voiced or unvoiced
speech
en
sn
H(z)
Vocal tract
Short Term
Tp
p
Short term prediction
34Harmonic Structures and Periodicities
voiced unvoiced
epn
speech
sn
Hlt(z)
Hst(z)
en
Pitch
Vocal tract
Tp
P
Long term prediction
35Harmonic Structures and Periodicities
Two Structures short-term (formants)
long-term - pitch (excitation)
eg 20ms frame 160 samples _at_ 8Khz
ai eg p3
ai eg p10
NB Representations of these parameters are
transmitted
36Practical Coding Systems
- Waveform Source Coders (Vocoders)
- 2 periodicities/redundancies in source
- short-term (formants)
- long-term - pitch
- Excitation en
en
epn
sn
Hlt(z)
Hst(z)
37Perfect Analysis/Synthesis
Input sn and output sn are identical (within
arithmetic limits)
38Practical System
Transmitted Data Frame
?
?
S(z)
E(z)
H(z)
?
?
sn
en
?
Input sn and output sn are similar
39Analysis-by-Synthesis LPAS
- Integrated encoder decoder at the encoder
-
sn
Basic decoder
Adaptive encoder
Weighted error
LPAS Encoder
40Log Spectral Estimates
- Comparisons between frames are very important in
many situations - log spectral estimates are the most common
(though in Comms. An approximation is used to
reduce computation)
In Comms, compuation is expensive and parameter
vector approximations to D are used
41Some Standards
- GSM European Cellular RPE-LTP 13kb/s
- FS1016 Secure Voice CELP 4.8
- IS54 NA Cellular VSELP 7.95
- IS96 QCELP 1-8
- JDC-FR Japanese Cellular VSELP 6.7
- JDC-HR PSI-CELP 3.67
- G.728 (terrestrial) LD-CELP 16
42CELP eg
Short-term coefficients (formants)
Long-term coefficients (pitch)
CB Index
Gain
en
sn
Hlt(z)
Hst(z)
Excitation is represented by address ie CB
Index
?
en
43CELP eg
Short-term coefficients (formants)
Long-term coefficients (pitch)
CB Index
Gain
sn
?
?
?
en
en
sn
sn
Hlt(z)
Hst(z)
Excitation is represented by address ie CB
Index
?
en
44Conversion of LPC Parameters
- A(z) 1 a1 z - 1 a2 z - 2 ap z -
p and a i are to be Txd - Line Spectral Frequencies (LSF) present a clever
way of representing the LPC coefficients, the
ais of A(z) - The ais are floating point numbers and their
accuracy is important - Factorising A(z) tends to give complex roots in
the z-domain - LSFs map these complex roots on to the unit
circle
- LSFs
- Lead to efficient coding
- Ensure a minimum phase filter
- Bit errors are spectrum localised minimising
loss of speech quality
45Line Spectral Frequencies
- Consider
- P(z) A(z) z(n1) A(z1 )
- and
- Q(z) A(z) - z(n1) A(z1 )
- then P(z) and Q(z) lead to what is known as
LSFs - Clearly if P(z) and Q(z) are known then A(z) can
be found - A(z) P(z) Q(z) / 2
- Roots of P(z) and Q(z) lie on the unit circle in
z-domain The locations give - the LSFs
- P(z) and Q(z), and whence A(z)
46LSF Evaluation
- Consider one pair of complex roots, A1(z)
- A1(z) 1 a1 z -1 a2 z -2
- P1(z) 1 a1 z -1 a2 z -2 z -3 (1
a1 z1 a2 z2 ) - (z2 (a1 a2 - 1) z 1 )( z 1 )
z 3 - Q1(z) 1 a1 z -1 a2 z -2 - z -3
(1 a1 z1 a2 z2 ) - (z2 (a1 - a2 1) z 1 )( z
- 1 ) z -3 - The roots at 0 and 1 are discarded
- It follows that the LSFs, ?1 ?2 , are given
by -
- cos (?1) - (a1 a2 - 1)/2
- and cos (?2) - (a1 - a2 1)/2
- Show
- a1 -(cos (?1) cos (?2) ) and
- a2 (cos (?2) - cos (?1) 1 )
47LSF Test Example
- A1(z) 1 a1 z -1 a2 z - 2
- (z2 a1 z a2 )z - 2
- (z2 2 cos(?) wn z wn2 ) z - 2
- where wn is radius and ? is angle from ?. So
radius ? a2 ? ? - ? - Note in P Q all w n2 terms (of the
multiple 2nd orders) are unity - EG 1 a2 1 then cos (?1) - (a1 a2 -
1)/2 - (a1)/2 - roots already on circle and do not move (unstable
system not practical) - EG 2 a1 0 then cos (?1) - (a1 a2 -1)/2
- (a2 - 1)/2 - cos (?2) - (a1 - a2 1)/2 - (-a2
1)/2 - so LSFs are symmetric about ? /4
48LSF Review Example (1)
LSFs/LSPs are defined as P(z)
A(z) z-(n1) A(z-1 ) and Q(z)
A(z) - z-(n1) A(z-1 ) thus A(z)
P(z) Q(z) / 2
49LSF Review Example (2)
For a second order A(z) 1 a1 z-1 a2 z-2 P
(z) 1 a1 z-1 a2 z-2 (1 a1 z1 a2
z2)z-3 (z2 (a1 a2 - 1)z
1)(z 1)z3 Q (z) 1 a1 z-1 a2 z-2 -
(a1 z1 a2 z2)z-3 (z2 (a1 - a2 1)z
1)(z - 1 )z3 cf (s2 ( 2cos(?)wn ) s
wn2)
50LSF Review Example (3)
For a second order A(z) 1 a1 z-1 a2 z-2
P (z) (z2 (a1 a2 - 1)z 1)(z
1)z3 Q (z) (z2 (a1 - a2 1)z 1)(z -
1 )z3 cf (s2 ( 2cos(?)wn )s wn2)
Thus (a1 a2 - 1) 2cos(?1) -
2cos(?1) (a1 - a2 1) - 2cos(?2 ) So,
given i) LPC coeffs., a1 and a2 , then LSFs
?1 ?2 can be found ii) LSFs, ?1 ?2 ,
then the LPC coeffs. a1 and a2 be found
?2
?1
51LSF Review Example (4)
For a second order and with P(z) corresponding to
the first root, Q(z) to the second root, and so
P (z) 1 a1 z-1 a2 z-2 (1 a1 z1
a2 z2)z-3 (z2 (a1 a2 - 1)z
1)(z 1)z3 for the second pair of qi, 1.37
and 1.77 (z2 - 2cos(1.37) z 1 )(z 1)
z3 (z3 (1 - 2cos(1.37) z2 (1 -
2cos(1.37))z 1)z3 Likewise Q (z) 1
a1 z-1 a2 z-2 - (a1 z1 a2 z2)z-3 (z2
(a1 - a2 1)z 1)(z - 1 )z3 (z2 -
2cos(1.77) z 1 )(z - 1) z3 (z3 (-1 -
2cos(1.77) z2 (1 2cos(1.77))z -
1)z3 Then A(z) P(z) Q(z) / 2)
(z3 (cos(1.37) cos(1.77))z2 (1 -
cos(1.37) cos(1.77))z)z3
52LSF Examples
LPC coeffs. LSFs LPC coeffs. LSFs LPC coeffs. LSFs LPC coeffs. LSFs
a1 a2 ?1 ?2
0 0.5 1.31812 1.82348
-1.8 0.9 0.31756 0.554811
1.8 0.9 p-0.554811 p-0. 31756
2.2274 2.3743
53Example Bit Allocation
54Codebooks VQ
N 2L
Identical book
i (0 N-1)
p
p
Data reduction (p x B) to L
time
time
55Codebook Compression
- Principle
- representative data sets
- data vector is replaced / representedby
nearest vector, chosen from a codebook - a
closed set of vectors - Examples
- LPC parameter sets
- Excitation as in CELP
56Codebook Compression - CELP
Codebook of time-domain samples
start point
en
y ms
y ms
y ms
en are time domain samples (integers) R samples
per second (eg 8000 Hz) Frame rate governs vector
size P 2 j Bit rate j/y bits/ms
P
57Codebook Compression of H(z)
x ms
N 2 k
time
i
M
index, i
Az at time t
Vector with M elements, every x ms Codebook with
N 2 k vectors Bit rate k/x bits per ms (not
a function of M) In practice Az is converted to
LSFs.
58Codebook Generation
1) Initialise form a single centroid of all
training data, N1 2) Repeat Split centroids
N -gt 2N Repeat Cluster data to nearest
centroid until convergence until N large
enough
59VQ Performance on Unseen Data
Ramachandran Mamone (eds) Modern Methods of
Speech Processing Kluer Academic, 1995
60VQ Performance on Unseen Data
Ramachandran Mamone (eds) Modern Methods of
Speech Processing Kluer Academic, 1995
61LPC FFT Spectra
LPC Roots -0.6651 0.6695i -0.0560 0.9709i
0.7228 0.6225i 0.8714 0.3694i 0.5758
-0.4200
LSFs
40
?2 of Q(z) ?1 of P(z)
2.3743 2.2274
1.6540 1.5997
0.8261 0.6954
0.6106 0.3937
20
0
Magnitude (dB)
-20
-40
0
1
2
3
4
5
Frequency (KHz) (
0-to-Fs/2)
62LPC Spectra LSFs
LPC Roots -0.6651 0.6695i -0.0560 0.9709i
0.7228 0.6225i 0.8714 0.3694i 0.5758
-0.4200
LSFs
?2 of Q(z) ?1 of P(z)
2.3743 2.2274
1.6540 1.5997
0.8261 0.6954
0.6106 0.3937
Frequency (KHz) (
0-to-Fs/2)
63LPC FFT Spectra - 2nd Order
A(z) 1.5537 -0.8276 Roots 0.7769
0.4733i
1
0.5
0
-0.5
H(0) K (1- (1.5537 -
0.8276)) H(ws/2) K
(1- (-1.5537 - 0.8276)) H(0) K/0.274
21.8dB H(ws /2) K/
3.38
-1
0
3.2
6.4
9.6
12.8
16
19.2
22.4
25.6
Time (ms)
40
20
0
-20
-40
0
1
2
3
4
5
Frequency (KHz) (
0-to-Fs/2)
64VT Shape Some Vowels - Ladefoged 62
65VT Shape Some Vowels - Ladefoged 62
66GSM
- Groupe Special Mobile - EU
- First digital cellular system in world
- See Hodge 1990
- Based on TDMA FDMA at 900MHz, and RPE-LPC(ie
it is an LPAS system) - Now at 1800 MHz
- Carriers at 200kHz
- Supporting 8 TDMA time slots each
- Time slots 577ms - 156.26 bit slots
- 8 time slots form 1 GSM frame of 4.62 ms
- Modulation Gaussian minimum shift key
- 26 bit training in every time slot
- Round-trip delay 80ms
- EU GSM US D-AMPS
67Other Related Topics
Spectral Lifting H(z) (1-az-1) Codebook
Training Spectral Differences between 2
frames Cepstra Modeling Speech Space - HMMs
68Pre-Emphasis Example
69Pre-Emphasis Example
z-plane jy
G(ws/2) 1 a G(0) 1 - a
a
For G(ws/2 ) gt G(0) then a must be gt 0
1a 2
ws/2
70Z-plane to Magnitude Spectrum
1
0.5
0
Imaginary Part
-0.5
-1
-1
-0.5
0
0.5
1
Real Part
50
40
30
1a 2
20
10
Magnitude (dB)
0
-10
ws/2
-20
-30
0
1
2
3
4
5
Frequency (KHz) (
0-to-Fs/2)