Speech Processing - PowerPoint PPT Presentation

1 / 70
About This Presentation
Title:

Speech Processing

Description:

... coders. Source (voice) coders (vo-coders) Source coders eg ... Waveform & Source Coders (Vocoders) 2 periodicities/redundancies in source. short-term (formants) ... – PowerPoint PPT presentation

Number of Views:224
Avg rating:3.0/5.0
Slides: 71
Provided by: jsdm5
Category:

less

Transcript and Presenter's Notes

Title: Speech Processing


1
Speech Processing
  • John Mason
  • Engineering
  • University of Wales Swansea

2
Time Sequence Information
3
VT Shape Some Vowels - Ladefoged 62
4
VT Shape Some Vowels - Ladefoged 62
5
Speech Processing - Applications
  • Why?
  • Communications
  • Synthesis
  • Recognition
  • Speech Speaker
  • How?
  • Frame-based
  • Systems approach

6
Some Books
  • Flanagan -Speech Analysis, Synthesis and
    Perception, Springer-Verlag, - a classic!
  • Furui - several books on recognition
  • Parsons - Voice and Speech Processing - McGraw
    Hill, one of the first text books on computer
    speech processing
  • OShaughnessy - Speech Comms - human and
    machine Addison-Wesley
  • Rabiner Juang - Fundamentals of Speech
    Recognition Prentice Hall, 1993
  • Ramachandran Mamone (eds) Modern Methods of
    Speech Processing Kluer Academic, 1995

7
Speech Communications
Person-to-Person
Person-to-Machine speech/speaker recognition
Machine-to-Person speech synthesis
8
(Electronic) Speech Communications
perhaps separated by long distance (or in time)
9
Telephony Broadcasting
Acoustic Air Path
Acoustic Air Path
l Transmission Path
Electronic Link
10
Speech Comms Telephony
Microphone ADC Analysis Coding Transmitter
Receiver Decoding (re-)Synthesis DAC Loudspeaker
11
Speech Bit Rates
hundreds
thousands
Tens of thousands
tens
Approx. bit rate in bps
Acoustic Space
Human Hearing Extraction
Message Realisation
Language decoding
12
Criteria in Speech Comms.
  • Quality versus Bit-rate

4 Quality Measures intelligibility loudness n
aturalness ease-of-listening
13
Speech Processing
  • The three main application areas are
  • Speech Comms. (the electronic link)
  • Automatic Speech/Speaker recognition
  • Speech SynthesisMuch of the underlying analysis
    is common, eg linear predictive coding

14
What does speech look like?
15
What does speech look like?
Dynamic Range - for flexibility and
robustness Time-varying - to convey
information
16
Frame-based Analysis
  • To capture time variations
  • 20-30 ms frames - centi-second labeling
  • spectral analysis
  • FFT
  • Filter-bank
  • Linear Predictive Coding

17
Speech Analysis/Coding
  • Two general cases
  • Waveform coders
  • Source (voice) coders (vo-coders)
  • Source coders eg linear predictive coding (LPC)
  • Model the source ie the vocal tract (VT)
  • Linear, time varying model of VT, plus excitation

18
Systems Approach
Excitation
Speech
Vocal Tract
Voiced
Speech
Model
f0
Unvoiced
Time Varying Parameters
19
LPC Analysis/Synthesis
  • Synthesis
  • Input Excitation
  • output Speech
  • Analysis
  • Input Speech
  • output Excitation

20
Perfect Analysis/Synthesis
Input sn and output sn are identical (within
arithmetic limits)
21
Practical Analysis/Synthesis
22
Practical Analysis/Synthesis
  • Parameters for Transmission
  • Input / Excitation en
  • Source model H(z)
  • Thus Analysis must derive these parameters,
    and
  • Synthesis must use them to re-generate speech

23
Linear Predictive Coding - LPC
  • Principle of linear prediction
  • The next value (or sample) in a series, ie at
    time n, is predicted or estimated by a weighted
    sum of previous values, ie those at time n-1,
    n-2, ...
  • Thus for a predictor of order p, we have

24
Linear Prediction
Transforming to the z-domain gives
25
LPC Error Terms
Error is simply difference between predicted and
actual values
sn
en

-
ˆ
sn
A(z)
26
Synthesis
sn
H(z)
Parameters updated at frame rate
sn
en

?

A(z)
NB hat of approximation omitted for simplicity
27
Analysis for Synthesis
  • The Analysis and Synthesis must match
  • what is needed for the Synthesis?
  • Answer en - the excitation and H(z) - the
    system
  • Thus the Analysis must derive these terms (from
    sn )
  • The speech signal, sn is analysed to give en and
    H(z) ie A(z) parameters for transmission.

28
Derivation of LPC Coefficients - A(z)
Recall
where ai are the p prediction coefficients.The
principle behind LPC is to find a set of p
coefficients, a1, a2, a3, ... ap, which in some
sense minimizes the error signal en, over a
frame of speech, N. This leads to a set p
coefficients for each frame.
29
Derivation of A(z) (2)
Minimisation of En is achieved by setting the p
partial derivatives to zero
The matrix R is Toepliz symmetric, offering
numerically efficient inversion techniques -
Durbins recursion algorithm being one of the
most popular.
30
Derivation of A(z) (3)
  • When N very large r is the autocorrelation
    coefficients of s
  • S comes from e convolved with h (excitation
    vocal tract)
  • we are interested here in separating e and h
  • the predictor order, p, is small to reflect the
    short-term periodicities (formants)
  • with higher predictor orders we will get the
    longer-term periodicities (pitch)
  • 2 practical problems with evaluating a
  • matrix singularities in R-1
  • unstable resultant H(z)
  • in practice both are solved by windowing -
    shaping frame - Hamming

31
Speech Signal Characteristics
  • Duration
  • Dynamic Range
  • Periodicities
  • vocal tract
  • pitch
  • Frame-based Analysis
  • frame size quasi-stationary capture
    transition typically 20 - 30ms
  • frame rate task dependent more means
    moreband-width/computation - up to 100
    frames/second

32
Harmonic Structures and Periodicities
  • Harmonic Structures Periodicities give
    potential for data reduction
  • LPC is one way of gaining this compression
  • Speech has two obvious separate structures
  • vocal tract resonances
  • pitch

33
Harmonic Structures and Periodicities
voiced or unvoiced
speech
en
sn
H(z)
Vocal tract
Short Term
Tp
p
Short term prediction
34
Harmonic Structures and Periodicities
voiced unvoiced
epn
speech
sn
Hlt(z)
Hst(z)
en
Pitch
Vocal tract
Tp
P
Long term prediction
35
Harmonic Structures and Periodicities
Two Structures short-term (formants)
long-term - pitch (excitation)
eg 20ms frame 160 samples _at_ 8Khz
ai eg p3
ai eg p10
NB Representations of these parameters are
transmitted
36
Practical Coding Systems
  • Waveform Source Coders (Vocoders)
  • 2 periodicities/redundancies in source
  • short-term (formants)
  • long-term - pitch
  • Excitation en

en
epn
sn
Hlt(z)
Hst(z)
37
Perfect Analysis/Synthesis
Input sn and output sn are identical (within
arithmetic limits)
38
Practical System
Transmitted Data Frame
?
?
S(z)
E(z)
H(z)
?
?
sn
en
?
Input sn and output sn are similar
39
Analysis-by-Synthesis LPAS
  • Integrated encoder decoder at the encoder

-
sn
Basic decoder
Adaptive encoder

Weighted error
LPAS Encoder
40
Log Spectral Estimates
  • Comparisons between frames are very important in
    many situations
  • log spectral estimates are the most common
    (though in Comms. An approximation is used to
    reduce computation)

In Comms, compuation is expensive and parameter
vector approximations to D are used
41
Some Standards
  • GSM European Cellular RPE-LTP 13kb/s
  • FS1016 Secure Voice CELP 4.8
  • IS54 NA Cellular VSELP 7.95
  • IS96 QCELP 1-8
  • JDC-FR Japanese Cellular VSELP 6.7
  • JDC-HR PSI-CELP 3.67
  • G.728 (terrestrial) LD-CELP 16

42
CELP eg
Short-term coefficients (formants)
Long-term coefficients (pitch)
CB Index
Gain
en
sn
Hlt(z)
Hst(z)
Excitation is represented by address ie CB
Index
?
en
43
CELP eg
Short-term coefficients (formants)
Long-term coefficients (pitch)
CB Index
Gain
sn
?
?
?
en
en
sn
sn
Hlt(z)
Hst(z)
Excitation is represented by address ie CB
Index
?
en
44
Conversion of LPC Parameters
  • A(z) 1 a1 z - 1 a2 z - 2 ap z -
    p and a i are to be Txd
  • Line Spectral Frequencies (LSF) present a clever
    way of representing the LPC coefficients, the
    ais of A(z)
  • The ais are floating point numbers and their
    accuracy is important
  • Factorising A(z) tends to give complex roots in
    the z-domain
  • LSFs map these complex roots on to the unit
    circle
  • LSFs
  • Lead to efficient coding
  • Ensure a minimum phase filter
  • Bit errors are spectrum localised minimising
    loss of speech quality

45
Line Spectral Frequencies
  • Consider
  • P(z) A(z) z(n1) A(z1 )
  • and
  • Q(z) A(z) - z(n1) A(z1 )
  • then P(z) and Q(z) lead to what is known as
    LSFs
  • Clearly if P(z) and Q(z) are known then A(z) can
    be found
  • A(z) P(z) Q(z) / 2
  • Roots of P(z) and Q(z) lie on the unit circle in
    z-domain The locations give
  • the LSFs
  • P(z) and Q(z), and whence A(z)

46
LSF Evaluation
  • Consider one pair of complex roots, A1(z)
  • A1(z) 1 a1 z -1 a2 z -2
  • P1(z) 1 a1 z -1 a2 z -2 z -3 (1
    a1 z1 a2 z2 )
  • (z2 (a1 a2 - 1) z 1 )( z 1 )
    z 3
  • Q1(z) 1 a1 z -1 a2 z -2 - z -3
    (1 a1 z1 a2 z2 )
  • (z2 (a1 - a2 1) z 1 )( z
    - 1 ) z -3
  •  The roots at 0 and 1 are discarded
  • It follows that the LSFs, ?1 ?2 , are given
    by
  •   
  • cos (?1) - (a1 a2 - 1)/2
  • and cos (?2) - (a1 - a2 1)/2
  • Show
  • a1 -(cos (?1) cos (?2) ) and
  • a2 (cos (?2) - cos (?1) 1 )

47
LSF Test Example
  • A1(z) 1 a1 z -1 a2 z - 2
  • (z2 a1 z a2 )z - 2
  • (z2 2 cos(?) wn z wn2 ) z - 2
  • where wn is radius and ? is angle from ?. So
    radius ? a2 ? ? - ?
  • Note in P Q all w n2 terms (of the
    multiple 2nd orders) are unity
  • EG 1 a2 1 then cos (?1) - (a1 a2 -
    1)/2 - (a1)/2
  • roots already on circle and do not move (unstable
    system not practical)
  • EG 2 a1 0 then cos (?1) - (a1 a2 -1)/2
    - (a2 - 1)/2
  • cos (?2) - (a1 - a2 1)/2 - (-a2
    1)/2
  • so LSFs are symmetric about ? /4

48
LSF Review Example (1)
LSFs/LSPs are defined as P(z)
A(z) z-(n1) A(z-1 ) and Q(z)
A(z) - z-(n1) A(z-1 ) thus A(z)
P(z) Q(z) / 2
49
LSF Review Example (2)
For a second order A(z) 1 a1 z-1 a2 z-2 P
(z) 1 a1 z-1 a2 z-2 (1 a1 z1 a2
z2)z-3 (z2 (a1 a2 - 1)z
1)(z 1)z3 Q (z) 1 a1 z-1 a2 z-2 -
(a1 z1 a2 z2)z-3 (z2 (a1 - a2 1)z
1)(z - 1 )z3 cf (s2 ( 2cos(?)wn ) s
wn2)
50
LSF Review Example (3)
For a second order A(z) 1 a1 z-1 a2 z-2
P (z) (z2 (a1 a2 - 1)z 1)(z
1)z3 Q (z) (z2 (a1 - a2 1)z 1)(z -
1 )z3 cf (s2 ( 2cos(?)wn )s wn2)
Thus (a1 a2 - 1) 2cos(?1) -
2cos(?1) (a1 - a2 1) - 2cos(?2 ) So,
given i) LPC coeffs., a1 and a2 , then LSFs
?1 ?2 can be found ii) LSFs, ?1 ?2 ,
then the LPC coeffs. a1 and a2 be found
?2
?1
51
LSF Review Example (4)
For a second order and with P(z) corresponding to
the first root, Q(z) to the second root, and so
P (z) 1 a1 z-1 a2 z-2 (1 a1 z1
a2 z2)z-3 (z2 (a1 a2 - 1)z
1)(z 1)z3 for the second pair of qi, 1.37
and 1.77 (z2 - 2cos(1.37) z 1 )(z 1)
z3 (z3 (1 - 2cos(1.37) z2 (1 -
2cos(1.37))z 1)z3  Likewise Q (z) 1
a1 z-1 a2 z-2 - (a1 z1 a2 z2)z-3 (z2
(a1 - a2 1)z 1)(z - 1 )z3 (z2 -
2cos(1.77) z 1 )(z - 1) z3 (z3 (-1 -
2cos(1.77) z2 (1 2cos(1.77))z -
1)z3   Then A(z) P(z) Q(z) / 2)
(z3 (cos(1.37) cos(1.77))z2 (1 -
cos(1.37) cos(1.77))z)z3
52
LSF Examples
LPC coeffs. LSFs LPC coeffs. LSFs LPC coeffs. LSFs LPC coeffs. LSFs
a1 a2 ?1 ?2
0 0.5 1.31812 1.82348
-1.8 0.9 0.31756 0.554811
1.8 0.9 p-0.554811 p-0. 31756
2.2274 2.3743
53
Example Bit Allocation
54
Codebooks VQ
N 2L
Identical book
i (0 N-1)
p
p
Data reduction (p x B) to L
time
time
55
Codebook Compression
  • Principle
  • representative data sets
  • data vector is replaced / representedby
    nearest vector, chosen from a codebook - a
    closed set of vectors
  • Examples
  • LPC parameter sets
  • Excitation as in CELP

56
Codebook Compression - CELP
Codebook of time-domain samples
start point
en
y ms
y ms
y ms
en are time domain samples (integers) R samples
per second (eg 8000 Hz) Frame rate governs vector
size P 2 j Bit rate j/y bits/ms
P
57
Codebook Compression of H(z)
x ms
N 2 k
time
i
M
index, i
Az at time t
Vector with M elements, every x ms Codebook with
N 2 k vectors Bit rate k/x bits per ms (not
a function of M) In practice Az is converted to
LSFs.
58
Codebook Generation
1) Initialise form a single centroid of all
training data, N1 2) Repeat Split centroids
N -gt 2N Repeat Cluster data to nearest
centroid until convergence until N large
enough
59
VQ Performance on Unseen Data
Ramachandran Mamone (eds) Modern Methods of
Speech Processing Kluer Academic, 1995
60
VQ Performance on Unseen Data
Ramachandran Mamone (eds) Modern Methods of
Speech Processing Kluer Academic, 1995
61
LPC FFT Spectra
LPC Roots -0.6651 0.6695i -0.0560 0.9709i
0.7228 0.6225i 0.8714 0.3694i 0.5758
-0.4200
LSFs
40
?2 of Q(z) ?1 of P(z)
2.3743 2.2274
1.6540 1.5997
0.8261 0.6954
0.6106 0.3937
20
0
Magnitude (dB)
-20
-40
0
1
2
3
4
5
Frequency (KHz) (
0-to-Fs/2)
62
LPC Spectra LSFs
LPC Roots -0.6651 0.6695i -0.0560 0.9709i
0.7228 0.6225i 0.8714 0.3694i 0.5758
-0.4200
LSFs
?2 of Q(z) ?1 of P(z)
2.3743 2.2274
1.6540 1.5997
0.8261 0.6954
0.6106 0.3937
Frequency (KHz) (
0-to-Fs/2)
63
LPC FFT Spectra - 2nd Order
A(z) 1.5537 -0.8276 Roots 0.7769
0.4733i
1
0.5
0
-0.5
H(0) K (1- (1.5537 -
0.8276)) H(ws/2) K
(1- (-1.5537 - 0.8276)) H(0) K/0.274
21.8dB H(ws /2) K/
3.38
-1
0
3.2
6.4
9.6
12.8
16
19.2
22.4
25.6
Time (ms)
40
20
0
-20
-40
0
1
2
3
4
5
Frequency (KHz) (
0-to-Fs/2)
64
VT Shape Some Vowels - Ladefoged 62
65
VT Shape Some Vowels - Ladefoged 62
66
GSM
  • Groupe Special Mobile - EU
  • First digital cellular system in world
  • See Hodge 1990
  • Based on TDMA FDMA at 900MHz, and RPE-LPC(ie
    it is an LPAS system)
  • Now at 1800 MHz
  • Carriers at 200kHz
  • Supporting 8 TDMA time slots each
  • Time slots 577ms - 156.26 bit slots
  • 8 time slots form 1 GSM frame of 4.62 ms
  • Modulation Gaussian minimum shift key
  • 26 bit training in every time slot
  • Round-trip delay 80ms
  • EU GSM US D-AMPS

67
Other Related Topics
Spectral Lifting H(z) (1-az-1) Codebook
Training Spectral Differences between 2
frames Cepstra Modeling Speech Space - HMMs
68
Pre-Emphasis Example
69
Pre-Emphasis Example
z-plane jy
G(ws/2) 1 a G(0) 1 - a
a
For G(ws/2 ) gt G(0) then a must be gt 0
1a 2
ws/2
70
Z-plane to Magnitude Spectrum
1
0.5
0
Imaginary Part
-0.5
-1
-1
-0.5
0
0.5
1
Real Part
50
40
30
1a 2
20
10
Magnitude (dB)
0
-10
ws/2
-20
-30
0
1
2
3
4
5
Frequency (KHz) (
0-to-Fs/2)
Write a Comment
User Comments (0)
About PowerShow.com