Speech Compression

About This Presentation

Title:

Speech Compression

Description:

Behavior on non-speech (babble noise, tones, music) Bit error robustness. Spch Comp ... Tone and transition detector. Mechanism to prevent loss from tandeming ... – PowerPoint PPT presentation

Number of Views:1557

Avg rating:3.0/5.0

Slides: 55

Provided by: yjs8

Category:

more less

Transcript and Presenter's Notes

Title: Speech Compression

1
SpeechCompression
2
Quick Overview
Spch Comp

Simple coders
G.711 A-law m-law
Delta
ADPCM
CELP coders
LPC-10
RELP/GSM
CELP

Other methods
MBE
MELP
STC
Waveform Interpolation

3
Encoder Criteria
Spch Comp

Encoders can be compared in many ways
the most important are
Bit rate (Kbps)
Speech quality (MOS)
Delay (algorithmic framelookahead
computational propagation)
Computational Complexity
Often less important
Bit exactness (interoperability)
Transcoding robustness
Behavior on non-speech (babble noise, tones,
music)
Bit error robustness

4
PSTN Quality Coders
Spch Comp

Rate ITU-T encoder
128 Kbps 16bit linear
sampling
64 Kbps G.711 A-law/m-law 8bit
log sampling
32 Kbps G.726 ADPCM
16 Kbps G.728 LDCELP
8 Kbps G.729 CS-ACELP
4 Kbps SG16Q21 ???
toll quality MOS rating, but higher delay

5
Digital Cellular Standards
Spch Comp
6
Military / Satellite Standards
Spch Comp
7

Simple
coders

8
G.711
Spch Comp

16 bit linear sampling at 8 KHz means 128 Kbps
Minimal toll quality linear sampling is 12 bit
(96 Kbps)
8 bit linear sampling (256 levels) is noticeably
noisy
Due to
prevalence of low amplitudes
logarithmic response of ear
we can use logarithmic sampling
Different standards for different places

9
G.711 - cont.
Spch Comp
North America m 255

m-law
A-law
Although very different looking they are nearly
identical
G.711 approximates these expressions by 16
staircase straight-line segments
(8 negative and 8 positive)
m-law horizontal segment through origin, A-law
vertical segment

Rest Of World A 87.56
10
DPCM
Spch Comp

Due to low-pass character of speech
differences are usually smaller than signal
values
and hence require fewer bits to quantize
Simplest Delta-PCM (DPCM) quantize first
difference signal D
Delta-PCM quantize difference between signal
and prediction
sn p ( sn-1 , sn-2 , , sn-N ) S pi sn-i
If predict using linear combination (FIR filter),
this is linear prediction
Delta-modulation (DM) use only sign of
difference (1bit DPCM)
Sigma-delta (1bit) oversample, DM, trade-off
rate for bits

i
11
DPCM with prediction
Spch Comp

If the linear prediction works well, then the
prediction error
en sn - sn
will be lower in energy and whiter than sn
itself !
Only the error is needed for reconstruction,
since the predictable portion can be predicted sn
sn en!

sn
prediction filter
12
DPCM - post-filtering
Spch Comp

Simplest case
if highly oversampled
then previous sample sn-1 predicts sn well,
so we can use DM,
if sgn(en) lt 0 then -D else D
For DM there is no way to encode zero prediction
error
so decoded signal oscillates wildly
Standard remedy is a post-filter that low-pass
filters this noise
But there is a b i g g e r problem!

13
Open-loop Prediction
Spch Comp

The encoder (linear predictor) is present in the
decoder
but there runs as feedback
The decoders predictions are accurate with the
precise error en
but it gets the quantized error en and the models
diverge!

14
Side Information
Spch Comp

There are two ways to solve the problem ...
The first way is to send the prediction
coefficients
from the encoder to the decoder
and not to let the decoder derive them
The coefficients sent are called side-information
Using side-information means higher bit-rate
(since both en and coefficients must be sent)
The second way does not require increasing bit
rate

15
Closed-loop Prediction
Spch Comp

To ensure that the encoder and decoder stay
in-sync
we put the decoder into the encoder
Thus the encoders predictions are identical to
the decoders
and no model difference accumulates

en
sn
en
sn
Q
IQ
IQ
PF
PF
16
Two types of error
Spch Comp

For DM there are two types of error (depending on
step size)

D too small
D OK
D too large
17
Adaptive Step Size
Spch Comp

Speech signals are very nonstationary
We need to adapt the step size to match signal
behavior
Increase D when signal changes rapidly
Decrease D when signal is relatively constant
Simplest method (for DM only)
If present bit is the same as previous multiply D
by K (K1.5)
If present bit is different, divide D by K
Constrain D to a predefined range
More general method
Collect N samples in buffer (N 128 512)
Compute standard deviation in buffer
Set D to a fraction of standard deviation
Send D to decoder as side-information or
Use backward adaptation (closed-loop D
computation)

18
ADPCM
Spch Comp

G.726 has
Adaptive predictor
Adaptive quantizer and inverse quantizer
Adaptation speed control
Tone and transition detector
Mechanism to prevent loss from tandeming
Computational complexity relatively high (10
MIPS)
24 and 16 Kbps modes defined, but not toll
quality
G.727 same rates but embedded for packetize
networks
ADPCM only used general low-pass characteristic
of speech
What is the next step?

19
Scalar Quantization
Spch Comp

Standard A/D has preset, evenly distributed
levels
G.711 has preset, non-evenly distributed levels
With a criterion we can make an adaptive
quantizer
Simplest criterion minimum squared quantization
error
en sn - sn E lt en2 gt
Need algorithm to find optimal placement of
levels EM-type algorithms

20
Vector Quantization
Spch Comp

We can do the same thing in higher dimensions
Here we wish to match input data xi i 1
.. N
to a codebook of codewords Cj j 1 .. M
with Minimal Mean Squared Error
E Si1..N xi - C 2
where C is the codeword closest to xi in the
codebook

xi
21
LBG Algorithm for VQ
Spch Comp

Input xi i 1 .. N clustering, unsupervised
learning
Randomly initialize codebook Cj j 1 .. M
Loop until converge
Classification Step
for i 1 .. N
for j 1 .. M
compute Dij2 xi - Cj 2
classify xi to Cj with minimal Dij2
Expectation Step
for j 1 .. M correct center Cj S
i e Cj xi

1
Nj
22
Speech Application of VQ
Spch Comp

OK, I understand what to do with scalar
quantization
what is VQ good for ?
We could try to simply VQ frames of speech
samples
but this doesnt work well !
We can VQ spectra or sub-band components
We often VQ parameter sets (e.g. LPC
coefficients)
We also VQ model error signals

CELP
coders

24
LPC-10
Spch Comp

Based on 10th order LPC (obviously) Bishnu
Atal
180 sample blocks are encoded into 54 bits
Pitch U/V (found using AMDF) 7 bits
Gain
5 bits
10 reflection coefficients found by covariance
method
first two coefficients converted to log area
ratios
L1, L2, a3, a4 5 bits each
a5, a6, a7, a8 4 bits each
a9 3 bits a10 2 bits 41 bits
1 sync bit 1
bit
54 bits 44.44 times per second results in 2400
bps
By using VQ could reduce bit rate to under 1
Kbps!
LPC-10 speech is intelligible, but synthetic
sounding
and much of the speaker identity is lost !

25
The Residual
Spch Comp

Recover sn by adding back the residual error
signal
sn sn en
So if we send en as side-information we can
recover sn
en is smaller than sn so may require fewer bits
!
But en is whiter than sn so may require many
bits!
The question has now become
How can we compress the residual?

26
Encoding the Residual
Spch Comp

RELP (6-9.6 Kbps)
Low-pass filter and downsample residual to 1 KHz
Encode using ADPCM
VQ-RELP (4.8 Kbps)
VQ coding of residual
RELP (4.8 Kbps)
Perform FFT on residual
Baseband coding
RPE-LTP (GSM-FR at 13 Kbps)
Residual Pulse Excitation - Long Term Predictor
Perform Long Term Prediction (pitch recovery)
Subtract to obtain new residual
Decimate by 3, use phase with maximum energy
Extract 6-bit overall gain
Encode remainder with 3 bits/sample

27
Residual and Excitation
Spch Comp

Synthesis filter sn
en S am sn-m
Analysis filter rn
sn - S am sn-m
So rn en !

excitation
residual
Note all-zero filter is the inverse of the
all-pole filter
28
CELP
Spch Comp

Atals idea
Find a way to efficiently encode the excitation !
Questions
How can we find the excitation?
Theoretically, by algebra (invert the filter!)
How can we efficiently encode the residual?
VQ - Code Excited Linear Prediction
How can we efficiently find the best codeword?
Exhaustive search

29
CELP - cont.
Spch Comp

Atal and friends (Schroeder, Remde, Singhal,
etc.) discoveries
Even random codebooks work well Gaussian,
uniform
Dont need large codebooks e.g. 1024 codewords
for 40 samples
Can center-clip with little loss
Codebook with constant amplitude almost as good
So we can use codebooks with structure (and save
storage/search/bits)
Multipulse (MP)
Constant Amplitude Pulse

Regular Pulse (RP)
30
Special Excitations
Spch Comp

Shift technique reduces random CB operations from
O(N2) to O(N)
a b c d e f c d e f g h e f g h I j ...
Using a small number of 1 amplitude pulses
leads to MIPS reduction
Since most values are zero, there are few
operations
Since amplitudes 1 no true multiplications
In a CB containing CW and -CW we can save half
Algebraic codebooks exploit algebraic structure
Example choose pulses according to Hadamard
matrix
Using FHT reduces computation
Conjugate structure codebooks
Excitation is sum of codewords from two related
CBs

31
Analysis by Synthesis
Spch Comp

Finding the best codeword by exhaustive search

sn
Compute energy
-
LPC
find minimum
32
Perceptual Weighting
Spch Comp

The criterion for selecting the best codeword
should be perceptual
not simply the energy of the difference signal!
We perceptually weight the signal and the
synthesized signal

sn
PW
-
Since PW is a filter we need use it only once
CB
LPC
33
Perceptual Weighting - cont.
Spch Comp

The most important PW effect is masking
Coding error energy near formants is not heard
anyway
so we allow higher error near formants
but demand lower perceivable error energy
To do this we de-emphasize according to the LPC
spectrum!
Simplest filter is 1 - S ai z-I where ai are
the LPC coefficients
How do we take the critical bandwidth into
account?
We perform bandwidth expansion Denominator
expansion gt numerator 1 - S g1i ai z-I
1 - S g2i ai z-I

BW - ln(g) Fs p
1 gt g1 gt g2 gt 0
Typical values g1 0.9 g2 0.6
34
Post-filter
Spch Comp

Not related to the subject, but if we are already
here
In order to increase the subjective quality of
many coders
post-filters are often used to emphasize the
formant structure
These have the same form as the perceptual
weighting filter
but 1 gt g2 gt g1 gt 0 with typical values g1 0.5
g2 0.75
Denominator expansion lt numerator!
the post-filter also reinforces tilt
which should then be compensated by an IIR filter
since the spectral valleys are de-emphasized
we should change the PW filter parameters g1 and
g2
Originally proposed for ADPCM !

35
Subframes
Spch Comp

Coders with large frames (gt 10 ms) need a long
excitation signal
and hence a lot of bits to encode
An alternative is to divide the frame into (2-4)
subframes
each of which has its own codeword excitation

frame n-1
frame n1
frame n
We really should recompute LPC per subframe but
we can get away with interpolating !
36
Lookahead
Spch Comp

If we are already dividing up the frame
we can compute the LPC based on a shifted frame
This is called lookahead, and it adds processing
delay !
To decrease delay we can use backward looking IIR
filter
and then we neednt send/store the LPC
coefficients at all!

------- LPC -------
------- LPC -------
CW
CW
CW
CW
CW
CW
CW
CW
37
What happened to the pitch?
Spch Comp

Unlike LPC, the ABS CELP coder is excited by
codebook
Where does the pitch come from?
Random CB minimi zation will prefer good
excitation
Regular/Multi pulse pulse spacing (not enough
pulses for high pitch)
But this is usually not enough (residual has
pitch periodicity)
Two solutions
Adaptive codebook (Klejn, etal)
Long term prediction (Atal Singhal)
Both of these reinforce the pitch component

38
Adaptive CB
Spch Comp

Adaptive codebook is repetitions of previous
excitations
Total excitation is weighted sum of stochastic CB
(random, MP, RP, etc)
and adaptive CB

Adaptive CB
Ga
LPC
Gs
Fixed CB
39
Long Term Prediction
Spch Comp

Using long-term (pitch predictor) and short-term
(LPC) prediction
Long term predictor may have only
one delay, but then non-integer
1
1 - b z - d

sn
pitch predictor
gain
-
codebook
LPC
perceptual weighting
error computation
40
Federal Standard CELP
Spch Comp

FS 1016 at 4.8 Kbps has MOS 3.2
Developed by ATT Bell Labs for DOD 144 bits /
30 ms frame
10th order LPC on 30 ms Hamming window
no pre-emphasis, additional 15 Hz BW expansion
(quality and LSP robustness)
Conversion to LSP and nonuniform scalar
quantization to 34 bits
4 subframes (7.5 ms) LSP interpolation
512 entry fixed CB - static -1,0,1 from
center-clipped Gaussian
5 bit nonuniform quantized gain 56 bits
256 entry adaptive CB - 8 bits 5 bit nonuniform
quantized gain 48 bits
optional noninteger delays, optional
Perceptual weighting
Postfilter spectral tilt compensation,
removable for noise or tandeming
FEC 4 bits SYNC 1 bit reserved 1 bit

41
G.728
Spch Comp

16 Kbps with MOS similar to G.726 at 32 Kbps
Low 5 sample (0.625 msec) delay
High computational complexity (about 30 MIPS)
CELP with Backward LPC
LPC order 50 (why not? - we dont transmit
side-information!)
Frame of 2.5 ms (20 samples)
4 subframes of 0.625 ms (5 samples)
Perceptual weighting
Only 10 bit index to fixed CB is transmitted
10 bits per 0.625 ms is 16 Kbps !

42
G.729
Spch Comp

8 Kbps toll-quality coder for DSVD and VoIP
Computational complexity 20 MIPS, but G.729a is
about 10 MIPS
frame 10 ms (80 samples) lookahead 5 ms (1
subframe)
LPC, LSP, VQ, LSP interpolation
CS-ACELP CB (Interleaved single pulse
permutation) 4 1 pulses / subframe
closed loop pitch prediction and adaptive CB
(delaygain)
2 (40 sample) subframes per frame
For each frame the encoder outputs 80 bits
LSF coefficients 18 bits pitch
8 bits gain CB 14 bits
adaptive CB 5 bits parity check 1
bit
pulse positions 26 bits pulse signs 8
bits

43
G.729 annexes
Spch Comp

A Compatible reduced complexity encoder with
minimal MOS reduction
B VAD and CNG
C Floating point implementation
D 6.4 Kbps version
similar to G.729 but 64 output bits per frame,
quality better than G.726 at 24Kbps
LSF coefficients 18b pitchadaptive CB 84b
gain CB 12b fixed CB 22b
E 11.8 Kbps coder for high quality and music

44
G.723.1
Spch Comp

6.4 (MP-MLQ) and 5.4 (ACELP) Kbps rates
About 18 MIPS on DSP
frame 30 ms (240 samples) lookahead 15 ms.
LPC on 30 ms (240 sample) frames, LSP and VQ
open-loop pitch computation on half-frames (120
sample)
excitation on 4 subframes (60 samples) per frame
perceptual weighting and harmonic noise weighting
fifth-order closed loop pitch predictor
MP-MLQ 5 or 6 1 pulses / subframe, positions
all even or all odd
ACELP 4 1 pulses / subframe, positions differ
by 8
Annex A VAD-CNG Annex B floating point
implementation

Other
Coders

46
MBE coder
Spch Comp

LPC10 makes hard U/V decision - no mixed voicing
Multi Band Excitation uses a different excitation
harmonics of pitch frequency
frequency-dependent binary U/V decision
large number of sub-bands (gt16)
Simultaneous ABS estimation of pitch and spectral
envelope
Then U/V decision made based on spectral fit
Use of dynamic programming for pitch tracking

47
MBE coder - cont.
Spch Comp

DVSI made various MBE, AMBE and IMBE for
satellite (INMARSAT)
Bit rates 2.4 - 9.6 Kbps (toll quality at 3.6
Kbps)
Integral FEC for bit-error robustness
As an example
128 bits for each 20 ms frame
pitch 8 bits
U/V decisions K bits (K lt 12)
spectral amplitudes (DCT) 75-K bits
FEC (Golay codes) 45 bits

48
MELP
Spch Comp

DOD wanted a new 2.4 Kbps coder with MOS similar
to FS1016
Main problems with LPC10
voicing determination errors
no handling of partially voiced speech
Unlike MBE MELP uses standard LPC model
MELP excitation is pulse train plus random noise
Soft decision in small number (5) of sub-bands
Frame 22.5 ms (180 samples)
10th order LPC, 15 Hz BW expansion, LSF,
interpolation, VQ
pitch refinement
5 sub-bands (0-500-1000-2000-3000-4000Hz) pitch
and noise excitation
FEC

49
Sinusoidal Transform Coder
Spch Comp

McAulay and Quatieri model
instead of LPC use sum of sine waves
sn Si 1 .. N Ai cos ( wi n fi )
For each analysis frame (10 - 20 ms) need to
extract N Ai fi s
Voiced speech
Use pitch and important harmonics from
pitch-synchronized STFT
Unvoiced speech
Use peaks of STFT points where slope changes
from to -
At high bit-rates keep magnitudes, frequencies
and phases
At low bit-rates frequencies constrained and
phases modeled

50
STC - cont.
Spch Comp

Sparse spectrum is updated at regularly spaced
times
Amplitude linearly interpolated between updates
Interpolated phase must obey 4 conditions (w f
w f)

overlapped windowing
sn
FFT
sum of sinusoids
sn
peak picker
spectrum encoder
spectrum decoder
e.g. all-pole model
51
STC - cont.
Spch Comp

Tracking the sinusoidal components

birth
frequency
death
time
52
Waveform Interpolation
Spch Comp

Voiced speech is a sequence of pitch-cycle
waveforms
The characteristic waveform usually changes
slowly with time

Useful to think of waveform in 2d
time
Phase in pitch period
This waveform can be the speech signal or the LPC
residual
53
WI - cont.
Spch Comp

Per frame LPC and pitch are extracted
Represent CW by features (e.g. DFT coefficients)
Alignment by circular shift until maximum
correlation
Separate treatment for voice and unvoiced
segments

LPC pitch tracking
sn
Characteristic waveform extraction
conversion to 1d
sn
2d CW alignment
waveform interpolation
quantization
decoding
54
We can go even lower
Spch Comp