Speech Processing

About This Presentation

Title:

Speech Processing

Description:

Speech Processing Analysis and Synthesis of Pole-Zero Speech Models – PowerPoint PPT presentation

Number of Views:145

Avg rating:3.0/5.0

Slides: 77

Provided by: vkepuska

Category:

more less

Transcript and Presenter's Notes

Title: Speech Processing

1
Speech Processing

Analysis and Synthesis of Pole-Zero Speech Models

2
Introduction

Deterministic
Speech Sounds with periodic or impulse sources
Stochastic
Speech Sounds with noise sources
Goal is to derive vocal tract model of each class
of sound source.
It will be shown that solution equations for the
two classes are similar in structure.
Solution approach is referred to as linear
prediction analysis.
Linear prediction analysis leads to a method of
speech synthesis based on the all-pole model.
Note that all-pole model is intimately associated
with the concatenated lossless tube model of
previous chapter (i.e., Chapter 4).

3
All-Pole Modeling of Deterministic Signals

Consider a vocal tract transfer function during
voiced source

Ugn
A

GlottalModel
Vocal TrackModel
RadiationModel
sn
?
Speech
Tpitch
V(z)
G(z)
R(z)
4
All-Pole Modeling of Deterministic Signals

What about the fact that R(z) is a zero model?
A single zero function can be expressed as a
infinite set of poles. Note
From the above expression one can derive

5
All-Pole Modeling of Deterministic Signals

In practice infinite number of poles are
approximated with a finite site of poles since
ak?0 as k?8.
H(z) can be considered all-pole representation
representing a zero with large number of poles ?
inefficient
Estimating zeros directly a more efficient
approach (covered later in this chapter).

6
Model Estimation

Goal - Estimate
filter coefficients a1, a2, ,ap for a
particular order p, and
A,
Over a short time span of speech signal
(typically 20 ms) for which the signal is
considered quasi-stationary.
Use linear prediction method
Each speech sample is approximated as a linear
combination of past speech samples ?
Set of analysis techniques for estimating
parameters of the all-pole model.

7
Model Estimation

Consider z-transform of the vocal tract model
Which can be transformed into
In time domain it can be written as
Referred to us as a autoregressive (AR) model.

8
Model Estimation

Method used to predict current sample from linear
combination of past samples is called linear
prediction analysis.
LPC Quantization of linear prediction
coefficients or of a transformed version of these
coefficients is called linear prediction coding
(Chapter 12).
For ugn0
This observation motivates the analysis technique
of linear prediction.

9
Model Estimation Definitions

A linear predictor of order p is defined by

Estimate of ak
z
10
Model Estimation Definitions

Prediction error sequence is given as difference
of the original sequence and its prediction
Associated prediction error filter is defined as
If ?kak

A(z)
11
Model Estimation Definitions

Note 1
Recovery of sn

12
Model Estimation Definitions

Note 2 If
Vocal tract contains finite number of poles and
no zeros,
Prediction order is correct,
then
?kak, and
en is an impulse train for voiced speech and
for impulse speech en will be just an impulse.

13
Example 5.1

Consider an exponentially decaying impulse
response of the form hnanun where un is
the unit step. Response to the scaled unit sample
A?n is
Consider the prediction of sn using a linear
predictor of order p1.
It is a good fit since
Prediction error sequence with ?1a is
The prediction of the signal is exact except at
the time origin.

14
Error Minimization

Important question is how to derive an estimate
of the prediction coefficients al, for a
particular order p, that would be optimal in some
sense.
Optimality is measured based on a criteria. An
appropriate measure of optimality is mean-squared
error (MSE).
Goal is to minimize the mean-squared prediction
error E defined as
In reality, a model must be valid over some
short-time interval, say M samples on either side
of n

15
Error Minimization

Thus in practice MSE is time-depended and is
formed over a finite interval as depicted in
previous figure.
n-M,nM prediction error interval.
Alternatively
where

16
Error Minimization

Determine ?k for which En is minimal
Which results in

17
Error Minimization

Last equation can be rewritten by multiplying
through
Define the function
Which gives the following
Referred to as the normal equations given in the
matrix form bellow

18
Error Minimization

The minimum error for the optimal solution can be
derived as follows
Last term in the equation above can be rewritten
as

19
Error Minimization

Thus error can be expressed as

20
Error Minimization

Remarks
Order (p) of the actual underlying all-pole
transfer function is not known.
Order can be estimated by observing the fact that
a pth order predictor in theory equals that of a
(p1) order predictor.
Also predictor coefficients for kgtp equal zero
(or in practice close to zero and model only
noise-random effects).
Prediction error enm is non-zero only in the
vicinity of the time n n-M,nM.
In predicating values of the short-time sequence
snm, p values outside of the prediction error
interval n-M,nM are required.
Covariance method uses values outside the
interval to predict values inside the interval
Autocorrelation Method assumes that speech
samples are zero outside the interval.

21
Error Minimization

Matrix formulation
Projection Theorem
Columns of Sn basis vectors
Error vector en is orthogonal to each basis
vector SnTen0 where
Orthogonality leads to

22
Autocorrelation Method

In previous section we have described a general
method of linear prediction that uses samples
outside the prediction error interval referred to
as covariance method.
Alternative approach that does not consider
samples outside analysis interval, referred to as
autocorrelation method, will be presented next.
This method is
Suboptimal, however it
Leads to an efficient and stable solution to
normal equations.

23
Autocorrelation Method

Assumes that the samples outside the time
interval n-M,nM are all zero, and
Extends the prediction error interval, i.e., the
range over which we minimize the mean-squared
error to 8.
Conventions
Short-time interval n, nNw-1 where Nw2M1
(Note it is not centered around sample n as in
previous derivation).
Segment is shifted to the left by n samples so
that the first nonzero sample falls at m0. This
operation is equivalent to
Shifting of speech sequence sm by n-samples to
the left and
Windowing by Nw -point rectangular window

24
Autocorrelation Method

Windowed sequence can be expressed as
This operation can be depicted in the figure
presented on the right.

25
Autocorrelation Method

Important observations that are consequence of
zeroing the signal outside of interval
Prediction error is nonzero only in the interval
0,Nwp-1
Nw-window length
p-the predictor order
The prediction error is largest at the left and
right ends of the segment. This is due to edge
effects caused by the way the prediction is done
from zeros from the left of the window
to zeros from the right of the window

26
Autocorrelation Method

To compensate for edge effects typically tapered
window is used (e.g., Hamming).
Removes the possibility that the mean-squared
error be dominated by end (edge) effects.
Data becomes distorted hence biasing estimates
?k.
Let the mean-squared prediction error be given
by
Limits of summation refer to new time origin, and
Prediction error outside this interval is zero.

27
Autocorrelation Method

Normal equations take the following form
(Exercise 5.1)
where

28
Autocorrelation Method

Due to summation limits depicted in the figure on
the right function ?ni,k can be written as
Recognizing that only samples in the interval
i,kNw-1 contribute to the sum, and
Changing variable m? m-i

29
Autocorrelation Method

Since the above expression is only function of
difference i-k thus we denote it as
Letting ??i-k, referred to as correlation
lag, leads to short-time autocorrelation
function

30
Autocorrelation Method

rn?sn?sn-?
Autocorrelation method leads to computation of
the short-time sequence snm convolved with
itself flipped in time.
Autocorrelation function is a measure of the
self-similarity of the signal at different lags
?.
When rn? is large then signal samples spaced by
? are said to by highly correlated.

31
Autocorrelation Method

Properties of rn?
For an N-point sequence, rn? is zero outside
the interval -(N-1),N-1.
rn? is even function of ?
rn0 rn?
rn0 energy of snm ?
If snm is a segment of a periodic sequence,
then rn? is periodic-like with the same period
Because snm is short-time, the overlapping data
in the correlation decreases as ? increases ?
Amplitude of rn? decreases as ? increases
With rectangular window the envelope of rn?
decreases linearly.
If snm is a random white noise sequence, then
rn? is impulse-like, reflecting self-similarity
only within a small neighborhood.

32
Autocorrelation Method
33
Autocorrelation Method

Letting ?ni,k rni-k, normal equation take
the form
The expression represents p linear equations with
p unknowns, ?k for 1kp.
Using the normal equation solution, it can be
shown that the corresponding minimum mean-squared
prediction error is given by
Matrix form representation of normal equations
Rn?rn.

34
Autocorrelation Method

Expanded form
The Rn matrix is Toepliz
Symmetric about the diagonal
All elements of the diagonal are equal.
Matrix is invertible
Implies efficient solution.

Rn
?
rn
35
Example 5.3

Consider a system with an exponentially decaying
impulse response of the form hn anun, with
un being the unit step function.
Estimate a using the autocorrelation method of
linear prediction.

Z
36
Example 5.3

Apply N-point rectangular window 0,N-1 at n0.
Compute r00 and r01.
Using normal equations

37
Example 5.3

Minimum squared error (from slide 33) is thus
(Exercise 5.5)
For 1st order predictor, as in this example here,
prediction error sequence for the true predictor
(i.e., ?1 a) is given by
ensn-asn-1?n (see example 5.1
presented earlier). Thus the prediction of the
signal is exact except at the time origin.
This example illustrates that with enough data
the autocorrelation method yields a solution
close to the true single-pole model for an
impulse input.

38
Limitations of the linear prediction model

When the underlying measured sequence is the
impulse response of an arbitrary all-pole
sequence, then autocorrelation methods yields
correct result.
There are a number of speech sounds that even
with an arbitrary long data sequence a true
solution can not be obtained.
Consider a periodic sequence simulating a steady
voiced sound formed by convolving a periodic
impulse train pn with an all-pole impulse
response hn.
Z-transform of hn is given by

39
Limitations of the linear prediction model

Thus
Normal equations of this system are given by (see
Exercise 5.7)
Where autocorrelation of hn is denoted by
rh?h?h-?.
Suppose now that the system is excited with an
impulse train of the period P

40
Limitations of the linear prediction model

Normal equations associated with sn (windowed
over multiple pitch periods) for an order p
predictor are given by
It can be shown that rn? is equal to
periodically repeated replicas of rh?but
with decreasing amplitude due to the windowing
(Exercise 5.7).

41
Limitations of the linear prediction model

The autocorrelation function rn? of the
windowed signal sn can be thought of as
aliased version of rh? due to overlap which
introduces distortion
When aliasing is minor the two solutions are
approximately equal.
Accuracy of this approximation decreases as the
pitch period decreases (e.g., high pitch) due to
increase in overlap of autocorrelation replicas
repeated every P samples.

42
Limitations of the linear prediction model

Sources of error
Aliasing increases with high pitched speakers
(smaller pitch period P).
Signal is not truly periodic.
Speech not always all-pole.
Autocorrelation is a suboptimal solution.
Covariance method capable of giving optimal
solution, however, is not guaranteed to converge
when underlying signal does not follow an
all-pole model.

43
The Levinson Recursion of the Autocorrelation
method

Direct inversion method (Gaussian
elimination)requires p3 multiplies and
additions.
Levinson Recursion (1947)
Requires p2 multiplies and additions
Links directly to the concatenated lossless tube
model (Chapter 4) and thus a mechanism for
estimating the vocal tract area function from an
all-pole-model estimation.

44
The Levinson Recursion of the Autocorrelation
method

Step 1
for i1,2,,p
Step 2
Step 3
Step 4
end

ki-partial correlation coefficients - PARCOR
45
The Levinson Recursion of the Autocorrelation
method

It can be shown that on each iteration that the
predictor coefficients ?k, can be written as
solely functions of the autocorrelation
coefficients (Exercise 5.11).
Desired transfer function is given by
Gain A has yet to be determined.

46
Properties of the Levinson Recursion of the
Autocorrelation method

Magnitude of partial correlation coefficients is
less than 1 kilt1 for all i.
Condition under 1 is sufficient for stability if
all kilt1 then all roots of A(z) are inside the
unit circle.
Autocorrelation Method gives a minimum-phase
solution even when the actual system is
mixed-phase.

47
Example 5.4

Consider the discrete-time model of the complete
transfer function from the glottis to the lips
derived in Chapter 4 (Equation 4.40), but without
zero contributions from the radiation and vocal
tract
Suppose we measure a single impulse response
denoted by hn wich is equal to the inverse
z-transform of H(z) and estimate the model with
autocorrelation method setting the number of
poles of H(z) correctly p22Ci, and with
prediction error defined over the entire duration
of hn which yields a solution

48
Experimentation Results
49
Properties of the Levinson Recursion of the
Autocorrelation method

Formal explanation
Suppose sn follows an all-pole model
Prediction error function is defined over all
time (i.e., no window truncation effects
and are the Fourier
transform phase functions for the minimum- and
maximum-phase contributions of S(?),
respectively.
Autocorrelation solution can be expressed as
(Exercise 5.14)

50
Properties of the Levinson Recursion of the
Autocorrelation method

Exercise 5.14 Rationalization of the Result
is the minimum-phase contribution due
to the vocal tract poles inside the unit circle,
and is maximum-phase contribution due
to glottal poles outside the unit circle.
Resulting estimated frequency response can be
expressed as
The phase distortion of synthesized speech can
have perceptual consequence since a gradual onset
of the glottal flow, and thus of the speech
waveform during the open phase of the glottal
cycle, is transformed to a sharp attack
consistent with the energy concentration property
of minimum-phase sequences (Chapter 2).

51
Properties of the Levinson Recursion to
Autocorrelation method

Reverse Levinson RecursionHow to obtain lower
level model from higher ones?
Autocorrelation matching Let rn? be the
autocorrelation of the speech signal snmwm
and rh? the autocorrelation of hn?-1H(z)
then
rn? rh? for ?p

52
Autocorrelation Method

Gain Computation
En is the average minimum prediction error for
the pth-order predictor.
If the energy in the all-pole impulse response
hm equals the energy in the measurement snm ?
Squared gain equal to the minimum prediction
error.

53
Autocorrelation Method

Relationship to Lossless Tube Model
Recall that for the lossless concatenated tube
model, with glottal impedance Zg(z) 8 (open
circuit), with the transfer function
Recursively obtained from
N-number of tubes and where reflection
coefficients rk is a function of cross-sectional
areas of successive tubes, i.e.,

54
Relationship to Lossless Tube Model

Levinson Recursion
Can be written in the Z domain (see Appendix 5.B)
Starting condition is obtained by mapping a000
to
Two recursions are identical when ri-ki which
then makes Di(z)Ai(z).

55
Relationship to Lossless Tube Model

Since the boundary condition was not included in
the lossless tube model, V(z) represents the
ratio between an ideal volume velocity at the
glottis and at the lips
Speech pressure measurement at the lips output,
however, has embedded within it the glottal shape
G(z), as well as radiation at the lips R(z).
Recall that for the voiced case (with no vocal
tract zeros)
The presence of glottal shape, i.e., G(z), thus
introduces poles that are not part of vocal
tract.
The net effect of glottal shape is typically
6dB/octave fall-off (see slide 94 of the
presentation Acoustic of Speech Production) to
the spectral tilt of V(z),
The influence of the glottal flow shape and
radiation load can be approximately removed with
a pre-emphasis of 6dB/octave spectral rise.

56
Example 5.5

In the following figure two examples that show
good matches to measured vocal tract area
functions for the vowels /a/ and /i/ derived from
estimates of the partial correlation coefficients.

57
Frequency Domain Interpretation

Consider an all-pole model of speech production
Where A(?) is given by
Define Q(?) as the difference of the
log-magnitude of measured and modeled spectra
Recall

58
Frequency Domain Interpretation

Thus we can write Q(?) as
Thus as en is minimized ? E(?) is minimized,
which in turn ? Q(?) minimized ? spectral
difference between actual measured speech and
modeled spectrum is minimized.

59
Linear Prediction Analysis of Stochastic Speech
Sounds

Linear Prediction analysis was motivated with
observation that for a single impulse or periodic
impulse train input to an all-pole vocal tract
model, the prediction error is zero most of the
time.
Such analysis appears not to be applicable to
speech sounds with fricative or aspirated sources
modeled as a stochastic (or random) process.
However, autocorrelation method of linear
prediction can be formulated for the stochastic
case where a white noise input takes on the role
of the single impulse.
The solution to a stochastic optimization problem
- analogous to the minimization of mean-squared
error function En, leads to normal equations
which are the stochastic counterparts to our
earlier solution.
Derivation and interpretation of this stochastic
optimization problem is left as an exercise.

60
Criterion of Goodness

How well does linear predication describe the
speech signal in time and in frequency?
Time Domain
Suppose
Underlying speech model is all-pole model of
order p, and
Autocorrelation method is used in the estimation
of the coefficients of the predictor polynomial
P(z).
If predictor coefficients are estimated exactly
then the prediction error
Is perfect impulse train for voiced speech
A single impulse for a plosive
A white noise for noisy (stochastic) speech.

61
Time Domain

Autocorrelation method of linear prediction
analysis does not yield such idealized outputs
when the measurement sn is inverse filtered by
the estimated system function A(z) (method
limitation)
Even when the vocal tract response follow an
all-pole model, true solution can not be
obtained, since the obtained solution approached
to the true solution in the limit when infinite
amount of data is available.
In a typical waveform segment, the actual vocal
tract impulse response is not all-pole for
variety of reasons
Presence of zeros due to
The radiation load,
Nasalization,
Back vocal cavity during frication and plosives.
Glottal flow shape even when adequately
modeled, is not minimum phase (see example 5.6).

62
Prediction Error Residuals

Autocorrelation method of linear prediction of
order 14
Estimation performed over 20 ms Hamming windowed
speech segments.

63
Prediction Error Residuals

Reconstructing residuals form an entire utterance
typically one hears in the prediction error
Not a noisy buzz as expected from idealized
residual, but rather
Roughly the speech itself
? Some of the vocal tract spectrum is passing
through the inverse filter.

64
Frequency Domain

Behavior of linear prediction analysis can be
studied alternatively in frequency domain
How well the spectrum derived form linear
prediction analysis matches the spectrum of a
sequence that follows
An all-pole model, and
Not an all-pole model.

65
Frequency Domain-Voiced Speech

Recall for voiced speech snwith Fourier
transform Ug(?).
Vocal tract impulse response with all-pole
frequency response H(?). Windowed speech snn
is
Fourier transform of windowed speech snn is
Where
W(?) - is the window transform
?o2?/P - is the fundamental frequency

66
Frequency Domain-Unvoiced Speech

Recall for unvoiced speech (stochastic sounds)
Linear prediction analysis attempts to estimate
H(?) - spectral envelope of the harmonic
spectrum S(?).

67
Schematics of Spectra for Periodic and Stochastic
Speech Sounds
68
Properties

For large p H(?) matches the Fourier transform
magnitude of the windowed signal S(?).

69
Properties

Spectral peeks are better matched than spectral
valleys

70
Properties
71
Synthesis Based on All-pole Modeling Properties

Now able to synthesize the waveform from model
parameters estimated using linear prediction
analysis
Synthesized signal

72
Synthesis Based on All-pole Modeling

Important Parameters to Consider
Window Duration
20-30 ms to give a satisfactory time-frequency
tradeoff (Exercise 5.20).
Duration can be adaptively varied to account for
different time-frequency resolution requirement
based on
Pitch
Voicing state
Phoneme class.
Frame Interval
Typical rate at which to perform analysis is 10
ms.
Model Order
There are three components to be considered
Vocal tract
On average resonant density of one resonance
per 1000 Hz.
Order of the system poles2 x resonances
(e.g., for 5000 Hz bandwidth signal 2x510 poles)
Glottal flow
2-pole maximum-phase model
Radiation at lips
1 zero inside the unit circle ? 4 poles provide
adequate representation.
Total of 16 poles

73
Synthesis Based on All-pole Modeling

Voiced/Unvoiced State and Pitch Estimation
Currently no discrimination is done between for
example plosive and fricative unvoiced speech
sound categories.
Pitch is estimated during voiced regions of
speech only. However, Pitch estimation algorithms
typically estimate pitch as well as perform
voiced/unvoiced classification.
A degree of voicing may be desired in more
complex analysis and synthesis methods
Voicing and turbulence occurs simultaneously
Voiced fricatives
Breathy vowels.

74
Synthesis Based on All-pole Modeling

Synthesis Structures
Determine excitation for each frame
Generate excitation for each frame by
Concatenating an impulse train during voiced
signal (spacing determined by the time-varying
pitch contour)
White noise during unvoiced signal.
Compute Gain
Directly by measuring frame energy
Using Autocorrelation method
Voiced Speech Magnitude of impulse is square
root of signal energy.
Unvoiced Speech Noise variance signal
variance.
Update filter values on each frame. Overlap and
add signal at consecutive frames