Title: Speech Processing
1Speech Processing
- Analysis and Synthesis of Pole-Zero Speech Models
2Introduction
- Deterministic
- Speech Sounds with periodic or impulse sources
- Stochastic
- Speech Sounds with noise sources
- Goal is to derive vocal tract model of each class
of sound source. - It will be shown that solution equations for the
two classes are similar in structure. - Solution approach is referred to as linear
prediction analysis. - Linear prediction analysis leads to a method of
speech synthesis based on the all-pole model. - Note that all-pole model is intimately associated
with the concatenated lossless tube model of
previous chapter (i.e., Chapter 4).
3All-Pole Modeling of Deterministic Signals
- Consider a vocal tract transfer function during
voiced source
Ugn
A
GlottalModel
Vocal TrackModel
RadiationModel
sn
?
Speech
Tpitch
V(z)
G(z)
R(z)
4All-Pole Modeling of Deterministic Signals
- What about the fact that R(z) is a zero model?
- A single zero function can be expressed as a
infinite set of poles. Note - From the above expression one can derive
5All-Pole Modeling of Deterministic Signals
- In practice infinite number of poles are
approximated with a finite site of poles since
ak?0 as k?8. - H(z) can be considered all-pole representation
- representing a zero with large number of poles ?
inefficient - Estimating zeros directly a more efficient
approach (covered later in this chapter).
6Model Estimation
- Goal - Estimate
- filter coefficients a1, a2, ,ap for a
particular order p, and - A,
- Over a short time span of speech signal
(typically 20 ms) for which the signal is
considered quasi-stationary. - Use linear prediction method
- Each speech sample is approximated as a linear
combination of past speech samples ? - Set of analysis techniques for estimating
parameters of the all-pole model.
7Model Estimation
- Consider z-transform of the vocal tract model
- Which can be transformed into
- In time domain it can be written as
- Referred to us as a autoregressive (AR) model.
8Model Estimation
- Method used to predict current sample from linear
combination of past samples is called linear
prediction analysis. - LPC Quantization of linear prediction
coefficients or of a transformed version of these
coefficients is called linear prediction coding
(Chapter 12). - For ugn0
- This observation motivates the analysis technique
of linear prediction.
9Model Estimation Definitions
- A linear predictor of order p is defined by
Estimate of ak
z
10Model Estimation Definitions
- Prediction error sequence is given as difference
of the original sequence and its prediction - Associated prediction error filter is defined as
- If ?kak
A(z)
11Model Estimation Definitions
12Model Estimation Definitions
- Note 2 If
- Vocal tract contains finite number of poles and
no zeros, - Prediction order is correct,
- then
- ?kak, and
- en is an impulse train for voiced speech and
for impulse speech en will be just an impulse.
13Example 5.1
- Consider an exponentially decaying impulse
response of the form hnanun where un is
the unit step. Response to the scaled unit sample
A?n is - Consider the prediction of sn using a linear
predictor of order p1. - It is a good fit since
- Prediction error sequence with ?1a is
- The prediction of the signal is exact except at
the time origin.
14Error Minimization
- Important question is how to derive an estimate
of the prediction coefficients al, for a
particular order p, that would be optimal in some
sense. - Optimality is measured based on a criteria. An
appropriate measure of optimality is mean-squared
error (MSE). - Goal is to minimize the mean-squared prediction
error E defined as -
- In reality, a model must be valid over some
short-time interval, say M samples on either side
of n
15Error Minimization
- Thus in practice MSE is time-depended and is
formed over a finite interval as depicted in
previous figure. - n-M,nM prediction error interval.
- Alternatively
- where
16Error Minimization
- Determine ?k for which En is minimal
- Which results in
17Error Minimization
- Last equation can be rewritten by multiplying
through - Define the function
- Which gives the following
- Referred to as the normal equations given in the
matrix form bellow
18Error Minimization
- The minimum error for the optimal solution can be
derived as follows - Last term in the equation above can be rewritten
as
19Error Minimization
- Thus error can be expressed as
20Error Minimization
- Remarks
- Order (p) of the actual underlying all-pole
transfer function is not known. - Order can be estimated by observing the fact that
a pth order predictor in theory equals that of a
(p1) order predictor. - Also predictor coefficients for kgtp equal zero
(or in practice close to zero and model only
noise-random effects). - Prediction error enm is non-zero only in the
vicinity of the time n n-M,nM. - In predicating values of the short-time sequence
snm, p values outside of the prediction error
interval n-M,nM are required. - Covariance method uses values outside the
interval to predict values inside the interval - Autocorrelation Method assumes that speech
samples are zero outside the interval.
21Error Minimization
- Matrix formulation
- Projection Theorem
- Columns of Sn basis vectors
- Error vector en is orthogonal to each basis
vector SnTen0 where - Orthogonality leads to
22Autocorrelation Method
- In previous section we have described a general
method of linear prediction that uses samples
outside the prediction error interval referred to
as covariance method. - Alternative approach that does not consider
samples outside analysis interval, referred to as
autocorrelation method, will be presented next. - This method is
- Suboptimal, however it
- Leads to an efficient and stable solution to
normal equations.
23Autocorrelation Method
- Assumes that the samples outside the time
interval n-M,nM are all zero, and - Extends the prediction error interval, i.e., the
range over which we minimize the mean-squared
error to 8. - Conventions
- Short-time interval n, nNw-1 where Nw2M1
(Note it is not centered around sample n as in
previous derivation). - Segment is shifted to the left by n samples so
that the first nonzero sample falls at m0. This
operation is equivalent to - Shifting of speech sequence sm by n-samples to
the left and - Windowing by Nw -point rectangular window
24Autocorrelation Method
- Windowed sequence can be expressed as
- This operation can be depicted in the figure
presented on the right.
25Autocorrelation Method
- Important observations that are consequence of
zeroing the signal outside of interval - Prediction error is nonzero only in the interval
0,Nwp-1 - Nw-window length
- p-the predictor order
- The prediction error is largest at the left and
right ends of the segment. This is due to edge
effects caused by the way the prediction is done - from zeros from the left of the window
- to zeros from the right of the window
26Autocorrelation Method
- To compensate for edge effects typically tapered
window is used (e.g., Hamming). - Removes the possibility that the mean-squared
error be dominated by end (edge) effects. - Data becomes distorted hence biasing estimates
?k. - Let the mean-squared prediction error be given
by - Limits of summation refer to new time origin, and
- Prediction error outside this interval is zero.
27Autocorrelation Method
- Normal equations take the following form
(Exercise 5.1) - where
28Autocorrelation Method
- Due to summation limits depicted in the figure on
the right function ?ni,k can be written as - Recognizing that only samples in the interval
i,kNw-1 contribute to the sum, and - Changing variable m? m-i
29Autocorrelation Method
- Since the above expression is only function of
difference i-k thus we denote it as - Letting ??i-k, referred to as correlation
lag, leads to short-time autocorrelation
function
30Autocorrelation Method
- rn?sn?sn-?
- Autocorrelation method leads to computation of
the short-time sequence snm convolved with
itself flipped in time. - Autocorrelation function is a measure of the
self-similarity of the signal at different lags
?. - When rn? is large then signal samples spaced by
? are said to by highly correlated.
31Autocorrelation Method
- Properties of rn?
- For an N-point sequence, rn? is zero outside
the interval -(N-1),N-1. - rn? is even function of ?
- rn0 rn?
- rn0 energy of snm ?
- If snm is a segment of a periodic sequence,
then rn? is periodic-like with the same period - Because snm is short-time, the overlapping data
in the correlation decreases as ? increases ? - Amplitude of rn? decreases as ? increases
- With rectangular window the envelope of rn?
decreases linearly. - If snm is a random white noise sequence, then
rn? is impulse-like, reflecting self-similarity
only within a small neighborhood.
32Autocorrelation Method
33Autocorrelation Method
- Letting ?ni,k rni-k, normal equation take
the form - The expression represents p linear equations with
p unknowns, ?k for 1kp. - Using the normal equation solution, it can be
shown that the corresponding minimum mean-squared
prediction error is given by - Matrix form representation of normal equations
- Rn?rn.
34Autocorrelation Method
- Expanded form
- The Rn matrix is Toepliz
- Symmetric about the diagonal
- All elements of the diagonal are equal.
- Matrix is invertible
- Implies efficient solution.
Rn
?
rn
35Example 5.3
- Consider a system with an exponentially decaying
impulse response of the form hn anun, with
un being the unit step function. - Estimate a using the autocorrelation method of
linear prediction.
Z
36Example 5.3
- Apply N-point rectangular window 0,N-1 at n0.
- Compute r00 and r01.
- Using normal equations
37Example 5.3
- Minimum squared error (from slide 33) is thus
(Exercise 5.5) -
- For 1st order predictor, as in this example here,
prediction error sequence for the true predictor
(i.e., ?1 a) is given by - ensn-asn-1?n (see example 5.1
presented earlier). Thus the prediction of the
signal is exact except at the time origin. - This example illustrates that with enough data
the autocorrelation method yields a solution
close to the true single-pole model for an
impulse input.
38Limitations of the linear prediction model
- When the underlying measured sequence is the
impulse response of an arbitrary all-pole
sequence, then autocorrelation methods yields
correct result. - There are a number of speech sounds that even
with an arbitrary long data sequence a true
solution can not be obtained. - Consider a periodic sequence simulating a steady
voiced sound formed by convolving a periodic
impulse train pn with an all-pole impulse
response hn. - Z-transform of hn is given by
-
39Limitations of the linear prediction model
- Thus
- Normal equations of this system are given by (see
Exercise 5.7) - Where autocorrelation of hn is denoted by
rh?h?h-?. - Suppose now that the system is excited with an
impulse train of the period P
40Limitations of the linear prediction model
- Normal equations associated with sn (windowed
over multiple pitch periods) for an order p
predictor are given by - It can be shown that rn? is equal to
periodically repeated replicas of rh?but
with decreasing amplitude due to the windowing
(Exercise 5.7).
41Limitations of the linear prediction model
- The autocorrelation function rn? of the
windowed signal sn can be thought of as
aliased version of rh? due to overlap which
introduces distortion - When aliasing is minor the two solutions are
approximately equal. - Accuracy of this approximation decreases as the
pitch period decreases (e.g., high pitch) due to
increase in overlap of autocorrelation replicas
repeated every P samples.
42Limitations of the linear prediction model
- Sources of error
- Aliasing increases with high pitched speakers
(smaller pitch period P). - Signal is not truly periodic.
- Speech not always all-pole.
- Autocorrelation is a suboptimal solution.
- Covariance method capable of giving optimal
solution, however, is not guaranteed to converge
when underlying signal does not follow an
all-pole model.
43The Levinson Recursion of the Autocorrelation
method
- Direct inversion method (Gaussian
elimination)requires p3 multiplies and
additions. - Levinson Recursion (1947)
- Requires p2 multiplies and additions
- Links directly to the concatenated lossless tube
model (Chapter 4) and thus a mechanism for
estimating the vocal tract area function from an
all-pole-model estimation.
44The Levinson Recursion of the Autocorrelation
method
- Step 1
- for i1,2,,p
- Step 2
- Step 3
- Step 4
- end
ki-partial correlation coefficients - PARCOR
45The Levinson Recursion of the Autocorrelation
method
- It can be shown that on each iteration that the
predictor coefficients ?k, can be written as
solely functions of the autocorrelation
coefficients (Exercise 5.11). - Desired transfer function is given by
- Gain A has yet to be determined.
46Properties of the Levinson Recursion of the
Autocorrelation method
- Magnitude of partial correlation coefficients is
less than 1 kilt1 for all i. - Condition under 1 is sufficient for stability if
all kilt1 then all roots of A(z) are inside the
unit circle. - Autocorrelation Method gives a minimum-phase
solution even when the actual system is
mixed-phase.
47Example 5.4
- Consider the discrete-time model of the complete
transfer function from the glottis to the lips
derived in Chapter 4 (Equation 4.40), but without
zero contributions from the radiation and vocal
tract - Suppose we measure a single impulse response
denoted by hn wich is equal to the inverse
z-transform of H(z) and estimate the model with
autocorrelation method setting the number of
poles of H(z) correctly p22Ci, and with
prediction error defined over the entire duration
of hn which yields a solution
48Experimentation Results
49Properties of the Levinson Recursion of the
Autocorrelation method
- Formal explanation
- Suppose sn follows an all-pole model
- Prediction error function is defined over all
time (i.e., no window truncation effects - and are the Fourier
transform phase functions for the minimum- and
maximum-phase contributions of S(?),
respectively. - Autocorrelation solution can be expressed as
(Exercise 5.14)
50Properties of the Levinson Recursion of the
Autocorrelation method
- Exercise 5.14 Rationalization of the Result
- is the minimum-phase contribution due
to the vocal tract poles inside the unit circle,
and is maximum-phase contribution due
to glottal poles outside the unit circle.
Resulting estimated frequency response can be
expressed as - The phase distortion of synthesized speech can
have perceptual consequence since a gradual onset
of the glottal flow, and thus of the speech
waveform during the open phase of the glottal
cycle, is transformed to a sharp attack
consistent with the energy concentration property
of minimum-phase sequences (Chapter 2).
51Properties of the Levinson Recursion to
Autocorrelation method
- Reverse Levinson RecursionHow to obtain lower
level model from higher ones? -
- Autocorrelation matching Let rn? be the
autocorrelation of the speech signal snmwm
and rh? the autocorrelation of hn?-1H(z)
then - rn? rh? for ?p
52Autocorrelation Method
- Gain Computation
- En is the average minimum prediction error for
the pth-order predictor. - If the energy in the all-pole impulse response
hm equals the energy in the measurement snm ? - Squared gain equal to the minimum prediction
error.
53Autocorrelation Method
- Relationship to Lossless Tube Model
- Recall that for the lossless concatenated tube
model, with glottal impedance Zg(z) 8 (open
circuit), with the transfer function - Recursively obtained from
- N-number of tubes and where reflection
coefficients rk is a function of cross-sectional
areas of successive tubes, i.e.,
54Relationship to Lossless Tube Model
- Levinson Recursion
- Can be written in the Z domain (see Appendix 5.B)
- Starting condition is obtained by mapping a000
to - Two recursions are identical when ri-ki which
then makes Di(z)Ai(z).
55Relationship to Lossless Tube Model
- Since the boundary condition was not included in
the lossless tube model, V(z) represents the
ratio between an ideal volume velocity at the
glottis and at the lips - Speech pressure measurement at the lips output,
however, has embedded within it the glottal shape
G(z), as well as radiation at the lips R(z).
Recall that for the voiced case (with no vocal
tract zeros) - The presence of glottal shape, i.e., G(z), thus
introduces poles that are not part of vocal
tract. - The net effect of glottal shape is typically
6dB/octave fall-off (see slide 94 of the
presentation Acoustic of Speech Production) to
the spectral tilt of V(z), - The influence of the glottal flow shape and
radiation load can be approximately removed with
a pre-emphasis of 6dB/octave spectral rise.
56Example 5.5
- In the following figure two examples that show
good matches to measured vocal tract area
functions for the vowels /a/ and /i/ derived from
estimates of the partial correlation coefficients.
57Frequency Domain Interpretation
- Consider an all-pole model of speech production
- Where A(?) is given by
- Define Q(?) as the difference of the
log-magnitude of measured and modeled spectra - Recall
58Frequency Domain Interpretation
- Thus we can write Q(?) as
- Thus as en is minimized ? E(?) is minimized,
which in turn ? Q(?) minimized ? spectral
difference between actual measured speech and
modeled spectrum is minimized.
59Linear Prediction Analysis of Stochastic Speech
Sounds
- Linear Prediction analysis was motivated with
observation that for a single impulse or periodic
impulse train input to an all-pole vocal tract
model, the prediction error is zero most of the
time. -
- Such analysis appears not to be applicable to
speech sounds with fricative or aspirated sources
modeled as a stochastic (or random) process. - However, autocorrelation method of linear
prediction can be formulated for the stochastic
case where a white noise input takes on the role
of the single impulse. - The solution to a stochastic optimization problem
- analogous to the minimization of mean-squared
error function En, leads to normal equations
which are the stochastic counterparts to our
earlier solution. -
- Derivation and interpretation of this stochastic
optimization problem is left as an exercise.
60Criterion of Goodness
- How well does linear predication describe the
speech signal in time and in frequency? - Time Domain
- Suppose
- Underlying speech model is all-pole model of
order p, and - Autocorrelation method is used in the estimation
of the coefficients of the predictor polynomial
P(z). - If predictor coefficients are estimated exactly
then the prediction error - Is perfect impulse train for voiced speech
- A single impulse for a plosive
- A white noise for noisy (stochastic) speech.
61Time Domain
- Autocorrelation method of linear prediction
analysis does not yield such idealized outputs
when the measurement sn is inverse filtered by
the estimated system function A(z) (method
limitation) - Even when the vocal tract response follow an
all-pole model, true solution can not be
obtained, since the obtained solution approached
to the true solution in the limit when infinite
amount of data is available. - In a typical waveform segment, the actual vocal
tract impulse response is not all-pole for
variety of reasons - Presence of zeros due to
- The radiation load,
- Nasalization,
- Back vocal cavity during frication and plosives.
- Glottal flow shape even when adequately
modeled, is not minimum phase (see example 5.6).
62Prediction Error Residuals
- Autocorrelation method of linear prediction of
order 14 - Estimation performed over 20 ms Hamming windowed
speech segments.
63Prediction Error Residuals
- Reconstructing residuals form an entire utterance
typically one hears in the prediction error - Not a noisy buzz as expected from idealized
residual, but rather - Roughly the speech itself
- ? Some of the vocal tract spectrum is passing
through the inverse filter.
64Frequency Domain
- Behavior of linear prediction analysis can be
studied alternatively in frequency domain - How well the spectrum derived form linear
prediction analysis matches the spectrum of a
sequence that follows - An all-pole model, and
- Not an all-pole model.
-
65Frequency Domain-Voiced Speech
- Recall for voiced speech snwith Fourier
transform Ug(?). - Vocal tract impulse response with all-pole
frequency response H(?). Windowed speech snn
is - Fourier transform of windowed speech snn is
- Where
- W(?) - is the window transform
- ?o2?/P - is the fundamental frequency
66Frequency Domain-Unvoiced Speech
- Recall for unvoiced speech (stochastic sounds)
- Linear prediction analysis attempts to estimate
H(?) - spectral envelope of the harmonic
spectrum S(?).
67Schematics of Spectra for Periodic and Stochastic
Speech Sounds
68Properties
- For large p H(?) matches the Fourier transform
magnitude of the windowed signal S(?).
69Properties
- Spectral peeks are better matched than spectral
valleys
70Properties
71Synthesis Based on All-pole Modeling Properties
- Now able to synthesize the waveform from model
parameters estimated using linear prediction
analysis - Synthesized signal
72Synthesis Based on All-pole Modeling
- Important Parameters to Consider
- Window Duration
- 20-30 ms to give a satisfactory time-frequency
tradeoff (Exercise 5.20). - Duration can be adaptively varied to account for
different time-frequency resolution requirement
based on - Pitch
- Voicing state
- Phoneme class.
- Frame Interval
- Typical rate at which to perform analysis is 10
ms. - Model Order
- There are three components to be considered
- Vocal tract
- On average resonant density of one resonance
per 1000 Hz. - Order of the system poles2 x resonances
(e.g., for 5000 Hz bandwidth signal 2x510 poles)
- Glottal flow
- 2-pole maximum-phase model
- Radiation at lips
- 1 zero inside the unit circle ? 4 poles provide
adequate representation. - Total of 16 poles
73Synthesis Based on All-pole Modeling
- Voiced/Unvoiced State and Pitch Estimation
- Currently no discrimination is done between for
example plosive and fricative unvoiced speech
sound categories. - Pitch is estimated during voiced regions of
speech only. However, Pitch estimation algorithms
typically estimate pitch as well as perform
voiced/unvoiced classification. - A degree of voicing may be desired in more
complex analysis and synthesis methods - Voicing and turbulence occurs simultaneously
- Voiced fricatives
- Breathy vowels.
74Synthesis Based on All-pole Modeling
- Synthesis Structures
- Determine excitation for each frame
- Generate excitation for each frame by
- Concatenating an impulse train during voiced
signal (spacing determined by the time-varying
pitch contour) - White noise during unvoiced signal.
- Compute Gain
- Directly by measuring frame energy
- Using Autocorrelation method
- Voiced Speech Magnitude of impulse is square
root of signal energy. - Unvoiced Speech Noise variance signal
variance. - Update filter values on each frame. Overlap and
add signal at consecutive frames
75Synthesis structures
76Alternate Synthesis Structures