Speech Processing - PowerPoint PPT Presentation

1 / 76
About This Presentation
Title:

Speech Processing

Description:

Speech Processing Analysis and Synthesis of Pole-Zero Speech Models Introduction Deterministic: Speech Sounds with periodic or impulse sources Stochastic: Speech ... – PowerPoint PPT presentation

Number of Views:484
Avg rating:3.0/5.0
Slides: 77
Provided by: vkep
Category:

less

Transcript and Presenter's Notes

Title: Speech Processing


1
Speech Processing
  • Analysis and Synthesis of Pole-Zero Speech Models

2
Introduction
  • Deterministic
  • Speech Sounds with periodic or impulse sources
  • Stochastic
  • Speech Sounds with noise sources
  • Goal is to derive vocal tract model of each class
    of sound source.
  • It will be shown that solution equations for the
    two classes are similar in structure.
  • Solution approach is referred to as linear
    predication analysis.
  • Linear prediction analysis leads to a method of
    speech synthesis based on the all-pole model.
  • Note that all-pole model is intimately associated
    with the concatenated lossless tube model of
    previous chapter (i.e., Chapter 4).

3
All-Pole Modeling of Deterministic Signals
  • Consider a vocal tract transfer function during
    voiced source

Ugn
A

GlottalModel
Vocal TrackModel
RadiationModel
sn
?
Speech
Tpitch
V(z)
G(z)
R(z)
4
All-Pole Modeling of Deterministic Signals
  • What about the fact that R(z) is a zero model?
  • A single zero function can be expressed as a
    infinite set of poles. Note
  • From the above expression one can derive

5
All-Pole Modeling of Deterministic Signals
  • In practice infinite number of poles are
    approximated with a finite site of poles since
    ak?0 as k?8.
  • H(z) can be considered all-pole representation
  • representing a zero with large number of poles ?
    inefficient
  • Estimating zeros directly a more efficient
    approach (covered later in this chapter).

6
Model Estimation
  • Goal - Estimate
  • filter coefficients a1, a2, ,ap for a
    particular order p, and
  • A,
  • Over a short time span of speech signal
    (typically 20 ms) for which the signal is
    considered quasi-stationary.
  • Use linear prediction method
  • Each speech sample is approximated as a linear
    combination of past speech samples ?
  • Set of analysis techniques for estimating
    parameters of the all-pole model.

7
Model Estimation
  • Consider z-transform of the vocal tract model
  • Which can be transformed into
  • In time domain it can be written as
  • Referred to us as a autoregressive (AR) model.

8
Model Estimation
  • Method used to predict current sample from linear
    combination of past samples is called linear
    prediction analysis.
  • LPC Quantization of linear prediction
    coefficients or of a transformed version of these
    coefficients is called linear prediction coding
    (Chapter 12).
  • For ugn0
  • This observation motivates the analysis technique
    of linear prediction.

9
Model Estimation Definitions
  • A linear predictor of order p is defined by

Estimate of ak
z
10
Model Estimation Definitions
  • Prediction error sequence is given as difference
    of the original sequence and its prediction
  • Associated prediction error filter is defined as
  • If ?kak

A(z)
11
Model Estimation Definitions
  • Note 1
  • Recovery of sn

12
Model Estimation Definitions
  • Note 2 If
  • Vocal tract contains finite number of poles and
    no zeros,
  • Prediction order is correct,
  • then
  • ?kak, and
  • en is an impulse train for voiced speech and
    for impulse speech en will be just an impulse.

13
Example 5.1
  • Consider an exponentially decaying impulse
    response of the form hnanun where un is
    the unit step. Response to the scaled unit sample
    A?n is
  • Consider the prediction of sn using a linear
    predictor of order p1.
  • It is a good fit since
  • Prediction error sequence with ?1a is
  • The prediction of the signal is exact except at
    the time origin.

14
Error Minimization
  • Important question is how to derive an estimate
    of the prediction coefficients al, for a
    particular order p, that would be optimal in some
    sense.
  • Optimality is measured based on a criteria. An
    appropriate measure of optimality is mean-squared
    error (MSE).
  • Goal is to minimize the mean-squared prediction
    error E defined as
  • In reality, a model must be valid over some
    short-time interval, say M samples on either side
    of n

15
Error Minimization
  • Thus in practice MSE is time-depended and is
    formed over a finite interval as depicted in
    previous figure.
  • n-M,nM prediction error interval.
  • Alternatively
  • where

16
Error Minimization
  • Determine ?k for which En is minimal
  • Which results in

17
Error Minimization
  • Last equation can be rewritten by multiplying
    through
  • Define the function
  • Which gives the following
  • Referred to as the normal equations given in the
    matrix form bellow

18
Error Minimization
  • The minimum error for the optimal solution can be
    derived as follows
  • Last term in the equation above can be rewritten
    as

19
Error Minimization
  • Thus error can be expressed as

20
Error Minimization
  • Remarks
  • Order (p) of the actual underlying all-pole
    transfer function is not known.
  • Order can be estimated by observing the fact that
    a pth order predictor in theory equals that of a
    (p1) order predictor.
  • Also predictor coefficients for kgtp equal zero
    (or in practice close to zero and model only
    noise-random effects).
  • Prediction error enm is non-zero only in the
    vicinity of the time n n-M,nM.
  • In predicating values of the short-time sequence
    snm, p values outside of the prediction error
    interval n-M,nM are required.
  • Covariance method uses values outside the
    interval to predict values inside the interval
  • Autocorrelation Method assumes that speech
    samples are zero outside the interval.

21
Error Minimization
  • Matrix formulation
  • Projection Theorem
  • Columns of Sn basis vectors
  • Error vector en is orthogonal to each basis
    vector SnTen0 where
  • Orthogonality leads to

22
Autocorrelation Method
  • In previous section we have described a general
    method of linear prediction that uses samples
    outside the prediction error interval referred to
    as covariance method.
  • Alternative approach that does not consider
    samples outside analysis interval, referred to as
    autocorrelation method, will be presented next.
  • This method is
  • Suboptimal, however it
  • Leads to an efficient and stable solution to
    normal equations.

23
Autocorrelation Method
  • Assumes that the samples outside the time
    interval n-M,nM are all zero, and
  • Extends the prediction error interval, i.e., the
    range over which we minimize the mean-squared
    error to 8.
  • Conventions
  • Short-time interval n, nNw-1 where Nw2M1
    (Note it is not centered around sample n as in
    previous derivation).
  • Segment is shifted to the left by n samples so
    that the first nonzero sample falls at m0. This
    operation is equivalent to
  • Shifting of speech sequence sm by n-samples to
    the left and
  • Windowing by Nw -point rectangular window

24
Autocorrelation Method
  • Windowed sequence can be expressed as
  • This operation can be depicted in the figure
    presented on the right.

25
Autocorrelation Method
  • Important observations that are consequence of
    zeroing the signal outside of interval
  • Prediction error is nonzero only in the interval
    0,Nwp-1
  • Nw-window length
  • p-the predictor order
  • The prediction error is largest at the left and
    right ends of the segment. This is due to edge
    effects caused by the way the prediction is done
  • from zeros from the left of the window
  • to zeros from the right of the window

26
Autocorrelation Method
  • To compensate for edge effects typically tapered
    window is used (e.g., Hamming).
  • Removes the possibility that the mean-squared
    error be dominated by end (edge) effects.
  • Data becomes distorted hence biasing estimates
    ?k.
  • Let the mean-squared prediction error be given
    by
  • Limits of summation refer to new time origin, and
  • Prediction error outside this interval is zero.

27
Autocorrelation Method
  • Normal equations take the following form
    (Exercise 5.1)
  • where

28
Autocorrelation Method
  • Due to summation limits depicted in the figure on
    the right function ?ni,k can be written as
  • Recognizing that only samples in the interval
    i,kNw-1 contribute to the sum, and
  • Changing variable m? m-i

29
Autocorrelation Method
  • Since the above expression is only function of
    difference i-k thus we denote it as
  • Letting ??i-k, referred to as correlation
    lag, leads to short-time autocorrelation
    function

30
Autocorrelation Method
  • rn?sn?sn-?
  • Autocorrelation method leads to computation of
    the short-time sequence snm convolved with
    itself flipped in time.
  • Autocorrelation function is a measure of the
    self-similarity of the signal at different lags
    ?.
  • When rn? is large then signal samples spaced by
    ? are said to by highly correlated.

31
Autocorrelation Method
  • Properties of rn?
  • For an N-point sequence, rn? is zero outside
    the interval -(N-1),N-1.
  • rn? is even function of ?
  • rn0 rn?
  • rn0 energy of snm ?
  • If snm is a segment of a periodic sequence,
    then rn? is periodic-like with the same period
  • Because snm is short-time, the overlapping data
    in the correlation decreases as ? increases ?
  • Amplitude of rn? decreases as ? increases
  • With rectangular window the envelope of rn?
    decreases linearly.
  • If snm is a random white noise sequence, then
    rn? is impulse-like, reflecting self-similarity
    only within a small negihbourhood.

32
Autocorrelation Method
33
Autocorrelation Method
  • Letting ?ni,k rni-k, normal equation take
    the form
  • The expression represents p linear equations with
    p unknowns, ?k for 1kp.
  • Using the normal equation solution, it can be
    shown that the corresponding minimum mean-squared
    prediction error is given by
  • Matrix form representation of normal equations
  • Rn?rn.

34
Autocorrelation Method
  • Expanded form
  • The Rn matrix is Toepliz
  • Symmetric about the diagonal
  • All elements of the diagonal are equal.
  • Matrix is invertible
  • Implies efficient solution.

Rn
?
rn
35
Example 5.3
  • Consider a system with an exponentially decaying
    impulse response of the form hn anun, with
    un being the unit step function.
  • Estimate a using the autocorrelation method of
    linear prediction.

Z
36
Example 5.3
  • Apply N-point rectangular window 0,N-1 at n0.
  • Compute r00 and r01.
  • Using normal equations

37
Example 5.3
  • Minimum squared error (from slide 33) is thus
    (Exercise 5.5)
  • For 1st order predictor, as in this example here,
    prediction error sequence for the true predictor
    (i.e., ?1 a) is given by
  • ensn-asn-1?n (see example 5.1
    presented earlier). Thus the prediction of the
    signal is exact except at the time origin.
  • This example illustrates that with enough data
    the autocorrelation method yields a solution
    close to the true single-pole model for an
    impulse input.

38
Limitations of the linear prediction model
  • When the underlying measured sequence is the
    impulse response of an arbitrary all-pole
    sequence, then autocorrelation methods yields
    correct result.
  • There are a number of speech sounds that even
    with an arbitrary long data sequence a true
    solution can not be obtained.
  • Consider a periodic sequence simulating a steady
    voiced sound formed by convolving a periodic
    impulse train pn with an all-pole impulse
    response hn.
  • Z-transform of hn is given by

39
Limitations of the linear prediction model
  • Thus
  • Normal equations of this system are given by (see
    Exercise 5.7)
  • Where autocorrelation of hn is denoted by
    rh?h?h-?.
  • Suppose now that the system is excited with an
    impulse train of the period P

40
Limitations of the linear prediction model
  • Normal equations associated with sn (windowed
    over multiple pitch periods) for an order p
    predictor are given by
  • It can be shown that rn? is equal to
    periodically repeated replicas of rh?but
    with decreasing amplitude due to the windowing
    (Exercise 5.7).

41
Limitations of the linear prediction model
  • The autocorrelation function rn? of the
    windowed signal sn can be thought of as
    aliased version of rh? due to overlap which
    introduces distortion
  • When aliasing is minor the two solutions are
    approximately equal.
  • Accuracy of this approximation decreases as the
    pitch period decreases (e.g., high pitch) due to
    increase in overlap of autocorrelation replicas
    repeated every P samples.

42
Limitations of the linear prediction model
  • Sources of error
  • Aliasing increases with high pitched speakers
    (smaller pitch period P).
  • Signal is not truly periodic.
  • Speech not always all-pole.
  • Autocorrelation is a suboptimal solution.
  • Covariance method capable of giving optimal
    solution, however, is not guaranteed to converge
    when underlying signal does not follow an
    all-pole model.

43
The Levinson Recursion of the Autocorrelation
method
  • Direct inversion method (Gaussian
    elimination)requires p3 multiplies and
    additions.
  • Levinson Recursion (1947)
  • Requires p2 multiplies and additions
  • Links directly to the concatenated lossless tube
    model (Chapter 4) and thus a mechanism for
    estimating the vocal tract area function from an
    all-pole-model estimation.

44
The Levinson Recursion of the Autocorrelation
method
  • Step 1
  • for i1,2,,p
  • Step 2
  • Step 3
  • Step 4
  • end

ki-partial correlation coefficients - PARCOR
45
The Levinson Recursion of the Autocorrelation
method
  • It can be shown that on each iteration that the
    predictor coefficients ?k, can be written as
    solely functions of the autocorrelation
    coefficients (Exercise 5.11).
  • Desired transfer function is given by
  • Gain A has yet to be determined.

46
Properties of the Levinson Recursion of the
Autocorrelation method
  1. Magnitude of partial correlation coefficients is
    less than 1 kilt1 for all i.
  2. Condition under 1 is sufficient for stability if
    all kilt1 then all roots of A(z) are inside the
    unit circle.
  3. Autocorrelation Method gives a minimum-phase
    solution even when the actual system is
    mixed-phase.

47
Example 5.4
  • Consider the discrete-time model of the complete
    transfer function from the glottis to the lips
    derived in Chapter 4 (Equation 4.40), but without
    zero contributions from the radiation and vocal
    tract
  • Suppose we measure a single impulse response
    denoted by hn wich is equal to the inverse
    z-transform of H(z) and estimate the model with
    autocorrelation method setting the number of
    poles of H(z) correctly p22Ci, and with
    prediction error defined over the entire duration
    of hn which yields a solution

48
Experimentation Results
49
Properties of the Levinson Recursion of the
Autocorrelation method
  • Formal explanation
  • Suppose sn follows an all-pole model
  • Prediction error function is defined over all
    time (i.e., no window truncation effects
  • and are the Fourier
    transform phase functions for the minimum- and
    maximum-phase contributions of S(?),
    respectively.
  • Autocorrelation solution can be expressed as
    (Exercise 5.14)

50
Properties of the Levinson Recursion of the
Autocorrelation method
  • Exercise 5.14 Rationalization of the Result
  • is the minimum-phase contribution due
    to the vocal tract poles inside the unit circle,
    and is maximum-phase contribution due
    to glottal poles outside the unit circle.
    Resulting estimated frequency response can be
    expressed as
  • The phase distortion of synthesized speech can
    have perceptual consequence since a gradual onset
    of the glottal flow, and thus of the speech
    waveform during the open phase of the glottal
    cycle, is transformed to a sharp attack
    consistent with the energy concentration property
    of minimum-phase sequences (Chapter 2).

51
Properties of the Levinson Recursion to
Autocorrelation method
  • Reverse Levinson RecursionHow to obtain lower
    level model from higher ones?
  • Autocorrelation matching Let rn? be the
    autocorrelation of the speech signal snmwm
    and rh? the autocorrelation of hn?-1H(z)
    then
  • rn? rh? for ?p

52
Autocorrelation Method
  • Gain Computation
  • En is the average minimum prediction error for
    the pth-order predictor.
  • If the energy in the all-pole impulse response
    hm equals the energy in the measurement snm ?
  • Squared gain equal to the minimum prediction
    error.

53
Autocorrelation Method
  • Relationship to Lossless Tube Model
  • Recall that for the lossless concatenated tube
    model, with glottal impedance Zg(z) 8 (open
    circuit), with the transfer function
  • Recursively obtained from
  • N-number of tubes and where reflection
    coefficients rk is a function of cross-sectional
    areas of successive tubes, i.e.,

54
Relationship to Lossless Tube Model
  • Levinson Recursion
  • Can be written in the Z domain (see Appendix 5.B)
  • Starting condition is obtained by mapping a000
    to
  • Two recursions are identical when ri-ki which
    then makes Di(z)Ai(z).

55
Relationship to Lossless Tube Model
  • Since the boundary condition was not included in
    the lossless tube model, V(z) represents the
    ratio between an ideal volume velocity at the
    glottis and at the lips
  • Speech pressure measurement at the lips output,
    however, has embedded within it the glottal shape
    G(z), as well as radiation at the lips R(z).
    Recall that for the voiced case (with no vocal
    tract zeros)
  • The presence of glottal shape, i.e., G(z), thus
    introduces poles that are not part of vocal
    tract.
  • The net effect of glottal shape is typically
    6dB/octave fall-off (see slide 94 of the
    presentation Acoustic of Speech Production) to
    the spectral tilt of V(z),
  • The influence of the glottal flow shape and
    radiation load can be approximately removed with
    a pre-emphasis of 6dB/octave spectral rise.

56
Example 5.5
  • In the following figure two examples that show
    good matches to measured vocal tract area
    functions for the vowels /a/ and /i/ derived from
    estimates of the partial correlation coefficients.

57
Frequency Domain Interpretation
  • Consider an all-pole model of speech production
  • Where A(?) is given by
  • Define Q(?) as the difference of the
    log-magnitude of measured and modeled spectra
  • Recall

58
Frequency Domain Interpretation
  • Thus we can write Q(?) as
  • Thus as en is minimized ? E(?) is minimized,
    which in turn ? Q(?) minimized ? spectral
    difference between actual measured speech and
    modeled spectrum is minimized.

59
Linear Prediction Analysis of Stochastic Speech
Sounds
  • Linear Prediction analysis was motivated with
    observation that for a single impulse or periodic
    impulse train input to an all-pole vocal tract
    model, the prediction error is zero most of the
    time.
  • Such analysis appears not to be applicable to
    speech sounds with fricative or aspirated sources
    modeled as a stochastic (or random) process.
  • However, autocorrelation method of linear
    prediction can be formulated for the stochastic
    case where a white noise input takes on the role
    of the single impulse.
  • The solution to a stochastic optimization problem
    - analogous to the minimization of mean-squared
    error function En, leads to normal equations
    which are the stochastic counterparts to our
    earlier solution.
  • Derivation and interpretation of this stochastic
    optimization problem is left as an exercise.

60
Criterion of Goodness
  • How well does linear predication describe the
    speech signal in time and in frequency?
  • Time Domain
  • Suppose
  • Underlying speech model is all-pole model of
    order p, and
  • Autocorrelation method is used in the estimation
    of the coefficients of the predictor polynomial
    P(z).
  • If predictor coefficients are estimated exactly
    then the prediction error
  • Is perfect impulse train for voiced speech
  • A single impulse for a plosive
  • A white noise for noisy (stochastic) speech.

61
Time Domain
  • Autocorrelation method of linear prediction
    analysis does not yield such idealized outputs
    when the measurement sn is inverse filtered by
    the estimated system function A(z) (method
    limitation)
  • Even when the vocal tract response follow an
    all-pole model, true solution can not be
    obtained, since the obtained solution approached
    to the true solution in the limit when infinite
    amount of data is available.
  • In a typical waveform segment, the actual vocal
    tract impulse response is not all-pole for
    variety of reasons
  • Presence of zeros due to
  • The radiation load,
  • Nasalization,
  • Back vocal cavity during frication and plosives.
  • Glottal flow shape even when adequately
    modeled, is not minimum phase (see example 5.6).

62
Prediction Error Residuals
  • Autocorrelation method of linear prediction of
    order 14
  • Estimation performed over 20 ms Hamming windowed
    speech segments.

63
Prediction Error Residuals
  • Reconstructing residuals form an entire utterance
    typically one hears in the prediction error
  • Not a noisy buzz as expected from idealized
    residual, but rather
  • Roughly the speech itself
  • ? Some of the vocal tract spectrum is passing
    through the inverse filter.

64
Frequency Domain
  • Behavior of linear prediction analysis can be
    studied alternatively in frequency domain
  • How well the spectrum derived form linear
    prediction analysis matches the spectrum of a
    sequence that follows
  • An all-pole model, and
  • Not an all-pole model.

65
Frequency Domain-Voiced Speech
  • Recall for voiced speech snwith Fourier
    transform Ug(?).
  • Vocal tract impulse response with all-pole
    frequency response H(?). Windowed speech snn
    is
  • Fourier transform of windowed speech snn is
  • Where
  • W(?) - is the window transform
  • ?o2?/P - is the fundamental frequency

66
Frequency Domain-Unvoiced Speech
  • Recall for unvoiced speech (stochastic sounds)
  • Linear prediction analysis attempts to estimate
    H(?) - spectral envelope of the harmonic
    spectrum S(?).

67
Schematics of Spectra for Periodic and Stochastic
Speech Sounds
68
Properties
  1. For large p H(?) matches the Fourier transform
    magnitude of the windowed signal S(?).

69
Properties
  1. Spectral peeks are better matched than spectral
    valleys

70
Properties
71
Synthesis Based on All-pole Modeling Properties
  • Now able to synthesize the waveform from model
    parameters estimated using linear prediction
    analysis
  • Synthesized signal

72
Synthesis Based on All-pole Modeling
  • Important Parameters to Consider
  • Window Duration
  • 20-30 ms to give a satisfactory time-frequency
    tradeoff (Exercise 5.20).
  • Duration can be adaptively varied to account for
    different time-frequency resolution requirement
    based on
  • Pitch
  • Voicing state
  • Phoneme class.
  • Frame Interval
  • Typical rate at which to perform analysis is 10
    ms.
  • Model Order
  • There are three components to be considered
  • Vocal tract
  • On average resonant density of one resonance
    per 1000 Hz.
  • Order of the system poles2 x resonances
    (e.g., for 5000 Hz bandwidth signal 2x510 poles)
  • Glottal flow
  • 2-pole maximum-phase model
  • Radiation at lips
  • 1 zero inside the unit circle ? 4 poles provide
    adequate representation.
  • Total of 16 poles

73
Synthesis Based on All-pole Modeling
  • Voiced/Unvoiced State and Pitch Estimation
  • Currently no discrimination is done between for
    example plosive and fricative unvoiced speech
    sound categories.
  • Pitch is estimated during voiced regions of
    speech only. However, Pitch estimation algorithms
    typically estimate pitch as well as perform
    voiced/unvoiced classification.
  • A degree of voicing may be desired in more
    complex analysis and synthesis methods
  • Voicing and turbulence occurs simultaneously
  • Voiced fricatives
  • Breathy vowels.

74
Synthesis Based on All-pole Modeling
  • Synthesis Structures
  • Determine excitation for each frame
  • Generate excitation for each frame by
  • Concatenating an impulse train during voiced
    signal (spacing determined by the time-varying
    pitch contour)
  • White noise during unvoiced signal.
  • Compute Gain
  • Directly by measuring frame energy
  • Using Autocorrelation method
  • Voiced Speech Magnitude of impulse is square
    root of signal energy.
  • Unvoiced Speech Noise variance signal
    variance.
  • Update filter values on each frame. Overlap and
    add signal at consecutive frames

75
Synthesis structures
76
Alternate Synthesis Structures
Write a Comment
User Comments (0)
About PowerShow.com