Title: Wide/Narrow Band Spectrograms
1Wide/Narrow Band Spectrograms
- Wide band (left)
- Combines harmonics
- Voiced speech vocal fold pulses (glottis air
puffs) show as vertical lines - Narrow band(right)
- Individual harmonics
- Narrow-band displays formants horizontally
- No vocal pulses shown
- Display parameters
- Generally log power (log(amplitude2)
- Frame shift 1 ms typical
Spectrogram for a vowel sound
Spectrograms vowel with varying pitch
2Frame Positioning
- Pitch-synchronous
- Centered around a pitch period
- Varied size frames
- Unvoiced sections assume fixed pitch period
- Challenge Determine exact pitch period locations
- Pitch-asynchronous
- Fixed frames and shifts
- typically 25-30 ms frame width with a 10 ms frame
shift - Tradeoffs
- Too large contains more than one phoneme
- Too small cannot determine F0 or the harmonics
3Source Filter Separation
- Source F0 correlating to pitch and intonation
- Filter The spectral envelope
- Three separation approaches Filter bank,
cepstral analysis, and linear prediction - Importance Spectrum and pitch need to be studied
separately
4Filter Bank
- Time Domain
- Series of linear band pass filters
- Frequency Domain
- Window a frame (Ex Hamming)
- Perform Fourier Transform
- Warp frequencies (Ex Mel scale)
- Compute weighted sum of each bin
- Advantage
- simple and robust for finding spectral envelope
- Okay for ASR (unless language is tonal)
- Disadvantage
- Lose too much detail to find pitch.
- Peaks can fall between harmonics not good for TTS
5The Cepstrum
- Definition cn F -1log(F(xn
- Note Sometimes the Cepstrum is taken on the
square of the spectrum rather than on the log of
the spectrum - Treat the spectrum as a wave
- Formant frequency is slow
- Glottal pulses are fast
- Cepstrum separates the two
- Cepstral Terminology
- Cepstrum is Spectrum in reverse
- Quefrency instead of frequency
- Lifter instead of filter
6Separating Source from Vocal Filter
- Source
- Excites particular fundamental frequencies
- The glottis source sometimes is noisy
- Filter
- The source is filtered resulting in vocal tract
resonances - Goal Separate excitation frequencies from the
filter - Process
- Time domain convolves source with filter (un
vn) - Convolution multiplies in the frequency domain
(UV) - Log converts multiplication to a sum (log(UV)
log(U) log(V)) - The V (filter) varies slowly the U (excitation)
varies quickly. - The inverse operation separates un and vn
into different quefrencies - Observations
- There are no pitch excitations in unvoiced speech
- Cepstral analysis works well for speech
recognition applications
7Cepstrum Process Illustration
Spectral envelope on the left, F0 is one of the
excitations
8Cepstrum Samples
Note Band passing frequencies below 100 or
greater than 900 can help
9Cepstral Mean Normalization (CMN)
For Automatic Speech Recognition
- For each window we perform a Cepstral analysis
- Mel scaled Quefrencies summed into 13 to 39 bins
- Each bin represents a Cepstral vector X x0,
x1, , xT-1 - Compute the mean of each vector coefficientµk
1/T ?t0,T-1xt where k is a vector coefficient - Subtract uk from coefficient k of each vector X
10Cepstral Evaluation
- The Cepstral process eliminates phase data.
However, human perception largely, but not
totally, ignores phase - Use the lower quefrencies to study the vocal
filter - Use the peak to study pitch and glottis behavior
- Zeroing the pitch portion of the Cepstrum and
transforming back to the frequency domain is an
approach for speech recognition - Disadvantage of Cepstrals They are difficult to
interpret using a visual plot
11Time Domain Pitch Detection
- Recall the autocorrelation pitch detection
algorithm - Correlate a window of speech with a previous
window - Find the best match
- Problem too many false peaks
- Peak and center clipping
- Algorithm to reduce false peaks
- clip the top/bottom of a signal
- Center the remainder around 0
- Other alternatives
- Researchers propose many other pitch detection
algorithms - There are much debate as to which is the best
12Epoch Detection
- Simply determining the pitch is not sufficient
for synthesis - Unit selection requires accurate anchors to be
able to merge segments of speech - Otherwise clicks and other artifacts will be
heard - Pitch-marking or epoch-detection attempt to
accurately mark pitch points - Mark peaks or troughs
- Mark Instant of glottal closure (large negative
pulse) - There are many algorithms proposed, but this
remains an open research area
13Linear Prediction Coding (LPC)
- Originally developed to compress (code) speech
- Although coding pertains to compression, the term
LPC has much broader implications in NLP - LPC is equivalent to the vocal tract model (Week
6) - LPC is another computational method to
- Compute vocal tract reflection coefficients
- Compute vocal tract filter coefficients
- LPC is useful to separating source (glottis) from
filter (vocal tract)
14Linear Predictive Encoding (LPC)
One approach There are many others with better
compression
- Pseudo Code
- WHILE not EOF
- READ sample n (sn)x prediction()
- error x sn
- IF error too large to
- fit in compressed size
- WRITE special code
- WRITE sn
- ELSE
- WRITE error
- Concept
- Guess at the next value using a set of previous
values - Instead of outputting the actual data, output the
error from the guess - Less bits should be needed if the guess is good
15Linear Algebra Background
- N equations and P unknowns
- If NltP, 8 number of potential solutions
- x y 5
- Solutions are along the line y 5-x
- If NP, there is at most one unique solution
- Solution x y 5 and x y 3, solution x4,
y1 - If NgtP, there cannot even be one solution
- No solutions for xy 4, x y 3, 2x 7 7
- The best we can do is find the closes fit
16Least Squares minimize error
- First Approach Linear algebra find orthogonal
projections of vectors onto the best fit - Second Approach Calculus Use derivative with
zero slope to find best fit
17Solving n equations and n unknowns
- Gaussian Elimination
- Complexity O(n3)
- Successive Iteration
- Complexity varies
- Cholskey Decomposition
- More efficient, still O(n3)
- Levenson-Durbin
- Complexity O(n2)
- Works for symmetric Toplitz matrices
Definitions for any matrix, A Transpose (AT)
Replace aij by aji for all i and j Symmetric AT
A Positive Definite No complex solutions
Toplitz Diagonals to the right all have equal
values Lower/Upper triangular No non zero values
above/below diagonal
18Symmetric Toeplitz Matrices
Example
- Flipping rows and columns produces the same
matrix - Every diagonal to the right contains the same
value
19Levinson Durbin Algorithm
or
Step 0 E0 1 r0 Initial Value
Step 1 E1 -3 (1-k12)E0 k1 2 r1/E0
Step 2 E2 -8/3 (1-k22)E1 k2 1/3 (r2 a11r1)/E1
Step 3 E3 -5/2 (1-k32)E2 k3 1/4 (r3 a21r2 a22r1)/E2
Step 4 E4 -12/5 (1-k42) E3 k4 1/5 r4 a31r3 a32r2 a33r1)/E3
a112 k1
a214/3 a11-k2a11 a221/3k2
a315/4 a21-k3a22 a320 a22-k3a21 a331/4 k3
a416/5 a31-k4a33 a420 a32-k4a32 a430a33-k4a31 a441/5k4
- Verify results by plugging a41, a42, a43, a44
back into the equations - 6/5(1) 0(2) (0)3 1/5(4) 2, 6/5(2) 0(1)
0(2) 1/5(3) 3 - 6/5(3) 0(2) 0(1) 1/5(2) 4, 6/5(4) 0(3)
0(2) 1/5(1) 5
20Levinson-Durbin Pseudo Code
- E0 r0
- FOR step 1 TO P
- kstep ri
- FOR i 1 TO step-1 THEN kstep - ai-1,i
rstep-i - kstep / Estep-1
- Estep (1 k2step)Estep-1
- astep,step kstep-1
- For i 1 TO step-1 THEN astep,i astep-1,I
kstepastep-1, step-i -
Note ri are the row 1 matrix coefficients
21Cholesky Decomposition
- Requirements
- Symmetric (same matrix if flip rows and columns)
- Positive definite matrix (no complex solutions)
- Solution
- Factor matrix A into A LLT where L is lower
triangular - Perform forward substitution to solve L(LTak)
bk - Use the resulting vector, xi, in the above step
to perform a backward substitution to solve for
LTak xi - Complexity
- Factoring step O(n3/3)
- Forward substitution n2
- Backward substitution n2
22Cholesky Factorization
Result
23Cholesky Factorization Pseudo Code
- FOR k1 TO n-1
- lkk a½kkFOR j k1 TO n
- ljk ajk/ lkk
- FOR j k1 TO n
- FOR i j TO n
- aij aij lik ljk
- lnn ann
- Column index k
- Row index j
- Elements of matrix A aij
- Elements of matrix L l
24Illustration Linear Prediction
- 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16
Goal Estimate yn using the three previous
values yn a1 yn-1 a2 yn-2 a3 yn-3 Three ak
coefficients, Frame size of 16 Thirteen equations
and three unknowns Note The equation is an IIR
filter
25LPC Basics
- Predict xn from xn-1, , xn-P
- en yn - ?k1,P ak yn-k
- en is the error between the projection and the
actual value - The goal is to find the coefficients that produce
the smallest en value - Concept
- Square the error
- take the partial derivative with respect to each
ak - Find the solution with zero derivative (the
minimum). - Result P equations with P unknowns
26Finding the Best LPC Estimate
- One linear prediction equation en yn - ?k1,P
ak yn-k - Over a whole frame we have n equations and k
unknowns - Sum en over the entire frame E ?n0,N-1(yn -
?k1,P ak yn-k) - Square the total error E2 ?n0,N-1 (yn -
?k1,P ak yn-k)2 - Take partial derivative with respect to each aj
generates P equations (Ej) - Like a regular derivative treating only aj as a
variable2Ej 2(?n0,N-1 (yn - ?k1,P
akyn-k)yn-j) - Calculus Chain Rule if y y(u(x)) then dy/dx
dy/du du/dx - Set each Ej to zero (zero derivative) to find the
minimum P errorsfor j 1 to P then 0 ?n0,N-1
(yn - ?k1,P akyn-k)yn-j (j indicates the
equation) - Rearrange terms for each j of the P equations,
?n0,N-1 ynyn-j?n0,N-1?k1,Pakyn-kyn-j?k1,P?n
1,Nakyn-kyn-j ?k1,Pakf(j,k)f(j,0) - Result P equations and P unknowns where f(j,k)
?n0,N-1 yn-kyn-j
27Covariance Method
- Result from previous slide (equation j)
?n0,N-1ynyn-j ?k1,P?n0,N-1akyn-kyn-j - A more concise notation when f(j,k) ?n0,N-1
yn-kyn-j is f(j,0)?k1,Pakf(j,k) - Now we have P equations and P unknowns
- Because f(j,k) f(k,j), the matrix is symmetric
- Solution requires O(n3) iterations (ex
Cholskeys decomposition) - Why covariance? Its not probabilistic, but the
matrix looks similar
28Covariance Example
Recall f(j,k) ?nstart,startN-1 yn-kyn-j
Where equation j is f(j,0) ?k1,Pakf(j,k)
- Signal , 3, 2, -1, -3, -5, -2, 0, 1, 2, 4, 3,
1, 0, -1, -2, -4, -1, 0, 3, 1, 0, - Frame -5, -2, 0, 1, 2, 4, 3, 1, Number of
coefficients 3 - f(1,1) -3-3 -5-5 -2-2 00 11 22
44 33 68 - f(2,1) -1-3 -3-5 -5-2 -20 01 12
24 43 50 - f(3,1) 2-3 -1-5 -3-2 -50 -21 02
14 23 13 - f(1,2) -3-1 -5-3 -2-5 0-2 10 21
42 34 50 - f(2,2) -1-1 -3-3 -5-5 -2-2 00 11
22 44 60 - f(3,2) 2-1 -1-3 -3-5 -5-2 -20 01
12 24 36 - f(1,3) -32 -5-1 -2-3 0-5 1-2 20
41 32 13 - f(2,3) -12 -3-1 -5-3 -2-5 0-2 10
21 42 36 - f(3,3) 22 -1-1 -3-3 -5-5 -2-2 00
11 22 48 - f(1,0) -3-5 -5-2 -20 01 12 24
43 31 50 - f(2,0) -1-5 -3-2 -50 -21 02 14
23 41 23 - f(3,0) 2-5 -1-2 -30 -51 -22 04
13 21 -12
29Auto-Correlation Method
- Assume all values of the signal outside of
0ltjltN-1 is zero - Correlate from -8 to 8 (most values are 0)
- The LPC formula for f becomes f(j,k)?n0,N-1-(j-
k) ynyn(j-k)R(j-k) - The Matrix is now in the Toplitz format
- The Levinson Durbin algorithm applies
- Implementation complexity O(n2)
30Auto Correlation Example
Recall f(j,k)?n0,N-1-(j-k) ynyn(j-k)R(j-k) Wh
ere equation j is R(j) ?k1,P
R(j-k)ak Notation j is the row, k is the column
- Signal , 3, 2, -1, -3, -5, -2, 0, 1, 2, 4, 3,
1, 0, -1, -2, -4, -1, 0, 3, 1, 0, - Frame -5, -2, 0, 1, 2, 4, 3, 1, Number of
coefficients 3 - R(0) -5-5 -2-2 00 11 22 44
33 11 60 - R(1) -5-2 -20 01 12 24 43 31
35 - R(2) -50 -21 02 14 23 41 12
- R(3) -51 -22 04 13 21 -4
31LPC Transfer Function
- Predict the values of the next sample
- Sn ? k1,p ak sn-k
- The error signal (en), is the LPC residual
- ensn- sn sn- ? k1,p ak sn-k
- Perform a Z-transform of both sides
- E(z)S(z)- ?k1,pak S(z)z-k
- Factor S(z)E(z) S(z) 1-?k1,p ak z-k
S(z)A(z) - Compute the transfer function S(z) E(z)/A(z)
- Conclusion LPC provides us with an all pole
filter
32LPC Coding and Synthesis Models
Coding Model
Conclusion The LPC all-pole model can code and
synthesizes speech
33The LPC Model
- The LPC estimate
- An all-pole IR filter yn Gxn - ?k1,N ak yn
- The Gxn residual attempts to model the glottal
source - LPC estimates the separation of source from
filter - Challenges (Problems in synthesis)
- The residual does not accurately model the source
(glottis) - The filter does not model radiation from the lips
- The model does not account for nasal resonances
- Possible solutions
- Additional poles can increase the accuracy to a
point - 1 pole pair for each 1k of sampling rate
- 2 more pairs can better estimate the source and
lips - Introduce zeroes into the model
- More robust analysis of the glottal source and
lip radiation
34The LPC Spectrum
- Perform a LPC analysis
- Find the poles
- Plot the spectrum aroundthe z-Plane unit circle
- What do we find concerning the LPC spectrum?
- Adding poles better matches speech up to about 22
for a 16k sampling rate - The peaks tend to be overly sharp (spiky)
because small radius changesgreatly alters pole
skirt widths
35PARCOR
- Definition PARtial auto CORrelation
coefficients - LPC coefficients are a1, a2, aP
- PARCOR coefficients are k1, k2, kP
- It is easy to compute PARCOR from LPC and visa
versa - Review
- Rectangular tubes have reflection coefficientsrk
(Ak1 Ak)/(Ak1 Ak) - With algebra the ratio of areas between tubes
areAk/Ak1 (1-rk)/(1rk) - Importance
- LPC is equivalent to the tube model of the vocal
tract - LogAk1/Ak log(1-ki)/(1ki)
- We can adjust the LPC parameters based on PARCOR
36Relationship to Tube Model
- Given PARCOR, compute LPC
- Given LPC, compute PARCOR
- FOR j 1 TO P THEN xP,j aj
- kp xP,P
- FOR i P TO 2 STEP -1
- FOR j 1 TO i-1
- xi-1,j (xi,j kixi,i-j)/(1-ki2)
- ki-1 xi-1,i-1
- FOR i 1 TO P
- xi,i ki
- if (igt1) then FOR j 1 TO i-1
- xi,j xi-1,j kixi-1,i-j
- FOR j1 TO P THEN aj xP,j
Notes ki are PARCOR coefficients ai are LPC
coefficients xi,j is a temporary work array
37Line Spectrum Pairs
- Overview
- Filter with an additional coefficient
- Uses the equations on the right
- The New filter models
- A completely closed glottis
- Completely open lips
- Characteristics
- Spectrum shown as lines because of infinite
amplitudes of formants - Forces zeros and poles to be interspersed on the
unit circle - Advantages
- Easier to estimate formants
- Less sensitive to quantization errors
38LPC and the Source Signal
- Experiments show
- Glottis requires both zeros and poles
- It requires less poles than the vocal function
- LPC combines the glottal and vocal tract poles
- If U(z) I(z)G(z)
- U(z) source function
- I(z) Impulse sequence
- G(z) Glottal filter
- Transfer function
- Goal separate glottal poles from the LPC
predictor
39Closed Phase Analysis
- Find Instant of glottal closure
- Epoch detection algorithm
- Divide signal
- closed phase (glottis does not affect LPC
predictors) - open phases (glottis has significant impact)
- Strategy
- Compute the G filter over a number of pitch
periods - Perform an inverse filter to obtain the glottal
signal
40Open Phase Analysis
- Problem It is not easy to find the instance of
glottal closure - Goal add extra poles to the model
- Advantages
- Human hearing is more sensitive to peaks than to
valleys - The tube model and LPC are all-pole systems
- Disadvantages
- Relationships between the poles and the formants
becomes obscure - Extra poles can approximate a zero, but not
perfectly - How can extra poles approximate zeros
- For example if x,y ? -1, then consider the
following derivation - 1-x 1/(1y)
- 1 (1-x)(1y) 1 y x xy 1 y(1-x) x
- Therefore y x/(1-x)