Title: Spectral Features for Automatic Text-Independent Speaker Recognition
1Spectral Features for Automatic Text-Independent
Speaker Recognition
Tomi Kinnunen Research seminar,
27.2.2004 Department of Computer
Science University of Joensuu
2Based on a True Story
- T. Kinnunen Spectral Features for Automatic
Text-Independent Speaker Recognition, Ph.Lic.
thesis, 144 pages, Department of Computer
Science, University of Joensuu, 2004. - Downloadable in PDF from
- http//cs.joensuu.fi/pages/tkinnu/research/index.h
tml
3Introduction
4Why Study Feature Extraction ?
- As the first component in the recognition chain,
the accuracy of classification is strongly
determined by its selection
5Why Study Feature Extraction ? (cont.)
- Typical feature extraction methods are directly
loaned from the speech recognition task - ? Quite contradictory, considering the
opposite nature of the two tasks - In general, it seems that currently we are at
the best guessing what might be invidual in our
speech ! - Because it is interesting challenging!
6Principle of Feature Extraction
7Studied Features
- 1. FFT-implemented filterbanks (subband
processing) - 2. FFT-cepstrum
- 3. LPC-derived features
- 4. Dynamic spectral features (delta features)
8Speech Material Evaluation Protocol
- Each test file is splitted into segments of
T350 vectors (about 3.5 seconds of speech) - Each segment is classified by vector
quantization - Speaker models are constructed from the training
data by RLS clustering algorithm - Performance measure classification error rate
()
91. Subband Features
10Computation of Subband Features
Windowed speech frame
Magnitude spectrum by FFT
Smoothing by a filterbank
Nonlinear mapping of the filter outputs
- Parameters of the filterbank
- Number of subbands
- Filter shapes bandwidths
- Type of frequency warping
- Filter output nonlinearity
Compressed filter ouputs
f (f1,f2, , fM)T
11Frequency Warping Whats That?!
- Real frequency axis (Hz) is stretched and
compressed locally according to a (bijective)
warping function
A 24-channel bark-warped filterbank
Bark scale
12Discrimination of Individual Subbands (F-ratio)
(Fixed parameters 30 linearly spaced triangular
filters)
Low-end (0-200 Hz) and mid/high frequencies ( 2
- 4 kHz) are important, region 200-2000 Hz less
important. (However, not consistently!)
13Subband Features The Effect of the Filter
Output Nonlinearity
Consistent ordering (!) cubic lt log lt linear
14Subband Features The Effect of the Filter Shape
15Subband Features The Number of Subbands (1)
Fixed parameters linearly spaced /
triangular-shaped filters, log-compression
Observation error rates decrease monotonically
with increasing number of subbands (in most
cases)
16Subband Features The Number of Subbands (2)
Fixed parameters linearly spaced /
triangular-shaped filters, log-compression
Helsinki (Almost) monotonous decrease in errors
with increasing number of subbands TIMIT Optimum
number of bands is in the range 50..100
Differences between corpora are (partly)
explained by the discrimination curves
17Discussion of the Subband Features
- (Typically used) log-compression should be
replaced with cubic compression or some better
nonlinearity - Number of subbands should be relatively high (at
least 50 based on these experiments) - Shape of the filter does not seem to be important
- Discriminative information is not evenly spaced
along the frequency axis - The relative discriminatory powers of subbands
depends on the selected speaker
population/language/speech content
182. FFT-Cepstral Features
19Computation of FFT-Cepstrum
Windowed speech frame
Magnitude spectrum by FFT
Smoothing by a filterbank
Common steps
Nonlinear mapping of the filter outputs
Decorrelation by DCT
Coefficient selection
Cepstrum vector
c (c1,,cM)T
20FFT-Cepstrum Type of Frequency Warping
Fixed parameters 30 triangular filters,
log-compression, DCT-transformed filter outputs,
15 lowest cepstral coefficients excluding c0
Helsinki Mel-frequency warped cepstrum gives the
best results on average TIMIT Linearly warped
cepstrum gives the best results on average Same
explanation as before discrimination curves
21FFT-Cepstrum Number of Cepstral Coefficients
( Fixed parameters mel-frequency warped
triangular filters, log-compression,
DCT-transformed filter outputs, 15 lowest
cepstral coefficients excluding c0, codebook
size 64)
Minimum number of coefficients around 10,
rather independent of the number of filters
22Discussion About the FFT-Cepstrum
- Same performance as with the subband features,
but smaller number of features - ? For computational and modeling reasons,
cepstrum is the preferred method of these two in
automatic recognition - The commonly used mel-warped filterbank is not
the best choice in general case ! - There is no reason to assume that it would be,
since mel-cepstrum is based on modeling of human
hearing and originally meant for speech
recognition purposes - I prefer / recommend to use linear frequency
warping, since - It is easier to control the amount resolution on
desired subbands (e.g. by linear weighting). In
nonlinear warping, the relationship between the
real and warped frequency axes is more
complicated
233. LPC-Derived Features
24What Is Linear Predictive Coding (LPC) ?
- In time domain, current sample is approximated
as a linear combination of the past p samples
- The objective is to determine the LPC
coefficients ak k1,,p such that the squared
prediction error is minimized
- In the frequency domain, LPCs define an
all-pole IIR-filter whose poles correspond to
local maximae of the magnitude spectrum
An LPC pole
25Computation of LPC and LPC-Based Features
Windowed speech frame
Autocorrelation computation
Solving of Yule-Walker AR equations
Levinson-Durbin algorithm
LPC coefficients (LPC)
Reflection coefficients (REFL)
LAR conversion
Complex polynomial expansion
LPC pole finding
Atals recursion
asin(.)
Arcus sine coefficients (ARCSIN)
Log area ratios (LAR)
Formants (FMT)
Linear Predictive Cepstral Coefficients (LPCC)
Root-finding algorithm
Line spectral frequencies (LSF)
26Linear Prediction (LPC) Number of LPC
coefficients
- Minimum number around 15 coefficients (not
consistent, however) - Error rates surprisingly small in general !
- LPC coefficients were used directly in
Euclidean-distance -based classifier. In
literature there is usually warning of the
following form Do not ever use LPCs directly,
at least with the Euclidean metric.
27Comparison of the LPC-Derived Features
Fixed parameters LPC predictor order p 15
- Overall performance is very good
- Raw LPC coefficients gives worst performance on
average - Differences between feature sets are rather
small - ? Other factors to be considered
- Computational complexity
- Ease of implementation
28LPC-Derived Formants
Fixed parameters Codebook size 64
- Formants give comparable, and surprisingly good
results !
- Why surprisingly good ?
- 1. Analysis procedure was very simple (produces
spurious formants) - 2. Subband processing, LPC, cepstrum, etc
describe the spectrum continuously - formants on
the other hand pick only a discrete number of
maximum peaks amplitudes from the spectrum (and
a small number!)
29Discussion About the LPC-Derived Features
- In general, results are promising, even for the
raw LPC coefficients - The differences between feature sets were small
- From the implementation and efficiency viewpoint
the following are the most attractive LPCC, LAR
and ARCSIN - Formants give (surprisingly) good results also,
which indicates indirectly - The regions of spectrum with high amplitude might
be important for speaker recognition
304. Dynamic Features
31Dynamic Spectral Features
- Dynamic feature an estimate of the time
derivate of the feature - Can be applied to any feature
- Two widely used estimatation methods are
differentiator and linear regression method
- Typical phrase Dont use differentiator, it
emphasizes noise
32Delta Features Comparison of the Two Estimation
Methods
33Delta Features Comparison with the Static
Features
Discussion About the Delta Features
- Optimum order is small (In most cases M1,2
neighboring frames) - The differentiator method is better in most
cases (surprising result, again!) - Delta features are worse than static features
but might provide uncorrelated extra information
(for multiparameter recognition) - The commonly used delta-cepstrum gives quite
poor results !
34Towards Concluding Remarks ...
35FFT-Cepstrum Revisited Question Is
Log-Compression / Mel-Cepstrum Best ?
Please note Now segment length is reduced down
to T100 vectors, thats why absolute recognition
rates are worse than before (ran out of time for
the thesis)
36FFT- vs. LPC-CepstrumQuestion Is it really
that FFT-cepstrum is more accurate ?
Helsinki
TIMIT
Answer NO ! (TIMIT shows this quite clearly)
37The Essential Difference Between the FFT- and
LPC-Cepstra ?
- FFT-cepstrum approximates the spectrum by linear
combination of cosine functions (non-parametric
model) - LPC makes a least-squares fit of the all-pole
filter to the spectrum (parametric model) - FFT-cepstrum first smoothes the original
spectrum by filterbank, whereas LPC filter is
fitted directly to the original spectrum
However, one might argue that we could drop out
the filterbank from FFT-cepstrum ...
38General Summary and Discussion
- Number of subbands should be high (30-50 for
these corpora) - Number of cepstral coefficients (LPC/FFT-based)
should high (? 15) - In particular, number of subbands, coefficients,
and LPC order are clearly higher than in speech
recognition generally - Formants give (surprisingly) good performance
- Number of formants should be high (? 8)
- In most cases, the differentiator method
outperforms the regression method in
delta-feature computation
39Philosophical Discussion
- The current knowledge of speaker individuality
is far from perfect - Engineers concentrete on tuning complex feature
compensation methods but dont (necessarily)
understand whats individual in speech - Phoneticians try to find the individual code
in the speech signal, but they dont
(necessarily) know how to apply engineers
methods - Why do we believe that speech would be any less
individual than e.g. fingerprints ? - Compare the history fingerprint and
voiceprint - Fingerprints have been studied systematically
since the 17th century (1684) - Spectrograph wasnt invented until 1946 ! How
could we possibly claim that we know what speech
is with research of less than 60 years? - Why do we believe that human beings are optimal
speaker discriminators? Our ear can be fooled
already (e.g. MP3 encoding).
40Thats All, Folks !