Spectral Features for Automatic Text-Independent Speaker Recognition - PowerPoint PPT Presentation

About This Presentation

Title:

Spectral Features for Automatic Text-Independent Speaker Recognition

Description:

thesis, 144 pages, Department of Computer Science, University ... As the first component in the recognition chain, the accuracy ... Spectrograph wasn't ... – PowerPoint PPT presentation

Number of Views:316

Avg rating:3.0/5.0

Slides: 41

Provided by: tki3

Category:

more less

Transcript and Presenter's Notes

Title: Spectral Features for Automatic Text-Independent Speaker Recognition

1
Spectral Features for Automatic Text-Independent
Speaker Recognition
Tomi Kinnunen Research seminar,
27.2.2004 Department of Computer
Science University of Joensuu
2
Based on a True Story

T. Kinnunen Spectral Features for Automatic
Text-Independent Speaker Recognition, Ph.Lic.
thesis, 144 pages, Department of Computer
Science, University of Joensuu, 2004.
Downloadable in PDF from
http//cs.joensuu.fi/pages/tkinnu/research/index.h
tml

3
Introduction
4
Why Study Feature Extraction ?

As the first component in the recognition chain,
the accuracy of classification is strongly
determined by its selection

5
Why Study Feature Extraction ? (cont.)

Typical feature extraction methods are directly
loaned from the speech recognition task
? Quite contradictory, considering the
opposite nature of the two tasks
In general, it seems that currently we are at
the best guessing what might be invidual in our
speech !
Because it is interesting challenging!

6
Principle of Feature Extraction
7
Studied Features

1. FFT-implemented filterbanks (subband
processing)
2. FFT-cepstrum
3. LPC-derived features
4. Dynamic spectral features (delta features)

8
Speech Material Evaluation Protocol

Each test file is splitted into segments of
T350 vectors (about 3.5 seconds of speech)
Each segment is classified by vector
quantization
Speaker models are constructed from the training
data by RLS clustering algorithm
Performance measure classification error rate
()

9
1. Subband Features
10
Computation of Subband Features
Windowed speech frame
Magnitude spectrum by FFT
Smoothing by a filterbank
Nonlinear mapping of the filter outputs

Parameters of the filterbank
Number of subbands
Filter shapes bandwidths
Type of frequency warping
Filter output nonlinearity

Compressed filter ouputs
f (f1,f2, , fM)T
11
Frequency Warping Whats That?!

Real frequency axis (Hz) is stretched and
compressed locally according to a (bijective)
warping function

A 24-channel bark-warped filterbank
Bark scale
12
Discrimination of Individual Subbands (F-ratio)
(Fixed parameters 30 linearly spaced triangular
filters)
Low-end (0-200 Hz) and mid/high frequencies ( 2
- 4 kHz) are important, region 200-2000 Hz less
important. (However, not consistently!)
13
Subband Features The Effect of the Filter
Output Nonlinearity
Consistent ordering (!) cubic lt log lt linear
14
Subband Features The Effect of the Filter Shape
15
Subband Features The Number of Subbands (1)
Fixed parameters linearly spaced /
triangular-shaped filters, log-compression
Observation error rates decrease monotonically
with increasing number of subbands (in most
cases)
16
Subband Features The Number of Subbands (2)
Fixed parameters linearly spaced /
triangular-shaped filters, log-compression
Helsinki (Almost) monotonous decrease in errors
with increasing number of subbands TIMIT Optimum
number of bands is in the range 50..100
Differences between corpora are (partly)
explained by the discrimination curves
17
Discussion of the Subband Features

(Typically used) log-compression should be
replaced with cubic compression or some better
nonlinearity
Number of subbands should be relatively high (at
least 50 based on these experiments)
Shape of the filter does not seem to be important
Discriminative information is not evenly spaced
along the frequency axis
The relative discriminatory powers of subbands
depends on the selected speaker
population/language/speech content

18
2. FFT-Cepstral Features
19
Computation of FFT-Cepstrum
Windowed speech frame
Magnitude spectrum by FFT
Smoothing by a filterbank
Common steps
Nonlinear mapping of the filter outputs
Decorrelation by DCT
Coefficient selection
Cepstrum vector
c (c1,,cM)T
20
FFT-Cepstrum Type of Frequency Warping
Fixed parameters 30 triangular filters,
log-compression, DCT-transformed filter outputs,
15 lowest cepstral coefficients excluding c0
Helsinki Mel-frequency warped cepstrum gives the
best results on average TIMIT Linearly warped
cepstrum gives the best results on average Same
explanation as before discrimination curves
21
FFT-Cepstrum Number of Cepstral Coefficients
( Fixed parameters mel-frequency warped
triangular filters, log-compression,
DCT-transformed filter outputs, 15 lowest
cepstral coefficients excluding c0, codebook
size 64)
Minimum number of coefficients around 10,
rather independent of the number of filters
22
Discussion About the FFT-Cepstrum

Same performance as with the subband features,
but smaller number of features
? For computational and modeling reasons,
cepstrum is the preferred method of these two in
automatic recognition
The commonly used mel-warped filterbank is not
the best choice in general case !
There is no reason to assume that it would be,
since mel-cepstrum is based on modeling of human
hearing and originally meant for speech
recognition purposes
I prefer / recommend to use linear frequency
warping, since
It is easier to control the amount resolution on
desired subbands (e.g. by linear weighting). In
nonlinear warping, the relationship between the
real and warped frequency axes is more
complicated

23
3. LPC-Derived Features
24
What Is Linear Predictive Coding (LPC) ?

In time domain, current sample is approximated
as a linear combination of the past p samples

The objective is to determine the LPC
coefficients ak k1,,p such that the squared
prediction error is minimized

In the frequency domain, LPCs define an
all-pole IIR-filter whose poles correspond to
local maximae of the magnitude spectrum

An LPC pole
25
Computation of LPC and LPC-Based Features
Windowed speech frame
Autocorrelation computation
Solving of Yule-Walker AR equations
Levinson-Durbin algorithm
LPC coefficients (LPC)
Reflection coefficients (REFL)
LAR conversion
Complex polynomial expansion
LPC pole finding
Atals recursion
asin(.)
Arcus sine coefficients (ARCSIN)
Log area ratios (LAR)
Formants (FMT)
Linear Predictive Cepstral Coefficients (LPCC)
Root-finding algorithm
Line spectral frequencies (LSF)
26
Linear Prediction (LPC) Number of LPC
coefficients

Minimum number around 15 coefficients (not
consistent, however)
Error rates surprisingly small in general !
LPC coefficients were used directly in
Euclidean-distance -based classifier. In
literature there is usually warning of the
following form Do not ever use LPCs directly,
at least with the Euclidean metric.

27
Comparison of the LPC-Derived Features
Fixed parameters LPC predictor order p 15

Overall performance is very good
Raw LPC coefficients gives worst performance on
average
Differences between feature sets are rather
small
? Other factors to be considered
Computational complexity
Ease of implementation

28
LPC-Derived Formants
Fixed parameters Codebook size 64

Formants give comparable, and surprisingly good
results !

Why surprisingly good ?
1. Analysis procedure was very simple (produces
spurious formants)
2. Subband processing, LPC, cepstrum, etc
describe the spectrum continuously - formants on
the other hand pick only a discrete number of
maximum peaks amplitudes from the spectrum (and
a small number!)

29
Discussion About the LPC-Derived Features

In general, results are promising, even for the
raw LPC coefficients
The differences between feature sets were small
From the implementation and efficiency viewpoint
the following are the most attractive LPCC, LAR
and ARCSIN
Formants give (surprisingly) good results also,
which indicates indirectly
The regions of spectrum with high amplitude might
be important for speaker recognition

30
4. Dynamic Features
31
Dynamic Spectral Features

Dynamic feature an estimate of the time
derivate of the feature
Can be applied to any feature

Two widely used estimatation methods are
differentiator and linear regression method

Typical phrase Dont use differentiator, it
emphasizes noise

32
Delta Features Comparison of the Two Estimation
Methods
33
Delta Features Comparison with the Static
Features
Discussion About the Delta Features

Optimum order is small (In most cases M1,2
neighboring frames)
The differentiator method is better in most
cases (surprising result, again!)
Delta features are worse than static features
but might provide uncorrelated extra information
(for multiparameter recognition)
The commonly used delta-cepstrum gives quite
poor results !

34
Towards Concluding Remarks ...
35
FFT-Cepstrum Revisited Question Is
Log-Compression / Mel-Cepstrum Best ?
Please note Now segment length is reduced down
to T100 vectors, thats why absolute recognition
rates are worse than before (ran out of time for
the thesis)
36
FFT- vs. LPC-CepstrumQuestion Is it really
that FFT-cepstrum is more accurate ?
Helsinki
TIMIT
Answer NO ! (TIMIT shows this quite clearly)
37
The Essential Difference Between the FFT- and
LPC-Cepstra ?

FFT-cepstrum approximates the spectrum by linear
combination of cosine functions (non-parametric
model)
LPC makes a least-squares fit of the all-pole
filter to the spectrum (parametric model)
FFT-cepstrum first smoothes the original
spectrum by filterbank, whereas LPC filter is
fitted directly to the original spectrum

However, one might argue that we could drop out
the filterbank from FFT-cepstrum ...
38
General Summary and Discussion

Number of subbands should be high (30-50 for
these corpora)
Number of cepstral coefficients (LPC/FFT-based)
should high (? 15)
In particular, number of subbands, coefficients,
and LPC order are clearly higher than in speech
recognition generally
Formants give (surprisingly) good performance
Number of formants should be high (? 8)
In most cases, the differentiator method
outperforms the regression method in
delta-feature computation

39
Philosophical Discussion

The current knowledge of speaker individuality
is far from perfect
Engineers concentrete on tuning complex feature
compensation methods but dont (necessarily)
understand whats individual in speech
Phoneticians try to find the individual code
in the speech signal, but they dont
(necessarily) know how to apply engineers
methods
Why do we believe that speech would be any less
individual than e.g. fingerprints ?
Compare the history fingerprint and
voiceprint
Fingerprints have been studied systematically
since the 17th century (1684)
Spectrograph wasnt invented until 1946 ! How
could we possibly claim that we know what speech
is with research of less than 60 years?
Why do we believe that human beings are optimal
speaker discriminators? Our ear can be fooled
already (e.g. MP3 encoding).