Spectral Features for Automatic Text-Independent Speaker Recognition - PowerPoint PPT Presentation

About This Presentation
Title:

Spectral Features for Automatic Text-Independent Speaker Recognition

Description:

thesis, 144 pages, Department of Computer Science, University ... As the first component in the recognition chain, the accuracy ... Spectrograph wasn't ... – PowerPoint PPT presentation

Number of Views:316
Avg rating:3.0/5.0
Slides: 41
Provided by: tki3
Category:

less

Transcript and Presenter's Notes

Title: Spectral Features for Automatic Text-Independent Speaker Recognition


1
Spectral Features for Automatic Text-Independent
Speaker Recognition
Tomi Kinnunen Research seminar,
27.2.2004 Department of Computer
Science University of Joensuu
2
Based on a True Story
  • T. Kinnunen Spectral Features for Automatic
    Text-Independent Speaker Recognition, Ph.Lic.
    thesis, 144 pages, Department of Computer
    Science, University of Joensuu, 2004.
  • Downloadable in PDF from
  • http//cs.joensuu.fi/pages/tkinnu/research/index.h
    tml

3
Introduction
4
Why Study Feature Extraction ?
  • As the first component in the recognition chain,
    the accuracy of classification is strongly
    determined by its selection

5
Why Study Feature Extraction ? (cont.)
  • Typical feature extraction methods are directly
    loaned from the speech recognition task
  • ? Quite contradictory, considering the
    opposite nature of the two tasks
  • In general, it seems that currently we are at
    the best guessing what might be invidual in our
    speech !
  • Because it is interesting challenging!

6
Principle of Feature Extraction
7
Studied Features
  • 1. FFT-implemented filterbanks (subband
    processing)
  • 2. FFT-cepstrum
  • 3. LPC-derived features
  • 4. Dynamic spectral features (delta features)

8
Speech Material Evaluation Protocol
  • Each test file is splitted into segments of
    T350 vectors (about 3.5 seconds of speech)
  • Each segment is classified by vector
    quantization
  • Speaker models are constructed from the training
    data by RLS clustering algorithm
  • Performance measure classification error rate
    ()

9
1. Subband Features
10
Computation of Subband Features
Windowed speech frame
Magnitude spectrum by FFT
Smoothing by a filterbank
Nonlinear mapping of the filter outputs
  • Parameters of the filterbank
  • Number of subbands
  • Filter shapes bandwidths
  • Type of frequency warping
  • Filter output nonlinearity

Compressed filter ouputs
f (f1,f2, , fM)T
11
Frequency Warping Whats That?!
  • Real frequency axis (Hz) is stretched and
    compressed locally according to a (bijective)
    warping function

A 24-channel bark-warped filterbank
Bark scale
12
Discrimination of Individual Subbands (F-ratio)
(Fixed parameters 30 linearly spaced triangular
filters)
Low-end (0-200 Hz) and mid/high frequencies ( 2
- 4 kHz) are important, region 200-2000 Hz less
important. (However, not consistently!)
13
Subband Features The Effect of the Filter
Output Nonlinearity
Consistent ordering (!) cubic lt log lt linear
14
Subband Features The Effect of the Filter Shape
15
Subband Features The Number of Subbands (1)
Fixed parameters linearly spaced /
triangular-shaped filters, log-compression
Observation error rates decrease monotonically
with increasing number of subbands (in most
cases)
16
Subband Features The Number of Subbands (2)
Fixed parameters linearly spaced /
triangular-shaped filters, log-compression
Helsinki (Almost) monotonous decrease in errors
with increasing number of subbands TIMIT Optimum
number of bands is in the range 50..100
Differences between corpora are (partly)
explained by the discrimination curves
17
Discussion of the Subband Features
  • (Typically used) log-compression should be
    replaced with cubic compression or some better
    nonlinearity
  • Number of subbands should be relatively high (at
    least 50 based on these experiments)
  • Shape of the filter does not seem to be important
  • Discriminative information is not evenly spaced
    along the frequency axis
  • The relative discriminatory powers of subbands
    depends on the selected speaker
    population/language/speech content

18
2. FFT-Cepstral Features
19
Computation of FFT-Cepstrum
Windowed speech frame
Magnitude spectrum by FFT
Smoothing by a filterbank
Common steps
Nonlinear mapping of the filter outputs
Decorrelation by DCT
Coefficient selection
Cepstrum vector
c (c1,,cM)T
20
FFT-Cepstrum Type of Frequency Warping
Fixed parameters 30 triangular filters,
log-compression, DCT-transformed filter outputs,
15 lowest cepstral coefficients excluding c0
Helsinki Mel-frequency warped cepstrum gives the
best results on average TIMIT Linearly warped
cepstrum gives the best results on average Same
explanation as before discrimination curves
21
FFT-Cepstrum Number of Cepstral Coefficients
( Fixed parameters mel-frequency warped
triangular filters, log-compression,
DCT-transformed filter outputs, 15 lowest
cepstral coefficients excluding c0, codebook
size 64)
Minimum number of coefficients around 10,
rather independent of the number of filters
22
Discussion About the FFT-Cepstrum
  • Same performance as with the subband features,
    but smaller number of features
  • ? For computational and modeling reasons,
    cepstrum is the preferred method of these two in
    automatic recognition
  • The commonly used mel-warped filterbank is not
    the best choice in general case !
  • There is no reason to assume that it would be,
    since mel-cepstrum is based on modeling of human
    hearing and originally meant for speech
    recognition purposes
  • I prefer / recommend to use linear frequency
    warping, since
  • It is easier to control the amount resolution on
    desired subbands (e.g. by linear weighting). In
    nonlinear warping, the relationship between the
    real and warped frequency axes is more
    complicated

23
3. LPC-Derived Features
24
What Is Linear Predictive Coding (LPC) ?
  • In time domain, current sample is approximated
    as a linear combination of the past p samples
  • The objective is to determine the LPC
    coefficients ak k1,,p such that the squared
    prediction error is minimized
  • In the frequency domain, LPCs define an
    all-pole IIR-filter whose poles correspond to
    local maximae of the magnitude spectrum

An LPC pole
25
Computation of LPC and LPC-Based Features
Windowed speech frame
Autocorrelation computation
Solving of Yule-Walker AR equations
Levinson-Durbin algorithm
LPC coefficients (LPC)
Reflection coefficients (REFL)
LAR conversion
Complex polynomial expansion
LPC pole finding
Atals recursion
asin(.)
Arcus sine coefficients (ARCSIN)
Log area ratios (LAR)
Formants (FMT)
Linear Predictive Cepstral Coefficients (LPCC)
Root-finding algorithm
Line spectral frequencies (LSF)
26
Linear Prediction (LPC) Number of LPC
coefficients
  • Minimum number around 15 coefficients (not
    consistent, however)
  • Error rates surprisingly small in general !
  • LPC coefficients were used directly in
    Euclidean-distance -based classifier. In
    literature there is usually warning of the
    following form Do not ever use LPCs directly,
    at least with the Euclidean metric.

27
Comparison of the LPC-Derived Features
Fixed parameters LPC predictor order p 15
  • Overall performance is very good
  • Raw LPC coefficients gives worst performance on
    average
  • Differences between feature sets are rather
    small
  • ? Other factors to be considered
  • Computational complexity
  • Ease of implementation

28
LPC-Derived Formants
Fixed parameters Codebook size 64
  • Formants give comparable, and surprisingly good
    results !
  • Why surprisingly good ?
  • 1. Analysis procedure was very simple (produces
    spurious formants)
  • 2. Subband processing, LPC, cepstrum, etc
    describe the spectrum continuously - formants on
    the other hand pick only a discrete number of
    maximum peaks amplitudes from the spectrum (and
    a small number!)

29
Discussion About the LPC-Derived Features
  • In general, results are promising, even for the
    raw LPC coefficients
  • The differences between feature sets were small
  • From the implementation and efficiency viewpoint
    the following are the most attractive LPCC, LAR
    and ARCSIN
  • Formants give (surprisingly) good results also,
    which indicates indirectly
  • The regions of spectrum with high amplitude might
    be important for speaker recognition

30
4. Dynamic Features
31
Dynamic Spectral Features
  • Dynamic feature an estimate of the time
    derivate of the feature
  • Can be applied to any feature
  • Two widely used estimatation methods are
    differentiator and linear regression method
  • Typical phrase Dont use differentiator, it
    emphasizes noise

32
Delta Features Comparison of the Two Estimation
Methods
33
Delta Features Comparison with the Static
Features
Discussion About the Delta Features
  • Optimum order is small (In most cases M1,2
    neighboring frames)
  • The differentiator method is better in most
    cases (surprising result, again!)
  • Delta features are worse than static features
    but might provide uncorrelated extra information
    (for multiparameter recognition)
  • The commonly used delta-cepstrum gives quite
    poor results !

34
Towards Concluding Remarks ...
35
FFT-Cepstrum Revisited Question Is
Log-Compression / Mel-Cepstrum Best ?
Please note Now segment length is reduced down
to T100 vectors, thats why absolute recognition
rates are worse than before (ran out of time for
the thesis)
36
FFT- vs. LPC-CepstrumQuestion Is it really
that FFT-cepstrum is more accurate ?
Helsinki
TIMIT
Answer NO ! (TIMIT shows this quite clearly)
37
The Essential Difference Between the FFT- and
LPC-Cepstra ?
  • FFT-cepstrum approximates the spectrum by linear
    combination of cosine functions (non-parametric
    model)
  • LPC makes a least-squares fit of the all-pole
    filter to the spectrum (parametric model)
  • FFT-cepstrum first smoothes the original
    spectrum by filterbank, whereas LPC filter is
    fitted directly to the original spectrum

However, one might argue that we could drop out
the filterbank from FFT-cepstrum ...
38
General Summary and Discussion
  • Number of subbands should be high (30-50 for
    these corpora)
  • Number of cepstral coefficients (LPC/FFT-based)
    should high (? 15)
  • In particular, number of subbands, coefficients,
    and LPC order are clearly higher than in speech
    recognition generally
  • Formants give (surprisingly) good performance
  • Number of formants should be high (? 8)
  • In most cases, the differentiator method
    outperforms the regression method in
    delta-feature computation

39
Philosophical Discussion
  • The current knowledge of speaker individuality
    is far from perfect
  • Engineers concentrete on tuning complex feature
    compensation methods but dont (necessarily)
    understand whats individual in speech
  • Phoneticians try to find the individual code
    in the speech signal, but they dont
    (necessarily) know how to apply engineers
    methods
  • Why do we believe that speech would be any less
    individual than e.g. fingerprints ?
  • Compare the history fingerprint and
    voiceprint
  • Fingerprints have been studied systematically
    since the 17th century (1684)
  • Spectrograph wasnt invented until 1946 ! How
    could we possibly claim that we know what speech
    is with research of less than 60 years?
  • Why do we believe that human beings are optimal
    speaker discriminators? Our ear can be fooled
    already (e.g. MP3 encoding).

40
Thats All, Folks !
Write a Comment
User Comments (0)
About PowerShow.com