Title: Prsentation COST 277 Limerick
1Non-Linear Speech Feature Extraction for Phoneme
Classification And Speaker Recognition
M. Chetouani, M. Faúndez-Zanuy (), B. Gas, J.L.
Zarader Laboratoire des Instruments et Systèmes
dIle-De-France (LISIF) Université Pierre Marie
Curie, PARIS, FRANCE () Escola Universitària
Politècnica de Mataró, BARCELONA, SPAIN
2Outline
- Feature extraction in the recognition process
- Needs for speech feature extraction
- A non-linear model the Neural Predictive Coding
- Feature extraction for phoneme classification
- Feature extraction for speaker recognition
- Conclusions and future works
3Speech Recognition Process
- Speech Feature Extraction Process
- Feature extraction is the first step of the
recognition process. - Feature extraction is usually computed by
temporal methods like the Linear Predictive
Coding (LPC) or frequential methods like the Mel
Frequency Cepstral Coding (MFCC) or both methods
like Perceptual Linear Coding (PLP). - Limits
- Linear methods.
- No a priori class membership information
(data-driven methods).
4Feature extraction principle
Common part
Specific parts
5Needs for Speech Feature Extraction (1)
- First, a non-linear modelization of the speech
production model - A solution Non-linear predictors (Volterra
Filters, Neural networks) - Our approach An extension of the Linear
Predictive Coding (LPC) to the non-linear domain
by neural networks.
6Needs for Speech Feature Extraction (2)
- Problem generation of a great number of
coefficients - Solution
- First layer common to all the phonemes.
- One second layer specific to each phoneme.
7The Neural Predictive Coding (NPC) (2)
- Principle
- The first weights w are common to all the phoneme
classes. - Each output layer is associated to one phoneme
class.
8Learning phase
9Phoneme classification
- The objective is to extract from the speech
signal phonetic features independently to the
speakers. - Common parts of the speech production model are
modelized by the first layer. - The specific parts are modelized by the second
layers.
10Discriminative Feature Extraction based on the
Minimum Classification Error (MCE) criterion
- The key idea is to provide discriminant
constraints by classifiers way optimal
constraints. - Simultaneous training of both the feature
extractor and the classifier - Neural Predictive Coding.
- Prototyped-based classifier The Learning Vector
Quantization.
11Evaluation on phoneme classification
- Phonemes are extracted from the NTIMIT speech
database - DR1 and DR2 regions (without the SA sentences).
- 114 speakers for training and 37 for testing.
- Classification of confusable phonemes
- Vowels /ih/, /ey/, /eh/, /ae/,
- Voiced plosives /b/, /d/, /g/,
- Unvoiced plosives /p/, /t/, /k/
- Comparisons with traditional methods LPC, MFCC,
PLP and a non-linear model NPC without explicit
discriminant criterion. - Classification by GMMs.
- Context independent classification
(frame-by-frame).
12Classification rates (frame-by-frame analysis)
13Feature extraction for speaker recognition
- Objective
- speaker-dependent feature extraction.
- Speaker recognition process
- Feature extraction is currently carried out in a
same way for all the speakers. - Our approach
- A speaker model
- Feature extractor (NPC)
- Reference model.
14A new initialization method for the NPC coding
phase
- Once the NPC model is parameterized, the coding
phase consists in the estimation of the second
layer weights. - Like for all optimization processes, the
initialization is important - Traditionally, one uses random initialization
with different constraints. - We use the LPC analysis for linear initialization
of the non-linear model - Data-driven method for linear initialization of
neural networks.
15Speaker identification
- Modification of the traditional enrolment and
test phases - Enrolment phase
- NPC parameterization phase 12 seconds are used.
- Computation of reference models by using the
whole sentence. - Test phase
- The speech input is coded by each NPC model.
- The obtained features are compared by the
associated reference models.
16Evaluation on speaker identification
- 49 speakers from the Gaudi database
- Acquisition with a microphone connected to a PC.
- Vector dimension is set to 16
- One minute of read text is used for reference
models training. From this minute, 12 seconds are
used for the NPC parameterization. - 5 sentences for testing (each sentence is about
2-3 seconds). - Comparisons with traditional coding methods LPC,
MFCC, LPCC and PLP. - Reference models computed by Covariance Matrices
and the Arithmetic-Harmonic Sphericity (AHS) is
used for comparisons.
17Results
18Conclusions
- The phoneme classification rates are improved by
non-linear methods, in comparison with
traditional methods like LPC, MFCC and PLP. - The speaker identification rates are improved by
non-linear methods with linear initialization in
comparison with traditional methods like LPC,
LPCC, MFCC and PLP.
19Perspectives
- Phoneme classification
- Cooperation with different classifiers.
- Application to a large number of phonemes.
- Speaker recognition
- Explicit Discriminative Feature Extraction.
- Different applications identification,
verification, tracking.
20- Thank you for your attention