Title: Dijana Petrovska-Delacr
1ALISP based improvement of GMMs for
Text-independent Speaker Verification
- Dijana Petrovska-Delacrétaz 1 Asmaa el Hannani1
- Gérard Chollet 2
- 1 DIVA Group, University of Fribourg2
GET-ENST, CNRS-LTCI, Paris - 3-4 December 2003, Biometrics Tutorials, Uni.
Fribourg
2Overview
- 1. Why segmental speaker verification systems ?
- 2. Speech segmentation problems
- 3. Proposed segmental system based on DTW
distance measure - 4. Experimental setup
- 5. Results
- 6. Conclusions and perspectives
31 Why segmental speaker verification systems ?
- Current reference speaker verification systems
are based on Gaussian Mixture Models (each speech
frame is treated independently) - Speech is composed of different sounds
- Phonemes have different discriminant
characteristics for speaker verification (see
Eatock, al. 94, J.Olsen 97, Petrovska al.98,
2000) - nasals and vowels convey more speaker
characteristics than other speech classes - we would like to exploit this fact
- We need a automatic speech segmentation tool !
41.1 Advantages and disadvantages of the speech
segmentation
- Problems
- Need of a speech segmentation tool
- Speaker modeling per speech classes gt more data
needed - More complicated systems
- Advantages
- Possibility to use it in combination with a
dialogue based systems, for which a speech
segmentation is already done - Possibility to introduce text-prompted speaker
verification, designed to include a maximum
number of speaker specific units
52 Speech Segmentation
- Large Vocabulary Continuous Speech Recognition
(LVCSR) System - good results for a small set of languages
- need huge amount of annotated speech data
- language (and task) dependent
- we do not have such a for American English
62.1 ALISP Speech Segmentation
- Data-driven speech segmentation
- not yet usable for speech recognition purposes
- no annotated databases needed
- language and task independent
- we could use it to segment the speech data for a
text-independent speaker verification task - We will use the data driven speech segmentation
method ALISP (Automatic Language Independent
Speech Processing)
72.2 ALISP principles
83 Proposed speaker verification system ALISP
segments and DTW 3.1 Segmentation problem
- Segmentation of the speech data with N ALISP HMM
models - N 64 speech classes
- Need of (not transcribed) speech data, to train
the 64 ALISP HMM models - With so much speech classes we should change the
speaker modeling method , not enough data for GMM
adaptationgt - Use of Dynamic Time Warping (DTW)
93.2 DTW distance measure for speaker verification
- Dynamic Time Warping (DTW) was already used for
speaker verification, in a text-dependent mode
(Rosenberg 76, Rabiner Schafer 76, Furui 81,
Pandit and Kittler 98) - The DTW distance measure between two speech
segments conveys speaker specific characteristics - Originality used DTW in text-independent mode
- We first proceed to the segmentation of speech
data in ALISP classes - Measure the distance between speaker and
non-speaker segments - Speaker specific information is extracted from
the - ALISP based speech segments gt Client
Dictionary - Non-speaker (world speakers)
- ALISP based speech segments gt World Dictionary
103.3 Searching in the client and world speech
dictionaries for speaker verification purposes
114 Evaluation of the proposed system
experimental setup
- Development data one subset from NIST 2002
cellular data (American English) - world speakers (60 female 59 male)
- used to train the ALISP speech segmenter
- and to model the non-speakers (world speakers)
- Evaluated on
- another subset from NIST 2002 (111 79 male
speakers)
124.1 Speech segmentation example
- 2 another occurrences of the English phone ay
- the corresponding ALISP sequences HX - Hf
and (HM) - Hf - Ha- - previous slide
(Hf )-Ha or (HM) - HZ -Ha
134.2 Results GMM , ALISP-DTW systems and their
fusion
144.3 Results EER comparison
154.4 Importance of fusion (33 improvement)
164.5 Using only GMMs scores to segmentsgt
segmental Gmm system
175. Conclusions
- State of the art NIST 2002 results for EER
(best 8 to worst 28) - Fusion of classical system with a segmental
systems big improvements - Why higher level informations present in the
segmental system complement usefully the short
therm frequency informations present in the GMM
system