Title: Recent work on Language Identification
1Recent work on Language Identification
Pietro Laface POLITECNICO di TORINO
Brno 28-06-2009 Pietro
LAFACE
2Team
POLITECNICO di TORINO Pietro Laface Professor Fab
io Castaldo Post-doc Sandro Cumani PhD
student Ivano Dalmasso Thesis Student
LOQUENDO Claudio Vair Senior
Researcher Daniele Colibro Researcher Emanuele
Dalmasso Post-doc
3Outline
- Acoustic models
- Fast discriminative training of GMMs
- Language factors
- Phonetic models
- 1-best tokenizers
- lattice tokenizers
- LRE09
- Incremental acquisition of segments for the
development sets
4Our technology progress
- Inter-speaker compensation in feature space
- GLDS / SVM models (ICASSP 2007) - GMMs
- SVM using GMM super-vectors (GMM-SVM)
- Introduced by MIT-LL for speaker recognition
- Fast discriminative training of GMMs
- Alternative to MMIE
- Exploiting the GMM-SVM separation hyperplanes
- MIT discriminative GMMs
- Language factors
5Acoustic Language Identification
- Task similar to text-independent Speaker
Recognition
Gaussian Mixture Models (GMM) - MAP adapted from
an Universal Background Model (UBM)
UBM
MAP
6GMM super-vectors
Appending the mean value of all Gaussians in a
single stream we get a super-vector
We use GMM super-vectors
7Using an UBM in LID
- The frame based inter-speaker variation
compensation approach estimates the inter-speaker
compensation factors using the UBM - In the GMM-SVM approach all language GMMs share
the same weights and variances of the UBM - The UBM is used for fast selection of Gaussians
8Speaker/channel compensationin feature space
- U is a low rank matrix (estimated offline)
projecting the speaker/channel factors subspace
in the supervector domain. - x(i) is a low dimensional vector, estimated using
the UBM, holding the speaker/channel factors for
the current utterance i. - is the occupation probability of the
m-th Gaussian
9Estimating the U matrix
- Estimating the U matrix with a large set of
differences between models generated using
different utterances of the same speaker we
compensate the distortions due to the
inter-session variability ? Speaker recognition - Estimating the U matrix with a large set of
differences between models generated using
different speaker utterances of the same language
we compensate the distortions due to
inter-speaker/channel variability within the same
language ? Language recognition
10GMM-SVM
- A GMM model is trained for each utterance, both
in train and in test - Each GMM is represented by a normalized GMM
super-vector - The normalization is necessary to define a
meaningful comparison between GMM supervectors
11Kullback-Leibler divergence
- Two GMMs (i and j) can be compared using an
approximation of the Kullback-Leibler divergence
The interesting property of this measure is that
12Kullback-Leibler normalization
- normalizing each supervector component according
to
- The normalized UBM supervector defines the origin
of a new space - The KL divergence becomes an Euclidean distance
- The SVM language models are created using a
linear kernel in this KL space
13GMM-SVM weakness
- GMM-SVM models perform very well with rather long
test utterances
It is difficult to estimate a robust GMM with a
short test utterance
Exploit the discriminative information given by
the GMM-SVM for fast estimation of discriminative
GMMs
14SVM discriminative directions
w normal vector to the class-separation
hyperplane
15GMM discriminative training
Utterance GMM
Language GMM
KL Space
- Shift each Gaussian of a language model along its
discriminative direction, given by the vector
normal to the class-separation hyperplane in the
KL space
16Rules for selection of ak
- A discriminative GMM moves away from its original
- MAP adapted - model, which best matches the
training (and test) data. - A large value of ak (shift size) ?
- more discriminative model, but
- worse likelihood than less discriminative models
- Use a development set for estimating a
17Experiments with 2048 GMMs
Pooled EER() of Discriminative 2048 GMMs, and
GMM-SVM on the NIST LRE tasks. In parentheses,
the average of the EERs of each language.
Year Models Models Models
Year Discriminative GMMs Discriminative GMMs GMM-SVM
Year 3s 10s 30s
1996 11.71 (13.71) 3.62 (4.92) 1.01 (1.37)
2003 13.56 (14.40) 5.50 (6.02) 1.42 (1.64)
2005 16.94 (17.85) 9.73 (11.07 ) 4.67 (5.81 )
256-MMI (Brno University 2006 IEEE Odyssey ) 256-MMI (Brno University 2006 IEEE Odyssey ) 256-MMI (Brno University 2006 IEEE Odyssey )
2005 17.1 8.6 4.6
18Pushed GMMs (MIT-LL)
19Language Factors
- Eigenvoice modeling, , and
the use of speaker factors as input features to
SVMs, has recently been demonstrated to give good
results for speaker recognition compared to the
standard GMM-SVM approach (Dehak et al. ICASSP
2009). - Analogy
- Estimate an eigen-language space, and use the
language factors as input features to SVM
classifiers (Castaldo et al. submitted to
Interspeech 2009).
20Language Factors advantages
- Language factors are low-dimension vectors
- Training and evaluating SVMs with different
kernels is easy and fast it requires the dot
product of normalized language factors - Using a very large number of training examples is
feasible - Small models give good performance
21Toward an eigen-language space
- After compensation of the nuisances of a GMM
adapted from the UBM using a single utterance,
residual information about the channel and the
speaker remains. - However, most of the undesired variation is
removed as demonstrated by the improvements
obtained using this technique
22Speaker compensated eigenvoices
- First approach
- Estimating the principal directions of the GMM
supervectors of all the training segments before
inter-speaker nuisance compensation would produce
a set of language independent, universal
eigenvoices. - After nuisance removal, however, the speaker
contribution to the principal components is
reduced to the benefit of language
discrimination.
23Eigen-language space
- Second approach
- Computing the differences between the GMM
supervectors obtained from utterances of a
polyglot speaker would compensate the speaker
characteristics and would enhance the acoustic
components of a language with respect to the
others. - We do not have labeled databases including
polyglot speakers - compute and collect the difference between GMM
supervectors produced by utterances of speakers
of two different languages irrespective of the
speaker identity, already compensated in the
feature domain
24Eigen-language space
- The number of these differences would grow with
the square of utterances of the training set. - Perform Principal Component Analysis on the set
of the differences between the set of the
supervectors of a language and the average
supervector of every other language.
25Training corpora
- The same used for LRE07 evaluation
- All data of the 12 languages in the Callfriend
corpus - Half of the NIST LRE07 development corpus
- Half of the OSHU corpus provided by NIST for
LRE05 - The Russian through switched telephone network
- Automatic segmentation
26Eigenvalues of two language subspaces
The language subspace has higher eigenvalues, and
both curves show a sharp decrease for their first
13 eigenvalues, corresponding to the main
language discrimination directions.
27 LRE07 30s closed set test
Language factors minDCF is always better and
more stable
28Pushed GMMs (MIT-LL)
29Pushed eigen-language GMMs
The same approach to obtain discriminative GMMs
from the language factors
30Min DCFs and (EER)
Models 30s 10s 3s
GMM-SVM (KL kernel) 0.029 (3.43) 0.085 (9.12) 0.201 (21.3)
GMM-SVM (Identity kernel) 0.031 (3.72) 0.087 (9.51) 0.200 (21.0)
LF-SVM (KL kernel) 0.026 (3.13) 0.083 (9.02) 0.186 (20.4)
LF-SVM (Identity kernel) 0.026 (3.11) 0.083 (9.13) 0.187 (20.4)
Discriminative GMMs 0.021 (2.56) 0.069 (7.49) 0.174 (18.45)
LF-Discriminative GMMs (KL kernel) 0.025 (2.97) 0.084 (9.04) 0.186 (19.9)
LF-Discriminative GMMs (Identity kernel) 0.025 (3.05) 0.084 (9.05) 0.186 (20.0)
31Loquendo-Polito LRE09 System
Model Training
32Phonetic models
- Output layer
- 700-1000 states for the language dependent
phonetic units
- ASR Recognizer
- phone-loop grammar with diphone transition
constraints
33Phone transcribers
- ASR Recognizer
- phone-loop grammar with diphone transition
constraints
- 12 phone transcribers for
- French, German, Greek, Italian, Polish,
Portuguese, Russian, Spanish, Swedish, Turkish,
UK and US English. - The statistics of the n-gram phone occurrences
collected from the best decoded string of each
conversation segment
34Phone transcribers
- ANN models
- Same phone-loop grammar - different engine
- 10 phone transcribers for
- Catalan, French, German, Greek, Italian, Polish,
Portuguese, Russian, Spanish, Swedish, Turkish,
UK and US English. - The statistics of the n-gram phone occurrences
collected from the expected counts from a lattice
of each conversation segment
35Multigrams
- Two different TFLLR kernels
- trigrams
- pruned multigrams
- multigrams can provide useful information about
the language by capturing word parts within the
string sequences
36Pruned Multigrams
For each phonetic transcriber, we discard all the
n-grams appearing in the training set less than
0.05 of the average occurrence of the unigrams
Total number of n-grams for 12 language transcribers Total number of n-grams for 12 language transcribers Total number of n-grams for 12 language transcribers Total number of n-grams for 12 language transcribers Total number of n-grams for 12 language transcribers Total number of n-grams for 12 language transcribers Total number of n-grams for 12 language transcribers
N-gram 1 2 3 4 5 6
Pruned 461 11477 120200 114396 10738 443
37Scoring
- The total number of models that we use for
scoring an unknown segment is 34 - 11 channel dependent models (11 x 2)
- 12 single channel models (2 telephone and 10
broadcast models only). - 23 x 2 for MMIE GMMs (channel independent but
M/F)
38Calibration and fusion
Multi-class FoCal
max of the channel dependent scores
39Language pair recognition
- For the language-pair evaluation only the
back-ends have been re-trained, keeping unchanged
the models of all the sub-systems.
40Telephone development corpora
- CALLFRIEND - Conversations split into slices of
150s - NIST 2003 and NIST 2005
- LRE07 development corpus
- Cantonese and Portuguese data in the 22 Language
OGI corpus - RuSTeN -The Russian through Switched Telephone
Network corpus
41Broadcast development corpora
- Incrementally created to include as far as
possible the variability within a language due to
channel, gender and speaker differences - The development data, further split in training,
calibration and test subsets, should cover the
mentioned variability
42Problems with LRE09 dev data
- Often same speaker segments
- Scarcity of segments for some languages after
filtering same speaker segments - Genders are not balanced
- Excluding French, the segments of all languages
are either telephone or broadcast. - No audited data available for Hindi, Russian,
Spanish and Urdu on VOA3, only automatic
segmentation was provided - No segmentation was provided in the first release
of the development data for Cantonese, Korean,
Mandarin, and Vietnamese - For these 8 missing languages only the language
hypotheses provided by BUT were available for
VOA2 data.
43Additional audited data
- For the 8 languages lacking broadcast data,
segments have been generated accessing the VOA
site looking for the original MP3 files - Goal collect 300 broadcast segments per
language, processed to detect narrowband
fragments - The candidates were checked to eliminate segments
including music, bad channel distortions, and
fragments of other languages
44Development data for bootstrap models
- The segments were distributed to these sets so
that same speaker segments were included in the
same set. - A set of acoustic (pushed GMMs) bootstrap models
has been trained
45Additional not-audited data from VOA3
- Preliminary tests with the bootstrap models
indicate the need of additional data - Selected from VOA3 to include new speakers in the
train, calibration and test sets - assuming that the file label correctly identify
the corresponding language
46Speaker selection
- Performed by means of a speaker recognizer
- We process the audited segments before the others
- A new speaker model is added to the current set
of speaker models whenever the best recognition
score obtained by a segment is less than a
threshold
47Additional not-audited data from VOA2
- Enriching the training set
- Language recognition has been performed using a
system combining the acoustic bootstrap models
and a phonetic system - A segment has been selected only if
- the 1-best language hypothesis of our system had
associated a score greater than a given (rather
high) threshold - matched the 1-best hypothesis provided by the BUT
system
48Total number of segments for this evaluation
Set Corpora Corpora Corpora Corpora Corpora Corpora
Set voa3_A voa2_A ftp_C voa3_S voa2_S ftp_S
Train 529 116 316 1955 590 66
Extended train 114 22 65 2483 574 151
Development 396 85 329 1866 449 45
- Suffix
- A audited
- C checked
- S automatic segmentation
- ftp ftp//8475.ftp.storage.akadns.net/mp3/voa/
49Hausa- Decision Cost Function
DCF
50Hindi- Decision Cost Function
DCF
51Results on the development set
Test on Systems Systems Systems Systems Systems Systems
Test on Pushed GMMs MMIE GMMs 3-grams Multi-grams Lattice Fusion
Broadcast telephone 1.48 1.70 1.09 1.12 1.06 0.86
Broadcast subset 1.54 1.69 1.24 1.26 1.14 0.91
Telephone subset 2.00 2.51 1.45 1.49 1.42 1.21
Average minDCFx100 on 30s test segments
52Korean - score cumulative distribution
t-t
t-b
b-t