Recent work on Language Identification - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

Recent work on Language Identification

Description:

Title: Slide 1 Author: Fabio Castaldo Last modified by: laface Document presentation format: Personalizzato Other titles: Bitstream Vera Sans Wingdings Symbol Times ... – PowerPoint PPT presentation

Number of Views:83

Avg rating:3.0/5.0

Slides: 44

Provided by: FabioCa9

Category:

more less

Transcript and Presenter's Notes

Title: Recent work on Language Identification

1
Recent work on Language Identification
Pietro Laface POLITECNICO di TORINO
Brno 28-06-2009 Pietro
LAFACE
2
Team
POLITECNICO di TORINO Pietro Laface Professor Fab
io Castaldo Post-doc Sandro Cumani PhD
student Ivano Dalmasso Thesis Student
LOQUENDO Claudio Vair Senior
Researcher Daniele Colibro Researcher Emanuele
Dalmasso Post-doc
3
Outline

Acoustic models
Fast discriminative training of GMMs
Language factors
Phonetic models
1-best tokenizers
lattice tokenizers
LRE09
Incremental acquisition of segments for the
development sets

4
Our technology progress

Inter-speaker compensation in feature space
GLDS / SVM models (ICASSP 2007) - GMMs
SVM using GMM super-vectors (GMM-SVM)
Introduced by MIT-LL for speaker recognition
Fast discriminative training of GMMs
Alternative to MMIE
Exploiting the GMM-SVM separation hyperplanes
MIT discriminative GMMs
Language factors

5
Acoustic Language Identification

Task similar to text-independent Speaker
Recognition

Gaussian Mixture Models (GMM) - MAP adapted from
an Universal Background Model (UBM)
UBM
MAP
6
GMM super-vectors
Appending the mean value of all Gaussians in a
single stream we get a super-vector
We use GMM super-vectors
7
Using an UBM in LID

The frame based inter-speaker variation
compensation approach estimates the inter-speaker
compensation factors using the UBM
In the GMM-SVM approach all language GMMs share
the same weights and variances of the UBM
The UBM is used for fast selection of Gaussians

8
Speaker/channel compensationin feature space

U is a low rank matrix (estimated offline)
projecting the speaker/channel factors subspace
in the supervector domain.
x(i) is a low dimensional vector, estimated using
the UBM, holding the speaker/channel factors for
the current utterance i.
is the occupation probability of the
m-th Gaussian

9
Estimating the U matrix

Estimating the U matrix with a large set of
differences between models generated using
different utterances of the same speaker we
compensate the distortions due to the
inter-session variability ? Speaker recognition
Estimating the U matrix with a large set of
differences between models generated using
different speaker utterances of the same language
we compensate the distortions due to
inter-speaker/channel variability within the same
language ? Language recognition

10
GMM-SVM

A GMM model is trained for each utterance, both
in train and in test
Each GMM is represented by a normalized GMM
super-vector
The normalization is necessary to define a
meaningful comparison between GMM supervectors

11
Kullback-Leibler divergence

Two GMMs (i and j) can be compared using an
approximation of the Kullback-Leibler divergence

The interesting property of this measure is that
12
Kullback-Leibler normalization

normalizing each supervector component according
to

The normalized UBM supervector defines the origin
of a new space
The KL divergence becomes an Euclidean distance
The SVM language models are created using a
linear kernel in this KL space

13
GMM-SVM weakness

GMM-SVM models perform very well with rather long
test utterances

It is difficult to estimate a robust GMM with a
short test utterance
Exploit the discriminative information given by
the GMM-SVM for fast estimation of discriminative
GMMs
14
SVM discriminative directions
w normal vector to the class-separation
hyperplane
15
GMM discriminative training
Utterance GMM
Language GMM
KL Space

Shift each Gaussian of a language model along its
discriminative direction, given by the vector
normal to the class-separation hyperplane in the
KL space

16
Rules for selection of ak

A discriminative GMM moves away from its original
- MAP adapted - model, which best matches the
training (and test) data.
A large value of ak (shift size) ?
more discriminative model, but
worse likelihood than less discriminative models
Use a development set for estimating a

17
Experiments with 2048 GMMs
Pooled EER() of Discriminative 2048 GMMs, and
GMM-SVM on the NIST LRE tasks. In parentheses,
the average of the EERs of each language.
Year Models Models Models
Year Discriminative GMMs Discriminative GMMs GMM-SVM
Year 3s 10s 30s
1996 11.71 (13.71) 3.62 (4.92) 1.01 (1.37)
2003 13.56 (14.40) 5.50 (6.02) 1.42 (1.64)
2005 16.94 (17.85) 9.73 (11.07 ) 4.67 (5.81 )
256-MMI (Brno University 2006 IEEE Odyssey ) 256-MMI (Brno University 2006 IEEE Odyssey ) 256-MMI (Brno University 2006 IEEE Odyssey )
2005 17.1 8.6 4.6
18
Pushed GMMs (MIT-LL)
19
Language Factors

Eigenvoice modeling, , and
the use of speaker factors as input features to
SVMs, has recently been demonstrated to give good
results for speaker recognition compared to the
standard GMM-SVM approach (Dehak et al. ICASSP
2009).
Analogy
Estimate an eigen-language space, and use the
language factors as input features to SVM
classifiers (Castaldo et al. submitted to
Interspeech 2009).

20
Language Factors advantages

Language factors are low-dimension vectors
Training and evaluating SVMs with different
kernels is easy and fast it requires the dot
product of normalized language factors
Using a very large number of training examples is
feasible
Small models give good performance

21
Toward an eigen-language space

After compensation of the nuisances of a GMM
adapted from the UBM using a single utterance,
residual information about the channel and the
speaker remains.
However, most of the undesired variation is
removed as demonstrated by the improvements
obtained using this technique

22
Speaker compensated eigenvoices

First approach
Estimating the principal directions of the GMM
supervectors of all the training segments before
inter-speaker nuisance compensation would produce
a set of language independent, universal
eigenvoices.
After nuisance removal, however, the speaker
contribution to the principal components is
reduced to the benefit of language
discrimination.

23
Eigen-language space

Second approach
Computing the differences between the GMM
supervectors obtained from utterances of a
polyglot speaker would compensate the speaker
characteristics and would enhance the acoustic
components of a language with respect to the
others.
We do not have labeled databases including
polyglot speakers
compute and collect the difference between GMM
supervectors produced by utterances of speakers
of two different languages irrespective of the
speaker identity, already compensated in the
feature domain

24
Eigen-language space

The number of these differences would grow with
the square of utterances of the training set.
Perform Principal Component Analysis on the set
of the differences between the set of the
supervectors of a language and the average
supervector of every other language.

25
Training corpora

The same used for LRE07 evaluation
All data of the 12 languages in the Callfriend
corpus
Half of the NIST LRE07 development corpus
Half of the OSHU corpus provided by NIST for
LRE05
The Russian through switched telephone network
Automatic segmentation

26
Eigenvalues of two language subspaces
The language subspace has higher eigenvalues, and
both curves show a sharp decrease for their first
13 eigenvalues, corresponding to the main
language discrimination directions.
27
LRE07 30s closed set test
Language factors minDCF is always better and
more stable
28
Pushed GMMs (MIT-LL)
29
Pushed eigen-language GMMs
The same approach to obtain discriminative GMMs
from the language factors
30
Min DCFs and (EER)
Models 30s 10s 3s
GMM-SVM (KL kernel) 0.029 (3.43) 0.085 (9.12) 0.201 (21.3)
GMM-SVM (Identity kernel) 0.031 (3.72) 0.087 (9.51) 0.200 (21.0)
LF-SVM (KL kernel) 0.026 (3.13) 0.083 (9.02) 0.186 (20.4)
LF-SVM (Identity kernel) 0.026 (3.11) 0.083 (9.13) 0.187 (20.4)
Discriminative GMMs 0.021 (2.56) 0.069 (7.49) 0.174 (18.45)
LF-Discriminative GMMs (KL kernel) 0.025 (2.97) 0.084 (9.04) 0.186 (19.9)
LF-Discriminative GMMs (Identity kernel) 0.025 (3.05) 0.084 (9.05) 0.186 (20.0)
31
Loquendo-Polito LRE09 System
Model Training
32
Phonetic models

Output layer
700-1000 states for the language dependent
phonetic units

Stationary units
23 - 47

ASR Recognizer
phone-loop grammar with diphone transition
constraints

33
Phone transcribers

ASR Recognizer
phone-loop grammar with diphone transition
constraints

12 phone transcribers for
French, German, Greek, Italian, Polish,
Portuguese, Russian, Spanish, Swedish, Turkish,
UK and US English.
The statistics of the n-gram phone occurrences
collected from the best decoded string of each
conversation segment

34
Phone transcribers

ANN models
Same phone-loop grammar - different engine

10 phone transcribers for
Catalan, French, German, Greek, Italian, Polish,
Portuguese, Russian, Spanish, Swedish, Turkish,
UK and US English.
The statistics of the n-gram phone occurrences
collected from the expected counts from a lattice
of each conversation segment

35
Multigrams

Two different TFLLR kernels
trigrams
pruned multigrams
multigrams can provide useful information about
the language by capturing word parts within the
string sequences

36
Pruned Multigrams
For each phonetic transcriber, we discard all the
n-grams appearing in the training set less than
0.05 of the average occurrence of the unigrams
Total number of n-grams for 12 language transcribers Total number of n-grams for 12 language transcribers Total number of n-grams for 12 language transcribers Total number of n-grams for 12 language transcribers Total number of n-grams for 12 language transcribers Total number of n-grams for 12 language transcribers Total number of n-grams for 12 language transcribers
N-gram 1 2 3 4 5 6
Pruned 461 11477 120200 114396 10738 443
37
Scoring

The total number of models that we use for
scoring an unknown segment is 34
11 channel dependent models (11 x 2)
12 single channel models (2 telephone and 10
broadcast models only).
23 x 2 for MMIE GMMs (channel independent but
M/F)

38
Calibration and fusion
Multi-class FoCal
max of the channel dependent scores
39
Language pair recognition

For the language-pair evaluation only the
back-ends have been re-trained, keeping unchanged
the models of all the sub-systems.

40
Telephone development corpora

CALLFRIEND - Conversations split into slices of
150s
NIST 2003 and NIST 2005
LRE07 development corpus
Cantonese and Portuguese data in the 22 Language
OGI corpus
RuSTeN -The Russian through Switched Telephone
Network corpus

41
Broadcast development corpora

Incrementally created to include as far as
possible the variability within a language due to
channel, gender and speaker differences
The development data, further split in training,
calibration and test subsets, should cover the
mentioned variability

42
Problems with LRE09 dev data

Often same speaker segments
Scarcity of segments for some languages after
filtering same speaker segments
Genders are not balanced
Excluding French, the segments of all languages
are either telephone or broadcast.
No audited data available for Hindi, Russian,
Spanish and Urdu on VOA3, only automatic
segmentation was provided
No segmentation was provided in the first release
of the development data for Cantonese, Korean,
Mandarin, and Vietnamese
For these 8 missing languages only the language
hypotheses provided by BUT were available for
VOA2 data.

43
Additional audited data

For the 8 languages lacking broadcast data,
segments have been generated accessing the VOA
site looking for the original MP3 files
Goal collect 300 broadcast segments per
language, processed to detect narrowband
fragments
The candidates were checked to eliminate segments
including music, bad channel distortions, and
fragments of other languages

44
Development data for bootstrap models

The segments were distributed to these sets so
that same speaker segments were included in the
same set.
A set of acoustic (pushed GMMs) bootstrap models
has been trained

45
Additional not-audited data from VOA3

Preliminary tests with the bootstrap models
indicate the need of additional data
Selected from VOA3 to include new speakers in the
train, calibration and test sets
assuming that the file label correctly identify
the corresponding language

46
Speaker selection

Performed by means of a speaker recognizer
We process the audited segments before the others
A new speaker model is added to the current set
of speaker models whenever the best recognition
score obtained by a segment is less than a
threshold

47
Additional not-audited data from VOA2

Enriching the training set
Language recognition has been performed using a
system combining the acoustic bootstrap models
and a phonetic system
A segment has been selected only if
the 1-best language hypothesis of our system had
associated a score greater than a given (rather
high) threshold
matched the 1-best hypothesis provided by the BUT
system

48
Total number of segments for this evaluation
Set Corpora Corpora Corpora Corpora Corpora Corpora
Set voa3_A voa2_A ftp_C voa3_S voa2_S ftp_S
Train 529 116 316 1955 590 66
Extended train 114 22 65 2483 574 151
Development 396 85 329 1866 449 45

Suffix
A audited
C checked
S automatic segmentation
ftp ftp//8475.ftp.storage.akadns.net/mp3/voa/

49
Hausa- Decision Cost Function
DCF
50
Hindi- Decision Cost Function
DCF
51
Results on the development set
Test on Systems Systems Systems Systems Systems Systems
Test on Pushed GMMs MMIE GMMs 3-grams Multi-grams Lattice Fusion
Broadcast telephone 1.48 1.70 1.09 1.12 1.06 0.86
Broadcast subset 1.54 1.69 1.24 1.26 1.14 0.91
Telephone subset 2.00 2.51 1.45 1.49 1.42 1.21
Average minDCFx100 on 30s test segments
52
Korean - score cumulative distribution
t-t
t-b
b-t

Write a Comment

User Comments (0)