Title: Language Identification
1Language Identification
- Oldrich Plchot, Pavel Matejka Speech_at_FIT, Brno
University of Technology, Czech Republic - matejkap_at_fit.vutbr.cz
IKR Brno 2012
2Outline
- Why do we need LID?
- Evaluations
- Acoustic LID
- Phonotactic LID
- Fusion
- Conclusion
3Why do we need language identification?
- 1) Route phone calls to human operators.
Emergency (112,155,911)
Call centers
Fireguard (150)
Police (158)
4Why do we need language identification?
- 2) Pre-select suitable recognition system.
KWS CHN
Language Identification
Translate SPA
Translate CZE
Translate VIE
Speech2Text ENG
Connect
5Why do we need language identification?
- 3) Security applications to narrow search space.
6Two main approaches to LID
- Acoustic Gaussian Mixture Model
- Phonotactic Phoneme Recognition followed by
Language Model
7Acoustic approach
- good for short speech segments and dialect
recognition - - relies on the sounds
8Spectral features - MFCC
20ms
10ms
-12.8 -0.3 -5.7 -22.4 8.9 6.8
-11.2 0.4 -4.7 -13.0 2.3 4.5
Short-time FFT
Mel - Filter Bank
Log ()
Discrete Cosine Transform
9Shifted delta cepstra
- Shifted Delta Cepstra represent an information
about the speech evolution around the current
frame ( 0.1sec) - Size of Final feature vector is 7 MFCC 7 7
SDC 56
10Acoustic systems GMM based
- Maximum likelihood (generative)
- Objective function to maximize is the likelihood
of training data given the transcription - Maximum Mutual Information (discriminative)
- Objective function to maximize is the posterior
probability of all training utterances being
correctly recognized - Advantages of using discriminative training
- Lower error rates
- Less parameters
- Disadvantages of discriminative training
- Overtraining
- Sometimes computationaly expensive
- Channel Compensation from previous presentation
11Highly overlapped distributions
12Results on LRE 2007 (14 languages)
System / Equal Error Rate 30sec 10sec 3sec
GMM2048 ML 8.03 12.89 21.77
GMM 256 ML 16
GMM256 MMI (15 MMI iterations) 4.15 8.61 18.43
GMM256-MMI-chcf (3 MMI iterations) 3.73 9.81 20.98
System / Equal Error Rate 30sec 10sec 3sec
GMM2048 8.03 12.89 21.77
GMM2048-eigchan 2.76 7.38 17.14
GMM2048chcf 2.94 7.40 17.93
GMM2048-MMI-chcf ( 3 MMI iterations) 2.41 7.02 16.90
- The best acoustic system combines
- Many Gaussians
- Eigen-channel compensation of features
- MMI
13Phonotactic approach
- Phoneme Recognition followed by Language Model
(PRLM)
- good for longer speech segments
- robust against dialects in one language
- eliminates speech characteristics of speaker's
native language
14Phone recognizer
- Investigation of different phone recognizers for
LID gt better phone recognizer better LID system
- 3 neural networks to produce the phone posterior
probability - 310 ms long time trajectory around the actual
frame
15Phone recognition output
One best phone string
16Phonotactic modeling - example
German
English
Test
u n d 25
a n d 3
t h e 0
. . . .
u n d 1
a n d 32
t h e 13
. . . .
u n d 5
a n d 0
t h e 1
. . . .
- N-gram language models discounting, backoff
- Support Vector Machines vectors with counts
- PCA LDA
- Neural Networks
17Phone recognition output
One best phone string
Phone lattice
18Results on LRE 2007 (14 languages)
System / Equal Error Rate 30 sec 10 sec 3 sec
HU_LM string (4-gram) 6.35 13.86 27.12
HU_LM 5.54 11.75 23.54
HU_SVM-3gram-counts 5.41 13.26 26.92
- Conclusion
- Build as good phone recognizer as you can
- Gather as much data for each language as you can
- Different approaches to modeling counts seem to
not have big influence on results
19Fusion - LRE 2007 (14 languages)
System / Equal Error Rate 30 sec 10 sec 3 sec
Acoustic - GMM2048-MMI-chcf ( 3 MMI iterations) 2.41 7.02 16.90
Phonotactic - EN_TREE 3.54 10.68 22.66
Phonotactic - HU_TREE_A3E7M5S3G3_LFA 4.52 10.35 23.66
Fusion The best 3 systems 1.28 4.63 13.53
- Note
- Fusion weights have to be trained on separate set
of files which are as close as possible to target
data
20Thanks for your attention and I hope you enjoyed
it )