Language Identification

About This Presentation

Title:

Language Identification

Description:

Language Identification Old ich Plchot, Pavel Mat jka Speech_at_FIT, Brno University of Technology, Czech Republic matejkap_at_fit.vutbr.cz IKR Brno – PowerPoint PPT presentation

Number of Views:326

Avg rating:3.0/5.0

Slides: 21

Provided by: JanCer8

more less

Transcript and Presenter's Notes

Title: Language Identification

1
Language Identification

Oldrich Plchot, Pavel Matejka Speech_at_FIT, Brno
University of Technology, Czech Republic
matejkap_at_fit.vutbr.cz

IKR Brno 2012
2
Outline

Why do we need LID?
Evaluations
Acoustic LID
Phonotactic LID
Fusion
Conclusion

3
Why do we need language identification?

1) Route phone calls to human operators.

Emergency (112,155,911)
Call centers
Fireguard (150)
Police (158)
4
Why do we need language identification?

2) Pre-select suitable recognition system.

KWS CHN
Language Identification
Translate SPA
Translate CZE
Translate VIE
Speech2Text ENG
Connect
5
Why do we need language identification?

3) Security applications to narrow search space.

6
Two main approaches to LID

Acoustic Gaussian Mixture Model

Phonotactic Phoneme Recognition followed by
Language Model

7
Acoustic approach

Gaussian Mixture Model

good for short speech segments and dialect
recognition
- relies on the sounds

8
Spectral features - MFCC
20ms
10ms
-12.8 -0.3 -5.7 -22.4 8.9 6.8
-11.2 0.4 -4.7 -13.0 2.3 4.5
Short-time FFT
Mel - Filter Bank
Log ()
Discrete Cosine Transform
9
Shifted delta cepstra

Shifted Delta Cepstra represent an information
about the speech evolution around the current
frame ( 0.1sec)
Size of Final feature vector is 7 MFCC 7 7
SDC 56

10
Acoustic systems GMM based

Maximum likelihood (generative)
Objective function to maximize is the likelihood
of training data given the transcription
Maximum Mutual Information (discriminative)
Objective function to maximize is the posterior
probability of all training utterances being
correctly recognized
Advantages of using discriminative training
Lower error rates
Less parameters
Disadvantages of discriminative training
Overtraining
Sometimes computationaly expensive
Channel Compensation from previous presentation

11
Highly overlapped distributions
12
Results on LRE 2007 (14 languages)
System / Equal Error Rate 30sec 10sec 3sec
GMM2048 ML 8.03 12.89 21.77
GMM 256 ML 16
GMM256 MMI (15 MMI iterations) 4.15 8.61 18.43
GMM256-MMI-chcf (3 MMI iterations) 3.73 9.81 20.98
System / Equal Error Rate 30sec 10sec 3sec
GMM2048 8.03 12.89 21.77
GMM2048-eigchan 2.76 7.38 17.14
GMM2048chcf 2.94 7.40 17.93
GMM2048-MMI-chcf ( 3 MMI iterations) 2.41 7.02 16.90

The best acoustic system combines
Many Gaussians
Eigen-channel compensation of features
MMI

13
Phonotactic approach

Phoneme Recognition followed by Language Model
(PRLM)

good for longer speech segments
robust against dialects in one language
eliminates speech characteristics of speaker's
native language

14
Phone recognizer

Investigation of different phone recognizers for
LID gt better phone recognizer better LID system

3 neural networks to produce the phone posterior
probability
310 ms long time trajectory around the actual
frame

15
Phone recognition output
One best phone string
16
Phonotactic modeling - example
German
English
Test
u n d 25
a n d 3
t h e 0
. . . .
u n d 1
a n d 32
t h e 13
. . . .
u n d 5
a n d 0
t h e 1
. . . .

N-gram language models discounting, backoff
Support Vector Machines vectors with counts
PCA LDA
Neural Networks

17
Phone recognition output
One best phone string
Phone lattice
18
Results on LRE 2007 (14 languages)
System / Equal Error Rate 30 sec 10 sec 3 sec
HU_LM string (4-gram) 6.35 13.86 27.12
HU_LM 5.54 11.75 23.54
HU_SVM-3gram-counts 5.41 13.26 26.92

Conclusion
Build as good phone recognizer as you can
Gather as much data for each language as you can
Different approaches to modeling counts seem to
not have big influence on results

19
Fusion - LRE 2007 (14 languages)
System / Equal Error Rate 30 sec 10 sec 3 sec
Acoustic - GMM2048-MMI-chcf ( 3 MMI iterations) 2.41 7.02 16.90
Phonotactic - EN_TREE 3.54 10.68 22.66
Phonotactic - HU_TREE_A3E7M5S3G3_LFA 4.52 10.35 23.66
Fusion The best 3 systems 1.28 4.63 13.53

Note
Fusion weights have to be trained on separate set
of files which are as close as possible to target
data

20
Thanks for your attention and I hope you enjoyed
it )

Write a Comment

User Comments (0)