Language Identification

1 / 20
About This Presentation
Title:

Language Identification

Description:

Language Identification Old ich Plchot, Pavel Mat jka Speech_at_FIT, Brno University of Technology, Czech Republic matejkap_at_fit.vutbr.cz IKR Brno – PowerPoint PPT presentation

Number of Views:323
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Language Identification


1
Language Identification
  • Oldrich Plchot, Pavel Matejka Speech_at_FIT, Brno
    University of Technology, Czech Republic
  • matejkap_at_fit.vutbr.cz

IKR Brno 2012
2
Outline
  • Why do we need LID?
  • Evaluations
  • Acoustic LID
  • Phonotactic LID
  • Fusion
  • Conclusion

3
Why do we need language identification?
  • 1) Route phone calls to human operators.

Emergency (112,155,911)
Call centers
Fireguard (150)
Police (158)
4
Why do we need language identification?
  • 2) Pre-select suitable recognition system.

KWS CHN
Language Identification
Translate SPA
Translate CZE
Translate VIE
Speech2Text ENG
Connect
5
Why do we need language identification?
  • 3) Security applications to narrow search space.

6
Two main approaches to LID
  • Acoustic Gaussian Mixture Model
  • Phonotactic Phoneme Recognition followed by
    Language Model

7
Acoustic approach
  • Gaussian Mixture Model
  • good for short speech segments and dialect
    recognition
  • - relies on the sounds

8
Spectral features - MFCC
20ms
10ms
-12.8 -0.3 -5.7 -22.4 8.9 6.8
-11.2 0.4 -4.7 -13.0 2.3 4.5
Short-time FFT
Mel - Filter Bank
Log ()
Discrete Cosine Transform
9
Shifted delta cepstra
  • Shifted Delta Cepstra represent an information
    about the speech evolution around the current
    frame ( 0.1sec)
  • Size of Final feature vector is 7 MFCC 7 7
    SDC 56

10
Acoustic systems GMM based
  • Maximum likelihood (generative)
  • Objective function to maximize is the likelihood
    of training data given the transcription
  • Maximum Mutual Information (discriminative)
  • Objective function to maximize is the posterior
    probability of all training utterances being
    correctly recognized
  • Advantages of using discriminative training
  • Lower error rates
  • Less parameters
  • Disadvantages of discriminative training
  • Overtraining
  • Sometimes computationaly expensive
  • Channel Compensation from previous presentation

11
Highly overlapped distributions
12
Results on LRE 2007 (14 languages)
System / Equal Error Rate 30sec 10sec 3sec
GMM2048 ML 8.03 12.89 21.77
GMM 256 ML 16
GMM256 MMI (15 MMI iterations) 4.15 8.61 18.43
GMM256-MMI-chcf (3 MMI iterations) 3.73 9.81 20.98
System / Equal Error Rate 30sec 10sec 3sec
GMM2048 8.03 12.89 21.77
GMM2048-eigchan 2.76 7.38 17.14
GMM2048chcf 2.94 7.40 17.93
GMM2048-MMI-chcf ( 3 MMI iterations) 2.41 7.02 16.90
  • The best acoustic system combines
  • Many Gaussians
  • Eigen-channel compensation of features
  • MMI

13
Phonotactic approach
  • Phoneme Recognition followed by Language Model
    (PRLM)
  • good for longer speech segments
  • robust against dialects in one language
  • eliminates speech characteristics of speaker's
    native language

14
Phone recognizer
  • Investigation of different phone recognizers for
    LID gt better phone recognizer better LID system
  • 3 neural networks to produce the phone posterior
    probability
  • 310 ms long time trajectory around the actual
    frame

15
Phone recognition output
One best phone string
16
Phonotactic modeling - example
German
English
Test
u n d 25
a n d 3
t h e 0
. . . .
u n d 1
a n d 32
t h e 13
. . . .
u n d 5
a n d 0
t h e 1
. . . .
  • N-gram language models discounting, backoff
  • Support Vector Machines vectors with counts
  • PCA LDA
  • Neural Networks

17
Phone recognition output
One best phone string
Phone lattice
18
Results on LRE 2007 (14 languages)
System / Equal Error Rate 30 sec 10 sec 3 sec
HU_LM string (4-gram) 6.35 13.86 27.12
HU_LM 5.54 11.75 23.54
HU_SVM-3gram-counts 5.41 13.26 26.92
  • Conclusion
  • Build as good phone recognizer as you can
  • Gather as much data for each language as you can
  • Different approaches to modeling counts seem to
    not have big influence on results

19
Fusion - LRE 2007 (14 languages)
System / Equal Error Rate 30 sec 10 sec 3 sec
Acoustic - GMM2048-MMI-chcf ( 3 MMI iterations) 2.41 7.02 16.90
Phonotactic - EN_TREE 3.54 10.68 22.66
Phonotactic - HU_TREE_A3E7M5S3G3_LFA 4.52 10.35 23.66
Fusion The best 3 systems 1.28 4.63 13.53
  • Note
  • Fusion weights have to be trained on separate set
    of files which are as close as possible to target
    data

20
Thanks for your attention and I hope you enjoyed
it )
Write a Comment
User Comments (0)