The ICSI Language Recognition Evaluation System - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

The ICSI Language Recognition Evaluation System

Description:

Prosodic Component. n-grams of binned values of energy and pitch on ... binned n-gram count prosodic systems don't really work for LRE (at least not ours) ... – PowerPoint PPT presentation

Number of Views:111
Avg rating:3.0/5.0
Slides: 19
Provided by: oldSpeake
Category:

less

Transcript and Presenter's Notes

Title: The ICSI Language Recognition Evaluation System


1
The ICSI Language Recognition Evaluation System
  • Christian Müller, cmueller_at_icsi.berkeley.edu

2
PPR-SVM
backgroundmodel
training
p r e p r.
normalization
svm learn
svm
data
frontend
test
svm classify
score
backend
3
Data Preprocessing
original conversations
1
noise reduction
split into conversation sites (2 per
conversation) and then into individual speech
acts
2
concatenate actual training conversations
3
4
Training Test Data
5
Training Test Data
6
Training Test Data
7
Phone Recognizer Frontends
SRI EN
ICSI EN
ICSI ARA
ICSI MAN
3-state HMM for open-loop phone
recognizer. MFCCs of order 13 plus ? and ? ?
trained on Switchboard I and II
fully connected MLP PLP features plus ? and ?
? fast GMM-based estimate for VTL
normalization (local context of 9 consecutive
frames) a phone label is determined as the
networks output unit with the maximal activation.
pitch features 15000 hidden units 870 h 16KHz
broadcast news 71 phones
20800 hidden units 2000 h 8KHz conversational
speech 46 phones
10000 hidden units 465 h 16KHz broadcast
news 36 phones
8
Rank-Normalization
0101 0.75 ... 0123 0.4 2317 0.2 ...
0101 0.06 ...
0101 0.13 ...
  • create ordered list of values using bg data
  • rank position in list / number of values
  • no occurrence mapped to 0
  • uniform value distribution

0101 0.29 ...
9
Model Building
  • SVM with 2nd order polynomial Kernel (using
    svmlite)
  • one against all
  • dimensionality reduction using 30 of 3-grams
  • 140 K features combining all 4 phone recognizers
  • j-factor is chosen according to the actual
    ratio of positive and negative training examples
  • t-norm is applied in GeneralLR and ChineseLR
  • decision threshold threshold that generates EER
    (suboptimal test set)

10
Official Results
cost 0.0786
11
Official Results
0.0786
12
Results with Optimal Threshold
0.0606
13
Polyfit Rank Normalization
  • use ranks to train a polynomial
  • apply polynomial instead of look-up tables
  • (hopefully) better interpolation
  • no need to load 200 M lookup-tables
  • experiments with more features possible

14
Polyfit Rank Normalization
  • use ranks to train a polynomial
  • apply polynomial instead of look-up tables
  • (hopefully) better interpolation
  • no need to load 200 M lookup-tables
  • experiments with more features possible

Results on LRE 07 dataset (with optimal
threshold) 0.059
15
Prosodic Component
  • n-grams of binned values of energy and pitch on
  • phone level
  • syllable level
  • different bin sizes for
  • unigrams (30)
  • bigrams (10)
  • trigrams (5)
  • using the same backend modeling as in the
    phonotactic component
  • best result obtained on the eval 07 dataset 0.25
  • no improvement when combined with the PPR-SVM
    system

16
Conclusions
17
Conclusions
  • developing an evaluation system from scratch is
    hard work!

18
Conclusions
  • developing an evaluation system from scratch is
    hard work!
  • frame-by-frame phone classifiers can be used for
    language recognition in the PPR-SVM framework
  • polynomial rank normalization works as good as
    rank normalization
  • binned n-gram count prosodic systems don't really
    work for LRE (at least not ours).
Write a Comment
User Comments (0)
About PowerShow.com