The ICSI Language Recognition Evaluation System

1 / 18

About This Presentation

Title:

The ICSI Language Recognition Evaluation System

Description:

Prosodic Component. n-grams of binned values of energy and pitch on ... binned n-gram count prosodic systems don't really work for LRE (at least not ours) ... –

Number of Views:111

Avg rating:3.0/5.0

Slides: 19

Provided by: oldSpeake

Category:

more less

Transcript and Presenter's Notes

Title: The ICSI Language Recognition Evaluation System

1
The ICSI Language Recognition Evaluation System

Christian Müller, cmueller_at_icsi.berkeley.edu

2
PPR-SVM
backgroundmodel
training
p r e p r.
normalization
svm learn
svm
data
frontend
test
svm classify
score
backend
3
Data Preprocessing
original conversations
1
noise reduction
split into conversation sites (2 per
conversation) and then into individual speech
acts
2
concatenate actual training conversations
3
4
Training Test Data
5
Training Test Data
6
Training Test Data
7
Phone Recognizer Frontends
SRI EN
ICSI EN
ICSI ARA
ICSI MAN
3-state HMM for open-loop phone
recognizer. MFCCs of order 13 plus ? and ? ?
trained on Switchboard I and II
fully connected MLP PLP features plus ? and ?
? fast GMM-based estimate for VTL
normalization (local context of 9 consecutive
frames) a phone label is determined as the
networks output unit with the maximal activation.
pitch features 15000 hidden units 870 h 16KHz
broadcast news 71 phones
20800 hidden units 2000 h 8KHz conversational
speech 46 phones
10000 hidden units 465 h 16KHz broadcast
news 36 phones
8
Rank-Normalization
0101 0.75 ... 0123 0.4 2317 0.2 ...
0101 0.06 ...
0101 0.13 ...

create ordered list of values using bg data
rank position in list / number of values
no occurrence mapped to 0
uniform value distribution

0101 0.29 ...
9
Model Building

SVM with 2nd order polynomial Kernel (using
svmlite)
one against all
dimensionality reduction using 30 of 3-grams
140 K features combining all 4 phone recognizers
j-factor is chosen according to the actual
ratio of positive and negative training examples
t-norm is applied in GeneralLR and ChineseLR
decision threshold threshold that generates EER
(suboptimal test set)

10
Official Results
cost 0.0786
11
Official Results
0.0786
12
Results with Optimal Threshold
0.0606
13
Polyfit Rank Normalization

use ranks to train a polynomial
apply polynomial instead of look-up tables
(hopefully) better interpolation
no need to load 200 M lookup-tables
experiments with more features possible

14
Polyfit Rank Normalization

use ranks to train a polynomial
apply polynomial instead of look-up tables
(hopefully) better interpolation
no need to load 200 M lookup-tables
experiments with more features possible

Results on LRE 07 dataset (with optimal
threshold) 0.059
15
Prosodic Component

n-grams of binned values of energy and pitch on
phone level
syllable level
different bin sizes for
unigrams (30)
bigrams (10)
trigrams (5)
using the same backend modeling as in the
phonotactic component
best result obtained on the eval 07 dataset 0.25
no improvement when combined with the PPR-SVM
system

16
Conclusions
17
Conclusions

developing an evaluation system from scratch is
hard work!

18
Conclusions

developing an evaluation system from scratch is
hard work!
frame-by-frame phone classifiers can be used for
language recognition in the PPR-SVM framework
polynomial rank normalization works as good as
rank normalization
binned n-gram count prosodic systems don't really
work for LRE (at least not ours).

Write a Comment

User Comments (0)