Title: The ICSI Language Recognition Evaluation System
1The ICSI Language Recognition Evaluation System
- Christian Müller, cmueller_at_icsi.berkeley.edu
2PPR-SVM
backgroundmodel
training
p r e p r.
normalization
svm learn
svm
data
frontend
test
svm classify
score
backend
3Data Preprocessing
original conversations
1
noise reduction
split into conversation sites (2 per
conversation) and then into individual speech
acts
2
concatenate actual training conversations
3
4Training Test Data
5Training Test Data
6Training Test Data
7Phone Recognizer Frontends
SRI EN
ICSI EN
ICSI ARA
ICSI MAN
3-state HMM for open-loop phone
recognizer. MFCCs of order 13 plus ? and ? ?
trained on Switchboard I and II
fully connected MLP PLP features plus ? and ?
? fast GMM-based estimate for VTL
normalization (local context of 9 consecutive
frames) a phone label is determined as the
networks output unit with the maximal activation.
pitch features 15000 hidden units 870 h 16KHz
broadcast news 71 phones
20800 hidden units 2000 h 8KHz conversational
speech 46 phones
10000 hidden units 465 h 16KHz broadcast
news 36 phones
8Rank-Normalization
0101 0.75 ... 0123 0.4 2317 0.2 ...
0101 0.06 ...
0101 0.13 ...
- create ordered list of values using bg data
- rank position in list / number of values
- no occurrence mapped to 0
- uniform value distribution
0101 0.29 ...
9Model Building
- SVM with 2nd order polynomial Kernel (using
svmlite) - one against all
- dimensionality reduction using 30 of 3-grams
- 140 K features combining all 4 phone recognizers
- j-factor is chosen according to the actual
ratio of positive and negative training examples - t-norm is applied in GeneralLR and ChineseLR
- decision threshold threshold that generates EER
(suboptimal test set)
10Official Results
cost 0.0786
11Official Results
0.0786
12Results with Optimal Threshold
0.0606
13Polyfit Rank Normalization
- use ranks to train a polynomial
- apply polynomial instead of look-up tables
- (hopefully) better interpolation
- no need to load 200 M lookup-tables
- experiments with more features possible
14Polyfit Rank Normalization
- use ranks to train a polynomial
- apply polynomial instead of look-up tables
- (hopefully) better interpolation
- no need to load 200 M lookup-tables
- experiments with more features possible
Results on LRE 07 dataset (with optimal
threshold) 0.059
15Prosodic Component
- n-grams of binned values of energy and pitch on
- phone level
- syllable level
- different bin sizes for
- unigrams (30)
- bigrams (10)
- trigrams (5)
- using the same backend modeling as in the
phonotactic component - best result obtained on the eval 07 dataset 0.25
- no improvement when combined with the PPR-SVM
system
16Conclusions
17Conclusions
- developing an evaluation system from scratch is
hard work!
18Conclusions
- developing an evaluation system from scratch is
hard work! - frame-by-frame phone classifiers can be used for
language recognition in the PPR-SVM framework - polynomial rank normalization works as good as
rank normalization - binned n-gram count prosodic systems don't really
work for LRE (at least not ours).