Title: Speech recognition in MUMIS
1Speech recognition in MUMIS
- Judith Kessens, Mirjam Wester
- Helmer Strik
2Manual transcriptions
- Transcriptions made by SPEX
- orthographic transcriptions
- transcriptions on chunk level (2-3 sec.)
- Formats
- .Textgrid ? praat
- xml-derivatives
- .pri no time information
- .skp time information
3Manual transcriptions
- Total amount of transcribed matches on ftp-site
(including the demo matches) - Dutch 6 matches
- German 21 matches
- English 3 matches
- Extensions
- Dutch (_N), German (_G), English (_E)
4Automatic speech recognition
- Acoustic preprocessing
- Acoustic signal ? features
- 2. Speech recognition
- Acoustic models
- Language models
- Lexicon
5Automatic transcriptions
- Problem of recorded data
- Commentaries and stadium noise are mixed
- Very high noise levels
- ? Recognition of such extreme noisy data is very
difficult
6Examples of data
- Yug-Ned match
- Dutch
- English
- German
op _t ogenblik wordt in dit stadion de
opstelling voorgelezen
and they wanna make the change before the corner
und die beiden Tore die die Hollaender bekommen
hat haben
7Examples of data
- Eng-Dld match
- Dutch
- English
- German
geeft nu een vrije trap in _t voordeel van Ince
and phil neville had to really make about three
yards to stop ltdreislerugt pulling it down and
playing it
wurde von allen englischen Zeitungen aus der
Mannschaft
8Evaluation of aut. transcriptions
insertionsdeletionssubstitutions number of words
WER()
? WER can be larger than 100 !
9WERs (all words)
Dutch English German
Yug-Ned 84.5 84.5 77.4
Eng-Dld 83.2 83.3 90.8
10WERs (player names)
Dutch English German
Yug-Ned names 84.5 53.0 84.5 48.2 77.4 40.9
Eng-Dld names 83.2 55.0 83.3 56.2 90.8 77.4
11WERs versus SNR
Dutch English German
Yug-Ned SNR 84.5 9 84.5 12 77.4 19
Eng-Dld SNR 83.2 8 83.3 11 90.8 7
12Automatic transcriptions
- The language model (LM) and lexicon (lex) are
adapted to a specific match - Start with a general LM and lex
- Add player names of the specific match
- Expand the general LM and lex when more data is
available
13WERs for various amounts of data
14Oracle experiments - ICLSP02
- Due to limited amount of material we started off
with oracle experiments - Language models are trained on target match
- Acoustic models are trained on part of target
match or other match - ? Much lower WERs
15Summary of results
- Acoustic model training
- Leaving out non-speech chunks does not hurt
recognition performance - Using more training data is benificial, but more
important - The SNRs of the training and test data should be
matched
16Summary of results
(tested on Yug-Ned match)
17Summary of results
Split words into categories, i.e. function words,
content words and football players names WER
function words gt WER content words gt WER names
(tested on Yug-Ned match)
18Summary of results
- Noise reduction tool (FTNR)? small improvement
19Ongoing work
- Techniques to lower WERs
- Tuning of the generic language model
- Defining different classes
- Reduction of OOV words in lexicon and in the
language model (using more material) - Speaker Adaptation in HTK
- (note all other experiments are being carried
out using Phicos)
20Ongoing work
- Noise robustness
- Extension of the acoustic models by using double
deltas. - Histogram Normalization and FTNR.
- SNR dependent acoustic models.
21Recommendations
- Acoustic modeling
- Record commentaries and stadium noise separately
- Speaker adaptation
- - Transcribe characteristics of commentator
- - Collect more speech data of commentator
22Recommendations
- Lexicon and language modeling
- Collect orthographic transcriptions of spoken
material, instead of written material - Subtitles
- Close captions