Title: Annotation of Heterogeneous Multimedia Content Using Automatic Speech Recognition
1Annotation of Heterogeneous Multimedia Content
Using Automatic Speech Recognition
MultimediaN MESH MediaCampaign
- Marijn Huijbregts
- Roeland Ordelman
- Franciska de Jong
2research focus
- automatic annotation of linguistic content based
on ASR technology can improve access to
multimedia archives
- how to create speech based annotations for
real-life archives automatically, accurately and
affordably
Automatic Speech Recognition
3improve access
- video data containing spoken content
- TREC SDR track (1997-2000) problem solved
- TRECVID beneficial to incorporate speech
features in models (2001-2007)
- successful projects e.g., MALACH
- commercial systems blinkx, speechfind
Text REtrieval Conference
Spoken Document Retrieval
Multilingual Access to Large spoken ArCHives
4manual
- controlled vocabulary indexing by subject matter
experts may still yield better retrieval
effectiveness for certain collections compared to
ASR transcripts - infrequent NEs are often OOV and misrecognised
- dates are rarely spoken
Named Entities
Out-Of-Vocabulary
5automatic
- semi-automatic
- exploit collateral data and align (e.g.,
subtitles, minutes,notes)
- social tagging
- automatic
- decoding speech into words
- context adaptation (unseen topics, speakers, etc)
- performance monitoring
- minimize parameter tuning
Forced Viterbi Alignment
6affordable
- set-up of transcription technology for a new
collection can be expensive
- fixed/variable costs
- tuning domain specific language models, acoustic
models, pronunciation dictionary
- context adaptation, performance monitoring
- work-flow e.g., when speed is an issue, ASR only
one piece of the machinery
- only cost effective for collections larger than
about 1,000 hours
7accurate
- large investments in ASR RD
- NIST benchmark evaluations
- read speech (90)
- BN speech (95)
- conversational speech (00)
- meeting room speech (03)
Research Development
(US) National Institute for Standards and Tech
nology
Broadcast News
8(No Transcript)
9accurate (cont.)
- reasons for improvements
- ASR technology improvements
- rising amounts of available training data
- known data
- performance drops
- ASR technology fails
- e.g., noisy data, cross-talk
- less/no training data
- e.g., other domains, other languages
- unknown data (surprise data)
10surprise data cases Mesh
- provide both professional and personal news users
with means to organise, locate, access and
present huge and heterogeneous amounts of news
- professional journalist (raw footage)
- non-professional creator of special interest
material (p/vodcasts)
- manual/semi-automatic annotation
- speech indexing unknown data characteristics
(especially topics)
11surprise data cases MediaCampaign
- detection tracking of campaigns in tv
commercials using
- logo detection, jingle recognition, OCR, speech
transcripts
- ASR is extremely difficult
- speech in music (technology)
- huge variety in speech types (voice-over,
spontaneous conversational, emotional speech,
yelling and singing, etc)
- no training data no prior information
- focus on straplines/slogans
television
Optical Character Recognition
12surprise data cases MultimediaN-TRECVID
- open domain ASR
- TRECVID from BN to a real-life archive of Dutch
- news magazines, science news, news reports,
documentaries, educational programs and archival
video
- UT was asked to provide speech annotations (but
was it used...?)
Netherlands Institute for Sound and Vision
13Sound Vision data
- very heterogeneous
- historical data
- low audio quality (AM mismatch)
- old-fashioned speech (LM mismatch)
- multiple languages
- background noise, music
- discussions
- no speech
- descriptive metadata available
- performance indication needed for TRECVID
participants
14evaluation
- use baseline BN speech recognition framework
- can we improve results by adapting to content
- focus on automatic only minimal manual tuning
- evaluation data
- BN data from Spoken Dutch Corpus/ TwNC
- 13x5minutes from soundvision data
- RT06/RT07 meeting data
Twente News Corpus
15- SAD speech, non-speech and silence
- language detection
- speaker segmentation and clustering (speaker
diarization who speaks when)
- multi-pass speech recognition
Speech Activity Detection
16Speech Activity Detection
- GMM based system
- GMMs need to be trained on data that matches the
evaluation data
- It is impossible to train models for surprise
data
- Solution
- do not use models trained on external data but
aim at unsupervised training on task data
- use as little tunable parameters as possible
- Select data
- use regions in the audio that have low or high
energy levels to train new speech and silence
models
- use output of BN SAD system
Gaussian Mixture Model
17SAD results
(NIST) Rich Transcription benchmark evaluation
18Speaker Diarization
- GMM based, each GMM represents a speaker
- identical research approach
- no models trained on external data
- as little tunable parameters as possible
- procedure
- divide data into a number of small segments
train a model for each segment
- compare models. If models are very similar,
merge
- repeat until no two models can be found that are
believed to be trained on speech of the same
speaker.
19Speaker Diarization Results
- Participation in NIST diarization benchmark in
collaboration with
- TNO (RT06)
- ICSI, Berkeley (RT07)
- RT07s results 8.2 DER (best was 8.2)
20ASR feature extraction
- Melscale Frequency Cepstrum Coefficients
- Cepstrum Mean Normalization
- Vocal Tract Length Normalization
- Histogram Normalization (work in progress)
- Speaker Adaptive Training (work in progress)
21ASR decoder
- HMM triphone acoustic models
- up to 4gram language models
- no limit on number of words in vocabulary
- multiple pronunciations per word possible
- token based Viterbi decoder
- histogram and beam pruning
- language model lookahead (4grams)
- Nbest and lattice output
22ASR AM
- feature space normalization
- normalize the audio signal make the test data
look more like training data
- VTLN warp acoustic features to a generic
speaker
- model space normalization
- transform model means of speaker/cluster to fit
adaptation data from 1st pass recognition
- SMAPLR
Vocal Tract Length Normalization
Structured Maximum aPriori Linear Regression
23ASR LM
- a priori vocabulary optimization not possible
- which LM fits best? Usually estimated on sample
data
- approach
- use metadata (or 1st pass transcripts) for
creating topic specific LMs
24(No Transcript)
25Word Error Rates given a baseline LM (base)
and a topic-specific LM (item-LM)
26TRECVID overall results
Word Error Rates of the baseline system compared
with the adapted system
27conclusions
- creating accurate automatic speech based
annotations for real-life archives is a
challenge
- substantial/only minor (take your pick)
performance gain by implementing well-known
unsupervised adaptation strategies
- still more than half of the words wrong (above
the magic 50 boundary) but also a lot of words
right (hopefully the right ones)
- research investment required
- open-domain ASR
- ASR alternatives (social tagging?)
28questions?