Annotation of Heterogeneous Multimedia Content Using Automatic Speech Recognition - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Annotation of Heterogeneous Multimedia Content Using Automatic Speech Recognition

Description:

automatic annotation of linguistic content based on ASR ... commercial systems: blinkx, speechfind. Spoken. Document. Retrieval. Text. REtrieval. Conference ... – PowerPoint PPT presentation

Number of Views:129

Avg rating:3.0/5.0

Slides: 29

Provided by: roelando

Category:

more less

Transcript and Presenter's Notes

Title: Annotation of Heterogeneous Multimedia Content Using Automatic Speech Recognition

1
Annotation of Heterogeneous Multimedia Content
Using Automatic Speech Recognition
MultimediaN MESH MediaCampaign

Marijn Huijbregts
Roeland Ordelman
Franciska de Jong

2
research focus

automatic annotation of linguistic content based
on ASR technology can improve access to
multimedia archives
how to create speech based annotations for
real-life archives automatically, accurately and
affordably

Automatic Speech Recognition
3
improve access

video data containing spoken content
TREC SDR track (1997-2000) problem solved
TRECVID beneficial to incorporate speech
features in models (2001-2007)
successful projects e.g., MALACH
commercial systems blinkx, speechfind

Text REtrieval Conference
Spoken Document Retrieval
Multilingual Access to Large spoken ArCHives
4
manual

controlled vocabulary indexing by subject matter
experts may still yield better retrieval
effectiveness for certain collections compared to
ASR transcripts
infrequent NEs are often OOV and misrecognised
dates are rarely spoken

Named Entities
Out-Of-Vocabulary
5
automatic

semi-automatic
exploit collateral data and align (e.g.,
subtitles, minutes,notes)
social tagging
automatic
decoding speech into words
context adaptation (unseen topics, speakers, etc)
performance monitoring
minimize parameter tuning

Forced Viterbi Alignment
6
affordable

set-up of transcription technology for a new
collection can be expensive
fixed/variable costs
tuning domain specific language models, acoustic
models, pronunciation dictionary
context adaptation, performance monitoring
work-flow e.g., when speed is an issue, ASR only
one piece of the machinery
only cost effective for collections larger than
about 1,000 hours

7
accurate

large investments in ASR RD
NIST benchmark evaluations
read speech (90)
BN speech (95)
conversational speech (00)
meeting room speech (03)

Research Development
(US) National Institute for Standards and Tech
nology
Broadcast News
8
(No Transcript)
9
accurate (cont.)

reasons for improvements
ASR technology improvements
rising amounts of available training data
known data
performance drops
ASR technology fails
e.g., noisy data, cross-talk
less/no training data
e.g., other domains, other languages
unknown data (surprise data)

10
surprise data cases Mesh

provide both professional and personal news users
with means to organise, locate, access and
present huge and heterogeneous amounts of news
professional journalist (raw footage)
non-professional creator of special interest
material (p/vodcasts)
manual/semi-automatic annotation
speech indexing unknown data characteristics
(especially topics)

11
surprise data cases MediaCampaign

detection tracking of campaigns in tv
commercials using
logo detection, jingle recognition, OCR, speech
transcripts
ASR is extremely difficult
speech in music (technology)
huge variety in speech types (voice-over,
spontaneous conversational, emotional speech,
yelling and singing, etc)
no training data no prior information
focus on straplines/slogans

television
Optical Character Recognition
12
surprise data cases MultimediaN-TRECVID

open domain ASR
TRECVID from BN to a real-life archive of Dutch
news magazines, science news, news reports,
documentaries, educational programs and archival
video
UT was asked to provide speech annotations (but
was it used...?)

Netherlands Institute for Sound and Vision
13
Sound Vision data

very heterogeneous
historical data
low audio quality (AM mismatch)
old-fashioned speech (LM mismatch)
multiple languages
background noise, music
discussions
no speech
descriptive metadata available
performance indication needed for TRECVID
participants

14
evaluation

use baseline BN speech recognition framework
can we improve results by adapting to content
focus on automatic only minimal manual tuning
evaluation data
BN data from Spoken Dutch Corpus/ TwNC
13x5minutes from soundvision data
RT06/RT07 meeting data

Twente News Corpus
15

SAD speech, non-speech and silence
language detection
speaker segmentation and clustering (speaker
diarization who speaks when)
multi-pass speech recognition

Speech Activity Detection
16
Speech Activity Detection

GMM based system
GMMs need to be trained on data that matches the
evaluation data
It is impossible to train models for surprise
data
Solution
do not use models trained on external data but
aim at unsupervised training on task data
use as little tunable parameters as possible
Select data
use regions in the audio that have low or high
energy levels to train new speech and silence
models
use output of BN SAD system

Gaussian Mixture Model
17
SAD results
(NIST) Rich Transcription benchmark evaluation
18
Speaker Diarization

GMM based, each GMM represents a speaker
identical research approach
no models trained on external data
as little tunable parameters as possible
procedure
divide data into a number of small segments
train a model for each segment
compare models. If models are very similar,
merge
repeat until no two models can be found that are
believed to be trained on speech of the same
speaker.

19
Speaker Diarization Results

Participation in NIST diarization benchmark in
collaboration with
TNO (RT06)
ICSI, Berkeley (RT07)
RT07s results 8.2 DER (best was 8.2)

20
ASR feature extraction

Melscale Frequency Cepstrum Coefficients
Cepstrum Mean Normalization
Vocal Tract Length Normalization
Histogram Normalization (work in progress)
Speaker Adaptive Training (work in progress)

21
ASR decoder

HMM triphone acoustic models
up to 4gram language models
no limit on number of words in vocabulary
multiple pronunciations per word possible
token based Viterbi decoder
histogram and beam pruning
language model lookahead (4grams)
Nbest and lattice output

22
ASR AM

feature space normalization
normalize the audio signal make the test data
look more like training data
VTLN warp acoustic features to a generic
speaker
model space normalization
transform model means of speaker/cluster to fit
adaptation data from 1st pass recognition
SMAPLR

Vocal Tract Length Normalization
Structured Maximum aPriori Linear Regression
23
ASR LM

a priori vocabulary optimization not possible
which LM fits best? Usually estimated on sample
data
approach
use metadata (or 1st pass transcripts) for
creating topic specific LMs

24
(No Transcript)
25
Word Error Rates given a baseline LM (base)
and a topic-specific LM (item-LM)
26
TRECVID overall results
Word Error Rates of the baseline system compared
with the adapted system
27
conclusions

creating accurate automatic speech based
annotations for real-life archives is a
challenge
substantial/only minor (take your pick)
performance gain by implementing well-known
unsupervised adaptation strategies
still more than half of the words wrong (above
the magic 50 boundary) but also a lot of words
right (hopefully the right ones)
research investment required
open-domain ASR
ASR alternatives (social tagging?)