Annotation of Heterogeneous Multimedia Content Using Automatic Speech Recognition - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Annotation of Heterogeneous Multimedia Content Using Automatic Speech Recognition

Description:

automatic annotation of linguistic content based on ASR ... commercial systems: blinkx, speechfind. Spoken. Document. Retrieval. Text. REtrieval. Conference ... – PowerPoint PPT presentation

Number of Views:129
Avg rating:3.0/5.0
Slides: 29
Provided by: roelando
Category:

less

Transcript and Presenter's Notes

Title: Annotation of Heterogeneous Multimedia Content Using Automatic Speech Recognition


1
Annotation of Heterogeneous Multimedia Content
Using Automatic Speech Recognition
MultimediaN MESH MediaCampaign
  • Marijn Huijbregts
  • Roeland Ordelman
  • Franciska de Jong

2
research focus
  • automatic annotation of linguistic content based
    on ASR technology can improve access to
    multimedia archives
  • how to create speech based annotations for
    real-life archives automatically, accurately and
    affordably

Automatic Speech Recognition
3
improve access
  • video data containing spoken content
  • TREC SDR track (1997-2000) problem solved
  • TRECVID beneficial to incorporate speech
    features in models (2001-2007)
  • successful projects e.g., MALACH
  • commercial systems blinkx, speechfind

Text REtrieval Conference
Spoken Document Retrieval
Multilingual Access to Large spoken ArCHives
4
manual
  • controlled vocabulary indexing by subject matter
    experts may still yield better retrieval
    effectiveness for certain collections compared to
    ASR transcripts
  • infrequent NEs are often OOV and misrecognised
  • dates are rarely spoken

Named Entities
Out-Of-Vocabulary
5
automatic
  • semi-automatic
  • exploit collateral data and align (e.g.,
    subtitles, minutes,notes)
  • social tagging
  • automatic
  • decoding speech into words
  • context adaptation (unseen topics, speakers, etc)

  • performance monitoring
  • minimize parameter tuning

Forced Viterbi Alignment
6
affordable
  • set-up of transcription technology for a new
    collection can be expensive
  • fixed/variable costs
  • tuning domain specific language models, acoustic
    models, pronunciation dictionary
  • context adaptation, performance monitoring
  • work-flow e.g., when speed is an issue, ASR only
    one piece of the machinery
  • only cost effective for collections larger than
    about 1,000 hours

7
accurate
  • large investments in ASR RD
  • NIST benchmark evaluations
  • read speech (90)
  • BN speech (95)
  • conversational speech (00)
  • meeting room speech (03)

Research Development
(US) National Institute for Standards and Tech
nology
Broadcast News
8
(No Transcript)
9
accurate (cont.)
  • reasons for improvements
  • ASR technology improvements
  • rising amounts of available training data
  • known data
  • performance drops
  • ASR technology fails
  • e.g., noisy data, cross-talk
  • less/no training data
  • e.g., other domains, other languages
  • unknown data (surprise data)

10
surprise data cases Mesh
  • provide both professional and personal news users
    with means to organise, locate, access and
    present huge and heterogeneous amounts of news
  • professional journalist (raw footage)
  • non-professional creator of special interest
    material (p/vodcasts)
  • manual/semi-automatic annotation
  • speech indexing unknown data characteristics
    (especially topics)

11
surprise data cases MediaCampaign
  • detection tracking of campaigns in tv
    commercials using
  • logo detection, jingle recognition, OCR, speech
    transcripts
  • ASR is extremely difficult
  • speech in music (technology)
  • huge variety in speech types (voice-over,
    spontaneous conversational, emotional speech,
    yelling and singing, etc)
  • no training data no prior information
  • focus on straplines/slogans

television
Optical Character Recognition
12
surprise data cases MultimediaN-TRECVID
  • open domain ASR
  • TRECVID from BN to a real-life archive of Dutch
  • news magazines, science news, news reports,
    documentaries, educational programs and archival
    video
  • UT was asked to provide speech annotations (but
    was it used...?)

Netherlands Institute for Sound and Vision
13
Sound Vision data
  • very heterogeneous
  • historical data
  • low audio quality (AM mismatch)
  • old-fashioned speech (LM mismatch)
  • multiple languages
  • background noise, music
  • discussions
  • no speech
  • descriptive metadata available
  • performance indication needed for TRECVID
    participants

14
evaluation
  • use baseline BN speech recognition framework
  • can we improve results by adapting to content
  • focus on automatic only minimal manual tuning
  • evaluation data
  • BN data from Spoken Dutch Corpus/ TwNC
  • 13x5minutes from soundvision data
  • RT06/RT07 meeting data

Twente News Corpus
15
  • SAD speech, non-speech and silence
  • language detection
  • speaker segmentation and clustering (speaker
    diarization who speaks when)
  • multi-pass speech recognition

Speech Activity Detection
16
Speech Activity Detection
  • GMM based system
  • GMMs need to be trained on data that matches the
    evaluation data
  • It is impossible to train models for surprise
    data
  • Solution
  • do not use models trained on external data but
    aim at unsupervised training on task data
  • use as little tunable parameters as possible
  • Select data
  • use regions in the audio that have low or high
    energy levels to train new speech and silence
    models
  • use output of BN SAD system

Gaussian Mixture Model
17
SAD results
(NIST) Rich Transcription benchmark evaluation
18
Speaker Diarization
  • GMM based, each GMM represents a speaker
  • identical research approach
  • no models trained on external data
  • as little tunable parameters as possible
  • procedure
  • divide data into a number of small segments
    train a model for each segment
  • compare models. If models are very similar,
    merge
  • repeat until no two models can be found that are
    believed to be trained on speech of the same
    speaker.

19
Speaker Diarization Results
  • Participation in NIST diarization benchmark in
    collaboration with
  • TNO (RT06)
  • ICSI, Berkeley (RT07)
  • RT07s results 8.2 DER (best was 8.2)

20
ASR feature extraction
  • Melscale Frequency Cepstrum Coefficients
  • Cepstrum Mean Normalization
  • Vocal Tract Length Normalization
  • Histogram Normalization (work in progress)
  • Speaker Adaptive Training (work in progress)

21
ASR decoder
  • HMM triphone acoustic models
  • up to 4gram language models
  • no limit on number of words in vocabulary
  • multiple pronunciations per word possible
  • token based Viterbi decoder
  • histogram and beam pruning
  • language model lookahead (4grams)
  • Nbest and lattice output

22
ASR AM
  • feature space normalization
  • normalize the audio signal make the test data
    look more like training data
  • VTLN warp acoustic features to a generic
    speaker
  • model space normalization
  • transform model means of speaker/cluster to fit
    adaptation data from 1st pass recognition
  • SMAPLR

Vocal Tract Length Normalization
Structured Maximum aPriori Linear Regression
23
ASR LM
  • a priori vocabulary optimization not possible
  • which LM fits best? Usually estimated on sample
    data
  • approach
  • use metadata (or 1st pass transcripts) for
    creating topic specific LMs

24
(No Transcript)
25
Word Error Rates given a baseline LM (base)
and a topic-specific LM (item-LM)
26
TRECVID overall results
Word Error Rates of the baseline system compared
with the adapted system
27
conclusions
  • creating accurate automatic speech based
    annotations for real-life archives is a
    challenge
  • substantial/only minor (take your pick)
    performance gain by implementing well-known
    unsupervised adaptation strategies
  • still more than half of the words wrong (above
    the magic 50 boundary) but also a lot of words
    right (hopefully the right ones)
  • research investment required
  • open-domain ASR
  • ASR alternatives (social tagging?)

28
questions?
Write a Comment
User Comments (0)
About PowerShow.com