Title: National Institute of Standards and Technology
12000 TREC-9 Spoken Document Retrieval
Track http//www.nist.gov/speech/sdr2000
John Garofolo, Jerome Lard, Ellen Voorhees
National Institute of Standards and
Technology Information Technology Laboratory
2SDR 2000 - Overview
- SDR 2000 Track Overview, changes for TREC-9
- SDR Collection/Topics
- Technical Approaches
- Speech Recognition Metrics/Performance
- Retrieval Metrics/Performance
- Conclusions
- Future
3Spoken Document Retrieval (SDR)
- Task
- Given a text topic, retrieve ranked list of
relevant excerpts from collection of recorded
speech - Requires 2 core technologies
- Speech Recognition
- Information Retrieval
- First step towards
- multimedia information access.
- Focus is on effect of recognition accuracy on
retrieval performance - Domain Radio and TV Broadcast News
4SDR Evaluation Approach
- In the TREC tradition
- Create realish but doable application task
- Increase realism (and difficulty) each year
- NIST creates
- infrastructure test collection, queries, task
definition, relevance judgements - task includes several different control
conditions recognizer, boundaries, etc. - Sites submit
- speech recognizer transcripts for benchmarking
and sharing - rank-ordered retrieval lists for scoring
5Past SDR Test Collections
6Past SDR Evaluation Conditions
7SDR 2000 - Changes from 1999
- 2000
- evaluated on whole shows including non-news
segments - 50 ad-hoc topics in two forms short description
and keyword - 1 baseline recognizer transcript sets (NIST/BBN
B2 from 1999) - story boundaries unknown (SU) condition is
required - recognition and use of non-lexical information
- 1999
- evaluated on hand-segmented news excerpts only
- 49 ad-hoc-style topics/metrics
- 2 baseline recognizer transcript sets (NIST/BBN)
- story boundaries known (SK) focus and
exploratory unknown (SU) conditions
8SDR 2000 - Test Collection
- Based on the LDC TDT-2 Corpus
- 4 sources (TV ABC, CNN, Radio PRI, VOA)
- February through June 1998 subset, 902 broadcasts
- 557.5 hours, 21,754 stories, 6,755 filler and
commercial segments (55 hours) - Reference transcripts
- Human-annotated story boundaries
- Full broadcast word transcription
- News segments hand-transcribed (same as in 99)
- Commercials and non-news filler transcribed via
NIST ROVER applied to 3 automatic recognizer
transcript sets - Word times provided by LIMSI forced alignment
- Automatic recognition of non-lexical information
(commercials, repeats, gender, bandwidth,
non-speech, signal energy, and combinations)
provided by CU
9Test Variables
- Collection
- Reference (R1) - transcripts created by LDC human
annotators - Baseline (B1) - transcripts created by NIST/BBN
time-adaptive automatic recognizer - Speech (S1/S2) - transcripts created by sites
own automatic recognizers - Cross-Recognizer (CR) - all contributed
recognizers - Boundaries
- Known (K) - Story boundaries provided by LDC
annotators - Unknown (U) - Story boundaries unknown
10Test Variables (contd)
- Queries
- Short (S) - 1 or 2-phrase description of
information need - Terse (T) - keyword list
- Non-Lexical Information
- Default - Could make use of automatically-recogniz
ed features - None (N) - no non-lexical information (control)
- Recognition language models
- Fixed (FLM) - Fixed language model/vocabulary
predating test epoch - Rolling (RLM) - Time-adaptive language
model/vocabulary using daily newswire texts
11Test Conditions
- Primary Conditions (may use non-lexical side
info, but must run contrast below) - R1SU Reference Retrieval, short topics, using
human-generated "perfect" transcripts without
known story boundaries - R1TU Reference Retrieval, terse topics, using
human-generated "perfect" - B1SU Baseline Retrieval, short topics, using
provided recognizer transcripts without known
story boundaries - B1TU Baseline Retrieval, terse topics, using
provided recognizer transcripts without known
story boundaries - S1SU Speech Retrieval, short topics, using own
recognizer without known story boundaries - S1TU Baseline Retrieval,terse topics, using
provided recognizer transcripts without known
story boundaries - Optional Cross-Recognizer Condition (may use
non-lexical side info, but must run contrast
below) - CRSU-ltSYS_NAMEgt Cross-Recognizer Retrieval,
short topics, using other participants'
recognizer transcripts without known story
boundaries - CRTU-ltSYS_NAMEgt Cross-Recognizer Retrieval,
terse topics, using other participants'
recognizer transcripts without known story
boundaries - Conditional No Non-Lexical Information Condition
(required contrast if non-lexical information is
used in other conditions) - R1SUN Reference Retrieval, short topics, using
human-generated "perfect" transcripts without
known story boundaries, no non-lexical info - R1TUN Reference Retrieval, terse topics, using
human-generated "perfect" transcripts without
known story boundaries, no non-lexical info - B1SUN Baseline Retrieval, short topics, using
provided recognizer transcripts without known
story boundaries, no non-lexical info - B1TUN Baseline Retrieval, terse topics, using
provided recognizer transcripts without known
story boundaries, no non-lexical info - S1SUN Speech Retrieval, short topics, using own
recognizer without known story boundaries, no
non-lexical info - S1TUN Speech Retrieval, terse topics, using own
recognizer without known story boundaries, no
non-lexical info - S2SUN Speech Retrieval, short topics, using own
second recognizer without known story boundaries,
no non-lexical info - S2TUN Speech Retrieval, terse topics, using own
second recognizer without known story boundaries,
no non-lexical info
12Test Topics
- 50 topics developed by NIST Assessors using
similar approach to TREC Ad-Hoc Task - Short and terse forms of topics were generated
- Hard Topic 125 10 relevant stories
- Short Provide information pertaining to
security violations withinthe U. S. intelligence
community. (.024 average MAP) - Terse U. S. intelligence violations (.019
average MAP) - Medium Topic 143 8 relevant stories
- Short How many Americans file for bankruptcy
each year? (.505 avg MAP) - Terse Americans bankruptcy debts (.472 average
MAP) - Easy Topic 127 11 relevant stories
- Short Name some countries which permit their
citizens to commit suicide with medical
assistance. (.887 average MAP) - Terse assisted suicide (.938 average MAP)
13Test Topic Relevance
14Topic Difficulty
15Participants
Full SDR (recognition and retrieval) Cambridge
University, UK LIMSI, France Sheffield
University, UK
16Approaches for 2000
- Automatic Speech Recognition
- HMM, word-based - most
- NN/HMM hybrid-based, Sheffield
- Retrieval
- OKAPI Probabilistic Model - all
- Blind Relevance Feedback and parallel corpus BRF
for query expansion
- all - Story boundary unknown retrieval
- passage windowing, retrieval and merging - all
- Use of automatically-recognized non-lexical
features - repeat and commercial detection - CU
17ASR Metrics
- Traditional ASR Metric
- Word Error Rate (WER) and Mean Story Word Error
Rate (SWER) using SCLITE and LDC ref transcripts - WER word insertions word deletions word
substitutions - total words in reference
- LDC created 2 Hub-4 compliant 10-hour subsets for
ASR scoring and analyses (LDC-SDR-99 and
LDC-SDR-2000) - Note that there is a 10.3 WER in the collection
human (closed caption) transcripts
Note SDR recognition is not directly comparable
to Hub-4 benchmarks due to transcript quality,
test set selection method, and word mapping
method used in scoring
18ASR Performance
19IR Metrics
- Traditional TREC ad-hoc Metric
- Mean Average Precision (MAP) using TREC_EVAL
- Created assessment pools for each topic using top
100 of all retrieval runs - Mean pool size 596 (2.1 of all segments)
- Min pool size 209
- Max pool size 1309
- NIST assessors created reference relevance
assessments from topic pools - Somewhat artificial for boundary unknown
conditions
20Story Boundaries Known Condition
- Retrieval using pre-segmented news stories
- systems given index of story boundaries for
recognition with IDs for retrieval - excluded non-news segments
- stories are treated as documents
- systems produce rank-ordered list of Story IDs
- document-based scoring
- score as in other TREC Ad Hoc tests using
TREC_EVAL
21Story Boundaries Known Retrieval Condition
22Unknown Story Boundary Condition
- Retrieval using continuous speech stream
- systems process entire broadcasts for ASR and
retrieval with no provided segmentation - systems output a single time marker for each
relevant excerpt to indicate topical passages - this task does NOT attempt to determine topic
boundaries - time-based scoring
- map to a story ID (dummy ID for retrieved
non-stories and duplicates) - score as usual using TREC_EVAL
- penalizes for duplicate retrieved stories
- story-based scoring somewhat artificial but
expedient
23Story Boundaries Unknown Retrieval Condition
24SDR-2000 Cross-Recognizer Results
25Conclusions
- ad hoc retrieval in broadcast news domain appears
to be a solved problem - systems perform well at finding relevant passages
in transcripts produced by a variety of
recognizers on full unsegmented news broadcasts - performance on own recognizer comparable to human
reference - just beginning to investigate use of non-lexical
information - Caveat Emptor
- ASR may still pose serious problems for Question
Answering domain where content errors are fatal
26Future for Multi-Media Retrieval?
- SDR Track will be sunset
- Other opportunities
- TREC
- Question Answering Track
- New Video Retrieval Track
- CLEF
- Cross-language SDR
- TDT Project
27TREC-9 SDR Results, Primary Conditions
28TREC-9 SDR Results, Cross Recognizer Conditions