TREC: Experiment and Evaluation in Information Retrieval - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

TREC: Experiment and Evaluation in Information Retrieval

Description:

SDR Track Design. Three types of transcripts produced from audio files: ... SDR Track Design ... SDR Results ... – PowerPoint PPT presentation

Number of Views:226
Avg rating:3.0/5.0
Slides: 21
Provided by: ebec8
Category:

less

Transcript and Presenter's Notes

Title: TREC: Experiment and Evaluation in Information Retrieval


1
TREC Experiment and Evaluation
in     Information Retrieval
  • Retrieving Noisy Text

Edith Beckett March 28, 2006
2
Noisy Text defined
  • Digital text created from non-digital sources
  • Documents scanned and converted using optical
    character recognition (OCR)
  • Transcripts created from audio sources using
    automatic speech recognition (ASR) systems

3
Noisy text in TREC
  • Confusion track for scanned/OCR processed texts
  • Confusion track
  • TREC-4 (1995)
  • TREC-5 (1996)

4
Noisy text in TREC
  • SDR (Spoken-document retrieval) track for the
    transcribed audio/video
  • SDR track
  • TREC-6 (1997)
  • TREC-7 (1998)
  • TREC-8 (1999)
  • TREC-9 (2000)

5
Noisy text in TREC
  • Goal was to explore methods of retrieval in
    documents containing systematic errors

6
TREC-4
  • Started with a baseline truth document set
  • NIST created two new document sets by randomly
    selecting characters to be deleted, substituted
    or added

7
TREC-4
  • Result a document set with a 10 percent error
    rate (10 percent of the characters were changed)
    and another with a 20 percent error rate
  • Strength did not model any particular
    error-producing process
  • Weakness in practice, errors created by OCR
    programs are systematic rather than random

8
TREC-5
  • Document set was approximately 55,600 documents
    from the Federal Register
  • Truth version was typesetting files provided
    by GPO
  • 2nd version obtained by scanning hard copy (est.
    error rate of 5 percent)
  • 3rd version was obtained by down sampling the
    original hard copy files and re-scanning (est.
    error rate of 20 percent)

9
TREC-5
  • Performed known-item searches
  • Two approaches
  • Expand query to include or match corrupted forms
    in documents
  • Clarify corrupted texts

10
TREC-5
  • In general, retrieval results become less
    effective as the noise in documents increases
  • Redundancy of natural language texts also
    minimizes the negative effects of noisy text on
    information retrieval.

11
TREC-5
  • In TREC-5, lack of redundancy in artificially
    created noisy documents may have contributed to
    the degradation of retrieval results

12
SDR Track Design
  • Three types of transcripts produced from audio
    files
  • Reference a perfect human-produced
    transcript
  • Baseline a common transcript produced by a
    single, selected ASR system
  • Speech a transcript produced by an ASR system
    of the participants choosing

13
SDR Track Design
  • Retrieval runs used the same topics and were
    required to use completely automatic processing
  • Providing common boundaries allowed the use of
    traditional IR document-based evaluation paradigm

14
SDR Track Design
  • Number of topics, length of recordings, and
    number of stories increased in successive tracks
  • TREC-6 known-item retrieval
  • TREC-7 ad hoc retrieval
  • TREC-8 ad hoc retrieval
  • TREC-9 ad hoc retrieval

15
TREC-6
  • Demonstrated that speech recognition and IR
    technologies were advanced enough to retrieve
    specific documents
  • Factors in transcribed speech other than
    recognition effects influence retrieval
    effectiveness
  • Missing vocabulary can also cause problems

16
TREC-7
  • Set of 23 topics and 2,866 stories was too small
    to allow confidence in conclusions

17
TREC-8
  • Same conditions as TREC-7, but document set
    increased to 49 topics and 21,754 stories
  • Used rolling language model
  • Added story-boundaries unknown condition

18
TREC-9
  • Confirmation of findings from TREC-8
  • Systems could find relevant passages on
    unsegmented news broadcasts
  • Comparable performance on reference, baseline,
    and speech transcripts

19
SDR Results
  • Successful systems were a combination of good
    retrieval technology applied to one-best word
    transcripts

20
Conclusion
  • With appropriate compensation, retrieval from
    noisy texts can be as effective as retrieval from
    clean texts
Write a Comment
User Comments (0)
About PowerShow.com