Translingual Topic Tracking with PRISE - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Translingual Topic Tracking with PRISE

Description:

Mandarin Newswire Text. Mandarin Broadcast News. Why Document Expansion Works ... Newswire Text. Translation Preference. Unigram statistics guided translation ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 30
Provided by: ginal5
Category:

less

Transcript and Presenter's Notes

Title: Translingual Topic Tracking with PRISE


1
Translingual Topic Tracking with PRISE
  • Gina-Anne Levow and Douglas W. Oard
  • University of Maryland
  • February 28, 2000

2
Roadmap
  • The signal to noise perspective
  • Our topic tracking system
  • Boosting signal
  • Reducing noise
  • Future directions

3
Translingual Tracking Challenges
  • Segmentation of text adds noise
  • Unknown words
  • Transcription of speech adds noise
  • Unknown words
  • Easily confused words (e.g., homophones)
  • Translation adds noise
  • Vocabulary mismatch with ASR / segmentation
  • Incorrect translation selection

4
Improving the Signal to Noise Ratio
  • Translation coverage
  • Enrich the term list using large dictionaries
  • Translation selection
  • Statistical evidence from comparable corpora
  • Enriching indexing vocabulary
  • Add related terms from comparable corpora
  • Score normalization
  • Learn source dependence from dry-run collection

5
Preview
  • Focusing on noise alone is not enough
  • Signal boosting is a big win
  • Baseline Systran
  • Goal choose the best single translation
  • Two signal-boosting strategies beat Systran
  • Choose the best two translations
  • Add related terms for indexing
  • (found in related documents)

6
Improvements Since TDT-2
  • Weight selection
  • PRISE bm25idf
  • Query representation
  • Vector of 180 most selective terms by ?² test
  • Two-pass normalization
  • Source-specific, 5 source classes
  • NYT, APW, Eng. Speech, Man. Text, Man. Speech
  • Topic-specific
  • Average of example story scores

7
Source-independent
Source-dependent
Mandarin (All Sources)
Source-independent
Source-dependent
English (All Sources)
8
Translingual Approaches
  • Indexing strategies (boosting signal)
  • Post-translation document expansion
  • n-best translation
  • Translation tweaks (reducing noise)
  • Enriched bilingual term list
  • Corpus-based translation selection
  • Pre-translation Mandarin stopword removal

9
Translingual Runs
( official run scored by NIST)
10
Document Expansion
Query Vector
Term Selection
Documents to Index
Top 5
PRISE
Results
Single Document
PRISE
Word-to-Word Translation
Comp. English Corpus
ASR Transcript
NMSU Segmenter
Mandarin
English
BN
NWT
BN
NWT
11
Mandarin Newswire Text
12
Mandarin Broadcast News
13
Why Document Expansion Works
  • Story-length objects provide useful context
  • Ranked retrieval finds signal amid the noise
  • Selective terms discriminate among documents
  • Enrich index with high IDF terms from top
    documents
  • Similar strategies work well in other
    applications
  • TREC-7 SDR Singhal et al., 1998
  • CLIR query translation Ballesteros Croft, 1997

14
n-best Translation
  • We generally used 1-best translation
  • Highest unigram frequency in comparable corpus
  • Tried 2-best two highest-ranked translations
  • Duplicating unique translations where necessary
  • Should reduce miss rate
  • But at what cost in false alarms?

15
Mandarin Newswire Text
16
Mandarin Broadcast News
17
Comparison With Systran
  • Used baseline translations provided by LDC
  • Untranslated words not used
  • No document expansion
  • Systran produces 1-best translations
  • Natural comparison is with our 2-best run

18
Mandarin Newswire Text
19
Mandarin Broadcast News
20
Bilingual Term List Enrichment
  • Two sources of candidate translations
  • LDC Chinese-English term list (version 2)
  • CETA (Optilex) dictionary
  • gt250K entries, hand-built from gt250 sources
  • Merging strategy
  • Used only general-purpose sources in CETA
  • Filtered out definitions
  • Removed parenthetical clauses

21
Term List Statistics
22
Broadcast News
Newswire Text
23
Translation Preference
  • Unigram statistics guided translation selection
  • Minimize effect of rare translations,
    misspellings,
  • Based on dry run stories and rolling update
  • Backoff to balanced corpus for unknown words
  • Brown corpus variety of genres
  • Compared with use of balanced corpus alone

24
Mandarin Newswire Text
25
Pre-Translation Stopword Removal
  • Common words dont help retrieval much
  • But mistranslations might hurt
  • We built a Mandarin stopword list
  • Processed dictionary to identify function words
  • Added the top 300 words in LDC frequency list
  • Filtered by two speakers of Mandarin
  • Suppressed translation of stopwords

26
Mandarin Newswire Text
27
Summary
  • 3 techniques produced improvements
  • Source-dependent normalization
  • Post-translation document expansion
  • n-best translation
  • 3 techniques had little effect
  • Bilingual term list enrichment
  • Comparable-corpus-based translation preference
  • Pre-translation stopword removal

28
Future Directions
  • Statistical significance
  • Can this be added to the scoring software?
  • Pre-translation document expansion
  • An effective approach in CLIR query translation
  • Further experiments with n-best translation
  • Probably using a weighted strategy
  • Structured translation Pirkola, 1998
  • Some concern about efficiency, though

29
Where is the Perfect TDT System?
Run TDT-4 In Nova Scotia!
BBN
Penn
Maryland
Write a Comment
User Comments (0)
About PowerShow.com