Cross-Language Information Retrieval - PowerPoint PPT Presentation

About This Presentation
Title:

Cross-Language Information Retrieval

Description:

... se//0.31 demande//0.24 demander//0.08 peut//0.07 merveilles//0.04 question//0.02 savoir//0.02 on//0.02 bien//0.01 merveille//0.01 pourrait//0.01 Unidirectional: ... – PowerPoint PPT presentation

Number of Views:199
Avg rating:3.0/5.0
Slides: 26
Provided by: Prefer1021
Category:

less

Transcript and Presenter's Notes

Title: Cross-Language Information Retrieval


1
Cross-Language Information Retrieval
  • Applied Natural Language Processing
  • October 29, 2009
  • Douglas W. Oard

2
What Do People Search For?
  • Searchers often dont clearly understand
  • The problem they are trying to solve
  • What information is needed to solve the problem
  • How to ask for that information
  • The query results from a clarification process
  • Dervins sense making

Need
Gap
Bridge
3
Taylors Model of Question Formation
Q1 Visceral Need
Q2 Conscious Need
Intermediated Search
Q3 Formalized Need
Q4 Compromised Need (Query)
4
Design Strategies
  • Foster human-machine synergy
  • Exploit complementary strengths
  • Accommodate shared weaknesses
  • Divide-and-conquer
  • Divide task into stages with well-defined
    interfaces
  • Continue dividing until problems are easily
    solved
  • Co-design related components
  • Iterative process of joint optimization

5
Human-Machine Synergy
  • Machines are good at
  • Doing simple things accurately and quickly
  • Scaling to larger collections in sublinear time
  • People are better at
  • Accurately recognizing what they are looking for
  • Evaluating intangibles such as quality
  • Both are pretty bad at
  • Mapping consistently between words and concepts

6
Process/System Co-Design
7
Supporting the Search Process
Source Selection
Choose
8
Supporting the Search Process
Source Selection
9
Search Component Model
Utility
Human Judgment
Information Need
Document
Query Formulation
Query
Document Processing
Query Processing
Representation Function
Representation Function
Query Representation
Document Representation
Comparison Function
Retrieval Status Value
10
Relevance
  • Relevance relates a topic and a document
  • Duplicates are equally relevant, by definition
  • Constant over time and across users
  • Pertinence relates a task and a document
  • Accounts for quality, complexity, language,
  • Utility relates a user and a document
  • Accounts for prior knowledge

11
Okapi Term Weights
TF component
IDF component
12
A Ranking Function Okapi BM25
13
Estimating TF and DF for Query Terms
f1 f2 f3 f4
20
50
2
5
0.4
50
30
40
200
0.3
e1
0.3
0.4
0.1
0.2
0.2
0.420 0.35 0.22 0.150 14.9
0.1
0.450 0.340 0.230 0.1200 58
14
Learning to Translate
  • Lexicons
  • Phrase books, bilingual dictionaries,
  • Large text collections
  • Translations (parallel)
  • Similar topics (comparable)
  • Similarity
  • Similar pronunciation, similar users
  • People

15
Hieroglyphic
Demotic
Greek
16
Statistical Machine Translation
Señora Presidenta , había pedido a la
administración del Parlamento que garantizase
Madam President , I had asked the administration
to ensure that
17
Bidirectional Translation
wonders of ancient world (CLEF Topic 151)
18
Experiment Setup
  • Test collections
  • Document processing
  • Stemming, accent-removal (CLEF French)
  • Word segmentation, encoding conversion (TREC
    Chinese)
  • Stopword removal (all collections)
  • Training statistical translation models (GIZA)

Source CLEF01-03 TREC-5,6
Query language English English
Document language French Chinese
of topics 151 54
of documents 87,191 139,801
Avg of rel docs 23 95
FBIS et al.
Europarl
Parallel corpus
English-Chinese
English-French
Languages
1,583,807
672,247
of sentence pairs
M1(10)
M1(10), HMM(5), M4(5)
Models (iterations)
19
(No Transcript)
20
Pruning Translations
Cumulative Probability Threshold
Translations
0.0 0.1 0.2 0.3 0.4 0.5 0.6
0.7 0.8 0.9 1.0
f1 (0.32) f2 (0.21) f3 (0.11) f4 (0.09) f5
(0.08) f6 (0.05) f7 (0.04) f8 (0.03) f9
(0.03) f10 (0.02) f11 (0.01) f12 (0.01)
f1
f1 f2 f3 f4 f5
f1 f2 f3 f4
f1 f2 f3 f4 f5 f6 f7
f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12
f1
f1
f1 f2
f1 f2
f1 f2 f3
f1
21
Unidirectional without Synonyms (PSQ)
CLEF French
TREC-5,6 Chinese
  • Statistical significance vs monolingual (Wilcoxon
    signed rank test)
  • CLEF French worse at peak
  • TREC-5,6 Chinese worse at peak

22
Bidirectional with Synonyms (DAMM)
CLEF French
TREC-5,6 Chinese
  • DAMM significantly outperformed PSQ
  • DAMM is statistically indistinguishable from
    monolingual at peak
  • IMM nearly as good as DAMM for French, but not
    for Chinese

23
Indexing Time

Dictionary-based vector translation, single Sun
SPARC in 2001
24
The Problem Space
  • Retrospective search
  • Web search
  • Specialized services (medicine, law, patents)
  • Help desks
  • Real-time filtering
  • Email spam
  • Web parental control
  • News personalization
  • Real-time interaction
  • Instant messaging
  • Chat rooms
  • Teleconferences

25
Making a Market
  • Multitude of potential applications
  • Retrospective search, email, IM, chat,
  • Natural consequence of language diversity
  • Limiting factor is translation readability
  • Searchability is mostly a solved problem
  • Leveraging human translation has potential
  • Translation routing, volunteers, cacheing
Write a Comment
User Comments (0)
About PowerShow.com