Title: Cross-Language Information Retrieval
1Cross-Language Information Retrieval
- Applied Natural Language Processing
- October 29, 2009
- Douglas W. Oard
2What Do People Search For?
- Searchers often dont clearly understand
- The problem they are trying to solve
- What information is needed to solve the problem
- How to ask for that information
- The query results from a clarification process
- Dervins sense making
Need
Gap
Bridge
3Taylors Model of Question Formation
Q1 Visceral Need
Q2 Conscious Need
Intermediated Search
Q3 Formalized Need
Q4 Compromised Need (Query)
4Design Strategies
- Foster human-machine synergy
- Exploit complementary strengths
- Accommodate shared weaknesses
- Divide-and-conquer
- Divide task into stages with well-defined
interfaces - Continue dividing until problems are easily
solved - Co-design related components
- Iterative process of joint optimization
5Human-Machine Synergy
- Machines are good at
- Doing simple things accurately and quickly
- Scaling to larger collections in sublinear time
- People are better at
- Accurately recognizing what they are looking for
- Evaluating intangibles such as quality
- Both are pretty bad at
- Mapping consistently between words and concepts
6Process/System Co-Design
7Supporting the Search Process
Source Selection
Choose
8Supporting the Search Process
Source Selection
9Search Component Model
Utility
Human Judgment
Information Need
Document
Query Formulation
Query
Document Processing
Query Processing
Representation Function
Representation Function
Query Representation
Document Representation
Comparison Function
Retrieval Status Value
10Relevance
- Relevance relates a topic and a document
- Duplicates are equally relevant, by definition
- Constant over time and across users
- Pertinence relates a task and a document
- Accounts for quality, complexity, language,
- Utility relates a user and a document
- Accounts for prior knowledge
11Okapi Term Weights
TF component
IDF component
12A Ranking Function Okapi BM25
13Estimating TF and DF for Query Terms
f1 f2 f3 f4
20
50
2
5
0.4
50
30
40
200
0.3
e1
0.3
0.4
0.1
0.2
0.2
0.420 0.35 0.22 0.150 14.9
0.1
0.450 0.340 0.230 0.1200 58
14Learning to Translate
- Lexicons
- Phrase books, bilingual dictionaries,
- Large text collections
- Translations (parallel)
- Similar topics (comparable)
- Similarity
- Similar pronunciation, similar users
- People
15Hieroglyphic
Demotic
Greek
16Statistical Machine Translation
Señora Presidenta , habÃa pedido a la
administración del Parlamento que garantizase
Madam President , I had asked the administration
to ensure that
17Bidirectional Translation
wonders of ancient world (CLEF Topic 151)
18Experiment Setup
- Test collections
- Document processing
- Stemming, accent-removal (CLEF French)
- Word segmentation, encoding conversion (TREC
Chinese) - Stopword removal (all collections)
- Training statistical translation models (GIZA)
Source CLEF01-03 TREC-5,6
Query language English English
Document language French Chinese
of topics 151 54
of documents 87,191 139,801
Avg of rel docs 23 95
FBIS et al.
Europarl
Parallel corpus
English-Chinese
English-French
Languages
1,583,807
672,247
of sentence pairs
M1(10)
M1(10), HMM(5), M4(5)
Models (iterations)
19(No Transcript)
20Pruning Translations
Cumulative Probability Threshold
Translations
0.0 0.1 0.2 0.3 0.4 0.5 0.6
0.7 0.8 0.9 1.0
f1 (0.32) f2 (0.21) f3 (0.11) f4 (0.09) f5
(0.08) f6 (0.05) f7 (0.04) f8 (0.03) f9
(0.03) f10 (0.02) f11 (0.01) f12 (0.01)
f1
f1 f2 f3 f4 f5
f1 f2 f3 f4
f1 f2 f3 f4 f5 f6 f7
f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12
f1
f1
f1 f2
f1 f2
f1 f2 f3
f1
21Unidirectional without Synonyms (PSQ)
CLEF French
TREC-5,6 Chinese
- Statistical significance vs monolingual (Wilcoxon
signed rank test) - CLEF French worse at peak
- TREC-5,6 Chinese worse at peak
22Bidirectional with Synonyms (DAMM)
CLEF French
TREC-5,6 Chinese
- DAMM significantly outperformed PSQ
- DAMM is statistically indistinguishable from
monolingual at peak - IMM nearly as good as DAMM for French, but not
for Chinese
23Indexing Time
Dictionary-based vector translation, single Sun
SPARC in 2001
24The Problem Space
- Retrospective search
- Web search
- Specialized services (medicine, law, patents)
- Help desks
- Real-time filtering
- Email spam
- Web parental control
- News personalization
- Real-time interaction
- Instant messaging
- Chat rooms
- Teleconferences
25Making a Market
- Multitude of potential applications
- Retrospective search, email, IM, chat,
- Natural consequence of language diversity
- Limiting factor is translation readability
- Searchability is mostly a solved problem
- Leveraging human translation has potential
- Translation routing, volunteers, cacheing