Information Retrieval and Web Search - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Information Retrieval and Web Search

Description:

Document translation. May be needed by the selection interface ... Controlled Vocabulary Free Text. Cross-Language Text Retrieval ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 28

Provided by: cse54

Category:

more less

Transcript and Presenter's Notes

Title: Information Retrieval and Web Search

1
Information Retrieval and Web Search

Cross Language Information Retrieval
Instructor Rada Mihalcea
Some of the slides are from a course taught by
Doug Oard at U. Maryland

2
The General Problem

Find documents written in any language
Using queries expressed in a single language

3
Why Do Cross-Language IR?

When users can read several languages
Eliminates multiple queries
Query in most fluent language
Monolingual users can also benefit
If translations can be provided
If it suffices to know that a document exists
If text captions are used to search for images

9
4
Top Ten Languages on the Web
internetworldstats.com, March, 2011
5
Demand Side Top Spoken Languages
Source http//en.wikipedia.org/wiki/List_of_langu
ages_by_number_of_native_speakers
6
Search Technology
Chinese Feature Assignment
Monolingual Chinese Matching
1 0.72 2 0.48
Language Identification
Chinese Feature Assignment
Chinese Query
English Feature Assignment
Cross- Language Matching
3 0.91 4 0.57 5 0.36
7
Language Identification

Can be specified using metadata
Included in HTTP and HTML
Can be determined using word-scale features
Which dictionary gets the most hits?
Can be determined using subword features
Letter n-grams, for example

8
Design Decisions

What to index?
Free text or controlled vocabulary
What to translate?
Queries or documents
Where to get translation knowledge?

9
Query Vector Translation
Chinese Query Features
Query (Vector) Translation
Monolingual English Matching
3 0.91 4 0.57 5 0.36
English Document Features
10
Document Vector Translation
Chinese Query Features
English Document Features
Monolingual Chinese Matching
3 0.91 4 0.57 5 0.36
Document (Vector) Translation
11
Matching Interlingual Representations
Chinese Query Features
Query Folding In
English Document Features
Interlingual Matching
3 0.91 4 0.57 5 0.36
Document Folding In
12
Query vs. Document Translation

Query translation
Very efficient for short queries
Not as big an advantage for relevance feedback
Hard to resolve ambiguous query terms
Document translation
May be needed by the selection interface
And supports adaptive filtering well
Slow, but only need to do it once per document
Poor scale-up to large numbers of languages

13
Cross-Language Text Retrieval
Query Translation
Document Translation
Text Translation Vector Translation
Controlled Vocabulary Free Text
Knowledge-based
Corpus-based
Ontology-based Dictionary-based
Term-aligned Sentence-aligned
Document-aligned Unaligned
Thesaurus-based
Parallel Comparable
14
Translation Knowledge

A lexicon
e.g., extract term list from a bilingual
dictionary
Corpora
Parallel or comparable, linked or unlinked
Algorithmic
e.g., transliteration rules, cognate matching
The user

15
Types of Lexicons

Ontology
Representation of concepts and relationships
Thesaurus
Ontology specialized for retrieval
Bilingual lexicon
Ontology specialized for machine translation
Bilingual dictionary
Ontology specialized for human translation

16
Multilingual Thesauri

Adapt the knowledge structure
Cultural differences influence indexing choices
Use language-independent descriptors
Matched to a unique term in each language
Three construction techniques
Build it from scratch
Translate an existing thesaurus
Merge monolingual thesauri

17
Machine Readable Dictionaries

Based on printed bilingual dictionaries
Becoming widely available
Used to produce bilingual term lists
Cross-language term mappings are accessible
Sometimes listed in order of most common usage
Some knowledge structure is also present
Hard to extract and represent automatically
The challenge is to pick the right translation

18
Unconstrained Query Translation

Replace each word with every translation
Typically 5-10 translations per word
About 50 of monolingual effectiveness
Ambiguity is a serious problem
Example Fly (English)
8 word senses (e.g., to fly a
flag)
13 Spanish translations (enarbolar, ondear, )
38 English retranslations (hoist, brandish, lift)

19
Phrase Indexing

Improves retrieval effectiveness two ways
Phrases are less ambiguous than single words
Idiomatic phrases translate as a single concept
Three ways to identify phrases
Semantic (e.g., appears in a dictionary)
Syntactic (e.g., parse as a noun phrase)
Co-occurrence (words found together often)
Semantic phrase results are impressive

20
Types of Bilingual Corpora

Parallel corpora translation-equivalent pairs
Document pairs
Sentence pairs
Term pairs
Comparable corpora
Content-equivalent document pairs
E.g. newspaper articles in different languages,
on the same day (for the same event)
Unaligned corpora
Content from the same domain

21
How to Use Bilingual Corpora?

Pseudo-relevance feedback
Enter query terms in Spanish
Find top Spanish documents in parallel corpus
Construct a query from English translations
Perform a monolingual free text search

Top ranked French Documents
French Query Terms
English Web Pages
English Translations
French Text Retrieval System
Parallel Corpus
Alta Vista
22
How to Use Bilingual Corpora?

Count how often each term occurs in each pair
Treat each pair as a single document

English Terms
Spanish Terms
E1 E2 E3 E4 E5 S1 S2
S3 S4
Doc 1
4
2
2
1
Doc 2
8
4
4
2
Doc 3
2
2
1
2
Doc 4
2
1
2
1
Doc 5
4
1
2
1
23
Similarity-Based Dictionaries