Information Retrieval and Web Search - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Information Retrieval and Web Search

Description:

Controlled Vocabulary Free Text. Cross-Language Text Retrieval ... Easily generated from a vector of term weights. Multiply by the term-document matrix ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 30

Provided by: litCs

Category:

more less

Transcript and Presenter's Notes

Title: Information Retrieval and Web Search

1
Information Retrieval and Web Search

Cross Language Information Retrieval
Instructor Rada Mihalcea
Class web page http//lit.csci.unt.edu/classes/C
SCE5200
Some of the slides are from a course taught by
Doug Oard at U. Maryland

2
The General Problem

Find documents written in any language
Using queries expressed in a single language

3
Why Do Cross-Language IR?

When users can read several languages
Eliminates multiple queries
Query in most fluent language
Monolingual users can also benefit
If translations can be provided
If it suffices to know that a document exists
If text captions are used to search for images

9
4
Source Michael Lesk, How Much Information is
there in the World?
5
Supply Side Internet Hosts
Guess What will be the most widely used
language on the Web in 2010?
Source Network Wizards Jan 99 Internet Domain
Survey
6
Demand Side Number of Speakers
Source http//www.g11n.com/faq.html
7
Search Technology
Chinese Feature Assignment
Monolingual Chinese Matching
1 0.72 2 0.48
Language Identification
Chinese Feature Assignment
Chinese Query
English Feature Assignment
Cross- Language Matching
3 0.91 4 0.57 5 0.36
8
Language Identification

Can be specified using metadata
Included in HTTP and HTML
Can be determined using word-scale features
Which dictionary gets the most hits?
Can be determined using subword features
Letter n-grams, for example

24
9
Design Decisions

What to index?
Free text or controlled vocabulary
What to translate?
Queries or documents
Where to get translation knowledge?

10
10
Query Vector Translation
Chinese Query Features
Query (Vector) Translation
Monolingual English Matching
3 0.91 4 0.57 5 0.36
English Document Features
11
Document Vector Translation
Chinese Query Features
English Document Features
Monolingual Chinese Matching
3 0.91 4 0.57 5 0.36
Document (Vector) Translation
12
Matching Interlingual Representations
Chinese Query Features
Query Folding In
English Document Features
Interlingual Matching
3 0.91 4 0.57 5 0.36
Document Folding In
13
Query vs. Document Translation

Query translation
Very efficient for short queries
Not as big an advantage for relevance feedback
Hard to resolve ambiguous query terms
Document translation
May be needed by the selection interface
And supports adaptive filtering well
Slow, but only need to do it once per document
Poor scale-up to large numbers of languages

23
14
Cross-Language Text Retrieval
Query Translation
Document Translation
Text Translation Vector Translation
Controlled Vocabulary Free Text
Knowledge-based
Corpus-based
Ontology-based Dictionary-based
Term-aligned Sentence-aligned
Document-aligned Unaligned
Thesaurus-based
Parallel Comparable
11
15
Translation Knowledge

A lexicon
e.g., extract term list from a bilingual
dictionary
Corpora
Parallel or comparable, linked or unlinked
Algorithmic
e.g., transliteration rules, cognate matching
The user

16
Types of Lexicons

Ontology
Representation of concepts and relationships
Thesaurus
Ontology specialized for retrieval
Bilingual lexicon
Ontology specialized for machine translation
Bilingual dictionary
Ontology specialized for human translation

22
17
Multilingual Thesauri

Adapt the knowledge structure
Cultural differences influence indexing choices
Use language-independent descriptors
Matched to a unique term in each language
Three construction techniques
Build it from scratch
Translate an existing thesaurus
Merge monolingual thesauri

16
18
Machine Readable Dictionaries

Based on printed bilingual dictionaries
Becoming widely available
Used to produce bilingual term lists
Cross-language term mappings are accessible
Sometimes listed in order of most common usage
Some knowledge structure is also present
Hard to extract and represent automatically
The challenge is to pick the right translation

27
19
Unconstrained Query Translation

Replace each word with every translation
Typically 5-10 translations per word
About 50 of monolingual effectiveness
Ambiguity is a serious problem
Example Fly (English)
8 word senses (e.g., to fly a
flag)
13 Spanish translations (enarbolar, ondear, )
38 English retranslations (hoist, brandish, lift)

28
20
Exploiting Part-of-Speech Tags

Constrain translations by part of speech
Noun, verb, adjective,
Effective taggers are available
Works well when queries are full sentences
Short queries provide little basis for tagging
Constrained matching can hurt monolingual IR
Nouns in queries often match verbs in documents
This is why stemming usually improves performance

29
21
Phrase Indexing

Improves retrieval effectiveness two ways
Phrases are less ambiguous than single words
Idiomatic phrases translate as a single concept
Three ways to identify phrases
Semantic (e.g., appears in a dictionary)
Syntactic (e.g., parse as a noun phrase)
Cooccurrence (words found together often)
Semantic phrase results are impressive

30
22
Types of Bilingual Corpora

Parallel corpora translation-equivalent pairs
Document pairs
Sentence pairs
Term pairs
Comparable corpora
Content-equivalent document pairs
E.g. newspaper articles in different languages,
on the same day (for the same event)
Unaligned corpora
Content from the same domain

32
23
Pseudo-Relevance Feedback

Enter query terms in French
Find top French documents in parallel corpus
Construct a query from English translations
Perform a monolingual free text search

Top ranked French Documents
French Query Terms
English Web Pages
English Translations
French Text Retrieval System
Parallel Corpus
Alta Vista
33
24
Learning From Document Pairs

Count how often each term occurs in each pair
Treat each pair as a single document

English Terms
Spanish Terms
E1 E2 E3 E4 E5 S1 S2
S3 S4
Doc 1
4
2
2
1
Doc 2
8
4
4
2
Doc 3
2
2
1
2
Doc 4
2
1
2
1
Doc 5
4
1
2
1
34
25
Similarity-Based Dictionaries