Information Retrieval

About This Presentation

Title:

Information Retrieval

Description:

The most of the semantics is carried by the noun words ... For each collection, the right column considers that all words are indexed, ... – PowerPoint PPT presentation

Number of Views:281

Avg rating:3.0/5.0

Slides: 91

Provided by: shyhka

Category:

more less

Transcript and Presenter's Notes

Title: Information Retrieval

1
Information Retrieval

Shyh-Kang Jeng
Department of Electrical Engineering/
Graduate Institute of Communication Engineering
National Taiwan University

2
Reference

R. Baeza-Yates and B. Ribeiro-Neto, Modern
Information Retrieval, Addison-Wesley, 1999.

3
Outline

Basic concepts
Information Retrieval Models
Text Property
Document Preprocessing
Indexing and Searching
Searching the Web

4
Information Retrieval Agents
5
Information Retrieval

Deals with information
Representation
Storage
Organization
Access
Provides the user with easy access to the
information in which he is interested

6
Example of Information Retrieval

Find all pages (documents) containing information
on college tennis teams which
are maintained by an university in USA
participate in the NCAA tennis tournament
To be relevant, the page must includes
National ranking of the team in the last three
years
Email of the team coach

7
Information vs. Data Retrieval

Information retrieval
Results might be inaccurate
Small errors are likely to go unnoticed
Deals with natural language text which is not
well structured and could be semantically
ambiguous
Data retrieval
Aims at retrieving all objects which satisfy
clearly defined conditions
A single erroneous object among a thousand
retrieval objects means total failure
Has a well defined structure and semantics

8
User Task

Retrieval
Searches for desired information directly
Browsing
Still a process of retrieving information
Main objectives are not clearly defined in the
beginning
The purpose might change during the interaction
with the system

9
Interaction with the System
Retrieval
Document Database
Browsing
10
Keywords

Queries are often translated to a set of key
words (or index terms) which summarizes the
description of the user information needed
Documents are also frequently represented through
a set of index terms or keywords

11
Logical View From Full Text to Set of Index Terms
12
Retrieval Process
13
Ad hoc and Filtering Retrieval

Ad hoc retrieval
The documents in the collection remain relative
static while new queries are submitted to the
system
Filtering retrieval
The queries remain relatively static while new
documents come into the system and leave

14
User Profile in Filtering Retrieval

User profile
Describes the users preferences
Filtering
Profile is compared to the incoming documents to
determine those that might be interest to the
user
Ranking
Rank the filtered documents and show the ranking
to the user

15
Constructing User Profile

User provides a set of keywords which describes
an initial profile of preference
As new documents arrive, the system uses this
profile to select documents and show them to the
user
The user indicates not only relevant documents,
but also non-relevant documents
The system uses this information to adjust the
user profile
Profile stabilizes after a while and no longer
changes drastically unless the users interests
shift suddenly

16
Information Retrieval Model

Quadruple D, Q, F, R(qi, dj)
D set composed of logical views for the
documents in the collection
Q set composed of logical views for the user
information needs
F framework for modeling document
representations, queries, and their relationships
R(qi, dj) ranking function defining an ordering
among the documents with regard to the query qi

17
Boolean Model
18
Vector Model

Generic index term ki
Set of all index items K k1, . . ., kt
Weight wi,j gt 0 is associated with index item ki
of a document dj
Document dj is associated with a vector
(w1,j, w2,j, . . ., wt,j )
Weight wi,q gt 0 is associated with ki, q
Query vector (w1,q, w2,q, . . ., wt,q)

19
Similarity by Vector Model

Evaluated as the correlation between and
The correlation is quantified by the cosine of
the angle between two vectors

20
An Effective Term Weighting Scheme

total number of documents
number of documents where ki appears
raw frequency of term ki in dj
Normalized frequency
Inverse document frequency

21
tf-idf Scheme and Salton-Buckley Query Weighting

tf-idf scheme
Salton-Buckley query weighting

22
Recall and Precision

Recall
Fraction of the relevant documents which has been
retrieved Recall Ra/R
Precision
Fraction of the retrieved documents which is
relevant Precision Ra/A

Relevant Docs R
Collection
Relevant Docs in Answer Set Ra
Answer Set
A
23
Example

Set containing relevant documents for query q
Rq d3, d5, d9, d25, d39, d44, d56, d71, d89,
d123
Ranking of the retrieved documents
1. d123 6. d9 11. d38
2. d84 7. d511 12. d48
3. d56 8. d129 13. d250
4. d6 9. d187 14. d113
5. d8 10. d25 15. d3

24
Precision and Recall Figure
25
User Relevance Feedback

The user is presented with a list of the
retrieved documents
After examining them, the user marks those which
are relevant
In practice, only the top 10 (or 20) ranked
documents need to be examined
Select important terms attached to the documents
marked relevant
Enhance the importance of these terms in a new
query formulation
The new query will be moved towards the relevant
documents and away from the non-relevant ones

26
Term Reweighting

Standard Rochio
set of relevant documents, as identified by
the user, among the retrieved documents
set of non-relevant documents, as
identified by the user, among the retrieved
documents

27
Modeling of Natural Language Zipfs Law

In a text of words with a vocabulary of
words, the i-th frequent word appears
times, where

F
Words
28
Modeling of Natural Language Heaps Law

The vocabulary of a text of size words is of
size

29
Lexical Analysis of the Text
30
Elimination of Stopwords

Stopwords
Words too frequent among the documents in the
collection
Not good discriminators
Articles, prepositions, conjunctions, and some
variables, adverbs, and adjectives are natural
candidates for a list of stopwords
Elimination of stopwords
Reduces the size of the index structure
considerably (40 or more is typical)
Counter example to be or not to be

31
Stemming

Stem
Portion of a word which is left after the removal
of its affixes (i.e. prefixes and suffixes like
plurals, gerund forms, and past tense suffixes)
Stemming
Substitute the words by their respective stems
Useful for improving retrieval performance
Can reduce the size of index structure
Controversy in literatures about the benefits
Porter algorithm is often used for suffix
stripping

32
Noun Groups

The most of the semantics is carried by the noun
words
Selects nouns as index terms through systematic
elimination of verbs, adjectives, adverbs,
connectives, articles, and pronouns
Common to combine two or three nouns in a single
component (e.g., computer science)
Makes sense to cluster nouns which appear near by
into a single indexing component
Noun group is a set of nouns with no more 3 (or a
predetermined threshold) words between any two
nouns

33
Thesauri

Refers to a treasury of words consisting of
A precompiled list of important words
For each word in the list, a set of related words
Complemented with a definition or an explanation
Purposes
Provide a standard vocabulary for indexing and
searching
Assists users with locating terms for proper
query formulation
Provides classified hierarchies that allow the
broadening and narrowing of the current query

34
Inverted Files

A word-oriented mechanism for indexing a text
collection in order to speed up the searching
task
Structure
Vocabulary
Occurrence
The space required for the vocabulary is rather
small, according to Heaps law
The occurrences need extra space

35
Example of an Inverted Index
Inverted Index
36
Inverted Index using Block Addressing
This is a text. A text has many words.
Block 1
Block 2
Block 3
Words are made from letters.
Text
Block 4
Inverted Index
37
Block Considerations

Blocks can be of fixed size
Or be defined using the natural division of the
text collection into files, documents, web pages,
etc.

38
Effect of Block Sizes
For each collection, the right column considers
that all words are indexed, While the left column
considers that stopwords are not indexed
39
Searching with Inverted Files

Vocabulary search
Better to have vocabulary in a separated file
Vocabulary file fits in main memory in most case
Retrieval of occurrences
Manipulation of occurrences
If block addressing is used, it may be necessary
to directly search the text to find the
information missing from the occurrences (e.g.,
exact word position)
Sublinear search time and sublinear space
requirements

40
Constructing a Vocabulary Trie
letters 60
made 50
d
l
Vocabulary trie
m
a
many 28
n
t
text 11, 19
w
words 33, 40
41
Building an Inverted Index

Once the text is exhausted, the trie is written
to disk together with the list of occurrence
Split the index into two files
First file lists of occurrences are stored
contiguously
Second file vocabulary is stored in
lexicographical order and, for each word, a
pointer to its list in the first file is also
included

42
Inverted Index for Large Texts

If the index does not fit in main memory, the
partial index Ii obtained up to now is written to
disk and erased from main memory before
continuing with rest of the text
Finally, a number of partial indices Ii exists on
disk. These indices are then merged in a
hierarchical manner

43
Merging the Partial Indices
I-1. .8
7
I-1. .4
I-5. .8
5
6
I-1. .2
I-3. .4
I-5. .6
I-7. .8
1
2
3
4
I-1
I-2
I-3
I-4
I-5
I-6
I-7
I-8
44
Suffix Trees and Suffix Arrays

Queries such as phrases are expensive to solve
using inverted indices
Concept of word does not exist in some
applications such as genetic databases
Suffix trees and suffix arrays are suitable for a
wider spectrum of applications
For word-based applications, inverted files
perform better unless complex queries are an
important issue

45
Suffixes
This is a text. A text has many words.
Words are made from letters.
Text
text. A text has many words. Words are made from
letters. text has many words. Words are made
from letters. many words. Words are made from
letters. words. Words are made from
letters. Words are made from letters. made from
letters. letters.
Suffixes
46
Suffix Trie
60
50
d
l
a
m
n
28
19

t
e
x
t
w
.
11

40
o
r
d
s
.
33
47
Suffix Tree
60
50
l
d
3
m
n
28
19
1

t
5
w
.
11

40
6
.
33
48
Suffix Array
60
50
28
19
11
40
33
49
Supra-index over Suffix Array
lett
text
word
60
50
28
19
11
40
33
50
Vocabulary Supra-index vs. Inverted List
letters
made
many
text
words
60
50
28
19
11
40
33
Suffix Array
Inverted list
60
50
28
11
19
33
40
51
Searching Using Suffix Arrays

The search pattern originates two limiting
patterns
and so that we want any suffix
such that
First binary search both limiting patterns in the
suffix array
All the elements lying between both positions
point to exactly those suffixes that start like
the original pattern, i.e., to the pattern
positions in the text
A simple phrase can be searched as if it was a
simple pattern

52
Sequential Searching for Exact String Matching

Given a short pattern P of length m and a long
text T of length n
Find all the text position where the pattern
occurs
With no data structure being built on the text
Assume that the text and the pattern are
sequences of characters drawn from an alphabet of
size s, whose first character is at position 1

53
Brute Force
b
r
a
c
a
b
r
a
c
a
d
a
b
r
a
a
a
b
r
a
c
a
d
a
a
a
b
a
a
b
r
a
c
a
d
a
b
r
a
Worst case O(mn), Average case O(n)
54
Knuth-Morris-Prattthe next Function
4
0
0
0
0
0
0
0
0
0
1
1
next
a
b
r
a
c
a
d
a
b
r
a
55
Knuth-Morris-PrattExample
b
r
a
c
a
b
r
a
c
a
d
a
b
r
a
a
a
b
r
a
c
a
d
a
b
r
a
c
a
d
a
b
r
a
Linear worst case behavior, but no faster
than brute force on average
56
Boyer-Moore Heuristics
Match heuristic 3
a
b
r
a
c
a
d
a
b
r
a
Occurrence heuristic 5
a
b
r
a
c
a
d
a
b
r
a
b
r
a
c
a
b
r
a
c
a
d
a
b
r
a
a
57
Boyer-Moore Example
b
r
a
c
a
b
r
a
c
a
d
a
b
r
a
a
r
a
a
b
r
a
c
a
d
a
b
r
a
a
O(nlog(m)/m) on average, worst case is
O(mn) Fastest in general
58
Approximate String Matching

Given a short pattern P of length m, a long text
T of length n, and a maximum allowed number of
errors k, find all the text positions where the
pattern occurs with at most k errors

59
Similarity

Similarity is measured by a distance function
Hamming distance
Number of positions that have different
characters
Should be symmetric and satisfy triangular
inequality

60
Levenshtein Distance (Edit Distance)

Minimum number of character insertions,
deletions, and replacements to make two strings
equal
Examples
distance(color, colour) 1
distance(survey, surgery) 2

61
Dynamic Programming for Approximate String
Matching

A matrix C0..m, 0..n is filled column by
column, where CI,j represents the minimum
number errors needed to match P1..i to a suffix
T1..j
Computed as
C0,j 0
CI,0 i
CI,j if( Pi Tj ) then Ci-1,j-1
else 1 min( Ci-1,j, CI,j-1,
Ci-1,j-1 )

62
Dynamic Programming Example
T
P
63
Structured Text Retrieval

Queries combine the patterns with the
specification of structural components of the
component
Example
Same-page( near (atom holocaust, Figure(label
(earth) ) ) )

64
Non-Overlapping Lists
Chapter
L0
Sections
L1
Subsections
L2
L3
Subsubsections
65
Non-Overlapping Lists

A single inverted file is built in which each
structural component stands as an entry in the
index
Associated with each entry there is a list of
text regions as a list of occurrences
Such a list could be easily merged with the
traditional inverted file for the words in the
text

66
Proximal Nodes
Chapter
Sections
Subsections
Subsubsections
holocaust
10
256
48324
67
Proximal Nodes Simple Query Processing Strategy

Traverse the inverted list for the term
For each entry in the list, search the
hierarchical index looking for chapter, sections,
subsections, and subsubsections containing that
occurrence of the term

68
Proximal Nodes Sophisticated Strategy

For the first entry in the list, search the
hierarchical index as before, until no more
successful matches occur
Verify whether the innermost matching component
also matches the second entry in the list
Proceed then to the third entry in the list, and
so on

69
Text in Sequence

Written text is usually conceived to be read
sequentially
A sequenced organizational structure lies
underneath most written text
Sometimes we are looking for information not
easily captured through sequential reading
Example
A book about the history of war organized
chronologically
We want to know regional wars in Europe

70
Hypertext

A high level interactive navigational structure
which allows us to browse text non-sequentially
on a computer screen
Basically a directed graph structure
Basis for HTML and HTTP, which originated the
World Wide Web

71
World Wide Web

Can be seen as a very large, unstructured but
ubiquitous database
Triggers the need for efficient tools to manage,
retrieve, and filter information from it
Those tools are also important in large
intranets, to extract or infer new information to
support a decision process, a task called data
mining

72
Searching the Web

Forms
Use search engines
Use Web directories
Exploit hyperlink structure
Challenges
Distributed data
High percentage of volatile data
Large volume
Unstructured and redundant data
Quality of data
Heterogeneous data

73
Problems Regarding the User and Interaction

How to specify a query?
How to interpret the answer provided by the
retrieval system?
How do we handle a large answer?
How do we rank the documents?
How do we select the documents?
How do we browse efficiently in large documents?

74
Measuring the Web (1999)

There are more than 40 millions computers in more
than 200 countries connected to the Internet
Estimated number of Web servers ranges from 2.4
million to over three million
Estimated number of Web pages ranges from 200 to
320 million, growing at a rate of 20 million
pages per month
Estimated that 30,000 largest Web sites (about 1
of the Web) account for approximately 50 of all
Web pages

75
Measuring the Web (1999)

An average page has between 5 and 15 hyperlinks,
and most of them are local
Most Web pages are HTML pages
Assume that the average HTML page has 5KB, and
that there are 300 million Web pages, we have at
least 1.5 terabytes of text
Total number of languages exceeds 100

76
Languages of the Web
77
Models of the Web

Heaps and Zipfs laws are also valid in the Web
Probability of finding a document of size x bytes
93 of all the files have a size below 9.3 KB

78
Distribution of All File Size (1998)
79
Right Tail Distribution for Different File Types
(1996)
80
Search Engines

In the web all queries must be answered without
accessing the text. That is, only the indices
are available.
Otherwise,
Store locally of a copy of the web pages (too
expansive)
Access remote pages through the network at query
time (too slow)

81
Searching Engine Centralized Architecture

Crawlers are programs (software agents) that
traverse the web sending new or updated pages to
a main server where they are indexed.
Crawlers are also called robots, spiders,
wanderers, walkers, and knowbots
A crawler does not actually move to and run on
remote machines
The index is used in a centralized fashion to
answer queries submitted from different places in
the Web

82
Searching Engine Centralized Architecture
Query Engine
Index
Interface
Indexer
Users
Crawler
Web
83
Searching Engine Centralized Architecture

Main problems
Gathering of the data (highly dynamic)
Saturated communication links
High load at web servers
Volume of the data
May not be able to cope with Web growth in the
near future
Good load balancing internally (answering queries
and indexing) and externally (crawling) are
important

84
Page Ranking

Most search engines use variations of the Boolean
or vector model
To be performed without accessing the text, just
the index
The vector model yields a better recall-precision
curve, with an average precision of 75 in a
study
Some new algorithms also use hyperlink
information and achieve even better results

85
Crawling the Web

Starts with a set of URLs and from there extract
other URLs which are followed recursively in a
breadth-first or depth-first fashion
Allows users to submit top Web sites that will be
added to the URL set
Or starts with a set of popular URLs
Difficult to coordinate several crawlers to avoid
visiting the same page more than once
Or partitions the Web using country codes or
Internet names

86
Indices

Dynamically generated pages can not be indexed as
well as password protected pages
Most indices use variants of the inverted file
Some use elimination of stopwords to reduce the
size of the index
Is complemented with a short description of each
Web page
A query is answered by doing a binary search on
the sorted list of words of the inverted file
Block addressing is used by some search engines

87
Web Directories

As a browsing tool. Yahoo! is an example
Also called catalogs, yellow pages, or subject
directories
In most cases, pages have to be submitted to the
Web directory, where they are reviewed and
classified
Classification is often done manually
Can afford to have a copy of all classified pages
Most also send query to a search engine

88
Metasearchers

Web servers that send a given query to several
search engines, Web directories and other
databases, collect the answers and unify them
Examples like Metacrawler and SavvySearch
Differs in how ranking is performed in the
unified result
Metasearchers for specific topics can be
considered as software agents

89
Dynamic Search

Use an online search to discover relevant
information by following links
Slow, but might be used in small and dynamic
subsets of the web
Fish search
Exploit the intuition that relevant documents
often have neighbors that are relevant
At each step, the page with highest priority is
analyzed. If relevant, a heuristic decides to
follow or not to follow the links on that page

90
Software Agents