Title: Information Retrieval
1Information Retrieval
- Shyh-Kang Jeng
- Department of Electrical Engineering/
- Graduate Institute of Communication Engineering
- National Taiwan University
2Reference
- R. Baeza-Yates and B. Ribeiro-Neto, Modern
Information Retrieval, Addison-Wesley, 1999.
3Outline
- Basic concepts
- Information Retrieval Models
- Text Property
- Document Preprocessing
- Indexing and Searching
- Searching the Web
4Information Retrieval Agents
5Information Retrieval
- Deals with information
- Representation
- Storage
- Organization
- Access
- Provides the user with easy access to the
information in which he is interested
6Example of Information Retrieval
- Find all pages (documents) containing information
on college tennis teams which - are maintained by an university in USA
- participate in the NCAA tennis tournament
- To be relevant, the page must includes
- National ranking of the team in the last three
years - Email of the team coach
7Information vs. Data Retrieval
- Information retrieval
- Results might be inaccurate
- Small errors are likely to go unnoticed
- Deals with natural language text which is not
well structured and could be semantically
ambiguous - Data retrieval
- Aims at retrieving all objects which satisfy
clearly defined conditions - A single erroneous object among a thousand
retrieval objects means total failure - Has a well defined structure and semantics
8User Task
- Retrieval
- Searches for desired information directly
- Browsing
- Still a process of retrieving information
- Main objectives are not clearly defined in the
beginning - The purpose might change during the interaction
with the system
9Interaction with the System
Retrieval
Document Database
Browsing
10Keywords
- Queries are often translated to a set of key
words (or index terms) which summarizes the
description of the user information needed - Documents are also frequently represented through
a set of index terms or keywords
11Logical View From Full Text to Set of Index Terms
12Retrieval Process
13Ad hoc and Filtering Retrieval
- Ad hoc retrieval
- The documents in the collection remain relative
static while new queries are submitted to the
system - Filtering retrieval
- The queries remain relatively static while new
documents come into the system and leave
14User Profile in Filtering Retrieval
- User profile
- Describes the users preferences
- Filtering
- Profile is compared to the incoming documents to
determine those that might be interest to the
user - Ranking
- Rank the filtered documents and show the ranking
to the user
15Constructing User Profile
- User provides a set of keywords which describes
an initial profile of preference - As new documents arrive, the system uses this
profile to select documents and show them to the
user - The user indicates not only relevant documents,
but also non-relevant documents - The system uses this information to adjust the
user profile - Profile stabilizes after a while and no longer
changes drastically unless the users interests
shift suddenly
16Information Retrieval Model
- Quadruple D, Q, F, R(qi, dj)
- D set composed of logical views for the
documents in the collection - Q set composed of logical views for the user
information needs - F framework for modeling document
representations, queries, and their relationships - R(qi, dj) ranking function defining an ordering
among the documents with regard to the query qi
17Boolean Model
18Vector Model
- Generic index term ki
- Set of all index items K k1, . . ., kt
- Weight wi,j gt 0 is associated with index item ki
of a document dj - Document dj is associated with a vector
- (w1,j, w2,j, . . ., wt,j )
- Weight wi,q gt 0 is associated with ki, q
- Query vector (w1,q, w2,q, . . ., wt,q)
19Similarity by Vector Model
- Evaluated as the correlation between and
- The correlation is quantified by the cosine of
the angle between two vectors
20An Effective Term Weighting Scheme
- total number of documents
- number of documents where ki appears
- raw frequency of term ki in dj
- Normalized frequency
- Inverse document frequency
21tf-idf Scheme and Salton-Buckley Query Weighting
- tf-idf scheme
- Salton-Buckley query weighting
22Recall and Precision
- Recall
- Fraction of the relevant documents which has been
retrieved Recall Ra/R - Precision
- Fraction of the retrieved documents which is
relevant Precision Ra/A
Relevant Docs R
Collection
Relevant Docs in Answer Set Ra
Answer Set
A
23Example
- Set containing relevant documents for query q
- Rq d3, d5, d9, d25, d39, d44, d56, d71, d89,
d123 - Ranking of the retrieved documents
- 1. d123 6. d9 11. d38
- 2. d84 7. d511 12. d48
- 3. d56 8. d129 13. d250
- 4. d6 9. d187 14. d113
- 5. d8 10. d25 15. d3
24Precision and Recall Figure
25User Relevance Feedback
- The user is presented with a list of the
retrieved documents - After examining them, the user marks those which
are relevant - In practice, only the top 10 (or 20) ranked
documents need to be examined - Select important terms attached to the documents
marked relevant - Enhance the importance of these terms in a new
query formulation - The new query will be moved towards the relevant
documents and away from the non-relevant ones
26Term Reweighting
- Standard Rochio
- set of relevant documents, as identified by
the user, among the retrieved documents - set of non-relevant documents, as
identified by the user, among the retrieved
documents
27Modeling of Natural Language Zipfs Law
- In a text of words with a vocabulary of
words, the i-th frequent word appears
-
- times, where
F
Words
28Modeling of Natural Language Heaps Law
- The vocabulary of a text of size words is of
size
29Lexical Analysis of the Text
30Elimination of Stopwords
- Stopwords
- Words too frequent among the documents in the
collection - Not good discriminators
- Articles, prepositions, conjunctions, and some
variables, adverbs, and adjectives are natural
candidates for a list of stopwords - Elimination of stopwords
- Reduces the size of the index structure
considerably (40 or more is typical) - Counter example to be or not to be
31Stemming
- Stem
- Portion of a word which is left after the removal
of its affixes (i.e. prefixes and suffixes like
plurals, gerund forms, and past tense suffixes) - Stemming
- Substitute the words by their respective stems
- Useful for improving retrieval performance
- Can reduce the size of index structure
- Controversy in literatures about the benefits
- Porter algorithm is often used for suffix
stripping
32Noun Groups
- The most of the semantics is carried by the noun
words - Selects nouns as index terms through systematic
elimination of verbs, adjectives, adverbs,
connectives, articles, and pronouns - Common to combine two or three nouns in a single
component (e.g., computer science) - Makes sense to cluster nouns which appear near by
into a single indexing component - Noun group is a set of nouns with no more 3 (or a
predetermined threshold) words between any two
nouns
33Thesauri
- Refers to a treasury of words consisting of
- A precompiled list of important words
- For each word in the list, a set of related words
- Complemented with a definition or an explanation
- Purposes
- Provide a standard vocabulary for indexing and
searching - Assists users with locating terms for proper
query formulation - Provides classified hierarchies that allow the
broadening and narrowing of the current query
34Inverted Files
- A word-oriented mechanism for indexing a text
collection in order to speed up the searching
task - Structure
- Vocabulary
- Occurrence
- The space required for the vocabulary is rather
small, according to Heaps law - The occurrences need extra space
35Example of an Inverted Index
Inverted Index
36Inverted Index using Block Addressing
This is a text. A text has many words.
Block 1
Block 2
Block 3
Words are made from letters.
Text
Block 4
Inverted Index
37Block Considerations
- Blocks can be of fixed size
- Or be defined using the natural division of the
text collection into files, documents, web pages,
etc.
38Effect of Block Sizes
For each collection, the right column considers
that all words are indexed, While the left column
considers that stopwords are not indexed
39Searching with Inverted Files
- Vocabulary search
- Better to have vocabulary in a separated file
- Vocabulary file fits in main memory in most case
- Retrieval of occurrences
- Manipulation of occurrences
- If block addressing is used, it may be necessary
to directly search the text to find the
information missing from the occurrences (e.g.,
exact word position) - Sublinear search time and sublinear space
requirements
40Constructing a Vocabulary Trie
letters 60
made 50
d
l
Vocabulary trie
m
a
many 28
n
t
text 11, 19
w
words 33, 40
41Building an Inverted Index
- Once the text is exhausted, the trie is written
to disk together with the list of occurrence - Split the index into two files
- First file lists of occurrences are stored
contiguously - Second file vocabulary is stored in
lexicographical order and, for each word, a
pointer to its list in the first file is also
included
42Inverted Index for Large Texts
- If the index does not fit in main memory, the
partial index Ii obtained up to now is written to
disk and erased from main memory before
continuing with rest of the text - Finally, a number of partial indices Ii exists on
disk. These indices are then merged in a
hierarchical manner
43Merging the Partial Indices
I-1. .8
7
I-1. .4
I-5. .8
5
6
I-1. .2
I-3. .4
I-5. .6
I-7. .8
1
2
3
4
I-1
I-2
I-3
I-4
I-5
I-6
I-7
I-8
44Suffix Trees and Suffix Arrays
- Queries such as phrases are expensive to solve
using inverted indices - Concept of word does not exist in some
applications such as genetic databases - Suffix trees and suffix arrays are suitable for a
wider spectrum of applications - For word-based applications, inverted files
perform better unless complex queries are an
important issue
45Suffixes
This is a text. A text has many words.
Words are made from letters.
Text
text. A text has many words. Words are made from
letters. text has many words. Words are made
from letters. many words. Words are made from
letters. words. Words are made from
letters. Words are made from letters. made from
letters. letters.
Suffixes
46Suffix Trie
60
50
d
l
a
m
n
28
19
t
e
x
t
w
.
11
40
o
r
d
s
.
33
47Suffix Tree
60
50
l
d
3
m
n
28
19
1
t
5
w
.
11
40
6
.
33
48Suffix Array
60
50
28
19
11
40
33
49Supra-index over Suffix Array
lett
text
word
60
50
28
19
11
40
33
50Vocabulary Supra-index vs. Inverted List
letters
made
many
text
words
60
50
28
19
11
40
33
Suffix Array
Inverted list
60
50
28
11
19
33
40
51Searching Using Suffix Arrays
- The search pattern originates two limiting
patterns - and so that we want any suffix
such that - First binary search both limiting patterns in the
suffix array - All the elements lying between both positions
point to exactly those suffixes that start like
the original pattern, i.e., to the pattern
positions in the text - A simple phrase can be searched as if it was a
simple pattern
52Sequential Searching for Exact String Matching
- Given a short pattern P of length m and a long
text T of length n - Find all the text position where the pattern
occurs - With no data structure being built on the text
- Assume that the text and the pattern are
sequences of characters drawn from an alphabet of
size s, whose first character is at position 1
53Brute Force
b
r
a
c
a
b
r
a
c
a
d
a
b
r
a
a
a
b
r
a
c
a
d
a
a
a
b
a
a
b
r
a
c
a
d
a
b
r
a
Worst case O(mn), Average case O(n)
54Knuth-Morris-Prattthe next Function
4
0
0
0
0
0
0
0
0
0
1
1
next
a
b
r
a
c
a
d
a
b
r
a
55Knuth-Morris-PrattExample
b
r
a
c
a
b
r
a
c
a
d
a
b
r
a
a
a
b
r
a
c
a
d
a
b
r
a
c
a
d
a
b
r
a
Linear worst case behavior, but no faster
than brute force on average
56Boyer-Moore Heuristics
Match heuristic 3
a
b
r
a
c
a
d
a
b
r
a
Occurrence heuristic 5
a
b
r
a
c
a
d
a
b
r
a
b
r
a
c
a
b
r
a
c
a
d
a
b
r
a
a
57Boyer-Moore Example
b
r
a
c
a
b
r
a
c
a
d
a
b
r
a
a
r
a
a
b
r
a
c
a
d
a
b
r
a
a
O(nlog(m)/m) on average, worst case is
O(mn) Fastest in general
58Approximate String Matching
- Given a short pattern P of length m, a long text
T of length n, and a maximum allowed number of
errors k, find all the text positions where the
pattern occurs with at most k errors
59Similarity
- Similarity is measured by a distance function
- Hamming distance
- Number of positions that have different
characters - Should be symmetric and satisfy triangular
inequality
60Levenshtein Distance (Edit Distance)
- Minimum number of character insertions,
deletions, and replacements to make two strings
equal - Examples
- distance(color, colour) 1
- distance(survey, surgery) 2
61Dynamic Programming for Approximate String
Matching
- A matrix C0..m, 0..n is filled column by
column, where CI,j represents the minimum
number errors needed to match P1..i to a suffix
T1..j - Computed as
- C0,j 0
- CI,0 i
- CI,j if( Pi Tj ) then Ci-1,j-1
- else 1 min( Ci-1,j, CI,j-1,
Ci-1,j-1 )
62Dynamic Programming Example
T
P
63Structured Text Retrieval
- Queries combine the patterns with the
specification of structural components of the
component - Example
- Same-page( near (atom holocaust, Figure(label
(earth) ) ) )
64Non-Overlapping Lists
Chapter
L0
Sections
L1
Subsections
L2
L3
Subsubsections
65Non-Overlapping Lists
- A single inverted file is built in which each
structural component stands as an entry in the
index - Associated with each entry there is a list of
text regions as a list of occurrences - Such a list could be easily merged with the
traditional inverted file for the words in the
text
66Proximal Nodes
Chapter
Sections
Subsections
Subsubsections
holocaust
10
256
48324
67Proximal Nodes Simple Query Processing Strategy
- Traverse the inverted list for the term
- For each entry in the list, search the
hierarchical index looking for chapter, sections,
subsections, and subsubsections containing that
occurrence of the term
68Proximal Nodes Sophisticated Strategy
- For the first entry in the list, search the
hierarchical index as before, until no more
successful matches occur - Verify whether the innermost matching component
also matches the second entry in the list - Proceed then to the third entry in the list, and
so on
69Text in Sequence
- Written text is usually conceived to be read
sequentially - A sequenced organizational structure lies
underneath most written text - Sometimes we are looking for information not
easily captured through sequential reading - Example
- A book about the history of war organized
chronologically - We want to know regional wars in Europe
70Hypertext
- A high level interactive navigational structure
which allows us to browse text non-sequentially
on a computer screen - Basically a directed graph structure
- Basis for HTML and HTTP, which originated the
World Wide Web
71World Wide Web
- Can be seen as a very large, unstructured but
ubiquitous database - Triggers the need for efficient tools to manage,
retrieve, and filter information from it - Those tools are also important in large
intranets, to extract or infer new information to
support a decision process, a task called data
mining
72Searching the Web
- Forms
- Use search engines
- Use Web directories
- Exploit hyperlink structure
- Challenges
- Distributed data
- High percentage of volatile data
- Large volume
- Unstructured and redundant data
- Quality of data
- Heterogeneous data
73Problems Regarding the User and Interaction
- How to specify a query?
- How to interpret the answer provided by the
retrieval system? - How do we handle a large answer?
- How do we rank the documents?
- How do we select the documents?
- How do we browse efficiently in large documents?
74Measuring the Web (1999)
- There are more than 40 millions computers in more
than 200 countries connected to the Internet - Estimated number of Web servers ranges from 2.4
million to over three million - Estimated number of Web pages ranges from 200 to
320 million, growing at a rate of 20 million
pages per month - Estimated that 30,000 largest Web sites (about 1
of the Web) account for approximately 50 of all
Web pages
75Measuring the Web (1999)
- An average page has between 5 and 15 hyperlinks,
and most of them are local - Most Web pages are HTML pages
- Assume that the average HTML page has 5KB, and
that there are 300 million Web pages, we have at
least 1.5 terabytes of text - Total number of languages exceeds 100
76Languages of the Web
77Models of the Web
- Heaps and Zipfs laws are also valid in the Web
- Probability of finding a document of size x bytes
- 93 of all the files have a size below 9.3 KB
78Distribution of All File Size (1998)
79Right Tail Distribution for Different File Types
(1996)
80Search Engines
- In the web all queries must be answered without
accessing the text. That is, only the indices
are available. - Otherwise,
- Store locally of a copy of the web pages (too
expansive) - Access remote pages through the network at query
time (too slow)
81Searching Engine Centralized Architecture
- Crawlers are programs (software agents) that
traverse the web sending new or updated pages to
a main server where they are indexed. - Crawlers are also called robots, spiders,
wanderers, walkers, and knowbots - A crawler does not actually move to and run on
remote machines - The index is used in a centralized fashion to
answer queries submitted from different places in
the Web
82Searching Engine Centralized Architecture
Query Engine
Index
Interface
Indexer
Users
Crawler
Web
83Searching Engine Centralized Architecture
- Main problems
- Gathering of the data (highly dynamic)
- Saturated communication links
- High load at web servers
- Volume of the data
- May not be able to cope with Web growth in the
near future - Good load balancing internally (answering queries
and indexing) and externally (crawling) are
important
84Page Ranking
- Most search engines use variations of the Boolean
or vector model - To be performed without accessing the text, just
the index - The vector model yields a better recall-precision
curve, with an average precision of 75 in a
study - Some new algorithms also use hyperlink
information and achieve even better results
85Crawling the Web
- Starts with a set of URLs and from there extract
other URLs which are followed recursively in a
breadth-first or depth-first fashion - Allows users to submit top Web sites that will be
added to the URL set - Or starts with a set of popular URLs
- Difficult to coordinate several crawlers to avoid
visiting the same page more than once - Or partitions the Web using country codes or
Internet names
86Indices
- Dynamically generated pages can not be indexed as
well as password protected pages - Most indices use variants of the inverted file
- Some use elimination of stopwords to reduce the
size of the index - Is complemented with a short description of each
Web page - A query is answered by doing a binary search on
the sorted list of words of the inverted file - Block addressing is used by some search engines
87Web Directories
- As a browsing tool. Yahoo! is an example
- Also called catalogs, yellow pages, or subject
directories - In most cases, pages have to be submitted to the
Web directory, where they are reviewed and
classified - Classification is often done manually
- Can afford to have a copy of all classified pages
- Most also send query to a search engine
88Metasearchers
- Web servers that send a given query to several
search engines, Web directories and other
databases, collect the answers and unify them - Examples like Metacrawler and SavvySearch
- Differs in how ranking is performed in the
unified result - Metasearchers for specific topics can be
considered as software agents
89Dynamic Search
- Use an online search to discover relevant
information by following links - Slow, but might be used in small and dynamic
subsets of the web - Fish search
- Exploit the intuition that relevant documents
often have neighbors that are relevant - At each step, the page with highest priority is
analyzed. If relevant, a heuristic decides to
follow or not to follow the links on that page
90Software Agents
- For searching specific information on the Web
- Deals with heterogeneous sources of information
which have to be combined - Important issues
- How to determine relevant sources
- How to merge the results retrieved (the fusion
problem)