Title: CS 430 INFO 430 Information Retrieval
1CS 430 / INFO 430 Information Retrieval
Lecture 4 Searching Full Text 4
2Course Administration
Assignment 1 has been posted. It is a
programming assignment and is due on Sunday,
September 17 at 11 p.m. Follow the instructions
carefully. Send questions to cs430-l_at_cs.cornell.
edu. This is a preliminary statement of the
assignment. Watch the Web site for any minor
changes.
3Inverted File
Inverted file An inverted file is list of
search terms that are organized for associative
look-up, i.e., to answer the questions In
which documents does a specified search term
appear? Where within each document does each
term appear? (There may be several
occurrences.) In a free text search system, the
word list and the postings file together provide
an inverted file system. In addition, they
contain the data needed to calculate weights and
information that is used to display results.
4Inverted File -- Basic Concept
Word Document abacus 3
19 22 actor 2
19 29 aspen 5
atoll 11 34
This is called an index file, a word list, or a
vocabulary file. Stop words are removed before
building the index.
5Inverted List -- Definitions
Posting Entry in an inverted file system that
applies to a single instance of a term within a
document, e.g., there are three postings for
"abacus" abacus 3
abacus 19 abacus
22
Inverted List A list of all the postings in an
inverted file system that apply to a specific
word, e.g. abacus 3 19
22
6Organization of Files for Full Text Searching
Documents store
Word list (index file)
Postings
Term Pointer to postings ant bee
cat dog
elk fox gnu
hog
Inverted lists
7Representation of Inverted Files
Document store Stores the documents. Important
for user interface design. Repositories for the
storage of document collections are covered in CS
431. Word list (vocabulary file) Stores list
of terms (keywords). Designed for searching and
sequential processing, e.g., for range queries,
(lexicographic index). May be held in
memory. Postings file Stores an inverted list
(postings list) of postings for each term.
Designed for rapid merging of lists and
calculation of similarities. Each list is
usually stored sequentially. Can be very large.
8Document Store
The Documents Store holds the corpus that is
being indexed. The corpus may be primary
documents, e.g., electronic journal articles or
Web pages. surrogates, e.g., catalog records
or abstracts, which refer to the primary
documents.
9Document Store
The storage of the document store may be Central
(monolithic) - all documents stored together on a
single server (e.g., library
catalog) Distributed database - all documents
managed together but stored on several
servers (e.g., Medline, Westlaw) Highly
distributed - documents stored on independently
managed servers (e.g., the Web) Each
requires a document ID, which is a unique
identifier that can be used by the search system
to refer to the document, and a location counter,
which can be used to specify location of words or
characters within a document.
10Documents Store for Web Search Systems
For Web search systems A document is a Web
page. The documents store is the Web.
The document ID is the URL of the
document. Indexes are built using a web crawler,
which retrieves each page on the Web for
indexing. After indexing, the local copy of each
page is discarded, unless stored in a cache.
(In addition to the usual word list and postings
file the indexing system stores contextual
information, which will be discussed in a later
lecture.)
11Use of Inverted Files for Evaluating a Boolean
Query
Examples abacus and actor Postings for
abacus Postings for actor Document 19 is the
only document that contains both terms, "abacus"
and "actor".
To evaluate the and operator, merge the two
inverted lists with a logical AND operation.
12Use of Inverted Files for Calculating Similarities
In the term vector space, if q is query and dj a
document, then q and dj have no terms in common
iff q.dj 0. 1. To calculate all the non-zero
similarities find R, the set of all the
documents, dj, that contain at least one term in
the query 2. Merge the inverted lists for
each term ti in the query, with a logical or, to
establish the set, R. 3. For each dj ? R,
calculate Similarity(q, dj), using appropriate
weights. 4. Return the elements of R in ranked
order.
13Enhancements to Inverted Files -- Concept
Location Each posting holds information about
the location of each term within the
document. Uses user interface design --
highlight location of search term adjacency and
near operators (in Boolean searching) Frequency
Each inverted list includes the number of
postings for each term. Uses term
weighting query processing optimization
14Inverted File -- Concept (Enhanced)
Word Postings Document Location abacus 4
3 94 19 7
19 212 22 56 actor 3 2
66 19 213 29
45 aspen 1 5 43 atoll 3 11
3 11 70 34 40
Inverted list for term actor
15Data for Calculating Weights
The calculation of weights requires extra data to
be held in the inverted file system. For each
term, tj and document, di fij number of
occurrences of tj in di For each term,
tj nj number of documents containing tj For
each document, di mi maximum frequency of any
term in di For the entire document file n total
number of documents
16Word List Individual Records for Each Term
The record for term j in the word list
contains term j pointer to inverted (postings)
list for term j number of documents in which
term j occurs (nj)
17Decisions in Building an Inverted File System
Lexicographic Order
It is important that the word list can be
processed sequentially, i.e, in alphabetic order.
To search with wild cards, e.g. comp, which
expands to every term beginning with the letters
"comp". To list results for browsing lists of
search terms. This is a special case of of the
mathematical concept of lexicographic order.
18Decisions in Building an Inverted File System
Query Languages
Some query options may require huge computation,
e.g., Regular expressions If inverted files are
stored in lexicographic order, comp can be
processed efficiently comp cannot be
processed efficiently Logical operators If A and
B are search terms A or B can be processed
by comparing two moderate sized lists (not
A) or (not B) requires two very large lists
19Decisions in Building an Inverted File System
Storage and Performance
Storage Inverted file systems are big, typically
10 to 100 the size of the collection of
documents. Update performance It must be
possible, with a reasonable amount of
computation, to (a) Add a large batch of
documents (b) Add a single document Retrieval
performance Retrieval must be fast enough to
satisfy users and not use excessive resources.
20Postings File
The postings file stores the elements of a sparse
matrix, the components of the term vector space,
with weights. It is stored as a separate inverted
list for each column, i.e., a list corresponding
to each term in the index file. Each element in
an inverted list is called a posting, i.e., the
occurrence of a term in a document Each list
consists of one or many individual postings.
21Postings FileA Linked List for Each Term
- 1 abacus
-
- 3 94
- 19 7
-
- 19 212
-
- 22 56
3 aspen 5 43
A linked list for each term is convenient to
process sequentially, but slow to update when
the lists are long.
22Length of Postings File
For a common term there may be very large numbers
of postings for a given term. Example 1,000,000,
000 documents 1,000,000 distinct words average
length 1,000 words per document 1012 postings By
Zipf's law, the 10th ranking word occurs,
approximately (1012/10)/10 times 1010 times
23Postings File
Merging inverted lists is the most
computationally intensive task in many
information retrieval systems. Since inverted
lists may be long, it is important to match
postings efficiently. Usually, the inverted lists
will be held on disk and paged into memory for
matching. Therefore algorithms for matching
postings process the lists sequentially. For
efficient matching, the inverted lists should all
be sorted in the same sequence. Inverted lists
are commonly cached to minimize disk accesses.
24Word List
On disk If a word list is held on disk, search
time is dominated by the number of disk
accesses. In memory Suppose that a word list has
1,000,000 distinct terms. Each index entry
consists of the term, some basic statistics and a
pointer to the inverted list, average 100
characters. Size of index is 100 megabytes, which
can easily be held in memory of a dedicated
computer.
25File Structures for Inverted Files Linear Index
Advantages Can be searched quickly, e.g., by
binary search, O(log n) Good for lexicographic
processing, e.g., comp Convenient for batch
updating Economical use of storage Disadvantages
Index must be rebuilt if an extra term is added
26File Structures for Inverted Files Binary Tree
Input elk, hog, bee, fox, cat, gnu, ant, dog
elk
bee
hog
fox
cat
ant
gnu
dog
27File Structures for Inverted Files Binary Tree
Advantages Can be searched quickly Convenient
for batch updating Easy to add an extra
term Economical use of storage Disadvantages Les
s good for lexicographic processing, e.g.,
comp Tree tends to become unbalanced If the
index is held on disk, important to optimize
the number of disk accesses
28File Structures for Inverted Files Binary Tree
Calculation of maximum depth of
tree. Illustrates importance of balanced
trees.
Worst case depth n
O(n) Ideal case depth log(n 1)/log 2
O(log n)
29File Structures for Inverted Files Right
Threaded Binary Tree
Threaded tree A binary search tree in which each
node uses an otherwise-empty left child link to
refer to the node's in-order predecessor and an
empty right child link to refer to its in-order
successor. Right-threaded tree A variant of a
threaded tree in which only the right thread,
i.e. link to the successor, of each node is
maintained. Can be used for lexicographic
processing. A good data structure when held in
memory
Knuth vol 1, 2.3.1, page 325.
30File Structures for Inverted Files Right
Threaded Binary Tree
dog
bee
gnu
hog
cat
elk
ant
NULL
fox
31File Structures for Inverted Files B-trees
B-tree of order m A balanced, multiway search
tree Each node stores many keys Root has
between 2 and 2m keys. All other internal
nodes have between m and 2m keys. If ki is
the ith key in a given internal node -gt all keys
in the (i-1)th child are smaller than ki -gt all
keys in the ith child are bigger than ki All
leaves are at the same depth
32File Structures for Inverted Files B-trees
B-tree example (order 2)
50 65
55 59
70 90 98
10 19 35
66 68
91 95 97
36 47
1 5 8 9
72 73
12 14 18
21 24 28
Every arrow points to a node containing between 2
and 4 keys. A node with k keys has k 1 pointers.
33File Structures for Inverted Files B-tree
A B-tree is used as an index Data is
stored in the leaves of the tree, known as buckets
Example B-tree of order 2, bucket size 4
50 65
10 25
55 59
70 81 90
... D9
D51 ... D54
D66...
D81 ...
(Implementation of B-trees is covered in CS 432.)