Title: CS 430 INFO 430 Information Retrieval
1CS 430 / INFO 430 Information Retrieval
Lecture 4 Searching Full Text 4
2Course Administration
Assignment 1 has been posted. It is a
programming assignment and is due on Saturday,
September 17 at midnight. Follow the submission
instructions carefully. Send questions to
cs430-l_at_cs.cornell.edu.
3Organization of Files for Full Text Searching
Documents store
Word list
Postings
Term Pointer to postings ant bee
cat dog
elk fox gnu
hog
Inverted lists
4Representation of Inverted Files
Document store Stores the documents. Important
for user interface design. Repositories for the
storage of document collections are covered in CS
431. Word list (vocabulary file) Stores list
of terms (keywords). Designed for searching and
sequential processing, e.g., for range queries,
(lexicographic index). Often held in
memory. Postings file Stores an inverted list
(postings list) of postings for each term.
Designed for rapid merging of lists and
calculation of similarities. Each list is
usually stored sequentially.
5Document Store
The Documents Store holds the corpus that is
being indexed. The corpus may be primary
documents, e.g., electronic journal articles or
Web pages surrogates, e.g., catalog records or
abstracts, which refer to the primary documents
6Document Store
The storage of the document store may be Central
(monolithic) - all documents stored together on a
single server (e.g., library
catalog) Distributed database - all documents
managed together but stored on several
servers (e.g., Medline, Westlaw, Dialog) Highly
distributed - documents stored on independently
managed servers (e.g., the Web) Each
requires a document ID, which is a unique
identifier that can be used by the search system
to refer to the document, and a location counter,
which can be used to specify location of words or
characters within a document.
7Documents Store for Web Search Systems
For Web search systems A document is a Web
page. The documents store is the Web.
The document ID is the URL of the
document. Indexes are built using a web crawler,
which retrieves each page on the Web for
indexing. After indexing, the local copy of each
page is discarded, unless stored in a cache.
(In addition to the usual word list and postings
file the indexing system stores contextual
information, which will be discussed in a later
lecture.)
8Inverted File
Inverted file An inverted file is list of
search terms that are organized for associative
look-up, i.e., to answer the questions In
which documents does a specified search term
appear? Where within each document does each
term appear? (There may be several
occurrences.) The word list and the postings
file together provide an inverted file system for
free text searching. In addition, they contain
the data needed to calculate weights and
information that is used to display results.
9Inverted File -- Basic Concept
Word Document abacus 3
19 22 actor 2
19 29 aspen 5
atoll 11 34
Stop words are removed before building the index.
10Inverted List -- Definitions
Inverted List A list of all the entries in an
inverted file that apply to a specific word,
e.g. abacus 3 19
22
Posting Entry in an inverted list that applies
to a single instance of a term within a document,
e.g., there are three postings for "abacus"
abacus 3 abacus
19 abacus 22
11Use of Inverted Files for Calculating Similarities
In the term vector space, if q is query and dj a
document, then q and dj have no terms in common
iff q.dj 0. 1. To calculate all the non-zero
similarities find R. the set of all the
documents, dj, that contain at least one term in
the query 2. Merge the inverted lists for
each term ti in the query, with a logical or, to
establish the set, R. 3. For each dj ? R,
calculate Similarity(q, dj), using appropriate
weights. 4. Return the elements of R in ranked
order.
12Enhancements to Inverted Files -- Concept
Location Each posting holds information about
the location of each term within the
document. Uses user interface design --
highlight location of search term adjacency and
near operators (in Boolean searching) Frequency
Each inverted list includes the number of
postings for each term. Uses term
weighting query processing optimization
13Inverted File -- Concept (Enhanced)
Word Postings Document Location abacus 4
3 94 19 7
19 212 22 56 actor 3 2
66 19 213 29
45 aspen 1 5 43 atoll 3 11
3 11 70 34 40
Inverted list for term actor
14Lexicographic Order
It is important that the word list can be
processed sequentially, i.e, in alphabetic order.
To search with wild cards, e.g. comp, which
expands to every term beginning with the letters
"comp". To list results for browsing lists of
search terms. This is a special case of of the
mathematical concept of lexicographic order.
15Postings File
The postings file stores the elements of a sparse
matrix, the term assignment matrix, with
weights. It is stored as a separate inverted list
for each column, i.e., a list corresponding to
each term in the index file. Each element in an
inverted list is called a posting, i.e., the
occurrence of a term in a document Each list
consists of one or many individual postings.
16Postings FileA Linked List for Each Term
- 1 abacus
-
- 3 94
- 19 7
-
- 19 212
-
- 22 56
3 aspen 5 43
A linked list for each term is convenient to
process sequentially, but slow to update when
the lists are long.
17Length of Postings File
For a common term there may be very large numbers
of postings for a given term. Example 1,000,000,
000 documents 1,000,000 distinct words average
length 1,000 words per document 1012 postings By
Zipf's law, the 10th ranking word occurs,
approximately (1012/10)/10 times 1010 times
18Postings File
Merging inverted lists is the most
computationally intensive task in many
information retrieval systems. Since inverted
lists may be long, it is important to match
postings efficiently. Usually, the inverted lists
will be held on disk and paged into memory for
matching. Therefore algorithms for matching
postings process the lists sequentially. For
efficient matching, the inverted lists should all
be sorted in the same sequence. Inverted lists
are commonly cached to minimize disk accesses.
19Data for Calculating Weights
The calculation of weights requires extra data to
be held in the inverted file system. For each
term, tj and document, di fij number of
occurrences of tj in di For each term,
tj nj number of documents containing tj For
each document, di mi maximum frequency of any
term in di For the entire document file n total
number of documents
20Word List Individual Records for Each Term
The record for term j in the word list
contains term j pointer to inverted (postings)
list for term j number of documents in which
term j occurs (nj)
21Decisions in Building an Inverted File
Efficiency and Query Languages
Some query options may require huge computation,
e.g., Regular expressions If inverted files are
stored in lexicographic order, comp can be
processed efficiently comp cannot be
processed efficiently Logical operators If A and
B are search terms A or B can be processed
by comparing two moderate sized lists (not
A) or (not B) requires two very large lists
22Efficiency Criteria
Storage Inverted files are big, typically 10 to
100 the size of the collection of
documents. Update performance It must be
possible, with a reasonable amount of
computation, to (a) Add a large batch of
documents (b) Add a single document Retrieval
performance Retrieval must be fast enough to
satisfy users and not use excessive resources.
23Word List
On disk If a word list is held on disk, search
time is dominated by the number of disk
accesses. In memory Suppose that a word list has
1,000,000 distinct terms. Each index entry
consists of the term, some basic statistics and a
pointer to the inverted list, average 100
characters. Size of index is 100 megabytes, which
can easily be held in memory of a dedicated
computer.
24File Structures for Inverted Files Linear Index
Advantages Can be searched quickly, e.g., by
binary search, O(log n) Good for lexicographic
processing, e.g., comp Convenient for batch
updating Economical use of storage Disadvantages
Index must be rebuilt if an extra term is added
25File Structures for Inverted Files Binary Tree
Input elk, hog, bee, fox, cat, gnu, ant, dog
elk
bee
hog
fox
cat
ant
gnu
dog
26File Structures for Inverted Files Binary Tree
Advantages Can be searched quickly Convenient
for batch updating Easy to add an extra
term Economical use of storage Disadvantages Les
s good for lexicographic processing, e.g.,
comp Tree tends to become unbalanced If the
index is held on disk, important to optimize
the number of disk accesses
27File Structures for Inverted Files Binary Tree
Calculation of maximum depth of
tree. Illustrates importance of balanced
trees.
Worst case depth n
O(n) Ideal case depth log(n 1)/log 2
O(log n)
28File Structures for Inverted Files Right
Threaded Binary Tree
Threaded tree A binary search tree in which each
node uses an otherwise-empty left child link to
refer to the node's in-order predecessor and an
empty right child link to refer to its in-order
successor. Right-threaded tree A variant of a
threaded tree in which only the right thread,
i.e. link to the successor, of each node is
maintained. Can be used for lexicographic
processing. A good data structure when held in
memory
Knuth vol 1, 2.3.1, page 325.
29File Structures for Inverted Files Right
Threaded Binary Tree
dog
bee
gnu
hog
cat
elk
ant
NULL
fox
30File Structures for Inverted Files B-trees
B-tree of order m A balanced, multiway search
tree Each node stores many keys Root has
between 2 and 2m keys. All other internal
nodes have between m and 2m keys. If ki is
the ith key in a given internal node -gt all keys
in the (i-1)th child are smaller than ki -gt all
keys in the ith child are bigger than ki All
leaves are at the same depth
31File Structures for Inverted Files B-trees
B-tree example (order 2)
50 65
55 59
70 90 98
10 19 35
66 68
91 95 97
36 47
1 5 8 9
72 73
12 14 18
21 24 28
Every arrow points to a node containing between 2
and 4 keys. A node with k keys has k 1 pointers.
32File Structures for Inverted Files B-tree
A B-tree is used as an index Data is
stored in the leaves of the tree, known as buckets
Example B-tree of order 2, bucket size 4
50 65
10 25
55 59
70 81 90
... D9
D51 ... D54
D66...
D81 ...
(Implementation of B-trees is covered in CS 432.)