Title: CS 4300 INFO 4300 Information Retrieval
1CS 4300 / INFO 4300 Information Retrieval
Lecture 7 Queries and Strings1
2Course Administration
3Information Retrieval Family Trees
Cyril Cleverdon Cranfield
Karen Sparck Jones Cambridge
Gerald Salton Cornell
Keith Van Rijsbergen Glasgow
Donna Harman NIST
Michael Lesk Bell Labs, etc.
Bruce Croft University of Massachusetts, Amherst
4Query Languages Wild Card
Query Gibraltr
Zero or more characters indicated by (or
sometimes ?)
Uses Uncertainty about spelling (see
above) Alternative spellings (e.g.,
color) An alternative to stemming (e.g., comp)
5Trailing Wild Card
Trailing wild card, e.g., corn A lexicographic
index of the word file, such as a B-tree, allows
easy extraction of the set of all words that
begin with the string corn.
6General Wild Card
Permuterm index Add a special symbol, , to show
the end of a term. Build an index that shows all
rotations, of each term, e.g., cornell ornellc
rnellco nellcor ellcorn llcorne lcornel
cornell
7Single, non-Trailing Wild Card
Single non- trailing wild card Example Query
corll Rotate query so that wild card is at
end llcor Use the permuterm index to find
all terms that match, e.g., llcorne
8Several Wild Cards in Term
Example Query gibltr Ignore characters
between first and last wild cards
gibr Rotate query so that wild card is
at end rgib Use the permuterm index to
find all terms that match, e.g.,
rgibralta Check each term to see if the term
matches the query
9Query Languages Regular Expressions
Regular expression A pattern built up by simple
strings (which are matched as substrings) and
operators Union If e1 and e2 are regular
expressions, then (e1 e2) matches whatever
matches e1 or e2. Concatenation If e1 and e2
are regular expressions, the occurrences of (e1
e2) are formed by the occurrences of e1 followed
immediately by e2. Repetition If e is a regular
expression, then e matches a sequence of zero or
more contiguous occurrences of e.
10Regular Expression Examples
(wild card) matches "wild card" travelled
matches "traveled" or "travelled", but not
"traveed" 192 (0 1 2 3 4 5) matches any
string in the range "1920" to "1925" Techniques
for processing regular expressions are taught in
CS 3810 and other theory courses.
11Regular Expressions in Java
Package java.util.regex Classes for matching
character sequences against patterns specified by
regular expressions. An instance of the Pattern
class represents a regular expression that is
specified in string form in a syntax similar to
that used by Perl. Instances of the Matcher class
are used to match character sequences against a
given pattern. Input is provided to matchers via
the CharSequence interface in order to support
matching against characters from a wide variety
of input sources.
12String Searching
Goal Given a pattern, find any substring of a
given corpus that matches the pattern. Example
file search In a file system, find all file names
that contain a given string. Example gene
matching DNA consists of a chain made from four
types of nucleotides adenine, cytosine, guanine,
and thymine (encoded A, C, G, T). Given a gene,
encoded as a sequence of nucleotides, perhaps
with several variants, find all matching DNA
chains in a corpus.
13String Searching Naive Algorithm
Objective Given a pattern, find any substring
of a given corpus that matches the pattern.
p pattern to be matched m length of
pattern p (characters) t the text to be
searched n length of t (characters) The
naive algorithm examines the characters of t in
sequence. for j from 1 to n-m1 if character
j of t matches the first character of p
(compare following characters of t and p until a
complete match or a difference is found)
14String SearchingKnuth-Morris-Pratt Algorithm
Concept The naive algorithm is modified, so
that whenever a partial match is found, it may be
possible to advance the character index, j, by
more than 1. Example p "university"
t "the uniform commercial code ..."
j 5 after partial match
continue here To indicate how far to advance the
character pointer, p is preprocessed to create a
table, which lists how far to advance against a
given length of partial match. In the example, j
is advanced by the length of the partial match, 3.
15Search for Substring
In some information retrieval applications, any
substring can be a search term (e.g., DNA
matching). Tries, using suffix trees, provide
lexicographical indexes for all the substrings in
a document or set of documents.
16Tries Search for Substring
Basic concept The text is divided into unique
semi-infinite strings, or sistrings. Each
sistring has a starting position in the text, and
continues to the right until it is unique. The
sistrings are stored as the leaves of a tree, the
suffix tree. Common parts are stored only once.
Each sistring is associated with a location
within a document where the sistring occurs.
Subtrees below a certain node represent all
occurrences of the substring represented by that
node. Suffix trees have a size of the same order
of magnitude as the input documents.
17Tries Suffix Tree
Example suffix tree for the following words
begin beginning between bread
break
b
e rea
gin tween d k
null ning
18Tries Sistrings
A binary example String 01 100 100 010
111 Sistrings 1 01 100 100 010 111 2 11 001
000 101 11 3 10 010 001 011 1 4 00 100 010
111 5 01 000 101 11 6 10 001 011 1 7 00
010 111 8 00 101 11 etc.
19Tries Lexical Ordering
7 00 010 111 4 00 100 010 111 8 00 101
11 5 01 000 101 11 1 01 100 100 010 111
6 10 001 011 1 3 10 010 001 011 1 2 11 001
000 101 11
Unique string indicated in blue
20Trie Basic Concept
1
0
1
0
1
0
2
0
1
0
0
1
7
5
1
1
0
0
6
3
0
1
4
8
21Patricia Trie
1
0
1
0
1
00
2
0
1
1
0
0
10
7
5
1
6
3
0
1
4
8
Single-descendant nodes are eliminated.
22Optional Material
The following slides introduce Signature Files,
an application of Bloom Filters to Information
Retrieval.
23Signature Files Search without Inverted File
Inexact filter A quick test which discards many
of the non-qualifying items. Uses the concept of
a Bloom filter. Advantages Much faster than
full text scanning -- 1 or 2 orders of
magnitude Modest space overhead -- 10 to
15 of file Insertion is straightforward Disa
dvantages Not good for very large files
Some hits are false hits
24Signature Files
Example Word Signature free 001 000 110
010 text 000 010 101 001 block signature 001 010
111 011
F 12 bits in a signature m 4 bits per word D
2 words per block
25Signature Files
Signature size. Number of bits in a signature,
F. Word signature. A bit pattern of size F with
exactly m bits set to 1 and the others 0. The
word signature is calculated by a hash
function. Block. A sequence of text that
contains D distinct words. Block signature. The
logical or of all the word signatures in a block
of text.
26Signature Files
A query term is processed by matching its
signature against the block signature. (a) If
the term is in the block, its word signature will
always match the block signature. (b) A word
signature may match the block signature, but the
word is not in the block. This is a false
hit. The design challenge is to minimize the
false drop probability, Fd . Frake, Section
4.2, page 47 discussed how to minimize Fd. The
rest of this chapter discusses enhancements to
the basic algorithm.