Title: CS 4300 INFO 4300 Information Retrieval
1CS 4300 / INFO 4300 Information Retrieval
Lecture 2 Searching Full Text 2
2Course Administration
Web site http//www.infosci.cornell.edu/course
s/info4300/2008fa/ Notices See the home page
on the course Web site Programming assignments
Java, C, or Python only. Assignment 3 (and
possibly 4) will require Java.
3Course Administration
Please send all questions about the course
to cs4300-l_at_lists.cs.cornell.edu The message
will be sent to William Arms Teaching
Assistants
4Discussion classes
Discussion class, Wednesday, September 3 Phillips
Hall 203, 730 to 830 p.m. Prepare for the class
as instructed on the course Web
site. Participation in the discussion classes is
one third of the grade, but tomorrow's class
will not be included in the grade calculation.
5Discussion Classes
Format Questions. Ask a member of the class to
answer. Provide opportunity for others to
comment. When answering Stand up. Give your
name. Make sure that the TA hears it. Speak
clearly so that all the class can
hear. Suggestions Do not be shy at presenting
partial answers. Differing viewpoints are
welcome.
6Discussion Class Preparation
You are given a problem to explore using three
search systems What is the medical evidence
that red wine is good or bad for your health? In
preparing for the class, focus on the question
What characteristics of the three search services
are helpful or lead to difficulties in addressing
this problem? The aim of your preparation is to
explore the search services, not to solve the
problem. Take care. Many of the documents that
you might find are written from a one-sided
viewpoint.
7Discussion Class Preparation
In preparing for the discussion classes, you may
find it useful to look at the slides from last
year's class on the old Web site http//www.infos
ci.cornell.edu/Courses/info430/2007fa/
8Definitions query
Query A string, describing the information that
the user is seeking. Each term in the query is
called a search term. A query can be a single
search term, a string of terms, a phrase in
natural language, or a stylized expression using
special symbols, e.g., a regular expression.
Full text searching Methods that compare the
query with every word in the text, without
distinguishing the function of the various words.
Fielded searching Methods that search on
specific bibliographic or structural fields, such
as author or title.
9Query
Parts of a query Sample query for fielded
searching author Joyce and title (artist or
Steen Hero)
search term
field name
Boolean operator
wild card
10 Sorting and Ranking Hits
When a user submits a query to a search system,
the system returns a set of hits. With a large
collection of documents, the set of hits maybe
very large. The value to the user often depends
on the order in which the hits are
presented. Three main methods Sorting the
hits, e.g., by date Ranking the hits by
similarity between query and document Ranking
the hits by the importance of the documents
11Word Frequency
Observation Some words are more common than
others. Statistics Most large collections of
unstructured text documents have similar
statistical characteristics. These statistics
influence the effectiveness and efficiency of
data structures used to index documents
many retrieval models rely on them
12Word Frequency
Example The following example is taken from
Jamie Callan, Characteristics of Text, 1997
Sample of 19 million words The next slide shows
the 50 commonest words in rank order (r), with
their frequency (f).
13 f f
f the 1,130,021 from 96,
900 or 54,958 of 547,311 he 94,585
about 53,713 to 516,635 million 93,515
market 52,110 a 464,736 year 90,104
they 51,359 in 390,819 its 86,774
this 50,933 and 387,703 be 85,588
would 50,828 that 204,351 was 83,398
you 49,281 for 199,340 company83,070
which 48,273 is 152,483 an 76,974
bank 47,940 said 148,302 has 74,405
stock 47,401 it 134,323 are 74,097
trade 47,310 on 121,173 have 73,132
his 47,116 by 118,863 but 71,887
more 46,244 as 109,135 will 71,494
who 42,142 at 101,779 say 66,807 one
41,635 mr 101,679 new 64,456 their
40,910 with 101,210 share 63,925
14Rank Frequency Distribution
For all the words in a collection of documents,
for each word w f is the frequency
that w appears r is rank of w in order
of frequency. (The most commonly
occurring word has rank 1, etc.)
f
w has rank r and frequency f
r
15Rank Frequency Example
The next slide shows the words in Callan's data
normalized. In this example r is the rank
of word w in the sample. f is the frequency
of word w in the sample. n is the total
number of word occurrences in the sample.
16 (rf)/(n/1000)
(rf)/(n/1000) (rf)/(n/1000) the 59 fr
om 92 or 101 of 58 he 95 about 102
to 82 million 98 market 101 a 98 year
100 they 103 in 103 its 100 this 105
and 122 be 104 would 107 that 75 was
105 you 106 for 84 company 109
which 107 is 72 an 105 bank 109 said
78 has 106 stock 110 it 78 are 109
trade 112 on 77 have 112 his 114 by
81 but 114 more 114 as 80 will 117
who 106 at 80 say 113 one 107 mr 86
new 112 their 108 with 91 share 114
17Zipf's Law
If the words in a collection are ranked, r, by
their frequency, f, they roughly fit the
relation r (f/n) c Where n is the number
of word occurrences in the collection, 19 million
in the example. Different collections have
different constants c. In English text, c tends
to be about 0.1.
18Methods that Build on Zipf's Law
Stop lists Ignore the most frequent words
(upper cut-off). Used by almost all
systems. Term weighting Give differing weights
to terms based on their frequency, with most
frequent words weighed less. Used by almost all
ranking methods. Significant words Ignore the
most frequent and least frequent words (upper and
lower cut-off). Rarely used.
19Zipf's Law
For a weird but wonderful discussion of this and
many other examples of naturally occurring rank
frequency distributions, see Zipf, G. K.,
Human Behaviour and the Principle of Least
Effort. Addison-Wesley, 1949 For a technical
understanding of the processes behind this law,
take Info 2040, Networks.
20Exact Matching (Boolean Model)
Documents
Query
Index database
Mechanism for determining whether a document
matches a query.
Set of hits
21Boolean Queries
Boolean query two or more search terms, related
by logical operators, e.g., and or not Exam
ples abacus and actor abacus or
actor (abacus and actor) or (abacus and
atoll) not actor
Find all documents that contain the exact words
abacus and actor
22Adjacent and Near Operators
abacus adj actor Terms abacus and actor are
adjacent to each other, e.g., "abacus
actor" abacus near 4 actor Terms abacus and
actor are within 4 words of each other,
e.g., "the actor has an abacus" Some systems
support other operators, such as with (two terms
in the same sentence) or same (two terms in the
same paragraph).
23Evaluation of Matching Recall and Precision
With matching methods, if information retrieval
were perfect ... Every hit would be relevant to
the original query, and every relevant item in
the body of information would be found.
Precision fraction (or percentage) of the hits
that are relevant, i.e., the extent to which
the set of hits retrieved by a query satisfies
the requirement that generated the query.
Recall fraction (or percentage) of the relevant
items that are found by the query, i.e., the
extent to which the query found all the items
that satisfy the requirement.
24Recall and Precision with Exact Matching
Example Corpus of 10,000 documents, 50 on a
specific topic Ideal search finds these 50
documents and reject all others Actual search
identifies 25 documents 20 are relevant but 5
were on other topics Precision 20/25 0.8
(80 of hits were relevant) Recall 20/50
0.4 (40 of relevant were found)
25Precision and Recall
Precision and recall measure the results of a
single query using a specific search system
applied to a specific set of documents.
26Inverted File
Inverted file An inverted file is list of
search terms that are organized for associative
look-up, i.e., to answer the questions In
which documents does a specified search term
appear? Where within each document does each
term appear? (There may be several
occurrences.) In a text search system, the
inverted file system has two parts the word list
and the postings file.
27Inverted File -- Definitions
Word ant bee cat dog eel fox gnu hog
The word list is a list of all the distinct terms
in the corpus after the removal of stop words and
stemming. This is sometimes called a vocabulary
file.
28Inverted File -- Definitions
Posting Entry in an inverted file system that
applies to a single instance of a term within a
document, e.g., there might be three postings for
"abacus" abacus 3
"abacus" is in document 3 abacus
19 abacus 22
Postings List A list of all the postings in an
inverted file system that apply to a specific
word, e.g. abacus 3 19 22 "abacus" is in
documents 3, 19 22
29Use of Inverted Files for Evaluating a Boolean
Query
To evaluate the and operator, merge the two
inverted lists with a logical AND operation.
Examples abacus and actor Postings for abacus
3 19 22 Postings for actor 2 19
29 Document 19 is the only document that contains
both terms, "abacus" and "actor".
30Enhancements to Inverted Files -- Concept
Location Each posting holds information about
the location of each term within the
document. Uses user interface design --
highlight location of search term adjacency and
near operators (in Boolean searching) Frequency
Each inverted list includes the number of
postings for each term. Uses term
weighting query processing optimization
31Inverted File -- Concept (Enhanced)
Word Frequency Document Location abacus
4 3 94 19 7
19 63 22 56 actor 3 2
66 19 64 29
45 aspen 1 5 43 atoll 3 11
3 11 70 34 40
Postings list for term actor
32Evaluating an Adjacency Operation
Example abacus adj actor Postings for
abacus Postings for actor Document 19, locations
63 and 64, is the only occurrence of the terms
"abacus" and "actor" adjacent.
location within document
document
3 94 19 7 19 63 22 56
2 66 19 64 29 45
33Query Matching Boolean Methods
- Query (abacus or asp) and actor
- 1. From the index file (word list), find the
postings lists for - "abacus"
- every word that begins "asp"
- "actor"
- Merge these posting lists. For each document
that occurs in any of the postings lists,
evaluate the Boolean expression to see if it is
true or false. - Step 2 should be carried out in a single pass.
34Use of Postings File for Query Matching
- 1 abacus
-
- 3 94
- 19 7
-
- 19 63
-
- 22 56
3 aspen 5 43