Title: Information Retrieval using the Boolean Model
1Information Retrieval using the Boolean Model
2Query
- Which plays of Shakespeare contain the words
Brutus AND Caesar but NOT Calpurnia? - Could grep all of Shakespeares plays for Brutus
and Caesar, then strip out lines containing
Calpurnia? - Slow (for large corpora)
- NOT Calpurnia is non-trivial
- Other operations (e.g., find the phrase Romans
and countrymen) not feasible
3Term-document incidence
1 if play contains word, 0 otherwise
4Incidence vectors
- So we have a 0/1 vector for each term.
- To answer query take the vectors for Brutus,
Caesar and Calpurnia (complemented) ? bitwise
AND. - 110100 AND 110111 AND 101111 100100.
5Answers to query
- Antony and Cleopatra, Act III, Scene ii
- Agrippa Aside to DOMITIUS ENOBARBUS Why,
Enobarbus, - When Antony found
Julius Caesar dead, - He cried almost to
roaring and he wept - When at Philippi he
found Brutus slain. - Hamlet, Act III, Scene ii
- Lord Polonius I did enact Julius Caesar I was
killed i' the - Capitol Brutus killed me.
6Bigger document collections
- Consider N 1million documents, each with about
1K terms. - Avg 6 bytes/term incl spaces/punctuation
- 6GB of data in the documents.
- Say there are M 500K distinct terms among these.
7Cant build the matrix
- 500K x 1M matrix has half-a-trillion 0s and 1s.
- But it has no more than one billion 1s.
- matrix is extremely sparse.
- Whats a better representation?
- We only record the 1 positions.
Why?
8Inverted index
- For each term T store a list of all documents
that contain T. - Do we use an array or a list for this?
Brutus
Calpurnia
Caesar
13
16
What happens if the word Caesar is added to
document 14?
9Inverted index
- Linked lists generally preferred to arrays
- Dynamic space allocation
- Insertion of terms into documents easy
- Space overhead of pointers
2
4
8
16
32
64
128
2
3
5
8
13
21
34
1
13
16
Sorted by docID (more later on why).
10Inverted index construction
Documents to be indexed.
Friends, Romans, countrymen.
11Indexer steps
- Sequence of (Modified token, Document ID) pairs.
Doc 1
Doc 2
I did enact Julius Caesar I was killed i' the
Capitol Brutus killed me.
So let it be with Caesar. The noble Brutus hath
told you Caesar was ambitious
12 Core indexing step.
13 - Multiple term entries in a single document are
merged. - Frequency information is added.
Why frequency? Will discuss later.
14 - The result is split into a Dictionary file and a
Postings file.
15 - Where do we pay in storage?
Terms
Pointers
16The index we just built
Todays focus
- How do we process a Boolean query?
- Later - what kinds of queries can we process?
17Query processing
- Consider processing the query
- Brutus AND Caesar
- Locate Brutus in the Dictionary
- Retrieve its postings.
- Locate Caesar in the Dictionary
- Retrieve its postings.
- Merge the two postings
128
Brutus
Caesar
34
18The merge
- Walk through the two postings simultaneously, in
time linear in the total number of postings
entries
128
2
34
If the list lengths are x and y, the merge takes
O(xy) operations. Crucial postings sorted by
docID.
19Basic postings intersection
20Boolean queries Exact match
- Queries using AND, OR and NOT together with query
terms - Views each document as a set of words
- Is precise document matches condition or not.
- Primary commercial retrieval tool for 3 decades.
- Professional searchers (e.g., Lawyers) still like
Boolean queries - You know exactly what youre getting.
21Example WestLaw http//www.westlaw.com/
- Largest commercial (paying subscribers) legal
search service (started 1975 ranking added 1992) - About 7 terabytes of data 700,000 users
- Majority of users still use boolean queries
- Example query
- What is the statute of limitations in cases
involving the federal tort claims act? - LIMIT! /3 STATUTE ACTION /s FEDERAL /2 TORT /3
CLAIM - Long, precise queries proximity operators
incrementally developed not like web search
22Query optimization
- What is the best order for query processing?
- Consider a query that is an AND of t terms.
- For each of the t terms, get its postings, then
AND together.
Brutus
Calpurnia
Caesar
13
16
Query Brutus AND Calpurnia AND Caesar
23Query optimization example
- Process in order of increasing freq
- start with smallest set, then keep cutting
further.
This is why we kept freq in dictionary
Execute the query as (Caesar AND Brutus) AND
Calpurnia.
24Query optimization
25More general optimization
- e.g., (madding OR crowd) AND (ignoble OR strife)
- Get freqs for all terms.
- Estimate the size of each OR by the sum of its
freqs (conservative). - Process in increasing order of OR sizes.
26Exercise
- Recommend a query processing order for
(tangerine OR trees) AND (marmalade OR skies)
AND (kaleidoscope OR eyes)
27Beyond Boolean term search
- What about phrases?
- Proximity Find Gates NEAR Microsoft.
- Need index to capture position information in
docs. More later. - Zones in documents Find documents with (author
Ullman) AND (text contains automata).
28Evidence accumulation
- 1 vs. 0 occurrence of a search term
- 2 vs. 1 occurrence
- 3 vs. 2 occurrences, etc.
- Need term frequency information in docs.
- Used to compute a score for each document
- Matching documents rank-ordered by this score.
29Evaluating search engines
30Measures for a search engine
- How fast does it index
- Number of documents/hour
- (Average document size)
- How fast does it search
- Latency as a function of index size
- Expressiveness of query language
- Speed on complex queries
31Measures for a search engine
- All of the preceding criteria are measurable we
can quantify speed/size we can make
expressiveness precise - The key measure user happiness
- What is this?
- Speed of response/size of index are factors
- But blindingly fast, useless answers wont make a
user happy - Need a way of quantifying user happiness
32Measuring user happiness
- Issue who is the user we are trying to make
happy? - Depends on the setting
- Web engine user finds what they want and return
to the engine - Can measure rate of return users
- eCommerce site user finds what they want and
make a purchase - Is it the end-user, or the eCommerce site, whose
happiness we measure? - Measure time to purchase, or fraction of
searchers who become buyers?
33Measuring user happiness
- Enterprise (company/govt/academic) Care about
user productivity - How much time do my users save when looking for
information? - Many other criteria having to do with breadth of
access, secure access more later
34Happiness elusive to measure
- Most common proxy relevance of search results
- But how do you measure relevance?
- Will detail a methodology here, then examine its
issues - Requires 3 elements
- A benchmark document collection
- A benchmark suite of queries
- A binary assessment of either Relevant or
Irrelevant for each query-doc pair
35Evaluating an IR system
- Note information need is translated into a query
- Relevance is assessed relative to the information
need not the query - E.g., Information need I'm looking for
information on whether drinking red wine is more
effective at reducing your risk of heart attacks
than white wine. - Query wine red white heart attack effective
36Standard relevance benchmarks
- TREC - National Institute of Standards and
Testing (NIST) has run large IR benchmark for
many years - Reuters and other benchmark doc collections used
- Retrieval tasks specified
- sometimes as queries
- Human experts mark, for each query and for each
doc, Relevant or Irrelevant - or at least for subset of docs that some system
returned for that query
37Precision and Recall
- Precision fraction of retrieved docs that are
relevant P(relevantretrieved) - Recall fraction of relevant docs that are
retrieved P(retrievedrelevant) - Precision P tp/(tp fp)
- Recall R tp/(tp fn)
Relevant Not Relevant
Retrieved tp fp
Not Retrieved fn tn
38Accuracy a different measure
- Given a query an engine classifies each doc as
Relevant or Irrelevant. - Accuracy of an engine the fraction of these
classifications that is correct.
39Why not just use accuracy?
- How to build a 99.9999 accurate search engine on
a low budget. - People doing information retrieval want to find
something and have a certain tolerance for junk.
Snoogle.com
Search for
0 matching results found.
40Precision/Recall
- Can get high recall (but low precision) by
retrieving all docs for all queries! - Recall is a non-decreasing function of the number
of docs retrieved - Precision usually decreases (in a good system)
41Difficulties in using precision/recall
- Should average over large corpus/query ensembles
- Need human relevance assessments
- People arent reliable assessors
- Assessments have to be binary
- Nuanced assessments?
- Heavily skewed by corpus/authorship
- Results may not translate from one domain to
another
42Information RetrievalPrabhakar RaghavanYahoo!
Research
- Lecture 1
- From Chapters 1,8 of IIR