CS246 - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

CS246

Description:

CS246 Basic Information Retrieval Today s Topic Basic Information Retrieval (IR) Bag of words assumption Boolean Model Inverted index Vector-space model Document ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 22

Provided by: cho125

Category:

more less

Transcript and Presenter's Notes

Title: CS246

1
CS246

Basic Information Retrieval

2
Todays Topic

Basic Information Retrieval (IR)
Bag of words assumption
Boolean Model
Inverted index
Vector-space model
Document-term matrix
TF-IDF vector and cosine similarity
Phrase queries
Spell correction

3
Information-Retrieval System

Information source Existing text documents
Keyword-based/natural-language query
The system returns best-matching documents given
the query
Challenge
Both queries and data are fuzzy
Unstructured text and natural language query
What documents are good matches for a query?
Computers do not understand the documents or
the queries
Developing a computerizable model is essential
to implement this approach

4
Bag of Words Major Simplification

Consider each document as a bag of words
bag vs set
Ignore word ordering, but keep word count
Consider queries as bag of words as well
Great oversimplification, but works adequately in
many cases
John loves only Jane vs Only John loves Jane
The limitation still shows up on current search
engines
Still how do we match documents and queries?

5
Boolean Model

Return all documents that contain the words in
the query
Simplest model for information retrieval
No notion of ranking
A document is either a match or non-match
Q How to find and return matching documents?
Basic algorithm?
Useful data structure?

6
Inverted Index

Allows quick lookup of document ids with a
particular word
Q How can we use this to answer UCLA Physics?

Postings list
lexicon/dictionary DIC
3 8 10 13 16 20
PL(Stanford)
Stanford
1 2 3 9 16 18
PL(UCLA)
UCLA
MIT
4 5 8 10 13 19 20 22
PL(MIT)

7
Inverted Index

Allows quick lookup of document ids with a
particular word

Postings list
lexicon/dictionary DIC
3 8 10 13 16 20
PL(Stanford)
Stanford
1 2 3 9 16 18
PL(UCLA)
UCLA
MIT
4 5 8 10 13 19 20 22
PL(MIT)

8
Size of Inverted Index (1)

100M docs, 10KB/doc, 1000 unique words/doc,
10B/word, 4B/docid
Q Document collection size?
Q Inverted index size?
Heaps Law Vocabulary size k nb with 30 lt k lt
100 and 0.4 lt b lt 1
k 50 and b 0.5 are good rule of thumb

9
Size of Inverted Index (2)

Q Between dictionary and postings lists, which
one is larger?
Q Lengths of postings lists?
Zipfs law collection term frequency ?
1/frequency rank
Q How do we construct an inverted index?

10
Inverted Index Construction

C set of all documents (corpus)
DIC dictionary of inverted index
PL(w) postings list of word w
1 For each document d ? C
2 Extract all words in content(d) into W
3 For each w ? W
4 If w ? DIC, then add w to DIC
5 Append id(d) to PL(w)
Q What if the index is larger than main memory?

11
Inverted-Index Construction

For large text corpus
Block-sorted based construction
Partition and merge

12
Evaluation Precision and Recall

Q Are all matching documents what users want?
Basic idea a model is good if it returns
document if and only if it is relevant.
R set of relevant documentD set of documents
returned by a model

13
Vector-Space Model

Main problem of Boolean model
Too many matching documents when the corpus is
large
Any way to rank documents?
Matrix interpretation of Boolean model
Document Term matrix
Boolean 0 or 1 value for each entry
Basic idea
Assign real-valued weight to the matrix entries
depending on the importance of the term
the vs UCLA
Q How should we assign the weights?

14
TF-IDF Vector

A term t is important for document d
If t appears many times in d or
If t is a rare term
TF term frequency
occurrence of t in d
IDF inverse document frequency
documents containing t
TF-IDF weighting
TF X Log(N/IDF)
Q How to use it to compute query-document
relevance?

15
Cosine Similarity

Represent both query and document as a TF-IDF
vector
Take the inner product of the two normalized
vectors to compute their similarity
Note Q does not matter for document ranking.
Division by D penalizes longer document.

16
Cosine Similarity Example

idf(UCLA)10, idf(good)0.1, idf(university)
idf(car) idf(racing) 1
Q (UCLA, university), D (car, racing)
Q (UCLA, university), D (UCLA, good)
Q (UCLA, university), D (university, good)

17
Finding High Cosine-Similarity Documents

Q Under vector-space model, does
precision/recall make sense?
Q How to find the documents with highest cosine
similarity from corpus?
Q Any way to avoid complete scan of corpus?

18
Inverted Index for TF-IDF

Q di 0 if di has no query words
Consider only the documents with query words
Inverted Index Word ? Document

18
19
Phrase Queries

Havard University Boston exactly as a phrase
Q How can we support this query?
Two approaches
Biword index
Positional index
Q Pros and cons of each approach?
Rule of thumb x2 x4 size increase for
positional index compared to docid only

20
Spell correction

Q What is the users intention for the query
Britnie Spears? How can we find the correct
spelling?
Given a user-typed word w, find its correct
spelling c.
Probabilistic approach Find c with the highest
probability P(cw).
Q How to estimate it?
Bayes rule P(cw) P(wc)P(c)/P(w)
Q What are these probabilities and how can we
estimate them?
Rule of thumb 75 misspells are within edit
distance 1. 98 are within edit distance 2.

21
Summary