The Vector Space Model - PowerPoint PPT Presentation

About This Presentation
Title:

The Vector Space Model

Description:

Title: Boolean Retrieval Model and Controlled Vocabulary Techniques Author: Douglas W. Oard Last modified by: jj Created Date: 6/17/1995 11:31:02 PM – PowerPoint PPT presentation

Number of Views:111
Avg rating:3.0/5.0
Slides: 44
Provided by: Dougl216
Category:

less

Transcript and Presenter's Notes

Title: The Vector Space Model


1
The Vector Space Model
  • LBSC 796/CMSC828o
  • Session 3, February 9, 2004
  • Douglas W. Oard

2
Agenda
  • Thinking about search
  • Design strategies
  • Decomposing the search component
  • Boolean free text retrieval
  • The bag of terms representation
  • Proximity operators
  • Ranked retrieval
  • Vector space model
  • Passage retrieval

3
Supporting the Search Process
Source Selection
?
????
?
??
?
4
Design Strategies
  • Foster human-machine synergy
  • Exploit complementary strengths
  • Accommodate shared weaknesses
  • Divide-and-conquer
  • Divide task into stages with well-defined
    interfaces
  • Continue dividing until problems are easily
    solved
  • Co-design related components
  • Iterative process of joint optimization

5
Human-Machine Synergy
  • Machines are good at
  • Doing simple things accurately and quickly
  • Scaling to larger collections in sublinear time
  • People are better at
  • Accurately recognizing what they are looking for
  • Evaluating intangibles such as quality
  • Both are pretty bad at
  • Mapping consistently between words and concepts

6
Divide and Conquer
  • Strategy use encapsulation to limit complexity
  • Approach
  • Define interfaces (input and output) for each
    component
  • Query interface input terms, output
    representation
  • Define the functions performed by each component
  • Remove common words, weight rare terms higher,
  • Repeat the process within components as needed
  • Result a hierarchical decomposition

7
Search Goal
  • Choose the same documents a human would
  • Without human intervention (less work)
  • Faster than a human could (less time)
  • As accurately as possible (less accuracy)
  • Humans start with an information need
  • Machines start with a query
  • Humans match documents to information needs
  • Machines match document query representations

8
Search Component Model
Utility
Human Judgment
Information Need
Document
Query Formulation
Query
Document Processing
Query Processing
Representation Function
Representation Function
Query Representation
Document Representation
Comparison Function
Retrieval Status Value
9
Relevance
  • Relevance relates a topic and a document
  • Duplicates are equally relevant, by definition
  • Constant over time and across users
  • Pertinence relates a task and a document
  • Accounts for quality, complexity, language,
  • Utility relates a user and a document
  • Accounts for prior knowledge
  • We seek utility, but relevance is what we get!

10
Bag of Terms Representation
  • Bag a set that can contain duplicates
  • The quick brown fox jumped over the lazy dogs
    back ?
  • back, brown, dog, fox, jump, lazy, over,
    quick, the, the
  • Vector values recorded in any consistent order
  • back, brown, dog, fox, jump, lazy, over, quick,
    the, the ?
  • 1 1 1 1 1 1 1 1 2

11
Bag of Terms Example
Document 1
Stopword List
Term
Document 1
Document 2
The quick brown fox jumped over the lazy dogs
back.
for
aid
0
1
is
all
0
1
back
of
1
0
the
brown
1
0
to
come
0
1
dog
1
0
fox
1
0
Document 2
good
0
1
jump
1
0
lazy
1
0
Now is the time for all good men to come to the
aid of their party.
men
0
1
now
0
1
over
1
0
party
0
1
quick
1
0
their
0
1
time
0
1
12
Boolean Free Text Retrieval
  • Limit the bag of words to absent and present
  • Boolean values, represented as 0 and 1
  • Represent terms as a bag of documents
  • Same representation, but rows rather than columns
  • Combine the rows using Boolean operators
  • AND, OR, NOT
  • Result set every document with a 1 remaining

13
Boolean Operators
B
B
0
1
0
1
A
0
1
1
0
0
NOT B
A OR B
1
1
1
B
B
0
1
0
1
A
A
0
0
0
0
0
0
A AND B
A NOT B
0
1
1
0
1
1
( A AND NOT B)
14
Boolean Free Text Example
  • dog AND fox
  • Doc 3, Doc 5
  • dog NOT fox
  • Empty
  • fox NOT dog
  • Doc 7
  • dog OR fox
  • Doc 3, Doc 5, Doc 7
  • good AND party
  • Doc 6, Doc 8
  • good AND party NOT over
  • Doc 6

Term
Doc 1
Doc 2
Doc 3
Doc 4
Doc 5
Doc 6
Doc 7
Doc 8
aid
0
0
0
1
0
0
0
1
all
0
1
0
1
0
1
0
0
back
1
0
1
0
0
0
1
0
brown
1
0
1
0
1
0
1
0
come
0
1
0
1
0
1
0
1
dog
0
0
1
0
1
0
0
0
fox
0
0
1
0
1
0
1
0
good
0
1
0
1
0
1
0
1
jump
0
0
1
0
0
0
0
0
lazy
1
0
1
0
1
0
1
0
men
0
1
0
1
0
0
0
1
now
0
1
0
0
0
1
0
1
over
1
0
1
0
1
0
1
1
party
0
0
0
0
0
1
0
1
quick
1
0
1
0
0
0
0
0
their
1
0
0
0
1
0
1
0
time
0
1
0
1
0
1
0
0
15
Why Boolean Retrieval Works
  • Boolean operators approximate natural language
  • Find documents about a good party that is not
    over
  • AND can discover relationships between concepts
  • good party
  • OR can discover alternate terminology
  • excellent party
  • NOT can discover alternate meanings
  • Democratic party

16
The Perfect Query Paradox
  • Every information need has a perfect doc set
  • If not, there would be no sense doing retrieval
  • Almost every document set has a perfect query
  • AND every word to get a query for document 1
  • Repeat for each document in the set
  • OR every document query to get the set query
  • But users find Boolean query formulation hard
  • They get too much, too little, useless stuff,

17
Why Boolean Retrieval Fails
  • Natural language is way more complex
  • She saw the man on the hill with a telescope
  • AND discovers nonexistent relationships
  • Terms in different paragraphs, chapters,
  • Guessing terminology for OR is hard
  • good, nice, excellent, outstanding, awesome,
  • Guessing terms to exclude is even harder!
  • Democratic party, party to a lawsuit,

18
Proximity Operators
  • More precise versions of AND
  • NEAR n allows at most n-1 intervening terms
  • WITH requires terms to be adjacent and in order
  • Easy to implement, but less efficient
  • Store a list of positions for each word in each
    doc
  • Stopwords become very important!
  • Perform normal Boolean computations
  • Treat WITH and NEAR like AND with an extra
    constraint

19
Proximity Operator Example
Term
Doc 1
Doc 2
  • time AND come
  • Doc 2
  • time (NEAR 2) come
  • Empty
  • quick (NEAR 2) fox
  • Doc 1
  • quick WITH fox
  • Empty

aid
1 (13)
0
all
1 (6)
0
back
0
1 (10)
brown
0
1 (3)
come
0
1 (9)
dog
0
1 (9)
fox
0
1 (4)
good
1 (7)
0
jump
0
1 (5)
lazy
0
1 (8)
men
1 (8)
0
now
1 (1)
0
over
0
1 (6)
party
1 (16)
0
quick
1 (2)
0
their
1 (15)
0
time
1 (4)
0
20
Strengths and Weaknesses
  • Strong points
  • Accurate, if you know the right strategies
  • Efficient for the computer
  • Weaknesses
  • Often results in too many documents, or none
  • Users must learn Boolean logic
  • Sometimes finds relationships that dont exist
  • Words can have many meanings
  • Choosing the right words is sometimes hard

21
Ranked Retrieval Paradigm
  • Exact match retrieval often gives useless sets
  • No documents at all, or way too many documents
  • Query reformulation is one solution
  • Manually add or delete query terms
  • Best-first ranking can be superior
  • Select every document within reason
  • Put them in order, with the best ones first
  • Display them one screen at a time

22
Advantages of Ranked Retrieval
  • Closer to the way people think
  • Some documents are better than others
  • Enriches browsing behavior
  • Decide how far down the list to go as you read it
  • Allows more flexible queries
  • Long and short queries can produce useful results

23
Ranked Retrieval Challenges
  • Best first is easy to say but hard to do!
  • The best we can hope for is to approximate it
  • Will the user understand the process?
  • It is hard to use a tool that you dont
    understand
  • Efficiency becomes a concern
  • Only a problem for long queries, though

24
Partial-Match Ranking
  • Form several result sets from one long query
  • Query for the first set is the AND of all the
    terms
  • Then all but the 1st term, all but the 2nd,
  • Then all but the first two terms,
  • And so on until each single term query is tried
  • Remove duplicates from subsequent sets
  • Display the sets in the order they were made
  • Document rank within a set is arbitrary

25
Partial Match Example
information AND retrieval
Readings in Information Retrieval Information
Storage and Retrieval Speech-Based Information
Retrieval for Digital Libraries Word Sense
Disambiguation and Information Retrieval
information NOT retrieval
The State of the Art in Information Filtering
retrieval NOT information
Inference Networks for Document
Retrieval Content-Based Image Retrieval
Systems Video Parsing, Retrieval and Browsing An
Approach to Conceptual Text Retrieval Using the
EuroWordNet Cross-Language Retrieval
English/Russian/French
26
Similarity-Based Queries
  • Treat the query as if it were a document
  • Create a query bag-of-words
  • Find the similarity of each document
  • Using the coordination measure, for example
  • Rank order the documents by similarity
  • Most similar to the query first
  • Surprisingly, this works pretty well!
  • Especially for very short queries

27
Document Similarity
  • How similar are two documents?
  • In particular, how similar is their bag of words?

1
2
3
1
complicated
1 Nuclear fallout contaminated Montana.
1
contaminated
1
fallout
2 Information retrieval is interesting.
1
1
information
3 Information retrieval is complicated.
1
interesting
1
nuclear
1
1
retrieval
1
siberia
28
The Coordination Measure
  • Count the number of terms in common
  • Based on Boolean bag-of-words
  • Documents 2 and 3 share two common terms
  • But documents 1 and 2 share no terms at all
  • Useful for more like this queries
  • more like doc 2 would rank doc 3 ahead of doc 1
  • Where have you seen this before?

29
Coordination Measure Example
1
2
3
1
complicated
Query complicated retrieval Result 3, 2
1
contaminated
1
fallout
Query interesting nuclear fallout Result 1, 2
1
1
information
1
interesting
1
nuclear
Query information retrieval Result 2, 3
1
1
retrieval
1
siberia
30
Counting Terms
  • Terms tell us about documents
  • If rabbit appears a lot, it may be about
    rabbits
  • Documents tell us about terms
  • the is in every document -- not discriminating
  • Documents are most likely described well by rare
    terms that occur in them frequently
  • Higher term frequency is stronger evidence
  • Low collection frequency makes it stronger still

31
The Document Length Effect
  • Humans look for documents with useful parts
  • But probabilities are computed for the whole
  • Document lengths vary in many collections
  • So probability calculations could be inconsistent
  • Two strategies
  • Adjust probability estimates for document length
  • Divide the documents into equal passages

32
Incorporating Term Frequency
  • High term frequency is evidence of meaning
  • And high IDF is evidence of term importance
  • Recompute the bag-of-words
  • Compute TF IDF for every element

33
Weighted Matching Schemes
  • Unweighted queries
  • Add up the weights for every matching term
  • User specified query term weights
  • For each term, multiply the query and doc weights
  • Then add up those values
  • Automatically computed query term weights
  • Most queries lack useful TF, but IDF may be
    useful
  • Used just like user-specified query term weights

34
TFIDF Example
1
2
3
4
1
2
3
4
Unweighted query contaminated
retrieval Result 2, 3, 1, 4
5
2
1.51
0.60
complicated
0.301
4
1
3
0.50
0.13
0.38
contaminated
0.125
5
4
3
0.63
0.50
0.38
fallout
0.125
Weighted query contaminated(3)
retrieval(1) Result 1, 3, 2, 4
6
3
3
2
information
0.000
1
0.60
interesting
0.602
3
7
0.90
2.11
nuclear
0.301
IDF-weighted query contaminated
retrieval Result 2, 3, 1, 4
6
1
4
0.75
0.13
0.50
retrieval
0.125
2
1.20
siberia
0.602
35
Document Length Normalization
  • Long documents have an unfair advantage
  • They use a lot of terms
  • So they get more matches than short documents
  • And they use the same words repeatedly
  • So they have much higher term frequencies
  • Normalization seeks to remove these effects
  • Related somehow to maximum term frequency
  • But also sensitive to the of number of terms

36
Cosine Normalization
  • Compute the length of each document vector
  • Multiply each weight by itself
  • Add all the resulting values
  • Take the square root of that sum
  • Divide each weight by that length

37
Cosine Normalization Example
1
2
3
4
1
2
3
4
1
2
3
4
0.57
0.69
5
2
1.51
0.60
complicated
0.301
0.29
0.13
0.14
4
1
3
0.50
0.13
0.38
contaminated
0.125
0.37
0.19
0.44
5
4
3
0.63
0.50
0.38
fallout
0.125
6
3
3
2
information
0.000
0.62
1
0.60
interesting
0.602
0.53
0.79
3
7
0.90
2.11
nuclear
0.301
0.77
0.05
0.57
6
1
4
0.75
0.13
0.50
retrieval
0.125
0.71
2
1.20
siberia
0.602
1.70
0.97
2.67
0.87
Length
Unweighted query contaminated retrieval,
Result 2, 4, 1, 3 (compare to 2, 3, 1, 4)
38
Why Call It Cosine?
d2
?
d1
39
Interpreting the Cosine Measure
  • Think of a document as a vector from zero
  • Similarity is the angle between two vectors
  • Small angle very similar
  • Large angle little similarity
  • Passes some key sanity checks
  • Depends on pattern of word use but not on length
  • Every document is most similar to itself

40
Okapi Term Weights
TF component
IDF component
41
Passage Retrieval
  • Another approach to long-document problem
  • Break it up into coherent units
  • Recognizing topic boundaries is hard
  • But overlapping 300 word passages work fine
  • Document rank is best passage rank
  • And passage information can help guide browsing

42
Summary
  • Goal find documents most similar to the query
  • Compute normalized document term weights
  • Some combination of TF, DF, and Length
  • Optionally, get query term weights from the user
  • Estimate of term importance
  • Compute inner product of query and doc vectors
  • Multiply corresponding elements and then add

43
Before You Go!
  • On a sheet of paper, please briefly answer the
    following question (no names)
  • What was the muddiest point in todays lecture?
Write a Comment
User Comments (0)
About PowerShow.com