Title: The Vector Space Model of Information Retrieval
1The Vector Space Model of Information Retrieval
2The Boolean Model of IR
A
B
3(No Transcript)
4Getting Beyond Boole
- The Boolean retrieval model imagines IR in set
theoretic terms. - Aside from its putative unfriendliness this
approach doesnt provide a satisfying model on at
least two counts - Suffers information overload (or null-set)
problems - Fails to provide a ranking of documents
(estimated relevance is binary, not a matter of
degree)
5Similarity as Surrogate for Relevance
- In information retrieval we are interested in the
abstraction relevance. - But relevance is a very slippery concept.
- The starting point of the vector space model
(VSM) is the idea that we can use similarity as a
proxy for relevance - i.e. assume that documents that are similar to a
query are likely to be relevant to the query,
too. - Note similarity is abstract, too. But perhaps
we can operationalize it more easily.
6Similarity as Surrogate for Relevance
- The basic goal of the VSM, then, involves
figuring out an adequate way to measure the
similarity between things (such as queries and
documents) - Towards this what does it mean for two things to
be similar? - What does it mean if A and B are more similar
than A and C? - How can we measure similarity in a way that is
principled and meaningful?
7Proximity as a surrogate for Similarity
- To estimate inter-document similarity, the vector
space model uses a geometric metaphor. - Documents and queries are represented as points
(actually vectors) in an abstract information
space. - We assume that things that are close to each
other in this space are more similar than things
that are far apart.
8Proximity as a surrogate for Similarity
9Proximity as a surrogate for Similarity
Query
10Similarity and Representation
- The similarities we observe depend on at least
two considerations - How do we define similarity?
- How do we represent the objects whose
similarities we wish to measure? - Our starting point, then what features form the
axes of our information space?
11Information Space
- We begin with the assumption that the people who
read and write the language that our system uses
(say, English) share some cognitive similarities. - I.E. We assume that people recognize certain
concepts, and that it is the relationships among
those concepts that we wish to model. - Lets call this abstract conceptual world the
information space shared by users of the system.
12Information Space
Here we have documents and queries shown in the
space spanned by the concepts CAT and DOG. A
documents location is the degree (for now
undefined) to which it evinces each of these
concepts.
13Motivation for the VSM
- Given these definitions, we can understand the
vector space model as an effort to rank documents
by their similarity in information space with
respect to the query. - Of course information space is imaginary, and we
still dont know how to measure distance yet.
14From Information Space to Term Space
- In order to measure inter-document similarities,
we need to define the salient features of our
documents. - In a perfect world, each dimension of information
space would form a feature, and each document
would be scored by its participation in each
concept. - Without such a scenario, what features can stand
in for our abstract concepts?
15From Information Space to Term Space
- While we dont directly observe the concepts that
are evinced in documents, we do observe the terms
that occur in them. - Most IR begins with the assumption that these
terms provide evidence about the concepts in
documents. - Instead of representing documents in information
space, then, we represent them in term space.
16From Information Space to Term Space
- dog cat
- 1 1.44 3.62
- 2 3.31 3.89
- 3 1.31 1.89
- 4 1.76 1.97
- 5 2.51 2.25
- 6 3.40 1.84
- 7 1.76 1.20
- 8 2.28 1.56
- 9 1.69 2.24
- 10 3.57 1.98
17From Information Space to Term Space
- dog cat
- 1 1.44 3.62
- 2 3.31 3.89
- 3 1.31 1.89
- 4 1.76 1.97
- 5 2.51 2.25
- 6 3.40 1.84
- 7 1.76 1.20
- 8 2.28 1.56
- 9 1.69 2.24
- 10 3.57 1.98
Our problem now has become relatively simple
rank documents by how close they are to the query
in this term space.
18From Information Space to Term Space
- dog cat fish
- 1 1.44 3.62 2.55
- 2 3.31 3.89 0.96
- 3 1.31 1.89 1.49
- 4 1.76 1.97 2.84
- 5 2.51 2.25 2.37
- 6 3.40 1.84 3.06
- 7 1.76 1.20 2.46
- 8 2.28 1.56 2.46
- 9 1.69 2.24 2.59
- 10 3.57 1.98 3.02
But what if there are more than two terms in our
conceptual universe? No problem. We simply need
to think of our term space as having more
dimensions
19From Information Space to Term Space
- dog cat fish
- 1 1.44 3.62 2.55
- 2 3.31 3.89 0.96
- 3 1.31 1.89 1.49
- 4 1.76 1.97 2.84
- 5 2.51 2.25 2.37
- 6 3.40 1.84 3.06
- 7 1.76 1.20 2.46
- 8 2.28 1.56 2.46
- 9 1.69 2.24 2.59
- 10 3.57 1.98 3.02
20From Information Space to Term Space
- dog cat fish
- 1 1.44 3.62 2.55
- 2 3.31 3.89 0.96
- 3 1.31 1.89 1.49
- 4 1.76 1.97 2.84
- 5 2.51 2.25 2.37
- 6 3.40 1.84 3.06
- 7 1.76 1.20 2.46
- 8 2.28 1.56 2.46
- 9 1.69 2.24 2.59
- 10 3.57 1.98 3.02
Our problem now has become relatively simple
rank documents by how close they are to the query
in this term space. In any realistic application,
our term space is of extremely high
dimensionality. This is hard to visualize, but
the notion of distance generalizes to n
dimensional spaces.
21Measuring Distance
- So were now half way done with our model.
- We have defined the features that span the space
in which we will represent our documents. - Now we need a way to calculate the distance (and
inversely, the proximity) between documents in
this space. - As in our term-weighting discussion, there is no
right way to do this. Instead, there are a
palette of distance metrics from which we can
choose.
22Properties of a Distance Metric
- Its values are nonnegative, with dist(a, b)0 iff
a b. - It is symmetric dist(a, b) dist(b, a).
- It satisfies the triangle inequality dist(a, c)
lt dist(a, b) dist(a, c) for all points a, b,
and c.
23Euclidean Distance
- This is the most familiar to us. It is simply
the length of a straight line joining two points
u and v. - To calculate the Euclidean distance, we rely on
the Pythagorean theorem
24Euclidean Distance
y
v (2,1)
1
x
u (0, 0)
2
1
25Euclidean Distance
y
v (2,1)
1
c
b
a
x
u (0, 0)
2
1
26Euclidean Distance
In p dimensions, let Euclidean distance between
two points u and v be
This is just an p-dimensional generalization of
our use of the Pythagorean theorem in the
previous slide. Make sure you understand this
identity.
27Manhattan Distance
- The Manhattan distance (a k a city block
distance) is the number of units on a rectangular
grid it takes to travel from point u to v.
28Manhattan Distance
- The Manhattan distance (a k a city block
distance) is the number of units on a rectangular
grid it takes to travel from point a to b.
y
v (2,1)
1
2 units horizontally 1 unit vertically 3
units
u (0, 0)
x
2
1
29Cosine Similarity
- The traditional vector space model is based on a
different notion of similarity the cosine of the
angle between two vectors. - To start our consideration of cosine similarity,
consider our document vectors not as points in
term space, but as arrows that travel from the
origin of the space to a particular address.
30Cosine Similarity
31Cosine Similarity
32Cosine Similarity
Estimate inter-document similarity by comparing
the magnitude of the angles between document
vectors. More specifically, we calculate the
cosine of this angle.
33Definition of Cosine
H
O
A
34Definition of Cosine
H
O
A
35Definition of Cosine
H
O
A
Is cos(theta) large or small here?
36Definition of Cosine
As H approaches A, the value of A/H approaches 1.
H
O
A
Is cos(theta) large or small here?
37Definition of Cosine
H
O
A
Is cos(theta) large or small here?
38Definition of Cosine
As H approaches ?, the value of A/H approaches 0.
H
O
A
Is cos(theta) large or small here?
39Cosine Similarity
So how do we compute the cosine of the angle
between two vectors?
x
y
40Calculating Cosine(x, y)
This is not as bad as it looks. I promise.
41Calculating Cosine(x, y)
The dot product
This is not as bad as it looks. I promise.
42Calculating Cosine(x, y)
The numerator of our function is just the dot
product between the vectors (i.e. how many terms
they have in common)
The denominator simply eliminates the effect of
long documents skewing our measurement by
normalizing both vectors.
43Vector Length and Normalization
y
We use the pythagorean theorem (again) to find
the length of vector v.
v
1
x
2
1
44Vector Length and Normalization
y
We use the pythagorean theorem (again) to find
the length of vector v.
v
1
c
b
a
x
2
1
45Vector Length and Normalization
In general, let the length (also called the norm)
of a p-dimensional vector v be defined as
46Vector Length and Normalization
Based on this we can normalize our vector v to
unit length by
47Calculating Cosine(x, y)
Finally, then, we get this fairly simple equation.
The denominator simply eliminates the effect of
long documents skewing our measurement by
normalizing both vectors.
48Using the Cosine for IR
- From now on, unless otherwise specified, we can
assume that all document vectors in our
term-document matrix A have been normalized to
unit length. - Likewise we always normalize our query to unit
length. - Given these assumptions, we have the classic
vector space model for IR
49Comparing Distance Metrics
4 document Vectors
Lets measure the distance between this query
vector and each document vector using our three
distance metrics. Lets start intuitively,
though what is going on in this term space?
Which documents do you suspect will be most
relevant to the query?
- q (1, 2, 0)
- d1 (2, 3, 1)
- d2 (20, 30, 10)
- d3 (0, 1, 3)
50Comparing Distance Metrics Euclidean
4 document Vectors
- q (1, 2, 0)
- d1 (2, 3, 1)
- d2 (20, 30, 10)
- d3 (0, 1, 3)
51Comparing Distance Metrics Euclidean
4 document Vectors
- q (1, 2, 0)
- d1 (2, 3, 1)
- d2 (20, 30, 10)
- d3 (0, 1, 3)
52Comparing Distance Metrics Euclidean
4 document Vectors
- q (1, 2, 0)
- d1 (2, 3, 1)
- d2 (20, 30, 10)
- d3 (0, 1, 3)
53Comparing Distance Metrics Euclidean
1
4 document Vectors
3
- q (1, 2, 0)
- d1 (2, 3, 1)
- d2 (20, 30, 10)
- d3 (0, 1, 3)
2
54Comparing Distance Metrics Manhattan
4 document Vectors
- q (1, 2, 0)
- d1 (2, 3, 1)
- d2 (20, 30, 10)
- d3 (0, 1, 3)
55Comparing Distance Metrics Manhattan
4 document Vectors
- q (1, 2, 0)
- d1 (2, 3, 1)
- d2 (20, 30, 10)
- d3 (0, 1, 3)
56Comparing Distance Metrics Manhattan
1
4 document Vectors
3
- q (1, 2, 0)
- d1 (2, 3, 1)
- d2 (20, 30, 10)
- d3 (0, 1, 3)
2
57Comparing Distance Metrics Cosine
4 document Vectors
We need to calculate these distances in a few
steps First we need to find the vector norms.
Then we normalize. Finally, we compute.
- q (1, 2, 0)
- d1 (2, 3, 1)
- d2 (20, 30, 10)
- d3 (0, 1, 3)
58Comparing Distance Metrics Cosine
4 document Vectors
- q (1, 2, 0)
- d1 (2, 3, 1)
- d2 (20, 30, 10)
- d3 (0, 1, 3)
59Comparing Distance Metrics Cosine
4 document Vectors
- q (1, 2, 0)
- d1 (2, 3, 1)
- d2 (20, 30, 10)
- d3 (0, 1, 3)
60Comparing Distance Metrics Cosine
4 document Vectors
- q (1, 2, 0)
- d1 (2, 3, 1)
- d2 (20, 30, 10)
- d3 (0, 1, 3)
61Comparing Distance Metrics Cosine
4 document Vectors
- q (1, 2, 0)
- d1 (2, 3, 1)
- d2 (20, 30, 10)
- d3 (0, 1, 3)
62Comparing Distance Metrics Cosine
4 document Vectors
- q (1, 2, 0)
- d1 (2, 3, 1)
- d2 (20, 30, 10)
- d3 (0, 1, 3)
63Comparing Distance Metrics Cosine
4 document Vectors
- q (1, 2, 0)
- d1 (2, 3, 1)
- d2 (20, 30, 10)
- d3 (0, 1, 3)
64Comparing Distance Metrics Cosine
4 document Vectors
- q (1, 2, 0)
- d1 (2, 3, 1)
- d2 (20, 30, 10)
- d3 (0, 1, 3)
65Comparing Distance Metrics Cosine
1
4 document Vectors
2
- q (1, 2, 0)
- d1 (2, 3, 1)
- d2 (20, 30, 10)
- d3 (0, 1, 3)
3
66Comparing Distance Metrics
1
4 document Vectors
2
- q (1, 2, 0)
- d1 (2, 3, 1)
- d2 (20, 30, 10)
- d3 (0, 1, 3)
3
1
3
1
2
3
2
67Using the Cosine for IR
For any query q and a document d
In other words, let the n-dimensional vector of
query-document similarities be
Where A is our n X p term document matrix, and q
is a p-dimensional query vector.
68Taking Stock
relevance
In the VSM we have abstractions layered on
abstractions. The intuition, though, is clear
enough rank documents by their estimated
similarity to the query. Estimate similarity by
noting the words that documents share.
similarity
proximity
information space
term space