The Vector Space Model of Information Retrieval - PowerPoint PPT Presentation

1 / 68

About This Presentation

Title:

The Vector Space Model of Information Retrieval

Description:

The Boolean retrieval model imagines IR in set theoretic terms. Aside from its putative 'unfriendliness' this approach doesn't provide a ... – PowerPoint PPT presentation

Number of Views:390

Avg rating:3.0/5.0

Slides: 69

Provided by: ischool2

Category:

more less

Transcript and Presenter's Notes

Title: The Vector Space Model of Information Retrieval

1
The Vector Space Model of Information Retrieval

INF 384H
Miles Efron

2
The Boolean Model of IR
A
B

The set A AND B

3
(No Transcript)
4
Getting Beyond Boole

The Boolean retrieval model imagines IR in set
theoretic terms.
Aside from its putative unfriendliness this
approach doesnt provide a satisfying model on at
least two counts
Suffers information overload (or null-set)
problems
Fails to provide a ranking of documents
(estimated relevance is binary, not a matter of
degree)

5
Similarity as Surrogate for Relevance

In information retrieval we are interested in the
abstraction relevance.
But relevance is a very slippery concept.
The starting point of the vector space model
(VSM) is the idea that we can use similarity as a
proxy for relevance
i.e. assume that documents that are similar to a
query are likely to be relevant to the query,
too.
Note similarity is abstract, too. But perhaps
we can operationalize it more easily.

6
Similarity as Surrogate for Relevance

The basic goal of the VSM, then, involves
figuring out an adequate way to measure the
similarity between things (such as queries and
documents)
Towards this what does it mean for two things to
be similar?
What does it mean if A and B are more similar
than A and C?
How can we measure similarity in a way that is
principled and meaningful?

7
Proximity as a surrogate for Similarity

To estimate inter-document similarity, the vector
space model uses a geometric metaphor.
Documents and queries are represented as points
(actually vectors) in an abstract information
space.
We assume that things that are close to each
other in this space are more similar than things
that are far apart.

8
Proximity as a surrogate for Similarity
9
Proximity as a surrogate for Similarity
Query
10
Similarity and Representation

The similarities we observe depend on at least
two considerations
How do we define similarity?
How do we represent the objects whose
similarities we wish to measure?
Our starting point, then what features form the
axes of our information space?

11
Information Space

We begin with the assumption that the people who
read and write the language that our system uses
(say, English) share some cognitive similarities.
I.E. We assume that people recognize certain
concepts, and that it is the relationships among
those concepts that we wish to model.
Lets call this abstract conceptual world the
information space shared by users of the system.

12
Information Space
Here we have documents and queries shown in the
space spanned by the concepts CAT and DOG. A
documents location is the degree (for now
undefined) to which it evinces each of these
concepts.
13
Motivation for the VSM

Given these definitions, we can understand the
vector space model as an effort to rank documents
by their similarity in information space with
respect to the query.
Of course information space is imaginary, and we
still dont know how to measure distance yet.

14
From Information Space to Term Space

In order to measure inter-document similarities,
we need to define the salient features of our
documents.
In a perfect world, each dimension of information
space would form a feature, and each document
would be scored by its participation in each
concept.
Without such a scenario, what features can stand
in for our abstract concepts?

15
From Information Space to Term Space

While we dont directly observe the concepts that
are evinced in documents, we do observe the terms
that occur in them.
Most IR begins with the assumption that these
terms provide evidence about the concepts in
documents.
Instead of representing documents in information
space, then, we represent them in term space.

16
From Information Space to Term Space

dog cat
1 1.44 3.62
2 3.31 3.89
3 1.31 1.89
4 1.76 1.97
5 2.51 2.25
6 3.40 1.84
7 1.76 1.20
8 2.28 1.56
9 1.69 2.24
10 3.57 1.98

17
From Information Space to Term Space

dog cat
1 1.44 3.62
2 3.31 3.89
3 1.31 1.89
4 1.76 1.97
5 2.51 2.25
6 3.40 1.84
7 1.76 1.20
8 2.28 1.56
9 1.69 2.24
10 3.57 1.98

Our problem now has become relatively simple
rank documents by how close they are to the query
in this term space.
18
From Information Space to Term Space

dog cat fish
1 1.44 3.62 2.55
2 3.31 3.89 0.96
3 1.31 1.89 1.49
4 1.76 1.97 2.84
5 2.51 2.25 2.37
6 3.40 1.84 3.06
7 1.76 1.20 2.46
8 2.28 1.56 2.46
9 1.69 2.24 2.59
10 3.57 1.98 3.02

But what if there are more than two terms in our
conceptual universe? No problem. We simply need
to think of our term space as having more
dimensions
19
From Information Space to Term Space

dog cat fish
1 1.44 3.62 2.55
2 3.31 3.89 0.96
3 1.31 1.89 1.49
4 1.76 1.97 2.84
5 2.51 2.25 2.37
6 3.40 1.84 3.06
7 1.76 1.20 2.46
8 2.28 1.56 2.46
9 1.69 2.24 2.59
10 3.57 1.98 3.02

20
From Information Space to Term Space

dog cat fish
1 1.44 3.62 2.55
2 3.31 3.89 0.96
3 1.31 1.89 1.49
4 1.76 1.97 2.84
5 2.51 2.25 2.37
6 3.40 1.84 3.06
7 1.76 1.20 2.46
8 2.28 1.56 2.46
9 1.69 2.24 2.59
10 3.57 1.98 3.02

Our problem now has become relatively simple
rank documents by how close they are to the query
in this term space. In any realistic application,
our term space is of extremely high
dimensionality. This is hard to visualize, but
the notion of distance generalizes to n
dimensional spaces.
21
Measuring Distance

So were now half way done with our model.
We have defined the features that span the space
in which we will represent our documents.
Now we need a way to calculate the distance (and
inversely, the proximity) between documents in
this space.
As in our term-weighting discussion, there is no
right way to do this. Instead, there are a
palette of distance metrics from which we can
choose.

22
Properties of a Distance Metric

Its values are nonnegative, with dist(a, b)0 iff
a b.
It is symmetric dist(a, b) dist(b, a).
It satisfies the triangle inequality dist(a, c)
lt dist(a, b) dist(a, c) for all points a, b,
and c.

23
Euclidean Distance

This is the most familiar to us. It is simply
the length of a straight line joining two points
u and v.
To calculate the Euclidean distance, we rely on
the Pythagorean theorem

24
Euclidean Distance
y
v (2,1)
1
x
u (0, 0)
2
1
25
Euclidean Distance
y
v (2,1)
1
c
b
a
x
u (0, 0)
2
1
26
Euclidean Distance
In p dimensions, let Euclidean distance between
two points u and v be
This is just an p-dimensional generalization of
our use of the Pythagorean theorem in the
previous slide. Make sure you understand this
identity.
27
Manhattan Distance

The Manhattan distance (a k a city block
distance) is the number of units on a rectangular
grid it takes to travel from point u to v.

28
Manhattan Distance

The Manhattan distance (a k a city block
distance) is the number of units on a rectangular
grid it takes to travel from point a to b.

y
v (2,1)
1
2 units horizontally 1 unit vertically 3
units
u (0, 0)
x
2
1
29
Cosine Similarity

The traditional vector space model is based on a
different notion of similarity the cosine of the
angle between two vectors.
To start our consideration of cosine similarity,
consider our document vectors not as points in
term space, but as arrows that travel from the
origin of the space to a particular address.

30
Cosine Similarity
31
Cosine Similarity
32
Cosine Similarity
Estimate inter-document similarity by comparing
the magnitude of the angles between document
vectors. More specifically, we calculate the
cosine of this angle.
33
Definition of Cosine
H
O
A
34
Definition of Cosine
H
O
A
35
Definition of Cosine
H
O
A
Is cos(theta) large or small here?
36
Definition of Cosine
As H approaches A, the value of A/H approaches 1.
H
O
A
Is cos(theta) large or small here?
37
Definition of Cosine
H
O
A
Is cos(theta) large or small here?
38
Definition of Cosine
As H approaches ?, the value of A/H approaches 0.
H
O
A
Is cos(theta) large or small here?
39
Cosine Similarity
So how do we compute the cosine of the angle
between two vectors?
x
y
40
Calculating Cosine(x, y)
This is not as bad as it looks. I promise.
41
Calculating Cosine(x, y)
The dot product
This is not as bad as it looks. I promise.
42
Calculating Cosine(x, y)
The numerator of our function is just the dot
product between the vectors (i.e. how many terms
they have in common)
The denominator simply eliminates the effect of
long documents skewing our measurement by
normalizing both vectors.
43
Vector Length and Normalization
y
We use the pythagorean theorem (again) to find
the length of vector v.
v
1
x
2
1
44
Vector Length and Normalization
y
We use the pythagorean theorem (again) to find
the length of vector v.
v
1
c
b
a
x
2
1
45
Vector Length and Normalization
In general, let the length (also called the norm)
of a p-dimensional vector v be defined as
46
Vector Length and Normalization
Based on this we can normalize our vector v to
unit length by
47
Calculating Cosine(x, y)
Finally, then, we get this fairly simple equation.
The denominator simply eliminates the effect of
long documents skewing our measurement by
normalizing both vectors.
48
Using the Cosine for IR

From now on, unless otherwise specified, we can
assume that all document vectors in our
term-document matrix A have been normalized to
unit length.
Likewise we always normalize our query to unit
length.
Given these assumptions, we have the classic
vector space model for IR

49
Comparing Distance Metrics
4 document Vectors
Lets measure the distance between this query
vector and each document vector using our three
distance metrics. Lets start intuitively,
though what is going on in this term space?
Which documents do you suspect will be most
relevant to the query?

q (1, 2, 0)
d1 (2, 3, 1)
d2 (20, 30, 10)
d3 (0, 1, 3)

50
Comparing Distance Metrics Euclidean
4 document Vectors

q (1, 2, 0)
d1 (2, 3, 1)
d2 (20, 30, 10)
d3 (0, 1, 3)

51
Comparing Distance Metrics Euclidean
4 document Vectors

q (1, 2, 0)
d1 (2, 3, 1)
d2 (20, 30, 10)
d3 (0, 1, 3)

52
Comparing Distance Metrics Euclidean
4 document Vectors

q (1, 2, 0)
d1 (2, 3, 1)
d2 (20, 30, 10)
d3 (0, 1, 3)

53
Comparing Distance Metrics Euclidean
1
4 document Vectors
3

q (1, 2, 0)
d1 (2, 3, 1)
d2 (20, 30, 10)
d3 (0, 1, 3)

2
54
Comparing Distance Metrics Manhattan
4 document Vectors

q (1, 2, 0)
d1 (2, 3, 1)
d2 (20, 30, 10)
d3 (0, 1, 3)

55
Comparing Distance Metrics Manhattan
4 document Vectors

q (1, 2, 0)
d1 (2, 3, 1)
d2 (20, 30, 10)
d3 (0, 1, 3)

56
Comparing Distance Metrics Manhattan
1
4 document Vectors
3

q (1, 2, 0)
d1 (2, 3, 1)
d2 (20, 30, 10)
d3 (0, 1, 3)

2
57
Comparing Distance Metrics Cosine
4 document Vectors
We need to calculate these distances in a few
steps First we need to find the vector norms.
Then we normalize. Finally, we compute.

q (1, 2, 0)
d1 (2, 3, 1)
d2 (20, 30, 10)
d3 (0, 1, 3)

58
Comparing Distance Metrics Cosine
4 document Vectors

q (1, 2, 0)
d1 (2, 3, 1)
d2 (20, 30, 10)
d3 (0, 1, 3)

59
Comparing Distance Metrics Cosine
4 document Vectors

q (1, 2, 0)
d1 (2, 3, 1)
d2 (20, 30, 10)
d3 (0, 1, 3)

60
Comparing Distance Metrics Cosine
4 document Vectors

q (1, 2, 0)
d1 (2, 3, 1)
d2 (20, 30, 10)
d3 (0, 1, 3)

61
Comparing Distance Metrics Cosine
4 document Vectors

q (1, 2, 0)
d1 (2, 3, 1)
d2 (20, 30, 10)
d3 (0, 1, 3)

62
Comparing Distance Metrics Cosine
4 document Vectors

q (1, 2, 0)
d1 (2, 3, 1)
d2 (20, 30, 10)
d3 (0, 1, 3)

63
Comparing Distance Metrics Cosine
4 document Vectors

q (1, 2, 0)
d1 (2, 3, 1)
d2 (20, 30, 10)
d3 (0, 1, 3)

64
Comparing Distance Metrics Cosine
4 document Vectors

q (1, 2, 0)
d1 (2, 3, 1)
d2 (20, 30, 10)
d3 (0, 1, 3)

65
Comparing Distance Metrics Cosine
1
4 document Vectors
2

q (1, 2, 0)
d1 (2, 3, 1)
d2 (20, 30, 10)
d3 (0, 1, 3)

3
66
Comparing Distance Metrics
1
4 document Vectors
2

q (1, 2, 0)
d1 (2, 3, 1)
d2 (20, 30, 10)
d3 (0, 1, 3)

3
1
3
1
2
3
2
67
Using the Cosine for IR
For any query q and a document d
In other words, let the n-dimensional vector of
query-document similarities be
Where A is our n X p term document matrix, and q
is a p-dimensional query vector.
68
Taking Stock
relevance
In the VSM we have abstractions layered on
abstractions. The intuition, though, is clear
enough rank documents by their estimated
similarity to the query. Estimate similarity by
noting the words that documents share.
similarity
proximity
information space
term space

Write a Comment

User Comments (0)