The Vector Space Model of Information Retrieval - PowerPoint PPT Presentation

1 / 68
About This Presentation
Title:

The Vector Space Model of Information Retrieval

Description:

The Boolean retrieval model imagines IR in set theoretic terms. Aside from its putative 'unfriendliness' this approach doesn't provide a ... – PowerPoint PPT presentation

Number of Views:390
Avg rating:3.0/5.0
Slides: 69
Provided by: ischool2
Category:

less

Transcript and Presenter's Notes

Title: The Vector Space Model of Information Retrieval


1
The Vector Space Model of Information Retrieval
  • INF 384H
  • Miles Efron

2
The Boolean Model of IR
A
B
  • The set A AND B

3
(No Transcript)
4
Getting Beyond Boole
  • The Boolean retrieval model imagines IR in set
    theoretic terms.
  • Aside from its putative unfriendliness this
    approach doesnt provide a satisfying model on at
    least two counts
  • Suffers information overload (or null-set)
    problems
  • Fails to provide a ranking of documents
    (estimated relevance is binary, not a matter of
    degree)

5
Similarity as Surrogate for Relevance
  • In information retrieval we are interested in the
    abstraction relevance.
  • But relevance is a very slippery concept.
  • The starting point of the vector space model
    (VSM) is the idea that we can use similarity as a
    proxy for relevance
  • i.e. assume that documents that are similar to a
    query are likely to be relevant to the query,
    too.
  • Note similarity is abstract, too. But perhaps
    we can operationalize it more easily.

6
Similarity as Surrogate for Relevance
  • The basic goal of the VSM, then, involves
    figuring out an adequate way to measure the
    similarity between things (such as queries and
    documents)
  • Towards this what does it mean for two things to
    be similar?
  • What does it mean if A and B are more similar
    than A and C?
  • How can we measure similarity in a way that is
    principled and meaningful?

7
Proximity as a surrogate for Similarity
  • To estimate inter-document similarity, the vector
    space model uses a geometric metaphor.
  • Documents and queries are represented as points
    (actually vectors) in an abstract information
    space.
  • We assume that things that are close to each
    other in this space are more similar than things
    that are far apart.

8
Proximity as a surrogate for Similarity
9
Proximity as a surrogate for Similarity
Query
10
Similarity and Representation
  • The similarities we observe depend on at least
    two considerations
  • How do we define similarity?
  • How do we represent the objects whose
    similarities we wish to measure?
  • Our starting point, then what features form the
    axes of our information space?

11
Information Space
  • We begin with the assumption that the people who
    read and write the language that our system uses
    (say, English) share some cognitive similarities.
  • I.E. We assume that people recognize certain
    concepts, and that it is the relationships among
    those concepts that we wish to model.
  • Lets call this abstract conceptual world the
    information space shared by users of the system.

12
Information Space
Here we have documents and queries shown in the
space spanned by the concepts CAT and DOG. A
documents location is the degree (for now
undefined) to which it evinces each of these
concepts.
13
Motivation for the VSM
  • Given these definitions, we can understand the
    vector space model as an effort to rank documents
    by their similarity in information space with
    respect to the query.
  • Of course information space is imaginary, and we
    still dont know how to measure distance yet.

14
From Information Space to Term Space
  • In order to measure inter-document similarities,
    we need to define the salient features of our
    documents.
  • In a perfect world, each dimension of information
    space would form a feature, and each document
    would be scored by its participation in each
    concept.
  • Without such a scenario, what features can stand
    in for our abstract concepts?

15
From Information Space to Term Space
  • While we dont directly observe the concepts that
    are evinced in documents, we do observe the terms
    that occur in them.
  • Most IR begins with the assumption that these
    terms provide evidence about the concepts in
    documents.
  • Instead of representing documents in information
    space, then, we represent them in term space.

16
From Information Space to Term Space
  • dog cat
  • 1 1.44 3.62
  • 2 3.31 3.89
  • 3 1.31 1.89
  • 4 1.76 1.97
  • 5 2.51 2.25
  • 6 3.40 1.84
  • 7 1.76 1.20
  • 8 2.28 1.56
  • 9 1.69 2.24
  • 10 3.57 1.98

17
From Information Space to Term Space
  • dog cat
  • 1 1.44 3.62
  • 2 3.31 3.89
  • 3 1.31 1.89
  • 4 1.76 1.97
  • 5 2.51 2.25
  • 6 3.40 1.84
  • 7 1.76 1.20
  • 8 2.28 1.56
  • 9 1.69 2.24
  • 10 3.57 1.98

Our problem now has become relatively simple
rank documents by how close they are to the query
in this term space.
18
From Information Space to Term Space
  • dog cat fish
  • 1 1.44 3.62 2.55
  • 2 3.31 3.89 0.96
  • 3 1.31 1.89 1.49
  • 4 1.76 1.97 2.84
  • 5 2.51 2.25 2.37
  • 6 3.40 1.84 3.06
  • 7 1.76 1.20 2.46
  • 8 2.28 1.56 2.46
  • 9 1.69 2.24 2.59
  • 10 3.57 1.98 3.02

But what if there are more than two terms in our
conceptual universe? No problem. We simply need
to think of our term space as having more
dimensions
19
From Information Space to Term Space
  • dog cat fish
  • 1 1.44 3.62 2.55
  • 2 3.31 3.89 0.96
  • 3 1.31 1.89 1.49
  • 4 1.76 1.97 2.84
  • 5 2.51 2.25 2.37
  • 6 3.40 1.84 3.06
  • 7 1.76 1.20 2.46
  • 8 2.28 1.56 2.46
  • 9 1.69 2.24 2.59
  • 10 3.57 1.98 3.02

20
From Information Space to Term Space
  • dog cat fish
  • 1 1.44 3.62 2.55
  • 2 3.31 3.89 0.96
  • 3 1.31 1.89 1.49
  • 4 1.76 1.97 2.84
  • 5 2.51 2.25 2.37
  • 6 3.40 1.84 3.06
  • 7 1.76 1.20 2.46
  • 8 2.28 1.56 2.46
  • 9 1.69 2.24 2.59
  • 10 3.57 1.98 3.02

Our problem now has become relatively simple
rank documents by how close they are to the query
in this term space. In any realistic application,
our term space is of extremely high
dimensionality. This is hard to visualize, but
the notion of distance generalizes to n
dimensional spaces.
21
Measuring Distance
  • So were now half way done with our model.
  • We have defined the features that span the space
    in which we will represent our documents.
  • Now we need a way to calculate the distance (and
    inversely, the proximity) between documents in
    this space.
  • As in our term-weighting discussion, there is no
    right way to do this. Instead, there are a
    palette of distance metrics from which we can
    choose.

22
Properties of a Distance Metric
  • Its values are nonnegative, with dist(a, b)0 iff
    a b.
  • It is symmetric dist(a, b) dist(b, a).
  • It satisfies the triangle inequality dist(a, c)
    lt dist(a, b) dist(a, c) for all points a, b,
    and c.

23
Euclidean Distance
  • This is the most familiar to us. It is simply
    the length of a straight line joining two points
    u and v.
  • To calculate the Euclidean distance, we rely on
    the Pythagorean theorem

24
Euclidean Distance
y
v (2,1)
1
x
u (0, 0)
2
1
25
Euclidean Distance
y
v (2,1)
1
c
b
a
x
u (0, 0)
2
1
26
Euclidean Distance
In p dimensions, let Euclidean distance between
two points u and v be
This is just an p-dimensional generalization of
our use of the Pythagorean theorem in the
previous slide. Make sure you understand this
identity.
27
Manhattan Distance
  • The Manhattan distance (a k a city block
    distance) is the number of units on a rectangular
    grid it takes to travel from point u to v.

28
Manhattan Distance
  • The Manhattan distance (a k a city block
    distance) is the number of units on a rectangular
    grid it takes to travel from point a to b.

y
v (2,1)
1
2 units horizontally 1 unit vertically 3
units
u (0, 0)
x
2
1
29
Cosine Similarity
  • The traditional vector space model is based on a
    different notion of similarity the cosine of the
    angle between two vectors.
  • To start our consideration of cosine similarity,
    consider our document vectors not as points in
    term space, but as arrows that travel from the
    origin of the space to a particular address.

30
Cosine Similarity
31
Cosine Similarity
32
Cosine Similarity
Estimate inter-document similarity by comparing
the magnitude of the angles between document
vectors. More specifically, we calculate the
cosine of this angle.
33
Definition of Cosine
H
O
A
34
Definition of Cosine
H
O
A
35
Definition of Cosine
H
O
A
Is cos(theta) large or small here?
36
Definition of Cosine
As H approaches A, the value of A/H approaches 1.
H
O
A
Is cos(theta) large or small here?
37
Definition of Cosine
H
O
A
Is cos(theta) large or small here?
38
Definition of Cosine
As H approaches ?, the value of A/H approaches 0.
H
O
A
Is cos(theta) large or small here?
39
Cosine Similarity
So how do we compute the cosine of the angle
between two vectors?
x
y
40
Calculating Cosine(x, y)
This is not as bad as it looks. I promise.
41
Calculating Cosine(x, y)
The dot product
This is not as bad as it looks. I promise.
42
Calculating Cosine(x, y)
The numerator of our function is just the dot
product between the vectors (i.e. how many terms
they have in common)
The denominator simply eliminates the effect of
long documents skewing our measurement by
normalizing both vectors.
43
Vector Length and Normalization
y
We use the pythagorean theorem (again) to find
the length of vector v.
v
1
x
2
1
44
Vector Length and Normalization
y
We use the pythagorean theorem (again) to find
the length of vector v.
v
1
c
b
a
x
2
1
45
Vector Length and Normalization
In general, let the length (also called the norm)
of a p-dimensional vector v be defined as
46
Vector Length and Normalization
Based on this we can normalize our vector v to
unit length by
47
Calculating Cosine(x, y)
Finally, then, we get this fairly simple equation.
The denominator simply eliminates the effect of
long documents skewing our measurement by
normalizing both vectors.
48
Using the Cosine for IR
  • From now on, unless otherwise specified, we can
    assume that all document vectors in our
    term-document matrix A have been normalized to
    unit length.
  • Likewise we always normalize our query to unit
    length.
  • Given these assumptions, we have the classic
    vector space model for IR

49
Comparing Distance Metrics
4 document Vectors
Lets measure the distance between this query
vector and each document vector using our three
distance metrics. Lets start intuitively,
though what is going on in this term space?
Which documents do you suspect will be most
relevant to the query?
  • q (1, 2, 0)
  • d1 (2, 3, 1)
  • d2 (20, 30, 10)
  • d3 (0, 1, 3)

50
Comparing Distance Metrics Euclidean
4 document Vectors
  • q (1, 2, 0)
  • d1 (2, 3, 1)
  • d2 (20, 30, 10)
  • d3 (0, 1, 3)

51
Comparing Distance Metrics Euclidean
4 document Vectors
  • q (1, 2, 0)
  • d1 (2, 3, 1)
  • d2 (20, 30, 10)
  • d3 (0, 1, 3)

52
Comparing Distance Metrics Euclidean
4 document Vectors
  • q (1, 2, 0)
  • d1 (2, 3, 1)
  • d2 (20, 30, 10)
  • d3 (0, 1, 3)

53
Comparing Distance Metrics Euclidean
1
4 document Vectors
3
  • q (1, 2, 0)
  • d1 (2, 3, 1)
  • d2 (20, 30, 10)
  • d3 (0, 1, 3)

2
54
Comparing Distance Metrics Manhattan
4 document Vectors
  • q (1, 2, 0)
  • d1 (2, 3, 1)
  • d2 (20, 30, 10)
  • d3 (0, 1, 3)

55
Comparing Distance Metrics Manhattan
4 document Vectors
  • q (1, 2, 0)
  • d1 (2, 3, 1)
  • d2 (20, 30, 10)
  • d3 (0, 1, 3)

56
Comparing Distance Metrics Manhattan
1
4 document Vectors
3
  • q (1, 2, 0)
  • d1 (2, 3, 1)
  • d2 (20, 30, 10)
  • d3 (0, 1, 3)

2
57
Comparing Distance Metrics Cosine
4 document Vectors
We need to calculate these distances in a few
steps First we need to find the vector norms.
Then we normalize. Finally, we compute.
  • q (1, 2, 0)
  • d1 (2, 3, 1)
  • d2 (20, 30, 10)
  • d3 (0, 1, 3)

58
Comparing Distance Metrics Cosine
4 document Vectors
  • q (1, 2, 0)
  • d1 (2, 3, 1)
  • d2 (20, 30, 10)
  • d3 (0, 1, 3)

59
Comparing Distance Metrics Cosine
4 document Vectors
  • q (1, 2, 0)
  • d1 (2, 3, 1)
  • d2 (20, 30, 10)
  • d3 (0, 1, 3)

60
Comparing Distance Metrics Cosine
4 document Vectors
  • q (1, 2, 0)
  • d1 (2, 3, 1)
  • d2 (20, 30, 10)
  • d3 (0, 1, 3)

61
Comparing Distance Metrics Cosine
4 document Vectors
  • q (1, 2, 0)
  • d1 (2, 3, 1)
  • d2 (20, 30, 10)
  • d3 (0, 1, 3)

62
Comparing Distance Metrics Cosine
4 document Vectors
  • q (1, 2, 0)
  • d1 (2, 3, 1)
  • d2 (20, 30, 10)
  • d3 (0, 1, 3)

63
Comparing Distance Metrics Cosine
4 document Vectors
  • q (1, 2, 0)
  • d1 (2, 3, 1)
  • d2 (20, 30, 10)
  • d3 (0, 1, 3)

64
Comparing Distance Metrics Cosine
4 document Vectors
  • q (1, 2, 0)
  • d1 (2, 3, 1)
  • d2 (20, 30, 10)
  • d3 (0, 1, 3)

65
Comparing Distance Metrics Cosine
1
4 document Vectors
2
  • q (1, 2, 0)
  • d1 (2, 3, 1)
  • d2 (20, 30, 10)
  • d3 (0, 1, 3)

3
66
Comparing Distance Metrics
1
4 document Vectors
2
  • q (1, 2, 0)
  • d1 (2, 3, 1)
  • d2 (20, 30, 10)
  • d3 (0, 1, 3)

3
1
3
1
2
3
2
67
Using the Cosine for IR
For any query q and a document d
In other words, let the n-dimensional vector of
query-document similarities be
Where A is our n X p term document matrix, and q
is a p-dimensional query vector.
68
Taking Stock
relevance
In the VSM we have abstractions layered on
abstractions. The intuition, though, is clear
enough rank documents by their estimated
similarity to the query. Estimate similarity by
noting the words that documents share.
similarity
proximity
information space
term space
Write a Comment
User Comments (0)
About PowerShow.com