Beyond Term Independence: Introduction to Some Advanced IR Models

About This Presentation

Title:

Beyond Term Independence: Introduction to Some Advanced IR Models

Description:

Under the BIR model, we assume that the ... The character strings car and automobile look different, though the concepts ... { car, automobile, vehicle} ... – PowerPoint PPT presentation

Number of Views:17

Avg rating:3.0/5.0

Slides: 57

Provided by: ischool2

Category:

more less

Transcript and Presenter's Notes

Title: Beyond Term Independence: Introduction to Some Advanced IR Models

1
Beyond Term Independence Introduction to Some
Advanced IR Models

Miles Efron
INF 384H
UT-ISchool

2
Term Independence

Up to this point, all of the models we have
explored make the convenient assumption that the
indexing terms of our corpus are statistically
independent.
This allows us to simplify the math we have to
do.
Its also not clear that this is a bad
assumption, even if it clearly is unrealistic.

3
Term Independence The Probabilistic Model

Under the BIR model, we assume that the
likelihood that a document was generated by the
distribution responsible for relevant or
non-relevant documents is expressible as a simple
product. Thus we have the similarity model

4
Term Independence The Probabilistic Model
This is clearly untrue knowing that a document
contains the word information does tell us
something about the likelihood that it also
contains the word retrieval.

Under the BIR model, we assume that the
likelihood that a document was generated by the
distribution responsible for relevant or
non-relevant documents is expressible as a simple
product. Thus we have the similarity model

5
Term Independence The Probabilistic Model
This is clearly untrue knowing that a document
contains the word information does tell us
something about the likelihood that it also
contains the word retrieval.

Under the BIR model, we assume that the
likelihood that a document was generated by the
distribution responsible for relevant or
non-relevant documents is expressible as a simple
produce. Thus we have the similarity model

Thus our similarity function is unlikely to be
based on accurate class-conditional
probabilities. But this doesnt necessarily make
it wrong.
6
Term Independence The Vector Space Model

Our assumption of term independence under the
vector space model is less obvious, though no
less stark.
The assumption of independence lies in the fact
that we assume the dimensions of term space are
orthogonal.
Geometrically, this means we assume they meet at
right angles.
Semantically, it means they convey separate
information.

7
Term Independence The Vector Space Model

Recall the VSM similarity function
What does this tell us to do?
For each of our p dimensions
Multiply the pth value seen in the query and
document
Add all of these products together

8
Term Independence The Vector Space Model

Recall the VSM similarity function
What does this tell us to do?
For each of our p dimensions
Multiply the pth value seen in the query and
document
Add all of these products together

If the query and document share no terms, they
have a similarity of 0. Likewise, if they share
many terms, that result is expressed only
additively.
9
Term Independence The Vector Space Model

Recall the VSM similarity function
Consider the function at the collection level
We can re-write this

10
Our Assumptions Haunt Us

To make retrieval simpler, we have devised
numerous assumptions.
Term independence
Bag of words models
Internal evidence as the only source of document
feature information
What prices do we pay for this simplicity?

11
Lexical Ambiguity

Human language is rife with redundancy and
lacunae. Both its repetitions and omissions can
frustrate IR.
Synonymy two words can describe the same
concept.
Polysemy words that look the same can have
different meanings in different contexts.

12
Lexical Ambiguity Polysemy

Is the West Bank the same as the Bank of the
West?
Stocks rose last week has a different meaning
than does the florists stocks of roses.
The seal of the state of Texas is quite different
from the Deans seal of approval, which are both
different from a seal that eats fish.

13
Lexical Ambiguity Synonymy

The character strings car and automobile look
different, though the concepts they reference are
awfully similar.
To make things even murkier, a polysemous word
like game can have synonyms associated with
various senses. As a vegetarian, I eat neither
game nor meat. But that doesnt mean I wont
play a game or participate in some sort of
contest or challenge.

14
The Problem of Undertermination

Many of the problems that plague the linguistic
aspects of IR are subsumed under the rubric of
underdetermination.
Queries are underdetermined in the sense that
users only supply a vague hint of the kinds of
information they are interested in.
Documents are underdetermined in that authors
dont write so as to make their documents easily
indexable by our algorithms.

15
Addressing Underdetermination

Global approaches
Altering document representations
Applying thesauri to expand documents
Latent Semantic Indexing
Local approaches
Query expansion
Relevance feedback

16
Elaborating the Vector Space Model Latent
Semantic Indexing

The terms that occur in a document give us some
evidence of that documents aboutness. But
that evidence is neither complete nor reliable.
Instead of modeling similarity as a function in
term space, we would really like to measure
similarity in some kind of concept space. This
is what LSI tries to do to map from term space
to concept space.

17
Problems with Keyword-based IR
The terms that occur in a document give us some
evidence of that documents aboutness. But
that evidence is neither complete nor reliable.
Instead of modeling similarity as a function in
term space, we would really like to measure
similarity in some kind of concept space. This
is what LSI tries to do to map from term space
to concept space.

Synonymy many terms may refer to one concept.
car, automobile, vehicle
Polysemy one term may refer to many concepts
bank of the West, the West bank

18
Problems with Keyword-based IR
The terms that occur in a document give us some
evidence of that documents aboutness. But
that evidence is neither complete nor reliable.
Instead of modeling similarity as a function in
term space, we would really like to measure
similarity in some kind of concept space. This
is what LSI tries to do to map from term space
to concept space.

Synonymy many terms may refer to one concept.
car, automobile, vehicle
Polysemy one term may refer to many concepts
bank of the West, the West bank

19
Problems with Keyword-based IR
The terms that occur in a document give us some
evidence of that documents aboutness. But
that evidence is neither complete nor reliable.
Instead of modeling similarity as a function in
term space, we would really like to measure
similarity in some kind of concept space. This
is what LSI tries to do to map from term space
to concept space.

LSI constructs a statistical model the conceptual
relationships among terms and documents by means
of dimensionality reduction.

20
Motivations for Dimensionality Reduction
Saltons Vector Space Model (VSM) assumes Term
Independence
Vector of similarity scores
Query vector
Term-document matrix
21
Motivations for Dimensionality Reduction
Generalized Vector Space Model (GVSM)
Vector of similarity scores
Query vector
Term-document matrix
Term correlation matrix
22
Motivations for Dimensionality Reduction
Generalized Vector Space Model (GVSM)
Vector of similarity scores
Query vector
Use the observed term correlations to inform
query-document matching. Is this a good idea?
Term-document matrix
Term correlation matrix
23
Motivations for Dimensionality Reduction
Latent Semantic Indexing (LSI)
Vector of similarity scores
Query vector
Term-document matrix
Term covariance matrix
24
Motivations for Dimensionality Reduction
LSI builds a statistical model of the population
covariance matrix, based on the observed sample.
Latent Semantic Indexing (LSI)
Vector of similarity scores
Query vector
Term-document matrix
Term covariance matrix
25
Motivations for Dimensionality Reduction
LSI builds a statistical model of the population
covariance matrix, based on the observed sample.
Latent Semantic Indexing (LSI)
Vector of similarity scores
Query vector
Term-document matrix
Term covariance matrix
We fit the model by means of the
eigenvalue/eigenvector decomposition of R.
26
Motivation for dimensionality reduction
Var(X)s Var(xbar) s/sqrt(N)
27
Technical Memo Example
Nine (9) documents (i.e. article titles), five
about human-computer interaction and four about
graph theory. For purposes of indexing, retain
only those terms that occur in at least two
documents, for a total of 12 indexing terms.
28
Technical Memo Example
Nine (9) documents (i.e. article titles), five
about human-computer interaction and four about
graph theory. For purposes of indexing, retain
only those terms that occur in at least two
documents, for a total of 12 indexing terms.
A term-document matrix for the data
29
Technical Memo Example
Nine (9) documents (i.e. article titles), five
about human-computer interaction and four about
graph theory. For purposes of indexing, retain
only those terms that occur in at least two
documents, for a total of 12 indexing terms.
A term-document matrix for the data
e.g.
Human machine interface for Lab ABC computer
applications
30
Technical Memo Example
31
Technical Memo Example
32
Technical Memo Example
33
Technical Memo Example
Query human-computer interaction
1 0 1 0 0 0 0 0 0 0 0 0
query
34
Technical Memo Example
Documents retrieved for our query using the
keyword-based similarity model.
Documents not retrievedi.e. documents that dont
share any terms with the query.
query
Problematic!
35
Technical Memo Example
docs terms c1.txt c2.txt c3.txt c4.txt
c5.txt m1.txt m2.txt m3.txt m4.txt computer
1 1 0 0 0 0 0
0 0 human 1 0 0 1
0 0 0 0 0 interface
1 0 1 0 0 0 0
0 0 response 0 1 0 0
1 0 0 0 0 survey
0 1 0 0 0 0 0
0 1 system 0 1 1 2
0 0 0 0 0 time
0 1 0 0 1 0 0
0 0 user 0 1 1 0
1 0 0 0 0 eps
0 0 1 1 0 0 0
0 0 trees 0 0 0 0
0 1 1 1 0 graph
0 0 0 0 0 0 1
1 1 minors 0 0 0 0
0 0 0 1 1
36
Technical Memo Example
err ???
computer human interface response survey system
time user eps trees graph minors computer
1.0 0.4 0.4 0.4 0.4 0.0 0.4
0.2 -0.3 -0.4 -0.4 -0.3 human 0.4
1.0 0.4 -0.3 -0.3 0.4 -0.3 -0.4
0.4 -0.4 -0.4 -0.3 interface 0.4 0.4
1.0 -0.3 -0.3 0.0 -0.3 0.2 0.4
-0.4 -0.4 -0.3 response 0.4 -0.3
-0.3 1.0 0.4 0.0 1.0 0.8 -0.3 -0.4
-0.4 -0.3 survey 0.4 -0.3 -0.3
0.4 1.0 0.0 0.4 0.2 -0.3 -0.4 0.2
0.4 system 0.0 0.4 0.0 0.0
0.0 1.0 0.0 0.2 0.8 -0.5 -0.5
-0.3 time 0.4 -0.3 -0.3 1.0
0.4 0.0 1.0 0.8 -0.3 -0.4 -0.4
-0.3 user 0.2 -0.4 0.2 0.8
0.2 0.2 0.8 1.0 0.2 -0.5 -0.5
-0.4 eps -0.3 0.4 0.4 -0.3
-0.3 0.8 -0.3 0.2 1.0 -0.4 -0.4
-0.3 trees -0.4 -0.4 -0.4 -0.4
-0.4 -0.5 -0.4 -0.5 -0.4 1.0 0.5
0.2 graph -0.4 -0.4 -0.4 -0.4
0.2 -0.5 -0.4 -0.5 -0.4 0.5 1.0
0.8 minors -0.3 -0.3 -0.3 -0.3
0.4 -0.3 -0.3 -0.4 -0.3 0.2 0.8 1.0
These inferred relationships seem pretty good,
no?
37
Technical Memo Example
VSM
GVSM
c1.txt 3.4 c2.txt 1.9 c3.txt 1.1 c4.txt
2.4 c5.txt 0.0 m1.txt -0.8 m2.txt -1.5 m3.txt
-2.1 m4.txt -1.3
38
Technical Memo Example
Documents ranked against the query in LSIs
2-space.
39
Motivations for Dimensionality Reduction
Documents in a correlated 2D feature space
40
Motivations for Dimensionality Reduction
Documents in a correlated 2D feature space
Correlation (training, obedience) 0.96 So if
we know that a document has a high score on
obedience do we really know nothing about its
score on training, as the VSM imagines? Hence
the errorand our correction
41
Motivations for Dimensionality Reduction
Documents in a correlated 2D feature space
We find the two dimensions that will make our
data have zero correlation and capture the most
variance
42
Motivations for Dimensionality Reduction
Rotate the data onto their first two principal
components
43
Motivations for Dimensionality Reduction
Rotate the data onto their first two principal
components
Since our first dimension gets 96 of the
variance, the 2nd one only captures 4. Maybe
theres really just one concept here, and the 2nd
dimension is just sampling error.
44
Motivations for Dimensionality Reduction
Discard Dimension 2 to derive the best
single-rank representation
45
Motivations for Dimensionality Reduction
Discard Dimension 2 to derive the best
single-rank representation
Crucial Point under LSI, we assume that the
amount of variance captured by a dimension is
evidence of its significance. Luckily, we have
easy access to this information the eigenvalues.
46
The Idea of Document Projections

LSI is an example of the more general method of
projecting documents from our observed term space
to some new, fabricated space.

47
Technical Memo Example

Assume that despite observing 12 variables
(terms), the underlying distribution that
generated our data actually has only k (klt12)
dimensions.
Observation of 12 vars is thus measurement
error. We want to uncover a more accurate (than
the observed Rhat) estimate of the underlying
covariance matrix.

48
Technical Memo Example

Choose some value of klt12 (for now, lets pick
k3 why?)
Project Rhat onto its first three eigenvectors to
derive R_k.
Plug R_k into our GVSM model

49
Technical Memo Example
But also note how little weve learned. The
only obvious win I see here is that the error on
user is reduced by half.
computer human interface response survey system
time user eps trees graph minors computer
0.9 0.4 0.4 0.4 0.3 -0.1 0.4
0.2 -0.3 -0.5 -0.4 -0.2 human 0.4
0.8 0.6 -0.4 -0.3 0.3 -0.4 -0.2
0.4 -0.3 -0.4 -0.3 interface 0.4 0.6
0.6 -0.2 -0.2 0.3 -0.2 -0.1 0.3
-0.4 -0.4 -0.3 response 0.4 -0.4
-0.2 0.9 0.5 0.1 0.9 0.8 -0.2 -0.4
-0.4 -0.3 survey 0.3 -0.3 -0.2
0.5 0.4 -0.3 0.5 0.2 -0.5 -0.1 0.0
0.1 system -0.1 0.3 0.3 0.1
-0.3 0.6 0.1 0.4 0.7 -0.4 -0.6
-0.5 time 0.4 -0.4 -0.2 0.9
0.5 0.1 0.9 0.8 -0.2 -0.4 -0.4
-0.3 user 0.2 -0.2 -0.1 0.8
0.2 0.4 0.8 0.8 0.2 -0.5 -0.5
-0.4 eps -0.3 0.4 0.3 -0.2
-0.5 0.7 -0.2 0.2 0.9 -0.2 -0.5
-0.5 trees -0.5 -0.3 -0.4 -0.4
-0.1 -0.4 -0.4 -0.5 -0.2 0.6 0.6
0.5 graph -0.4 -0.4 -0.4 -0.4
0.0 -0.6 -0.4 -0.5 -0.5 0.6 0.8
0.6 minors -0.2 -0.3 -0.3 -0.3
0.1 -0.5 -0.3 -0.4 -0.5 0.5 0.6 0.5
Notice how little information weve lost by
throwing out over half the data. What does this
tell us about language? About our ability to
model it with this approach?
50
Technical Memo Example
VSM
GVSM
LSI(3)
c1.txt 3.4 c2.txt 1.9 c3.txt 1.1 c4.txt
2.4 c5.txt 0.0 m1.txt -0.8 m2.txt -1.5 m3.txt
-2.1 m4.txt -1.3
c1.txt 1.5 c2.txt 1.4 c3.txt 2.3 c4.txt
2.7 c5.txt 0.6 m1.txt -0.6 m2.txt -1.4 m3.txt
-2.1 m4.txt -1.7
51
The Idea of Document Projections

LSI is an example of the more general method of
projecting documents from our observed term space
to some new, fabricated space.
The LSI dimensions capture the most variance
among the documents (they are the least-squares
fit a la linear regression)
Other projections optimize other criteria
Independent component analysis
Probabilistic LSI

52
The Idea of Document Projections
Lets see how well these methods work.

LSI is an example of the more general method of
projecting documents from our observed term space
to some new, fabricated space.
The LSI dimensions capture the most variance
among the documents (they are the least-squares
fit a la linear regression)
Other projections optimize other criteria
Independent component analysis
Probabilistic LSI

53
My Own Research

People have known for a long time that LSI seems
to work. But exactly why throwing away large
portions of your data should improve retrieval
was poorly understood.
its a noise reduction procedure! So whats the
noise and whats the signal?
its like a linear regression! So how is the
model optimized?
The weak link in the theory was the matter of
building the LSI modeli.e. Model selection.
If were going to reduce dimensionality, how
aggressively should we do so? What is our
rationale for choosing a particular value of k,
the LSI models dimensionality?

54
Research Question
Can a statistical analysis of the eigenvalues
associated with a corpus yield a robust estimate
of the optimal representational dimensionality
for IR?
55
Research Question

Can a statistical analysis of the eigenvalues
associated with a corpus yield a robust estimate
of the optimal representational dimensionality
for IR? If so
What method of analysis yields the best
estimate?
What are the theoretical implications of
dimensionality estimation by such a method?

56
Libraries 10 Nearest Neighbors in k-space
k2
k10
kkmax
levels storage papers accomplished technology tech
niques proved found thesauri role
book automation mechanization center feasibility j
ournals traditional library general data
combining medical main expansion book increase car
d form speed publication

Write a Comment

User Comments (0)