Title: Beyond Term Independence: Introduction to Some Advanced IR Models
1Beyond Term Independence Introduction to Some
Advanced IR Models
- Miles Efron
- INF 384H
- UT-ISchool
2Term Independence
- Up to this point, all of the models we have
explored make the convenient assumption that the
indexing terms of our corpus are statistically
independent. - This allows us to simplify the math we have to
do. - Its also not clear that this is a bad
assumption, even if it clearly is unrealistic.
3Term Independence The Probabilistic Model
- Under the BIR model, we assume that the
likelihood that a document was generated by the
distribution responsible for relevant or
non-relevant documents is expressible as a simple
product. Thus we have the similarity model
4Term Independence The Probabilistic Model
This is clearly untrue knowing that a document
contains the word information does tell us
something about the likelihood that it also
contains the word retrieval.
- Under the BIR model, we assume that the
likelihood that a document was generated by the
distribution responsible for relevant or
non-relevant documents is expressible as a simple
product. Thus we have the similarity model
5Term Independence The Probabilistic Model
This is clearly untrue knowing that a document
contains the word information does tell us
something about the likelihood that it also
contains the word retrieval.
- Under the BIR model, we assume that the
likelihood that a document was generated by the
distribution responsible for relevant or
non-relevant documents is expressible as a simple
produce. Thus we have the similarity model
Thus our similarity function is unlikely to be
based on accurate class-conditional
probabilities. But this doesnt necessarily make
it wrong.
6Term Independence The Vector Space Model
- Our assumption of term independence under the
vector space model is less obvious, though no
less stark. - The assumption of independence lies in the fact
that we assume the dimensions of term space are
orthogonal. - Geometrically, this means we assume they meet at
right angles. - Semantically, it means they convey separate
information.
7Term Independence The Vector Space Model
- Recall the VSM similarity function
- What does this tell us to do?
- For each of our p dimensions
- Multiply the pth value seen in the query and
document - Add all of these products together
8Term Independence The Vector Space Model
- Recall the VSM similarity function
- What does this tell us to do?
- For each of our p dimensions
- Multiply the pth value seen in the query and
document - Add all of these products together
If the query and document share no terms, they
have a similarity of 0. Likewise, if they share
many terms, that result is expressed only
additively.
9Term Independence The Vector Space Model
- Recall the VSM similarity function
- Consider the function at the collection level
- We can re-write this
10Our Assumptions Haunt Us
- To make retrieval simpler, we have devised
numerous assumptions. - Term independence
- Bag of words models
- Internal evidence as the only source of document
feature information - What prices do we pay for this simplicity?
11Lexical Ambiguity
- Human language is rife with redundancy and
lacunae. Both its repetitions and omissions can
frustrate IR. - Synonymy two words can describe the same
concept. - Polysemy words that look the same can have
different meanings in different contexts.
12Lexical Ambiguity Polysemy
- Is the West Bank the same as the Bank of the
West? - Stocks rose last week has a different meaning
than does the florists stocks of roses. - The seal of the state of Texas is quite different
from the Deans seal of approval, which are both
different from a seal that eats fish.
13Lexical Ambiguity Synonymy
- The character strings car and automobile look
different, though the concepts they reference are
awfully similar. - To make things even murkier, a polysemous word
like game can have synonyms associated with
various senses. As a vegetarian, I eat neither
game nor meat. But that doesnt mean I wont
play a game or participate in some sort of
contest or challenge.
14The Problem of Undertermination
- Many of the problems that plague the linguistic
aspects of IR are subsumed under the rubric of
underdetermination. - Queries are underdetermined in the sense that
users only supply a vague hint of the kinds of
information they are interested in. - Documents are underdetermined in that authors
dont write so as to make their documents easily
indexable by our algorithms.
15Addressing Underdetermination
- Global approaches
- Altering document representations
- Applying thesauri to expand documents
- Latent Semantic Indexing
- Local approaches
- Query expansion
- Relevance feedback
16Elaborating the Vector Space Model Latent
Semantic Indexing
- The terms that occur in a document give us some
evidence of that documents aboutness. But
that evidence is neither complete nor reliable.
Instead of modeling similarity as a function in
term space, we would really like to measure
similarity in some kind of concept space. This
is what LSI tries to do to map from term space
to concept space.
17Problems with Keyword-based IR
The terms that occur in a document give us some
evidence of that documents aboutness. But
that evidence is neither complete nor reliable.
Instead of modeling similarity as a function in
term space, we would really like to measure
similarity in some kind of concept space. This
is what LSI tries to do to map from term space
to concept space.
- Synonymy many terms may refer to one concept.
car, automobile, vehicle - Polysemy one term may refer to many concepts
bank of the West, the West bank
18Problems with Keyword-based IR
The terms that occur in a document give us some
evidence of that documents aboutness. But
that evidence is neither complete nor reliable.
Instead of modeling similarity as a function in
term space, we would really like to measure
similarity in some kind of concept space. This
is what LSI tries to do to map from term space
to concept space.
- Synonymy many terms may refer to one concept.
car, automobile, vehicle - Polysemy one term may refer to many concepts
bank of the West, the West bank
19Problems with Keyword-based IR
The terms that occur in a document give us some
evidence of that documents aboutness. But
that evidence is neither complete nor reliable.
Instead of modeling similarity as a function in
term space, we would really like to measure
similarity in some kind of concept space. This
is what LSI tries to do to map from term space
to concept space.
- LSI constructs a statistical model the conceptual
relationships among terms and documents by means
of dimensionality reduction.
20Motivations for Dimensionality Reduction
Saltons Vector Space Model (VSM) assumes Term
Independence
Vector of similarity scores
Query vector
Term-document matrix
21Motivations for Dimensionality Reduction
Generalized Vector Space Model (GVSM)
Vector of similarity scores
Query vector
Term-document matrix
Term correlation matrix
22Motivations for Dimensionality Reduction
Generalized Vector Space Model (GVSM)
Vector of similarity scores
Query vector
Use the observed term correlations to inform
query-document matching. Is this a good idea?
Term-document matrix
Term correlation matrix
23Motivations for Dimensionality Reduction
Latent Semantic Indexing (LSI)
Vector of similarity scores
Query vector
Term-document matrix
Term covariance matrix
24Motivations for Dimensionality Reduction
LSI builds a statistical model of the population
covariance matrix, based on the observed sample.
Latent Semantic Indexing (LSI)
Vector of similarity scores
Query vector
Term-document matrix
Term covariance matrix
25Motivations for Dimensionality Reduction
LSI builds a statistical model of the population
covariance matrix, based on the observed sample.
Latent Semantic Indexing (LSI)
Vector of similarity scores
Query vector
Term-document matrix
Term covariance matrix
We fit the model by means of the
eigenvalue/eigenvector decomposition of R.
26Motivation for dimensionality reduction
Var(X)s Var(xbar) s/sqrt(N)
27Technical Memo Example
Nine (9) documents (i.e. article titles), five
about human-computer interaction and four about
graph theory. For purposes of indexing, retain
only those terms that occur in at least two
documents, for a total of 12 indexing terms.
28Technical Memo Example
Nine (9) documents (i.e. article titles), five
about human-computer interaction and four about
graph theory. For purposes of indexing, retain
only those terms that occur in at least two
documents, for a total of 12 indexing terms.
A term-document matrix for the data
29Technical Memo Example
Nine (9) documents (i.e. article titles), five
about human-computer interaction and four about
graph theory. For purposes of indexing, retain
only those terms that occur in at least two
documents, for a total of 12 indexing terms.
A term-document matrix for the data
e.g.
Human machine interface for Lab ABC computer
applications
30Technical Memo Example
31Technical Memo Example
32Technical Memo Example
33Technical Memo Example
Query human-computer interaction
1 0 1 0 0 0 0 0 0 0 0 0
query
34Technical Memo Example
Documents retrieved for our query using the
keyword-based similarity model.
Documents not retrievedi.e. documents that dont
share any terms with the query.
query
Problematic!
35Technical Memo Example
docs terms c1.txt c2.txt c3.txt c4.txt
c5.txt m1.txt m2.txt m3.txt m4.txt computer
1 1 0 0 0 0 0
0 0 human 1 0 0 1
0 0 0 0 0 interface
1 0 1 0 0 0 0
0 0 response 0 1 0 0
1 0 0 0 0 survey
0 1 0 0 0 0 0
0 1 system 0 1 1 2
0 0 0 0 0 time
0 1 0 0 1 0 0
0 0 user 0 1 1 0
1 0 0 0 0 eps
0 0 1 1 0 0 0
0 0 trees 0 0 0 0
0 1 1 1 0 graph
0 0 0 0 0 0 1
1 1 minors 0 0 0 0
0 0 0 1 1
36Technical Memo Example
err ???
computer human interface response survey system
time user eps trees graph minors computer
1.0 0.4 0.4 0.4 0.4 0.0 0.4
0.2 -0.3 -0.4 -0.4 -0.3 human 0.4
1.0 0.4 -0.3 -0.3 0.4 -0.3 -0.4
0.4 -0.4 -0.4 -0.3 interface 0.4 0.4
1.0 -0.3 -0.3 0.0 -0.3 0.2 0.4
-0.4 -0.4 -0.3 response 0.4 -0.3
-0.3 1.0 0.4 0.0 1.0 0.8 -0.3 -0.4
-0.4 -0.3 survey 0.4 -0.3 -0.3
0.4 1.0 0.0 0.4 0.2 -0.3 -0.4 0.2
0.4 system 0.0 0.4 0.0 0.0
0.0 1.0 0.0 0.2 0.8 -0.5 -0.5
-0.3 time 0.4 -0.3 -0.3 1.0
0.4 0.0 1.0 0.8 -0.3 -0.4 -0.4
-0.3 user 0.2 -0.4 0.2 0.8
0.2 0.2 0.8 1.0 0.2 -0.5 -0.5
-0.4 eps -0.3 0.4 0.4 -0.3
-0.3 0.8 -0.3 0.2 1.0 -0.4 -0.4
-0.3 trees -0.4 -0.4 -0.4 -0.4
-0.4 -0.5 -0.4 -0.5 -0.4 1.0 0.5
0.2 graph -0.4 -0.4 -0.4 -0.4
0.2 -0.5 -0.4 -0.5 -0.4 0.5 1.0
0.8 minors -0.3 -0.3 -0.3 -0.3
0.4 -0.3 -0.3 -0.4 -0.3 0.2 0.8 1.0
These inferred relationships seem pretty good,
no?
37Technical Memo Example
VSM
GVSM
c1.txt 3.4 c2.txt 1.9 c3.txt 1.1 c4.txt
2.4 c5.txt 0.0 m1.txt -0.8 m2.txt -1.5 m3.txt
-2.1 m4.txt -1.3
38Technical Memo Example
Documents ranked against the query in LSIs
2-space.
39Motivations for Dimensionality Reduction
Documents in a correlated 2D feature space
40Motivations for Dimensionality Reduction
Documents in a correlated 2D feature space
Correlation (training, obedience) 0.96 So if
we know that a document has a high score on
obedience do we really know nothing about its
score on training, as the VSM imagines? Hence
the errorand our correction
41Motivations for Dimensionality Reduction
Documents in a correlated 2D feature space
We find the two dimensions that will make our
data have zero correlation and capture the most
variance
42Motivations for Dimensionality Reduction
Rotate the data onto their first two principal
components
43Motivations for Dimensionality Reduction
Rotate the data onto their first two principal
components
Since our first dimension gets 96 of the
variance, the 2nd one only captures 4. Maybe
theres really just one concept here, and the 2nd
dimension is just sampling error.
44Motivations for Dimensionality Reduction
Discard Dimension 2 to derive the best
single-rank representation
45Motivations for Dimensionality Reduction
Discard Dimension 2 to derive the best
single-rank representation
Crucial Point under LSI, we assume that the
amount of variance captured by a dimension is
evidence of its significance. Luckily, we have
easy access to this information the eigenvalues.
46The Idea of Document Projections
- LSI is an example of the more general method of
projecting documents from our observed term space
to some new, fabricated space.
47Technical Memo Example
- Assume that despite observing 12 variables
(terms), the underlying distribution that
generated our data actually has only k (klt12)
dimensions. - Observation of 12 vars is thus measurement
error. We want to uncover a more accurate (than
the observed Rhat) estimate of the underlying
covariance matrix.
48Technical Memo Example
- Choose some value of klt12 (for now, lets pick
k3 why?) - Project Rhat onto its first three eigenvectors to
derive R_k. - Plug R_k into our GVSM model
49Technical Memo Example
But also note how little weve learned. The
only obvious win I see here is that the error on
user is reduced by half.
computer human interface response survey system
time user eps trees graph minors computer
0.9 0.4 0.4 0.4 0.3 -0.1 0.4
0.2 -0.3 -0.5 -0.4 -0.2 human 0.4
0.8 0.6 -0.4 -0.3 0.3 -0.4 -0.2
0.4 -0.3 -0.4 -0.3 interface 0.4 0.6
0.6 -0.2 -0.2 0.3 -0.2 -0.1 0.3
-0.4 -0.4 -0.3 response 0.4 -0.4
-0.2 0.9 0.5 0.1 0.9 0.8 -0.2 -0.4
-0.4 -0.3 survey 0.3 -0.3 -0.2
0.5 0.4 -0.3 0.5 0.2 -0.5 -0.1 0.0
0.1 system -0.1 0.3 0.3 0.1
-0.3 0.6 0.1 0.4 0.7 -0.4 -0.6
-0.5 time 0.4 -0.4 -0.2 0.9
0.5 0.1 0.9 0.8 -0.2 -0.4 -0.4
-0.3 user 0.2 -0.2 -0.1 0.8
0.2 0.4 0.8 0.8 0.2 -0.5 -0.5
-0.4 eps -0.3 0.4 0.3 -0.2
-0.5 0.7 -0.2 0.2 0.9 -0.2 -0.5
-0.5 trees -0.5 -0.3 -0.4 -0.4
-0.1 -0.4 -0.4 -0.5 -0.2 0.6 0.6
0.5 graph -0.4 -0.4 -0.4 -0.4
0.0 -0.6 -0.4 -0.5 -0.5 0.6 0.8
0.6 minors -0.2 -0.3 -0.3 -0.3
0.1 -0.5 -0.3 -0.4 -0.5 0.5 0.6 0.5
Notice how little information weve lost by
throwing out over half the data. What does this
tell us about language? About our ability to
model it with this approach?
50Technical Memo Example
VSM
GVSM
LSI(3)
c1.txt 3.4 c2.txt 1.9 c3.txt 1.1 c4.txt
2.4 c5.txt 0.0 m1.txt -0.8 m2.txt -1.5 m3.txt
-2.1 m4.txt -1.3
c1.txt 1.5 c2.txt 1.4 c3.txt 2.3 c4.txt
2.7 c5.txt 0.6 m1.txt -0.6 m2.txt -1.4 m3.txt
-2.1 m4.txt -1.7
51The Idea of Document Projections
- LSI is an example of the more general method of
projecting documents from our observed term space
to some new, fabricated space. - The LSI dimensions capture the most variance
among the documents (they are the least-squares
fit a la linear regression) - Other projections optimize other criteria
- Independent component analysis
- Probabilistic LSI
52The Idea of Document Projections
Lets see how well these methods work.
- LSI is an example of the more general method of
projecting documents from our observed term space
to some new, fabricated space. - The LSI dimensions capture the most variance
among the documents (they are the least-squares
fit a la linear regression) - Other projections optimize other criteria
- Independent component analysis
- Probabilistic LSI
53My Own Research
- People have known for a long time that LSI seems
to work. But exactly why throwing away large
portions of your data should improve retrieval
was poorly understood. - its a noise reduction procedure! So whats the
noise and whats the signal? - its like a linear regression! So how is the
model optimized? - The weak link in the theory was the matter of
building the LSI modeli.e. Model selection. - If were going to reduce dimensionality, how
aggressively should we do so? What is our
rationale for choosing a particular value of k,
the LSI models dimensionality?
54Research Question
Can a statistical analysis of the eigenvalues
associated with a corpus yield a robust estimate
of the optimal representational dimensionality
for IR?
55Research Question
- Can a statistical analysis of the eigenvalues
associated with a corpus yield a robust estimate
of the optimal representational dimensionality
for IR? If so - What method of analysis yields the best
estimate? - What are the theoretical implications of
dimensionality estimation by such a method?
56Libraries 10 Nearest Neighbors in k-space
k2
k10
kkmax
levels storage papers accomplished technology tech
niques proved found thesauri role
book automation mechanization center feasibility j
ournals traditional library general data
combining medical main expansion book increase car
d form speed publication