Beyond Term Independence: Introduction to Some Advanced IR Models - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Beyond Term Independence: Introduction to Some Advanced IR Models

Description:

Under the BIR model, we assume that the ... The character strings car and automobile look different, though the concepts ... { car, automobile, vehicle} ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Slides: 57
Provided by: ischool2
Category:

less

Transcript and Presenter's Notes

Title: Beyond Term Independence: Introduction to Some Advanced IR Models


1
Beyond Term Independence Introduction to Some
Advanced IR Models
  • Miles Efron
  • INF 384H
  • UT-ISchool

2
Term Independence
  • Up to this point, all of the models we have
    explored make the convenient assumption that the
    indexing terms of our corpus are statistically
    independent.
  • This allows us to simplify the math we have to
    do.
  • Its also not clear that this is a bad
    assumption, even if it clearly is unrealistic.

3
Term Independence The Probabilistic Model
  • Under the BIR model, we assume that the
    likelihood that a document was generated by the
    distribution responsible for relevant or
    non-relevant documents is expressible as a simple
    product. Thus we have the similarity model

4
Term Independence The Probabilistic Model
This is clearly untrue knowing that a document
contains the word information does tell us
something about the likelihood that it also
contains the word retrieval.
  • Under the BIR model, we assume that the
    likelihood that a document was generated by the
    distribution responsible for relevant or
    non-relevant documents is expressible as a simple
    product. Thus we have the similarity model

5
Term Independence The Probabilistic Model
This is clearly untrue knowing that a document
contains the word information does tell us
something about the likelihood that it also
contains the word retrieval.
  • Under the BIR model, we assume that the
    likelihood that a document was generated by the
    distribution responsible for relevant or
    non-relevant documents is expressible as a simple
    produce. Thus we have the similarity model

Thus our similarity function is unlikely to be
based on accurate class-conditional
probabilities. But this doesnt necessarily make
it wrong.
6
Term Independence The Vector Space Model
  • Our assumption of term independence under the
    vector space model is less obvious, though no
    less stark.
  • The assumption of independence lies in the fact
    that we assume the dimensions of term space are
    orthogonal.
  • Geometrically, this means we assume they meet at
    right angles.
  • Semantically, it means they convey separate
    information.

7
Term Independence The Vector Space Model
  • Recall the VSM similarity function
  • What does this tell us to do?
  • For each of our p dimensions
  • Multiply the pth value seen in the query and
    document
  • Add all of these products together

8
Term Independence The Vector Space Model
  • Recall the VSM similarity function
  • What does this tell us to do?
  • For each of our p dimensions
  • Multiply the pth value seen in the query and
    document
  • Add all of these products together

If the query and document share no terms, they
have a similarity of 0. Likewise, if they share
many terms, that result is expressed only
additively.
9
Term Independence The Vector Space Model
  • Recall the VSM similarity function
  • Consider the function at the collection level
  • We can re-write this

10
Our Assumptions Haunt Us
  • To make retrieval simpler, we have devised
    numerous assumptions.
  • Term independence
  • Bag of words models
  • Internal evidence as the only source of document
    feature information
  • What prices do we pay for this simplicity?

11
Lexical Ambiguity
  • Human language is rife with redundancy and
    lacunae. Both its repetitions and omissions can
    frustrate IR.
  • Synonymy two words can describe the same
    concept.
  • Polysemy words that look the same can have
    different meanings in different contexts.

12
Lexical Ambiguity Polysemy
  • Is the West Bank the same as the Bank of the
    West?
  • Stocks rose last week has a different meaning
    than does the florists stocks of roses.
  • The seal of the state of Texas is quite different
    from the Deans seal of approval, which are both
    different from a seal that eats fish.

13
Lexical Ambiguity Synonymy
  • The character strings car and automobile look
    different, though the concepts they reference are
    awfully similar.
  • To make things even murkier, a polysemous word
    like game can have synonyms associated with
    various senses. As a vegetarian, I eat neither
    game nor meat. But that doesnt mean I wont
    play a game or participate in some sort of
    contest or challenge.

14
The Problem of Undertermination
  • Many of the problems that plague the linguistic
    aspects of IR are subsumed under the rubric of
    underdetermination.
  • Queries are underdetermined in the sense that
    users only supply a vague hint of the kinds of
    information they are interested in.
  • Documents are underdetermined in that authors
    dont write so as to make their documents easily
    indexable by our algorithms.

15
Addressing Underdetermination
  • Global approaches
  • Altering document representations
  • Applying thesauri to expand documents
  • Latent Semantic Indexing
  • Local approaches
  • Query expansion
  • Relevance feedback

16
Elaborating the Vector Space Model Latent
Semantic Indexing
  • The terms that occur in a document give us some
    evidence of that documents aboutness. But
    that evidence is neither complete nor reliable.
    Instead of modeling similarity as a function in
    term space, we would really like to measure
    similarity in some kind of concept space. This
    is what LSI tries to do to map from term space
    to concept space.

17
Problems with Keyword-based IR
The terms that occur in a document give us some
evidence of that documents aboutness. But
that evidence is neither complete nor reliable.
Instead of modeling similarity as a function in
term space, we would really like to measure
similarity in some kind of concept space. This
is what LSI tries to do to map from term space
to concept space.
  • Synonymy many terms may refer to one concept.
    car, automobile, vehicle
  • Polysemy one term may refer to many concepts
    bank of the West, the West bank

18
Problems with Keyword-based IR
The terms that occur in a document give us some
evidence of that documents aboutness. But
that evidence is neither complete nor reliable.
Instead of modeling similarity as a function in
term space, we would really like to measure
similarity in some kind of concept space. This
is what LSI tries to do to map from term space
to concept space.
  • Synonymy many terms may refer to one concept.
    car, automobile, vehicle
  • Polysemy one term may refer to many concepts
    bank of the West, the West bank

19
Problems with Keyword-based IR
The terms that occur in a document give us some
evidence of that documents aboutness. But
that evidence is neither complete nor reliable.
Instead of modeling similarity as a function in
term space, we would really like to measure
similarity in some kind of concept space. This
is what LSI tries to do to map from term space
to concept space.
  • LSI constructs a statistical model the conceptual
    relationships among terms and documents by means
    of dimensionality reduction.

20
Motivations for Dimensionality Reduction
Saltons Vector Space Model (VSM) assumes Term
Independence
Vector of similarity scores
Query vector
Term-document matrix
21
Motivations for Dimensionality Reduction
Generalized Vector Space Model (GVSM)
Vector of similarity scores
Query vector
Term-document matrix
Term correlation matrix
22
Motivations for Dimensionality Reduction
Generalized Vector Space Model (GVSM)
Vector of similarity scores
Query vector
Use the observed term correlations to inform
query-document matching. Is this a good idea?
Term-document matrix
Term correlation matrix
23
Motivations for Dimensionality Reduction
Latent Semantic Indexing (LSI)
Vector of similarity scores
Query vector
Term-document matrix
Term covariance matrix
24
Motivations for Dimensionality Reduction
LSI builds a statistical model of the population
covariance matrix, based on the observed sample.
Latent Semantic Indexing (LSI)
Vector of similarity scores
Query vector
Term-document matrix
Term covariance matrix
25
Motivations for Dimensionality Reduction
LSI builds a statistical model of the population
covariance matrix, based on the observed sample.
Latent Semantic Indexing (LSI)
Vector of similarity scores
Query vector
Term-document matrix
Term covariance matrix
We fit the model by means of the
eigenvalue/eigenvector decomposition of R.
26
Motivation for dimensionality reduction
Var(X)s Var(xbar) s/sqrt(N)
27
Technical Memo Example
Nine (9) documents (i.e. article titles), five
about human-computer interaction and four about
graph theory. For purposes of indexing, retain
only those terms that occur in at least two
documents, for a total of 12 indexing terms.
28
Technical Memo Example
Nine (9) documents (i.e. article titles), five
about human-computer interaction and four about
graph theory. For purposes of indexing, retain
only those terms that occur in at least two
documents, for a total of 12 indexing terms.
A term-document matrix for the data
29
Technical Memo Example
Nine (9) documents (i.e. article titles), five
about human-computer interaction and four about
graph theory. For purposes of indexing, retain
only those terms that occur in at least two
documents, for a total of 12 indexing terms.
A term-document matrix for the data
e.g.
Human machine interface for Lab ABC computer
applications
30
Technical Memo Example
31
Technical Memo Example
32
Technical Memo Example
33
Technical Memo Example
Query human-computer interaction
1 0 1 0 0 0 0 0 0 0 0 0
query
34
Technical Memo Example
Documents retrieved for our query using the
keyword-based similarity model.
Documents not retrievedi.e. documents that dont
share any terms with the query.
query
Problematic!
35
Technical Memo Example
docs terms c1.txt c2.txt c3.txt c4.txt
c5.txt m1.txt m2.txt m3.txt m4.txt computer
1 1 0 0 0 0 0
0 0 human 1 0 0 1
0 0 0 0 0 interface
1 0 1 0 0 0 0
0 0 response 0 1 0 0
1 0 0 0 0 survey
0 1 0 0 0 0 0
0 1 system 0 1 1 2
0 0 0 0 0 time
0 1 0 0 1 0 0
0 0 user 0 1 1 0
1 0 0 0 0 eps
0 0 1 1 0 0 0
0 0 trees 0 0 0 0
0 1 1 1 0 graph
0 0 0 0 0 0 1
1 1 minors 0 0 0 0
0 0 0 1 1
36
Technical Memo Example
err ???
computer human interface response survey system
time user eps trees graph minors computer
1.0 0.4 0.4 0.4 0.4 0.0 0.4
0.2 -0.3 -0.4 -0.4 -0.3 human 0.4
1.0 0.4 -0.3 -0.3 0.4 -0.3 -0.4
0.4 -0.4 -0.4 -0.3 interface 0.4 0.4
1.0 -0.3 -0.3 0.0 -0.3 0.2 0.4
-0.4 -0.4 -0.3 response 0.4 -0.3
-0.3 1.0 0.4 0.0 1.0 0.8 -0.3 -0.4
-0.4 -0.3 survey 0.4 -0.3 -0.3
0.4 1.0 0.0 0.4 0.2 -0.3 -0.4 0.2
0.4 system 0.0 0.4 0.0 0.0
0.0 1.0 0.0 0.2 0.8 -0.5 -0.5
-0.3 time 0.4 -0.3 -0.3 1.0
0.4 0.0 1.0 0.8 -0.3 -0.4 -0.4
-0.3 user 0.2 -0.4 0.2 0.8
0.2 0.2 0.8 1.0 0.2 -0.5 -0.5
-0.4 eps -0.3 0.4 0.4 -0.3
-0.3 0.8 -0.3 0.2 1.0 -0.4 -0.4
-0.3 trees -0.4 -0.4 -0.4 -0.4
-0.4 -0.5 -0.4 -0.5 -0.4 1.0 0.5
0.2 graph -0.4 -0.4 -0.4 -0.4
0.2 -0.5 -0.4 -0.5 -0.4 0.5 1.0
0.8 minors -0.3 -0.3 -0.3 -0.3
0.4 -0.3 -0.3 -0.4 -0.3 0.2 0.8 1.0
These inferred relationships seem pretty good,
no?
37
Technical Memo Example
VSM
GVSM
c1.txt 3.4 c2.txt 1.9 c3.txt 1.1 c4.txt
2.4 c5.txt 0.0 m1.txt -0.8 m2.txt -1.5 m3.txt
-2.1 m4.txt -1.3
38
Technical Memo Example
Documents ranked against the query in LSIs
2-space.
39
Motivations for Dimensionality Reduction
Documents in a correlated 2D feature space
40
Motivations for Dimensionality Reduction
Documents in a correlated 2D feature space
Correlation (training, obedience) 0.96 So if
we know that a document has a high score on
obedience do we really know nothing about its
score on training, as the VSM imagines? Hence
the errorand our correction
41
Motivations for Dimensionality Reduction
Documents in a correlated 2D feature space
We find the two dimensions that will make our
data have zero correlation and capture the most
variance
42
Motivations for Dimensionality Reduction
Rotate the data onto their first two principal
components
43
Motivations for Dimensionality Reduction
Rotate the data onto their first two principal
components
Since our first dimension gets 96 of the
variance, the 2nd one only captures 4. Maybe
theres really just one concept here, and the 2nd
dimension is just sampling error.
44
Motivations for Dimensionality Reduction
Discard Dimension 2 to derive the best
single-rank representation
45
Motivations for Dimensionality Reduction
Discard Dimension 2 to derive the best
single-rank representation
Crucial Point under LSI, we assume that the
amount of variance captured by a dimension is
evidence of its significance. Luckily, we have
easy access to this information the eigenvalues.
46
The Idea of Document Projections
  • LSI is an example of the more general method of
    projecting documents from our observed term space
    to some new, fabricated space.

47
Technical Memo Example
  • Assume that despite observing 12 variables
    (terms), the underlying distribution that
    generated our data actually has only k (klt12)
    dimensions.
  • Observation of 12 vars is thus measurement
    error. We want to uncover a more accurate (than
    the observed Rhat) estimate of the underlying
    covariance matrix.

48
Technical Memo Example
  • Choose some value of klt12 (for now, lets pick
    k3 why?)
  • Project Rhat onto its first three eigenvectors to
    derive R_k.
  • Plug R_k into our GVSM model

49
Technical Memo Example
But also note how little weve learned. The
only obvious win I see here is that the error on
user is reduced by half.
computer human interface response survey system
time user eps trees graph minors computer
0.9 0.4 0.4 0.4 0.3 -0.1 0.4
0.2 -0.3 -0.5 -0.4 -0.2 human 0.4
0.8 0.6 -0.4 -0.3 0.3 -0.4 -0.2
0.4 -0.3 -0.4 -0.3 interface 0.4 0.6
0.6 -0.2 -0.2 0.3 -0.2 -0.1 0.3
-0.4 -0.4 -0.3 response 0.4 -0.4
-0.2 0.9 0.5 0.1 0.9 0.8 -0.2 -0.4
-0.4 -0.3 survey 0.3 -0.3 -0.2
0.5 0.4 -0.3 0.5 0.2 -0.5 -0.1 0.0
0.1 system -0.1 0.3 0.3 0.1
-0.3 0.6 0.1 0.4 0.7 -0.4 -0.6
-0.5 time 0.4 -0.4 -0.2 0.9
0.5 0.1 0.9 0.8 -0.2 -0.4 -0.4
-0.3 user 0.2 -0.2 -0.1 0.8
0.2 0.4 0.8 0.8 0.2 -0.5 -0.5
-0.4 eps -0.3 0.4 0.3 -0.2
-0.5 0.7 -0.2 0.2 0.9 -0.2 -0.5
-0.5 trees -0.5 -0.3 -0.4 -0.4
-0.1 -0.4 -0.4 -0.5 -0.2 0.6 0.6
0.5 graph -0.4 -0.4 -0.4 -0.4
0.0 -0.6 -0.4 -0.5 -0.5 0.6 0.8
0.6 minors -0.2 -0.3 -0.3 -0.3
0.1 -0.5 -0.3 -0.4 -0.5 0.5 0.6 0.5
Notice how little information weve lost by
throwing out over half the data. What does this
tell us about language? About our ability to
model it with this approach?
50
Technical Memo Example
VSM
GVSM
LSI(3)
c1.txt 3.4 c2.txt 1.9 c3.txt 1.1 c4.txt
2.4 c5.txt 0.0 m1.txt -0.8 m2.txt -1.5 m3.txt
-2.1 m4.txt -1.3
c1.txt 1.5 c2.txt 1.4 c3.txt 2.3 c4.txt
2.7 c5.txt 0.6 m1.txt -0.6 m2.txt -1.4 m3.txt
-2.1 m4.txt -1.7
51
The Idea of Document Projections
  • LSI is an example of the more general method of
    projecting documents from our observed term space
    to some new, fabricated space.
  • The LSI dimensions capture the most variance
    among the documents (they are the least-squares
    fit a la linear regression)
  • Other projections optimize other criteria
  • Independent component analysis
  • Probabilistic LSI

52
The Idea of Document Projections
Lets see how well these methods work.
  • LSI is an example of the more general method of
    projecting documents from our observed term space
    to some new, fabricated space.
  • The LSI dimensions capture the most variance
    among the documents (they are the least-squares
    fit a la linear regression)
  • Other projections optimize other criteria
  • Independent component analysis
  • Probabilistic LSI

53
My Own Research
  • People have known for a long time that LSI seems
    to work. But exactly why throwing away large
    portions of your data should improve retrieval
    was poorly understood.
  • its a noise reduction procedure! So whats the
    noise and whats the signal?
  • its like a linear regression! So how is the
    model optimized?
  • The weak link in the theory was the matter of
    building the LSI modeli.e. Model selection.
  • If were going to reduce dimensionality, how
    aggressively should we do so? What is our
    rationale for choosing a particular value of k,
    the LSI models dimensionality?

54
Research Question
Can a statistical analysis of the eigenvalues
associated with a corpus yield a robust estimate
of the optimal representational dimensionality
for IR?
55
Research Question
  • Can a statistical analysis of the eigenvalues
    associated with a corpus yield a robust estimate
    of the optimal representational dimensionality
    for IR? If so
  • What method of analysis yields the best
    estimate?
  • What are the theoretical implications of
    dimensionality estimation by such a method?

56
Libraries 10 Nearest Neighbors in k-space
k2
k10
kkmax
levels storage papers accomplished technology tech
niques proved found thesauri role
book automation mechanization center feasibility j
ournals traditional library general data
combining medical main expansion book increase car
d form speed publication
Write a Comment
User Comments (0)
About PowerShow.com