Information Retrieval Models - PowerPoint PPT Presentation

1 / 86
About This Presentation
Title:

Information Retrieval Models

Description:

Postulate: Documents that are 'close together' in the vector space ... saturn. planet. contribution to similarity, if used in 1st meaning, but not if in 2nd ... – PowerPoint PPT presentation

Number of Views:534
Avg rating:3.0/5.0
Slides: 87
Provided by: pb8
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval Models


1
Information Retrieval Models
  • PengBo
  • Oct 22, 2007

2
?????
  • Information Retrieval Models
  • Vector Space Model
  • Latent Semantic Model
  • Language Model

3
(No Transcript)
4
Vector Space Model
5
Term Vectors
  • Bag of words
  • Each is a vector in RM
  • Here log-scaled tf.idf

6
Documents as vectors
  • Each doc j can now be viewed as a vector of
    wf?idf values, one component for each term
  • So we have a vector space
  • terms are axes
  • docs live in this space
  • even with stemming, may have 20,000 dimensions

7
Intuition
t3
d2
d3
d1
?
f
t1
d5
t2
d4
Postulate Documents that are close together
in the vector space talk about the same things.
8
The vector space model
  • Freetext query as vector
  • We regard freetext query as short document
  • We return the documents ranked by the closeness
    of their vectors to the query vector.

9
Cosine similarity
  • Distance between vectors d1 and d2 captured by
    the cosine of the angle x between them.
  • Note this is similarity, not distance
  • No triangle inequality for similarity.

10
Cosine similarity
  • Cosine of angle between two vectors
  • The denominator involves the lengths of the
    vectors.

Normalization
11
1.COS Similarity
  • Compute the vector space similarity between the
    query digital cameras and the document digital
    cameras and video cameras by filling out the
    empty columns in the following Table.
  • Assume N 10,000,000, logarithmic term weighting
    (wf columns) for query and document, idf
    weighting for the query only and cosine
    normalization for the document only. Treat and as
    a stop word. Enter term counts in the tf columns.
    What is the final similarity score?

12
2. Evaluation
  • For this exercise, we define the precision-recall
    graph of a result list as the set of
    (precision/recall) points, where one
    precision/recall point is computed for each
    additional returned document. We will initially
    define the breakeven point as the point where
    precision equals recall.
  • Can there be more than one breakeven point? If
    yes, give an example if not, show why not.

13
3. Evaluation
  • Suppose a retrieval system ranks a set of 50
    documents and the 6 known relevant documents
    appear at the following ranks
  • 1, 2, 5, 10, 22, 42
  • First plot an exact recall/precision graph
    (recall on the X axis) and then overlay it with a
    graph where the precision values are interpolated
    to the standard 11 points. Then calculate the
    following evaluation measures for that ranked
    list or indicate that there is not sufficient
    information to calculate it
  • Precision at rank 10
  • Precision when recall is 50
  • Uninterpolated average precision
  • 11-point interpolated average precision
  • Precision when recall is 25
  • Uninterpolated average F1

14
Latent Semantic Model
15
Vector Space Model Pros
  • Automatic selection of index terms
  • Partial matching of queries and documents
    (dealing with the case where no document contains
    all search terms)
  • Ranking according to similarity score (dealing
    with large result sets)
  • Term weighting schemes (improves retrieval
    performance)
  • Various extensions
  • Document clustering
  • Relevance feedback (modifying query vector)
  • Geometric foundation

16
Problems with Lexical Semantics
  • Ambiguity and association in natural language
  • Polysemy Words often have a multitude of
    meanings and different types of usage (more
    severe in very heterogeneous collections).
  • The vector space model is unable to discriminate
    between different meanings of the same word.

17
Polysemy and Context
  • Document similarity on single word level
    polysemy and context

18
Problems with Lexical Semantics
  • Synonymy Different terms may have an dentical or
    a similar meaning (weaker words indicating the
    same topic).
  • No associations between words are made in the
    vector space representation.

19
Issues in the VSM
  • Assumes terms are independent
  • Some terms are likely to appear together
  • synonyms, related words
  • spelling mistakes?
  • Terms can have different meanings depending on
    context
  • Term-document matrix has a very high
    dimensionality
  • are there really that many important features for
    each document and term?

20
Latent Semantic Indexing (LSI)
  • Perform a low-rank approximation of term-document
    matrix (typical rank 100-300)
  • General idea
  • Map documents (and terms) to a low-dimensional
    representation.
  • Design a mapping such that the low-dimensional
    space reflects semantic associations (latent
    semantic space).
  • Compute document similarity based on the inner
    product in this latent semantic space

21
Goals of LSI
  • Similar terms map to similar location in low
    dimensional space
  • Noise reduction by dimension reduction

22
Latent Semantic Indexing
?
DT
wtd
T

r ? r
r ? d
t ? d
t ? r
  • Compute singular value decomposition of a
    term-document matrix
  • DT, a representation of documents in r dimensions
  • T, a matrix for transforming new documents
  • diagonal matrix ? gives relative importance of
    dimensions

23
Low-rank Approximation
?
DT
wtd
T

r ? r
r ? d
t ? d
t ? r
?
DT
w'td
T


k ? k
k ? d
t ? d
t ? k
24
What it is
  • From term-doc matrix Ar, we compute the
    approximation Ak.
  • There is a row for each term and a column for
    each doc in Ak
  • Thus docs live in a space of kltltr dimensions
  • These dimensions are not the original axes

25
LSI Term matrix T
  • T matrix
  • gives a vector for each term in LSI space
  • multiply by a new document vector to fold in
    new documents into LSI space
  • LSI is a rotation of the term-space
  • original matrix terms are d-dimensional
  • new space has lower dimensionality
  • dimensions are groups of terms that tend to
    co-occur in the same documents
  • synonyms, contextually-related words, variant
    endings

26
Singular Values
  • ? gives an ordering to the dimensions
  • values drop off very quickly
  • singular values at the tail represent "noise"
  • cutting off low-value dimensions reduces noise
    and can improve performance

27
Document matrix D
  • DT matrix
  • coordinates of documents in LSI space
  • same dimensionality as T vectors
  • can compute the similarity between a query and a
    document

28
Improved Retrieval with LSI
  • New documents and queries are "folded in"
  • multiply vector by T?-1 (see P56)
  • Compute similarity for ranking as in VSM
  • compare queries and documents by dot-product
  • Improvements come from
  • reduction of noise
  • no need to stem terms (variants will co-occur)
  • no need for stop list
  • stop words are used uniformly throughout
    collection, so they tend to appear in the first
    dimension
  • No speed or space gains, though

29
  • C
  • Tr
  • ?r

30
  • Dr
  • ?2
  • D2

31
Example
  • Map into 2-dimenstion space

32
Latent Semantic Analysis
  • Latent semantic space illustrating example

courtesy of Susan Dumais
33
Empirical evidence
  • Experiments on TREC 1/2/3 Dumais
  • Lanczos SVD code (available on netlib) due to
    Berry used in these expts
  • Running times of one day on tens of thousands
    of docs
  • Dimensions various values 250-350 reported
  • (Under 200 reported unsatisfactory)
  • Generally expect recall to improve what about
    precision?

34
Empirical evidence
  • Precision at or above median TREC precision
  • Top scorer on almost 20 of TREC topics
  • Slightly better on average than straight vector
    spaces
  • Effect of dimensionality

35
Failure modes
  • Negated phrases
  • TREC topics sometimes negate certain query/terms
    phrases automatic conversion of topics to
  • Boolean queries
  • As usual, freetext/vector space syntax of LSI
    queries precludes (say) Find any doc having to
    do with the following 5 companies
  • See Dumais for more.

36
LSI has many other applications
  • In many settings in pattern recognition and
    retrieval, we have a feature-object matrix.
  • For text, the terms are features and the docs are
    objects.
  • Could be opinions and users
  • This matrix may be redundant in dimensionality.
  • Can work with low-rank approximation.
  • If entries are missing (e.g., users opinions),
    can recover if dimensionality is low.
  • Powerful general analytical technique
  • Close, principled analog to clustering methods.

37
Matrix Low-rank Approximation for LSI
38
Eigenvalues Eigenvectors
  • Eigenvectors (for a square m?m matrix S)
  • How many eigenvalues are there at most?

eigenvalue
(right) eigenvector
39
Matrix-vector multiplication
has eigenvalues 3, 2, 0 with corresponding
eigenvectors
Any vector (say x ) can be viewed as a
combination of the eigenvectors x
2v1 4v2 6v3
40
Matrix vector multiplication
  • Thus a matrix-vector multiplication such as Sx
    (S, x as in the previous slide) can be rewritten
    in terms of the eigenvalues/vectors
  • Even though x is an arbitrary vector, the action
    of S on x is determined by the eigenvalues/vectors
    .
  • Suggestion the effect of small eigenvalues is
    small.

41
Eigenvalues Eigenvectors
42
Example
  • Let
  • Then
  • The eigenvalues are 1 and 3 (nonnegative, real).
  • The eigenvectors are orthogonal (and real)

Real, symmetric.
Plug in these values and solve for eigenvectors.
43
Eigen/diagonal Decomposition
  • Let be a square matrix with m
    linearly independent eigenvectors
  • Theorem Exists an eigen decomposition
  • Columns of U are eigenvectors of S
  • Diagonal elements of are eigenvalues of

44
Diagonal decomposition why/how
Thus SUU?, or U1SU?
And SU?U1.
45
Diagonal decomposition - example
Recall
The eigenvectors and form
Recall UU1 1.
Inverting, we have
Then, SU?U1
46
Example continued
Lets divide U (and multiply U1) by
Then, S
?
Q
(Q-1 QT )
Why? Stay tuned
47
Symmetric Eigen Decomposition
  • If is a symmetric matrix
  • Theorem Exists a (unique) eigen decomposition
  • where Q is orthogonal
  • Q-1 QT
  • Columns of Q are normalized eigenvectors
  • Columns are orthogonal.
  • (everything is real)

48
Time out!
  • What do these matrices have to do with text?
  • Recall m? n term-document matrices
  • But everything so far needs square matrices so

49
Singular Value Decomposition
For an m? n matrix A of rank r there exists a
factorization (Singular Value Decomposition
SVD) as follows
The columns of U are orthogonal eigenvectors of
AAT.
The columns of V are orthogonal eigenvectors of
ATA.
50
Singular Value Decomposition
  • Illustration of SVD dimensions and sparseness

51
SVD example
Let
Typically, the singular values arranged in
decreasing order.
52
Low-rank Approximation
  • SVD can be used to compute optimal low-rank
    approximations.
  • Approximation problem Find Ak of rank k such
    that
  • Ak and X are both m?n matrices.
  • Typically, want k ltlt r.

53
Low-rank Approximation
  • Solution via SVD

set smallest r-k singular values to zero
54
Approximation error
  • How good (bad) is this approximation?
  • Its the best possible, measured by the Frobenius
    norm of the error
  • where the ?i are ordered such that ?i ? ?i1.
  • Suggests why Frobenius error drops as k increased.

55
SVD Low-rank approximation
  • Whereas the term-doc matrix A may have m50000,
    n10 million (and rank close to 50000)
  • We can construct an approximation A100 with rank
    100.
  • Of all rank 100 matrices, it would have the
    lowest Frobenius error.
  • Great but why would we??
  • Answer Latent Semantic Indexing

C. Eckart, G. Young, The approximation of a
matrix by another of lower rank. Psychometrika,
1, 211-218, 1936.
56
Performing the maps
  • Each row and column of A gets mapped into the
    k-dimensional LSI space, by the SVD.
  • A query q is also mapped into this space, by

57
Language Models
58
IR based on Language Model (LM)
Information need
d1
generation
d2
query


dn
  • A common search heuristic is to use words that
    you expect to find in matching documents as your
    query
  • The LM approach directly exploits that idea!

document collection
59
Formal Language (Model)
  • Traditional generative model generates strings
  • Finite state machines or regular grammars, etc.
  • Example

I wish
I wish I wish
I wish I wish I wish
I wish I wish I wish I wish
I
wish

(I wish)
60
Stochastic Language Models
  • Models probability of generating strings in the
    language (commonly all strings over alphabet ?)

Model M
0.2 the 0.1 a 0.01 man 0.01 woman 0.03 said 0.02 l
ikes
the
man
likes
the
woman
0.2
0.01
0.02
0.2
0.01
P(s M) 0.00000008
61
Stochastic Language Models
  • Model probability of generating any string

Model M1
Model M2
0.2 the 0.0001 class 0.03 sayst 0.02 pleaseth 0.1
yon 0.01 maiden 0.0001 woman
0.2 the 0.01 class 0.0001 sayst 0.0001 pleaseth 0.
0001 yon 0.0005 maiden 0.01 woman
P(sM2) gt P(sM1)
62
Stochastic Language Models
  • A statistical model for generating text
  • Probability distribution over strings in a given
    language

M
63
Unigram and higher-order models
  • Unigram Language Models
  • Bigram (generally, n-gram) Language Models
  • Other Language Models
  • Grammar-based models (PCFGs), etc.
  • Probably not the first thing to try in IR

Easy. Effective!
64
Using Language Models in IR
  • Treat each document as the basis for a model
    (e.g., unigram sufficient statistics)
  • Rank document d based on P(d q)
  • P(d q) P(q d) x P(d) / P(q)
  • P(q) is the same for all documents, so ignore
  • P(d) the prior is often treated as the same for
    all d
  • But we could use criteria like authority, length,
    genre
  • P(q d) is the probability of q given ds model
  • Very general formal approach

65
The fundamental problem of LMs
  • Usually we dont know the model M
  • But have a sample of text representative of that
    model
  • Estimate a language model from a sample
  • Then compute the observation probability

M
66
Language Models for IR
  • Language Modeling Approaches
  • Attempt to model query generation process
  • Documents are ranked by the probability that a
    query would be observed as a random sample from
    the respective document model
  • Multinomial approach

67
Retrieval based on probabilistic LM
  • Treat the generation of queries as a random
    process.
  • Approach
  • Infer a language model for each document.
  • Estimate the probability of generating the query
    according to each of these models.
  • Rank the documents according to these
    probabilities.
  • Usually a unigram estimate of words is used
  • Some work on bigrams, paralleling van Rijsbergen

68
Query generation probability (1)
  • Ranking formula
  • The probability of producing the query given the
    language model of document d using MLE is

Unigram assumption Given a particular language
model, the query terms occur independently
69
Insufficient data
  • Zero probability
  • May not wish to assign a probability of zero to a
    document that is missing one or more of the query
    terms gives conjunction semantics
  • General approach
  • A non-occurring term is possible, but no more
    likely than would be expected by chance in the
    collection.
  • If ,

raw count of term t in the collection
raw collection size(total number of
tokens in the collection)
70
Insufficient data
  • Zero probabilities spell disaster
  • We need to smooth probabilities
  • Discount nonzero probabilities
  • Give some probability mass to unseen things
  • Theres a wide space of approaches to smoothing
    probability distributions to deal with this
    problem, such as adding 1, ½ or ? to counts,
    Dirichlet priors, discounting, and interpolation
  • See FSNLP ch. 6 if you want more
  • A simple idea that works well in practice is to
    use a mixture between the document multinomial
    and the collection multinomial distribution

71
Mixture model
  • P(wd) ?Pmle(wMd) (1 ?)Pmle(wMc)
  • Mixes the probability from the document with the
    general collection frequency of the word.
  • Correctly setting ? is very important
  • A high value of lambda makes the search
    conjunctive-like suitable for short queries
  • A low value is more suitable for long queries
  • Can tune ? to optimize performance
  • Perhaps make it dependent on document size (cf.
    Dirichlet prior or Witten-Bell smoothing)

72
Basic mixture model summary
  • General formulation of the LM for IR
  • The user has a document in mind, and generates
    the query from this document.
  • The equation represents the probability that the
    document that the user had in mind was in fact
    this one.

general language model
individual-document model
73
Example
  • Document collection (2 documents)
  • d1 Xerox reports a profit but revenue is down
  • d2 Lucent narrows quarter loss but revenue
    decreases further
  • Model MLE unigram from documents ? ½
  • Query revenue down
  • P(Qd1)
  • (1/8 2/16)/2 x (1/8 1/16)/2
  • 1/8 x 3/32 3/256
  • P(Qd2)
  • (1/8 2/16)/2 x (0 1/16)/2
  • 1/8 x 1/32 1/256
  • Ranking d1 gt d2

74
Alternative Models of Text Generation
Query Model
Query
Searcher
Is this the same model?
Doc Model
Doc
Writer
75
Retrieval Using Language Models
Query Model
Query
1
3
2
Doc Model
Doc
Retrieval Query likelihood (1), Document
likelihood (2), Model comparison (3)
76
Query Likelihood
  • P(QDm)
  • Major issue is estimating document model
  • i.e. smoothing techniques instead of tf.idf
    weights
  • Good retrieval results
  • e.g. UMass, BBN, Twente, CMU
  • Problems dealing with relevance feedback, query
    expansion, structured queries

77
Document Likelihood
  • Rank by likelihood ratio P(DR)/P(DNR)
  • treat as a generation problem
  • P(wR) is estimated by P(wQm)
  • Qm is the query or relevance model
  • P(wNR) is estimated by collection probabilities
    P(w)
  • Issue is estimation of query model
  • Treat query as generated by mixture of topic and
    background
  • Estimate relevance model from related documents
    (query expansion)
  • Relevance feedback is easily incorporated
  • Good retrieval results
  • e.g. UMass at SIGIR 01
  • inconsistent with heterogeneous document
    collections

78
Model Comparison
  • Estimate query and document models and compare
  • Suitable measure is KL divergence D(QmDm)
  • equivalent to query-likelihood approach if simple
    empirical distribution used for query model
  • More general risk minimization framework has been
    proposed
  • Zhai and Lafferty 2001
  • Better results than query-likelihood or
    document-likelihood approaches

79
Language models pro con
  • Novel way of looking at the problem of text
    retrieval based on probabilistic language
    modeling
  • Conceptually simple and explanatory
  • Formal mathematical model
  • Natural use of collection statistics, not
    heuristics (almost)
  • LMs provide effective retrieval and can be
    improved to the extent that the following
    conditions can be met
  • Our language models are accurate representations
    of the data.
  • Users have some sense of term distribution.

80
Comparison With Vector Space
  • Theres some relation to traditional tf.idf
    models
  • (unscaled) term frequency is directly in model
  • the probabilities do length normalization of term
    frequencies
  • the effect of doing a mixture with overall
    collection frequencies is a little like idf
    terms rare in the general collection but common
    in some documents will have a greater influence
    on the ranking

81
Comparison With Vector Space
  • Similar in some ways
  • Term weights based on frequency
  • Terms often used as if they were independent
  • Inverse document/collection frequency used
  • Some form of length normalization used
  • Different in others
  • Based on probability rather than similarity
  • Intuitions are probabilistic rather than
    geometric
  • Details of use of document length and term,
    document, and collection frequency differ

82
?????
  • Latent Semantic Indexing
  • singular value decomposition
  • Matrix Low-rank Approximation
  • LanguageModel
  • Generative model
  • smooth probabilities
  • Mixture model

83
????
  • 1 IIR Ch12, Ch18
  • 2 M. Alistair, Z. Justin, and H. David,
    "Recommended reading for IR research students"
    SIGIR Forum, vol. 39, pp. 3-14, 2005.

84
Resources
  • The Template Numerical Toolkit (TNT)http//math.n
    ist.gov/tnt/documentation.html
  • The Lemur Toolkit for Language Modeling and
    Information Retrieval. http//www-2.cs.cmu.edu/l
    emur/ CMU/Umass LM and IR system in C(),
    currently actively developed.

85
Thank You!
  • QA

86
2 Evaluation
  • Question a
  • ???????????breakeven point
  • ??????I,??????R,????breakeven point,??????A,?????
    ????Ra,?precisionRa/A,recallRa/R,??break
    even point???,precisionrecall,??RA??????k
    (kgt0)????,?????breakeven point,????precisionRa
    /A,recallRa/R,??AR???AAk,kgt0
    ,?AR,????,?????????????breakeven point

????????????,??????????,???breakevenpoint????????
?,????????????breakeven point
Write a Comment
User Comments (0)
About PowerShow.com