Term Weighting in Information Retrieval - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Term Weighting in Information Retrieval

Description:

The SMART Retrieval System Experiments in automatic document processing. ... Experiments with representation in a document retrieval system. ... – PowerPoint PPT presentation

Number of Views:814
Avg rating:3.0/5.0
Slides: 48
Provided by: sra8
Category:

less

Transcript and Presenter's Notes

Title: Term Weighting in Information Retrieval


1
Term Weighting in Information Retrieval
Web Information Retrieval
  • Polettini Nicola
  • Monday, June 5, 2006

2
Contents
  • Introduction to Vector Space Model
  • Vocabulary Terms
  • Documents Queries
  • Similarity measures
  • Term weighting
  • Binary weights
  • SMART Retrieval System
  • Salton Term Precision Model
  • Paper analysis
  • New weighting schemas
  • Web documents
  • Conclusions
  • References

3
The Vector Space Model
  • Vocabulary.
  • Terms.
  • Documents Queries.
  • Vector representation.
  • Similarity measures.
  • Cosine Similarity.

4
Vocabulary
  • Documents are represented as vectors in term
    space (all terms vocabulary).
  • Queries represented the same as documents.
  • Query and Document weights are based on length
    and direction of their vector.
  • A vector distance measure between the query and
    documents is used to rank retrieved documents.

5
Terms
  • Documents are represented by binary or weighted
    vectors of terms.
  • Terms are usually stems.
  • Terms can be also n-grams.
  • Computer Science bigram
  • World Wide Web trigram

6
Documents Queries Vectors
  • Documents and queries are represented as bags of
    words (BOW).
  • Represented as vectors
  • A vector is like an array of floating point.
  • It has direction and magnitude.
  • Each vector holds a place for every term in the
    collection.
  • Therefore, most vectors are sparse.

7
Vector representation
  • Documents and Queries are represented as vectors.
  • Vocabulary n terms
  • Position 1 corresponds to term 1, position 2 to
    term 2, position n to term n.

8
Similarity Measures
Simple matching Dices Coefficient Jaccards
Coefficient Cosine Coefficient Overlap
Coefficient
9
Cosine Similarity
  • The similarity of two documents is
  • This is called the cosine similarity.
  • The normalization is done when weighting the
    terms. Otherwise normalization and similarity can
    be combined.
  • Cosine similarity sorts documents according to
    degrees of similarity.

10
Example Computing Cosine Similarity
11
Example Computing Cosine Similarity (2)
12
Term Weighting
  • Binary weights
  • SMART Retrieval System
  • Local formulas
  • Global formulas
  • Normalization formulas
  • TFIDF

13
Binary Weights
  • Only the presence (1) or absence (0) of a term is
    included in the vector

14
Binary Weights Formula
Binary formula gives every word that appears in a
document equal relevance. It can be useful when
frequency is not important.
15
Why use term weighting?
  • Binary weights too limiting.
  • (terms are either present or absent).
  • Non-binary weights allow to model partial
    matching .
  • (Partial matching allows retrieval of docs that
    approximate the query).
  • Ranking of retrieved documents - best matching.
  • (Term-weighting improves quality of answer set).

16
Smart Retrieval System
  • SMART is an experimental IR system developed by
    Gerard Salton (and continued by Chris Buckley) at
    Cornell.
  • Designed for laboratory experiments in IR
  • Easy to mix and match different weighting methods.

Paper Salton, The Smart Retrieval System
Experiments in Automatic Document Processing,
1971
17
Smart Retrieval System (2)
  • In SMART weights are decomposed into three
    factors

18
Local term-weighting formulas
Binary Frequency Maxnorm Augmented Normalized
Alternate Log
19
Term frequency
  • TF (term frequency) - Count of times term occurs
    in document.

20
Term frequency (2)
  • The more times a term t occurs in document d the
    more likely it is that t is relevant to the
    document.
  • Used alone, favors common words, long documents.
  • Too much credit to words that appears more
    frequently.
  • Tipically used for query weighting.

21
Augmented Normalized Term Frequency
  • This formula was proposed by Croft.
  • Usually K 0,5.
  • K lt 0,5 for large documents.
  • K 0,5 for shorter documents.
  • Output varies between 0,5 and 1 for terms that
    appear in the document. Its a weak form of
    normalization.

22
Logarithmic Term Frequency
  • Logarithms are a way to de-emphasize the effect
    of frequency.
  • Logarithmic formula decreases the effects of
    large differences in term frequencies.

23
Global term-weighting formulas
Inverse Squared Probabilistic Frequency
24
Document Frequency
  • DF document frequency
  • Count the frequency considering the whole
    collection of documents.
  • Less frequently a term appears in the whole
    collection, the more discriminating it is.

25
Inverse Document Frequency
  • Measures rarity of the term in collection.
  • Inverts the document frequency.
  • Its the most used global formula.
  • Higher if term occurs in less documents
  • Gives full weight to terms that occurr in one
    document only.
  • Gives lowest weight to terms that occurr in all
    documents.

26
Inverse Document Frequency (2)
  • IDF provides high values for rare words and low
    values for common words.

Examples for a collection of 10000 documents (N
10000)
27
Other IDF Schemes
  • Squared IDF used rarely as a variant of IDF.
  • Probabilistic IDF
  • It assigns weights ranging from - for a term
    that appears in every document to log(n-1) for a
    term that appears in only one document.
  • Negative weights for terms appearing in more than
    half of the documents.

28
Normalization formulas
Sum of weights Cosine Fourth Max
29
Document Normalization
  • Long documents have an unfair advantage
  • They use a lot of terms
  • So they get more matches than short documents
  • And they use the same words repeatedly
  • So they have much higher term frequencies
  • Normalization seeks to remove these effects
  • Related somehow to maximum term frequency.
  • But also sensitive to the number of terms.
  • If we dont normalize short documents may not be
    recognized as relevant.

30
Cosine Normalization
  • Its the most used and popular.
  • Normalize the term weights (so longer documents
    are not unfairly given more weight).
  • If weights are normalized the cosine similarity
    results

31
Other normalizations
  • Sum of weights and fourth normalization are
    rarely used as cosine normalization variant.
  • Max Weight Normalization
  • It assigns weights between 0 and 1, but it
    doesnt take into account the distribution of
    terms over documents.
  • It gives high importance to the most relevant
    weighted terms within a document (used in
    CiteSeer).

32
TFIDF Term-weighting
33
TFIDF Example
  • Its the most used term-weighting

tf
Wi,j
idf
1
2
3
4
1
2
3
4
5
2
1.51
0.60
complicated
0.301
4
1
3
0.50
0.13
0.38
contaminated
0.125
5
4
3
0.63
0.50
0.38
fallout
0.125
6
3
3
2
information
0.000
1
0.60
interesting
0.602
3
7
0.90
2.11
nuclear
0.301
6
1
4
0.75
0.13
0.50
retrieval
0.125
2
1.20
siberia
0.602
34
Normalization example
Wi,j
W'i,j
idf
1
2
3
4
1
2
3
4
1.51
0.60
complicated
0.57
0.69
0.301
0.50
0.13
0.38
contaminated
0.29
0.13
0.14
0.125
0.63
0.50
0.38
fallout
0.37
0.19
0.44
0.125
information
0.000
0.60
interesting
0.62
0.602
0.90
2.11
nuclear
0.53
0.79
0.301
0.75
0.13
0.50
retrieval
0.77
0.05
0.57
0.125
1.20
siberia
0.71
0.602
1.70
0.97
2.67
0.87
Length
35
Retrieval Example
Query contaminated retrieval
W'i,j
query
complicated
contaminated
1
fallout
Ranked list Doc 2 Doc 4 Doc 1 Doc 3
information
interesting
nuclear
retrieval
1
siberia
0.29
0.9
0.19
0.57
Cosine similarity score
36
Gerard Salton paper The term precision model
  • Weighting Schema proposed.
  • Cosine similarity.
  • Density formula.
  • Discrimination Value formulas.
  • Term Precision formulas.
  • Conclusions.

37
Gerard Salton paper Weighting Schema proposed
  • Use of tf idf formulas.
  • Underline the importance of term weighting.
  • Use of cosine similarity.

38
Gerard Salton paper Density formula
  • Density the average pairwise cosine similarity
    between distinct document pairs.
  • N total number of documents.

39
Gerard Salton paper Discrimination Value
formulas
  • DV Discrimination Value.
  • Its the difference between the two average
    densities where sk is the density for document
    pairs from which term k has been removed.
  • If k is useful DV is positive.

40
Gerard Salton paper Discrimination Value
formulas
  • Terms with a high document frequency increase the
    total density formula and DV is negative.
  • Terms with a low document frequency leave the
    density unchanged and DV is near zero value.
  • Terms with medium document frequency decrease the
    total density and DV is positive.

41
Gerard Salton paper Term Precision formulas
  • N total documents.
  • R relevant documents with respect to a query.
  • I (N-R) non relevant documents.
  • r assigned relevant documents.
  • s assigned non relevant documents. (df r s)
  • w increases in 0ltdfltR and decreases in RltdfltN
  • The maximum value of w is reached at df R.

42
Gerard Salton paper Conclusions
  • Precision weights are difficult to compute in
    practice because required relevance assessments
    of documents with respect to queries are not
    normally available in real retrieval situations.

43
New Weighting Schemas
  • Web problems
  • Document Structure
  • Hyperlinks
  • Different weighting schemas

44
New Weighting Schemas (2)
  • Weight tokens under particular HTML tags more
    heavily
  • ltTITLEgt tokens (Google seems to like title
    matches)
  • ltH1gt,ltH2gt tokens
  • ltMETAgt keyword tokens
  • Parse page into conceptual sections (e.g.
    navigation links vs. page content) and weight
    tokens differently based on section.

45
References
  • Gerald Salton and Chris Buckley. Term weighting
    approaches in automatic text retrieval.
    Information Processing and Management,
    24(5)513-523, Issue 5. 1988.
  • Gerard Salton and M.J.McGill. Introduction to
    Modern Information Retrieval. McGraw Hill Book
    Co., New York, 1983.
  • Gerard Salton, A. Wong, and C.S. Yang. A vector
    space model for Information Retrieval. Journal of
    the American Society for Information Science,
    18(11)613-620, November 1975.
  • Gerard Salton. The SMART Retrieval System
    Experiments in automatic document processing.
    Prentice Hall, Englewood Cliffs, N. J., 1971.

46
References (2)
  • Erica Chisholm and Tamara G. Kolda. New Term
    Weighting formulas for the Vector Space Method in
    Information Retrieval. Computer Science and
    Mathematics Division. Oak Ridge National
    Laboratory, 1999.
  • W. B. Croft. Experiments with representation in a
    document retrieval system. Information
    Technology Research and Development, 21-21,
    1983.
  • Ray Larson, Marc Davis. SIMS 202 Information
    Organization and Retrieval. UC Berkeley SIMS,
    Lecture 18 Vector Representation, 2002.
  • Kishore Papineni. Why Inverse Document
    Frequency?. IBM T.J. Watson Research Center
    Yorktown Heights, New York, Usa, 2001.
  • Chris Buckley. The importance of proper weighting
    methods. In M. Bates, editor, Human Language
    Technology. Morgan Kaufman, 1993.

47
Questions?
Write a Comment
User Comments (0)
About PowerShow.com