Term Weighting in Information Retrieval

About This Presentation

Title:

Term Weighting in Information Retrieval

Description:

The SMART Retrieval System Experiments in automatic document processing. ... Experiments with representation in a document retrieval system. ... – PowerPoint PPT presentation

Number of Views:814

Avg rating:3.0/5.0

Slides: 48

Provided by: sra8

Category:

more less

Transcript and Presenter's Notes

Title: Term Weighting in Information Retrieval

1
Term Weighting in Information Retrieval
Web Information Retrieval

Polettini Nicola
Monday, June 5, 2006

2
Contents

Introduction to Vector Space Model
Vocabulary Terms
Documents Queries
Similarity measures
Term weighting
Binary weights
SMART Retrieval System
Salton Term Precision Model
Paper analysis
New weighting schemas
Web documents
Conclusions
References

3
The Vector Space Model

Vocabulary.
Terms.
Documents Queries.
Vector representation.
Similarity measures.
Cosine Similarity.

4
Vocabulary

Documents are represented as vectors in term
space (all terms vocabulary).
Queries represented the same as documents.
Query and Document weights are based on length
and direction of their vector.
A vector distance measure between the query and
documents is used to rank retrieved documents.

5
Terms

Documents are represented by binary or weighted
vectors of terms.
Terms are usually stems.
Terms can be also n-grams.
Computer Science bigram
World Wide Web trigram

6
Documents Queries Vectors

Documents and queries are represented as bags of
words (BOW).
Represented as vectors
A vector is like an array of floating point.
It has direction and magnitude.
Each vector holds a place for every term in the
collection.
Therefore, most vectors are sparse.

7
Vector representation

Documents and Queries are represented as vectors.
Vocabulary n terms
Position 1 corresponds to term 1, position 2 to
term 2, position n to term n.

8
Similarity Measures
Simple matching Dices Coefficient Jaccards
Coefficient Cosine Coefficient Overlap
Coefficient
9
Cosine Similarity

The similarity of two documents is
This is called the cosine similarity.
The normalization is done when weighting the
terms. Otherwise normalization and similarity can
be combined.
Cosine similarity sorts documents according to
degrees of similarity.

10
Example Computing Cosine Similarity
11
Example Computing Cosine Similarity (2)
12
Term Weighting

Binary weights
SMART Retrieval System
Local formulas
Global formulas
Normalization formulas
TFIDF

13
Binary Weights

Only the presence (1) or absence (0) of a term is
included in the vector

14
Binary Weights Formula
Binary formula gives every word that appears in a
document equal relevance. It can be useful when
frequency is not important.
15
Why use term weighting?

Binary weights too limiting.
(terms are either present or absent).
Non-binary weights allow to model partial
matching .
(Partial matching allows retrieval of docs that
approximate the query).
Ranking of retrieved documents - best matching.
(Term-weighting improves quality of answer set).

16
Smart Retrieval System

SMART is an experimental IR system developed by
Gerard Salton (and continued by Chris Buckley) at
Cornell.
Designed for laboratory experiments in IR
Easy to mix and match different weighting methods.

Paper Salton, The Smart Retrieval System
Experiments in Automatic Document Processing,
1971
17
Smart Retrieval System (2)

In SMART weights are decomposed into three
factors

18
Local term-weighting formulas
Binary Frequency Maxnorm Augmented Normalized
Alternate Log
19
Term frequency

TF (term frequency) - Count of times term occurs
in document.

20
Term frequency (2)

The more times a term t occurs in document d the
more likely it is that t is relevant to the
document.
Used alone, favors common words, long documents.
Too much credit to words that appears more
frequently.
Tipically used for query weighting.

21
Augmented Normalized Term Frequency

This formula was proposed by Croft.
Usually K 0,5.
K lt 0,5 for large documents.
K 0,5 for shorter documents.
Output varies between 0,5 and 1 for terms that
appear in the document. Its a weak form of
normalization.

22
Logarithmic Term Frequency

Logarithms are a way to de-emphasize the effect
of frequency.
Logarithmic formula decreases the effects of
large differences in term frequencies.

23
Global term-weighting formulas
Inverse Squared Probabilistic Frequency
24
Document Frequency

DF document frequency
Count the frequency considering the whole
collection of documents.
Less frequently a term appears in the whole
collection, the more discriminating it is.

25
Inverse Document Frequency

Measures rarity of the term in collection.
Inverts the document frequency.
Its the most used global formula.
Higher if term occurs in less documents
Gives full weight to terms that occurr in one
document only.
Gives lowest weight to terms that occurr in all
documents.

26
Inverse Document Frequency (2)

IDF provides high values for rare words and low
values for common words.

Examples for a collection of 10000 documents (N
10000)
27
Other IDF Schemes

Squared IDF used rarely as a variant of IDF.
Probabilistic IDF
It assigns weights ranging from - for a term
that appears in every document to log(n-1) for a
term that appears in only one document.
Negative weights for terms appearing in more than
half of the documents.

28
Normalization formulas
Sum of weights Cosine Fourth Max
29
Document Normalization

Long documents have an unfair advantage
They use a lot of terms
So they get more matches than short documents
And they use the same words repeatedly
So they have much higher term frequencies
Normalization seeks to remove these effects
Related somehow to maximum term frequency.
But also sensitive to the number of terms.
If we dont normalize short documents may not be
recognized as relevant.

30
Cosine Normalization

Its the most used and popular.
Normalize the term weights (so longer documents
are not unfairly given more weight).
If weights are normalized the cosine similarity
results

31
Other normalizations

Sum of weights and fourth normalization are
rarely used as cosine normalization variant.
Max Weight Normalization
It assigns weights between 0 and 1, but it
doesnt take into account the distribution of
terms over documents.
It gives high importance to the most relevant
weighted terms within a document (used in
CiteSeer).

32
TFIDF Term-weighting
33
TFIDF Example

Its the most used term-weighting

tf
Wi,j
idf
1
2
3
4
1
2
3
4
5
2
1.51
0.60
complicated
0.301
4
1
3
0.50
0.13
0.38
contaminated
0.125
5
4
3
0.63
0.50
0.38
fallout
0.125
6
3
3
2
information
0.000
1
0.60
interesting
0.602
3
7
0.90
2.11
nuclear
0.301
6
1
4
0.75
0.13
0.50
retrieval
0.125
2
1.20
siberia
0.602
34
Normalization example
Wi,j
W'i,j
idf
1
2
3
4
1
2
3
4
1.51
0.60
complicated
0.57
0.69
0.301
0.50
0.13
0.38
contaminated
0.29
0.13
0.14
0.125
0.63
0.50
0.38
fallout
0.37
0.19
0.44
0.125
information
0.000
0.60
interesting
0.62
0.602
0.90
2.11
nuclear
0.53
0.79
0.301
0.75
0.13
0.50
retrieval
0.77
0.05
0.57
0.125
1.20
siberia
0.71
0.602
1.70
0.97
2.67
0.87
Length
35
Retrieval Example
Query contaminated retrieval
W'i,j
query
complicated
contaminated
1
fallout
Ranked list Doc 2 Doc 4 Doc 1 Doc 3
information
interesting
nuclear
retrieval
1
siberia
0.29
0.9
0.19
0.57
Cosine similarity score
36
Gerard Salton paper The term precision model

Weighting Schema proposed.
Cosine similarity.
Density formula.
Discrimination Value formulas.
Term Precision formulas.
Conclusions.

37
Gerard Salton paper Weighting Schema proposed

Use of tf idf formulas.
Underline the importance of term weighting.
Use of cosine similarity.

38
Gerard Salton paper Density formula

Density the average pairwise cosine similarity
between distinct document pairs.
N total number of documents.

39
Gerard Salton paper Discrimination Value
formulas

DV Discrimination Value.
Its the difference between the two average
densities where sk is the density for document
pairs from which term k has been removed.
If k is useful DV is positive.

40
Gerard Salton paper Discrimination Value
formulas

Terms with a high document frequency increase the
total density formula and DV is negative.
Terms with a low document frequency leave the
density unchanged and DV is near zero value.
Terms with medium document frequency decrease the
total density and DV is positive.

41
Gerard Salton paper Term Precision formulas

N total documents.
R relevant documents with respect to a query.
I (N-R) non relevant documents.
r assigned relevant documents.
s assigned non relevant documents. (df r s)
w increases in 0ltdfltR and decreases in RltdfltN
The maximum value of w is reached at df R.

42
Gerard Salton paper Conclusions

Precision weights are difficult to compute in
practice because required relevance assessments
of documents with respect to queries are not
normally available in real retrieval situations.

43
New Weighting Schemas

Web problems
Document Structure
Hyperlinks
Different weighting schemas

44
New Weighting Schemas (2)

Weight tokens under particular HTML tags more
heavily
ltTITLEgt tokens (Google seems to like title
matches)
ltH1gt,ltH2gt tokens
ltMETAgt keyword tokens
Parse page into conceptual sections (e.g.
navigation links vs. page content) and weight
tokens differently based on section.

45
References

Gerald Salton and Chris Buckley. Term weighting
approaches in automatic text retrieval.
Information Processing and Management,
24(5)513-523, Issue 5. 1988.
Gerard Salton and M.J.McGill. Introduction to
Modern Information Retrieval. McGraw Hill Book
Co., New York, 1983.
Gerard Salton, A. Wong, and C.S. Yang. A vector
space model for Information Retrieval. Journal of
the American Society for Information Science,
18(11)613-620, November 1975.
Gerard Salton. The SMART Retrieval System
Experiments in automatic document processing.
Prentice Hall, Englewood Cliffs, N. J., 1971.

46
References (2)

Erica Chisholm and Tamara G. Kolda. New Term
Weighting formulas for the Vector Space Method in
Information Retrieval. Computer Science and
Mathematics Division. Oak Ridge National
Laboratory, 1999.
W. B. Croft. Experiments with representation in a
document retrieval system. Information
Technology Research and Development, 21-21,
1983.
Ray Larson, Marc Davis. SIMS 202 Information
Organization and Retrieval. UC Berkeley SIMS,
Lecture 18 Vector Representation, 2002.
Kishore Papineni. Why Inverse Document
Frequency?. IBM T.J. Watson Research Center
Yorktown Heights, New York, Usa, 2001.
Chris Buckley. The importance of proper weighting
methods. In M. Bates, editor, Human Language
Technology. Morgan Kaufman, 1993.

47
Questions?

Write a Comment

User Comments (0)

About PowerShow.com

Term Weighting in Information Retrieval - PowerPoint PPT Presentation

Term Weighting in Information Retrieval

The SMART Retrieval System Experiments in automatic document processing. ... Experiments with representation in a document retrieval system. ... – PowerPoint PPT presentation