Title: Term Weighting in Information Retrieval
1Term Weighting in Information Retrieval
Web Information Retrieval
- Polettini Nicola
- Monday, June 5, 2006
2Contents
- Introduction to Vector Space Model
- Vocabulary Terms
- Documents Queries
- Similarity measures
- Term weighting
- Binary weights
- SMART Retrieval System
- Salton Term Precision Model
- Paper analysis
- New weighting schemas
- Web documents
- Conclusions
- References
3The Vector Space Model
- Vocabulary.
- Terms.
- Documents Queries.
- Vector representation.
- Similarity measures.
- Cosine Similarity.
4Vocabulary
- Documents are represented as vectors in term
space (all terms vocabulary). - Queries represented the same as documents.
- Query and Document weights are based on length
and direction of their vector. - A vector distance measure between the query and
documents is used to rank retrieved documents.
5Terms
- Documents are represented by binary or weighted
vectors of terms. - Terms are usually stems.
- Terms can be also n-grams.
- Computer Science bigram
- World Wide Web trigram
6Documents Queries Vectors
- Documents and queries are represented as bags of
words (BOW). - Represented as vectors
- A vector is like an array of floating point.
- It has direction and magnitude.
- Each vector holds a place for every term in the
collection. - Therefore, most vectors are sparse.
7Vector representation
- Documents and Queries are represented as vectors.
- Vocabulary n terms
- Position 1 corresponds to term 1, position 2 to
term 2, position n to term n.
8Similarity Measures
Simple matching Dices Coefficient Jaccards
Coefficient Cosine Coefficient Overlap
Coefficient
9Cosine Similarity
- The similarity of two documents is
- This is called the cosine similarity.
- The normalization is done when weighting the
terms. Otherwise normalization and similarity can
be combined. - Cosine similarity sorts documents according to
degrees of similarity.
10Example Computing Cosine Similarity
11Example Computing Cosine Similarity (2)
12Term Weighting
- Binary weights
- SMART Retrieval System
- Local formulas
- Global formulas
- Normalization formulas
- TFIDF
13Binary Weights
- Only the presence (1) or absence (0) of a term is
included in the vector
14Binary Weights Formula
Binary formula gives every word that appears in a
document equal relevance. It can be useful when
frequency is not important.
15Why use term weighting?
- Binary weights too limiting.
- (terms are either present or absent).
- Non-binary weights allow to model partial
matching . - (Partial matching allows retrieval of docs that
approximate the query). - Ranking of retrieved documents - best matching.
- (Term-weighting improves quality of answer set).
16Smart Retrieval System
- SMART is an experimental IR system developed by
Gerard Salton (and continued by Chris Buckley) at
Cornell. - Designed for laboratory experiments in IR
- Easy to mix and match different weighting methods.
Paper Salton, The Smart Retrieval System
Experiments in Automatic Document Processing,
1971
17Smart Retrieval System (2)
- In SMART weights are decomposed into three
factors
18Local term-weighting formulas
Binary Frequency Maxnorm Augmented Normalized
Alternate Log
19Term frequency
- TF (term frequency) - Count of times term occurs
in document.
20Term frequency (2)
- The more times a term t occurs in document d the
more likely it is that t is relevant to the
document. - Used alone, favors common words, long documents.
- Too much credit to words that appears more
frequently. - Tipically used for query weighting.
21Augmented Normalized Term Frequency
- This formula was proposed by Croft.
- Usually K 0,5.
- K lt 0,5 for large documents.
- K 0,5 for shorter documents.
- Output varies between 0,5 and 1 for terms that
appear in the document. Its a weak form of
normalization.
22Logarithmic Term Frequency
- Logarithms are a way to de-emphasize the effect
of frequency. - Logarithmic formula decreases the effects of
large differences in term frequencies.
23Global term-weighting formulas
Inverse Squared Probabilistic Frequency
24Document Frequency
- DF document frequency
- Count the frequency considering the whole
collection of documents. - Less frequently a term appears in the whole
collection, the more discriminating it is.
25Inverse Document Frequency
- Measures rarity of the term in collection.
- Inverts the document frequency.
- Its the most used global formula.
- Higher if term occurs in less documents
- Gives full weight to terms that occurr in one
document only. - Gives lowest weight to terms that occurr in all
documents.
26Inverse Document Frequency (2)
- IDF provides high values for rare words and low
values for common words.
Examples for a collection of 10000 documents (N
10000)
27Other IDF Schemes
- Squared IDF used rarely as a variant of IDF.
- Probabilistic IDF
- It assigns weights ranging from - for a term
that appears in every document to log(n-1) for a
term that appears in only one document. - Negative weights for terms appearing in more than
half of the documents.
28Normalization formulas
Sum of weights Cosine Fourth Max
29Document Normalization
- Long documents have an unfair advantage
- They use a lot of terms
- So they get more matches than short documents
- And they use the same words repeatedly
- So they have much higher term frequencies
- Normalization seeks to remove these effects
- Related somehow to maximum term frequency.
- But also sensitive to the number of terms.
- If we dont normalize short documents may not be
recognized as relevant.
30Cosine Normalization
- Its the most used and popular.
- Normalize the term weights (so longer documents
are not unfairly given more weight). - If weights are normalized the cosine similarity
results
31Other normalizations
- Sum of weights and fourth normalization are
rarely used as cosine normalization variant. - Max Weight Normalization
- It assigns weights between 0 and 1, but it
doesnt take into account the distribution of
terms over documents. - It gives high importance to the most relevant
weighted terms within a document (used in
CiteSeer).
32TFIDF Term-weighting
33TFIDF Example
- Its the most used term-weighting
tf
Wi,j
idf
1
2
3
4
1
2
3
4
5
2
1.51
0.60
complicated
0.301
4
1
3
0.50
0.13
0.38
contaminated
0.125
5
4
3
0.63
0.50
0.38
fallout
0.125
6
3
3
2
information
0.000
1
0.60
interesting
0.602
3
7
0.90
2.11
nuclear
0.301
6
1
4
0.75
0.13
0.50
retrieval
0.125
2
1.20
siberia
0.602
34Normalization example
Wi,j
W'i,j
idf
1
2
3
4
1
2
3
4
1.51
0.60
complicated
0.57
0.69
0.301
0.50
0.13
0.38
contaminated
0.29
0.13
0.14
0.125
0.63
0.50
0.38
fallout
0.37
0.19
0.44
0.125
information
0.000
0.60
interesting
0.62
0.602
0.90
2.11
nuclear
0.53
0.79
0.301
0.75
0.13
0.50
retrieval
0.77
0.05
0.57
0.125
1.20
siberia
0.71
0.602
1.70
0.97
2.67
0.87
Length
35Retrieval Example
Query contaminated retrieval
W'i,j
query
complicated
contaminated
1
fallout
Ranked list Doc 2 Doc 4 Doc 1 Doc 3
information
interesting
nuclear
retrieval
1
siberia
0.29
0.9
0.19
0.57
Cosine similarity score
36Gerard Salton paper The term precision model
- Weighting Schema proposed.
- Cosine similarity.
- Density formula.
- Discrimination Value formulas.
- Term Precision formulas.
- Conclusions.
37Gerard Salton paper Weighting Schema proposed
- Use of tf idf formulas.
- Underline the importance of term weighting.
- Use of cosine similarity.
38Gerard Salton paper Density formula
- Density the average pairwise cosine similarity
between distinct document pairs. - N total number of documents.
39Gerard Salton paper Discrimination Value
formulas
- DV Discrimination Value.
- Its the difference between the two average
densities where sk is the density for document
pairs from which term k has been removed. - If k is useful DV is positive.
40Gerard Salton paper Discrimination Value
formulas
- Terms with a high document frequency increase the
total density formula and DV is negative. - Terms with a low document frequency leave the
density unchanged and DV is near zero value. - Terms with medium document frequency decrease the
total density and DV is positive.
41Gerard Salton paper Term Precision formulas
- N total documents.
- R relevant documents with respect to a query.
- I (N-R) non relevant documents.
- r assigned relevant documents.
- s assigned non relevant documents. (df r s)
- w increases in 0ltdfltR and decreases in RltdfltN
- The maximum value of w is reached at df R.
42Gerard Salton paper Conclusions
- Precision weights are difficult to compute in
practice because required relevance assessments
of documents with respect to queries are not
normally available in real retrieval situations.
43New Weighting Schemas
- Web problems
- Document Structure
- Hyperlinks
- Different weighting schemas
44New Weighting Schemas (2)
- Weight tokens under particular HTML tags more
heavily - ltTITLEgt tokens (Google seems to like title
matches) - ltH1gt,ltH2gt tokens
- ltMETAgt keyword tokens
- Parse page into conceptual sections (e.g.
navigation links vs. page content) and weight
tokens differently based on section.
45References
- Gerald Salton and Chris Buckley. Term weighting
approaches in automatic text retrieval.
Information Processing and Management,
24(5)513-523, Issue 5. 1988. - Gerard Salton and M.J.McGill. Introduction to
Modern Information Retrieval. McGraw Hill Book
Co., New York, 1983. - Gerard Salton, A. Wong, and C.S. Yang. A vector
space model for Information Retrieval. Journal of
the American Society for Information Science,
18(11)613-620, November 1975. - Gerard Salton. The SMART Retrieval System
Experiments in automatic document processing.
Prentice Hall, Englewood Cliffs, N. J., 1971.
46References (2)
- Erica Chisholm and Tamara G. Kolda. New Term
Weighting formulas for the Vector Space Method in
Information Retrieval. Computer Science and
Mathematics Division. Oak Ridge National
Laboratory, 1999. - W. B. Croft. Experiments with representation in a
document retrieval system. Information
Technology Research and Development, 21-21,
1983. - Ray Larson, Marc Davis. SIMS 202 Information
Organization and Retrieval. UC Berkeley SIMS,
Lecture 18 Vector Representation, 2002. - Kishore Papineni. Why Inverse Document
Frequency?. IBM T.J. Watson Research Center
Yorktown Heights, New York, Usa, 2001. - Chris Buckley. The importance of proper weighting
methods. In M. Bates, editor, Human Language
Technology. Morgan Kaufman, 1993.
47Questions?