Title: A Formal Study of Information Retrieval Heuristics
1A Formal Study of Information Retrieval Heuristics
- Hui Fang , Tao Tao , ChengXiang Zhai
- University of Illinois at Urbana Champaign Urbana
- SIGIR 2004
Presented by CHU Huei-Ming 2004/01/17
2Outline
- Formal Definitions of Heuristic Retrieval
Constraints - Analysis of Three Representative Retrieval
Formulas - Pivoted Normalization Method
- Okapi Method
- Dirichlet Prior Method
- Experiments
- Conclusion and Future Work
3Formal Definitions of Heuristic Retrieval
Constraints
- Six intuitive and desirable constraints
- Any reasonable retrieval formula should satisfy
- Term Frequency Constraints (TFCs)
- Term Discrimination Constraints (TDC)
- Length Normalization Constraints (LNCs)
- TF-Length Constraints (TF-LNC)
4Formal Definitions of Heuristic Retrieval
Constraints
- Term Frequency Constraints (TFCs)
- TFC1
- qw , Assume d1d2. If
c(w,d1) gt c(w,d2), then f(d1,q) gt f(d2,q) - TFC2
- qw , Assume d1d2d3 , c(w,d1)gt0,
- If c(w,d2) - c(w,d1) 1 ,
c(w,d3) - c(w,d2) 1 - then f(d2,q) - f(d1,q) gt
f(d3,q) - f(d2,q)
5Formal Definitions of Heuristic Retrieval
Constraints
- Term Discrimination Constraints (TDC)
- Let q be a query , and w1,w2 q be two query term
- Assume d1d2 , c(w1,d1)
c(w2,d1) c(w1,d2) c(w2,d2) - If idf(w1) idf(w2) and c(w1,d1) gt c(w2,d2) ,
then f(d1,q) f(d2,q)
6Formal Definitions of Heuristic Retrieval
Constraints
- Length Normalization Constraints (LNCs)
- LNC1
- Let q be a query , d1 and d2 are two documents
- If some word w q , c(w,d2) c(w,d1) 1
but for any query term w, c(w,d2)
c(w,d1)then f(d1,q) f(d2,q) - LNC2
- Let q be a query ,? k gt1 , d1 and d2 are two
documents - If d1 k d2 and for all terms w , c(w,
d1) k c(w, d2), - then f(d1, q) f(d2, q).
7Formal Definitions of Heuristic Retrieval
Constraints
- TF-Length Constraints (TF-LNC)
- qw, d1 and d2 are two documents
- If c(w,d1) gt c(w,d2) and d1d2 c(w,d1) -
c(w,d2) - then f(d1,q) gt f(d2,q)
8Formal Definitions of Heuristic Retrieval
Constraints
9Analysis of Three Representative Retrieval
Formulas
- Pivoted Normalization Method
- Okapi Method
- Dirichlet Prior Method
10Analysis of Three Representative Retrieval
Formulas Pivoted Normalization Method
- Retrieval function
- Analyzing
11Analysis of Three Representative Retrieval
Formulas Pivoted Normalization Method
- Check TF-LNC constraint when d1avdl , it is
equivalent to the - TF-LNC is satisfied only if s is blow a certain
upper bound
12Analysis of Three Representative Retrieval
Formulas Pivoted Normalization Method
- Check the LNC2 constraint
13Analysis of Three Representative Retrieval
Formulas Pivoted Normalization Method
- Consider common case when d2avdl
- Performance can be bad for a large s
14Analysis of Three Representative Retrieval
Formulas Pivoted Normalization Method
- Check TDC constraint
- It is equivalent to c(w2,d1) c(w1,d2) this is
conditional satisfied
15Analysis of Three Representative Retrieval
Formulas Okapi Method
- Retrieval function
- k1 (between 1.02.0 ) b (usually 0.75) and k3
(between 0 1000)
16Analysis of Three Representative Retrieval
Formulas Okapi Method
- Analysis
- When df(w)gt N/2 , the IDF part in the formula
will be a negative value - When the IDF part is positive (mostly true for
keyword query) - TFC and LNCs are satisfied
- TF-LNC constraint considering a common case
when d2avdl the constraint is
equivalent to b avdl / c(w, d2) - TDC is equivalent to c(w2,d1) c(w1,d2) same
as the formula above
17Analysis of Three Representative Retrieval
Formulas Okapi Method
- Modify Okapi Method
- Solve the problem of negative IDF
- Replace the original IDF in Okapi with the
regular IDF in the pivoted normalization formula - The performance is better on the verbose queries
- Analysis result
18Analysis of Three Representative Retrieval
Formulas Dirichlet Prior Method
- Retrieval function
- Use Dirichlet prior smoothing method to smooth a
document language model - Rank the documents according to the likelihood
of the query according to the estimated language
model of each document
19Analysis of Three Representative Retrieval
Formulas Dirichlet Prior Method
- Analysis
- LNC2 constraint is equivalent to c(w ,d2) d2
p(wC) - Which is usually satisfied for content-carrying
words - TDC constraint led to some lower bound for
parameter
20Analysis of Three Representative Retrieval
Formulas Dirichlet Prior Method
- Analysis
- TDC consider a common case of w2 ,
p(w2C)1/avdl - Means for discriminative words with a high term
frequency in a document , needs to be
sufficiently large - In order to balance the TF and IDF appropriately
21ExperimentsSetup
- Document set
- AP news article , DOE technical report, FR
government documents, - ADF combination of AP, DOE, FR
- Web web data used in the TREC8
- Trec7 ad hoc data used in the TREC7
- Trec8 ad hoc data used in the TREC8
22ExperimentsSetup
- Query combination
- Short-keyword (SK, keyword title)
- Shot-verbose (SV, one sentence description)
- Long-keyword (LK, keyword list)
- Long-verbose (LV, multiple sentences)
- Preprocessing
- Only stemming with the Porters stemmer
- No stop words have been removed
23ExperimentsParameter Sensitivity
- Pivoted normalization method
- The analysis of LNC2 constraint for the pivoted
normalization methods suggests the s should be
smaller than 0.4
24ExperimentsParameter Sensitivity
- Okapi method k1 1.2, k3 1000, b changes from
0.1 to 1.0
25ExperimentsParameter Sensitivity
26ExperimentsParameter Sensitivity
27ExperimentsPerformance Comparison
28ExperimentsPerformance Comparison
- For any query type, the performance of Dirichlet
prior method is comparable to pivoted
normalization method - For keyword queries, the performance of Okapi is
comparable to the other two retrieval formulas - For verbose queries, the performance of Okapi may
be worse than others due to the possible negative
IDF part in the formula
29ExperimentsPerformance Comparison
- Average precision comparison
30Conclusion and Future Work
- Define six basic constraints that any reasonable
retrieval function should satisfy - When the constraints is not satisfied, it often
indicates non-optimality of the method