Title: Gravitation-Based Model for Information Retrieval
1Gravitation-Based Model for Information Retrieval
- Shuming Shi
- Ji-Rong Wen
- Qing Yu
- Ruihua Song
- Wei-Ying Ma
- Microsoft Research Asia
- shumings_at_microsoft.com
From http//www.awesomelibrary.org/images/solar-s
ystem-nasa.jpg
2Background
A core problem in Information Retrieval (IR)
Determine the relevance of a document to a query
Query
Bill Clinton
Document
Relevant? How relevant?
3Background
- IR Models Perspectives
- IR models define the representation of documents,
queries, and the relevance relationship between
them - The key behind all IR models is primary
perspectives on information retrieval
Model Perspective
Boolean model Set theory and Boolean algebra
Vector space model Vector and linear algebra
Probabilistic model Probabilistic
Language model Probabilistic
4Background
- Hard questions
- What is the essence of information retrieval?
- What is the right perspective of it?
- Till now, we know more about IR each time when a
new perspective is adopted - It would also be helpful to view IR problems from
more new perspectives - We try to view IR from the perspective of physics
5Background
(1687 AD.)
From http//csep10.phys.utk.edu/astr161/lect/hist
ory/newtongrav.html
6Background
From http//www.enterprisemission.com/hyper2a.php
7Background
- We are living in a physical world which is
dominated by fundamental physics laws. - Can we get help from the God in acquiring
deeper understanding of information retrieval? - Simply start from Newtons Universal Law of
Gravitation
8Preliminary Achievements
- First discovered by Robertson et al, inspired by
the shape of a complex formula derived from a
probabilistic model under the 2-Poisson
assumption. - Amati and Rijsbergen proposed a probabilistic
framework with which the BM25 function with some
special parameters (k11.2, b0.75 or k12,
b0.75) can be approximated numerically - We lack a complete derivation of BM25 formula in
theory.
It is encouraging that we can really benefit from
the nature. With the new perspective, we get the
following preliminary achievements,
- We build a new IR model GBM from which many
effective ranking functions can be derived - The BM25 formula can be derived from our model,
so we give an intuitive physical interpretation
of this powerful and robust function. - A more reasonable approach for structured
document retrieval can be obtained directly from
the model. This approach is not only highly
effective but also robust to be used in various
conditions.
9Outline
- Background
- Gravitation-based Model
- Notations Basic Concepts
- Discrete GBM Model
- Continuous GBM Model
- Model analysis
- GBM Model for Structured Document Retrieval
- Summary
10GBM Initial Idea
IR concepts notations D Document length
df(t) Document frequency of t avdl Average
document length in a collection N Total number
of documents c(t,D) Times of occurrences of t
in D (or written as tf(t,D))
A mapping is need to be build from concepts of
information retrieval to those of physics
Query
Bill Clinton
Document
Relevance score
Attractive force
Physics concepts mass distance
11GBM Notations Basic Concepts
- Particle
- (atom) Basic element of any object
- A particle has two attributes mass and type
- Type Determined by the term object it composes
12GBM Notations Basic Concepts
H(D) Hidden terms in document D
Two natural assumptions
A term object has 4 attributes type, shape,
mass, and diameter
13Notation List
14Outline
- Background
- Gravitation-based Model
- Notations Basic Concepts
- Discrete GBM Model
- Continuous GBM Model
- Model analysis
- GBM Model for Structured Document Retrieval
- Summary
15Discrete GBM Model
- Key Points
- Under the attraction of query terms, the
structure of each document would be adjusted to
an optimized-term-placement state. - 2. The relevance between a document and a query
is defined by the attractive force between them
when the document is in its optimized-term-placeme
nt state.
Optimized-term-placement state A state where
the aggregated force between the document and the
query gets maximized
16Term Weighting Formula
Unknown expressions m(t,Q), m(t,D), and
di(t,D) Need Mass and diameter estimation
The force between query term t and its i-th
nearest occurrence in D
The maximal (optimized) gravitational force
between t and D
The attractive force between D and Q
17Mass and Diameter Estimation
For any two terms, their mass ratio in any
document is equal to the ratio of their average
masses in the whole collection.
Assume that all terms in the same document have
equal diameters
(Assumption-2)
(Assumption-1)
Define a document-independent mass for
each (type of) term. It denotes the average mass
of term t in the whole collection.
(Assumption-3)
(Assumption-4)
18Ultimate Discrete GBM Formula
- The mass of a document is a measure of its
quality, which depends on how informative and
important it is. - Relationship with PageRank? ltFuture workgt
The average (document-independent) mass of term t
in the collection
The ultimate term-weighting function
where and
19Ultimate Discrete GBM Formula
If m(D) const, di(D) const, and
Then a special case of the term-weighting
function
where
Two parameters
20Outline
- Background
- Gravitation-based Model
- Notations Basic Concepts
- Discrete GBM Model
- Continuous GBM Model
- Model analysis
- GBM Model for Structured Document Retrieval
- Summary
21Continuous GBM Model
Term shape Ideal cylinder
Document D is now in its optimized-term-placement
state
22Term Weighting Formula
The force between query term t and its i-th
nearest occurrence in D
The maximal (optimized) gravitational force
between t and D
23Ultimate Continuous GBM Formula
By doing mass and diameter estimation, we have
the ultimate term-weighting function
where and
If m(D) const, di(D) const, and
Then a special case of the above term-weighting
function
(Two parameters )
24Outline
- Background
- Gravitation-based Model
- Notations Basic Concepts
- Discrete GBM Model
- Continuous GBM Model
- Model analysis
- GBM Model for Structured Document Retrieval
- Summary
25Continuous GBM Formula vs. BM25
A special case of the continuous GBM
term-weighting function
where
BM25 term-weighting function
26Other Ranking Formulas Derived
Ranking formulas (highly simplified version)
derived from the continuous GBM model with
various gravitational-field-functions
27Check with Heuristic Constraints
- Fang et al, SIGIR04 Some heuristic
constraints related to TF, IDF, and document
length that all reasonable ranking formulas
should satisfy - TFC1, TFC2
- TDC ? M-TDC
- LNC1, LNC2
- TF-LNC
- All our derived term weighting functions satisfy
all the above constraints.
28Preliminary Experiments
Corpora characteristics
Query-sets used in the experiments
29Preliminary Experiments
Optimal performance comparison among some
formulas over various corpora and tasks (measure
mean average precision)
30Outline
- Background
- Gravitation-based Model
- Notations Basic Concepts
- Discrete GBM Model
- Continuous GBM Model
- Model analysis
- GBM Model for Structured Document Retrieval skip
- Summary
31Structured Document Retrieval
- A document is said to be structured here when it
contains multiple fields. - Current approaches for structured document
retrieval - Score combination
- The most commonly used and well-studied approach
- Rank combination is a special case of score
combination - Term-frequency combination
- Robertson et al, CIKM04 An extension of BM25
- Ogilvie et al, SIGIR03 Linearly combining
language models - Each approach works moderately well, but
32Score Combination Issues
- For a multi-term query, a document matching a
single query term over many fields could get
unreasonably higher score than another document
which matches all the query terms in a few
fields. (See discussions in Robertson et al,
CIKM04)
score(d1) s s s s 8s score(d2) 2s
2s 0 0 4s
score(d1) gt score(d2) Unreasonable
33TF Combination Issues
Consider a single-term query Qt, and some
documents with two fields (F1, F2). Assuming w1
weight(F1) 5 w2 weight(F2) 1
tf(t,d1) w1 1 w2 0 5 tf(t,d2) w1 0
w2 6 6
score(d1) lt score(d2) Reasonable
- Larger w1?
- Cant remove this issue
- Potential risk of making the case of example-1
unreasonable
Example-1 (assuming d1d2)
tf(t,d3) w1 1 w2 8 13 tf(t,d4) w1 0
w2 14 19
score(d3) lt score(t,d4) Unreasonable
Example-2 (assuming d3d4)
34Structured Document Retrievalby GBM
35Experimental Results
Performance comparison of different approaches
for the combination of body and title fields
36Outline
- Background
- Gravitation-based Model
- Notations Basic Concepts
- Discrete GBM Model
- Continuous GBM Model
- Model analysis
- GBM Model for Structured Document Retrieval
- Summary
37Summary
- Viewing IR from a different viewpoint is the same
important as going deeper from traditional
perspectives. - This paper may be a first step to take a physics
viewpoint - It is encouraging that we can really benefit from
the nature - A family of effective ranking functions derived
- Give BM25 a physics interpretation
- A more reasonable approach for structured
document retrieval obtained
38- Sorry, Sir Isaac Newton. Hope I am not abusing
your laws.
39The End
- Gravitation-Based Model for Information Retrieval
- Please send your comments to shumings_at_microsoft.c
om