Gravitation-Based Model for Information Retrieval - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Gravitation-Based Model for Information Retrieval

Description:

... formula derived from a probabilistic model under the 2-Poisson assumption. ... Two natural assumptions: H(D): Hidden terms in document D ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 40
Provided by: lei141
Category:

less

Transcript and Presenter's Notes

Title: Gravitation-Based Model for Information Retrieval


1
Gravitation-Based Model for Information Retrieval
  • Shuming Shi
  • Ji-Rong Wen
  • Qing Yu
  • Ruihua Song
  • Wei-Ying Ma
  • Microsoft Research Asia
  • shumings_at_microsoft.com

From http//www.awesomelibrary.org/images/solar-s
ystem-nasa.jpg
2
Background
A core problem in Information Retrieval (IR)
Determine the relevance of a document to a query
Query
Bill Clinton
Document
Relevant? How relevant?
3
Background
  • IR Models Perspectives
  • IR models define the representation of documents,
    queries, and the relevance relationship between
    them
  • The key behind all IR models is primary
    perspectives on information retrieval

Model Perspective
Boolean model Set theory and Boolean algebra
Vector space model Vector and linear algebra
Probabilistic model Probabilistic
Language model Probabilistic

4
Background
  • Hard questions
  • What is the essence of information retrieval?
  • What is the right perspective of it?
  • Till now, we know more about IR each time when a
    new perspective is adopted
  • It would also be helpful to view IR problems from
    more new perspectives
  • We try to view IR from the perspective of physics

5
Background
(1687 AD.)
From http//csep10.phys.utk.edu/astr161/lect/hist
ory/newtongrav.html
6
Background
From http//www.enterprisemission.com/hyper2a.php
7
Background
  • We are living in a physical world which is
    dominated by fundamental physics laws.
  • Can we get help from the God in acquiring
    deeper understanding of information retrieval?
  • Simply start from Newtons Universal Law of
    Gravitation

8
Preliminary Achievements
  • First discovered by Robertson et al, inspired by
    the shape of a complex formula derived from a
    probabilistic model under the 2-Poisson
    assumption.
  • Amati and Rijsbergen proposed a probabilistic
    framework with which the BM25 function with some
    special parameters (k11.2, b0.75 or k12,
    b0.75) can be approximated numerically
  • We lack a complete derivation of BM25 formula in
    theory.

It is encouraging that we can really benefit from
the nature. With the new perspective, we get the
following preliminary achievements,
  • We build a new IR model GBM from which many
    effective ranking functions can be derived
  • The BM25 formula can be derived from our model,
    so we give an intuitive physical interpretation
    of this powerful and robust function.
  • A more reasonable approach for structured
    document retrieval can be obtained directly from
    the model. This approach is not only highly
    effective but also robust to be used in various
    conditions.

9
Outline
  • Background
  • Gravitation-based Model
  • Notations Basic Concepts
  • Discrete GBM Model
  • Continuous GBM Model
  • Model analysis
  • GBM Model for Structured Document Retrieval
  • Summary

10
GBM Initial Idea
IR concepts notations D Document length
df(t) Document frequency of t avdl Average
document length in a collection N Total number
of documents c(t,D) Times of occurrences of t
in D (or written as tf(t,D))
A mapping is need to be build from concepts of
information retrieval to those of physics
Query
Bill Clinton
Document
Relevance score
Attractive force
Physics concepts mass distance
11
GBM Notations Basic Concepts
  • Particle
  • (atom) Basic element of any object
  • A particle has two attributes mass and type
  • Type Determined by the term object it composes

12
GBM Notations Basic Concepts
H(D) Hidden terms in document D
Two natural assumptions
A term object has 4 attributes type, shape,
mass, and diameter
13
Notation List
14
Outline
  • Background
  • Gravitation-based Model
  • Notations Basic Concepts
  • Discrete GBM Model
  • Continuous GBM Model
  • Model analysis
  • GBM Model for Structured Document Retrieval
  • Summary

15
Discrete GBM Model
  • Key Points
  • Under the attraction of query terms, the
    structure of each document would be adjusted to
    an optimized-term-placement state.
  • 2. The relevance between a document and a query
    is defined by the attractive force between them
    when the document is in its optimized-term-placeme
    nt state.

Optimized-term-placement state A state where
the aggregated force between the document and the
query gets maximized
16
Term Weighting Formula
Unknown expressions m(t,Q), m(t,D), and
di(t,D) Need Mass and diameter estimation
The force between query term t and its i-th
nearest occurrence in D
The maximal (optimized) gravitational force
between t and D
The attractive force between D and Q
17
Mass and Diameter Estimation
For any two terms, their mass ratio in any
document is equal to the ratio of their average
masses in the whole collection.
Assume that all terms in the same document have
equal diameters
(Assumption-2)
(Assumption-1)
Define a document-independent mass for
each (type of) term. It denotes the average mass
of term t in the whole collection.
(Assumption-3)
(Assumption-4)
18
Ultimate Discrete GBM Formula
  • The mass of a document is a measure of its
    quality, which depends on how informative and
    important it is.
  • Relationship with PageRank? ltFuture workgt

The average (document-independent) mass of term t
in the collection
The ultimate term-weighting function
where and
19
Ultimate Discrete GBM Formula
If m(D) const, di(D) const, and
Then a special case of the term-weighting
function
where
Two parameters
20
Outline
  • Background
  • Gravitation-based Model
  • Notations Basic Concepts
  • Discrete GBM Model
  • Continuous GBM Model
  • Model analysis
  • GBM Model for Structured Document Retrieval
  • Summary

21
Continuous GBM Model
Term shape Ideal cylinder
Document D is now in its optimized-term-placement
state
22
Term Weighting Formula
The force between query term t and its i-th
nearest occurrence in D
The maximal (optimized) gravitational force
between t and D
23
Ultimate Continuous GBM Formula
By doing mass and diameter estimation, we have
the ultimate term-weighting function
where and
If m(D) const, di(D) const, and
Then a special case of the above term-weighting
function
(Two parameters )
24
Outline
  • Background
  • Gravitation-based Model
  • Notations Basic Concepts
  • Discrete GBM Model
  • Continuous GBM Model
  • Model analysis
  • GBM Model for Structured Document Retrieval
  • Summary

25
Continuous GBM Formula vs. BM25
A special case of the continuous GBM
term-weighting function
where
BM25 term-weighting function
26
Other Ranking Formulas Derived
Ranking formulas (highly simplified version)
derived from the continuous GBM model with
various gravitational-field-functions
27
Check with Heuristic Constraints
  • Fang et al, SIGIR04 Some heuristic
    constraints related to TF, IDF, and document
    length that all reasonable ranking formulas
    should satisfy
  • TFC1, TFC2
  • TDC ? M-TDC
  • LNC1, LNC2
  • TF-LNC
  • All our derived term weighting functions satisfy
    all the above constraints.

28
Preliminary Experiments
  • Experimental Setup

Corpora characteristics
Query-sets used in the experiments
29
Preliminary Experiments
  • Experimental Results

Optimal performance comparison among some
formulas over various corpora and tasks (measure
mean average precision)
30
Outline
  • Background
  • Gravitation-based Model
  • Notations Basic Concepts
  • Discrete GBM Model
  • Continuous GBM Model
  • Model analysis
  • GBM Model for Structured Document Retrieval skip
  • Summary

31
Structured Document Retrieval
  • A document is said to be structured here when it
    contains multiple fields.
  • Current approaches for structured document
    retrieval
  • Score combination
  • The most commonly used and well-studied approach
  • Rank combination is a special case of score
    combination
  • Term-frequency combination
  • Robertson et al, CIKM04 An extension of BM25
  • Ogilvie et al, SIGIR03 Linearly combining
    language models
  • Each approach works moderately well, but

32
Score Combination Issues
  • For a multi-term query, a document matching a
    single query term over many fields could get
    unreasonably higher score than another document
    which matches all the query terms in a few
    fields. (See discussions in Robertson et al,
    CIKM04)

score(d1) s s s s 8s score(d2) 2s
2s 0 0 4s
score(d1) gt score(d2) Unreasonable
33
TF Combination Issues
Consider a single-term query Qt, and some
documents with two fields (F1, F2). Assuming w1
weight(F1) 5 w2 weight(F2) 1
tf(t,d1) w1 1 w2 0 5 tf(t,d2) w1 0
w2 6 6
score(d1) lt score(d2) Reasonable
  • Larger w1?
  • Cant remove this issue
  • Potential risk of making the case of example-1
    unreasonable

Example-1 (assuming d1d2)
tf(t,d3) w1 1 w2 8 13 tf(t,d4) w1 0
w2 14 19
score(d3) lt score(t,d4) Unreasonable
Example-2 (assuming d3d4)
34
Structured Document Retrievalby GBM
35
Experimental Results
Performance comparison of different approaches
for the combination of body and title fields
36
Outline
  • Background
  • Gravitation-based Model
  • Notations Basic Concepts
  • Discrete GBM Model
  • Continuous GBM Model
  • Model analysis
  • GBM Model for Structured Document Retrieval
  • Summary

37
Summary
  • Viewing IR from a different viewpoint is the same
    important as going deeper from traditional
    perspectives.
  • This paper may be a first step to take a physics
    viewpoint
  • It is encouraging that we can really benefit from
    the nature
  • A family of effective ranking functions derived
  • Give BM25 a physics interpretation
  • A more reasonable approach for structured
    document retrieval obtained

38
  • Sorry, Sir Isaac Newton. Hope I am not abusing
    your laws.

39
The End
  • Gravitation-Based Model for Information Retrieval
  • Please send your comments to shumings_at_microsoft.c
    om
Write a Comment
User Comments (0)
About PowerShow.com