FieldWeighted Xml Retrieval Based on BM25 presentation

About This Presentation

Title:

FieldWeighted Xml Retrieval Based on BM25

Description:

Revised Okapi's index structure to combine with the path indexing system ... It's a linear-combination of field-weighted tf method rather than combination of ... –

Number of Views:138

Avg rating:3.0/5.0

Slides: 16

Provided by: Hai85

Category:

more less

Transcript and Presenter's Notes

Title: FieldWeighted Xml Retrieval Based on BM25

1
Field-Weighted Xml Retrieval Based on BM25

W. Lu
(Reed)
Center for Studies of
Information Resources
Wuhan University, China
sa713_at_soi.city.ac.uk

S. E. Robertson Microsoft Research
Cambridge ser_at_microsoft.com
A. Macfarlane Centre for Interactive System
Research City University London andym_at_soi.city.ac
.uk
2
Outline

Basic work for INEX 2005
Our approach
Field-weighted model BM25F
Element-weighted model BM25E
Experiments
Results
Future work

3
Basic work for INEX(first year for us)

Deveoped a path indexing system
Revised Okapis index structure to combine with
the path indexing system
Developed a query parser and bm25E based
retrieval and output interfaces.

4
Our approach

BM25F proposed by Robertson 11
Its a linear-combination of field-weighted tf
method rather than combination of field weight
score

5
Our approach

BM25F

This is a field-weighted version BM25
The difference lies in that
tfj is the weighted tf
dl is the weighted document length
avdl is the weighted average dl
across the collection
K1 is the weighted free parameter.
K1 K1 avdl/avdl

6
Our approach

BM25F

Suppose we have nF fields f 1, . . . , nF. In a
given document d, term t has frequency tfd, t ,f
in field f. Then using the number of indexed
terms (tokens), the length of the field in this
document is
where V is the vocabulary, i.e. all indexed terms.
7
Our approach

BM25F

With no field weighting, the term frequency of t
in the whole document is
and the document length is
Average document length is
8
Our approach

BM25F

With field weights Wf,, these are modified as
follows
9
Our approach

BM25E(Applied bm25f to element retrieval)

Where
denotes the weighted term frequency of jth term
t in element e
is the weighted element length
is the weighted average element length across
the collection.
is the weighted free parameter.
10
Our approach

BM25E(Applied bm25f to element retrieval)

Our basic view is that an element is to be
treated like a document, except that it may
inherit information from other elements(atl, abs,
st) in the document.
The key is to tune the parameter Wf for each
selected field(elements) which contribute to
specified elements.

11
Our Experiments

Assumption 1 elements in one document do not
have effect on elements in other documents.
Elements except atl, abs and st also don't have
effect on other elements which are not their
ancestors in the same document.
Assumption 2 Elements atl and abs contribute to
the weight of elements bdy, bm and their child
elements. Elements st contributes to the weight
of the section it belongs to, and also of the
sections child elements and article element. All
st elements have the same Wf without considering
the level they belong to.
Assumption 3 Due to the complexity to compute
parameters avel and K1, we believe the values
of the article level can be used instead of them
for all elements.

12
Our Experiments
Experiment Procedure (1) Select atl, abs
and st as the tuned fields (2) Use INEX
04s data sets, co topics(40) and relevance
assessments to tune the wf at document level for
atl, abs and st. We get the peak value at 2356,
4, 22 for wf(atl, abs, st ) . (Metrics Average
precision) (3) We select 6 groups of tuned
wf values for INEX 05 retrieval and submission
2356, 4, 22 , 1000, 4, 22 and 15, 4, 8
for CO.Thorough runs 1000, 4, 22,
300, 4, 18 and 98, 4, 13 for CO.FetchBrowse
runs Note only article, abs, bdy, bm, bib,
section el. and para. el. are treated as
retrievable elements.
13
Results and evaluation
(1) Our runs for Co.thorough does well
especially for nxCG(25, 50) or
ep/rg, Quantization strict, Overlapoff
But for Quantization generalized, our runs
does normally (2) runs using wf
2356, 4, 22 , 1000, 4, 22 do better than 15,
4, 8 for CO.Thorough runs. (3) Results
show our method is worth to be exploited.
Also shows tuning selected elements atl, abs
and st is really beneficial
14
Future work
(1) Tune wf at element level but not only
at document level (2) Try to
investigate the parameters such as avel and K1
at element level. (3) Upgrade our system to
make sure more runs to be submitted and more
tasks to be involved in next year.
15
Thanks !

Write a Comment

User Comments (0)

About PowerShow.com