Title: FieldWeighted Xml Retrieval Based on BM25
1Field-Weighted Xml Retrieval Based on BM25
- W. Lu
- (Reed)
- Center for Studies of
- Information Resources
- Wuhan University, China
- sa713_at_soi.city.ac.uk
S. E. Robertson Microsoft Research
Cambridge ser_at_microsoft.com
A. Macfarlane Centre for Interactive System
Research City University London andym_at_soi.city.ac
.uk
2Outline
- Basic work for INEX 2005
- Our approach
- Field-weighted model BM25F
- Element-weighted model BM25E
- Experiments
- Results
- Future work
3Basic work for INEX(first year for us)
- Deveoped a path indexing system
- Revised Okapis index structure to combine with
the path indexing system - Developed a query parser and bm25E based
retrieval and output interfaces.
4Our approach
- BM25F proposed by Robertson 11
- Its a linear-combination of field-weighted tf
method rather than combination of field weight
score
5Our approach
- This is a field-weighted version BM25
- The difference lies in that
- tfj is the weighted tf
- dl is the weighted document length
- avdl is the weighted average dl
across the collection - K1 is the weighted free parameter.
- K1 K1 avdl/avdl
6Our approach
Suppose we have nF fields f 1, . . . , nF. In a
given document d, term t has frequency tfd, t ,f
in field f. Then using the number of indexed
terms (tokens), the length of the field in this
document is
where V is the vocabulary, i.e. all indexed terms.
7Our approach
With no field weighting, the term frequency of t
in the whole document is
and the document length is
Average document length is
8Our approach
With field weights Wf,, these are modified as
follows
9Our approach
- BM25E(Applied bm25f to element retrieval)
Where
denotes the weighted term frequency of jth term
t in element e
is the weighted element length
is the weighted average element length across
the collection.
is the weighted free parameter.
10Our approach
- BM25E(Applied bm25f to element retrieval)
- Our basic view is that an element is to be
treated like a document, except that it may
inherit information from other elements(atl, abs,
st) in the document. - The key is to tune the parameter Wf for each
selected field(elements) which contribute to
specified elements.
11Our Experiments
- Assumption 1 elements in one document do not
have effect on elements in other documents.
Elements except atl, abs and st also don't have
effect on other elements which are not their
ancestors in the same document. - Assumption 2 Elements atl and abs contribute to
the weight of elements bdy, bm and their child
elements. Elements st contributes to the weight
of the section it belongs to, and also of the
sections child elements and article element. All
st elements have the same Wf without considering
the level they belong to. - Assumption 3 Due to the complexity to compute
parameters avel and K1, we believe the values
of the article level can be used instead of them
for all elements.
12Our Experiments
Experiment Procedure (1) Select atl, abs
and st as the tuned fields (2) Use INEX
04s data sets, co topics(40) and relevance
assessments to tune the wf at document level for
atl, abs and st. We get the peak value at 2356,
4, 22 for wf(atl, abs, st ) . (Metrics Average
precision) (3) We select 6 groups of tuned
wf values for INEX 05 retrieval and submission
2356, 4, 22 , 1000, 4, 22 and 15, 4, 8
for CO.Thorough runs 1000, 4, 22,
300, 4, 18 and 98, 4, 13 for CO.FetchBrowse
runs Note only article, abs, bdy, bm, bib,
section el. and para. el. are treated as
retrievable elements.
13Results and evaluation
(1) Our runs for Co.thorough does well
especially for nxCG(25, 50) or
ep/rg, Quantization strict, Overlapoff
But for Quantization generalized, our runs
does normally (2) runs using wf
2356, 4, 22 , 1000, 4, 22 do better than 15,
4, 8 for CO.Thorough runs. (3) Results
show our method is worth to be exploited.
Also shows tuning selected elements atl, abs
and st is really beneficial
14Future work
(1) Tune wf at element level but not only
at document level (2) Try to
investigate the parameters such as avel and K1
at element level. (3) Upgrade our system to
make sure more runs to be submitted and more
tasks to be involved in next year.
15Thanks !