Title: A Generative Retrieval Model for Structured Documents
1A Generative Retrieval Model for Structured
Documents
- Le Zhao, Jamie Callan
- Language Technologies InstituteSchool of
Computer ScienceCarnegie Mellon University - Oct 2008
2Background
- Structured documents
- Author Edited Fields
- Library systems Title, meta-data of books
- Web documents HTML, XML
- Automatic Annotations
- Part Of Speech, Named Entity, Semantic Role
- Structured query
- Human
- Automatic
3Example Structured Retrieval Queries
- XML element retrieval
- NEXI query (Wikipedia XML)
- a) //articleabout(., music) b)
//articleabout(.//section, music)
//sectionabout(., pop) - Question Answering
- Indri query (ASSERT style SRL annotation)
- combinesentence( combinetarget(
love combine./arg0 (
anyperson ) combine./arg1 (
Mary ) ) )
music..
..pop..
Music
pop
music
music
D
D
arg0
arg1
John loves Mary
4Motivation
- Basis Language Model Inference Net (Indri
search engine/Lemur) - Already supports field retrieval indexing and
retrieving relations between annotations - flexible query language testing new query forms
promptly - Main problems
- Approximate matching (structure keyword)
- Evidence combination
- Extension from keyword retrieval model
- Approximate structure keyword matching
- Combining evidence from inner fields
- Goal Outperform keyword in precision, thru
- Coherent structured retrieval model better
understanding - Better smoothing Guiding query formulation
- Finer control via accurate robust structured
queries
5Roadmap
- Brief overview of Indri Field retrieval
- Existing Problems
- The generative structured retrieval model
- Term Field level Smoothing
- Evidence combination alternatives
- Experiments
- Conclusions
6Indri Document Retrieval
- combine(iraq war)
- Scoring scope is document
- Return a list of scores for docs
- Language Model built from Scoring scope, smoothed
with the collection model. - Because of smoothing partial matches can also be
returned.
7Indri Field Retrieval
- combinetitle(iraq war)
- Scoring scope is title
- Return a list of scores for titles
- Language Model built from Scoring scope (title),
smoothed with Document and Collection models. - Results on Wikipedia collectionscore document-nu
mber start end content-1.19104 199636.xm
l 0 2 Iraq War-1.58453 1595906.xml 0 3
Anglo-Iraqi War-1.58734 14889.xml 0 3
Iran-Iraq War-1.87811 184613.xml 0 4 2003
Iraq war timeline-2.07668 2316643.xml 0 5
Canada and the Iraq War-2.09957 202304.xml 0 5
Protests against the Iraq war-2.23997 2656581.x
ml 0 6 Saddam's Trial and Iran-Iraq
War-2.35804 1617564.xml 0 7 List of Iraq War
Victoria Cross Because of smoothing partial
matches can also be returned.
8Existing Problems
9Evidence Combination
- Topic A document with multiple sections about
iraq war, and discusses Bushs exit strategy. - combine( combinesection(iraq war)
bush 1(exit strategy) )sections could
return scores (0.2, 0.2, 0.1, 0.002) for one
document - Some options
- max (Bilotti et al 2007) Only considers one
match - or Favors many matches, even if weak matches
- and Biased against many matches, even if good
matches - average Favors many good matches, hurt by weak
matches -
- What about documents that dont contain a section
element? - But, do have a lot of matching terms?
10Evidence Combination Methods (1)
AVG OR MAX Result Sentences
-1.545 -0.5075 -0.5085 1) It must takemeasures.
-1.881 -0.8467 -0.8520 2) U.S. investments worldwide could be in jeopardy if other countries take up similar measures.
-2.349 -0.2425 -0.5030 3) Chanting the slogan "takemeasures before they takeour measurements," the Greenpeace activists set up a coffin outside the ministry to draw attention to the deadly combination of atmospheric pollution and rising temperatures in Athens, which are expected to reach 42 degrees centigrade at the weekend.
-2.401 -0.4817 -0.5012 4) SINGAPORE, May 14 (Xinhua) -- The Singapore government will take measures to discourage speculation in the private and tighten credit, particularly for foreigners, Deputy Prime Minister Lee Hsien Loong announced here today.
11Bias toward short fields
- Topic Who loves Mary?combine( combinetarget
( Loves combine./arg0( anyperson )
combine./arg1( Mary ) ) ) - PMLE(qiE) count(qi)/E
- Produces very skewed scores when E is small
- E.g., if E 1, PMLE(qiE) is either 0 or 1
- Biases toward combinetarget(Loves)
- target usually length 1, arg0/1 longer
- Ratio between having and not having a target
match is larger than that of arg0/1, with
Jelinek-Mercer smoothing
12The Generative Structured Retrieval model
- A new framework for structured retrieval
- A new term-level smoothing method
13A New Framework
- combine( combinesection(iraq war) bush
1(exit strategy)) - Query
- Traditional merely the sampled terms
- New specifies a graphical model, a generation
process - Scoring scope is document
- For one document, calculate probability of the
model - Sections are used as evidence of relevance for
the document - a hidden variable in the graphical model
- In general, inner fields are hidden, and used to
score outer fields. - Hidden variables are summed over to produce final
score - Averaging the scores from section fields,
(uniform prior over sections)
14A New FrameworkField-Level Smoothing
- Term level smoothing (traditional)
- no section contains iraq or war
- add prior terms to section Dirichlet prior
from collection model - Field level smoothing (new)
- no section field in document add prior fields
Terms in S
P(wC)
Ss
µ
S
Sections in D
P(wsection, D, C)
Ds
15A New FrameworkAdvantages
- Soft matching of sections
- Matches documents even w/o section fields
- prior fields, (Bilotti et al 2007) called this
empty fields - Aggregation of all matching fields
- P-OR, Max,
- Heuristics
- From our generative model
- Probabilistic-Average
16Reduction to Keyword Likelihood Model
- Assume term tag around each term in collection
- Assume no document level smoothing (µd inf, ?d
0)then, no matter with how many empty
smoothing fields, the AVG model degenerates to
the Keyword retrieval model, in the following
way - combine( combineterm( u )
combine( u v )
combineterm( v ) )(same collection level
smoothing Dirichlet/Jelinek-Mercer is preserved)
17Term Level Smoothing Revisited
- Two level Jelinek-Mercer (traditional)
- Equivalently (more general parameterization),
- Two level Dirichlet (new)
-
- Corrects J-Ms Bias toward shorter fields
- Relative gain of matching independent of field
length
18Experiments
- - Smoothing- Evidence combination methods
19Datasets
- XML retrieval
- INEX 06, 07 (Wikipedia collection)
- Goal evaluate evidence combination (and
smoothing) - Topics (modified) element retrieval ? document
retrieval, e.g. combine( combinesection(w
ifi security)) - Assessments (modified) any element relevant ?
document relevant - Smoothing parameters
- Trained on INEX 06, 62 topics
- Tested on INEX 07, 60 topics
- Question Answering
- TREC 2002
- AQUAINT Corpus
- Topics
- Training 55 original topics -gt 367 relevant
sentences (new topics) - Test 54 original topics -gt 250 relevant
sentences (new topics) - For example,Question who loves MaryRelevant
sentence John says he loves MaryQuery
combinetarget( love
combine./arg1(Mary) ) - Relevance feedback setup, stronger than (Bilotti
et al 2007)
20Effects of 2-level Dirichlet smoothing
Table 3. A comparison of two-level
Jelinek-Mercer and two-level Dirichlet smoothing
on the INEX and QA datasets.
Structured query Structured query Structured query Keyword query Keyword query Keyword query
Collections Metric 2-level Jelinek-Mercer 2-level Dirichlet Improvement 2-level Jelinek-Mercer 2-level Dirichlet Improvement
INEX06 (training) MRR P_at_10 MAP 0.73020.46940.2900 0.78720.47100.2927 7.8060.34090.9310 0.70170.47900.2918 0.75600.50810.2956 7.7386.0751.302
INEX07 (test) MRR P_at_10 MAP 0.70610.45170.2838 0.73860.46330.2871 4.6032.5681.163 0.65150.44330.2839 0.77340.46670.2979 18.715.2794.931
QA (training) MRR P_at_10 MAP 0.63710.22530.1402 0.67290.23710.1460 5.6195.2374.137 0.52110.20980.1651 0.52270.23360.1755 0.307011.346.299
QA (test) MRR P_at_10 MAP 0.51280.16840.1634 0.51380.17640.1623 0.19504.751-0.6732 0.16490.08080.1197 0.17910.08080.1189 8.6110.0000-0.6683
significance level lt 0.04, significance
level lt 0.002, significance level lt 0.00001
21Optimal Smoothing Parameters
Datasets Queries Jelinek-Mercer Jelinek-Mercer Dirichlet Dirichlet
Datasets Queries ?1 ?2 µd µc
INEX Keyword 0.6 0.2 Any 800
INEX Structured 0.7 0.1 100 800
QA Keyword 0.6 0.2 10 1000
QA Structured 0.7 0.1 5 50
- Optimization with grid search
- Optimal values for Dirichlet related to average
length of the fields being queried
22Evidence Combination Methods
Structured query Metric AVG MAX OR Keyword query
INEX 06 (training) MRR P_at_10 MAP 0.75600.50810.2956 0.78180.50650.3094 0.80930.46610.2914 0.85380.52740.3538
INEX 07 (test) MRR P_at_10 MAP 0.77340.46670.2979 0.75520.45830.3006 0.73860.46330.2871 0.81290.47830.3463
QA (training) MRR P_at_10 MAP 0.42650.17250.1251 0.64520.23730.1501 0.07620.03130.0225 0.52270.23360.1755
QA (test) MRR P_at_10 MAP 0.39470.14280.1264 0.51380.17520.1617 0.07010.03120.0364 0.17910.08080.1189
- For QA, MAX is best
- For INEX
- Evaluation at document level does not discount
irrelevant text portions - Not clear which combination method performs best
23Better Evaluation for INEX Datasets
- NDCG
- Assumptions
- Degree of relevance is somehow given
- user spends similar amount of effort on each
document, and effort decreases in log-rank - With more informative element level judgments
- Degree of relevance for a document relevance
density - the proportion of relevant texts (in bytes) in
the document - Discount lower ranked relevant documents
- not by docs ranked ahead
- but by length (in bytes) of texts ranked ahead
- Effectively discounts irrelevant texts ranked
ahead
24Measuring INEX topics with NDCG
Structured retrieval Structured retrieval AVG AVG MAX MAX OR OR Keyword query Keyword query
Smoothing method Smoothing method J-M1 Dir1 J-M Dir J-M Dir J-M Dir
INEX 06 (training) NDCG_at_10 NDCG_at_20 NDCG_at_30 0.48210.54900.5847 0.50170.56690.5889 0.43200.50120.5184 0.44780.51640.5388 0.40480.46710.4903 0.42380.48570.5091 0.45550.53760.5816 0.52810.59770.6514
INEX 06 (training) AvgLen_at_10 AvgLen_at_20 AvgLen_at_30 5818.65327.45140.0 6268.85621.55228.1 8296.67715.27654.7 7981.17708.27086.6 126351122011144 12361104599792.4 5138.95106.35063.1 9906.88717.48065.9
INEX 07 (test) NDCG_at_10 NDCG_at_20 NDCG_at_30 0.47550.54050.5788 0.52910.59740.6390 0.45550.51410.5574 0.47940.55090.5893 0.49840.55370.5895 0.48270.53620.5729 0.46520.55700.6212 0.50630.57970.6388
INEX 07 (test) AvgLen_at_10 AvgLen_at_20 AvgLen_at_30 3819.64135.33875.3 5218.74704.14493.9 6488.66233.75955.0 6943.86576.16066.1 107129759.28853.6 109009553.58668.3 3464.83367.13493.9 8736.77784.57228.8
- p lt 0.007 between AVG and MAX or AVG and OR
- No significant difference between AVG and keyword!
25Error Analysis for INEX06 Queries and Correcting
INEX07 Queries
- Two Changes (looking only at training set)
- Semantic mismatch with topic (mainly keyword
query) (22/70) - Lacking alternative fields image ?
image,figure - Wrong ANDOR semantics (astronaut AND
cosmonaut) ? (astronaut OR cosmonaut) - Misspellings VanGogh ? Van Gogh
- Over-restricted query terms using phrases
1(origin of universe) ? uw4(origin universe) - All article restrictions ? whole document
(34/70) - Proofreading test (INEX07) queries
- Retrieval results of the queries are not
referenced in any way. - Only looked at keyword query topic description
26Performance after query correction
- df 30 p lt 0.006, for NDCG_at_10 p lt 0.0004, for
NDCG_at_20 p lt 0.002 for NDCG_at_30
27Conclusions
- A structured query specifies a generative model
for P(QD),model parameters estimated from D,
rank D by P(QD) - Best evidence combination strategy is task
dependent - Dirichlet smoothing corrects bias to short
fields, and outperforms Jelinek-Mercer - Guidance to structured query formulation
- Robust structured queries can outperform keyword
28Acknowledgements
- Paul Ogilvie
- Matthew Bilotti
- Eric Nyberg
- Mark Hoy
29Thanks!