Title: Prof. Ray Larson
1Lecture 21 XML Retrieval
Principles of Information Retrieval
- Prof. Ray Larson
- University of California, Berkeley
- School of Information
- Tuesday and Thursday 1030 am - 1200 pm
- Spring 2007
- http//courses.ischool.berkeley.edu/i240/s07
2Mini-TREC
- Proposed Schedule
- February 15 Database and previous Queries
- February 27 report on system acquisition and
setup - March 8, New Queries for testing
- April 19, Results due (Next Thursday)
- April 24 or 26, Results and system rankings
- May 8 Group reports and discussion
3Announcement
- No Class on Tuesday (April 17th)
4Today
- Review
- Geographic Information Retrieval
- GIR Algorithms and evaluation based on a
presentation to the 2004 European Conference on
Digital Libraries, held in Bath, U.K. - XML and Structured Element Retrieval
- INEX
- Approaches to XML retrieval
Credit for some of the slides in this lecture
goes to Marti Hearst
5Today
- Review
- Geographic Information Retrieval
- GIR Algorithms and evaluation based on a
presentation to the 2004 European Conference on
Digital Libraries, held in Bath, U.K. - Web Crawling and Search Issues
- Web Crawling
- Web Search Engines and Algorithms
Credit for some of the slides in this lecture
goes to Marti Hearst
6Introduction
- What is Geographic Information Retrieval?
- GIR is concerned with providing access to
georeferenced information sources. It includes
all of the areas of traditional IR research with
the addition of spatially and geographically
oriented indexing and retrieval. - It combines aspects of DBMS research, User
Interface Research, GIS research, and Information
Retrieval research.
7Example Results display from CheshireGeo
http//calsip.regis.berkeley.edu/pattyf/mapserver/
cheshire2/cheshire_init.html
8Other convex, conservative Approximations
9Our Research Questions
- Spatial Ranking
- How effectively can the spatial similarity
between a query region and a document region be
evaluated and ranked based on the overlap of the
geometric approximations for these regions? - Geometric Approximations Spatial Ranking
- How do different geometric approximations affect
the rankings? - MBRs the most popular approximation
- Convex hulls the highest quality convex
approximation
10Spatial Ranking Methods for computing spatial
similarity
11Probabilistic Models Logistic Regression
attributes
- X1 area of overlap(query region, candidate GIO)
/ area of query region - X2 area of overlap(query region, candidate GIO)
/ area of candidate GIOÂ - X3 1 abs(fraction of overlap region that is
onshore fraction of candidate GIO that is
onshore) - Where
- Range for all variables is 0 (not similar) to 1
(same)
12CA Named Places in the Test Collection complex
polygons
13CA Counties Geometric Approximations
MBRs
Convex Hulls
Ave. False Area of Approximation MBRs
94.61 Convex Hulls 26.73
14CA User Defined Areas (UDAs) in the Test
Collection
15Test Collection Query Regions CA Counties
- 42 of 58 counties referenced in the test
collection metadata - 10 counties randomly selected as query regions to
train LR model - 32 counties used as query regions to test model
16LR model
- X1 area of overlap(query region, candidate GIO)
/ area of query region - Â
- X2 area of overlap(query region, candidate GIO)
/ area of candidate GIO - Where
- Range for all variables is 0 (not similar) to 1
(same)
17Some of our Results
- Mean Average Query Precision the average
precision values after each new relevant document
is observed in a ranked list.
For metadata indexed by CA named place regions
- These results suggest
- Convex Hulls perform better than MBRs
- Expected result given that the CH is a higher
quality approximation - A probabilistic ranking based on MBRs can perform
as well if not better than a non-probabiliistic
ranking method based on Convex Hulls - Interesting
- Since any approximation other than the MBR
requires great expense, this suggests that the
exploration of new ranking methods based on the
MBR are a good way to go.
For all metadata in the test collection
18Some of our Results
- Mean Average Query Precision the average
precision values after each new relevant document
is observed in a ranked list.
For metadata indexed by CA named place regions
BUT The inclusion of UDA indexed metadata
reduces precision. This is because coarse
approximations of onshore or coastal geographic
regions will necessarily include much irrelevant
offshore area, and vice versa
For all metadata in the test collection
19Shorefactor Model
- X1 area of overlap(query region, candidate GIO)
/ area of query region - X2 area of overlap(query region, candidate GIO)
/ area of candidate GIO - X3 1 abs(fraction of query region
approximation that is onshore fraction of
candidate GIO approximation that is onshore) - Where Range for all variables is 0 (not
similar) to 1 (same)
20Some of our Results, with Shorefactor
For all metadata in the test collection
Mean Average Query Precision the average
precision values after each new relevant document
is observed in a ranked list.
- These results suggest
- Addition of Shorefactor variable improves the
model (LR 2), especially for MBRs - Improvement not so dramatic for convex hull
approximations b/c the problem that shorefactor
addresses is not that significant when areas are
represented by convex hulls.
21Results for All Data - MBRs
Precision
Recall
22Results for All Data - Convex Hull
Precision
Recall
23XML Retrieval
- The following slides are adapted from
presentations at INEX 2003-2005 and at the INEX
Element Retrieval Workshop in Glasgow 2005, with
some new additions for general context, etc.
24INEX Organization
- Organized By
- University of Duisburg-Essen, Germany
- Norbert Fuhr, Saadia Malik, and others
- Queen Mary University of London, UK
- Mounia Lalmas, Gabriella Kazai, and others
- Supported By
- DELOS Network of Excellence in Digital Libraries
(EU) - IEEE Computer Society
- University of Duisburg-Essen
25XML Retrieval Issues
- Using Structure?
- Specification of Queries
- How to evaluate?
26Cheshire SGML/XML Support
- Underlying native format for all data is SGML or
XML - The DTD defines the database contents
- Full SGML/XML parsing
- SGML/XML Format Configuration Files define the
database location and indexes - Various format conversions and utilities
available for Z39.50 support (MARC, GRS-1
27SGML/XML Support
- Configuration files for the Server are SGML/XML
- They include elements describing all of the data
files and indexes for the database. - They also include instructions on how data is to
be extracted for indexing and how Z39.50
attributes map to the indexes for a given
database.
28Indexing
- Any SGML/XML tagged field or attribute can be
indexed - B-Tree and Hash access via Berkeley DB
(Sleepycat) - Stemming, keyword, exact keys and special keys
- Mapping from any Z39.50 Attribute combination to
a specific index - Underlying postings information includes term
frequency for probabilistic searching - Component extraction with separate component
indexes
29XML Element Extraction
- A new search ElementSetName is XML_ELEMENT_
- Any Xpath, element name, or regular expression
can be included following the final underscore
when submitting a present request - The matching elements are extracted from the
records matching the search and delivered in a
simple format..
30XML Extraction
zselect sherlock 372 Connection with SHERLOCK
(sherlock.berkeley.edu) database 'bibfile' at
port 2100 is open as connection 372 zfind
topic mathematics OK Status 1 Hits 26
Received 0 Set Default RecordSyntax
UNKNOWN zset recsyntax XML zset elementset
XML_ELEMENT_Fld245 zdisplay OK Status 0
Received 10 Position 1 Set Default
NextPosition 11 RecordSyntax XML
1.2.840.10003.5.109.10 ltRESULT_DATA
DOCID"1"gt ltITEM XPATH"/USMARC1/VarFlds1/VarD
Flds1/Titles1/Fld2451"gt ltFld245
AddEnty"No" NFChars"0"gtltagtSingularitâes áa
Cargáeselt/agtlt/Fld245gt lt/ITEMgt ltRESULT_DATAgt etc
31TREC3 Logistic Regression
Probability of relevance is based on Logistic
regression from a sample set of documents to
determine values of the coefficients. At
retrieval the probability estimate is obtained by
For the 6 X attribute measures shown on the next
slide
32TREC3 Logistic Regression
Average Absolute Query Frequency Query
Length Average Absolute Component
Frequency Document Length Average Inverse
Component Frequency Number of Terms in both
query and Component
33Okapi BM25
- Where
- Q is a query containing terms T
- K is k1((1-b) b.dl/avdl)
- k1, b and k3 are parameters , usually 1.2, 0.75
and 7-1000 - tf is the frequency of the term in a specific
document - qtf is the frequency of the term in a topic from
which Q was derived - dl and avdl are the document length and the
average document length measured in some
convenient unit - w(1) is the Robertson-Sparck Jones weight.
34Combining Boolean and Probabilistic Search
Elements
- Two original approaches
- Boolean Approach
- Non-probabilistic Fusion Search Set merger
approach is a weighted merger of document scores
from separate Boolean and Probabilistic queries
35INEX 04 Fusion Search
Subquery
Subquery
Final Ranked List
Fusion/ Merge
Subquery
Subquery
Comp. Query Results
Comp. Query Results
- Merge multiple ranked and Boolean index searches
within each query and multiple component search
resultsets - Major components merged are Articles, Body,
Sections, subsections, paragraphs
36Merging and Ranking Operators
- Extends the capabilities of merging to include
merger operations in queries like Boolean
operators - Fuzzy Logic Operators (not used for INEX)
- !FUZZY_AND
- !FUZZY_OR
- !FUZZY_NOT
- Containment operators Restrict components to or
with a particular parent - !RESTRICT_FROM
- !RESTRICT_TO
- Merge Operators
- !MERGE_SUM
- !MERGE_MEAN
- !MERGE_NORM
- !MERGE_CMBZ
37New LR Coefficients
Estimates using INEX 03 relevance assessments
for b1 Average Absolute Query Frequency b2
Query Length b3 Average Absolute Component
Frequency b4 Document Length b5 Average
Inverse Component Frequency b6 Number of Terms
in common between query and Component
38INEX CO Runs
- Three official, one later run - all Title-only
- Fusion - Combines Okapi and LR using the
MERGE_CMBZ operator - NewParms (LR)- Using only LR with the new
parameters - Feedback - An attempt at blind relevance feedback
- PostFusion - Fusion of the new LR coefficients
and Okapi
39Query Generation - CO
- 162 TITLE Text and Index Compression
Algorithms - QUERY topicshort _at_ Text and Index Compression
Algorithms) !MERGE_CMBZ (alltitles _at_ Text and
Index Compression Algorithms) !MERGE_CMBZ
(topicshort _at_ Text and Index Compression
Algorithms) !MERGE_CMBZ (alltitles _at_ Text and
Index Compression Algorithms) - _at_ is Okapi, _at_ is LR
- !MERGE_CMBZ is a normalized score summation and
enhancement
40INEX CO Runs
Strict
Generalized
Avg Prec FUSION 0.0642 NEWPARMS
0.0582 FDBK 0.0415 POSTFUS
0.0690
Avg Prec FUSION 0.0923 NEWPARMS
0.0853 FDBK 0.0390 POSTFUS
0.0952
41INEX VCAS Runs
- Two official runs
- FUSVCAS - Element fusion using LR and various
operators for path restriction - NEWVCAS - Using the new LR coefficients for each
appropriate index and various operators for path
restriction
42Query Generation - VCAS
- 66 TITLE //articleabout(., intelligent
transport systems)//secabout(., on-board route
planning navigation system for automobiles) - Submitted query ((topic _at_ intelligent
transport systems)) !RESTRICT_FROM ((sec_words _at_
on-board route planning navigation system for
automobiles)) - Target elements secss1ss2ss3
43VCAS Results
Generalized
Strict
Avg Prec FUSVCAS 0.0321 NEWVCAS
0.0270
Avg Prec FUSVCAS 0.0601 NEWVCAS
0.0569
44Heterogeneous Track
- Approach using the Cheshires Virtual Database
options - Primarily a version of distributed IR
- Each collection indexed separately
- Search via Z39.50 distributed queries
- Z39.50 Attribute mapping used to map query
indexes to appropriate elements in a given
collection - Only LR used and collection results merged using
probability of relevance for each collection
result
45INEX 2005 Approach
- Used only Logistic regression methods
- TREC3 with Pivot
- TREC2 with Pivot
- TREC2 with Blind Feedback
- Used post-processing for specific tasks
46Logistic Regression
Probability of relevance is based on Logistic
regression from a sample set of documents to
determine values of the coefficients. At
retrieval the probability estimate is obtained by
For some set of m statistical measures, Xi,
derived from the collection and query
47TREC2 Algorithm
48Blind Feedback
- Term selection from top-ranked documents is based
on the classic Robertson/Sparck Jones
probabilistic model
For each term t
49Blind Feedback
- Top x new terms taken from top y documents
- For each term in the top y assumed relevant set
- Terms are ranked by termwt and the top x selected
for inclusion in the query
50Pivot method
- Based on the pivot weighting used by IBM Haifa in
INEX 2004 (Mass Mandelbrod) - Used 0.50 as pivot for all cases
- For TREC3 and TREC2 runs all component results
weighted by article-level results for the
matching article
51Adhoc Component Fusion Search
Subquery
Subquery
Raw Ranked List
Fusion/ Merge
Subquery
Subquery
Comp. Query Results
Comp. Query Results
- Merge multiple ranked component types
- Major components merged are Article Body,
Sections, paragraphs, figures
52TREC3 Logistic Regression
Probability of relevance is based on Logistic
regression from a sample set of documents to
determine values of the coefficients. At
retrieval the probability estimate is obtained by
53TREC3 Logistic Regression attributes
Average Absolute Query Frequency Query
Length Average Absolute Component
Frequency Document Length Average Inverse
Component Frequency Inverse Component
Frequency Number of Terms in common between
query and Component -- logged
54TREC3 LR Coefficients
Estimates using INEX 03 relevance assessments
for b1 Average Absolute Query Frequency b2
Query Length b3 Average Absolute Component
Frequency b4 Document Length b5 Average
Inverse Component Frequency b6 Number of Terms
in common between query and Component
55CO.Focused
56COS.Focused
57CO.Thorough
58COS.Thorough
59CAS
60Het. Element Retr. Overview
- The Problem
- Issues with Element Retrieval and Heterogeneous
Retrieval - Possible Approaches
- XPointer
- Generic Metadata systems
- E.g., Dublin Core
- Other Metadata Systems
61The Problem
- The Adhoc track in INEX has dealt with a single
DTD for one type of data (computer science
journal articles) - In real-world environments, XML retrieval must
deal with different DTDs, different genres of
data and widely varying topical content
62The Heterogeneous Track
- Research Questions (2004)
- For content-oriented queries, what methods are
possible for determining which elements contain re
asonable answers? Are pure statistical methods app
ropriate, or are ontology-based approaches also
helpful? - What methods can be used to map structural
criteria onto other DTDs? - Should mappings focus on element names only, or
also deal with element content or semantics? - What are appropriate evaluation criteria for
heterogeneous collections?
63INEX 2004 Het Collection Tags
64Issues with Element Retrieval for Heterogeneous
Retrieval
- Conceptual Issues (user view)
- To actually specify structural elements for
retrieval requires that the user know the
structure of the items to be retrieved - As the number of DTDs or schemas increase this
task becomes more complex for both specification
and for understanding - For real world XML retrieval, specifying
structure effectively requires omniscience on the
part of the user - The collection itself must be specified in some
way (can the user know all of the collections?) - Users of INEX cant do correct specifications for
even one DTD
65Issues with Element Retrieval for Heterogeneous
Retrieval
- Practical Issues (programmers view)
- Most of the same problems as the user view
- As seen in an earlier papers today the system
must provide an interface that the user can
understand, but maps to the complexities of the
DTD(s) - But, once again, as the number of DTDs or schemas
increase this task becomes increasingly complex
for the specification of the mappings - For real world XML retrieval, specifying
structure effectively requires omniscience on the
part of the programmer to provide exhaustive
mappings of the document elements to be retrieved - As Roelof noted earlier today, this rapidly can
become a system that has too many options for a
user to understand or use
66Postulate of Impotence
- In summation we might suggest another Postulate
of Impotence'' like those suggested by Swanson - You can either have heterogeneous retrieval, or
precise element specifications in queries, but
you cannot have both simultaneously
67Possible Approaches
- Generalized structure
- Parent/child as in Xpath/Xpointer
- What about flat structures? (like most
collections in the Het track) - Abstract query elements
- Use semantic representations in queries rather
than structural representations - E.g. Title instead of //fm/tig/atl
- What semantic representations can/should be used?
68XPointer
- Can specify collection-level identification
- Basically a URN attached to an Xpath
- Can also specify various string-matching
constraints on Xpath - Might be useful in INEX Het Track for specifying
relevance judgements - But, it doesnt address (or worsens) the larger
problem of dealing with large numbers of
heterogeneous structures
69Abstract Data Elements
- The idea is to remove the requirement of precise
and explicit specification of structural elements
and replace them with abstract and implied
specifications - Used in other heterogeneous retrieval systems
- Z39.50/SRW (attributesets and elementsets)
- Dublin Core (limited set of elements for search
or retrieval)
70Dublin Core
- Simple metadata for describing internet resources
- For Document-Like Objects
- 15 Elements (in base DC)
71Dublin Core Elements
- Title
- Creator
- Subject
- Description
- Publisher
- Other Contributors
- Date
- Resource Type
- Format
- Resource Identifier
- Source
- Language
- Relation
- Coverage
- Rights Management
72Issues in Dublin Core
- Lack of guidance on what to put into each element
- How to structure or organize at the element
level? - How to ensure consistency across descriptions for
the same persons, places, things, etc.