Prof. Ray Larson

About This Presentation

Title:

Prof. Ray Larson

Description:

GIR Algorithms and evaluation based on a presentation to the 2004 European ... Web Search Engines and Algorithms ... 162 TITLE = Text and Index Compression Algorithms ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 73

Provided by: ValuedGate70

Learn more at: https://courses.ischool.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Prof. Ray Larson

1
Lecture 21 XML Retrieval
Principles of Information Retrieval

Prof. Ray Larson
University of California, Berkeley
School of Information
Tuesday and Thursday 1030 am - 1200 pm
Spring 2007
http//courses.ischool.berkeley.edu/i240/s07

2
Mini-TREC

Proposed Schedule
February 15 Database and previous Queries
February 27 report on system acquisition and
setup
March 8, New Queries for testing
April 19, Results due (Next Thursday)
April 24 or 26, Results and system rankings
May 8 Group reports and discussion

3
Announcement

No Class on Tuesday (April 17th)

4
Today

Review
Geographic Information Retrieval
GIR Algorithms and evaluation based on a
presentation to the 2004 European Conference on
Digital Libraries, held in Bath, U.K.
XML and Structured Element Retrieval
INEX
Approaches to XML retrieval

Credit for some of the slides in this lecture
goes to Marti Hearst
5
Today

Review
Geographic Information Retrieval
GIR Algorithms and evaluation based on a
presentation to the 2004 European Conference on
Digital Libraries, held in Bath, U.K.
Web Crawling and Search Issues
Web Crawling
Web Search Engines and Algorithms

Credit for some of the slides in this lecture
goes to Marti Hearst
6
Introduction

What is Geographic Information Retrieval?
GIR is concerned with providing access to
georeferenced information sources. It includes
all of the areas of traditional IR research with
the addition of spatially and geographically
oriented indexing and retrieval.
It combines aspects of DBMS research, User
Interface Research, GIS research, and Information
Retrieval research.

7
Example Results display from CheshireGeo
http//calsip.regis.berkeley.edu/pattyf/mapserver/
cheshire2/cheshire_init.html
8
Other convex, conservative Approximations
9
Our Research Questions

Spatial Ranking
How effectively can the spatial similarity
between a query region and a document region be
evaluated and ranked based on the overlap of the
geometric approximations for these regions?
Geometric Approximations Spatial Ranking
How do different geometric approximations affect
the rankings?
MBRs the most popular approximation
Convex hulls the highest quality convex
approximation

10
Spatial Ranking Methods for computing spatial
similarity
11
Probabilistic Models Logistic Regression
attributes

X1 area of overlap(query region, candidate GIO)
/ area of query region
X2 area of overlap(query region, candidate GIO)
/ area of candidate GIO
X3 1 abs(fraction of overlap region that is
onshore fraction of candidate GIO that is
onshore)
Where
Range for all variables is 0 (not similar) to 1
(same)

12
CA Named Places in the Test Collection complex
polygons
13
CA Counties Geometric Approximations
MBRs
Convex Hulls
Ave. False Area of Approximation MBRs
94.61 Convex Hulls 26.73
14
CA User Defined Areas (UDAs) in the Test
Collection
15
Test Collection Query Regions CA Counties

42 of 58 counties referenced in the test
collection metadata
10 counties randomly selected as query regions to
train LR model
32 counties used as query regions to test model

16
LR model

X1 area of overlap(query region, candidate GIO)
/ area of query region
X2 area of overlap(query region, candidate GIO)
/ area of candidate GIO
Where
Range for all variables is 0 (not similar) to 1
(same)

17
Some of our Results

Mean Average Query Precision the average
precision values after each new relevant document
is observed in a ranked list.

For metadata indexed by CA named place regions

These results suggest
Convex Hulls perform better than MBRs
Expected result given that the CH is a higher
quality approximation
A probabilistic ranking based on MBRs can perform
as well if not better than a non-probabiliistic
ranking method based on Convex Hulls
Interesting
Since any approximation other than the MBR
requires great expense, this suggests that the
exploration of new ranking methods based on the
MBR are a good way to go.

For all metadata in the test collection
18
Some of our Results

Mean Average Query Precision the average
precision values after each new relevant document
is observed in a ranked list.

For metadata indexed by CA named place regions
BUT The inclusion of UDA indexed metadata
reduces precision. This is because coarse
approximations of onshore or coastal geographic
regions will necessarily include much irrelevant
offshore area, and vice versa
For all metadata in the test collection
19
Shorefactor Model

X1 area of overlap(query region, candidate GIO)
/ area of query region
X2 area of overlap(query region, candidate GIO)
/ area of candidate GIO
X3 1 abs(fraction of query region
approximation that is onshore fraction of
candidate GIO approximation that is onshore)
Where Range for all variables is 0 (not
similar) to 1 (same)

20
Some of our Results, with Shorefactor
For all metadata in the test collection
Mean Average Query Precision the average
precision values after each new relevant document
is observed in a ranked list.

These results suggest
Addition of Shorefactor variable improves the
model (LR 2), especially for MBRs
Improvement not so dramatic for convex hull
approximations b/c the problem that shorefactor
addresses is not that significant when areas are
represented by convex hulls.

21
Results for All Data - MBRs
Precision
Recall
22
Results for All Data - Convex Hull
Precision
Recall
23
XML Retrieval

The following slides are adapted from
presentations at INEX 2003-2005 and at the INEX
Element Retrieval Workshop in Glasgow 2005, with
some new additions for general context, etc.

24
INEX Organization

Organized By
University of Duisburg-Essen, Germany
Norbert Fuhr, Saadia Malik, and others
Queen Mary University of London, UK
Mounia Lalmas, Gabriella Kazai, and others
Supported By
DELOS Network of Excellence in Digital Libraries
(EU)
IEEE Computer Society
University of Duisburg-Essen

25
XML Retrieval Issues

Using Structure?
Specification of Queries
How to evaluate?

26
Cheshire SGML/XML Support

Underlying native format for all data is SGML or
XML
The DTD defines the database contents
Full SGML/XML parsing
SGML/XML Format Configuration Files define the
database location and indexes
Various format conversions and utilities
available for Z39.50 support (MARC, GRS-1

27
SGML/XML Support

Configuration files for the Server are SGML/XML
They include elements describing all of the data
files and indexes for the database.
They also include instructions on how data is to
be extracted for indexing and how Z39.50
attributes map to the indexes for a given
database.

28
Indexing

Any SGML/XML tagged field or attribute can be
indexed
B-Tree and Hash access via Berkeley DB
(Sleepycat)
Stemming, keyword, exact keys and special keys
Mapping from any Z39.50 Attribute combination to
a specific index
Underlying postings information includes term
frequency for probabilistic searching
Component extraction with separate component
indexes

29
XML Element Extraction

A new search ElementSetName is XML_ELEMENT_
Any Xpath, element name, or regular expression
can be included following the final underscore
when submitting a present request
The matching elements are extracted from the
records matching the search and delivered in a
simple format..

30
XML Extraction
zselect sherlock 372 Connection with SHERLOCK
(sherlock.berkeley.edu) database 'bibfile' at
port 2100 is open as connection 372 zfind
topic mathematics OK Status 1 Hits 26
Received 0 Set Default RecordSyntax
UNKNOWN zset recsyntax XML zset elementset
XML_ELEMENT_Fld245 zdisplay OK Status 0
Received 10 Position 1 Set Default
NextPosition 11 RecordSyntax XML
1.2.840.10003.5.109.10 ltRESULT_DATA
DOCID"1"gt ltITEM XPATH"/USMARC1/VarFlds1/VarD
Flds1/Titles1/Fld2451"gt ltFld245
AddEnty"No" NFChars"0"gtltagtSingularitâes áa
Cargáeselt/agtlt/Fld245gt lt/ITEMgt ltRESULT_DATAgt etc
31
TREC3 Logistic Regression
Probability of relevance is based on Logistic
regression from a sample set of documents to
determine values of the coefficients. At
retrieval the probability estimate is obtained by
For the 6 X attribute measures shown on the next
slide
32
TREC3 Logistic Regression
Average Absolute Query Frequency Query
Length Average Absolute Component
Frequency Document Length Average Inverse
Component Frequency Number of Terms in both
query and Component
33
Okapi BM25

Where
Q is a query containing terms T
K is k1((1-b) b.dl/avdl)
k1, b and k3 are parameters , usually 1.2, 0.75
and 7-1000
tf is the frequency of the term in a specific
document
qtf is the frequency of the term in a topic from
which Q was derived
dl and avdl are the document length and the
average document length measured in some
convenient unit
w(1) is the Robertson-Sparck Jones weight.

34
Combining Boolean and Probabilistic Search
Elements

Two original approaches
Boolean Approach
Non-probabilistic Fusion Search Set merger
approach is a weighted merger of document scores
from separate Boolean and Probabilistic queries

35
INEX 04 Fusion Search
Subquery
Subquery
Final Ranked List
Fusion/ Merge
Subquery
Subquery
Comp. Query Results
Comp. Query Results

Merge multiple ranked and Boolean index searches
within each query and multiple component search
resultsets
Major components merged are Articles, Body,
Sections, subsections, paragraphs

36
Merging and Ranking Operators

Extends the capabilities of merging to include
merger operations in queries like Boolean
operators
Fuzzy Logic Operators (not used for INEX)
!FUZZY_AND
!FUZZY_OR
!FUZZY_NOT
Containment operators Restrict components to or
with a particular parent
!RESTRICT_FROM
!RESTRICT_TO
Merge Operators
!MERGE_SUM
!MERGE_MEAN
!MERGE_NORM
!MERGE_CMBZ

37
New LR Coefficients
Estimates using INEX 03 relevance assessments
for b1 Average Absolute Query Frequency b2
Query Length b3 Average Absolute Component
Frequency b4 Document Length b5 Average
Inverse Component Frequency b6 Number of Terms
in common between query and Component
38
INEX CO Runs

Three official, one later run - all Title-only
Fusion - Combines Okapi and LR using the
MERGE_CMBZ operator
NewParms (LR)- Using only LR with the new
parameters
Feedback - An attempt at blind relevance feedback
PostFusion - Fusion of the new LR coefficients
and Okapi

39
Query Generation - CO

162 TITLE Text and Index Compression
Algorithms
QUERY topicshort _at_ Text and Index Compression
Algorithms) !MERGE_CMBZ (alltitles _at_ Text and
Index Compression Algorithms) !MERGE_CMBZ
(topicshort _at_ Text and Index Compression
Algorithms) !MERGE_CMBZ (alltitles _at_ Text and
Index Compression Algorithms)
_at_ is Okapi, _at_ is LR
!MERGE_CMBZ is a normalized score summation and
enhancement

40
INEX CO Runs
Strict
Generalized
Avg Prec FUSION 0.0642 NEWPARMS
0.0582 FDBK 0.0415 POSTFUS
0.0690
Avg Prec FUSION 0.0923 NEWPARMS
0.0853 FDBK 0.0390 POSTFUS
0.0952
41
INEX VCAS Runs

Two official runs
FUSVCAS - Element fusion using LR and various
operators for path restriction
NEWVCAS - Using the new LR coefficients for each
appropriate index and various operators for path
restriction

42
Query Generation - VCAS

66 TITLE //articleabout(., intelligent
transport systems)//secabout(., on-board route
planning navigation system for automobiles)
Submitted query ((topic _at_ intelligent
transport systems)) !RESTRICT_FROM ((sec_words _at_
on-board route planning navigation system for
automobiles))
Target elements secss1ss2ss3

43
VCAS Results
Generalized
Strict
Avg Prec FUSVCAS 0.0321 NEWVCAS
0.0270
Avg Prec FUSVCAS 0.0601 NEWVCAS
0.0569
44
Heterogeneous Track

Approach using the Cheshires Virtual Database
options
Primarily a version of distributed IR
Each collection indexed separately
Search via Z39.50 distributed queries
Z39.50 Attribute mapping used to map query
indexes to appropriate elements in a given
collection
Only LR used and collection results merged using
probability of relevance for each collection
result

45
INEX 2005 Approach

Used only Logistic regression methods
TREC3 with Pivot
TREC2 with Pivot
TREC2 with Blind Feedback
Used post-processing for specific tasks

46
Logistic Regression
Probability of relevance is based on Logistic
regression from a sample set of documents to
determine values of the coefficients. At
retrieval the probability estimate is obtained by
For some set of m statistical measures, Xi,
derived from the collection and query
47
TREC2 Algorithm
48
Blind Feedback

Term selection from top-ranked documents is based
on the classic Robertson/Sparck Jones
probabilistic model

For each term t
49
Blind Feedback

Top x new terms taken from top y documents
For each term in the top y assumed relevant set
Terms are ranked by termwt and the top x selected
for inclusion in the query

50
Pivot method

Based on the pivot weighting used by IBM Haifa in
INEX 2004 (Mass Mandelbrod)
Used 0.50 as pivot for all cases
For TREC3 and TREC2 runs all component results
weighted by article-level results for the
matching article

51
Adhoc Component Fusion Search
Subquery
Subquery
Raw Ranked List
Fusion/ Merge
Subquery
Subquery
Comp. Query Results
Comp. Query Results

Merge multiple ranked component types
Major components merged are Article Body,
Sections, paragraphs, figures

52
TREC3 Logistic Regression
Probability of relevance is based on Logistic
regression from a sample set of documents to
determine values of the coefficients. At
retrieval the probability estimate is obtained by
53
TREC3 Logistic Regression attributes
Average Absolute Query Frequency Query
Length Average Absolute Component
Frequency Document Length Average Inverse
Component Frequency Inverse Component
Frequency Number of Terms in common between
query and Component -- logged
54
TREC3 LR Coefficients
Estimates using INEX 03 relevance assessments
for b1 Average Absolute Query Frequency b2
Query Length b3 Average Absolute Component
Frequency b4 Document Length b5 Average
Inverse Component Frequency b6 Number of Terms
in common between query and Component
55
CO.Focused

Generalized Strict

56
COS.Focused

Generalized Strict

57
CO.Thorough

Generalized Strict

58
COS.Thorough

Generalized Strict

59
CAS

Generalize Strict

60
Het. Element Retr. Overview

The Problem
Issues with Element Retrieval and Heterogeneous
Retrieval
Possible Approaches
XPointer
Generic Metadata systems
E.g., Dublin Core
Other Metadata Systems

61
The Problem

The Adhoc track in INEX has dealt with a single
DTD for one type of data (computer science
journal articles)
In real-world environments, XML retrieval must
deal with different DTDs, different genres of
data and widely varying topical content

62
The Heterogeneous Track

Research Questions (2004)
For content-oriented queries, what methods are
possible for determining which elements contain re
asonable answers? Are pure statistical methods app
ropriate, or are ontology-based approaches also
helpful?
What methods can be used to map structural
criteria onto other DTDs?
Should mappings focus on element names only, or
also deal with element content or semantics?
What are appropriate evaluation criteria for
heterogeneous collections?

63
INEX 2004 Het Collection Tags
64
Issues with Element Retrieval for Heterogeneous
Retrieval

Conceptual Issues (user view)
To actually specify structural elements for
retrieval requires that the user know the
structure of the items to be retrieved
As the number of DTDs or schemas increase this
task becomes more complex for both specification
and for understanding
For real world XML retrieval, specifying
structure effectively requires omniscience on the
part of the user
The collection itself must be specified in some
way (can the user know all of the collections?)
Users of INEX cant do correct specifications for
even one DTD

65
Issues with Element Retrieval for Heterogeneous
Retrieval

Practical Issues (programmers view)
Most of the same problems as the user view
As seen in an earlier papers today the system
must provide an interface that the user can
understand, but maps to the complexities of the
DTD(s)
But, once again, as the number of DTDs or schemas
increase this task becomes increasingly complex
for the specification of the mappings
For real world XML retrieval, specifying
structure effectively requires omniscience on the
part of the programmer to provide exhaustive
mappings of the document elements to be retrieved
As Roelof noted earlier today, this rapidly can
become a system that has too many options for a
user to understand or use

66
Postulate of Impotence

In summation we might suggest another Postulate
of Impotence'' like those suggested by Swanson
You can either have heterogeneous retrieval, or
precise element specifications in queries, but
you cannot have both simultaneously

67
Possible Approaches

Generalized structure
Parent/child as in Xpath/Xpointer
What about flat structures? (like most
collections in the Het track)
Abstract query elements
Use semantic representations in queries rather
than structural representations
E.g. Title instead of //fm/tig/atl
What semantic representations can/should be used?

68
XPointer

Can specify collection-level identification
Basically a URN attached to an Xpath
Can also specify various string-matching
constraints on Xpath
Might be useful in INEX Het Track for specifying
relevance judgements
But, it doesnt address (or worsens) the larger
problem of dealing with large numbers of
heterogeneous structures

69
Abstract Data Elements

The idea is to remove the requirement of precise
and explicit specification of structural elements
and replace them with abstract and implied
specifications
Used in other heterogeneous retrieval systems
Z39.50/SRW (attributesets and elementsets)
Dublin Core (limited set of elements for search
or retrieval)

70
Dublin Core

Simple metadata for describing internet resources
For Document-Like Objects
15 Elements (in base DC)

71
Dublin Core Elements

Title
Creator
Subject
Description
Publisher
Other Contributors
Date
Resource Type

Format
Resource Identifier
Source
Language
Relation
Coverage
Rights Management

72
Issues in Dublin Core

Lack of guidance on what to put into each element
How to structure or organize at the element
level?
How to ensure consistency across descriptions for
the same persons, places, things, etc.

Write a Comment

User Comments (0)