XRANK: Ranked Keyword Search Over XML Documents - PowerPoint PPT Presentation

About This Presentation

Title:

XRANK: Ranked Keyword Search Over XML Documents

Description:

XML documents in Internet and Corporate Intranets ... Can query a mix of XML and HTML documents (Internet, corporate Intranets) ... – PowerPoint PPT presentation

Number of Views:109

Avg rating:3.0/5.0

Slides: 60

Provided by: jayavelsha

Learn more at: https://www.cs.cornell.edu

Category:

more less

Transcript and Presenter's Notes

Title: XRANK: Ranked Keyword Search Over XML Documents

1
XRANK Ranked Keyword Search Over XML Documents

Jayavel Shanmugasundaram
Cornell University
Joint work with
Lin Guo, Feng Shao, Chavdar Botev

2
Motivation

Ranked keyword search emerging as dominant
paradigm for information discovery
Simple
Results ranked in order of relevance
Hitherto mostly limited to unstructured data
HTML, unstructured text
Growing wave of semi-structured data
XML documents in Internet and Corporate Intranets
XRANK designed for keyword search over such data

3
Keyword Search over Unstructured Data
Ranked Results
Query Keywords
Hyperlinked HTML Documents
4
Keyword Search over Semi-Structured Data
XRank
Ranked Results
Query Keywords
Mix of Hyperlinked XMLand HTML Documents
5
XRANK System

Semi-structured XML documents
Predefined schema not necessary
Hyperlink-based ranking
Exploit hyperlinked nature of XML for ranking
Generalize unstructured keyword search
Can query a mix of XML and HTML documents
(Internet, corporate Intranets)
Google for XML (also Google for HTML!)

6
Possible Applications

Content management
Integrating structured and unstructured data
Scientific repositories
Marked up documents
Bibliography databases
Future Internet (?)
.

7
Outline

Design Principles
Indexing and Query Processing
Experimental Results
Related Work and Conclusion

8
XML Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Workgt
The XQL language
lt/subsectiongt lt/sectiongt
ltcite
xmlnsxlinkhttp//www.acm.org/www8/paper/xmlqlgt
lt/citegt lt/papergt
9
XML Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Workgt
The XQL language
lt/subsectiongt lt/sectiongt
ltcite
xmlnsxlinkhttp//www.acm.org/www8/paper/xmlqlgt
lt/citegt lt/papergt
10
Design Principles

Return most specific element containing the query
keywords

11
XML Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Work lt/subsectiongt
The XQL language
lt/subsectiongt
lt/sectiongt ltcite
xmlnsxlinkhttp//www.acm.org/www8/paper/xmlqlgt
lt/citegt lt/papergt ltpaper
id2gt
12
Design Principles

Return most specific element containing the query
keywords
Ranking has to be done at the granularity of
elements

13
XML Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Work lt/subsectiongt
The XQL language
lt/subsectiongt
lt/sectiongt
ltcite xmlnsxlinkhttp//www.acm.org/www8/paper/x
mlqlgt lt/citegt lt/papergt
14
Design Principles

Return most specific element containing the query
keywords
Ranking has to be done at the granularity of
elements
Two-dimensional keyword proximity
Height of result XML tree
Width of result XML tree

15
XML Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Work lt/subsectiongt
The XQL language
lt/subsectiongt
lt/sectiongt
ltcite xmlnsxlinkhttp//www.acm.org/www8/paper/x
mlqlgt lt/citegt lt/papergt
16
Design Principles

Return most specific element containing the query
keywords
Ranking has to be done at the granularity of
elements
Two-dimensional keyword proximity
Height of result XML tree
Width of result XML tree
Generalize HTML keyword search

17
Outline

Design Principles
Overview
Formalization
Indexing and Query Processing
Experimental Results
Related Work and Conclusion

18
Data Model
Containment edge
Hyperlink edge
19
Design Principles

Return most specific element containing the query
keywords
Ranking has to be done at the granularity of
elements
Two-dimensional keyword proximity
Height of result XML tree
Width of result XML tree
Generalize HTML keyword search

20
Design Principles

Return most specific element containing the query
keywords
Ranking has to be done at the granularity of
elements
Two-dimensional keyword proximity
Height of result XML tree
Width of result XML tree
Generalize HTML keyword search

21
ElemRank

Objective importance of element
Analogous to Googles PageRank
But computed at granularity of elements
Exploit hyperlink edges and containment edges
Naturally generalizes Googles PageRank
Random walk interpretation

22
PageRank Brin Page 1998
Hyperlink edge
w
1-d Probability of random jump
23
ElemRank
Hyperlink edge
Containment edge
w
1-d1-d2-d3 Probability of random jump
24
Design Principles

Return most specific element containing the query
keywords
Ranking has to be done at the granularity of
elements
Two-dimensional keyword proximity
Height of result XML tree
Width of result XML tree
Generalize HTML keyword search

25
Two-Dimensional Proximity

Consider Query Q k1, , kn and a result
element E
ki is directly contained in wi

E
w1

k1
wn

Rank of E with respect to ki
Rank(E, ki) ElemRank(wi) decayh
0 lt decay lt 1
h is length of path from E to wi

Overall rank
Rank(E) ? Rank(E, ki) ?(E, k1, , kn)

26
Outline

Design Principles
Indexing and Query Processing
Experimental Results
Related Work and Conclusion

27
System Architecture
Keyword query
Ranked Results
XML/HTML Documents
Query Evaluator
Data access
XML Elements with ElemRanks
ElemRank Computation
Hybrid Dewey Inverted List
Compute top-m query results as per definition of
ranking
28
Outline

Design Principles
Indexing and Query Processing
Naïve
DIL
RDIL
HDIL
Experimental Results
Related Work and Conclusion

29
Naïve Approach

A main difference between document and XML
keyword search is result granularity
Treat each element as a document
Build regular inverted list index structures over
elements

30
Naïve Method

Naïve inverted lists
Ricardo 1 5 6 8
XQL 1 5 6 7

1
ltworkshopgt
date
lttitlegt
lteditorsgt
ltproceedingsgt
2
3
4
5
28 July
XML and
David Carmel
ltpapergt
ltpapergt
6
Problems 1. Space Overhead 2. Spurious
Results 3. Inaccurate Ranking (2-dimensional
proximity)

lttitlegt
ltauthorgt
7
8

XQL and
Ricardo
Main issue Decouples representation of ancestors
and descendants
31
Outline

Design Principles
Indexing and Query Processing
Naïve
DIL
RDIL
HDIL
Experimental Results
Related Work and Conclusion

32
Dewey Encoding of IDs 1850s
ltworkshopgt
0
0.0
date
0.1
lttitlegt
0.2
lteditorsgt
0.3
ltproceedingsgt
28 July
XML and
David Carmel
0.3.0
ltpapergt
0.3.1
ltpapergt

0.3.0.0
lttitlegt
0.3.0.1
ltauthorgt

XQL and
Ricardo
33
Dewey Inverted List (DIL)
Position List
ElemRank
Dewey Id
5.0.3.0.0
85
32
XQL
Sorted by Dewey Id
91
8.0.3.8.3
38
89

5.0.3.0.1
82
38
Ricardo
Sorted by Dewey Id
8.2.1.4.2
99
52

Store IDs of elements that directly contain
keyword - Avoids space overhead
34
DIL Challenges

Merging multiple inverted lists
Simple equality merge not sufficient
Need to infer most specific result
Suppress spurious results
Two-dimensional proximity
Algorithm that addresses above issues in a single
scan over inverted lists

35
DIL Query Processing

Merge query keyword inverted lists in Dewey ID
Order
Entries with common prefixes are processed
together
Compute Longest Common Prefix of Dewey IDs during
the merge
Longest common prefix ensures most specific
results
Also suppresses spurious results
Keep top-m results seen so far in output heap
Calculate rank using two-dimensional proximity
metric
Output contents of output heap after scanning
inverted lists

36
Dewey Inverted List (DIL)
Position List
Dewey Id
Standing
XQL
5.0.3.0.0
85
32
Sorted by Dewey Id
8.0.3.8.3
38
89
91

Ricardo
5.0.3.0.1
82
38
Sorted by Dewey Id
8.2.1.4.2
99
52

37
Outline

Design Principles
Indexing and Query Processing
Naïve
DIL
RDIL
HDIL
Experimental Results
Related Work and Conclusion

38
Potential Problem with DIL

Always requires a full scan of query keyword
inverted lists
Can be inefficient for unselective keywords/large
document collections
Solution RDIL (Ranked Dewey Inverted List)
Order inverted lists by ElemRank instead of
DeweyID
Higher ranked results likely to appear first
Query processing can be terminated early

39
Ranked Dewey Inverted List (RDIL)
B-tree On Dewey Id
Inverted List
XQL
Sorted by ElemRank
(other keywords)
40
RDIL Challenges

An element may be ranked highly in one list and
low in another list
B-tree helps search for low ranked element
When to stop scanning inverted lists?
Based on Threshold Algorithm Fagin et al.,
2002, which periodically calculates a threshold
Can stop if we have sufficient results above the
threshold
Extension to most specific results

41
RDIL Query Processing
Output Heap
Temp Heap
P
P
B-tree on Dewey Id
Ricardo
Inverted List
P 9.0.4.2.0
Rank(9.0.4)
threshold ElemRank(P)Max-ElemRank
threshold ElemRank(P)ElemRank(R)
XQL
9.0.4.1.2
R
8.2.1.4.2
9.0.4.1.2
9.0.5.6
10.8.3
9.0.5.6
9.0.4.1.2
B-tree on Dewey Id
9.0.4.2.0
42
Outline

Design Principles
Indexing and Query Processing
Naïve
DIL
RDIL
HDIL
Experimental Results
Related Work and Conclusion

43
Motivation for DIL/RDIL Hybrid

Correlation of query keywords probability that
the query keywords occur in same element
High correlation RDIL likely to outperform DIL
by stopping early
Low correlation DIL likely to outperform RDIL
because RDIL has to scan most (or entire)
inverted list
Dilemma
DIL and RDIL are likely to outperform each other
But require inverted lists to be sorted in
different orders
Challenges
Get benefits of DIL and RDIL without doubling
space?
How can keyword correlation be determined?

44
Hybrid Dewey Inverted List (HDIL)
B-tree On Dewey Id
Full Inverted List
XQL
Sorted by Dewey id
Short List
Sorted by ElemRank

RDIL is better only when it scans little of
inverted list
Short list sorted by ElemRank - saves space!
Can reuse full inverted list as leaf of B-tree
Saves space!

45
HDIL Algorithm

Start with RDIL (to learn correlation)
Periodically calculate
time spent so far t
number of results above threshold r
expected remaining time (m-r)t/r, where m is
desired number of query results
If expected time for RDIL exceeds that for DIL,
switch to DIL, else stick to RDIL
Expected time for DIL can easily be calculated a
priori because DIL scans the entire inverted list

46
Outline

Design Principles
Problem Definition and Ranking
Indexing and Query Processing
Experimental Results
Related Work and Conclusion

47
Experimental Setup

Data sets
DBLP (real data, 143MB, depth 4, many small
documents)
XMARK (synthetic data, 113MB, depth 10, one
large document)
Implementation
C, file system using memory-mapped I/O
Naïve-ID, Naïve-Rank, DIL, RDIL, HDIL
Hardware
2.8GHz P4 processor, 1GB RAM, 2 40GB hard disks

48
ElemRank Computation

Parameter settings
d1 0.35, d2 d3 0.25
Convergence threshold 0.00002
DBLP converged in lt 10 minutes
XMark converged in lt 5 minutes
Similar convergence time for other values of d1,
d2, and d3

49
Quality of Results

Anecdotal evidence
Query gray on DBLP
Author elements of highly referenced papers and
books by Jim Gray
Title elements of important papers on Gray
codes
Query author gray on DBLP
Ranks of Gray codes elements dropped due to
two-dimensional proximity metric
Full evaluation on real IEEE INEX collection
underway

50
Space Requirements
51
Query Performance

Parameters
Number of query keywords
Correlation between keywords
Desired number of query results (default 10)
Selectivity of keywords (default unselective)
Cold cache performance numbers
Simulate large, non memory-resident data set

52
DBLP High Correlation Keywords
53
DBLP Low Correlation Keywords
54
Outline

Design Principles
Problem Definition and Ranking
Indexing and Query Processing
Experimental Results
Related Work and Conclusion

55
Related Work

Semi-structured ranked keyword search
XIRQL Fuhr and Grobjohann, 2001
XXL Theobald and Weikum, 2001
Commercial search engines Luk et al., 2000
SGML documents Myaeng et al., 2001
Keyword search over databases
BANKS Bhalotia et al., 2002
DBXplorer Agrawal et al., 2002
DISCOVER Hristidis et al., 2002
LORE Goldman et al., 1999

56
XRANK Summary

Ranked keyword search over XML documents
Exploit hyperlinked and containment structure for
ranking
Two-dimensional proximity
Query a mix of XML and HTML documents
Efficient index structures and query processing
techniques
SIGMOD 2003 paper for more details

57
Extensions to XRANK

Other ranking functions (e.g., tf-idf)
Incremental updates of inverted lists
Normalized XML documents
Integration with structured query processing

58
Quark Project _at_ Cornell
Information Retrieval Systems
Ranked Keyword Search
Queries
Complex and Structured
(Relational) Database Systems
Structured
Unstructured
Data
59
Questions?

Write a Comment

User Comments (0)