Title: XRANK: Ranked Keyword Search Over XML Documents
1XRANK Ranked Keyword Search Over XML Documents
- Jayavel Shanmugasundaram
- Cornell University
- Joint work with
- Lin Guo, Feng Shao, Chavdar Botev
2Motivation
- Ranked keyword search emerging as dominant
paradigm for information discovery - Simple
- Results ranked in order of relevance
- Hitherto mostly limited to unstructured data
- HTML, unstructured text
- Growing wave of semi-structured data
- XML documents in Internet and Corporate Intranets
- XRANK designed for keyword search over such data
3Keyword Search over Unstructured Data
Ranked Results
Query Keywords
Hyperlinked HTML Documents
4Keyword Search over Semi-Structured Data
XRank
Ranked Results
Query Keywords
Mix of Hyperlinked XMLand HTML Documents
5XRANK System
- Semi-structured XML documents
- Predefined schema not necessary
- Hyperlink-based ranking
- Exploit hyperlinked nature of XML for ranking
- Generalize unstructured keyword search
- Can query a mix of XML and HTML documents
(Internet, corporate Intranets) - Google for XML (also Google for HTML!)
6Possible Applications
- Content management
- Integrating structured and unstructured data
- Scientific repositories
- Marked up documents
- Bibliography databases
- Future Internet (?)
- .
7Outline
- Design Principles
- Indexing and Query Processing
- Experimental Results
- Related Work and Conclusion
8XML Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Workgt
The XQL language
lt/subsectiongt lt/sectiongt
ltcite
xmlnsxlinkhttp//www.acm.org/www8/paper/xmlqlgt
lt/citegt lt/papergt
9XML Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Workgt
The XQL language
lt/subsectiongt lt/sectiongt
ltcite
xmlnsxlinkhttp//www.acm.org/www8/paper/xmlqlgt
lt/citegt lt/papergt
10Design Principles
- Return most specific element containing the query
keywords
11XML Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Work lt/subsectiongt
The XQL language
lt/subsectiongt
lt/sectiongt ltcite
xmlnsxlinkhttp//www.acm.org/www8/paper/xmlqlgt
lt/citegt lt/papergt ltpaper
id2gt
12Design Principles
- Return most specific element containing the query
keywords - Ranking has to be done at the granularity of
elements
13XML Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Work lt/subsectiongt
The XQL language
lt/subsectiongt
lt/sectiongt
ltcite xmlnsxlinkhttp//www.acm.org/www8/paper/x
mlqlgt lt/citegt lt/papergt
14Design Principles
- Return most specific element containing the query
keywords - Ranking has to be done at the granularity of
elements - Two-dimensional keyword proximity
- Height of result XML tree
- Width of result XML tree
15XML Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Work lt/subsectiongt
The XQL language
lt/subsectiongt
lt/sectiongt
ltcite xmlnsxlinkhttp//www.acm.org/www8/paper/x
mlqlgt lt/citegt lt/papergt
16Design Principles
- Return most specific element containing the query
keywords - Ranking has to be done at the granularity of
elements - Two-dimensional keyword proximity
- Height of result XML tree
- Width of result XML tree
- Generalize HTML keyword search
17Outline
- Design Principles
- Overview
- Formalization
- Indexing and Query Processing
- Experimental Results
- Related Work and Conclusion
18Data Model
Containment edge
Hyperlink edge
19Design Principles
- Return most specific element containing the query
keywords - Ranking has to be done at the granularity of
elements - Two-dimensional keyword proximity
- Height of result XML tree
- Width of result XML tree
- Generalize HTML keyword search
20Design Principles
- Return most specific element containing the query
keywords - Ranking has to be done at the granularity of
elements - Two-dimensional keyword proximity
- Height of result XML tree
- Width of result XML tree
- Generalize HTML keyword search
21ElemRank
- Objective importance of element
- Analogous to Googles PageRank
- But computed at granularity of elements
- Exploit hyperlink edges and containment edges
- Naturally generalizes Googles PageRank
- Random walk interpretation
22PageRank Brin Page 1998
Hyperlink edge
w
1-d Probability of random jump
23ElemRank
Hyperlink edge
Containment edge
w
1-d1-d2-d3 Probability of random jump
24Design Principles
- Return most specific element containing the query
keywords - Ranking has to be done at the granularity of
elements - Two-dimensional keyword proximity
- Height of result XML tree
- Width of result XML tree
- Generalize HTML keyword search
25Two-Dimensional Proximity
- Consider Query Q k1, , kn and a result
element E - ki is directly contained in wi
E
w1
k1
wn
- Rank of E with respect to ki
- Rank(E, ki) ElemRank(wi) decayh
- 0 lt decay lt 1
- h is length of path from E to wi
kn
- Overall rank
- Rank(E) ? Rank(E, ki) ?(E, k1, , kn)
26Outline
- Design Principles
- Indexing and Query Processing
- Experimental Results
- Related Work and Conclusion
27System Architecture
Keyword query
Ranked Results
XML/HTML Documents
Query Evaluator
Data access
XML Elements with ElemRanks
ElemRank Computation
Hybrid Dewey Inverted List
Compute top-m query results as per definition of
ranking
28Outline
- Design Principles
- Indexing and Query Processing
- Naïve
- DIL
- RDIL
- HDIL
- Experimental Results
- Related Work and Conclusion
29Naïve Approach
- A main difference between document and XML
keyword search is result granularity - Treat each element as a document
- Build regular inverted list index structures over
elements
30Naïve Method
- Naïve inverted lists
- Ricardo 1 5 6 8
- XQL 1 5 6 7
1
ltworkshopgt
date
lttitlegt
lteditorsgt
ltproceedingsgt
2
3
4
5
28 July
XML and
David Carmel
ltpapergt
ltpapergt
6
Problems 1. Space Overhead 2. Spurious
Results 3. Inaccurate Ranking (2-dimensional
proximity)
lttitlegt
ltauthorgt
7
8
XQL and
Ricardo
Main issue Decouples representation of ancestors
and descendants
31Outline
- Design Principles
- Indexing and Query Processing
- Naïve
- DIL
- RDIL
- HDIL
- Experimental Results
- Related Work and Conclusion
32Dewey Encoding of IDs 1850s
ltworkshopgt
0
0.0
date
0.1
lttitlegt
0.2
lteditorsgt
0.3
ltproceedingsgt
28 July
XML and
David Carmel
0.3.0
ltpapergt
0.3.1
ltpapergt
0.3.0.0
lttitlegt
0.3.0.1
ltauthorgt
XQL and
Ricardo
33Dewey Inverted List (DIL)
Position List
ElemRank
Dewey Id
5.0.3.0.0
85
32
XQL
Sorted by Dewey Id
91
8.0.3.8.3
38
89
5.0.3.0.1
82
38
Ricardo
Sorted by Dewey Id
8.2.1.4.2
99
52
Store IDs of elements that directly contain
keyword - Avoids space overhead
34DIL Challenges
- Merging multiple inverted lists
- Simple equality merge not sufficient
- Need to infer most specific result
- Suppress spurious results
- Two-dimensional proximity
- Algorithm that addresses above issues in a single
scan over inverted lists
35DIL Query Processing
- Merge query keyword inverted lists in Dewey ID
Order - Entries with common prefixes are processed
together - Compute Longest Common Prefix of Dewey IDs during
the merge - Longest common prefix ensures most specific
results - Also suppresses spurious results
- Keep top-m results seen so far in output heap
- Calculate rank using two-dimensional proximity
metric - Output contents of output heap after scanning
inverted lists
36Dewey Inverted List (DIL)
Position List
Dewey Id
Standing
XQL
5.0.3.0.0
85
32
Sorted by Dewey Id
8.0.3.8.3
38
89
91
Ricardo
5.0.3.0.1
82
38
Sorted by Dewey Id
8.2.1.4.2
99
52
37Outline
- Design Principles
- Indexing and Query Processing
- Naïve
- DIL
- RDIL
- HDIL
- Experimental Results
- Related Work and Conclusion
38Potential Problem with DIL
- Always requires a full scan of query keyword
inverted lists - Can be inefficient for unselective keywords/large
document collections - Solution RDIL (Ranked Dewey Inverted List)
- Order inverted lists by ElemRank instead of
DeweyID - Higher ranked results likely to appear first
- Query processing can be terminated early
39Ranked Dewey Inverted List (RDIL)
B-tree On Dewey Id
Inverted List
XQL
Sorted by ElemRank
(other keywords)
40RDIL Challenges
- An element may be ranked highly in one list and
low in another list - B-tree helps search for low ranked element
- When to stop scanning inverted lists?
- Based on Threshold Algorithm Fagin et al.,
2002, which periodically calculates a threshold - Can stop if we have sufficient results above the
threshold - Extension to most specific results
41RDIL Query Processing
Output Heap
Temp Heap
P
P
B-tree on Dewey Id
Ricardo
Inverted List
P 9.0.4.2.0
Rank(9.0.4)
threshold ElemRank(P)Max-ElemRank
threshold ElemRank(P)ElemRank(R)
XQL
9.0.4.1.2
R
8.2.1.4.2
9.0.4.1.2
9.0.5.6
10.8.3
9.0.5.6
9.0.4.1.2
B-tree on Dewey Id
9.0.4.2.0
42Outline
- Design Principles
- Indexing and Query Processing
- Naïve
- DIL
- RDIL
- HDIL
- Experimental Results
- Related Work and Conclusion
43Motivation for DIL/RDIL Hybrid
- Correlation of query keywords probability that
the query keywords occur in same element - High correlation RDIL likely to outperform DIL
by stopping early - Low correlation DIL likely to outperform RDIL
because RDIL has to scan most (or entire)
inverted list - Dilemma
- DIL and RDIL are likely to outperform each other
- But require inverted lists to be sorted in
different orders - Challenges
- Get benefits of DIL and RDIL without doubling
space? - How can keyword correlation be determined?
44Hybrid Dewey Inverted List (HDIL)
B-tree On Dewey Id
Full Inverted List
XQL
Sorted by Dewey id
Short List
Sorted by ElemRank
- RDIL is better only when it scans little of
inverted list - Short list sorted by ElemRank - saves space!
- Can reuse full inverted list as leaf of B-tree
- Saves space!
45HDIL Algorithm
- Start with RDIL (to learn correlation)
- Periodically calculate
- time spent so far t
- number of results above threshold r
- expected remaining time (m-r)t/r, where m is
desired number of query results - If expected time for RDIL exceeds that for DIL,
switch to DIL, else stick to RDIL - Expected time for DIL can easily be calculated a
priori because DIL scans the entire inverted list
46Outline
- Design Principles
- Problem Definition and Ranking
- Indexing and Query Processing
- Experimental Results
- Related Work and Conclusion
47Experimental Setup
- Data sets
- DBLP (real data, 143MB, depth 4, many small
documents) - XMARK (synthetic data, 113MB, depth 10, one
large document) - Implementation
- C, file system using memory-mapped I/O
- Naïve-ID, Naïve-Rank, DIL, RDIL, HDIL
- Hardware
- 2.8GHz P4 processor, 1GB RAM, 2 40GB hard disks
48ElemRank Computation
- Parameter settings
- d1 0.35, d2 d3 0.25
- Convergence threshold 0.00002
- DBLP converged in lt 10 minutes
- XMark converged in lt 5 minutes
- Similar convergence time for other values of d1,
d2, and d3
49Quality of Results
- Anecdotal evidence
- Query gray on DBLP
- Author elements of highly referenced papers and
books by Jim Gray - Title elements of important papers on Gray
codes - Query author gray on DBLP
- Ranks of Gray codes elements dropped due to
two-dimensional proximity metric - Full evaluation on real IEEE INEX collection
underway
50Space Requirements
51Query Performance
- Parameters
- Number of query keywords
- Correlation between keywords
- Desired number of query results (default 10)
- Selectivity of keywords (default unselective)
- Cold cache performance numbers
- Simulate large, non memory-resident data set
52DBLP High Correlation Keywords
53DBLP Low Correlation Keywords
54Outline
- Design Principles
- Problem Definition and Ranking
- Indexing and Query Processing
- Experimental Results
- Related Work and Conclusion
55Related Work
- Semi-structured ranked keyword search
- XIRQL Fuhr and Grobjohann, 2001
- XXL Theobald and Weikum, 2001
- Commercial search engines Luk et al., 2000
- SGML documents Myaeng et al., 2001
- Keyword search over databases
- BANKS Bhalotia et al., 2002
- DBXplorer Agrawal et al., 2002
- DISCOVER Hristidis et al., 2002
- LORE Goldman et al., 1999
56XRANK Summary
- Ranked keyword search over XML documents
- Exploit hyperlinked and containment structure for
ranking - Two-dimensional proximity
- Query a mix of XML and HTML documents
- Efficient index structures and query processing
techniques - SIGMOD 2003 paper for more details
57Extensions to XRANK
- Other ranking functions (e.g., tf-idf)
- Incremental updates of inverted lists
- Normalized XML documents
- Integration with structured query processing
58Quark Project _at_ Cornell
Information Retrieval Systems
Ranked Keyword Search
Queries
Complex and Structured
(Relational) Database Systems
Structured
Unstructured
Data
59Questions?