XRANK: Ranked Keyword Search Over XML Documents - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

XRANK: Ranked Keyword Search Over XML Documents

Description:

XML documents in Internet and Corporate Intranets ... Can query a mix of XML and HTML documents (Internet, corporate Intranets) ... – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 60
Provided by: jayavelsha
Category:

less

Transcript and Presenter's Notes

Title: XRANK: Ranked Keyword Search Over XML Documents


1
XRANK Ranked Keyword Search Over XML Documents
  • Jayavel Shanmugasundaram
  • Cornell University
  • Joint work with
  • Lin Guo, Feng Shao, Chavdar Botev

2
Motivation
  • Ranked keyword search emerging as dominant
    paradigm for information discovery
  • Simple
  • Results ranked in order of relevance
  • Hitherto mostly limited to unstructured data
  • HTML, unstructured text
  • Growing wave of semi-structured data
  • XML documents in Internet and Corporate Intranets
  • XRANK designed for keyword search over such data

3
Keyword Search over Unstructured Data
Ranked Results
Query Keywords
Hyperlinked HTML Documents
4
Keyword Search over Semi-Structured Data
XRank
Ranked Results
Query Keywords
Mix of Hyperlinked XMLand HTML Documents
5
XRANK System
  • Semi-structured XML documents
  • Predefined schema not necessary
  • Hyperlink-based ranking
  • Exploit hyperlinked nature of XML for ranking
  • Generalize unstructured keyword search
  • Can query a mix of XML and HTML documents
    (Internet, corporate Intranets)
  • Google for XML (also Google for HTML!)

6
Possible Applications
  • Content management
  • Integrating structured and unstructured data
  • Scientific repositories
  • Marked up documents
  • Bibliography databases
  • Future Internet (?)
  • .

7
Outline
  • Design Principles
  • Indexing and Query Processing
  • Experimental Results
  • Related Work and Conclusion

8
XML Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Workgt
The XQL language
lt/subsectiongt lt/sectiongt
ltcite
xmlnsxlinkhttp//www.acm.org/www8/paper/xmlqlgt
lt/citegt lt/papergt
9
XML Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Workgt
The XQL language
lt/subsectiongt lt/sectiongt
ltcite
xmlnsxlinkhttp//www.acm.org/www8/paper/xmlqlgt
lt/citegt lt/papergt
10
Design Principles
  • Return most specific element containing the query
    keywords

11
XML Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Work lt/subsectiongt
The XQL language
lt/subsectiongt
lt/sectiongt ltcite
xmlnsxlinkhttp//www.acm.org/www8/paper/xmlqlgt
lt/citegt lt/papergt ltpaper
id2gt
12
Design Principles
  • Return most specific element containing the query
    keywords
  • Ranking has to be done at the granularity of
    elements

13
XML Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Work lt/subsectiongt
The XQL language
lt/subsectiongt
lt/sectiongt
ltcite xmlnsxlinkhttp//www.acm.org/www8/paper/x
mlqlgt lt/citegt lt/papergt
14
Design Principles
  • Return most specific element containing the query
    keywords
  • Ranking has to be done at the granularity of
    elements
  • Two-dimensional keyword proximity
  • Height of result XML tree
  • Width of result XML tree

15
XML Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Work lt/subsectiongt
The XQL language
lt/subsectiongt
lt/sectiongt
ltcite xmlnsxlinkhttp//www.acm.org/www8/paper/x
mlqlgt lt/citegt lt/papergt
16
Design Principles
  • Return most specific element containing the query
    keywords
  • Ranking has to be done at the granularity of
    elements
  • Two-dimensional keyword proximity
  • Height of result XML tree
  • Width of result XML tree
  • Generalize HTML keyword search

17
Outline
  • Design Principles
  • Overview
  • Formalization
  • Indexing and Query Processing
  • Experimental Results
  • Related Work and Conclusion

18
Data Model
Containment edge
Hyperlink edge
19
Design Principles
  • Return most specific element containing the query
    keywords
  • Ranking has to be done at the granularity of
    elements
  • Two-dimensional keyword proximity
  • Height of result XML tree
  • Width of result XML tree
  • Generalize HTML keyword search

20
Design Principles
  • Return most specific element containing the query
    keywords
  • Ranking has to be done at the granularity of
    elements
  • Two-dimensional keyword proximity
  • Height of result XML tree
  • Width of result XML tree
  • Generalize HTML keyword search

21
ElemRank
  • Objective importance of element
  • Analogous to Googles PageRank
  • But computed at granularity of elements
  • Exploit hyperlink edges and containment edges
  • Naturally generalizes Googles PageRank
  • Random walk interpretation

22
PageRank Brin Page 1998
Hyperlink edge
w
1-d Probability of random jump
23
ElemRank
Hyperlink edge
Containment edge
w
1-d1-d2-d3 Probability of random jump
24
Design Principles
  • Return most specific element containing the query
    keywords
  • Ranking has to be done at the granularity of
    elements
  • Two-dimensional keyword proximity
  • Height of result XML tree
  • Width of result XML tree
  • Generalize HTML keyword search

25
Two-Dimensional Proximity
  • Consider Query Q k1, , kn and a result
    element E
  • ki is directly contained in wi

E
w1

k1
wn
  • Rank of E with respect to ki
  • Rank(E, ki) ElemRank(wi) decayh
  • 0 lt decay lt 1
  • h is length of path from E to wi

kn
  • Overall rank
  • Rank(E) ? Rank(E, ki) ?(E, k1, , kn)

26
Outline
  • Design Principles
  • Indexing and Query Processing
  • Experimental Results
  • Related Work and Conclusion

27
System Architecture
Keyword query
Ranked Results
XML/HTML Documents
Query Evaluator
Data access
XML Elements with ElemRanks
ElemRank Computation
Hybrid Dewey Inverted List
Compute top-m query results as per definition of
ranking
28
Outline
  • Design Principles
  • Indexing and Query Processing
  • Naïve
  • DIL
  • RDIL
  • HDIL
  • Experimental Results
  • Related Work and Conclusion

29
Naïve Approach
  • A main difference between document and XML
    keyword search is result granularity
  • Treat each element as a document
  • Build regular inverted list index structures over
    elements

30
Naïve Method
  • Naïve inverted lists
  • Ricardo 1 5 6 8
  • XQL 1 5 6 7

1
ltworkshopgt
date
lttitlegt
lteditorsgt
ltproceedingsgt
2
3
4
5
28 July
XML and
David Carmel
ltpapergt
ltpapergt
6
Problems 1. Space Overhead 2. Spurious
Results 3. Inaccurate Ranking (2-dimensional
proximity)

lttitlegt
ltauthorgt
7
8


XQL and
Ricardo
Main issue Decouples representation of ancestors
and descendants
31
Outline
  • Design Principles
  • Indexing and Query Processing
  • Naïve
  • DIL
  • RDIL
  • HDIL
  • Experimental Results
  • Related Work and Conclusion

32
Dewey Encoding of IDs 1850s
ltworkshopgt
0
0.0
date
0.1
lttitlegt
0.2
lteditorsgt
0.3
ltproceedingsgt
28 July
XML and
David Carmel
0.3.0
ltpapergt
0.3.1
ltpapergt

0.3.0.0
lttitlegt
0.3.0.1
ltauthorgt


XQL and
Ricardo
33
Dewey Inverted List (DIL)
Position List
ElemRank
Dewey Id
5.0.3.0.0
85
32
XQL
Sorted by Dewey Id
91
8.0.3.8.3
38
89



5.0.3.0.1
82
38
Ricardo
Sorted by Dewey Id
8.2.1.4.2
99
52



Store IDs of elements that directly contain
keyword - Avoids space overhead
34
DIL Challenges
  • Merging multiple inverted lists
  • Simple equality merge not sufficient
  • Need to infer most specific result
  • Suppress spurious results
  • Two-dimensional proximity
  • Algorithm that addresses above issues in a single
    scan over inverted lists

35
DIL Query Processing
  • Merge query keyword inverted lists in Dewey ID
    Order
  • Entries with common prefixes are processed
    together
  • Compute Longest Common Prefix of Dewey IDs during
    the merge
  • Longest common prefix ensures most specific
    results
  • Also suppresses spurious results
  • Keep top-m results seen so far in output heap
  • Calculate rank using two-dimensional proximity
    metric
  • Output contents of output heap after scanning
    inverted lists

36
Dewey Inverted List (DIL)
Position List
Dewey Id
Standing
XQL
5.0.3.0.0
85
32
Sorted by Dewey Id
8.0.3.8.3
38
89
91



Ricardo
5.0.3.0.1
82
38
Sorted by Dewey Id
8.2.1.4.2
99
52




37
Outline
  • Design Principles
  • Indexing and Query Processing
  • Naïve
  • DIL
  • RDIL
  • HDIL
  • Experimental Results
  • Related Work and Conclusion

38
Potential Problem with DIL
  • Always requires a full scan of query keyword
    inverted lists
  • Can be inefficient for unselective keywords/large
    document collections
  • Solution RDIL (Ranked Dewey Inverted List)
  • Order inverted lists by ElemRank instead of
    DeweyID
  • Higher ranked results likely to appear first
  • Query processing can be terminated early

39
Ranked Dewey Inverted List (RDIL)
B-tree On Dewey Id
Inverted List
XQL
Sorted by ElemRank
(other keywords)
40
RDIL Challenges
  • An element may be ranked highly in one list and
    low in another list
  • B-tree helps search for low ranked element
  • When to stop scanning inverted lists?
  • Based on Threshold Algorithm Fagin et al.,
    2002, which periodically calculates a threshold
  • Can stop if we have sufficient results above the
    threshold
  • Extension to most specific results

41
RDIL Query Processing
Output Heap
Temp Heap
P
P
B-tree on Dewey Id
Ricardo
Inverted List
P 9.0.4.2.0
Rank(9.0.4)
threshold ElemRank(P)Max-ElemRank
threshold ElemRank(P)ElemRank(R)
XQL
9.0.4.1.2
R
8.2.1.4.2
9.0.4.1.2
9.0.5.6
10.8.3
9.0.5.6
9.0.4.1.2
B-tree on Dewey Id
9.0.4.2.0
42
Outline
  • Design Principles
  • Indexing and Query Processing
  • Naïve
  • DIL
  • RDIL
  • HDIL
  • Experimental Results
  • Related Work and Conclusion

43
Motivation for DIL/RDIL Hybrid
  • Correlation of query keywords probability that
    the query keywords occur in same element
  • High correlation RDIL likely to outperform DIL
    by stopping early
  • Low correlation DIL likely to outperform RDIL
    because RDIL has to scan most (or entire)
    inverted list
  • Dilemma
  • DIL and RDIL are likely to outperform each other
  • But require inverted lists to be sorted in
    different orders
  • Challenges
  • Get benefits of DIL and RDIL without doubling
    space?
  • How can keyword correlation be determined?

44
Hybrid Dewey Inverted List (HDIL)
B-tree On Dewey Id
Full Inverted List
XQL
Sorted by Dewey id
Short List
Sorted by ElemRank
  • RDIL is better only when it scans little of
    inverted list
  • Short list sorted by ElemRank - saves space!
  • Can reuse full inverted list as leaf of B-tree
  • Saves space!

45
HDIL Algorithm
  • Start with RDIL (to learn correlation)
  • Periodically calculate
  • time spent so far t
  • number of results above threshold r
  • expected remaining time (m-r)t/r, where m is
    desired number of query results
  • If expected time for RDIL exceeds that for DIL,
    switch to DIL, else stick to RDIL
  • Expected time for DIL can easily be calculated a
    priori because DIL scans the entire inverted list

46
Outline
  • Design Principles
  • Problem Definition and Ranking
  • Indexing and Query Processing
  • Experimental Results
  • Related Work and Conclusion

47
Experimental Setup
  • Data sets
  • DBLP (real data, 143MB, depth 4, many small
    documents)
  • XMARK (synthetic data, 113MB, depth 10, one
    large document)
  • Implementation
  • C, file system using memory-mapped I/O
  • Naïve-ID, Naïve-Rank, DIL, RDIL, HDIL
  • Hardware
  • 2.8GHz P4 processor, 1GB RAM, 2 40GB hard disks

48
ElemRank Computation
  • Parameter settings
  • d1 0.35, d2 d3 0.25
  • Convergence threshold 0.00002
  • DBLP converged in lt 10 minutes
  • XMark converged in lt 5 minutes
  • Similar convergence time for other values of d1,
    d2, and d3

49
Quality of Results
  • Anecdotal evidence
  • Query gray on DBLP
  • Author elements of highly referenced papers and
    books by Jim Gray
  • Title elements of important papers on Gray
    codes
  • Query author gray on DBLP
  • Ranks of Gray codes elements dropped due to
    two-dimensional proximity metric
  • Full evaluation on real IEEE INEX collection
    underway

50
Space Requirements
51
Query Performance
  • Parameters
  • Number of query keywords
  • Correlation between keywords
  • Desired number of query results (default 10)
  • Selectivity of keywords (default unselective)
  • Cold cache performance numbers
  • Simulate large, non memory-resident data set

52
DBLP High Correlation Keywords
53
DBLP Low Correlation Keywords
54
Outline
  • Design Principles
  • Problem Definition and Ranking
  • Indexing and Query Processing
  • Experimental Results
  • Related Work and Conclusion

55
Related Work
  • Semi-structured ranked keyword search
  • XIRQL Fuhr and Grobjohann, 2001
  • XXL Theobald and Weikum, 2001
  • Commercial search engines Luk et al., 2000
  • SGML documents Myaeng et al., 2001
  • Keyword search over databases
  • BANKS Bhalotia et al., 2002
  • DBXplorer Agrawal et al., 2002
  • DISCOVER Hristidis et al., 2002
  • LORE Goldman et al., 1999

56
XRANK Summary
  • Ranked keyword search over XML documents
  • Exploit hyperlinked and containment structure for
    ranking
  • Two-dimensional proximity
  • Query a mix of XML and HTML documents
  • Efficient index structures and query processing
    techniques
  • SIGMOD 2003 paper for more details

57
Extensions to XRANK
  • Other ranking functions (e.g., tf-idf)
  • Incremental updates of inverted lists
  • Normalized XML documents
  • Integration with structured query processing

58
Quark Project _at_ Cornell
Information Retrieval Systems
Ranked Keyword Search
Queries
Complex and Structured
(Relational) Database Systems
Structured
Unstructured
Data
59
Questions?
Write a Comment
User Comments (0)
About PowerShow.com