Title: Flexible and Efficient XML Search with Complex Full-Text Predicates
1Flexible and Efficient XML Search with Complex
Full-Text Predicates
Sihem Amer-Yahia - ATT Labs Research ? Yahoo!
Research Emiran Curtmola - University of
California San Diego Alin Deutsch - University
of California San Diego
2Introduction
- Need for complex full-text predicates beyond
simple keyword search - Library of Congress (LoC)
- Biomedical data
- ACM, IEEE publications
- INEX data collection
- Wikipedia XML data set
3XML real fragment from LoChttp//thomas.loc.gov/h
ome/gpoxmlc109/h2739_ih.xml
bill
legis-session
congress-info
legis
legis-desc
nbr
sponsors
Congress on education and workforce, comments
to appropriate services.
legis-body
action
Jefferson and services
HR2739
House of Representatives Current chamber on
workforce and services. Committees on education
are headed by Jefferson
Mr Column and co-sponsors Mrs Miller and Mrs
Jones. Others include Jefferson
109th
action-desc
on May 2, 2004 Joe Jefferson
committee-name
introduced the following bill. The bill was
reintroduced later and was referred to the
committee
on education and workforce sponsored by Joe
Jefferson
4Query with complex FT predicates
- Document fragments (nodes) that
- contain the keywords
- Jefferson and education
-
- and satisfy the predicates
- within a window of 10 words,
- with Jefferson ordered before education
5Example LoC document
6Example LoC document
Return document fragments Naive solution test the query at each node ? redundant Need for efficient evaluation of full-text predicates use structural relationship between nodes avoid redundant computation
7Existing languages
- Many XML full-text search languages
- expressive power, semantics, scores BAS-06
- XQFT-class
- W3Cs XQuery Full-Text (XQFT), NEXI, XIRQL,
JuruXML, XSearch, XRank, XKSearch, Schema Free
XQuery - Efficient query evaluation limited to
- Conjunctive keyword search (no predicates)
- Full-text predicates in isolation
- Need for a universal optimization framework
- Guarantee the universality of the solution
8Contributions
- Formal semantics for XQFT-class
- Unified framework
- Capture family of tfidf scoring methods
- Structure-aware algorithms to efficiently
evaluate XQFT-class languages - XFT full-text algebra
- Enable new optimizations inspired by relational
rewritings
9Talk Outline
- Motivation Contributions
- Formalization of XML full-text search
- Efficient evaluation
- Experiments
- Conclusion
10Formalization design goals
- Capture existing full-text languages
- Language semantics in terms of
- keyword patterns
- pattern matches
- predicates evaluated through matches
- Manipulate tuples
- enable relational query evaluation and rewritings
11Formalization patterns
- Pattern tuple of simultaneously matching
keywords - Query expression
- Jefferson and education
- within a window of 10 words,
- with Jefferson ordered before education
Pattern
(Jefferson, education)
12Formalization patterns
- Formalization specifies
- patterns ? conjunction of keywords
- set of patterns ? disjunction of keywords
- exclusion patterns ? negation of keywords
- No matches in the document
13Formalization matches
Jefferson, education
(22, 3)
14Formalization matches
Jefferson, education
(22, 3) (22, 45)
15Formalization matches
Jefferson, education
(22, 3) (22, 45) (22, 67)
16Formalization matches
Jefferson, education
(22, 3) (22, 45) (22, 67) (51, 3)
17Formalization matching tables
- Matching table represents
- Nested relation
- Each node in the document
- Each pattern in the query
- Set of matches
18Formalization matching tables
Node Pattern Matches
action Jefferson, education (28, 45) (51, 45)
19XFT Algebra
- Similar to relational algebra
- Manipulate matching tables
- Leverage relational query evaluation
optimization techniques - XFT operators
- construct matching table Rk for each keyword k
- get(k)
- manipulate matching tables
- R1 or R2
- R1 and R2
- R1 minus R2
- stimes(R), sordered(R), swindow(R),
sdistance(R)
20XFT Algebra
- Query Nodes that contain the keywords
- Jefferson and education
- within a window of 10 words,
- with Jefferson ordered before education
Benefit equivalent query rewritings
21Talk Outline
- Motivation Contributions
- Formalization of XML full-text search
- Efficient evaluation
- Experiments
- Conclusion
22Query evaluation AllNodes
5
- Straightforward implementation of the XFT algebra
- Each node is considered separately
- Each tuple is self-contained
- Relational-style evaluation
- Joins ? equi-joins
- Predicates ? selections on set of matches
23Example LoC document
1
1.3
1.1
1.3.1
1.3.2
1.1.2
1.1.3
1.1.1
1.3.1.2
1.2
1.2.2
1.2.2.2
24 Node Pattern Matches
1 Jefferson 22, 28, 51, 54, 72
1.1 Jefferson 22
1.1.3 Jefferson 22
1.2 Jefferson 28, 51
1.2.2 Jefferson 51
1.2.2.2 Jefferson 51
1.3 Jefferson 54, 72
1.3.1 Jefferson 54
1.3.1.2 Jefferson 54
1.3.2 Jefferson 72
Node Pattern Matches
1 education 3, 45, 67
1.1 education 3
1.1.1 education 3
1.2 education 45
1.2.2 education 45
1.2.2.2 education 45
1.3 education 67
1.3.2 education 67
25Node Pattern Matches
1 Jefferson, education (22,45), (72,67)
1.1 Jefferson, education (22, 3)
1.2 Jefferson, education (28, 45), (51, 45)
1.2.2 Jefferson, education (51, 45)
1.2.2.2 Jefferson, education (51, 45)
1.3 Jefferson, education (54, 67), (72, 67)
1.3.2 Jefferson, education (72, 67)
Node Pattern Matches
1 Jefferson 22, 28, 51, 54, 72
1.1 Jefferson 22
1.1.3 Jefferson 22
1.2 Jefferson 28, 51
1.2.2 Jefferson 51
1.2.2.2 Jefferson 51
1.3 Jefferson 54, 72
1.3.1 Jefferson 54
1.3.1.2 Jefferson 54
1.3.2 Jefferson 72
Node Pattern Matches
1 education 3, 45, 67
1.1 education 3
1.1.1 education 3
1.2 education 45
1.2.2 education 45
1.2.2.2 education 45
1.3 education 67
1.3.2 education 67
26 Node Pattern Matches
1 Jefferson, education (22,45), (72,67)
1.1 Jefferson, education (22, 3)
1.2 Jefferson, education (28, 45), (51, 45)
1.2.2 Jefferson, education (51, 45)
1.2.2.2 Jefferson, education (51, 45)
1.3 Jefferson, education (54, 67), (72, 67)
1.3.2 Jefferson, education (72, 67)
Predicate operates one tuple at a time
Node Pattern Matches
1 Jefferson 22, 28, 51, 54, 72
1.1 Jefferson 22
1.1.3 Jefferson 22
1.2 Jefferson 28, 51
1.2.2 Jefferson 51
1.2.2.2 Jefferson 51
1.3 Jefferson 54, 72
1.3.1 Jefferson 54
1.3.1.2 Jefferson 54
1.3.2 Jefferson 72
Node Pattern Matches
1 education 3, 45, 67
1.1 education 3
1.1.1 education 3
1.2 education 45
1.2.2 education 45
1.2.2.2 education 45
1.3 education 67
1.3.2 education 67
27Example LoC document
1
1.3
1.1
1.3.1
1.3.2
1.1.2
1.1.3
1.1.1
1.3.1.2
1.2
1.2.2
1.2.2.2
28Query evaluation SCU
5
- AllNodes straightforward algorithm
- Reduce size of intermediate results
- structural relationships between nodes
- avoid redundant match representation
- SCU Smallest Containing Unit
29Matching tables ? SCU tables
Node Pattern Matches
1 Jefferson 22, 28, 51, 54, 72
1.1 Jefferson 22
1.1.3 Jefferson 22
1.2 Jefferson 28, 51
1.2.2 Jefferson 51
1.2.2.2 Jefferson 51
1.3 Jefferson 54, 72
1.3.1 Jefferson 54
1.3.1.2 Jefferson 54
1.3.2 Jefferson 72
Node Pattern Matches
1.1.3 Jefferson 22
1.2.2.2 Jefferson 51
1.2 Jefferson 28
1.3.1.2 Jefferson 54
1.3.2 Jefferson 72
?
captures same information
30Node Pattern Matches
1.1.3 Jefferson 22
1.2.2.2 Jefferson 51
1.2 Jefferson 28
1.3.1.2 Jefferson 54
1.3.2 Jefferson 72
Node Pattern Matches
1.1.1 education 3
1.2.2.2 education 45
1.3.2 education 67
31- Equi-join does not work
- Need to compute LCA
Node Pattern Matches
1.2.2.2 Jefferson, education (51, 45)
1.3.2 Jefferson, education (72, 67)
Node Pattern Matches
1.1.3 Jefferson 22
1.2.2.2 Jefferson 51
1.2 Jefferson 28
1.3.1.2 Jefferson 54
1.3.2 Jefferson 72
Node Pattern Matches
1.1.1 education 3
1.2.2.2 education 45
1.3.2 education 67
32Node Pattern Matches
1.1 Jefferson, education (22, 3)
1.2.2.2 Jefferson, education (51, 45)
1.2 Jefferson, education (28, 45)
1.3.2 Jefferson, education (72, 67)
1.3 Jefferson, education (54, 67)
1 Jefferson, education (22, 45)
1.1 is the LCA of 1.1.3 and 1.1.1
Node Pattern Matches
1.1.3 Jefferson 22
1.2.2.2 Jefferson 51
1.2 Jefferson 28
1.3.1.2 Jefferson 54
1.3.2 Jefferson 72
Node Pattern Matches
1.1.1 education 3
1.2.2.2 education 45
1.3.2 education 67
33Node Pattern Matches
EMPTY !!! EMPTY !!! EMPTY !!!
Node Pattern Matches
1.2 Jefferson, education (28, 45)
1.3 Jefferson, education (54, 67)
1 Jefferson, education (22, 45)
Node Pattern Matches
1.1 Jefferson, education (22, 3)
1.2.2.2 Jefferson, education (51, 45)
1.2 Jefferson, education (28, 45)
1.3.2 Jefferson, education (72, 67)
1.3 Jefferson, education (54, 67)
1 Jefferson, education (22, 45)
Node Pattern Matches
1.1.3 Jefferson 22
1.2.2.2 Jefferson 51
1.2 Jefferson 28
1.3.1.2 Jefferson 54
1.3.2 Jefferson 72
Node Pattern Matches
1.1.1 education 3
1.2.2.2 education 45
1.3.2 education 67
34 Node Pattern Matches
1.1 Jefferson, education (22, 3)
1.2.2.2 Jefferson, education (51, 45)
1.2 Jefferson, education (28, 45)
1.3.2 Jefferson, education (72, 67)
1.3 Jefferson, education (54, 67)
1 Jefferson, education (22, 45)
Node Pattern Matches
1.1.3 Jefferson 22
1.2.2.2 Jefferson 51
1.2 Jefferson 28
1.3.1.2 Jefferson 54
1.3.2 Jefferson 72
Node Pattern Matches
1.1.1 education 3
1.2.2.2 education 45
1.3.2 education 67
35Node Pattern Matches
1.3 Jefferson, education (54, 67)
1 Jefferson, education (22, 45)
Node Pattern Matches
1.1 Jefferson, education (22, 3)
1.2.2.2 Jefferson, education (51, 45)
1.2 Jefferson, education (28, 45)
1.3.2 Jefferson, education (72, 67)
1.3 Jefferson, education (54, 67)
1 Jefferson, education (22, 45)
Node Pattern Matches
1.1.3 Jefferson 22
1.2.2.2 Jefferson 51
1.2 Jefferson 28
1.3.1.2 Jefferson 54
1.3.2 Jefferson 72
Node Pattern Matches
1.1.1 education 3
1.2.2.2 education 45
1.3.2 education 67
36Node Pattern Matches
1.3 Jefferson, education (54, 67) (72, 67)
1 Jefferson, education (22, 45)
Node Pattern Matches
1.1 Jefferson, education (22, 3)
1.2.2.2 Jefferson, education (51, 45)
1.2 Jefferson, education (28, 45)
1.3.2 Jefferson, education (72, 67)
1.3 Jefferson, education (54, 67)
1 Jefferson, education (22, 45)
- Postorder
- Stack supports single scan
Node Pattern Matches
1.1.3 Jefferson 22
1.2.2.2 Jefferson 51
1.2 Jefferson 28
1.3.1.2 Jefferson 54
1.3.2 Jefferson 72
Node Pattern Matches
1.1.1 education 3
1.2.2.2 education 45
1.3.2 education 67
37SCU summary
5
- Equivalent to AllNodes
- Structure-awareness reduces size of intermediate
results - Increase computation cost
- Compute LCAs of nodes
- Match propagation
- Stack-based techniques
38Related work on LCA for XML
- LCA for conjunctive keyword search
- XRank GSBS-03
- Schema-free XQuery LYJ-04
- XKSearch XP-05
- Shortcomings
- No postprocessing, not compositional
- Input in document order
- Output postorder traversal
- Support for complex predicates is not
straightforward
39Talk Outline
- Motivation Contributions
- Formalization of XML full-text search
- Efficient evaluation
- Experiments
- Conclusion
40Experimental goals
- AllNodes vs. SCU
- AllNodes redundant representation
- SCU smaller sizes, more computation
- SCU Overhead
- Stack
- Match propagation
- Benefit of Rewritings
- Relational-style rewritings
41Experimental setup
- Centrino 1.8GHz with 1GB of RAM
- XMark generated datasets
- Size ranges from 50 MB 300 MB
42Experiments AllNodes vs. SCU
- q1 get(See) and get(internationally) and
- get(description) and get(charges) and
get(ship)
43Experiments SCU Overhead
- Queries
- q4 swindowgt1(See, internationally,
description, charges, ship) (q1) - q5 swindowgt90000000(See, internationally,
description, charges, ship) (q1) - Recall that
- q1 get(See) and get(internationally) and
- get(description) and get(charges)
and get(ship)
44Experiments SCU Overhead
- q4 always true ? no match propagation, just the
stack overhead - q5 always false ? propagate all matches
45Experiments Benefit of Rewritings
- Queries
- q2 sorderedE(See, internationally,
description, charges, ship) (q1) - q3 push selections in q2
- Recall that
- q1 get(See) and get(internationally) and
- get(description) and get(charges)
and get(ship)
46Experiments Benefit of Rewritings
- 40 improvement for relational-like query
rewritings
47Conclusion
- A unified logical framework for XML full-text
search languages - Algebra admits
- Efficient algorithms for operator evaluation
- Rewritings of queries into more efficient forms
- Facilitate XML joint optimizations of queries on
both structure and text search - Future work
- Score-aware logical framework
48Thank you!
5