Flexible and Efficient XML Search with Complex Full-Text Predicates - PowerPoint PPT Presentation

About This Presentation
Title:

Flexible and Efficient XML Search with Complex Full-Text Predicates

Description:

Flexible and Efficient XML Search with Complex Full-Text Predicates Sihem Amer-Yahia - AT&T Labs Research Yahoo! Research Emiran Curtmola - University of ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 49
Provided by: Emir4
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: Flexible and Efficient XML Search with Complex Full-Text Predicates


1
Flexible and Efficient XML Search with Complex
Full-Text Predicates
Sihem Amer-Yahia - ATT Labs Research ? Yahoo!
Research Emiran Curtmola - University of
California San Diego Alin Deutsch - University
of California San Diego
2
Introduction
  • Need for complex full-text predicates beyond
    simple keyword search
  • Library of Congress (LoC)
  • Biomedical data
  • ACM, IEEE publications
  • INEX data collection
  • Wikipedia XML data set

3
XML real fragment from LoChttp//thomas.loc.gov/h
ome/gpoxmlc109/h2739_ih.xml
bill
legis-session
congress-info
legis
legis-desc
nbr
sponsors
Congress on education and workforce, comments
to appropriate services.
legis-body
action
Jefferson and services
HR2739
House of Representatives Current chamber on
workforce and services. Committees on education
are headed by Jefferson
Mr Column and co-sponsors Mrs Miller and Mrs
Jones. Others include Jefferson
109th
action-desc
on May 2, 2004 Joe Jefferson
committee-name
introduced the following bill. The bill was
reintroduced later and was referred to the
committee
on education and workforce sponsored by Joe
Jefferson
4
Query with complex FT predicates
  • Document fragments (nodes) that
  • contain the keywords
  • Jefferson and education
  • and satisfy the predicates
  • within a window of 10 words,
  • with Jefferson ordered before education

5
Example LoC document
6
Example LoC document
Return document fragments Naive solution test the query at each node ? redundant Need for efficient evaluation of full-text predicates use structural relationship between nodes avoid redundant computation
7
Existing languages
  • Many XML full-text search languages
  • expressive power, semantics, scores BAS-06
  • XQFT-class
  • W3Cs XQuery Full-Text (XQFT), NEXI, XIRQL,
    JuruXML, XSearch, XRank, XKSearch, Schema Free
    XQuery
  • Efficient query evaluation limited to
  • Conjunctive keyword search (no predicates)
  • Full-text predicates in isolation
  • Need for a universal optimization framework
  • Guarantee the universality of the solution

8
Contributions
  • Formal semantics for XQFT-class
  • Unified framework
  • Capture family of tfidf scoring methods
  • Structure-aware algorithms to efficiently
    evaluate XQFT-class languages
  • XFT full-text algebra
  • Enable new optimizations inspired by relational
    rewritings

9
Talk Outline
  • Motivation Contributions
  • Formalization of XML full-text search
  • Efficient evaluation
  • Experiments
  • Conclusion

10
Formalization design goals
  • Capture existing full-text languages
  • Language semantics in terms of
  • keyword patterns
  • pattern matches
  • predicates evaluated through matches
  • Manipulate tuples
  • enable relational query evaluation and rewritings

11
Formalization patterns
  • Pattern tuple of simultaneously matching
    keywords
  • Query expression
  • Jefferson and education
  • within a window of 10 words,
  • with Jefferson ordered before education

Pattern
(Jefferson, education)
12
Formalization patterns
  • Formalization specifies
  • patterns ? conjunction of keywords
  • set of patterns ? disjunction of keywords
  • exclusion patterns ? negation of keywords
  • No matches in the document

13
Formalization matches
Jefferson, education
(22, 3)

14
Formalization matches
Jefferson, education
(22, 3) (22, 45)

15
Formalization matches
Jefferson, education
(22, 3) (22, 45) (22, 67)

16
Formalization matches
Jefferson, education
(22, 3) (22, 45) (22, 67) (51, 3)

17
Formalization matching tables
  • Matching table represents
  • Nested relation
  • Each node in the document
  • Each pattern in the query
  • Set of matches

18
Formalization matching tables
Node Pattern Matches
action Jefferson, education (28, 45) (51, 45)

19
XFT Algebra
  • Similar to relational algebra
  • Manipulate matching tables
  • Leverage relational query evaluation
    optimization techniques
  • XFT operators
  • construct matching table Rk for each keyword k
  • get(k)
  • manipulate matching tables
  • R1 or R2
  • R1 and R2
  • R1 minus R2
  • stimes(R), sordered(R), swindow(R),
    sdistance(R)

20
XFT Algebra
  • Query Nodes that contain the keywords
  • Jefferson and education
  • within a window of 10 words,
  • with Jefferson ordered before education

Benefit equivalent query rewritings
21
Talk Outline
  • Motivation Contributions
  • Formalization of XML full-text search
  • Efficient evaluation
  • Experiments
  • Conclusion

22
Query evaluation AllNodes
5
  • Straightforward implementation of the XFT algebra
  • Each node is considered separately
  • Each tuple is self-contained
  • Relational-style evaluation
  • Joins ? equi-joins
  • Predicates ? selections on set of matches

23
Example LoC document
1
1.3
1.1
1.3.1
1.3.2
1.1.2
1.1.3
1.1.1
1.3.1.2
1.2
1.2.2
1.2.2.2
24

Node Pattern Matches
1 Jefferson 22, 28, 51, 54, 72
1.1 Jefferson 22
1.1.3 Jefferson 22
1.2 Jefferson 28, 51
1.2.2 Jefferson 51
1.2.2.2 Jefferson 51
1.3 Jefferson 54, 72
1.3.1 Jefferson 54
1.3.1.2 Jefferson 54
1.3.2 Jefferson 72
Node Pattern Matches
1 education 3, 45, 67
1.1 education 3
1.1.1 education 3
1.2 education 45
1.2.2 education 45
1.2.2.2 education 45
1.3 education 67
1.3.2 education 67
25
Node Pattern Matches
1 Jefferson, education (22,45), (72,67)
1.1 Jefferson, education (22, 3)
1.2 Jefferson, education (28, 45), (51, 45)
1.2.2 Jefferson, education (51, 45)
1.2.2.2 Jefferson, education (51, 45)
1.3 Jefferson, education (54, 67), (72, 67)
1.3.2 Jefferson, education (72, 67)

Node Pattern Matches
1 Jefferson 22, 28, 51, 54, 72
1.1 Jefferson 22
1.1.3 Jefferson 22
1.2 Jefferson 28, 51
1.2.2 Jefferson 51
1.2.2.2 Jefferson 51
1.3 Jefferson 54, 72
1.3.1 Jefferson 54
1.3.1.2 Jefferson 54
1.3.2 Jefferson 72
Node Pattern Matches
1 education 3, 45, 67
1.1 education 3
1.1.1 education 3
1.2 education 45
1.2.2 education 45
1.2.2.2 education 45
1.3 education 67
1.3.2 education 67
26

Node Pattern Matches
1 Jefferson, education (22,45), (72,67)
1.1 Jefferson, education (22, 3)
1.2 Jefferson, education (28, 45), (51, 45)
1.2.2 Jefferson, education (51, 45)
1.2.2.2 Jefferson, education (51, 45)
1.3 Jefferson, education (54, 67), (72, 67)
1.3.2 Jefferson, education (72, 67)

Predicate operates one tuple at a time

Node Pattern Matches
1 Jefferson 22, 28, 51, 54, 72
1.1 Jefferson 22
1.1.3 Jefferson 22
1.2 Jefferson 28, 51
1.2.2 Jefferson 51
1.2.2.2 Jefferson 51
1.3 Jefferson 54, 72
1.3.1 Jefferson 54
1.3.1.2 Jefferson 54
1.3.2 Jefferson 72
Node Pattern Matches
1 education 3, 45, 67
1.1 education 3
1.1.1 education 3
1.2 education 45
1.2.2 education 45
1.2.2.2 education 45
1.3 education 67
1.3.2 education 67
27
Example LoC document
1
1.3
1.1
1.3.1
1.3.2
1.1.2
1.1.3
1.1.1
1.3.1.2
1.2
1.2.2
1.2.2.2
28
Query evaluation SCU
5
  • AllNodes straightforward algorithm
  • Reduce size of intermediate results
  • structural relationships between nodes
  • avoid redundant match representation
  • SCU Smallest Containing Unit

29
Matching tables ? SCU tables
Node Pattern Matches
1 Jefferson 22, 28, 51, 54, 72
1.1 Jefferson 22
1.1.3 Jefferson 22
1.2 Jefferson 28, 51
1.2.2 Jefferson 51
1.2.2.2 Jefferson 51
1.3 Jefferson 54, 72
1.3.1 Jefferson 54
1.3.1.2 Jefferson 54
1.3.2 Jefferson 72
Node Pattern Matches
1.1.3 Jefferson 22
1.2.2.2 Jefferson 51
1.2 Jefferson 28
1.3.1.2 Jefferson 54
1.3.2 Jefferson 72
?
captures same information
30
Node Pattern Matches
1.1.3 Jefferson 22
1.2.2.2 Jefferson 51
1.2 Jefferson 28
1.3.1.2 Jefferson 54
1.3.2 Jefferson 72
Node Pattern Matches
1.1.1 education 3
1.2.2.2 education 45
1.3.2 education 67
31
  • Equi-join does not work
  • Need to compute LCA

Node Pattern Matches
1.2.2.2 Jefferson, education (51, 45)
1.3.2 Jefferson, education (72, 67)

Node Pattern Matches
1.1.3 Jefferson 22
1.2.2.2 Jefferson 51
1.2 Jefferson 28
1.3.1.2 Jefferson 54
1.3.2 Jefferson 72
Node Pattern Matches
1.1.1 education 3
1.2.2.2 education 45
1.3.2 education 67
32
Node Pattern Matches
1.1 Jefferson, education (22, 3)
1.2.2.2 Jefferson, education (51, 45)
1.2 Jefferson, education (28, 45)
1.3.2 Jefferson, education (72, 67)
1.3 Jefferson, education (54, 67)
1 Jefferson, education (22, 45)
1.1 is the LCA of 1.1.3 and 1.1.1

Node Pattern Matches
1.1.3 Jefferson 22
1.2.2.2 Jefferson 51
1.2 Jefferson 28
1.3.1.2 Jefferson 54
1.3.2 Jefferson 72
Node Pattern Matches
1.1.1 education 3
1.2.2.2 education 45
1.3.2 education 67
33
Node Pattern Matches
EMPTY !!! EMPTY !!! EMPTY !!!

Node Pattern Matches
1.2 Jefferson, education (28, 45)
1.3 Jefferson, education (54, 67)
1 Jefferson, education (22, 45)

Node Pattern Matches
1.1 Jefferson, education (22, 3)
1.2.2.2 Jefferson, education (51, 45)
1.2 Jefferson, education (28, 45)
1.3.2 Jefferson, education (72, 67)
1.3 Jefferson, education (54, 67)
1 Jefferson, education (22, 45)
Node Pattern Matches
1.1.3 Jefferson 22
1.2.2.2 Jefferson 51
1.2 Jefferson 28
1.3.1.2 Jefferson 54
1.3.2 Jefferson 72
Node Pattern Matches
1.1.1 education 3
1.2.2.2 education 45
1.3.2 education 67
34


Node Pattern Matches
1.1 Jefferson, education (22, 3)
1.2.2.2 Jefferson, education (51, 45)
1.2 Jefferson, education (28, 45)
1.3.2 Jefferson, education (72, 67)
1.3 Jefferson, education (54, 67)
1 Jefferson, education (22, 45)
Node Pattern Matches
1.1.3 Jefferson 22
1.2.2.2 Jefferson 51
1.2 Jefferson 28
1.3.1.2 Jefferson 54
1.3.2 Jefferson 72
Node Pattern Matches
1.1.1 education 3
1.2.2.2 education 45
1.3.2 education 67
35
Node Pattern Matches

1.3 Jefferson, education (54, 67)
1 Jefferson, education (22, 45)

Node Pattern Matches
1.1 Jefferson, education (22, 3)
1.2.2.2 Jefferson, education (51, 45)
1.2 Jefferson, education (28, 45)
1.3.2 Jefferson, education (72, 67)
1.3 Jefferson, education (54, 67)
1 Jefferson, education (22, 45)
Node Pattern Matches
1.1.3 Jefferson 22
1.2.2.2 Jefferson 51
1.2 Jefferson 28
1.3.1.2 Jefferson 54
1.3.2 Jefferson 72
Node Pattern Matches
1.1.1 education 3
1.2.2.2 education 45
1.3.2 education 67
36
Node Pattern Matches

1.3 Jefferson, education (54, 67) (72, 67)
1 Jefferson, education (22, 45)

Node Pattern Matches
1.1 Jefferson, education (22, 3)
1.2.2.2 Jefferson, education (51, 45)
1.2 Jefferson, education (28, 45)
1.3.2 Jefferson, education (72, 67)
1.3 Jefferson, education (54, 67)
1 Jefferson, education (22, 45)
  • Postorder
  • Stack supports single scan

Node Pattern Matches
1.1.3 Jefferson 22
1.2.2.2 Jefferson 51
1.2 Jefferson 28
1.3.1.2 Jefferson 54
1.3.2 Jefferson 72
Node Pattern Matches
1.1.1 education 3
1.2.2.2 education 45
1.3.2 education 67
37
SCU summary
5
  • Equivalent to AllNodes
  • Structure-awareness reduces size of intermediate
    results
  • Increase computation cost
  • Compute LCAs of nodes
  • Match propagation
  • Stack-based techniques

38
Related work on LCA for XML
  • LCA for conjunctive keyword search
  • XRank GSBS-03
  • Schema-free XQuery LYJ-04
  • XKSearch XP-05
  • Shortcomings
  • No postprocessing, not compositional
  • Input in document order
  • Output postorder traversal
  • Support for complex predicates is not
    straightforward

39
Talk Outline
  • Motivation Contributions
  • Formalization of XML full-text search
  • Efficient evaluation
  • Experiments
  • Conclusion

40
Experimental goals
  • AllNodes vs. SCU
  • AllNodes redundant representation
  • SCU smaller sizes, more computation
  • SCU Overhead
  • Stack
  • Match propagation
  • Benefit of Rewritings
  • Relational-style rewritings

41
Experimental setup
  • Centrino 1.8GHz with 1GB of RAM
  • XMark generated datasets
  • Size ranges from 50 MB 300 MB

42
Experiments AllNodes vs. SCU
  • q1 get(See) and get(internationally) and
  • get(description) and get(charges) and
    get(ship)

43
Experiments SCU Overhead
  • Queries
  • q4 swindowgt1(See, internationally,
    description, charges, ship) (q1)
  • q5 swindowgt90000000(See, internationally,
    description, charges, ship) (q1)
  • Recall that
  • q1 get(See) and get(internationally) and
  • get(description) and get(charges)
    and get(ship)

44
Experiments SCU Overhead
  • q4 always true ? no match propagation, just the
    stack overhead
  • q5 always false ? propagate all matches

45
Experiments Benefit of Rewritings
  • Queries
  • q2 sorderedE(See, internationally,
    description, charges, ship) (q1)
  • q3 push selections in q2
  • Recall that
  • q1 get(See) and get(internationally) and
  • get(description) and get(charges)
    and get(ship)

46
Experiments Benefit of Rewritings
  • 40 improvement for relational-like query
    rewritings

47
Conclusion
  • A unified logical framework for XML full-text
    search languages
  • Algebra admits
  • Efficient algorithms for operator evaluation
  • Rewritings of queries into more efficient forms
  • Facilitate XML joint optimizations of queries on
    both structure and text search
  • Future work
  • Score-aware logical framework

48
Thank you!
5
Write a Comment
User Comments (0)
About PowerShow.com