XML Query processing

About This Presentation

Title:

XML Query processing

Description:

Though typically parent-child relationships. Often don't have good measure of 'fan-out' ... They claim ARIES needs some modifications why? Their changes: ... – PowerPoint PPT presentation

Number of Views:114

Avg rating:3.0/5.0

Slides: 81

Provided by: VV37

Category:

more less

Transcript and Presenter's Notes

Title: XML Query processing

1
XML Query processing

Vasilis Vassalos ATHENS UNIVERSITY OF ECONOMICS
AND BUSINESS
COMPUTER SCIENCE DEPARTMENT

2
eXtensible Markup Language
Markup indicates structure and semantics
ltbibliographygt ltpaper ID "XML-rewriting"gt
ltauthorsgt ltauthorgtY.Papakonstantinoult/autho
rgt ltauthorgtV. Vassaloslt/authorgt
lt/authorsgt ltfullPaper source"rewrite"/gt
lttitlegtQuery Rewriting for semistructured
datalt/titlegt ltbooktitlegtSIGMOD
99lt/booktitlegt lt/papergt lt/bibliographygt
Decoupled from presentation
3
Elements and their Content
Element Content
element name
ltbibliographygt ltpaper ID "XML-rewriting"gt
ltauthorsgt ltauthorgtY.Papakonstantinoult/auth
orgt ltauthorgtV. Vassaloslt/authorgt
lt/authorsgt ltfullPaper source"rewrite"/gt
lttitlegtQuery Rewriting for semistr.datalt/titlegt
ltbooktitlegtSIGMOD 99lt/booktitlegt
lt/papergt lt/bibliographygt
element
Empty Element
Character content
4
Path Expressions
In the second chapter of the document zoo.xml
find the figures with caption Tree Frogs
doc(zoo.xml)//chapter2//figurecaptionTree
Frogs
book
chapter
appendix
part
chapter
chapter
chapter
part
section
chapter
paragraph
paragraph
figure
figure
caption
caption
Tree Frogs
Just Frogs
5
More Path Expressions
Find the first immediate chapter subelements of
immediate part subelements of the document
zoo.xml and retrieve figures that have
doc(zoo.xml)/part/chapter1//figurecaptionTr
ee Frogs
book
chapter
appendix
part
chapter
chapter
chapter
part
section
chapter
paragraph
paragraph
figure
figure
caption
caption
Tree Frogs
Just Frogs
6
Element Construction
In the second chapter of the document zoo.xml
find the figures with caption Tree Frogs and
place them into an element called result
ltresultgt doc(zoo.xml)//chapter2//figurecapti
onTree Frogs lt/resultgt
result
figure
caption
Tree Frogs
7
Bibliography Example Data Set
ltbibgt ltbook ISBN12345gt ltauthorgt Aho
lt/authorgt ltauthorgt Hopcroft lt/authorgt
ltauthorgt Ullman lt/authorgt lttitlegt Automata
Theory lt/titlegt ltpublishergt Morgan Kaufmann
lt/publishergt ltyeargt 1998 gt/yeargt lt/bookgt
ltbookgt ltauthorgt Ullman lt/authorgt lttitlegt
Database Systems lt/titlegt ltpublishergt Morgan
Kaufmann lt/publishergt ltyeargt 1998 gt/yeargt
lt/bookgt ltbookgt ltauthorgt Berners-Lee lt/authorgt
ltauthorgt Robie lt/authorgt ltauthorgt Bosak
lt/authorgt lttitlegt Semantic Web lt/titlegt
ltpublishergt Prentice Hall lt/publishergt ltyeargt
1998 lt/yeargt lt/bookgt lt/bibgt
8
Reviews Example Data Set
ltreviewsgt ltreviewgt lttitlegt Automata Theory
lt/titlegt ltcommentgt Its the best in automata
theory lt/commentgt ltcommentgt A definitive
textbook lt/commentgt lt/reviewgt lt/reviewsgt
9
For-Let-Where-Return (FLWR)
List the titles of books published by Morgan
Kaufmann
FOR b in doc(bib.xml)//book WHERE b/publisher
Morgan Kaufmann RETURN b/title
bib
book
book
book
book
title
title
title
year
year
year
publisher
publisher
publisher
1998
1998
1998
Morgan Kaufmann
Morgan Kaufmann
Prentice Hall
10
(Tuples of) Variable Bindings
bib
book
book
book
title
title
title
year
year
year
publisher
publisher
publisher
1998
1998
1998
Morgan Kaufmann
Morgan Kaufmann
Prentice Hall
b book book book
FOR/LET
Ordered lists of tuples of variable bindings
WHERE
b book book
Tuples of that satisfy the conditions
RETURN
title
title
List of trees
year
11
FOR b in doc(bib.xml)//book WHERE b/year gt
1990 RETURN b/author
Return the list of authors who published after
1990
12
Tuples
List publishers who have published more than 1
book
FOR p in distinct(doc(bib.xml)//publisher) LET
b document(bib.xml)//bookpublisher
p WHERE count(b) gt 1 RETURN p
Tuples (p, b) are formulated
13
Boolean Expressions in WHERE
List the titles of books published by Morgan
Kaufmann in 1998
FOR b in doc(bib.xml)//book WHERE b/publisher
Morgan Kaufmann AND b/year 1998 RETURN
b/title
14
Joins
For every book with a matching review output a
book_with_review that contains all the
attributes and subelements of book and the
comment subelements of review
FOR b in doc(bib.xml)/book, r in
doc(review.xml)/review WHERE b/title
r/title RETURN ltbook_with_reviewgt b/_at_,
b/, r/comment lt/book_with_reviewgt
ltbook_with_review ISBN1234gt ltauthorgt Aho
lt/authorgt ltauthorgt Hopcroft lt/authorgt
ltauthorgt Ullman lt/authorgt lttitlegt Automata
Theory lt/titlegt ltpublishergt Morgan Kaufmann
lt/publishergt ltyeargt 1998 gt/yeargt ltcommentgt
Its the best in automata theory lt/commentgt
ltcommentgt A definitive textbook lt/commentgt
lt/book_with_reviewgt
15
Relax Order Conditions
List the titles of books published by Morgan
Kaufmann in 1998
FOR b in unordered(doc(bib.xml)//book) WHERE
b/publisher Morgan Kaufmann AND b/year
1998 RETURN b/title
Very important feature in dealing with relational
sources and other set-oriented sources.
SELECT title FROM bib WHERE publisher Morgan
Kaufmann AND year 1998
Depending on the indices and access methods used,
the SQL query processor may deliver the tuples in
different order
16
Nested queries
FOR a IN distinct(document(bib.xml)//author) RE
TURN ltauthorgt ltnamegt a/text() lt/namegt
FOR b IN document(bib.xml)//bookauthor
a RETURN b/title lt/authorgt
Invert the structure of the input document so
that there is a list of author elements
containing the name of the author and the list of
books he wrote
17
Conditionals
FOR b IN doc(bib.xml)/book RETURN ltshortgt
b/title ltauthorgt IF count(b/author) lt
3 b/author ELSE
b/author1, ltauthorgtand otherslt/authorgt
lt/shortgt
18
Existential and Universal Quantification
Return books where at least one of the authors is
Ullman
FOR b in doc(bib.xml)/book WHERE b/author
Ullman RETURN b
Return books where all authors are Ullman
FOR b in doc(bib.xml)/book WHERE EVERY author
IN b/author SATISFIES author
Ullman RETURN b
19
Functions
DEFINE FUNCTION depth(e) RETURNS xsdinteger
IF (empty(e/) THEN 1 ELSE max(depth(e/) 1
FOR b in doc(bib.xml)/book RETURN depth(b)
FOR a in doc(bib.xml)/book/author RETURN
ltallcapsnamegt allcaps(a) lt/allcapsnamegt
20
XML What Makes It Hard?

Its not normalized
It conceptually centers around some origin,
meaning that navigation becomes central
Contrast with E-R diagrams
How to store the hierarchy?
Complex navigation
Updates, locking
Optimization
Also, its ordered
May restrict order of evaluation (or at least
presentation)
Makes updates more complex
Many of these issues arent unique to XML
Semistructured databases, esp. with ordered
collections, were similar

21
XML Whats It Good For?

Integration of heterogeneous data
More on this later
Collections of text documents, e.g., the Web, doc
DBs
How would we want to query those?
IR/text queries, path queries, XQueries?
Interchanging data
SOAP messages, RSS, XML streams
Perhaps subsets of data from RDBMSs
Storing native, database-like XML data
Caching
Logging of XML messages
?

22
Lots of XML Research Out There

Integration
Specialized query processing techniques
Integration specification tools
Text
Hybrids of database and IR techniques for search
(e.g., Amer-Yahia Shanmugasundaram, Weikum
Ramakrishnan, )
Interchange
Web service verification
XML stream processing
XML databases
Natix, TIMBER,
Tamino, DB2 UDB, Oracle,

23
The Main Focal Points

XML Query processing
XML Query Algebra
XPath processing techniques
XQuery optimization
Streaming XML/XML for integration
RDBMS ? XML export
Partitioning of computation between source and
mediator
Streaming XPath engines
XML databases
XML to relational issues
Hierarchical storage locking (Natix, TIMBER,
BerkeleyDB, Tamino, )
Query optimization
XML with documents
Inverted indices
Integration of ranking into DBMS
Interaction between structure and content

24
XML as a Wire Format

RDBMS ? XML export
SilkRoute and Xperanto, outer unions
Interaction with RDBMS optimization techniques
Updates Tatarinov01
Cascading updates are already possible in RDBMSs
Updating XML views
Streaming XML
SAX-based XPath-matching engines
Ives01AltinelFranklin00Green02
DiaoFranklinChen
Push-down of XPath matching as early as possible
Query decomposition (still in need of a standard
means of pushing XQuery to a source)
Subsets of XQuery that are amenable to streaming

25
XML in a Database

Use a legacy RDBMS
Shredding Shanmugasundaram99 and many others
Path-based encodings Cooper01
Region-based encodings Bruno02Chen04
Translation to SQL
Order preservation in updates Tatarinov02,
Whats novel here? How does this relate to
materialized views and warehousing?
Native XML databases
Hierarchical storage (Natix, TIMBER, BerkeleyDB,
Tamino, )
Updates and locking
Query optimization

26
Query Processing for XML

Why is optimization harder?
Hierarchy means many more joins (conceptually)
traverse, tree-match, x-scan, unnest,
path, op
Though typically parent-child relationships
Often dont have good measure of fan-out
More ways of optimizing this
Order preservation limits processing in many ways
Nested content left outer join
Except that we need to cluster a collection with
the parent
Relationship with NF2 approach
Tags (dont really add much complexity except in
trying to encode efficiently)
Complex functions and recursion
Few real DB systems implement these fully
New operators (staircase join, stack-based joins)
Why is storage harder?
Thats the focus of Natix, really

27
The Natix System

In contrast to many pieces of work on XML,
focuses on the bottom layers
Physical layout
Indexing
Locking/concurrency control
Logging/recovery

28
Physical Layout

What are our options in storing XML trees?
At some level, its all smoke-and-mirrors
Need to map to flat byte sequences on disk
But several options
Shred completely, as in many RDBMS mappings
Each path may get its own contiguous set of pages
e.g., vectorized XML Buneman et al.
An element
With path index instead of foreign keys
Georgiadis
An element may get its 11 children
e.g., shared inlining Shanmugasundaram and
Chen
All content may be in one table
e.g., Florescu/Kossmann and most interval
encoded XML
We may embed a few items on the same page and
overflow the rest
How collections are often stored in ORDBMS
We may try to cluster XML trees on the same page,
as interpreted BLOBs
This is Natixs approach (and also IBMs DB2)
Pros and cons of these approaches?

29
Challenges of the Page-per-Tree Approach

How big of a tree?
What happens if the XML overflows the tree?
Natix claims an adaptive approach to choosing the
trees granularity
Primarily based on balancing the tree,
constraints on children that must appear with a
parent
Natix uses a B Tree-like scheme for achieving
balance and splitting a tree across pages

30
Example
Split point in parent page
Note proxy nodes
31
That Was Simple But What about Updates?

Clearly, insertions and deletions can affect
things
Deletion may ultimately require us to rebalance
Ditto with insertion
But insertion also may make us run out of space
what to do?
Their approach add another page ultimately may
need to split at multiple levels, as in B Tree
Others have studied this problem and used integer
encoding schemes (plus B Trees) for the order

32
Does this Help?

The Natix experiments in this paper were pretty
weak, so we wont look at them in more detail
The IBM guys say their approach, which is
similar, works significantly better than Oracles
shredded approach
Experiments by our group dont support this
thesis.

33
Theres More to Updates than the Pages

What about concurrency control and recovery?
We already have a notion of hierarchical locks,
but they claim
If we want to support IDREF traversal, and
indexing directly to nodes, we need more
Whats the idea behind SPP locking?

34
Logging

They claim ARIES needs some modifications why?
Their changes
Need to make subtree updates more efficient
dont want to write a log entry for each subtree
insertion
Use (a copy of) the page itself as a means of
tracking what was inserted, then batch-apply to
WAL
Annihilators if we undo a tree creation, then
we probably dont need to worry about undoing
later changes to that tree
A few minor tweaks to minimize undo/redo when
only one transaction touches a page

35
Annihilators
36
Assessment

Native XML storage isnt really all that
different from other means of storage
There are probably some good reasons to make a
few tweaks in locking
Optimization stays harder
Using smart techniques for XPath to SQL
translation (ie, support for XML view creation)
would probably make RDBMSs come close to
delivering the same performance, modulo locking

37
Text-Based XML

The fundamental questions
How should we model ranking in query processing?
Simply as another value (e.g., Amer-Yahia
Shanmugasundaram)
Using a probabilistic model or as an undefined
metric
e.g., Weikum and Ramakrishnan work-in-progress
How does structure affect ranking?
PageRank-style (e.g., Shanmugasundaram et al.)
Query relaxation (FleXPath)
Other?
How do we achieve efficient pruning?
A search Cohen 98
Fagins Threshold Algorithm
Custom logic?
How do we integrate keyword indexing with
structural indexing?
Multiple indices (e.g., Lore, Natix, )
Integrated indices (e.g., ViST)

38
Questions

Where are the main challenges of XML processing
at this point?
Are we working on the right problems? Whats XML
going to be used for, anyway?

39
XQuery Queries Views for XML
ltcustomersgt for cust in document(db)/custom
er return ltcustomergt cust/id,
for order in document(db)/order where
order/cid cust/id return ltordergt
order/id lt/ordergt lt/customergt
lt/customersgt
40
Multiple Plans are Possible

Retrieve customers
For each customer find matching orders

41
XML query processing

Native
Algebra-based
Holistic
Storage issues
Based on Relational storage and QP engines
XML to Relational mapping issues
Off the shelf Query processor vs additional
operators

42
An Algebra-Based Query Processor Architecture
Client
XQuery
Navigation Requests
Results
XQuery Views
Translation to Algebra
Algebra Plan
Source Schemas Types
Source Description
Rewriter/Optimizer
Physical Algebra Plan
Functions
Plan Execution Engine
Function Description
43
Query Processing on Tuple-Oriented Algebra
Enables

Well-known efficient physical implementations of
the operators
Join optimization
Nested queries by nested plans or group-by
Efficient iterator model

44
XQuery Queries Views for XML
ltcustomersgt for cust in document(db)/custom
er return ltcustomergt cust/id,
for order in document(db)/order where
order/cid cust/id return ltordergt
order/id lt/ordergt lt/customergt
lt/customersgt
45
Access and Navigation
getD cust, id ? cust_id
db
customer_table customer name
John id 56 customer
name George id 58
getD db1, customer ? cust
source db, db1
46
Simplification Using Schema Inference
Since cust_id ? cust and cust is useless
otherwise
db
customer_table customer name
John id 56 customer
name George id 58
getD db1, customer/id ? cust_id
i1
i2
source db, db1
47
Nested Plans
ct i2 o21
apply part, p ? orders
for part
48
Joins and Selections
cust_id
db1 cust_id db2 order cust_id2
order_id
?cust_id2?
db2 order cust_id2 order_id
getD order, id ? order_id
getD order, cid ? cust_id2
getD db2, order ? order
nestedSrc part
source db, db2
49
Constructors
listify oidE ? orders
o2
crEl order, oidL ? oidE
o1
crList orders ? oidL
order_id o1
o2
50
Algebra Example
51
Plan
52
Replacing Nested Joins with GroupBy/Outerjoin
Combinations
apply part, p ? R
apply part, p ? R
p3
p3
nestedSrc part
groupBy S(p1) ? part
p2
nestedSrc part
for part
p1
p1
p2
53
Multiple Possible Plans
54
Optimizations Performed during Rewriting

Operations pushed to sources
selections, projections, joins, sorts
Join order selection
Join type selection
sort/merge, parameter passing, nested loops
Nested plan elimination
nested plans converted to outer joins with
group-by
Research topic
Constant folding
constant expressions evaluated at plan
compilation time
Multi-threaded evaluation of web services
Unused element elimination in views
Condition push-down through views
Research topic

55
Indexing

In order to do efficient query processing,
indexes are used
Reminder An index is a structure that points
directly to nodes satisfying a given constraint
More indexes usually allow query processing to be
more efficient, but also take up more space
(time/space tradeoff)

56
Basic Indexes

At minimum, the following indexes are usually
stored
Value indexes for each value appearing in the
tree there is a list of nodes containing the
value
Element indexes for each element name appearing
in the tree, there is a list of nodes with the
corresponding element
Sometimes also structure indexes for certain
XPath expressions, there is a list of nodes that
satisfy the expression

57
Example Value Indexes
1
transaction
2
account
11
4
buy
sell
3
89-344
12
shares
5
shares
13
14
ticker
6
7
30
ticker
15
100
exch
8
exch
16
17
NYSE
GE
10
9
NYSE
WEBM
58
Example Element Indexes
1
transaction
2
account
11
4
buy
sell
3
89-344
12
shares
5
shares
13
14
ticker
6
7
30
ticker
15
100
exch
8
exch
16
17
NYSE
GE
10
9
NYSE
WEBM
59
Example Structure Indexes
1
transaction
2
account
11
4
buy
sell
3
89-344
12
shares
5
shares
13
14
ticker
6
7
30
ticker
15
100
exch
8
exch
16
17
NYSE
GE
10
9
NYSE
WEBM
60
Query Processing

Suppose that we only have value indexes and
element indexes
How should we process the query //buy//exch ?
Strategy 1 Find buy elements. Then traverse the
subtree of these elements to look for exch
elements
Strategy 2 Find exch elements. Then traverse the
ancestors of these elements to look for buy
elements
Which is a better strategy?

61
//buy//exch Strategy 1
1
transaction
2
account
11
4
buy
sell
3
89-344
12
shares
5
shares
13
14
ticker
6
7
30
ticker
15
100
exch
8
exch
16
17
NYSE
GE
10
9
NYSE
WEBM
62
//buy//exch Strategy 2
1
transaction
2
account
11
4
buy
sell
3
89-344
12
shares
5
shares
13
14
ticker
6
7
30
ticker
15
100
exch
8
exch
16
17
NYSE
GE
10
9
NYSE
WEBM
63
Improving the Execution

Instead of storing a running id for each element,
store triple (start, end, level)
Find buy elements
Find exch elements
Merge these two lists by finding exch elements
that are nested within buy elements
Level is used in case we are interested in
finding children, not descendents

64
//buy//exch Improved
Start
End
Level
buy
(4,10,2)
Merge the 2 lists by finding descendent elements
exch
(15,17,4)
(8,9,4)
65
Merging Lists

What is the complexity of merging the lists?
Is it enough to go through each list once?
Assuming the lists are sorted by start?
Example Suppose we want to find all pairs of a
and b such that b is a descendent of a

a
b
a
b
b
66
Merging Lists Example

Example Suppose we want to find all pairs of a
and b such that b is a descendent of a

a
(3,6,2)
(1,7,1)
(5,5,3)
b
(4,4,3)
(2,2,2)
67
Merging Lists Example

Example Suppose we want to find all pairs of a
and b such that b is a descendent of a

a
(3,6,2)
(1,7,1)
(5,5,3)
b
(4,4,3)
(2,2,2)
68
Merging Lists Example

We did extra work
Need a method to find the correct place to start
in the b list

a
(3,6,2)
(1,7,1)
(5,5,3)
b
(4,4,3)
(2,2,2)
69
Minimizing the Work

Several algorithms have been defined to minimize
the amount of work required, by identifying
exactly where to restart
See
Shu-Yao Chien, Zografoula Vagena, Donghui Zhang,
Vassilis J. Tsotras, Carlo Zaniolo, Efficient
Structural Joins on Indexed XML Documents
Proc.of VLDB 2002
Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas,
Jingesh M. Patel, Divesh Srivastava, Yuqing Wu,
Structural Joins A Primitive for Efficient XML
Query Pattern Matching, ICDE 2002
Nicolas Bruno, Nick Koudas, Divesh Srivastava,
Holistic Twig Joins Optimal XML Pattern
Matching, ACM SIGMOD 2002

70
Tree Pattern Can Be Computed From Structural
Relationships
Descendent edge
Child edge
book
title
author
jane
XML
71
Stack-Tree Algorithms Intuition

A depth-first traversal of a tree can be
performed in linear time, using a stack as large
as the height of the tree.
An ancestor-descendant structural relationship is
manifested as the ancestor appearing higher on
the stack than the descendant.
Unfortunately, a depth-first traversal requires
going over all the tree.

72
Stack-Tree Algorithms

We present 2 algorithms
Stack-Tree-Desc that returns the result ordered
by (desc-start, anc-start)
Stack-Tree-Anc that returns the result ordered by
(anc-start, desc-start)
Why is the ordering of the result of interest?

73
Stack-Tree-Desc

a Alist-gtfirst node d Dlist-gtfirst node
OutputList NULL
while (lists are not empty)
if (a.startPos lt d.startPos) then e a else e
d
while (e.startPos gt stack.Top().endPos)
stack.Pop()
if (e a)
stack.Push(a)
a a-gtnextNode
else
for each a in stack do append (a, d) to
OutputList
d d-gtnextNode

74
Stack-Tree-Desc //section//paragraph
article
paragraph
section
paragraph
paragraph
section
paragraph
paragraph
section
paragraph
paragraph
Bla,..Bla,..
Bla,..Bla,..
75
Stack-Tree-Desc //section//paragraph
article
Alist
paragraph
section
paragraph
paragraph
section
paragraph
paragraph
section
paragraph
paragraph
Bla,..Bla,..
Bla,..Bla,..
76
Stack-Tree-Desc //section//paragraph
article
Dlist
paragraph
section
paragraph
paragraph
section
paragraph
paragraph
section
paragraph
paragraph
Bla,..Bla,..
Bla,..Bla,..
77
Stack-Tree-Desc //section//paragraph
article
d7
a1
paragraph
section
d1
a2
d6
paragraph
paragraph
section
a3
d2
d5
paragraph
paragraph
section
d4
d3
paragraph
paragraph
Bla,..Bla,..
Bla,..Bla,..
78
Stack-Tree-Desc //section//paragraph
Alist
a1
a2
a3
Dlist
d1
d4
d2
d5
d3
d6
article
d7
paragraph
a1
section
d1
d6
paragraph
paragraph
a2
section
d5
d2
paragraph
paragraph
a3
section
d4
d3
paragraph
paragraph
Bla,..Bla,..
Bla,..Bla,..
79
Stack-Tree-Desc
a1 d1 a2 d2 a3 d3 d4 d5 d6 d7
d7
a1
d1
d6
a2
d2
d5
a3
d3
d4
a3
a2
Stack
a1

(a1,d3),(a2,d3),(a3,d3)
(a1,d1)
(a1,d2),(a2,d2)
Output
(a1,d4),(a2,d4),(a3,d4)
(a1,d5),(a2,d5)
(a1,d6)
80
Analysis of Stack-Tree-Dec

O(Alist Dlist OutputList) for
ancestor-descendant structural relationships.
Each Alist element is pushed once and popped
once, so stack operations take O(Alist).
The inner for loop outputs a new pair each
time, so its total time is O(OutputList).

81
XML and XQuery for integration