Title: XML Query processing
1XML Query processing
- Vasilis Vassalos ATHENS UNIVERSITY OF ECONOMICS
AND BUSINESS - COMPUTER SCIENCE DEPARTMENT
2eXtensible Markup Language
Markup indicates structure and semantics
ltbibliographygt ltpaper ID "XML-rewriting"gt
ltauthorsgt ltauthorgtY.Papakonstantinoult/autho
rgt ltauthorgtV. Vassaloslt/authorgt
lt/authorsgt ltfullPaper source"rewrite"/gt
lttitlegtQuery Rewriting for semistructured
datalt/titlegt ltbooktitlegtSIGMOD
99lt/booktitlegt lt/papergt lt/bibliographygt
Decoupled from presentation
3Elements and their Content
Element Content
element name
ltbibliographygt ltpaper ID "XML-rewriting"gt
ltauthorsgt ltauthorgtY.Papakonstantinoult/auth
orgt ltauthorgtV. Vassaloslt/authorgt
lt/authorsgt ltfullPaper source"rewrite"/gt
lttitlegtQuery Rewriting for semistr.datalt/titlegt
ltbooktitlegtSIGMOD 99lt/booktitlegt
lt/papergt lt/bibliographygt
element
Empty Element
Character content
4Path Expressions
In the second chapter of the document zoo.xml
find the figures with caption Tree Frogs
doc(zoo.xml)//chapter2//figurecaptionTree
Frogs
book
chapter
appendix
part
chapter
chapter
chapter
part
section
chapter
paragraph
paragraph
figure
figure
caption
caption
Tree Frogs
Just Frogs
5More Path Expressions
Find the first immediate chapter subelements of
immediate part subelements of the document
zoo.xml and retrieve figures that have
doc(zoo.xml)/part/chapter1//figurecaptionTr
ee Frogs
book
chapter
appendix
part
chapter
chapter
chapter
part
section
chapter
paragraph
paragraph
figure
figure
caption
caption
Tree Frogs
Just Frogs
6Element Construction
In the second chapter of the document zoo.xml
find the figures with caption Tree Frogs and
place them into an element called result
ltresultgt doc(zoo.xml)//chapter2//figurecapti
onTree Frogs lt/resultgt
result
figure
caption
Tree Frogs
7Bibliography Example Data Set
ltbibgt ltbook ISBN12345gt ltauthorgt Aho
lt/authorgt ltauthorgt Hopcroft lt/authorgt
ltauthorgt Ullman lt/authorgt lttitlegt Automata
Theory lt/titlegt ltpublishergt Morgan Kaufmann
lt/publishergt ltyeargt 1998 gt/yeargt lt/bookgt
ltbookgt ltauthorgt Ullman lt/authorgt lttitlegt
Database Systems lt/titlegt ltpublishergt Morgan
Kaufmann lt/publishergt ltyeargt 1998 gt/yeargt
lt/bookgt ltbookgt ltauthorgt Berners-Lee lt/authorgt
ltauthorgt Robie lt/authorgt ltauthorgt Bosak
lt/authorgt lttitlegt Semantic Web lt/titlegt
ltpublishergt Prentice Hall lt/publishergt ltyeargt
1998 lt/yeargt lt/bookgt lt/bibgt
8Reviews Example Data Set
ltreviewsgt ltreviewgt lttitlegt Automata Theory
lt/titlegt ltcommentgt Its the best in automata
theory lt/commentgt ltcommentgt A definitive
textbook lt/commentgt lt/reviewgt lt/reviewsgt
9For-Let-Where-Return (FLWR)
List the titles of books published by Morgan
Kaufmann
FOR b in doc(bib.xml)//book WHERE b/publisher
Morgan Kaufmann RETURN b/title
bib
book
book
book
book
title
title
title
year
year
year
publisher
publisher
publisher
1998
1998
1998
Morgan Kaufmann
Morgan Kaufmann
Prentice Hall
10(Tuples of) Variable Bindings
bib
book
book
book
title
title
title
year
year
year
publisher
publisher
publisher
1998
1998
1998
Morgan Kaufmann
Morgan Kaufmann
Prentice Hall
b book book book
FOR/LET
Ordered lists of tuples of variable bindings
WHERE
b book book
Tuples of that satisfy the conditions
RETURN
title
title
List of trees
year
11 FOR b in doc(bib.xml)//book WHERE b/year gt
1990 RETURN b/author
Return the list of authors who published after
1990
12Tuples
List publishers who have published more than 1
book
FOR p in distinct(doc(bib.xml)//publisher) LET
b document(bib.xml)//bookpublisher
p WHERE count(b) gt 1 RETURN p
Tuples (p, b) are formulated
13Boolean Expressions in WHERE
List the titles of books published by Morgan
Kaufmann in 1998
FOR b in doc(bib.xml)//book WHERE b/publisher
Morgan Kaufmann AND b/year 1998 RETURN
b/title
14Joins
For every book with a matching review output a
book_with_review that contains all the
attributes and subelements of book and the
comment subelements of review
FOR b in doc(bib.xml)/book, r in
doc(review.xml)/review WHERE b/title
r/title RETURN ltbook_with_reviewgt b/_at_,
b/, r/comment lt/book_with_reviewgt
ltbook_with_review ISBN1234gt ltauthorgt Aho
lt/authorgt ltauthorgt Hopcroft lt/authorgt
ltauthorgt Ullman lt/authorgt lttitlegt Automata
Theory lt/titlegt ltpublishergt Morgan Kaufmann
lt/publishergt ltyeargt 1998 gt/yeargt ltcommentgt
Its the best in automata theory lt/commentgt
ltcommentgt A definitive textbook lt/commentgt
lt/book_with_reviewgt
15Relax Order Conditions
List the titles of books published by Morgan
Kaufmann in 1998
FOR b in unordered(doc(bib.xml)//book) WHERE
b/publisher Morgan Kaufmann AND b/year
1998 RETURN b/title
Very important feature in dealing with relational
sources and other set-oriented sources.
SELECT title FROM bib WHERE publisher Morgan
Kaufmann AND year 1998
Depending on the indices and access methods used,
the SQL query processor may deliver the tuples in
different order
16Nested queries
FOR a IN distinct(document(bib.xml)//author) RE
TURN ltauthorgt ltnamegt a/text() lt/namegt
FOR b IN document(bib.xml)//bookauthor
a RETURN b/title lt/authorgt
Invert the structure of the input document so
that there is a list of author elements
containing the name of the author and the list of
books he wrote
17Conditionals
FOR b IN doc(bib.xml)/book RETURN ltshortgt
b/title ltauthorgt IF count(b/author) lt
3 b/author ELSE
b/author1, ltauthorgtand otherslt/authorgt
lt/shortgt
18Existential and Universal Quantification
Return books where at least one of the authors is
Ullman
FOR b in doc(bib.xml)/book WHERE b/author
Ullman RETURN b
Return books where all authors are Ullman
FOR b in doc(bib.xml)/book WHERE EVERY author
IN b/author SATISFIES author
Ullman RETURN b
19Functions
DEFINE FUNCTION depth(e) RETURNS xsdinteger
IF (empty(e/) THEN 1 ELSE max(depth(e/) 1
FOR b in doc(bib.xml)/book RETURN depth(b)
FOR a in doc(bib.xml)/book/author RETURN
ltallcapsnamegt allcaps(a) lt/allcapsnamegt
20XML What Makes It Hard?
- Its not normalized
- It conceptually centers around some origin,
meaning that navigation becomes central - Contrast with E-R diagrams
- How to store the hierarchy?
- Complex navigation
- Updates, locking
- Optimization
- Also, its ordered
- May restrict order of evaluation (or at least
presentation) - Makes updates more complex
- Many of these issues arent unique to XML
- Semistructured databases, esp. with ordered
collections, were similar
21XML Whats It Good For?
- Integration of heterogeneous data
- More on this later
- Collections of text documents, e.g., the Web, doc
DBs - How would we want to query those?
- IR/text queries, path queries, XQueries?
- Interchanging data
- SOAP messages, RSS, XML streams
- Perhaps subsets of data from RDBMSs
- Storing native, database-like XML data
- Caching
- Logging of XML messages
- ?
22Lots of XML Research Out There
- Integration
- Specialized query processing techniques
- Integration specification tools
- Text
- Hybrids of database and IR techniques for search
- (e.g., Amer-Yahia Shanmugasundaram, Weikum
Ramakrishnan, ) - Interchange
- Web service verification
- XML stream processing
- XML databases
- Natix, TIMBER,
- Tamino, DB2 UDB, Oracle,
23The Main Focal Points
- XML Query processing
- XML Query Algebra
- XPath processing techniques
- XQuery optimization
- Streaming XML/XML for integration
- RDBMS ? XML export
- Partitioning of computation between source and
mediator - Streaming XPath engines
- XML databases
- XML to relational issues
- Hierarchical storage locking (Natix, TIMBER,
BerkeleyDB, Tamino, ) - Query optimization
- XML with documents
- Inverted indices
- Integration of ranking into DBMS
- Interaction between structure and content
24XML as a Wire Format
- RDBMS ? XML export
- SilkRoute and Xperanto, outer unions
- Interaction with RDBMS optimization techniques
- Updates Tatarinov01
- Cascading updates are already possible in RDBMSs
- Updating XML views
- Streaming XML
- SAX-based XPath-matching engines
Ives01AltinelFranklin00Green02
DiaoFranklinChen - Push-down of XPath matching as early as possible
- Query decomposition (still in need of a standard
means of pushing XQuery to a source) - Subsets of XQuery that are amenable to streaming
25XML in a Database
- Use a legacy RDBMS
- Shredding Shanmugasundaram99 and many others
- Path-based encodings Cooper01
- Region-based encodings Bruno02Chen04
- Translation to SQL
- Order preservation in updates Tatarinov02,
- Whats novel here? How does this relate to
materialized views and warehousing? - Native XML databases
- Hierarchical storage (Natix, TIMBER, BerkeleyDB,
Tamino, ) - Updates and locking
- Query optimization
26Query Processing for XML
- Why is optimization harder?
- Hierarchy means many more joins (conceptually)
- traverse, tree-match, x-scan, unnest,
path, op - Though typically parent-child relationships
- Often dont have good measure of fan-out
- More ways of optimizing this
- Order preservation limits processing in many ways
- Nested content left outer join
- Except that we need to cluster a collection with
the parent - Relationship with NF2 approach
- Tags (dont really add much complexity except in
trying to encode efficiently) - Complex functions and recursion
- Few real DB systems implement these fully
- New operators (staircase join, stack-based joins)
- Why is storage harder?
- Thats the focus of Natix, really
27The Natix System
- In contrast to many pieces of work on XML,
focuses on the bottom layers - Physical layout
- Indexing
- Locking/concurrency control
- Logging/recovery
28Physical Layout
- What are our options in storing XML trees?
- At some level, its all smoke-and-mirrors
- Need to map to flat byte sequences on disk
- But several options
- Shred completely, as in many RDBMS mappings
- Each path may get its own contiguous set of pages
- e.g., vectorized XML Buneman et al.
- An element
- With path index instead of foreign keys
Georgiadis - An element may get its 11 children
- e.g., shared inlining Shanmugasundaram and
Chen - All content may be in one table
- e.g., Florescu/Kossmann and most interval
encoded XML - We may embed a few items on the same page and
overflow the rest - How collections are often stored in ORDBMS
- We may try to cluster XML trees on the same page,
as interpreted BLOBs - This is Natixs approach (and also IBMs DB2)
- Pros and cons of these approaches?
29Challenges of the Page-per-Tree Approach
- How big of a tree?
- What happens if the XML overflows the tree?
- Natix claims an adaptive approach to choosing the
trees granularity - Primarily based on balancing the tree,
constraints on children that must appear with a
parent - Natix uses a B Tree-like scheme for achieving
balance and splitting a tree across pages
30Example
Split point in parent page
Note proxy nodes
31That Was Simple But What about Updates?
- Clearly, insertions and deletions can affect
things - Deletion may ultimately require us to rebalance
- Ditto with insertion
- But insertion also may make us run out of space
what to do? - Their approach add another page ultimately may
need to split at multiple levels, as in B Tree - Others have studied this problem and used integer
encoding schemes (plus B Trees) for the order
32Does this Help?
- The Natix experiments in this paper were pretty
weak, so we wont look at them in more detail - The IBM guys say their approach, which is
similar, works significantly better than Oracles
shredded approach - Experiments by our group dont support this
thesis.
33Theres More to Updates than the Pages
- What about concurrency control and recovery?
- We already have a notion of hierarchical locks,
but they claim - If we want to support IDREF traversal, and
indexing directly to nodes, we need more - Whats the idea behind SPP locking?
34Logging
- They claim ARIES needs some modifications why?
- Their changes
- Need to make subtree updates more efficient
dont want to write a log entry for each subtree
insertion - Use (a copy of) the page itself as a means of
tracking what was inserted, then batch-apply to
WAL - Annihilators if we undo a tree creation, then
we probably dont need to worry about undoing
later changes to that tree - A few minor tweaks to minimize undo/redo when
only one transaction touches a page
35Annihilators
36Assessment
- Native XML storage isnt really all that
different from other means of storage - There are probably some good reasons to make a
few tweaks in locking - Optimization stays harder
- Using smart techniques for XPath to SQL
translation (ie, support for XML view creation)
would probably make RDBMSs come close to
delivering the same performance, modulo locking
37Text-Based XML
- The fundamental questions
- How should we model ranking in query processing?
- Simply as another value (e.g., Amer-Yahia
Shanmugasundaram) - Using a probabilistic model or as an undefined
metric - e.g., Weikum and Ramakrishnan work-in-progress
- How does structure affect ranking?
- PageRank-style (e.g., Shanmugasundaram et al.)
- Query relaxation (FleXPath)
- Other?
- How do we achieve efficient pruning?
- A search Cohen 98
- Fagins Threshold Algorithm
- Custom logic?
- How do we integrate keyword indexing with
structural indexing? - Multiple indices (e.g., Lore, Natix, )
- Integrated indices (e.g., ViST)
38Questions
- Where are the main challenges of XML processing
at this point? - Are we working on the right problems? Whats XML
going to be used for, anyway?
39XQuery Queries Views for XML
ltcustomersgt for cust in document(db)/custom
er return ltcustomergt cust/id,
for order in document(db)/order where
order/cid cust/id return ltordergt
order/id lt/ordergt lt/customergt
lt/customersgt
40Multiple Plans are Possible
- Retrieve customers
- For each customer find matching orders
41XML query processing
- Native
- Algebra-based
- Holistic
- Storage issues
- Based on Relational storage and QP engines
- XML to Relational mapping issues
- Off the shelf Query processor vs additional
operators
42An Algebra-Based Query Processor Architecture
Client
XQuery
Navigation Requests
Results
XQuery Views
Translation to Algebra
Algebra Plan
Source Schemas Types
Source Description
Rewriter/Optimizer
Physical Algebra Plan
Functions
Plan Execution Engine
Function Description
43Query Processing on Tuple-Oriented Algebra
Enables
- Well-known efficient physical implementations of
the operators - Join optimization
- Nested queries by nested plans or group-by
- Efficient iterator model
44XQuery Queries Views for XML
ltcustomersgt for cust in document(db)/custom
er return ltcustomergt cust/id,
for order in document(db)/order where
order/cid cust/id return ltordergt
order/id lt/ordergt lt/customergt
lt/customersgt
45Access and Navigation
getD cust, id ? cust_id
db
customer_table customer name
John id 56 customer
name George id 58
getD db1, customer ? cust
source db, db1
46Simplification Using Schema Inference
Since cust_id ? cust and cust is useless
otherwise
db
customer_table customer name
John id 56 customer
name George id 58
getD db1, customer/id ? cust_id
i1
i2
source db, db1
47Nested Plans
ct i2 o21
apply part, p ? orders
for part
48Joins and Selections
cust_id
db1 cust_id db2 order cust_id2
order_id
?cust_id2?
db2 order cust_id2 order_id
getD order, id ? order_id
getD order, cid ? cust_id2
getD db2, order ? order
nestedSrc part
source db, db2
49Constructors
listify oidE ? orders
o2
crEl order, oidL ? oidE
o1
crList orders ? oidL
order_id o1
o2
50Algebra Example
51Plan
52Replacing Nested Joins with GroupBy/Outerjoin
Combinations
apply part, p ? R
apply part, p ? R
p3
p3
nestedSrc part
groupBy S(p1) ? part
p2
nestedSrc part
for part
p1
p1
p2
53Multiple Possible Plans
54Optimizations Performed during Rewriting
- Operations pushed to sources
- selections, projections, joins, sorts
- Join order selection
- Join type selection
- sort/merge, parameter passing, nested loops
- Nested plan elimination
- nested plans converted to outer joins with
group-by - Research topic
- Constant folding
- constant expressions evaluated at plan
compilation time - Multi-threaded evaluation of web services
- Unused element elimination in views
- Condition push-down through views
- Research topic
55Indexing
- In order to do efficient query processing,
indexes are used - Reminder An index is a structure that points
directly to nodes satisfying a given constraint - More indexes usually allow query processing to be
more efficient, but also take up more space
(time/space tradeoff)
56Basic Indexes
- At minimum, the following indexes are usually
stored - Value indexes for each value appearing in the
tree there is a list of nodes containing the
value - Element indexes for each element name appearing
in the tree, there is a list of nodes with the
corresponding element - Sometimes also structure indexes for certain
XPath expressions, there is a list of nodes that
satisfy the expression
57Example Value Indexes
1
transaction
2
account
11
4
buy
sell
3
89-344
12
shares
5
shares
13
14
ticker
6
7
30
ticker
15
100
exch
8
exch
16
17
NYSE
GE
10
9
NYSE
WEBM
58Example Element Indexes
1
transaction
2
account
11
4
buy
sell
3
89-344
12
shares
5
shares
13
14
ticker
6
7
30
ticker
15
100
exch
8
exch
16
17
NYSE
GE
10
9
NYSE
WEBM
59Example Structure Indexes
1
transaction
2
account
11
4
buy
sell
3
89-344
12
shares
5
shares
13
14
ticker
6
7
30
ticker
15
100
exch
8
exch
16
17
NYSE
GE
10
9
NYSE
WEBM
60Query Processing
- Suppose that we only have value indexes and
element indexes - How should we process the query //buy//exch ?
- Strategy 1 Find buy elements. Then traverse the
subtree of these elements to look for exch
elements - Strategy 2 Find exch elements. Then traverse the
ancestors of these elements to look for buy
elements - Which is a better strategy?
61//buy//exch Strategy 1
1
transaction
2
account
11
4
buy
sell
3
89-344
12
shares
5
shares
13
14
ticker
6
7
30
ticker
15
100
exch
8
exch
16
17
NYSE
GE
10
9
NYSE
WEBM
62//buy//exch Strategy 2
1
transaction
2
account
11
4
buy
sell
3
89-344
12
shares
5
shares
13
14
ticker
6
7
30
ticker
15
100
exch
8
exch
16
17
NYSE
GE
10
9
NYSE
WEBM
63Improving the Execution
- Instead of storing a running id for each element,
store triple (start, end, level) - Find buy elements
- Find exch elements
- Merge these two lists by finding exch elements
that are nested within buy elements - Level is used in case we are interested in
finding children, not descendents
64//buy//exch Improved
Start
End
Level
buy
(4,10,2)
Merge the 2 lists by finding descendent elements
exch
(15,17,4)
(8,9,4)
65Merging Lists
- What is the complexity of merging the lists?
- Is it enough to go through each list once?
- Assuming the lists are sorted by start?
- Example Suppose we want to find all pairs of a
and b such that b is a descendent of a
a
b
a
b
b
66Merging Lists Example
- Example Suppose we want to find all pairs of a
and b such that b is a descendent of a
a
(3,6,2)
(1,7,1)
(5,5,3)
b
(4,4,3)
(2,2,2)
67Merging Lists Example
- Example Suppose we want to find all pairs of a
and b such that b is a descendent of a
a
(3,6,2)
(1,7,1)
(5,5,3)
b
(4,4,3)
(2,2,2)
68Merging Lists Example
- We did extra work
- Need a method to find the correct place to start
in the b list
a
(3,6,2)
(1,7,1)
(5,5,3)
b
(4,4,3)
(2,2,2)
69Minimizing the Work
- Several algorithms have been defined to minimize
the amount of work required, by identifying
exactly where to restart - See
- Shu-Yao Chien, Zografoula Vagena, Donghui Zhang,
Vassilis J. Tsotras, Carlo Zaniolo, Efficient
Structural Joins on Indexed XML Documents
Proc.of VLDB 2002 - Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas,
Jingesh M. Patel, Divesh Srivastava, Yuqing Wu,
Structural Joins A Primitive for Efficient XML
Query Pattern Matching, ICDE 2002 - Nicolas Bruno, Nick Koudas, Divesh Srivastava,
Holistic Twig Joins Optimal XML Pattern
Matching, ACM SIGMOD 2002
70Tree Pattern Can Be Computed From Structural
Relationships
Descendent edge
Child edge
book
title
author
jane
XML
71Stack-Tree Algorithms Intuition
- A depth-first traversal of a tree can be
performed in linear time, using a stack as large
as the height of the tree. - An ancestor-descendant structural relationship is
manifested as the ancestor appearing higher on
the stack than the descendant. - Unfortunately, a depth-first traversal requires
going over all the tree.
72Stack-Tree Algorithms
- We present 2 algorithms
- Stack-Tree-Desc that returns the result ordered
by (desc-start, anc-start) - Stack-Tree-Anc that returns the result ordered by
(anc-start, desc-start) - Why is the ordering of the result of interest?
73Stack-Tree-Desc
- a Alist-gtfirst node d Dlist-gtfirst node
OutputList NULL - while (lists are not empty)
- if (a.startPos lt d.startPos) then e a else e
d - while (e.startPos gt stack.Top().endPos)
stack.Pop() - if (e a)
- stack.Push(a)
- a a-gtnextNode
- else
- for each a in stack do append (a, d) to
OutputList - d d-gtnextNode
-
74Stack-Tree-Desc //section//paragraph
article
paragraph
section
paragraph
paragraph
section
paragraph
paragraph
section
paragraph
paragraph
Bla,..Bla,..
Bla,..Bla,..
75Stack-Tree-Desc //section//paragraph
article
Alist
paragraph
section
paragraph
paragraph
section
paragraph
paragraph
section
paragraph
paragraph
Bla,..Bla,..
Bla,..Bla,..
76Stack-Tree-Desc //section//paragraph
article
Dlist
paragraph
section
paragraph
paragraph
section
paragraph
paragraph
section
paragraph
paragraph
Bla,..Bla,..
Bla,..Bla,..
77Stack-Tree-Desc //section//paragraph
article
d7
a1
paragraph
section
d1
a2
d6
paragraph
paragraph
section
a3
d2
d5
paragraph
paragraph
section
d4
d3
paragraph
paragraph
Bla,..Bla,..
Bla,..Bla,..
78Stack-Tree-Desc //section//paragraph
Alist
a1
a2
a3
Dlist
d1
d4
d2
d5
d3
d6
article
d7
paragraph
a1
section
d1
d6
paragraph
paragraph
a2
section
d5
d2
paragraph
paragraph
a3
section
d4
d3
paragraph
paragraph
Bla,..Bla,..
Bla,..Bla,..
79Stack-Tree-Desc
a1 d1 a2 d2 a3 d3 d4 d5 d6 d7
d7
a1
d1
d6
a2
d2
d5
a3
d3
d4
a3
a2
Stack
a1
(a1,d3),(a2,d3),(a3,d3)
(a1,d1)
(a1,d2),(a2,d2)
Output
(a1,d4),(a2,d4),(a3,d4)
(a1,d5),(a2,d5)
(a1,d6)
80Analysis of Stack-Tree-Dec
- O(Alist Dlist OutputList) for
ancestor-descendant structural relationships. - Each Alist element is pushed once and popped
once, so stack operations take O(Alist). - The inner for loop outputs a new pair each
time, so its total time is O(OutputList).
81XML and XQuery for integration
- XML facilitates integration
- Richer data model
- Nesting, path queries, irregular structure
- Schema can be used for optimization
- As in relational
- XML is lingua franca for many heterogeneous
systems - Mediator query processor data model XML
- Underlying systems often relational
- Going back and forth XML to relational