Title: An XML Query Engine for NetworkBound Data
1An XML Query Engine for Network-Bound Data
-
- VLDB J. 11(4) 380-402 (2002)
2What is it about?
- Tukwila An XQuery processing system
- Evaluates XQuery expressions over streaming XML
documents - It (claims that) can be used as a data
integration system an XQuery may involve
different XML documents from different sources
3Storing and querying XML
- How to implement an XPath or XQuery processor?
- It depends on how you store XML documents
- shred XML documents into relations
- store XML documents in an
- appropriate data model (e.g B-trees)
- store XML documents as raw files
faster query evaluation (indexing, storing
auxiliary information)
Use a parser to read and query XML documents
4Store XML docs as raw files
- Use a DOM XML parser and then evaluate the query
Secondary Storage
Memory
DOM tree
store
Query
lt?xml version1.0?gt .......... .. .. .
. .
/store//book
booklist1
booklist2
Results
magazine
magazine
book
book
book book
- Why not?
- Must first parse the whole document
- Must keep the whole dom tree into memory
- Bad for web applications where first results
must be fetched quickly
5Store XML docs as raw files
- Or use a SAX XML parser..
Secondary Storage
Memory
SAX Parser
Query
lt?xml version1.0?gt .......... .. .. .
. .
Next Element
/store//book
store
booklist1
book
magazine
booklist2
- Why?
- No need to parse the entire document into memory
- Handles XML document as a stream
- But
- More difficult to evaluate a query over a stream
of XML document
6Store XML docs as raw files
XML Source
SAX Parser
Data Stream
Next Element
Query
lt?xml version1.0?gt .......... .. .. .
. .
/store//book
Network
Despite that, it could be efficient for web
applications, data integration, etc.
Thats where Tukwila system is based on!
7Tukwila System - Architecture
XML Producer
Streams of Partly Tagged tuples
XQuery optimizer
XQuery
Data Stream
XML trees
Simple XPath expressions
Streams of tuples
XML Tree Manager
8The query optimizer
- Input an XQuery expression
- Output a query plan involving available
operators - XScan operators
- In the leaves of the plan tree
- Reads a stream of an XML document
- Evaluates one or more simple XPath expressions
over the stream - Outputs a pipelined relation
- Web-join operators for joining two XML documents
- Other operators for projecting, joining,
sorting, XML construction etc.
9The query optimizer
10The XScan operator
- Input An XML data stream, several XPath
expressions - Output A pipelined relation (streaming tuples)
- For each reference to a different XML document,
the XScan operator is involved, which evaluates
several simple XPath expressions, derived from
the XQuery, and pipelines a denormalized relation
having a column for each bind variable
11The XScan operator
12The XScan Operator
- For each XPath expression in the for clause of
the XQuery statement - a separate deterministic finite automaton is
created - a separate relation is created with columns
corresponding to the binding variables - The final state of each automaton corresponds to
a new binding for the respective variable
13The XScan operator
Deterministic automata
At first one relation for each binding variable
XML Tree Manager
b
a
d
14The XScan operator
- As the stream is parsed by the SAX parser the
automata change states - Once an automaton reaches a final state
- a new node is bound to the respective binding
variable - a new tuple is inserted into the respective
relation having the value of the binding variable
in the respective column - If the node is an attribute or a simple element,
the value stored is the actual simple value of
the node - If the node is a complex element, the value
stored is a reference to the element in the XML
tree, maintained in the XML Tree Manager in text
form
15The XScan operator
Deterministic automata
db
At first one relation for each binding variable
XML Tree Manager
b
a
d
db
16The XScan operator
Deterministic automata
book
At first one relation for each binding variable
XML Tree Manager
b
a
d
1
db book
17The XScan operator
Deterministic automata
authors
At first one relation for each binding variable
XML Tree Manager
b
a
d
1
db book title XQuery from the
Experts authors
18The XScan operator
Deterministic automata
authors
At first one relation for each binding variable
XML Tree Manager
b
a
d
1
db book title XQuery from the
Experts authors authorDon
Chamnerlim
Don Chammerlin
19The XScan operator
Deterministic automata
authors
At first one relation for each binding variable
XML Tree Manager
b
a
d
1
db book title XQuery from the
Experts authors authorDon
Chamnerlim authorMichael Kay
authorDenise Draper
Don Chammerlin
Michael Kay
Denise Draper
20The XScan operator
Deterministic automata
details
At first one relation for each binding variable
b
a
d
1
Don Chammerlin
2
Michael Kay
Denise Draper
21The XScan operator
- The output of the XScan operator is the join of
the relations corresponding to the automata - The join takes place gradually
- Whenever a new tuple is to be inserted into the
root relation - all the current relations are joined
- the result relation is pipelined
- all the tuples in the current relations are
deleted - The new tuple is inserted into the root
relation - and so on
22The XScan operator
Deterministic automata
At first one relation for each binding variable
which are later joined into a single relation
a
b
d
Don Chammerlin
1
2
Michael Kay
1
2
Denise Draper
1
2
23The XScan operator - problems
- Only simple XPath expressions are handled
- Selection predicates simple predicates over
values, e.g. a/b_at_d4, but not a/bh/j/t5 - Only forward paths
- The output relation may be huge due to
denormalization - The entire XML document is also kept as text (in
slight different format)
24The web-join operator
- In a XQuery expression a join between different
XML documents may occur
25The web-join operator
- One way to do this is to use an XScan operator to
read from each source and then join the result
relations using a relational join algorithm - What if the join with the second source is highly
selective? - What if a source requires input values before it
returns an answer? (e.g. an online bookseller may
require an author or title)
26The web-join operator
- The web-join operator is inspired by the
dependent join used by distributed relational
query processing - The main idea
- read data from the one source (with an XScan
operator) - send results to the other source (via http post
or SOAP) - read the answer (again with an XScan operator)
- join the two relations.
27The web-join operator
- The big question How to send the results to a
source? - The paper doesnt explain!
- It claims that it uses HTTP POST or SOAP
requests, but this requires - knowing the query capabilities of the sources
- XPath? Simple sequences of values??
- Knowing their schema
28XML Construction
- In XQuery, the return clause builds a tree and
inserts references to binding variables within
this tree - Special operators (output, element, result) are
used which add structural information in the
final binding tuples and finally output the
result in XML form
29XML Construction
For Return ltbookgt ltnamegt lst fst
lt/namegt ltbublishergt p lt/publishergt
lt/bookgt
XQuery expression
binding tuple
lst fst p
fst
lst
name/2
p
publisher/1
book/2
30Performance
31Conclusions
- Query over streaming data is useful in
web-applications - Automata may be used for evaluating XML queries
over streaming XML data - A data integration system should forward queries
to the sources and then combine the results - The Tukwila system process and evaluates queries
in the mediator - The sources sent entire XML documents to the
mediator - Web-join is quite vague and not sufficient