The Joy of SAX - PowerPoint PPT Presentation

About This Presentation

Title:

The Joy of SAX

Description:

The Joy of SAX Leonidas Fegaras University of Texas at Arlington fegaras_at_cse.uta.edu http://lambda.uta.edu/ Design Goals Want to build an XQuery engine based entirely ... – PowerPoint PPT presentation

Number of Views:65

Avg rating:3.0/5.0

Slides: 15

Provided by: lambdaUta7

Learn more at: https://lambda.uta.edu

Category:

more less

Transcript and Presenter's Notes

Title: The Joy of SAX

1
The Joy of SAX

Leonidas Fegaras
University of Texas at Arlington
fegaras_at_cse.uta.edu
http//lambda.uta.edu/

2
Design Goals

Want to build an XQuery engine based entirely on
SAX handlers
all the way from the points the input documents
are read by the SAX parser up to the point the
query results are printed
This engine should consist of operators that
naturally reflect the syntactic structures of
XQuery and
can be composed into pipelines in the same way
the corresponding XQuery structures are composed
to form complex queries
The XQuery translation should be concise, clean,
and completely compositional
Even though it cannot compete with transducers
for simple XPaths, it should not sacrifice much
on performance in terms of memory and
computational overhead
But, ... it should be able to beat transducers
for complex predicates and deeply nested queries

3
Pull-Based Approach

Based on iterators
class Iterator
Tuple current() // current tuple from
stream
void open () // open the stream iterator
Tuple next () // get the next tuple from
stream
boolean eos () // is this the end of
stream?
An iterator reads data from the input stream(s)
and delivers data to the output stream
Connected through pipelines
an iterator (the producer) delivers a stream
element to the output only when requested by the
next operator in pipeline (the consumer)
to deliver one stream element to the output, the
producer becomes a consumer by requesting from
the previous iterator as many elements as
necessary to produce a single element, etc, until
the end of stream

4
What is a Tuple?

A vector of components
one component for each scoped for-variable
has fixed-size at each point in a pipeline (known
at compile time)
doesn't need to include the variable names
A tuple component is the unit of communication
between iterators
Passing fully constructed XML elements through
iterators is a bad idea for a compositional
translation
initially, we would have to pass the entire
document as a tree!
The unit of communication should be
a single event or
a fragment (a reference to an XML element in a
document)
this requires a structural index for fragments
A proposal for a pull parser XML Pull Parser 3
www.xmlpull.org
BEA/XQRL token stream token iterators

5
Event-Oriented Approach

A tuple in an event-oriented approach consists of
a sequence of events, ending with an End-Of-Tuple
(EOT) event
Single-node event sequence
depth-first unfolding of a single XML node
ltstart Agt
ltstart Bgt
lttext xgt
ltend Bgt
ltstart Bgt
lttext ygt
ltend Bgt
ltend Agt
lttext zgt
ltstart Agt
ltstart Bgt
lttext wgt
ltend Bgt
ltend Agt
ltEOTgt

A tuple with 3 components
6
Element vs Event Granularity
Stream unit is a single event abstract class
Event class Start extends Event String tag
class End extends Event String tag class
Text extends Event String text class EOT
extends Event class Child extends Iterator
Iterator input String tagname
boolean keep false int nest 0
Event next () while (!input.eos())
current input.current() if (current
instanceof Start) if (nest 1)
keep ((Start) current).tag
.equals(tagname) else if (current instanceof
End) if (nest-- 1) keep false
input.next() if (keep) return
current
Stream unit is a DOM-like element abstract
class Element class Node extends Element
String tag Element sequence class
Text extends Element String text class
Tuple Element components class Child
extends Iterator Iterator input String
tagname int index 0 Tuple next ()
while (!input.eos()) if
(input.current().get(0) instanceof Node) Node
ce (Node) input.current().get(0) if (index lt
ce.sequence.length) if (ce.sequenceindex
instanceof Node ((Node)
ce.sequenceindex) .tag.equals(tagname)
) current new Tuple(ce.sequenceindex)
return current else index else
index 0 input.next() else
index 0 input.next()
7
For-Loop using Iterators

Need a stepper for a for-loop

class Step extends Iterator boolean first
Tuple tuple void open () first true
current tuple Tuple next () first
false return current void set ( Tuple t )
tuple t boolean eos () return
!first Tuple Loop.next () if
(!left.eos()) while (right.eos())
left.next()
right_step.set(left.current())
right.open() current
left.current().append(right.current())
right.next() return current
Not a good idea if right reads a document!
Loop
right
right_step
left
right pipeline
set
Step
class Loop extends Iterator Iterator left
Step right_step Iterator right
8
Let-Bindings using Iterators

Let-bindings are harder to implement
the let-value may be a sequence
one producer -- many consumers
we do not want to materialize the let-value in
memory

queue
tail
head
fastest consumer
slowest consumer
backlog
Some cases are hopeless let ve return (v,v)
9
Push-based Pipelines

Unit of communication between pipelines
messages rather than events
Pipeline components are SAX-like event handlers
they are instances of Operator subclasses
abstract class Operator
void suspend ()
void release ()
void startDocument ( int node )
void endDocument ( int node )
Status endTuple ( int node )
Status startElement ( int node, String tag )
Status endElement ( int node, String tag )
Status characters ( int node, String text )
('node' identifies a for-variable)

10
The Child Operator

class Child extends Operator
Operator next
String tagname
int nest 0
boolean keep false
Status startElement ( int node, String tag )
if (nest 1)
keep tagname.equals(tag)
if (keep)
return next.startElement(node,tag)
else return invalid
Example document(...)/A///B

Document
Child A
Any
Descendant B
Kick
Print
11
For-Loops

One thread per document reader
Need to queue one tuple from the outer stream
each time
for x in E1, y in E2 return ...

startElement, endElement, .... if nodex,
insert the event into Queue else emit the event
to the output (next) endTuple if nodex,
suspend outer stream send all events in Queue
to E2 else emit all events in Queue to the output
(next) endDocument if nodey, clear Queue
release outer stream
E2
E1
For y
For x
inner
outer
Queue
Loop x
next

Not a good idea if E2 reads a document
the document is read as many times as the tuples
in E1
but we can cache the output of E2 and push the
cached data instead

12
Other Issues

Let-bindings can be easily done using splitters
(repeaters)
no caching is necessary
But, ... binary concatenation needs to cache the
second stream
so, let ve return (v,v) is still
hopeless
We dont need to cache path/FLWOR conditionals
the returned status of the condition events
determines the predicate outcome (existential
semantics)
initially, Predicate sends a suspend() event to
the next stream and then the input events are
propagated as is (to both pred and next)
if and when the predicate becomes true, the
output is released

Predicate
condition
pred
next
Sink
13
So, to Pull or to Push?

For event streams, it doesn't really make a
difference in terms of efficiency/storage
requirements
a matter of programming style
push-based is a bit more difficult to program and
harder to debug (threads)
But, ... if you want to use indexes, pulling is
better
For indexing, fragments are a better alternative
to events
fragment a reference to an element in a
document
a fragment corresponds to a tree node, and you
need an index to access descendants
need to guarantee that indexes deliver fragments
sorted, so that all stream operators can be
implemented using merge joins
examples
structural indexes based on region encoding or on
preorder/postorder ranks
IR-style content-based inverse indexes
see my recent work on XQuery processing with
relevance ranking
http//lambda.uta.edu/XQueryRank.pdf