Title: The Joy of SAX
1The Joy of SAX
- Leonidas Fegaras
- University of Texas at Arlington
- fegaras_at_cse.uta.edu
- http//lambda.uta.edu/
2Design Goals
- Want to build an XQuery engine based entirely on
SAX handlers - all the way from the points the input documents
are read by the SAX parser up to the point the
query results are printed - This engine should consist of operators that
- naturally reflect the syntactic structures of
XQuery and - can be composed into pipelines in the same way
the corresponding XQuery structures are composed
to form complex queries - The XQuery translation should be concise, clean,
and completely compositional - Even though it cannot compete with transducers
for simple XPaths, it should not sacrifice much
on performance in terms of memory and
computational overhead - But, ... it should be able to beat transducers
for complex predicates and deeply nested queries
3Pull-Based Approach
- Based on iterators
- class Iterator
- Tuple current() // current tuple from
stream - void open () // open the stream iterator
- Tuple next () // get the next tuple from
stream - boolean eos () // is this the end of
stream? -
- An iterator reads data from the input stream(s)
and delivers data to the output stream - Connected through pipelines
- an iterator (the producer) delivers a stream
element to the output only when requested by the
next operator in pipeline (the consumer) - to deliver one stream element to the output, the
producer becomes a consumer by requesting from
the previous iterator as many elements as
necessary to produce a single element, etc, until
the end of stream
4What is a Tuple?
- A vector of components
- one component for each scoped for-variable
- has fixed-size at each point in a pipeline (known
at compile time) - doesn't need to include the variable names
- A tuple component is the unit of communication
between iterators - Passing fully constructed XML elements through
iterators is a bad idea for a compositional
translation - initially, we would have to pass the entire
document as a tree! - The unit of communication should be
- a single event or
- a fragment (a reference to an XML element in a
document) - this requires a structural index for fragments
- A proposal for a pull parser XML Pull Parser 3
- www.xmlpull.org
- BEA/XQRL token stream token iterators
5Event-Oriented Approach
- A tuple in an event-oriented approach consists of
a sequence of events, ending with an End-Of-Tuple
(EOT) event - Single-node event sequence
- depth-first unfolding of a single XML node
- ltstart Agt
- ltstart Bgt
- lttext xgt
- ltend Bgt
- ltstart Bgt
- lttext ygt
- ltend Bgt
- ltend Agt
- lttext zgt
- ltstart Agt
- ltstart Bgt
- lttext wgt
- ltend Bgt
- ltend Agt
- ltEOTgt
A tuple with 3 components
6Element vs Event Granularity
Stream unit is a single event abstract class
Event class Start extends Event String tag
class End extends Event String tag class
Text extends Event String text class EOT
extends Event class Child extends Iterator
Iterator input String tagname
boolean keep false int nest 0
Event next () while (!input.eos())
current input.current() if (current
instanceof Start) if (nest 1)
keep ((Start) current).tag
.equals(tagname) else if (current instanceof
End) if (nest-- 1) keep false
input.next() if (keep) return
current
Stream unit is a DOM-like element abstract
class Element class Node extends Element
String tag Element sequence class
Text extends Element String text class
Tuple Element components class Child
extends Iterator Iterator input String
tagname int index 0 Tuple next ()
while (!input.eos()) if
(input.current().get(0) instanceof Node) Node
ce (Node) input.current().get(0) if (index lt
ce.sequence.length) if (ce.sequenceindex
instanceof Node ((Node)
ce.sequenceindex) .tag.equals(tagname)
) current new Tuple(ce.sequenceindex)
return current else index else
index 0 input.next() else
index 0 input.next()
7For-Loop using Iterators
- Need a stepper for a for-loop
class Step extends Iterator boolean first
Tuple tuple void open () first true
current tuple Tuple next () first
false return current void set ( Tuple t )
tuple t boolean eos () return
!first Tuple Loop.next () if
(!left.eos()) while (right.eos())
left.next()
right_step.set(left.current())
right.open() current
left.current().append(right.current())
right.next() return current
Not a good idea if right reads a document!
Loop
right
right_step
left
right pipeline
set
Step
class Loop extends Iterator Iterator left
Step right_step Iterator right
8Let-Bindings using Iterators
- Let-bindings are harder to implement
- the let-value may be a sequence
- one producer -- many consumers
- we do not want to materialize the let-value in
memory
queue
tail
head
fastest consumer
slowest consumer
backlog
Some cases are hopeless let ve return (v,v)
9Push-based Pipelines
- Unit of communication between pipelines
- messages rather than events
- Pipeline components are SAX-like event handlers
- they are instances of Operator subclasses
- abstract class Operator
- void suspend ()
- void release ()
- void startDocument ( int node )
- void endDocument ( int node )
- Status endTuple ( int node )
- Status startElement ( int node, String tag )
- Status endElement ( int node, String tag )
- Status characters ( int node, String text )
-
- ('node' identifies a for-variable)
10The Child Operator
- class Child extends Operator
- Operator next
- String tagname
- int nest 0
- boolean keep false
- Status startElement ( int node, String tag )
- if (nest 1)
- keep tagname.equals(tag)
- if (keep)
- return next.startElement(node,tag)
- else return invalid
-
- Example document(...)/A///B
Document
Child A
Any
Descendant B
Kick
Print
11For-Loops
- One thread per document reader
- Need to queue one tuple from the outer stream
each time - for x in E1, y in E2 return ...
startElement, endElement, .... if nodex,
insert the event into Queue else emit the event
to the output (next) endTuple if nodex,
suspend outer stream send all events in Queue
to E2 else emit all events in Queue to the output
(next) endDocument if nodey, clear Queue
release outer stream
E2
E1
For y
For x
inner
outer
Queue
Loop x
next
- Not a good idea if E2 reads a document
- the document is read as many times as the tuples
in E1 - but we can cache the output of E2 and push the
cached data instead
12Other Issues
- Let-bindings can be easily done using splitters
(repeaters) - no caching is necessary
- But, ... binary concatenation needs to cache the
second stream - so, let ve return (v,v) is still
hopeless - We dont need to cache path/FLWOR conditionals
- the returned status of the condition events
determines the predicate outcome (existential
semantics) - initially, Predicate sends a suspend() event to
the next stream and then the input events are
propagated as is (to both pred and next) - if and when the predicate becomes true, the
output is released
Predicate
condition
pred
next
Sink
13So, to Pull or to Push?
- For event streams, it doesn't really make a
difference in terms of efficiency/storage
requirements - a matter of programming style
- push-based is a bit more difficult to program and
harder to debug (threads) - But, ... if you want to use indexes, pulling is
better - For indexing, fragments are a better alternative
to events - fragment a reference to an element in a
document - a fragment corresponds to a tree node, and you
need an index to access descendants - need to guarantee that indexes deliver fragments
sorted, so that all stream operators can be
implemented using merge joins - examples
- structural indexes based on region encoding or on
preorder/postorder ranks - IR-style content-based inverse indexes
- see my recent work on XQuery processing with
relevance ranking - http//lambda.uta.edu/XQueryRank.pdf
14Related Work
- Joost XSLT transformation based on SAX
- BEA/XQRL pull-based XQuery processing
- Apache Cocoon user-constructed pipelines made
out of SAX handlers - Many XQuery processors Galax, Xalan, Qizx,
Saxon, ... - Lots of work on XPath/XQuery processing based on
transducers