Title: Lazy Query Evaluation for Active XML
1Lazy Query Evaluation for Active XML
Abiteboul, Benjelloun, Cautis, Manolescu, Milo,
PredaINRIA Futurs
presented by Grigoris Karvounarakis
Univ. of Pennsylvania
CIS 650 October 14,
2004
2Active XML
function nodes
3Tree Pattern Queries
result nodes
4Tree Pattern Queries
- Similar to Pattern Trees from TAX/TLC algebra
- variable nodes, used to bind variables to
sub-trees - (variable nodes with the same name must be mapped
to elements with the same tag name) - result nodes
- Embedding (of a query q into a doc d) Match
- Result of embedding bindings of output
variables on witness tree
5No embedding
6No embedding
1
but if we evaluate
7Embedding Example
8Embedding Example
9Embedding Example
10Relevant rewriting
- (getNearbyRestos) is a relevant function
node - In general, a function node is relevant, if there
exists some rewriting of the document where some
of the nodes it produces belongs to a match - Rewriting the document by invoking relevant
function nodes produces relevant rewritings - d1 !v1 d2 !v2 dn
- A document that contains no calls that are
relevant to a query q is said to be complete for q
1
11Problem definition
- Given an Active XML document d and a query q,
find an efficient way to evaluate the query over
the document - Naïve approach interleave query evaluation with
function calls - Better try to compute (a superset of) the
relevant functions calls for q and execute q over
the rewriting of d (that results from executing
these function calls)
12Problem definition
- Given an Active XML document d and a query q,
find an efficient way to evaluate the query over
the document - Naïve approach interleave query evaluation with
function calls - Better try to compute (a superset of) the
relevant functions calls for q and execute q over
the rewriting of d (that results from executing
these function calls) - Efficiency tradeoff
- time to compute approximation of set of relevant
functions (larger for more accurate approx) - time to execute the function calls (smaller for
more accurate approx) and time to execute query
over resulting rewriting of document (smaller
document for more accurate approx)
13Outline
- Definitions
- Finding relevant calls
- Sequencing relevant calls
- Improving accuracy
- Reducing detection time
- Conclusions - Discussion
14Linear Path Queries
/() /nyHotels/() /nyHotels/hotel/() /nyHotels/h
otel/name/() /nyHotels/hotel/rating/() /nyHotels
/hotel/nearby/() /nyHotels/hotel/nearby//() /nyH
otels/hotel/nearby//restaurant/() /nyHotels/hotel
/nearby//restaurant/name/() /nyHotels/hotel/nearb
y//restaurant/address/() /nyHotels/hotel/nearby//
restaurant/rating/()
15Linear Path Queries
- Correct, but usually inaccurate
- Ignores filtering conditions in the path from the
root or in other branches that could make some of
the functions irrelevant (e.g. there is no chance
that a getNearbyRestos() function node under a
hotel is relevant, if the hotel rating is not
)
16Node Focused Queries
- For each node in the query tree, replace it with
an OR node (to add a branch () to match any
functions, similarly with LPQs) - Then, for every node v in the resulting query
tree, create qv q v and its subtree, with
output node fv pointing at the position of the
() OR-sibling of v - Each such query tree involves the path from the
root to the node (as in LPQ) any parts of the
tree that would have to be matched anyway, for
the whole query tree to match.
17NFQ Example
nyHotels
hotel
name
nearby
rating
Best Western
restaurant
name
address
rating
X
Y
18NFQ Example
nyHotels
hotel
name
nearby
rating
Best Western
restaurant
name
address
rating
X
Y
19NFQ Example
nyHotels
20NFQ Example
nyHotels
21NFQ Example
nyHotels
22Another NFQ Example
Best Western
23Another NFQ Example
24Another NFQ Example
25Another NFQ Example
Best Western
26Node Focused Queries
- Assuming that functions can return data of
arbitrary type, the function nodes that are
relevant for a query q are precisely the ones
retrieved by the NFQs of q
27Outline
- Definitions
- Finding relevant calls
- Sequencing relevant calls
- Improving accuracy
- Reducing detection time
- Conclusions - Discussion
28Sequencing relevant calls
- Naïve NFQA algorithm
- Evaluate all NFQs
- Pick one of the returned functions, say fv
- Evaluate the function and rewrite the document (d
!fv d) - Until all NFQs return empty results (i.e., there
are no more relevant calls) - After every loop, although the NFQs remain the
same, their result can change (since evaluating
functions at step 3 above can introduce new
function nodes or make some results irrelevant)
29Improving NFQA
- Predict when NFQ results could not have
possibly changed and avoid reevaluating them - Identify dependences between NFQs and the effect
of executing functions they return
30Influence of NFQs
NFQ1
NFQ2
nyHotels
Best Western
NFQ1 can influence NFQ2, but not vice versa
31Influence of NFQs
- NFQ1 may influence NFQ2 iff the output function
node of NFQ1 is an ancestor (in the query tree)
of the output node of NFQ2 - Two NFQs belong in the same layer if they may
influence (directly or transitively) each other. - Inside every layer, we have to reevaluate every
NFQ after every function call - Multiple equivalent NFQs (i.e., in the same
layer) can only exist under // so that, not
knowing the output type, both nodes could appear
as descendants of each other, e.g. //a, //b in
/a/b, //a matches /a and //b matches /a/b, while
in /b/a, //b matches /b and //a matches /b/a
32Influence of NFQs
- L1 (directly or transitively) some NFQ in
- We have to process L1 before L2 (without having
to process L1 again afterwards) - When processing L1 has finished, OR-nodes
corresponding to returned functions are redundant
and thus NFQs in L2 can be simplified by removing
them
33Parallelizing calls
- Let qlin be the linear path from the root to the
output node of NFQ q, not inclusive (note qlin
is a regular expression) - Two NFQs q, q that belong to the same layer are
independent iff there are no common words in the
regular languages of qlin, qlin - E.g //a, //b are independent
- But //a//c and //b//c are not (e.g. both match
/a/b/c) - If all NFQs in a layer are independent, we can
call all functions returned by the same NFQ in a
step of NFQA in parallel. - Other sufficient conditions could exist, too
34Outline
- Definitions
- Finding relevant calls
- Sequencing relevant calls
- Improving accuracy
- Reducing detection time
- Conclusions - Discussion
35Using types
- Use function return type to predict shape of
data that a function call can return - Similar to check for existence of a possible
rewriting - If this shape cannot match the (corresponding
part of) the query pattern, they can be discarded - In some cases, one can go further and restrict
not only the output type but also the specific
names of functions that could match - Refined NFQs
- Use set of function names of appropriate return
type instead of () - Use F-guides (later) to make them even more
refined
36Refined NFQ example
nyHotels
hotel
nearby
name
rating
Best Western
37Refined NFQ example
nyHotels
hotel
nearby
name
rating
getNearbyRestos
getRating
Best Western
38Pushing queries
- Similar to pushing selections on scans in
relational queries or pushing queries to data
sources in mediator systems - Reduce amount of (useless) data that are
transferred (assuming functions correspond to
remote (web) services), by filtering irrelevant
matches and projecting only on output variable
nodes
39Outline
- Definitions
- Finding relevant calls
- Sequencing relevant calls
- Improving accuracy
- Reducing detection time
- Conclusions - Discussion
40Lenient rewriting
- Trade accuracy for efficiency
- Use XPath or LPQs instead of NFQ (faster
processing) - Use a lenient form of type checking (ignoring
order and cardinality of elements)
41Function call guides
- Similar to dataguides for function calls
- One occurrence for each path that leads to some
function node pointers to function nodes
42Function call guides
- Similar to dataguides for function calls
- One occurrence for each path that leads to some
function node pointers to function nodes
paths that dont lead to functions are left out
43Function call guides
- Similar to dataguides for function calls
- One occurrence for each path that leads to some
function node pointers to function nodes
pointers to getHotels calls
pointers to getRating calls
pointers to getNearbyRestos, getNearbyMuseums
calls
44Function call guides
- Use F-guides for
- Generation of Refined NFQs (use return type
within appropriate F-guide part to get only
function names that can indeed appear in the
corresponding tree fragment) - Efficient approximation of relevant function
nodes evaluate queries (NFQs) on F-guide ?
evaluate queries on original document using LPQs - Initial filtering Can get rid of NFQs for nodes
that dont have any children in the F-guide
45Conclusions
- Active XML Interesting new area
- Nothing fundamentally novel
- Applies known tools (distributed processing, lazy
evaluation) in a new context, giving new life to
documents - Greatest challenge formulate the right research
questions well - Answers to these well-formulated questions are
fairly easy. - Contributions of this paper
- Formulates such an interesting question
- Thorough understanding of different aspects of
the problem (accuracy vs. performance and their
effect to overall efficiency)
46Questions?