Title: Incremental Fusion of XML Fragments through Semantic Identifiers
1Incremental Fusion of XML Fragments through
Semantic Identifiers
Maged El-Sayed, Elke A. Rundensteiner, and Murali
Mani Database Systems Research Lab Computer
Science Department Worcester Polytechnic
Institute Worcester, MA 01609-2280, USA
2Motivation
- Views integrate data from different sources
- Source data may be available at different times
- View result to be computed incrementally
- Incremental results need to be fused
3Applications Needing Fusion
- Incremental view maintenance
- Stream monitoring
- Data integration
- Data warehousing
- E-commerce
- . . .
4Motivation Fusion for XML
- Challenges in XML views
- Nesting via FLWR expressions
- Complex result re-structuring
- Order handling
- . . .
5Running Example XML Data
ltbibgt ltbook year 1994gt lttitlegtTCP/IP
Illustratedlt/titlegt ltauthorgt
ltlastgtStevenslt/lastgtltfirstgtW.lt/firstgt
lt/authorgt lt/bookgt ltbook year 2000gt
lttitlegtData on the Weblt/titlegt ltauthorgt
ltlastgtAbiteboullt/lastgt
ltfirstgtSergelt/firstgt lt/authorgt
lt/bookgt lt/bibgt
ltpricesgt ltentrygt ltpricegt39.95lt/pricegt
ltb-titlegtData on the Weblt/b-titlegt
lt/entrygt ltentrygt ltpricegt
65.95lt/pricegt ltb-titlegtTCP/IP
Illustratedlt/b-titlegt lt/entrygt ltentrygt
ltpricegt 69.99lt/pricegt
ltb-titlegtAdvanced Programming in the
Unix environment lt/b-titlegt lt/entrygt
lt/pricesgt
prices.xml
bib.xml
6Example XML View
ltresultgt FOR y in distinct-values(doc("bib.xm
l")/bib/book/_at_year) ORDER BY y RETURN
ltyGroup Y ygt ltbooksgt FOR
b in doc ("bib.xml")/bib/book,
e in doc (prices.xml")/prices/entry
WHERE y b/_at_year and
b/title e/b-title RETURN
ltentrygt b/title
e/pricelt/entrygt lt/booksgt
lt/yGroupgt lt/resultgt
result
yGroup
yGroup
Y2000
Y1994
books
books
entry
entry
title
price
price
title
TCP/IP
65.95
Data on..
39.95
bib.xml
prices.xml
bib
prices
book
book
Year2000
entry
entry
entry
Year1994
title
author
price
title
author
price
price
b-title
b-title
b-title
Data
Advanced..
Data
69.99
39.95
65.95
TCP/IP..
TCP/IP..
last
first
last
first
Abiteboul
Serge
Stevens
W.
7Example Incremental Updates
what ?
how?
8Example Alternatives for Fusion
result
yGroup
books
entry
title
price
Advanced...
69.99
- We need to decide for each node
- Where to add
- What to merge
- Which order to impose
9Outline
- Semantic Identifier Generation
- What they are?
- Fusion of XML Results through Semantic-Ids
- How to use them?
- Experimental Evaluation
- Related Work
- Conclusions
10Overall Solution Id-based Fusion
- Goal Need to decide how to merge processed
fragments with XML result re location and order.
- Idea Assign Semantic ids to nodes in XML
results - Semantic ids must be reproducible
- when processing two source XML nodes contributing
to same node in XML result, same id is
generated - even when sources nodes are not always equal in
value or id and - when processing at different times.
11Overall Solution Id-based Fusion
- Options
- Syntactic Approach
- Algebraic Approach
12Background XML Query Model
- XQuery ? XAT algebra tree ZPR02
- XAT Operators
- XAT Relational Operators Select, Join
- XAT XML Operators Navigate Unnest, Navigate
Collection, Tagger, Combine - XAT Data Model (XAT Table)
- Order sensitive table of tuples
- Columns represent user-specified or internally
generated variable bindings - Cell in tuple holds an XML node or a sequence of
XML nodes
Navigate
?b, _at_year/text()col1
13Background Base Node IDs
- Fast Lexicographical Key DR03
- Encodes
- Node hierarchy
- Node order
?bib.xml
bib.xml
b
b
bib
bib
b.f
b.b
b.l
book
book
book
Year2000
Year1994
b.f.b
Year1994
b.l.b
b.f.f
b.l.f
b.b.b
b.b.f
title
author
author
title
title
author
Data
b.l.f.f
Advanced
b.l.f.b
TCP/IP..
b.b.f.b
b.f.f.f
b.f.b.f
b.b.f.f
first
last
last
first
last
first
Stevens
W.
Serge
Abiteboul
Stevens
W.
prices.xml
e
prices
e.l
e.b
e.f
entry
entry
entry
e.f.b
e.b.b
e.l.b
e.f.f
e.l.f
e.b.f
price
price
price
b-title
b-title
b-title
Advanced..
Data
69.99
39.95
65.95
TCP/IP..
14Background XAT Algebra Tree
21
ltresultgt FOR y in distinct-values(doc("bib.xm
l")/bib/book/_at_year) ORDER BY y RETURN
ltyGroup Y ygt ltbooksgt FOR b in
doc ("bib.xml")/bib/book, e
in doc (prices.xml")/prices/entry WHERE
y b/_at_year and b/title
e/b-title RETURN ltentrygt
b/title e/pricelt/entrygt lt/booksgt
lt/yGroupgt lt/resultgt
Expose col8
20
14
Tltresultgtcol7lt/resultgtcol8
Tltentrygtcol4lt/entrygtcol5
19
Combine col7
13
? col2, col3col4
18
TltyGroup Yygtcol6lt/yGroupgtcol7
?e, pricecol3
12
17
OrderByy
11
?b,titlecol2
16
Tltbooksgtcol5lt/booksgtcol6
10
Join b/title e/b-title
15
GroupByy(Combinecol5)
9
?S2,entrye
7
LOJy col1
6
Distinct(y)
?b, _at_year/text()col1
2
?S1,book/_at_year/text()y
? S1,bookb
5
8
1
4
S bib.xmlS1
S bib.xmlS1
Sprices.xmlS2
15Semantic Identifier Generation
- Phase One Compute Context Schema
- What Rules define how to compute node lineage
and order - When Computed at algebra tree generation time
- Where Defined at schema level of algebra tree
(XAT table columns) - Note No access to actual data needed
- Phase Two Generate Semantic ids
- What Use Context Schema to decide how to
generate or manipulate semantic ids - When Performed at query execution time
- Where Only some operators manipulate
Semantic-ids of XML nodes (e.g., Tagger, XML
union, etc.)
16Phase 1 Context Schema Computation
- Define one context schema for each XAT column
coli - Composed of two lists order list and lineage
list - e.g., coli.cnxtSchm (ordCols)lngCols
- Order list can be
- Null column has no order
- Empty order is reflected by lineage list
- Has 1 or more column names reflecting columns
specifying order - Lineage list can be
- Empty lineage of the columns depends only on
itself - Has 1 or more column names reflecting columns
specifying lineage - Rules for computing Context Schema specific to
algebraic operator
17Context Schema Example
18Context Schema Example
A
14
Tltentrygtcol4lt/entrygtcol5
13
? col2, col3col4
?e, pricecol3
12
11
?b,titlecol2
10
Join b/title e/b-title
9
?S2,entrye
7
LOJy col1
6
Distinct(y)
?b, _at_year/text()col1
3
2
?S1,book/_at_year/text()y
? S1,bookb
5
8
1
4
S bib.xmlS1
S bib.xmlS1
Sprices.xmlS2
19Context Schema Example (cont.)
21
Expose col8
20
Tltresultgtcol7lt/resultgtcol8
19
Combine col7
18
TltyGroup Yygtcol6lt/yGroupgtcol7
17
OrderByy
16
Tltbooksgtcol5lt/booksgtcol6
15
GroupByy(Combinecol5)
A
20Phase 2 Semantic Identifier Generation
- Based on Context Schema, generate ids for nodes
in XML result. - Format ltorder prefixgt ltbodygt
- Order prefix (optional)
- What Reflects local order or no order ()
- How A composition of source node ids and/or
values, or a constant . - Body
- What Reflects lineage (and possibly order)
- How A composition of source node ids, values,
and/or constant - Properties of semantic identifiers
- Reproducible
- Compact
21Semantic Identifiers Example
A
14
Tltentrygtcol4lt/entrygtcol5
13
? col2, col3col4
?e, pricecol3
12
11
?b,titlecol2
10
Join b/title e/b-title
9
?S2,entrye
7
LOJy col1
6
Distinct(y)
?b, _at_year/text()col1
3
2
?S1,book/_at_year/text()y
? S1,bookb
5
8
1
4
S bib.xmlS1
S bib.xmlS1
Sprices.xmlS2
22Semantic Identifiers Example (cont.)
21
Expose col8
20
Tltresultgtcol7lt/resultgtcol8
19
Combine col7
18
TltyGroup Yygtcol6lt/yGroupgtcol7
17
OrderByy
16
Tltbooksgtcol5lt/booksgtcol6
15
GroupByy(Combinecol5)
A
23Semantic Ids Example (cont.)
c
result
1994 c
2000c
yGroup
yGroup
Y2000
Y1994
1994c
2000c
books
books
b.b..e.fc
b.f..e.bc
entry
entry
(a)b.b.b
(b)e.f.b
(a)b.f.b
(b)e.b.b
title
price
price
title
Data ..
TCP/IP
65.95
39.95
Query result annotated with Semantic ids
24Fusing XML Results Through Semantic Ids
- Deep Union Operator
- Unions two XML trees by matching their root nodes
using semantic ids, - and recursively performs deep union on their
respective list of children nodes BDT99 - Our XML Views become distributive over Deep
Union - V(S ?S) V(S)
V(?S)
25Fusing XML Results Though Semantic Ids
- For our running example
- V(S1 ?S1, S1 ?S1, S2) V(S1, S1, S2)
V(?S1,S1,S2) V(S1,?S1,S2)
V(?S1,?S1,S2) - Note that V(S1,?S1,S2) V(?S1,?S1,S2)
V(S1,?S1,S2) where S1 (S1 ?S1) - Real Xquery syntax needs to go here as
reminder????
26Example Fusing Incremental Results
V(?S1,S1,S2)
V(S1, ?S1,S2)
V(S1,S1,S2)
c
c
c
result
result
result
1994 c
1994 c
1994 c
2000c
yGroup
yGroup
yGroup
yGroup
Y1994
Y1994
Y2000
Y1994
1994c
1994c
1994c
2000c
books
books
books
books
b.b..e.fc
b.l..e.lc
b.b..e.fc
b.f..e.bc
entry
entry
entry
entry
(b)e.f.b
(a)b.b.b
(a)b.l.b
(b)e.l.b
(b)e.f.b
(a)b.b.b
(b)e.b.b
(a)b.f.b
price
price
title
title
title
price
price
title
Advanced...
Data ..
69.99
TCP/IP
65.95
TCP/IP
65.95
39.95
27Experimental Evaluation
- Solution Implemented within Rainbow Java XML
Query Engine Zetal03 - Data XMark Benchmark SEBC02
- Queries Queries vary in their id generation
28Experimental Results Query 1
ltresultgt ltcustomersgt for p in
doc(site.xml")/people/person where
p/id/text() .lt. 63750 return
ltcustomergt ltlocationgtp/address/city
/text()lt/locationgt p/name lt/customergt
lt/customersgt ltopen_bidsgt for oa in
doc(site.xml")/open_auctions/open_auction
where oa/id/text() .lt. 30000 return ltbidgt
oa/reserve oa/intial lt/bidgt lt/open_bidsgt lt/
resultgt
Query 1
29Experimental Results Query 1
30Experimental Results Query 2
Query 2
ltresultgt for p in doc(site.xml")/people/person w
here p/id/text() .lt. 63750 return
p/name lt/resultgt
31Related Work
- View Maintenance
- Materialization of auxiliary data
AMR98,AFP03,ZG98 - (Auxiliary data requires maintenance, no order
support) - Reproducible ids LD2000
- (Complex id with nested structure, places
limitation on maintainable views, no order
support) - Skolem functions PAG96,BCF04
- (Does not support order)
- XML Stream Processing
- Structure encoding IHW02
- (Limited queries, no support for correlated
nested queries) - Special ids FLBC02
- (Predefined decomposition of the streamed
document and id assignment)
32Conclusions Here ???
- ????
- ?gtgtgtgtgt
- Phase One Compute Context Schema
- Phase Two Generate Semantic ids
33Rainbow XQuery Engine website http//davis.wpi.e
du/dsrg/rainbow/index.htmlSoftware
downloadhttp//davis.wpi.edu/dsrg/rainbow/Rainbo
wCore/release.htmAlso Maged can be contacted
atmaged_at_cs.wpi.edu
34References
- ZPR02 X. Zhang, B. Pielech, and E. A.
Rundensteiner. Honey, I Shrunk the XQuery! An
XML Algebra Optimization Approach. In WIDM, pages
1522, Nov. 2002. - DR03 K. Deschler and E. Rundensteiner. Mass A
multi-axis storage structure for large xml
documents. In CIKM, pages 520523, Nov 2003. - BDT99 P. Buneman, A. Deutsch, and W. C. Tan. A
deterministic model for semi-structured data. In
Workshop on Query Processing for Semistructured
Data and Non-Standard Data Formats, Jan 1999. - AMR98 S. Abiteboul and et al. Incremental
Maintenance for Materialized Views over
Semistructured Data. In VLDB, pages 3849, 1998. - AFP03 M. A. Ali, A. Fernandes, and N. W.
Paton. MOVIE An incremental maintenance system
for materialized object views. DKE Journal,
47(2)131166, 2003. - ZG98 Y. Zhuge and H. Garcia-Molina. Graph
Structured Views and Their Incremental
Maintenance. In ICDE, pages 116125, 1998. - LD2000 H. Liefke and S. B. Davidson. View
maintenance for hierarchical semistructured data.
In DWKD, pages 114125, 2000. - PAG96 Y. Papakonstantinou and et al. Object
fusion in mediator systems. In VLDB, pages
413424, 1996. - BCF04 P. Bohannon, B. Choi, and W. Fan.
Incremental evaluation of schema-directed XML
publishing. In SIGMOD, pages 503514, 2004. - IHW02 Z. G. Ives and et al. An xml query engine
for network-bound data. The VLDB Journal, 11
(4)402402, December 2002. - FLBC02 L. Fegaras and et al. Query processing
of streamed xml data. In CIKM, pages 126 133,
2002.