Title: Part 4: Compressing XML Data
1Managing XML and Semistructured Data
- Part 4 Compressing XML Data
2In this section
- XML Compression
- Motivation
- The State-of-the-Art
- Queriable compressors
- Non-queriable compressors
- Resources
- XMILL An Efficient Compressor for XML Data by
Liefke and Suciu, in SIGMOD'2001 - Others XGrind, XPress, XQuec, XMLzip,
- XCQ From my publications
- XQZip From my publications
- MQX From my publications
3Introduction
- More and more XML data is created
- Duplicate structures (tags, paths )
- Data inflation data in XML is much larger than
raw data - Compression storage and data transfer
- General-purpose compressor (e.g. gzip)
- Characteristics of XML data not utilized
- Unqueriable
4Compression The Problem
- XML for exchange (space or time)
- But XML is verbose and inflated due to
- Duplicated tags and paths
- Users prefer application specific formats
- Eg. Web Server Logs
- Is XML doomed to fail ?
- Solution XML-specific compressor
- Non-queriable XMill
- Queriable XQzip
5XML-Specific Compressors
- Unqueriable Compression (e.g. XMill)
- Full-chunked data commonalities eliminated
- Very good compression ratio
- Queriable Compression (e.g. XGrind, XPRESS)
- Fine-grained data commonalities ignored
- Inadequate compression ratio and time
- Support simple path queries with atomic predicate
6Issues in XML Compression
- Compression ratios, Compression time, Query
Coverage, Memory Usage(see my survey paper in
WWWJ)
Comparison of existing technologies
7An ExampleWeb Server Logs
ASCII File 15.9 MB (gzipped 1.6MB)
202.239.238.16GET / HTTP/1.0text/html2001997/1
0/01-000002-4478--http//www.net.jp/Mozill
a/3.1ja(I)
XML-ized apache web log inflates to 24.2 MB
(gzipped 2.1MB)
- ltapacheentrygt
- ltapachehostgt 202.239.238.16 lt/apachehostgt
- ltapacherequestLinegt GET / HTTP/1.0
lt/apacherequestLinegt - ltapachecontentTypegt text/html
lt/apachecontentTypegt - ltapachestatusCodegt 200lt/apachestatusCodegt
- ltapachedategt 1997/10/01-000002lt/apachedategt
- ltapachebyteCountgt 4478lt/apachebyteCountgt
- ltapachereferergt http//www.net.jp/
lt/apachereferergt - ltapacheuserAgentgt Mozilla/3.1ja(I)lt/apac
heuserAgentgt - lt/apacheentrygt
8XMill
- First specialized compressor for XML data
- SAX parser for parsing XML data
- Still using gzip as its underlying compressor
- Clever grouping of data into containers for
compression - Compress XML via three basic techniques
- Compress the structure separately from the data
- Group the data values according to their types
- Apply semantic (specialized) compressors
- Downloadable
- www.cs.washington.edu/homes/suciu/XMILL
9XMill Architecture
10How Xmill Works Three Ideas
Compress the structure separately from the data
gzip Structure
gzip Data
ltapacheentrygt ltapachehostgt lt/apachehostgt
. . . lt/apacheentrygt
202.239.238.16 GET / HTTP/1.0 text/html 200
1.75MB
11How Xmill Works Three Ideas
Group the data values according to their types
gzip Structure
gzip Data1
gzip Data2
ltapacheentrygt . . . lt/apacheentrygt
202.23.23.16 224.42.24.55
GET / HTTP/1.0 GET / HTTP/1.1
1.33MB
12How Xmill Works Three Ideas
Apply semantic (specialized) compressors
- Examples
- 8, 16, 32-bit integer encoding (signed/unsigned)
- differential compressing (e.g. 1999, 1995, 2001,
2000, 1995, ...) - compress lists, records (e.g. 104.32.23.1 ? 4
bytes) - Need user input to select the semantic compressor
13Path Processor structure container
ltBookgtltTitle langEnglishgtData
Compressionlt/Titlegt ltAuthorgtGraylt/Authorgt ltAutho
rgtReiterlt/Authorgt lt/Bookgt
Dictionary One more entry for each new word
Fewer storage! 14 bytes!
- Replace data value with container number
(negative integer) - Replace end tag with 0
- Replace tags/attributes with positive integer
ltBookgtltTitle langEnglishgtData
Compressionlt/Titlegt ltAuthorgtGraylt/Authorgt ltAutho
rgtReiterlt/Authorgt lt/Bookgt
ltBookgtltTitle lang-1gt-2lt/Titlegt ltAuthorgt-3lt/Autho
rgt ltAuthorgt-3lt/Autorgt lt/Bookgt
ltBookgtltTitle lang-1 0gt-2 0 ltAuthorgt-3 0
ltAuthorgt-3 0 0
Book 1, Title 2, _at_lang 3, Author 4 1 2 3
-1 0 -2 0 4 -3 0 4 -3 0 0
Repeated structures entries could be compressed
effectively!
14XML Compression
XMill Evaluation using XML datasets
15Queriable Compressors
- XQzip queriable XML compressor (our work
EDBT04) - Existing XML compressors (survey inWWWJ05)
- Unqueriable (e.g. XMill SIGMOD00) exploit data
commonalities better compression rate than
gzip) - Queriable (e.g. XGrind ICDE02, XPRESS
SIGMOD03, XQueC, XQzip EDBT04, XCQ
KAISJ05) compress data individually
inadequate compression rate and time) - Features of XQzip
- Use the SIT to aid query evaluation
- Block-compression allow data commonalities to be
exploited and used as buffers to reduce
decompression overhead
16Structure Index Tree (SIT)
- Effective elimination of duplicate structures in
the XML data - Merging of nodes that have
- the same incoming path
- the same ordered set of paths of their
descendants - SIT Construction
- A linear scan of the XML document
- Merging of the subtree that we are constructing
into its equivalent subtree in the base tree
17SIT Construction
/
/
0
0
a
a
1
1
b
b
c
c
b
c
c
c
b
7
2
2
,7
5
6
5
6
,6
7
d
d
d
e
e
d
d
e
e
d
d
3
4
8
9
10
8
,10
3
4
9
10
,8,10
,9
18XQzip Architecture
- Index Constructor construct the SIT
- Compressor
- Group semantically related items in blocks
- Compress each block by gzip
- Query Processor evaluate query
- Parser
- Executor apply the SIT to evaluate query
- Buffer Manager (By LRU)
19SIT Construction Complexity
- N Total number of elements in the input XML
document - Time Complexity
- Worst-case O(N SIT )
- Average-case O(N)
- Space Complexity
- Base tree and the subtree being merged 2SIT
- Space for storing ids of eliminated nodes O(N)
20Data Compression
- A balance between full-chunked and fine-grained
compression - A distinct data container for each distinct
element - Each container compressed (using gzip) into many
smaller blocks - Block size?
- Too small query time ?compression ratio?
- Too large query time ?compression ratio?
- Only can be determined by an empirical study
21Block Size
- Representative datasets and queries
- Datasets
- Heavy text
- Light text
- A mix of heavy text and light text
- Queries
- High Selectivity
- Medium Selectivity
- Low Selectivity
22Block Size
23Structure of Compressed-Data
- Block size?
- Determined by an empirical study
- Querying Time
- near-optimal range 600-1000 data items/block
(average optimal 950) - Compression Ratio
- Not improved much after 150 KB/block (usually
contain more than 1000 items) - 1000 data items/block
24Outline
- Introduction
- XQzip EDBT 2004
- Indexing
- Data Compression
- Query Evaluation
- Performance Evaluation
- Conclusion
25XQzip Query Coverage
- All XPath axes except the sideways axes (e.g.
preceding, following)-siblings - Multiple and nested predicates
- and / or / not expressions
- Aggregations sum, count, average, max, min
- Group queries e.g. (L1 (L2 L3 L4))
- L1 //ab Crete (prefix) L2 c
- L3 df/count() gt100 L4 e//g
26Query Evaluation
- Depth-first traverse the index tree
- Buffer Management (LRU)
- Why buffering? Decompression Time Dominates
- Decompression avoidance
27Outline
- Introduction
- XQzip
- Indexing
- Data Compression
- Query Evaluation
- Performance Evaluation
- Conclusion
28Effectiveness of the SIT
Data Source Node Reduction Load Time Node Selection Acceleration
XMark 1.64 0.67s 2.15
OMIM 0.24 0.07s 2.16
DBLP 0.04 1.62s 2.11
SwissProt 28.38 5.61s 1.92
Treebank 93.42 2.26s 1.76
PSD 10.85 9.97s 2.18
Shakespeare 1.96 0.07s 2.10
Lineitem 0.002 0.42s 1.78
29Effectiveness of the SIT
- Index Size less than 1 of original size
- Load Time a fraction of a second
- Node Selection Acceleration twice faster than
FB-Index - Construction Time more than 3 times faster than
FB-Index
30Compression Ratio
XQzip is comparable to XMill and gzip, 17
better than XGrind with index size included, 42
better than XGrind without index.
31Compression/Decompression Time
- XQzip (compression index construction) is more
than 5 times better than XGrind, 1.5 times worse
than XMill - XQzip (index-loading decompression) is more
than 3 times better than XGrind, 1.4 times worse
than XMill
32Â Â Node Partial Result Querying Querying Querying
Data  Selecting Decomp. Processing Time (sec) Time (sec) Time (sec)
Sources Sources Time (sec) Time (sec) Time (sec) (XQzip-) (XQzip) (XGrind)
XMark Q1 0.001 --- 0.911 0.913 0.122 22.774
(111MB) Q2 0.001 0.920 0.012 0.934 0.295 23.067
 Q3 0.001 3.395 0.014 3.411 0.349 35.012
 Q4 0.003 --- 0.551 0.584 0.118 ---
 Q5 0.831 4.534 0.010 5.376 1.544 ---
OMIM Q1 0.001 --- 0.030 0.032 0.005 3.513
(24.5MB) Q2 0.001 0.021 0.011 0.034 0.014 4.690
 Q3 0.001 0.036 0.057 0.095 0.067 6.134
 Q4 0.005 --- --- 0.005 0.005 ---
 Q5 0.012 0.020 0.580 0.613 0.034 ---
DBLP Q1 0.001 --- 0.370 0.381 0.034 19.582
(148MB) Q2 0.001 0.330 0.013 0.345 0.029 26.108
 Q3 0.033 0.391 8.997 9.541 1.543 50.344
 Q4 0.001 --- 0.000 0.001 0.001 ---
 Q5 0.087 1.122 0.260 1.481 0.642 ---
33Query Preformance
- Cold Buffer-pool Evaluation
- 13 times better than XGrind
- Warm buffer-pool Evaluation
- 80 times better than XGrind
- Impressive Buffer Effect!
34Lessons on XML Compression
- Good compression ratio and time
- Comparable to that of XMill
- Much better than that of XGrind (and XPRESS)
- Support a very practical set of queries
- A much wider range of queries than XGrind and
XPRESS - Very Competitive Querying Time with Buffer
- 13 time better than XGrind with cold buffer
- 80 time better than XGrind with warm buffer
- Limitations
- Cost of building and maintenance of complex
Indexes - No theoretical foundation of block size
35XCQ
- XCQ Framework
- Experimental Results
- Compression Performance
- Query Performance
- Lessons and Development
36XCQ
- Objectives
- Achieve Good Compression ratio
- Comparable to XMill
- Better than XGrind
- Achieve Good Query performance
- More efficient than XGrind
- Querying compressed documents with block-based
partial decompression - But addressing issues different from XQzip
- Adopt minimal indexing
- Establish theory between selectivity and block
size
37XCQ Strategy
- Based on four techniques
- DTD Tree and SAX Event Stream Parsing (DSP)
- Partition Path-Based Data Grouping (PPB) Format
- Block-Statistic Signature (BSS) Indexing
- Access Methods
PPG format
BSS indexing
DSP
Access Methods
38Technique 1 DTD Tree and SAX Event Stream
Parsing (DSP)
PPG format
BSS indexing
DSP
Access Methods
XCQ Querying Engine
XCQ Compression Engine
DTD
XML Document
XPath Queries
39Technique 1 DTD Tree and SAX Event Stream
Parsing (DSP)
- Purpose
- To utilize information in the associated DTD of
the document - Benefits
- Only encode the information that cannot be
inferred in the DTD - Precise path-based grouping of data items
- Run in automated manner
40DSP Input and Output
41DSP Step 1 Creating a DTD Tree
lt!ELEMENT library (entry)gt lt!ELEMENT entry
(author, title, year, publisher?,
(papercourse_notebook),
num_copy)gt lt!ELEMENT author EMPTYgt lt!ATTLIST
author name CDATAgt lt!ELEMENT title
(PCDATA)gt lt!ELEMENT year (PCDATA)gt lt!ELEMENT
publisher (PCDATA)gt lt!ELEMENT paper
EMPTYgt lt!ELEMENT course_note EMPTYgt lt!ELEMENT
book EMPTYgt lt!ELEMENT num_copy (PCDATA)gt
42DSP Step 1 Creating a DTD Tree
lt!ELEMENT library (entry)gt lt!ELEMENT entry
(author, title, year, publisher?,
(papercourse_notebook),
num_copy)gt lt!ELEMENT author EMPTYgt lt!ATTLIST
author name CDATAgt lt!ELEMENT title
(PCDATA)gt lt!ELEMENT year (PCDATA)gt lt!ELEMENT
publisher (PCDATA)gt lt!ELEMENT paper
EMPTYgt lt!ELEMENT course_note EMPTYgt lt!ELEMENT
book EMPTYgt lt!ELEMENT num_copy (PCDATA)gt
43DSP Step 2 Processing in DSP Module
- How does the DSP module process the following XML
document?
ltlibrarygt ltentrygt ltauthor name"Tom"/gt
lttitlegtIntroduction to 34OS34lt/titlegt
ltyeargt2003lt/yeargt ltcourse_note/gt
ltnum_copygt3lt/num_copygt lt/entrygt lt/librarygt
44ltlibrarygt ltentrygt ltauthor name"Tom"/gt
lttitlegtIntroduction to 34OS34lt/titlegt
ltyeargt2003lt/yeargt ltcourse_note/gt
ltnum_copygt3lt/num_copygt lt/entrygt lt/librarygt
SAX Event
Start element library
Structure Stream
library
entry
author (name)
publisher?
title
year
num_copy
Data Streams
Keys
paper
book
Traversal path
course_note
PCDATA
Processing DTD tree node
45ltlibrarygt ltentrygt ltauthor name"Tom"/gt
lttitlegtIntroduction to 34OS34lt/titlegt
ltyeargt2003lt/yeargt ltcourse_note/gt
ltnum_copygt3lt/num_copygt lt/entrygt lt/librarygt
SAX Event
Start element entry
Match!
Structure Stream
library
entry
author (name)
publisher?
title
year
num_copy
Data Streams
Keys
paper
book
Traversal path
course_note
PCDATA
Processing DTD tree node
46ltlibrarygt ltentrygt ltauthor name"Tom"/gt
lttitlegtIntroduction to 34OS34lt/titlegt
ltyeargt2003lt/yeargt ltcourse_note/gt
ltnum_copygt3lt/num_copygt lt/entrygt lt/librarygt
SAX Event
Start element author, att0nameTom End
element author
d0
Structure Stream
library
T
, d0
Match!
entry
author (name)
publisher?
title
year
num_copy
Data Streams
d0 Tom
Keys
paper
book
Traversal path
course_note
PCDATA
Processing DTD tree node
47ltlibrarygt ltentrygt ltauthor name"Tom"/gt
lttitlegtIntroduction to 34OS34lt/titlegt
ltyeargt2003lt/yeargt ltcourse_note/gt
ltnum_copygt3lt/num_copygt lt/entrygt lt/librarygt
SAX Event
Start element title PCDATA Introduction to
34OS 34 End element title
Structure Stream
library
T, d0, d1
entry
author (name)
publisher?
title
year
num_copy
Data Streams
d0 Tom
d1 Introduction to 34OS 34
Keys
paper
book
Traversal path
course_note
PCDATA
Processing DTD tree node
48ltlibrarygt ltentrygt ltauthor name"Tom"/gt
lttitlegtIntroduction to 34OS34lt/titlegt
ltyeargt2003lt/yeargt ltcourse_note/gt
ltnum_copygt3lt/num_copygt lt/entrygt lt/librarygt
SAX Event
SAX Events Start element year PCDATA
2003 End element year Start element
course_note
Structure Stream
library
T, d0, d1, d2
, F
entry
author (name)
publisher?
title
year
num_copy
Data Streams
d0 Tom
d1 Introduction to 34OS 34
d2 2003
Keys
paper
book
Traversal path
course_note
PCDATA
Processing DTD tree node
49ltlibrarygt ltentrygt ltauthor name"Tom"/gt
lttitlegtIntroduction to 34OS34lt/titlegt
ltyeargt2003lt/yeargt ltcourse_note/gt
ltnum_copygt3lt/num_copygt lt/entrygt lt/librarygt
SAX Event
Start element course_note End element
course_note
Structure Stream
library
T, d0, d1, d2, F
, p1
Match!
entry
author (name)
publisher?
title
year
num_copy
Data Streams
d0 Tom
p0
p2
d1 Introduction to 34OS 34
p1
d2 2003
paper
book
course_note
50ltlibrarygt ltentrygt ltauthor name"Tom"/gt
lttitlegtIntroduction to 34OS34lt/titlegt
ltyeargt2003lt/yeargt ltcourse_note/gt
ltnum_copygt3lt/num_copygt lt/entrygt lt/librarygt
SAX Event
Start element num_copy PCDATA 3 End
element num_copy End element entry
Structure Stream
library
T, d0, d1, d2, F, p1
entry
author (name)
publisher?
title
year
num_copy
Data Streams
d0 Tom
d1 Introduction to 34OS 34
d2 2003
paper
book
d4 3
course_note
51DSP Step 3 Generated Output
52Technique 2 Partition Path-Based (PPB) Data
Grouping Format
PPG format
BSS indexing
DSP
Access Methods
XCQ Querying Engine
XCQ Compression Engine
DTD
XML Document
XPath Queries
53Technique 2 Partition Path-Based Data Grouping
(PPB) Format
- Purpose
- To partition the data streams
- Each block contains a number of data items
- Benefits
- Can be compressed and decompressed as an
individual unit - Support partial decompression during query
processing
54Technique 2 Partition Part Based Data Grouping
(PPB) Format
55Technique 2 Partition Part Based Data Grouping
(PPB) Format
- A cost model is developed for PPB
- Relationship between block size, processing cost
and selectivity can be known - Further modelling is possible
56Two layers
57n layers
58Technique 3 Block-Statistic Signature (BSS)
Indexing
PPG format
BSS indexing
DSP
Access Methods
XCQ Querying Engine
XCQ Compression Engine
DTD
XML Document
XPath Queries
59Technique 3 Block-Statistic Signature (BSS)
Indexing
- Purpose To avoid accessing of non-relevant data
blocks during querying - I/O cost
- Decompression overhead
- Time to scan the data inside the block
- Details
- Statistic summary (signature) for each block
- Min, Max, Sum and Count
- Benefit Little amount of processing time and
storage space - Research status Supporting numerical data only
60Technique 3 Block-Statistic Signature (BSS)
Indexing
0 1210 100 10000 10
Min 0 Max 10000 Sum 11320 Count 5
Compressed Data Blocks
Block Statistic Signatures
0 10 18 27 5
Min 0 Max 27 Sum 60 Count 5
61Technique 3 Block Statistic Signature (BSS)
Indexing
62Technique 4 Access Methods
PPB format
BSS indexing
DSP
Access Methods
XCQ Querying Engine
XCQ Compression Engine
DTD
XML Document
XPath Queries
63Technique 4 Access Methods
- Purpose
- For realizing partial decompression during query
processing - 4 types of queries
- Selection queries
- Structural queries
- Structure-based aggregation queries
- Path-based aggregation queries
64Technique 4 Access Methods Selection Queries
//entryauthor/_at_nameJess and
publisher/text()ABC
Structure Stream
Keys for path-based grouped Date Streams
d0 /library/entry/author/_at_name
d1 /library/entry/title/text()
d2 /library/entry/year/text()
d3 /library/entry/publisher/text()
d4 /library/entry/num_copy/text()
65Technique 4 Access Methods Structural Queries
/library/entry/author
Structure Stream
d0
d1
d2
d3
d4
Keys for path-based grouped Date Streams
d0 /library/entry/author/_at_name
d1 /library/entry/title/text()
d2 /library/entry/year/text()
d3 /library/entry/publisher/text()
d4 /library/entry/num_copy/text()
66Technique 4 Access Methods Structure-Based
Aggregation Queries
count(//entry)
Structure Stream
d0
d1
d2
d3
d4
Keys for path-based grouped Date Streams
d0 /library/entry/author/_at_name
d1 /library/entry/title/text()
d2 /library/entry/year/text()
d3 /library/entry/publisher/text()
d4 /library/entry/num_copy/text()
67Technique 4 Access Methods Path-Based
Aggregation Queries
sum(//num_copy/text()1)
Structure Stream
d0
d1
d2
d3
d4
Keys for path-based grouped Date Streams
d0 /library/entry/author/_at_name
d1 /library/entry/title/text()
d2 /library/entry/year/text()
d3 /library/entry/publisher/text()
d4 /library/entry/num_copy/text()
68Experiment Context
- Compressors under study
- gzip, XMill, XGrind, XCQ
- Datasets
Document Size Data-Centric/ Document-Centric Regularity (Relative Level)
Weblog 89 MB Data-Centric 5
SwissProt 32 MB Data-Centric 3
DBLP 41 MB Data-Centric 2
TPC-H 32 MB Data-Centric 6
XMark 104 MB Data-Centric 4
Shakespeare 8 MB Document-Centric 1
69Experiment Compression Performance
- Compression Performance
- gzip, XMill, XCQ (No Partition) and XGrind
- Scalability
- XCQ
- Partitioning
- BSS Indexing overhead
Objective Comparable to XMill and better than
XGrind
70Compression Ratios
71Compression Times
72Decompression Times
73Experiment Compression Performance
- Compression Performance
- gzip, XMill, XCQ and XGrind
- Scalability
- XCQ
- Partitioning
- BSS Indexing overhead
Result Comparable to XMill
74Scalability Compressed Sizes
75Experiment Compression Performance
- Compression Performance
- gzip, XMill, XCQ (No Partition) and XGrind
- Scalability
- XCQ
- Partitioning
- BSS Indexing
Result Overheads introduced are low
76Experiment Results Partitioning Effect on XCQ
Compression
77Experiment Results BSS Indexing Effect on XCQ
Compression
78Experiment Compression Performance
- Query Performance
- Different block sizes have impact!
- XCQ vs XGrind
Result Choose a good block size
79Experiment Results Query performance
Selection queries
80Experiment Results Query performance
Selection queries
81Experiment Results Query performance
Structural Query and Structure-Based Aggregation
Query
82Experiment Results Query performance
Path-Based Aggregation Query
83Experiment Compression Performance
- Query Performance
- Different block sizes
- XCQ vs XGrind
Objective How to choose a good block size? A few
hundred elements
84Experiment Compression Performance
- Query Performance
- Different block sizes
- XCQ vs XGrind
Objective More efficient query performance
85Experiment Results XCQ vs XGrind (Data Centric
Documents)
86Experiment Results XCQ vs XGrind (Document
Centric Document)
87Lessons and Development
- XCQ Framework
- Developed techniques
- DSP
- PPG document format
- BSS indexing
- Access methods
- Benefits of XCQ from experimental results
- Simple Indexing, Mathematical Foundation
- Compression performance
- Comparable to XMill
- Query performance
- Better than XGrind for Data-Centric Documents
- Comparable to XGrind for Document-Centric Document
88Multi-query evaluation of Compressed Data over
network
- Widespread XML documents in remote locations
- Large scale
- XML verbosity
- Traditional XML query processing
- One by one on a standalone system
- Original result fragments or whole documents are
forwarded.
- Heavy bandwidth costs for Internet or Poor
processing efficiency - Motivations
- Provide efficient query evaluation on compressed
XML data - Reduce bandwidth consumption in result publication
89Architecture
- Composed of the server and a group of clients
- On the server side
- A large-scale XML document
- Largest results directing to the nearest clients
- Under compression
- Co-operative clients
- Further dissemination XML data to remote clients
is possible
90Preliminaries- XPress
- XPress
- For tags
- reverse arithmetic encoding
- Encoded into numerical intervals
- For text
- dictionary huffman encoder
- Compared with XGrind
- Higher compression ratio
- More efficient query evaluation
- Less decompression need
91Preliminaries-Interval Encoding
- Reverse arithmetic encoding
- Adopted to compress tags in XPress
Element a b c
Probability 0.3 0.3 0.4
Original interval 0.0, 0.3) 0.3, 0.6) 0.6, 1.0)
92Preliminaries-Interval Encoding
- Reverse arithmetic encoding
- Adopted to compress tags in XPress
- The interval of /a/c is
- 0.60.40.0, 0.60.40.3) 0.6, 0.72)
Element a b c
Probability 0.3 0.3 0.4
Original interval 0.0, 0.3) 0.3, 0.6) 0.6, 1.0)
Original interval of c
93Preliminaries-Interval Encoding
- Reverse arithmetic encoding
- Adopted to compress tags in XPress
- The interval of /a/c is
- 0.60.40.0, 0.60.40.3) 0.6, 0.72)
Element a b c
Probability 0.3 0.3 0.4
Original interval 0.0, 0.3) 0.3, 0.6) 0.6, 1.0)
Probability of c
94Preliminaries-Interval Encoding
- Reverse arithmetic encoding
- Adopted to compress tags in XPress
- The interval of /a/c is
- 0.60.40.0, 0.60.40.3) 0.6, 0.72)
Element a b c
Probability 0.3 0.3 0.4
Original interval 0.0, 0.3) 0.3, 0.6) 0.6, 1.0)
Original interval of a
95Preliminaries-Interval Encoding
- Reverse arithmetic encoding
- Adopted to compress tags in XPress
- The interval of /a/c is
- 0.60.40.0, 0.60.40.3) 0.6, 0.72)
- The interval of //c is 0.6, 1.0)
Element a b c
Probability 0.3 0.3 0.4
Original interval 0.0, 0.3) 0.3, 0.6) 0.6, 1.0)
96Preliminaries-Interval Encoding
- Reverse arithmetic encoding
- Adopted to compress tags in XPress
- The interval of /a/c is
- 0.60.40.0, 0.60.40.3) 0.6, 0.72)
- The interval of //c is 0.6, 1.0)
- //c is a suffix of /a/c
- The interval of //c contains the interval of
/a/c
Element a b c
Probability 0.3 0.3 0.4
Original interval 0.0, 0.3) 0.3, 0.6) 0.6, 1.0)
97Preliminaries-XML Containment
- Query Evaluation on compressed document
- XP/, //,
- Query QA, QB submitted by client CA and CB
98Preliminaries-XML Containment
- Query Evaluation on compressed document
- XP/, //,
- Query QA, QB submitted by client CA and CB
- XPath Containment
- If QAs result is always contained by QBs for
every XML document, then QB contains QA.
99Preliminaries-XML Containment
- Query Evaluation on compressed document
- XP/, //,
- Query QA, QB submitted by client CA and CB
- XPath Containment
- If QAs result is always contained by QBs for
every XML document, then QB contains QA.
- Application in our scenario
- If QB contains QA, then result of QA can be
published by CB. - Classify queries based on the containment
relationship
100Our approach
- Query-Index-Tree (QIT)
- QIT Construction
- Multi-Query Evaluation
- Sub-Index Construction for Clients
101Query-Index-Tree (QIT)
- Built at the server side
- Each node corresponds to a query
- Explore containment relationship
- Among ancestors and descendants
- Remark all result locations as indices
- Target
- based on the hierachical level of QIT
- Evaluate queries
- Route result fragments
102An QIT Example
103An QIT Example
104An QIT Example
105An QIT Example
106QIT Construction
- Recursive classification
- All submitted queries
- is a descendant set of root
107QIT Construction
- Recursive classification
- QA contains
- all other queries
108QIT Construction
- Recursive classification
- Recursive classification
- in QAs descendant set
109QIT Construction
- Recursive classification
-
Each class has a query -
containing others
110QIT Construction
- Recursive classification
- Classification continues until leafs
111Preprocess for Multi-Query Evaluation
- On server side, Over compressed document
- How to evaluate queries using QIT
- How to support intermediate clients to locate
results - Tags are encoded into intervals
- To avoid decompression in query processing
- Interval translation
- Simple path interval
- Complex path simple paths
intervals - Examples
- /a/b//c/d /a/b /c/d
- /a/b//c/d /a/b, /c/d
112Experiment - Overall Cost Savings
- Compare with linear query processing (without
QIT) -
- Saving Ratio
-
113Collaborative Processing
- A co-operative framework for multi-query
processing over compressed XML data - Keep results under compression to save bandwidth
- Bring forward QIT and building algorithm
- Future work
- QIT is not enough for handling complex XPath
- Subscribed queries and non-subscribed queries.
- XPath queries and XPath FT queries
114Papers Compression
- XMILL An Efficient Compressor for XML Data by
Liefke and Suciu, in SIGMOD'2001 - P. M. Tolani and J. R. Haritsa. XGRIND A
Query-friendly XML Compressor. IEEE ICDE Conf.,
pp. 225-234, 2002. - M. Girardot and N. Sundaresan. Millau an
encoding format for efficient representation and
exchange of XML over the Web. WWW Conf., pp.
747-765, 2000. - H. Ishikawa, S. Yokoyama, S. Isshiki and M. Ohta.
Project Xanadu XML- and Active-Database-Unified
Approach to Distributed E-Commerce. Int. Workshop
on DEXA, 2001. - A.Arion, A. Bonifati, G. Costa, S. DAguanno, I.
Manolescu, A. Pugliese, Efficient Query
Evaluation over XML Compressed Data, EDBT 2004. - JunKi Min, MyungJae Park, ChinWan Chung, XPRESS
A Queriable Compression for XML Data, EDBT 2004.
115Our publications for XML compression
- Xiaoling WANG, Aoying ZHOU, Juzhen HE and Wilfred
NG. MQX Multi-Query Processing Engine for
Compressed XML Data. International Conference on
Information Retrieval. ACM SIGIR 2007, Amsterdam,
Holland (Demonstration Paper), pp. 897, (2007). - Wilfred NG, Ho-Lam LAU and Aoying ZHOU. Divide,
Compress and Conquer Querying XML via
Partitioned Path-Based Compressed Data Blocks.
Accepted and to appear World Wide Web Journal,
(2006). - Juzhen HE, Wilfred NG, Xiaoling WANG and Aoying
ZHOU. An Efficient Co-operative Framework for
Multi-Query Processing over Compressed XML Data.
International Conference of Database Systems for
Advanced Applications. DASFAA 2006, Lecture Notes
in Computer Science Vol. 3882, Singapore, pp.
218-232, (2006). - Wilfred NG, Wai-Yeung LAM, Peter WOOD and Mark
LEVENE. XCQ A Queriable XML Compression System.
Accepted and to appear An International Journal
of Knowledge and Information Systems, (2005). - Wilfred NG, Wai-Yeung LAM and James CHENG.
Comparative Analysis of XML Compression
Technologies. Accepted and to appear World Wide
Web Journal Internet and Web Information
Systems, (2005). - James CHENG and Wilfred NG. XQzip Querying
Compressed XML Using Structural Indexing.
International Conference on Extending Database
Technology EDBT 2004, Lecture Notes of Computer
Science Vol.2992, Heraklion, Crete, Greece, page
219-236, (2004). - Wai-Yeung LAM, Wilfred NG, Peter WOOD and Mark
LEVENE. XCQ XML Compression and Querying
System. Poster Proceedings of the World Wide Web
WWW'2003, Budapest, (2003).