Part 4: Compressing XML Data

About This Presentation

Title:

Part 4: Compressing XML Data

Description:

Managing XML and Semistructured Data Part 4: Compressing XML Data – PowerPoint PPT presentation

Number of Views:193

Avg rating:3.0/5.0

Slides: 116

Provided by: ust51

Category:

more less

Transcript and Presenter's Notes

Title: Part 4: Compressing XML Data

1
Managing XML and Semistructured Data

Part 4 Compressing XML Data

2
In this section

XML Compression
Motivation
The State-of-the-Art
Queriable compressors
Non-queriable compressors
Resources
XMILL An Efficient Compressor for XML Data by
Liefke and Suciu, in SIGMOD'2001
Others XGrind, XPress, XQuec, XMLzip,
XCQ From my publications
XQZip From my publications
MQX From my publications

3
Introduction

More and more XML data is created
Duplicate structures (tags, paths )
Data inflation data in XML is much larger than
raw data
Compression storage and data transfer
General-purpose compressor (e.g. gzip)
Characteristics of XML data not utilized
Unqueriable

4
Compression The Problem

XML for exchange (space or time)
But XML is verbose and inflated due to
Duplicated tags and paths
Users prefer application specific formats
Eg. Web Server Logs
Is XML doomed to fail ?
Solution XML-specific compressor
Non-queriable XMill
Queriable XQzip

5
XML-Specific Compressors

Unqueriable Compression (e.g. XMill)
Full-chunked data commonalities eliminated
Very good compression ratio
Queriable Compression (e.g. XGrind, XPRESS)
Fine-grained data commonalities ignored
Inadequate compression ratio and time
Support simple path queries with atomic predicate

6
Issues in XML Compression

Compression ratios, Compression time, Query
Coverage, Memory Usage(see my survey paper in
WWWJ)

Comparison of existing technologies
7
An ExampleWeb Server Logs
ASCII File 15.9 MB (gzipped 1.6MB)
202.239.238.16GET / HTTP/1.0text/html2001997/1
0/01-000002-4478--http//www.net.jp/Mozill
a/3.1ja(I)
XML-ized apache web log inflates to 24.2 MB
(gzipped 2.1MB)

ltapacheentrygt
ltapachehostgt 202.239.238.16 lt/apachehostgt
ltapacherequestLinegt GET / HTTP/1.0
lt/apacherequestLinegt
ltapachecontentTypegt text/html
lt/apachecontentTypegt
ltapachestatusCodegt 200lt/apachestatusCodegt
ltapachedategt 1997/10/01-000002lt/apachedategt
ltapachebyteCountgt 4478lt/apachebyteCountgt
ltapachereferergt http//www.net.jp/
lt/apachereferergt
ltapacheuserAgentgt Mozilla/3.1ja(I)lt/apac
heuserAgentgt
lt/apacheentrygt

8
XMill

First specialized compressor for XML data
SAX parser for parsing XML data
Still using gzip as its underlying compressor
Clever grouping of data into containers for
compression
Compress XML via three basic techniques
Compress the structure separately from the data
Group the data values according to their types
Apply semantic (specialized) compressors
Downloadable
www.cs.washington.edu/homes/suciu/XMILL

9
XMill Architecture
10
How Xmill Works Three Ideas
Compress the structure separately from the data
gzip Structure
gzip Data
ltapacheentrygt ltapachehostgt lt/apachehostgt
. . . lt/apacheentrygt
202.239.238.16 GET / HTTP/1.0 text/html 200
1.75MB

11
How Xmill Works Three Ideas
Group the data values according to their types
gzip Structure
gzip Data1
gzip Data2
ltapacheentrygt . . . lt/apacheentrygt
202.23.23.16 224.42.24.55
GET / HTTP/1.0 GET / HTTP/1.1
1.33MB

12
How Xmill Works Three Ideas
Apply semantic (specialized) compressors

Examples
8, 16, 32-bit integer encoding (signed/unsigned)
differential compressing (e.g. 1999, 1995, 2001,
2000, 1995, ...)
compress lists, records (e.g. 104.32.23.1 ? 4
bytes)
Need user input to select the semantic compressor

13
Path Processor structure container
ltBookgtltTitle langEnglishgtData
Compressionlt/Titlegt ltAuthorgtGraylt/Authorgt ltAutho
rgtReiterlt/Authorgt lt/Bookgt
Dictionary One more entry for each new word
Fewer storage! 14 bytes!

Replace data value with container number
(negative integer)
Replace end tag with 0
Replace tags/attributes with positive integer

ltBookgtltTitle langEnglishgtData
Compressionlt/Titlegt ltAuthorgtGraylt/Authorgt ltAutho
rgtReiterlt/Authorgt lt/Bookgt
ltBookgtltTitle lang-1gt-2lt/Titlegt ltAuthorgt-3lt/Autho
rgt ltAuthorgt-3lt/Autorgt lt/Bookgt
ltBookgtltTitle lang-1 0gt-2 0 ltAuthorgt-3 0
ltAuthorgt-3 0 0
Book 1, Title 2, _at_lang 3, Author 4 1 2 3
-1 0 -2 0 4 -3 0 4 -3 0 0
Repeated structures entries could be compressed
effectively!
14
XML Compression
XMill Evaluation using XML datasets
15
Queriable Compressors

XQzip queriable XML compressor (our work
EDBT04)
Existing XML compressors (survey inWWWJ05)
Unqueriable (e.g. XMill SIGMOD00) exploit data
commonalities better compression rate than
gzip)
Queriable (e.g. XGrind ICDE02, XPRESS
SIGMOD03, XQueC, XQzip EDBT04, XCQ
KAISJ05) compress data individually
inadequate compression rate and time)
Features of XQzip
Use the SIT to aid query evaluation
Block-compression allow data commonalities to be
exploited and used as buffers to reduce
decompression overhead

16
Structure Index Tree (SIT)

Effective elimination of duplicate structures in
the XML data
Merging of nodes that have
the same incoming path
the same ordered set of paths of their
descendants
SIT Construction
A linear scan of the XML document
Merging of the subtree that we are constructing
into its equivalent subtree in the base tree

17
SIT Construction
/
/
0
0
a
a
1
1
b
b
c
c
b
c
c
c
b
7
2
2
,7
5
6
5
6
,6
7
d
d
d
e
e
d
d
e
e
d
d
3
4
8
9
10
8
,10
3
4
9
10
,8,10
,9
18
XQzip Architecture

Index Constructor construct the SIT
Compressor
Group semantically related items in blocks
Compress each block by gzip
Query Processor evaluate query
Parser
Executor apply the SIT to evaluate query
Buffer Manager (By LRU)

19
SIT Construction Complexity

N Total number of elements in the input XML
document
Time Complexity
Worst-case O(N SIT )
Average-case O(N)
Space Complexity
Base tree and the subtree being merged 2SIT
Space for storing ids of eliminated nodes O(N)

20
Data Compression

A balance between full-chunked and fine-grained
compression
A distinct data container for each distinct
element
Each container compressed (using gzip) into many
smaller blocks
Block size?
Too small query time ?compression ratio?
Too large query time ?compression ratio?
Only can be determined by an empirical study

21
Block Size

Representative datasets and queries
Datasets
Heavy text
Light text
A mix of heavy text and light text
Queries
High Selectivity
Medium Selectivity
Low Selectivity

22
Block Size
23
Structure of Compressed-Data

Block size?
Determined by an empirical study
Querying Time
near-optimal range 600-1000 data items/block
(average optimal 950)
Compression Ratio
Not improved much after 150 KB/block (usually
contain more than 1000 items)
1000 data items/block

24
Outline

Introduction
XQzip EDBT 2004
Indexing
Data Compression
Query Evaluation
Performance Evaluation
Conclusion

25
XQzip Query Coverage

All XPath axes except the sideways axes (e.g.
preceding, following)-siblings
Multiple and nested predicates
and / or / not expressions
Aggregations sum, count, average, max, min
Group queries e.g. (L1 (L2 L3 L4))
L1 //ab Crete (prefix) L2 c
L3 df/count() gt100 L4 e//g

26
Query Evaluation

Depth-first traverse the index tree
Buffer Management (LRU)
Why buffering? Decompression Time Dominates
Decompression avoidance

27
Outline

Introduction
XQzip
Indexing
Data Compression
Query Evaluation
Performance Evaluation
Conclusion

28
Effectiveness of the SIT
Data Source Node Reduction Load Time Node Selection Acceleration
XMark 1.64 0.67s 2.15
OMIM 0.24 0.07s 2.16
DBLP 0.04 1.62s 2.11
SwissProt 28.38 5.61s 1.92
Treebank 93.42 2.26s 1.76
PSD 10.85 9.97s 2.18
Shakespeare 1.96 0.07s 2.10
Lineitem 0.002 0.42s 1.78
29
Effectiveness of the SIT

Index Size less than 1 of original size
Load Time a fraction of a second
Node Selection Acceleration twice faster than
FB-Index
Construction Time more than 3 times faster than
FB-Index

30
Compression Ratio
XQzip is comparable to XMill and gzip, 17
better than XGrind with index size included, 42
better than XGrind without index.
31
Compression/Decompression Time

XQzip (compression index construction) is more
than 5 times better than XGrind, 1.5 times worse
than XMill
XQzip (index-loading decompression) is more
than 3 times better than XGrind, 1.4 times worse
than XMill

32
Node Partial Result Querying Querying Querying
Data Selecting Decomp. Processing Time (sec) Time (sec) Time (sec)
Sources Sources Time (sec) Time (sec) Time (sec) (XQzip-) (XQzip) (XGrind)
XMark Q1 0.001 --- 0.911 0.913 0.122 22.774
(111MB) Q2 0.001 0.920 0.012 0.934 0.295 23.067
Q3 0.001 3.395 0.014 3.411 0.349 35.012
Q4 0.003 --- 0.551 0.584 0.118 ---
Q5 0.831 4.534 0.010 5.376 1.544 ---
OMIM Q1 0.001 --- 0.030 0.032 0.005 3.513
(24.5MB) Q2 0.001 0.021 0.011 0.034 0.014 4.690
Q3 0.001 0.036 0.057 0.095 0.067 6.134
Q4 0.005 --- --- 0.005 0.005 ---
Q5 0.012 0.020 0.580 0.613 0.034 ---
DBLP Q1 0.001 --- 0.370 0.381 0.034 19.582
(148MB) Q2 0.001 0.330 0.013 0.345 0.029 26.108
Q3 0.033 0.391 8.997 9.541 1.543 50.344
Q4 0.001 --- 0.000 0.001 0.001 ---
Q5 0.087 1.122 0.260 1.481 0.642 ---
33
Query Preformance

Cold Buffer-pool Evaluation
13 times better than XGrind
Warm buffer-pool Evaluation
80 times better than XGrind
Impressive Buffer Effect!

34
Lessons on XML Compression

Good compression ratio and time
Comparable to that of XMill
Much better than that of XGrind (and XPRESS)
Support a very practical set of queries
A much wider range of queries than XGrind and
XPRESS
Very Competitive Querying Time with Buffer
13 time better than XGrind with cold buffer
80 time better than XGrind with warm buffer
Limitations
Cost of building and maintenance of complex
Indexes
No theoretical foundation of block size

35
XCQ

XCQ Framework
Experimental Results
Compression Performance
Query Performance
Lessons and Development

36
XCQ

Objectives
Achieve Good Compression ratio
Comparable to XMill
Better than XGrind
Achieve Good Query performance
More efficient than XGrind
Querying compressed documents with block-based
partial decompression
But addressing issues different from XQzip
Adopt minimal indexing
Establish theory between selectivity and block
size

37
XCQ Strategy

Based on four techniques
DTD Tree and SAX Event Stream Parsing (DSP)
Partition Path-Based Data Grouping (PPB) Format
Block-Statistic Signature (BSS) Indexing
Access Methods

PPG format
BSS indexing
DSP
Access Methods
38
Technique 1 DTD Tree and SAX Event Stream
Parsing (DSP)
PPG format
BSS indexing
DSP
Access Methods
XCQ Querying Engine
XCQ Compression Engine
DTD
XML Document
XPath Queries
39
Technique 1 DTD Tree and SAX Event Stream
Parsing (DSP)

Purpose
To utilize information in the associated DTD of
the document
Benefits
Only encode the information that cannot be
inferred in the DTD
Precise path-based grouping of data items
Run in automated manner

40
DSP Input and Output
41
DSP Step 1 Creating a DTD Tree
lt!ELEMENT library (entry)gt lt!ELEMENT entry
(author, title, year, publisher?,
(papercourse_notebook),
num_copy)gt lt!ELEMENT author EMPTYgt lt!ATTLIST
author name CDATAgt lt!ELEMENT title
(PCDATA)gt lt!ELEMENT year (PCDATA)gt lt!ELEMENT
publisher (PCDATA)gt lt!ELEMENT paper
EMPTYgt lt!ELEMENT course_note EMPTYgt lt!ELEMENT
book EMPTYgt lt!ELEMENT num_copy (PCDATA)gt
42
DSP Step 1 Creating a DTD Tree
lt!ELEMENT library (entry)gt lt!ELEMENT entry
(author, title, year, publisher?,
(papercourse_notebook),
num_copy)gt lt!ELEMENT author EMPTYgt lt!ATTLIST
author name CDATAgt lt!ELEMENT title
(PCDATA)gt lt!ELEMENT year (PCDATA)gt lt!ELEMENT
publisher (PCDATA)gt lt!ELEMENT paper
EMPTYgt lt!ELEMENT course_note EMPTYgt lt!ELEMENT
book EMPTYgt lt!ELEMENT num_copy (PCDATA)gt
43
DSP Step 2 Processing in DSP Module

How does the DSP module process the following XML
document?

ltlibrarygt ltentrygt ltauthor name"Tom"/gt
lttitlegtIntroduction to 34OS34lt/titlegt
ltyeargt2003lt/yeargt ltcourse_note/gt
ltnum_copygt3lt/num_copygt lt/entrygt lt/librarygt
44
ltlibrarygt ltentrygt ltauthor name"Tom"/gt
lttitlegtIntroduction to 34OS34lt/titlegt
ltyeargt2003lt/yeargt ltcourse_note/gt
ltnum_copygt3lt/num_copygt lt/entrygt lt/librarygt
SAX Event
Start element library
Structure Stream
library
entry
author (name)
publisher?

title
year
num_copy
Data Streams
Keys
paper
book
Traversal path
course_note
PCDATA
Processing DTD tree node
45
ltlibrarygt ltentrygt ltauthor name"Tom"/gt
lttitlegtIntroduction to 34OS34lt/titlegt
ltyeargt2003lt/yeargt ltcourse_note/gt
ltnum_copygt3lt/num_copygt lt/entrygt lt/librarygt
SAX Event
Start element entry
Match!
Structure Stream
library
entry
author (name)
publisher?

title
year
num_copy
Data Streams
Keys
paper
book
Traversal path
course_note
PCDATA
Processing DTD tree node
46
ltlibrarygt ltentrygt ltauthor name"Tom"/gt
lttitlegtIntroduction to 34OS34lt/titlegt
ltyeargt2003lt/yeargt ltcourse_note/gt
ltnum_copygt3lt/num_copygt lt/entrygt lt/librarygt
SAX Event
Start element author, att0nameTom End
element author
d0
Structure Stream
library
T
, d0
Match!
entry
author (name)
publisher?

title
year
num_copy
Data Streams
d0 Tom
Keys
paper
book
Traversal path
course_note
PCDATA
Processing DTD tree node
47
ltlibrarygt ltentrygt ltauthor name"Tom"/gt
lttitlegtIntroduction to 34OS34lt/titlegt
ltyeargt2003lt/yeargt ltcourse_note/gt
ltnum_copygt3lt/num_copygt lt/entrygt lt/librarygt
SAX Event
Start element title PCDATA Introduction to
34OS 34 End element title
Structure Stream
library
T, d0, d1
entry
author (name)
publisher?

title
year
num_copy
Data Streams
d0 Tom
d1 Introduction to 34OS 34
Keys
paper
book
Traversal path
course_note
PCDATA
Processing DTD tree node
48
ltlibrarygt ltentrygt ltauthor name"Tom"/gt
lttitlegtIntroduction to 34OS34lt/titlegt
ltyeargt2003lt/yeargt ltcourse_note/gt
ltnum_copygt3lt/num_copygt lt/entrygt lt/librarygt
SAX Event
SAX Events Start element year PCDATA
2003 End element year Start element
course_note
Structure Stream
library
T, d0, d1, d2
, F
entry
author (name)
publisher?

title
year
num_copy
Data Streams
d0 Tom
d1 Introduction to 34OS 34
d2 2003
Keys
paper
book
Traversal path
course_note
PCDATA
Processing DTD tree node
49
ltlibrarygt ltentrygt ltauthor name"Tom"/gt
lttitlegtIntroduction to 34OS34lt/titlegt
ltyeargt2003lt/yeargt ltcourse_note/gt
ltnum_copygt3lt/num_copygt lt/entrygt lt/librarygt
SAX Event
Start element course_note End element
course_note
Structure Stream
library
T, d0, d1, d2, F
, p1
Match!
entry
author (name)
publisher?

title
year
num_copy
Data Streams
d0 Tom
p0
p2
d1 Introduction to 34OS 34
p1
d2 2003
paper
book
course_note
50
ltlibrarygt ltentrygt ltauthor name"Tom"/gt
lttitlegtIntroduction to 34OS34lt/titlegt
ltyeargt2003lt/yeargt ltcourse_note/gt
ltnum_copygt3lt/num_copygt lt/entrygt lt/librarygt
SAX Event
Start element num_copy PCDATA 3 End
element num_copy End element entry
Structure Stream
library
T, d0, d1, d2, F, p1
entry
author (name)
publisher?

title
year
num_copy
Data Streams
d0 Tom
d1 Introduction to 34OS 34
d2 2003
paper
book
d4 3
course_note
51
DSP Step 3 Generated Output
52
Technique 2 Partition Path-Based (PPB) Data
Grouping Format
PPG format
BSS indexing
DSP
Access Methods
XCQ Querying Engine
XCQ Compression Engine
DTD
XML Document
XPath Queries
53
Technique 2 Partition Path-Based Data Grouping
(PPB) Format

Purpose
To partition the data streams
Each block contains a number of data items
Benefits
Can be compressed and decompressed as an
individual unit
Support partial decompression during query
processing

54
Technique 2 Partition Part Based Data Grouping
(PPB) Format
55
Technique 2 Partition Part Based Data Grouping
(PPB) Format

A cost model is developed for PPB
Relationship between block size, processing cost
and selectivity can be known
Further modelling is possible

56
Two layers
57
n layers
58
Technique 3 Block-Statistic Signature (BSS)
Indexing
PPG format
BSS indexing
DSP
Access Methods
XCQ Querying Engine
XCQ Compression Engine
DTD
XML Document
XPath Queries
59
Technique 3 Block-Statistic Signature (BSS)
Indexing

Purpose To avoid accessing of non-relevant data
blocks during querying
I/O cost
Decompression overhead
Time to scan the data inside the block
Details
Statistic summary (signature) for each block
Min, Max, Sum and Count
Benefit Little amount of processing time and
storage space
Research status Supporting numerical data only

60
Technique 3 Block-Statistic Signature (BSS)
Indexing
0 1210 100 10000 10
Min 0 Max 10000 Sum 11320 Count 5
Compressed Data Blocks
Block Statistic Signatures
0 10 18 27 5
Min 0 Max 27 Sum 60 Count 5
61
Technique 3 Block Statistic Signature (BSS)
Indexing
62
Technique 4 Access Methods
PPB format
BSS indexing
DSP
Access Methods
XCQ Querying Engine
XCQ Compression Engine
DTD
XML Document
XPath Queries
63
Technique 4 Access Methods

Purpose
For realizing partial decompression during query
processing
4 types of queries
Selection queries
Structural queries
Structure-based aggregation queries
Path-based aggregation queries

64
Technique 4 Access Methods Selection Queries
//entryauthor/_at_nameJess and
publisher/text()ABC
Structure Stream
Keys for path-based grouped Date Streams
d0 /library/entry/author/_at_name
d1 /library/entry/title/text()
d2 /library/entry/year/text()
d3 /library/entry/publisher/text()
d4 /library/entry/num_copy/text()
65
Technique 4 Access Methods Structural Queries
/library/entry/author
Structure Stream
d0
d1
d2
d3
d4
Keys for path-based grouped Date Streams
d0 /library/entry/author/_at_name
d1 /library/entry/title/text()
d2 /library/entry/year/text()
d3 /library/entry/publisher/text()
d4 /library/entry/num_copy/text()
66
Technique 4 Access Methods Structure-Based
Aggregation Queries
count(//entry)
Structure Stream
d0
d1
d2
d3
d4
Keys for path-based grouped Date Streams
d0 /library/entry/author/_at_name
d1 /library/entry/title/text()
d2 /library/entry/year/text()
d3 /library/entry/publisher/text()
d4 /library/entry/num_copy/text()
67
Technique 4 Access Methods Path-Based
Aggregation Queries
sum(//num_copy/text()1)
Structure Stream
d0
d1
d2
d3
d4
Keys for path-based grouped Date Streams
d0 /library/entry/author/_at_name
d1 /library/entry/title/text()
d2 /library/entry/year/text()
d3 /library/entry/publisher/text()
d4 /library/entry/num_copy/text()
68
Experiment Context

Compressors under study
gzip, XMill, XGrind, XCQ
Datasets

Document Size Data-Centric/ Document-Centric Regularity (Relative Level)
Weblog 89 MB Data-Centric 5
SwissProt 32 MB Data-Centric 3
DBLP 41 MB Data-Centric 2
TPC-H 32 MB Data-Centric 6
XMark 104 MB Data-Centric 4
Shakespeare 8 MB Document-Centric 1
69
Experiment Compression Performance

Compression Performance
gzip, XMill, XCQ (No Partition) and XGrind
Scalability
XCQ
Partitioning
BSS Indexing overhead

Objective Comparable to XMill and better than
XGrind
70
Compression Ratios
71
Compression Times
72
Decompression Times
73
Experiment Compression Performance

Compression Performance
gzip, XMill, XCQ and XGrind
Scalability
XCQ
Partitioning
BSS Indexing overhead

Result Comparable to XMill
74
Scalability Compressed Sizes
75
Experiment Compression Performance

Compression Performance
gzip, XMill, XCQ (No Partition) and XGrind
Scalability
XCQ
Partitioning
BSS Indexing

Result Overheads introduced are low
76
Experiment Results Partitioning Effect on XCQ
Compression
77
Experiment Results BSS Indexing Effect on XCQ
Compression
78
Experiment Compression Performance

Query Performance
Different block sizes have impact!
XCQ vs XGrind

Result Choose a good block size
79
Experiment Results Query performance
Selection queries
80
Experiment Results Query performance
Selection queries
81
Experiment Results Query performance
Structural Query and Structure-Based Aggregation
Query
82
Experiment Results Query performance
Path-Based Aggregation Query
83
Experiment Compression Performance

Query Performance
Different block sizes
XCQ vs XGrind

Objective How to choose a good block size? A few
hundred elements
84
Experiment Compression Performance

Query Performance
Different block sizes
XCQ vs XGrind

Objective More efficient query performance
85
Experiment Results XCQ vs XGrind (Data Centric
Documents)
86
Experiment Results XCQ vs XGrind (Document
Centric Document)
87
Lessons and Development

XCQ Framework
Developed techniques
DSP
PPG document format
BSS indexing
Access methods
Benefits of XCQ from experimental results
Simple Indexing, Mathematical Foundation
Compression performance
Comparable to XMill
Query performance
Better than XGrind for Data-Centric Documents
Comparable to XGrind for Document-Centric Document

88
Multi-query evaluation of Compressed Data over
network

Widespread XML documents in remote locations
Large scale
XML verbosity
Traditional XML query processing
One by one on a standalone system
Original result fragments or whole documents are
forwarded.

Heavy bandwidth costs for Internet or Poor
processing efficiency
Motivations
Provide efficient query evaluation on compressed
XML data
Reduce bandwidth consumption in result publication

89
Architecture

Composed of the server and a group of clients
On the server side
A large-scale XML document
Largest results directing to the nearest clients
Under compression
Co-operative clients
Further dissemination XML data to remote clients
is possible

90
Preliminaries- XPress

XPress
For tags
reverse arithmetic encoding
Encoded into numerical intervals
For text
dictionary huffman encoder
Compared with XGrind
Higher compression ratio
More efficient query evaluation
Less decompression need

91
Preliminaries-Interval Encoding

Reverse arithmetic encoding
Adopted to compress tags in XPress

Element a b c
Probability 0.3 0.3 0.4
Original interval 0.0, 0.3) 0.3, 0.6) 0.6, 1.0)
92
Preliminaries-Interval Encoding

Reverse arithmetic encoding
Adopted to compress tags in XPress
The interval of /a/c is
0.60.40.0, 0.60.40.3) 0.6, 0.72)

Element a b c
Probability 0.3 0.3 0.4
Original interval 0.0, 0.3) 0.3, 0.6) 0.6, 1.0)
Original interval of c
93
Preliminaries-Interval Encoding

Reverse arithmetic encoding
Adopted to compress tags in XPress
The interval of /a/c is
0.60.40.0, 0.60.40.3) 0.6, 0.72)

Element a b c
Probability 0.3 0.3 0.4
Original interval 0.0, 0.3) 0.3, 0.6) 0.6, 1.0)
Probability of c
94
Preliminaries-Interval Encoding

Reverse arithmetic encoding
Adopted to compress tags in XPress
The interval of /a/c is
0.60.40.0, 0.60.40.3) 0.6, 0.72)

Element a b c
Probability 0.3 0.3 0.4
Original interval 0.0, 0.3) 0.3, 0.6) 0.6, 1.0)
Original interval of a
95
Preliminaries-Interval Encoding

Reverse arithmetic encoding
Adopted to compress tags in XPress
The interval of /a/c is
0.60.40.0, 0.60.40.3) 0.6, 0.72)
The interval of //c is 0.6, 1.0)

Element a b c
Probability 0.3 0.3 0.4
Original interval 0.0, 0.3) 0.3, 0.6) 0.6, 1.0)
96
Preliminaries-Interval Encoding

Reverse arithmetic encoding
Adopted to compress tags in XPress
The interval of /a/c is
0.60.40.0, 0.60.40.3) 0.6, 0.72)
The interval of //c is 0.6, 1.0)
//c is a suffix of /a/c
The interval of //c contains the interval of
/a/c

Element a b c
Probability 0.3 0.3 0.4
Original interval 0.0, 0.3) 0.3, 0.6) 0.6, 1.0)
97
Preliminaries-XML Containment

Query Evaluation on compressed document
XP/, //,
Query QA, QB submitted by client CA and CB

98
Preliminaries-XML Containment

Query Evaluation on compressed document
XP/, //,
Query QA, QB submitted by client CA and CB

XPath Containment
If QAs result is always contained by QBs for
every XML document, then QB contains QA.

99
Preliminaries-XML Containment

Query Evaluation on compressed document
XP/, //,
Query QA, QB submitted by client CA and CB

XPath Containment
If QAs result is always contained by QBs for
every XML document, then QB contains QA.

Application in our scenario
If QB contains QA, then result of QA can be
published by CB.
Classify queries based on the containment
relationship

100
Our approach

Query-Index-Tree (QIT)
QIT Construction
Multi-Query Evaluation
Sub-Index Construction for Clients

101
Query-Index-Tree (QIT)

Built at the server side
Each node corresponds to a query
Explore containment relationship
Among ancestors and descendants
Remark all result locations as indices
Target
based on the hierachical level of QIT
Evaluate queries
Route result fragments

102
An QIT Example
103
An QIT Example
104
An QIT Example
105
An QIT Example
106
QIT Construction

Recursive classification
All submitted queries
is a descendant set of root

107
QIT Construction

Recursive classification
QA contains
all other queries

108
QIT Construction

Recursive classification
Recursive classification
in QAs descendant set

109
QIT Construction

Recursive classification
Each class has a query
containing others

110
QIT Construction

Recursive classification
Classification continues until leafs

111
Preprocess for Multi-Query Evaluation

On server side, Over compressed document
How to evaluate queries using QIT
How to support intermediate clients to locate
results
Tags are encoded into intervals
To avoid decompression in query processing
Interval translation
Simple path interval
Complex path simple paths
intervals
Examples
/a/b//c/d /a/b /c/d
/a/b//c/d /a/b, /c/d

112
Experiment - Overall Cost Savings

Compare with linear query processing (without
QIT)
Saving Ratio

113
Collaborative Processing

A co-operative framework for multi-query
processing over compressed XML data
Keep results under compression to save bandwidth
Bring forward QIT and building algorithm
Future work
QIT is not enough for handling complex XPath
Subscribed queries and non-subscribed queries.
XPath queries and XPath FT queries

114
Papers Compression

XMILL An Efficient Compressor for XML Data by
Liefke and Suciu, in SIGMOD'2001
P. M. Tolani and J. R. Haritsa. XGRIND A
Query-friendly XML Compressor. IEEE ICDE Conf.,
pp. 225-234, 2002.
M. Girardot and N. Sundaresan. Millau an
encoding format for efficient representation and
exchange of XML over the Web. WWW Conf., pp.
747-765, 2000.
H. Ishikawa, S. Yokoyama, S. Isshiki and M. Ohta.
Project Xanadu XML- and Active-Database-Unified
Approach to Distributed E-Commerce. Int. Workshop
on DEXA, 2001.
A.Arion, A. Bonifati, G. Costa, S. DAguanno, I.
Manolescu, A. Pugliese, Efficient Query
Evaluation over XML Compressed Data, EDBT 2004.
JunKi Min, MyungJae Park, ChinWan Chung, XPRESS
A Queriable Compression for XML Data, EDBT 2004.

115
Our publications for XML compression

Xiaoling WANG, Aoying ZHOU, Juzhen HE and Wilfred
NG. MQX Multi-Query Processing Engine for
Compressed XML Data. International Conference on
Information Retrieval. ACM SIGIR 2007, Amsterdam,
Holland (Demonstration Paper), pp. 897, (2007).
Wilfred NG, Ho-Lam LAU and Aoying ZHOU. Divide,
Compress and Conquer Querying XML via
Partitioned Path-Based Compressed Data Blocks.
Accepted and to appear World Wide Web Journal,
(2006).
Juzhen HE, Wilfred NG, Xiaoling WANG and Aoying
ZHOU. An Efficient Co-operative Framework for
Multi-Query Processing over Compressed XML Data.
International Conference of Database Systems for
Advanced Applications. DASFAA 2006, Lecture Notes
in Computer Science Vol. 3882, Singapore, pp.
218-232, (2006).
Wilfred NG, Wai-Yeung LAM, Peter WOOD and Mark
LEVENE. XCQ A Queriable XML Compression System.
Accepted and to appear An International Journal
of Knowledge and Information Systems, (2005).
Wilfred NG, Wai-Yeung LAM and James CHENG.
Comparative Analysis of XML Compression
Technologies. Accepted and to appear World Wide
Web Journal Internet and Web Information
Systems, (2005).
James CHENG and Wilfred NG. XQzip Querying
Compressed XML Using Structural Indexing.
International Conference on Extending Database
Technology EDBT 2004, Lecture Notes of Computer
Science Vol.2992, Heraklion, Crete, Greece, page
219-236, (2004).
Wai-Yeung LAM, Wilfred NG, Peter WOOD and Mark
LEVENE. XCQ XML Compression and Querying
System. Poster Proceedings of the World Wide Web
WWW'2003, Budapest, (2003).