Part 4: Compressing XML Data - PowerPoint PPT Presentation

About This Presentation
Title:

Part 4: Compressing XML Data

Description:

Managing XML and Semistructured Data Part 4: Compressing XML Data – PowerPoint PPT presentation

Number of Views:191
Avg rating:3.0/5.0
Slides: 116
Provided by: ust51
Category:

less

Transcript and Presenter's Notes

Title: Part 4: Compressing XML Data


1
Managing XML and Semistructured Data
  • Part 4 Compressing XML Data

2
In this section
  • XML Compression
  • Motivation
  • The State-of-the-Art
  • Queriable compressors
  • Non-queriable compressors
  • Resources
  • XMILL An Efficient Compressor for XML Data by
    Liefke and Suciu, in SIGMOD'2001
  • Others XGrind, XPress, XQuec, XMLzip,
  • XCQ From my publications
  • XQZip From my publications
  • MQX From my publications

3
Introduction
  • More and more XML data is created
  • Duplicate structures (tags, paths )
  • Data inflation data in XML is much larger than
    raw data
  • Compression storage and data transfer
  • General-purpose compressor (e.g. gzip)
  • Characteristics of XML data not utilized
  • Unqueriable

4
Compression The Problem
  • XML for exchange (space or time)
  • But XML is verbose and inflated due to
  • Duplicated tags and paths
  • Users prefer application specific formats
  • Eg. Web Server Logs
  • Is XML doomed to fail ?
  • Solution XML-specific compressor
  • Non-queriable XMill
  • Queriable XQzip

5
XML-Specific Compressors
  • Unqueriable Compression (e.g. XMill)
  • Full-chunked data commonalities eliminated
  • Very good compression ratio
  • Queriable Compression (e.g. XGrind, XPRESS)
  • Fine-grained data commonalities ignored
  • Inadequate compression ratio and time
  • Support simple path queries with atomic predicate

6
Issues in XML Compression
  • Compression ratios, Compression time, Query
    Coverage, Memory Usage(see my survey paper in
    WWWJ)

Comparison of existing technologies
7
An ExampleWeb Server Logs
ASCII File 15.9 MB (gzipped 1.6MB)
202.239.238.16GET / HTTP/1.0text/html2001997/1
0/01-000002-4478--http//www.net.jp/Mozill
a/3.1ja(I)
XML-ized apache web log inflates to 24.2 MB
(gzipped 2.1MB)
  • ltapacheentrygt
  • ltapachehostgt 202.239.238.16 lt/apachehostgt
  • ltapacherequestLinegt GET / HTTP/1.0
    lt/apacherequestLinegt
  • ltapachecontentTypegt text/html
    lt/apachecontentTypegt
  • ltapachestatusCodegt 200lt/apachestatusCodegt
  • ltapachedategt 1997/10/01-000002lt/apachedategt
  • ltapachebyteCountgt 4478lt/apachebyteCountgt
  • ltapachereferergt http//www.net.jp/
    lt/apachereferergt
  • ltapacheuserAgentgt Mozilla/3.1ja(I)lt/apac
    heuserAgentgt
  • lt/apacheentrygt

8
XMill
  • First specialized compressor for XML data
  • SAX parser for parsing XML data
  • Still using gzip as its underlying compressor
  • Clever grouping of data into containers for
    compression
  • Compress XML via three basic techniques
  • Compress the structure separately from the data
  • Group the data values according to their types
  • Apply semantic (specialized) compressors
  • Downloadable
  • www.cs.washington.edu/homes/suciu/XMILL

9
XMill Architecture
10
How Xmill Works Three Ideas
Compress the structure separately from the data
gzip Structure
gzip Data
ltapacheentrygt ltapachehostgt lt/apachehostgt
. . . lt/apacheentrygt
202.239.238.16 GET / HTTP/1.0 text/html 200
1.75MB

11
How Xmill Works Three Ideas
Group the data values according to their types
gzip Structure
gzip Data1
gzip Data2
ltapacheentrygt . . . lt/apacheentrygt
202.23.23.16 224.42.24.55
GET / HTTP/1.0 GET / HTTP/1.1
1.33MB


12
How Xmill Works Three Ideas
Apply semantic (specialized) compressors
  • Examples
  • 8, 16, 32-bit integer encoding (signed/unsigned)
  • differential compressing (e.g. 1999, 1995, 2001,
    2000, 1995, ...)
  • compress lists, records (e.g. 104.32.23.1 ? 4
    bytes)
  • Need user input to select the semantic compressor

13
Path Processor structure container
ltBookgtltTitle langEnglishgtData
Compressionlt/Titlegt ltAuthorgtGraylt/Authorgt ltAutho
rgtReiterlt/Authorgt lt/Bookgt
Dictionary One more entry for each new word
Fewer storage! 14 bytes!
  • Replace data value with container number
    (negative integer)
  • Replace end tag with 0
  • Replace tags/attributes with positive integer

ltBookgtltTitle langEnglishgtData
Compressionlt/Titlegt ltAuthorgtGraylt/Authorgt ltAutho
rgtReiterlt/Authorgt lt/Bookgt
ltBookgtltTitle lang-1gt-2lt/Titlegt ltAuthorgt-3lt/Autho
rgt ltAuthorgt-3lt/Autorgt lt/Bookgt
ltBookgtltTitle lang-1 0gt-2 0 ltAuthorgt-3 0
ltAuthorgt-3 0 0
Book 1, Title 2, _at_lang 3, Author 4 1 2 3
-1 0 -2 0 4 -3 0 4 -3 0 0
Repeated structures entries could be compressed
effectively!
14
XML Compression
XMill Evaluation using XML datasets
15
Queriable Compressors
  • XQzip queriable XML compressor (our work
    EDBT04)
  • Existing XML compressors (survey inWWWJ05)
  • Unqueriable (e.g. XMill SIGMOD00) exploit data
    commonalities better compression rate than
    gzip)
  • Queriable (e.g. XGrind ICDE02, XPRESS
    SIGMOD03, XQueC, XQzip EDBT04, XCQ
    KAISJ05) compress data individually
    inadequate compression rate and time)
  • Features of XQzip
  • Use the SIT to aid query evaluation
  • Block-compression allow data commonalities to be
    exploited and used as buffers to reduce
    decompression overhead

16
Structure Index Tree (SIT)
  • Effective elimination of duplicate structures in
    the XML data
  • Merging of nodes that have
  • the same incoming path
  • the same ordered set of paths of their
    descendants
  • SIT Construction
  • A linear scan of the XML document
  • Merging of the subtree that we are constructing
    into its equivalent subtree in the base tree

17
SIT Construction
/
/
0
0
a
a
1
1
b
b
c
c
b
c
c
c
b
7
2
2
,7
5
6
5
6
,6
7
d
d
d
e
e
d
d
e
e
d
d
3
4
8
9
10
8
,10
3
4
9
10
,8,10
,9
18
XQzip Architecture
  • Index Constructor construct the SIT
  • Compressor
  • Group semantically related items in blocks
  • Compress each block by gzip
  • Query Processor evaluate query
  • Parser
  • Executor apply the SIT to evaluate query
  • Buffer Manager (By LRU)

19
SIT Construction Complexity
  • N Total number of elements in the input XML
    document
  • Time Complexity
  • Worst-case O(N SIT )
  • Average-case O(N)
  • Space Complexity
  • Base tree and the subtree being merged 2SIT
  • Space for storing ids of eliminated nodes O(N)

20
Data Compression
  • A balance between full-chunked and fine-grained
    compression
  • A distinct data container for each distinct
    element
  • Each container compressed (using gzip) into many
    smaller blocks
  • Block size?
  • Too small query time ?compression ratio?
  • Too large query time ?compression ratio?
  • Only can be determined by an empirical study

21
Block Size
  • Representative datasets and queries
  • Datasets
  • Heavy text
  • Light text
  • A mix of heavy text and light text
  • Queries
  • High Selectivity
  • Medium Selectivity
  • Low Selectivity

22
Block Size
23
Structure of Compressed-Data
  • Block size?
  • Determined by an empirical study
  • Querying Time
  • near-optimal range 600-1000 data items/block
    (average optimal 950)
  • Compression Ratio
  • Not improved much after 150 KB/block (usually
    contain more than 1000 items)
  • 1000 data items/block

24
Outline
  • Introduction
  • XQzip EDBT 2004
  • Indexing
  • Data Compression
  • Query Evaluation
  • Performance Evaluation
  • Conclusion

25
XQzip Query Coverage
  • All XPath axes except the sideways axes (e.g.
    preceding, following)-siblings
  • Multiple and nested predicates
  • and / or / not expressions
  • Aggregations sum, count, average, max, min
  • Group queries e.g. (L1 (L2 L3 L4))
  • L1 //ab Crete (prefix) L2 c
  • L3 df/count() gt100 L4 e//g

26
Query Evaluation
  • Depth-first traverse the index tree
  • Buffer Management (LRU)
  • Why buffering? Decompression Time Dominates
  • Decompression avoidance

27
Outline
  • Introduction
  • XQzip
  • Indexing
  • Data Compression
  • Query Evaluation
  • Performance Evaluation
  • Conclusion

28
Effectiveness of the SIT
Data Source Node Reduction Load Time Node Selection Acceleration
XMark 1.64 0.67s 2.15
OMIM 0.24 0.07s 2.16
DBLP 0.04 1.62s 2.11
SwissProt 28.38 5.61s 1.92
Treebank 93.42 2.26s 1.76
PSD 10.85 9.97s 2.18
Shakespeare 1.96 0.07s 2.10
Lineitem 0.002 0.42s 1.78
29
Effectiveness of the SIT
  • Index Size less than 1 of original size
  • Load Time a fraction of a second
  • Node Selection Acceleration twice faster than
    FB-Index
  • Construction Time more than 3 times faster than
    FB-Index

30
Compression Ratio
XQzip is comparable to XMill and gzip, 17
better than XGrind with index size included, 42
better than XGrind without index.
31
Compression/Decompression Time
  • XQzip (compression index construction) is more
    than 5 times better than XGrind, 1.5 times worse
    than XMill
  • XQzip (index-loading decompression) is more
    than 3 times better than XGrind, 1.4 times worse
    than XMill

32
    Node Partial Result Querying Querying Querying
Data   Selecting Decomp. Processing Time (sec) Time (sec) Time (sec)
Sources Sources Time (sec) Time (sec) Time (sec) (XQzip-) (XQzip) (XGrind)
XMark Q1 0.001 --- 0.911 0.913 0.122 22.774
(111MB) Q2 0.001 0.920 0.012 0.934 0.295 23.067
  Q3 0.001 3.395 0.014 3.411 0.349 35.012
  Q4 0.003 --- 0.551 0.584 0.118 ---
  Q5 0.831 4.534 0.010 5.376 1.544 ---
OMIM Q1 0.001 --- 0.030 0.032 0.005 3.513
(24.5MB) Q2 0.001 0.021 0.011 0.034 0.014 4.690
  Q3 0.001 0.036 0.057 0.095 0.067 6.134
  Q4 0.005 --- --- 0.005 0.005 ---
  Q5 0.012 0.020 0.580 0.613 0.034 ---
DBLP Q1 0.001 --- 0.370 0.381 0.034 19.582
(148MB) Q2 0.001 0.330 0.013 0.345 0.029 26.108
  Q3 0.033 0.391 8.997 9.541 1.543 50.344
  Q4 0.001 --- 0.000 0.001 0.001 ---
  Q5 0.087 1.122 0.260 1.481 0.642 ---
33
Query Preformance
  • Cold Buffer-pool Evaluation
  • 13 times better than XGrind
  • Warm buffer-pool Evaluation
  • 80 times better than XGrind
  • Impressive Buffer Effect!

34
Lessons on XML Compression
  • Good compression ratio and time
  • Comparable to that of XMill
  • Much better than that of XGrind (and XPRESS)
  • Support a very practical set of queries
  • A much wider range of queries than XGrind and
    XPRESS
  • Very Competitive Querying Time with Buffer
  • 13 time better than XGrind with cold buffer
  • 80 time better than XGrind with warm buffer
  • Limitations
  • Cost of building and maintenance of complex
    Indexes
  • No theoretical foundation of block size

35
XCQ
  • XCQ Framework
  • Experimental Results
  • Compression Performance
  • Query Performance
  • Lessons and Development

36
XCQ
  • Objectives
  • Achieve Good Compression ratio
  • Comparable to XMill
  • Better than XGrind
  • Achieve Good Query performance
  • More efficient than XGrind
  • Querying compressed documents with block-based
    partial decompression
  • But addressing issues different from XQzip
  • Adopt minimal indexing
  • Establish theory between selectivity and block
    size

37
XCQ Strategy
  • Based on four techniques
  • DTD Tree and SAX Event Stream Parsing (DSP)
  • Partition Path-Based Data Grouping (PPB) Format
  • Block-Statistic Signature (BSS) Indexing
  • Access Methods

PPG format
BSS indexing
DSP
Access Methods
38
Technique 1 DTD Tree and SAX Event Stream
Parsing (DSP)
PPG format
BSS indexing
DSP
Access Methods
XCQ Querying Engine
XCQ Compression Engine
DTD
XML Document
XPath Queries
39
Technique 1 DTD Tree and SAX Event Stream
Parsing (DSP)
  • Purpose
  • To utilize information in the associated DTD of
    the document
  • Benefits
  • Only encode the information that cannot be
    inferred in the DTD
  • Precise path-based grouping of data items
  • Run in automated manner

40
DSP Input and Output
41
DSP Step 1 Creating a DTD Tree
lt!ELEMENT library (entry)gt lt!ELEMENT entry
(author, title, year, publisher?,
(papercourse_notebook),
num_copy)gt lt!ELEMENT author EMPTYgt lt!ATTLIST
author name CDATAgt lt!ELEMENT title
(PCDATA)gt lt!ELEMENT year (PCDATA)gt lt!ELEMENT
publisher (PCDATA)gt lt!ELEMENT paper
EMPTYgt lt!ELEMENT course_note EMPTYgt lt!ELEMENT
book EMPTYgt lt!ELEMENT num_copy (PCDATA)gt
42
DSP Step 1 Creating a DTD Tree
lt!ELEMENT library (entry)gt lt!ELEMENT entry
(author, title, year, publisher?,
(papercourse_notebook),
num_copy)gt lt!ELEMENT author EMPTYgt lt!ATTLIST
author name CDATAgt lt!ELEMENT title
(PCDATA)gt lt!ELEMENT year (PCDATA)gt lt!ELEMENT
publisher (PCDATA)gt lt!ELEMENT paper
EMPTYgt lt!ELEMENT course_note EMPTYgt lt!ELEMENT
book EMPTYgt lt!ELEMENT num_copy (PCDATA)gt
43
DSP Step 2 Processing in DSP Module
  • How does the DSP module process the following XML
    document?

ltlibrarygt ltentrygt ltauthor name"Tom"/gt
lttitlegtIntroduction to 34OS34lt/titlegt
ltyeargt2003lt/yeargt ltcourse_note/gt
ltnum_copygt3lt/num_copygt lt/entrygt lt/librarygt
44
ltlibrarygt ltentrygt ltauthor name"Tom"/gt
lttitlegtIntroduction to 34OS34lt/titlegt
ltyeargt2003lt/yeargt ltcourse_note/gt
ltnum_copygt3lt/num_copygt lt/entrygt lt/librarygt
SAX Event
Start element library
Structure Stream
library
entry
author (name)
publisher?

title
year
num_copy
Data Streams
Keys
paper
book
Traversal path
course_note
PCDATA
Processing DTD tree node
45
ltlibrarygt ltentrygt ltauthor name"Tom"/gt
lttitlegtIntroduction to 34OS34lt/titlegt
ltyeargt2003lt/yeargt ltcourse_note/gt
ltnum_copygt3lt/num_copygt lt/entrygt lt/librarygt
SAX Event
Start element entry
Match!
Structure Stream
library
entry
author (name)
publisher?

title
year
num_copy
Data Streams
Keys
paper
book
Traversal path
course_note
PCDATA
Processing DTD tree node
46
ltlibrarygt ltentrygt ltauthor name"Tom"/gt
lttitlegtIntroduction to 34OS34lt/titlegt
ltyeargt2003lt/yeargt ltcourse_note/gt
ltnum_copygt3lt/num_copygt lt/entrygt lt/librarygt
SAX Event
Start element author, att0nameTom End
element author
d0
Structure Stream
library
T
, d0
Match!
entry
author (name)
publisher?

title
year
num_copy
Data Streams
d0 Tom
Keys
paper
book
Traversal path
course_note
PCDATA
Processing DTD tree node
47
ltlibrarygt ltentrygt ltauthor name"Tom"/gt
lttitlegtIntroduction to 34OS34lt/titlegt
ltyeargt2003lt/yeargt ltcourse_note/gt
ltnum_copygt3lt/num_copygt lt/entrygt lt/librarygt
SAX Event
Start element title PCDATA Introduction to
34OS 34 End element title
Structure Stream
library
T, d0, d1
entry
author (name)
publisher?

title
year
num_copy
Data Streams
d0 Tom
d1 Introduction to 34OS 34
Keys
paper
book
Traversal path
course_note
PCDATA
Processing DTD tree node
48
ltlibrarygt ltentrygt ltauthor name"Tom"/gt
lttitlegtIntroduction to 34OS34lt/titlegt
ltyeargt2003lt/yeargt ltcourse_note/gt
ltnum_copygt3lt/num_copygt lt/entrygt lt/librarygt
SAX Event
SAX Events Start element year PCDATA
2003 End element year Start element
course_note
Structure Stream
library
T, d0, d1, d2
, F
entry
author (name)
publisher?

title
year
num_copy
Data Streams
d0 Tom
d1 Introduction to 34OS 34
d2 2003
Keys
paper
book
Traversal path
course_note
PCDATA
Processing DTD tree node
49
ltlibrarygt ltentrygt ltauthor name"Tom"/gt
lttitlegtIntroduction to 34OS34lt/titlegt
ltyeargt2003lt/yeargt ltcourse_note/gt
ltnum_copygt3lt/num_copygt lt/entrygt lt/librarygt
SAX Event
Start element course_note End element
course_note
Structure Stream
library
T, d0, d1, d2, F
, p1
Match!
entry
author (name)
publisher?

title
year
num_copy
Data Streams
d0 Tom
p0
p2
d1 Introduction to 34OS 34
p1
d2 2003
paper
book
course_note
50
ltlibrarygt ltentrygt ltauthor name"Tom"/gt
lttitlegtIntroduction to 34OS34lt/titlegt
ltyeargt2003lt/yeargt ltcourse_note/gt
ltnum_copygt3lt/num_copygt lt/entrygt lt/librarygt
SAX Event
Start element num_copy PCDATA 3 End
element num_copy End element entry
Structure Stream
library
T, d0, d1, d2, F, p1
entry
author (name)
publisher?

title
year
num_copy
Data Streams
d0 Tom
d1 Introduction to 34OS 34
d2 2003
paper
book
d4 3
course_note
51
DSP Step 3 Generated Output
52
Technique 2 Partition Path-Based (PPB) Data
Grouping Format
PPG format
BSS indexing
DSP
Access Methods
XCQ Querying Engine
XCQ Compression Engine
DTD
XML Document
XPath Queries
53
Technique 2 Partition Path-Based Data Grouping
(PPB) Format
  • Purpose
  • To partition the data streams
  • Each block contains a number of data items
  • Benefits
  • Can be compressed and decompressed as an
    individual unit
  • Support partial decompression during query
    processing

54
Technique 2 Partition Part Based Data Grouping
(PPB) Format
55
Technique 2 Partition Part Based Data Grouping
(PPB) Format
  • A cost model is developed for PPB
  • Relationship between block size, processing cost
    and selectivity can be known
  • Further modelling is possible

56
Two layers
57
n layers
58
Technique 3 Block-Statistic Signature (BSS)
Indexing
PPG format
BSS indexing
DSP
Access Methods
XCQ Querying Engine
XCQ Compression Engine
DTD
XML Document
XPath Queries
59
Technique 3 Block-Statistic Signature (BSS)
Indexing
  • Purpose To avoid accessing of non-relevant data
    blocks during querying
  • I/O cost
  • Decompression overhead
  • Time to scan the data inside the block
  • Details
  • Statistic summary (signature) for each block
  • Min, Max, Sum and Count
  • Benefit Little amount of processing time and
    storage space
  • Research status Supporting numerical data only

60
Technique 3 Block-Statistic Signature (BSS)
Indexing
0 1210 100 10000 10
Min 0 Max 10000 Sum 11320 Count 5
Compressed Data Blocks
Block Statistic Signatures
0 10 18 27 5
Min 0 Max 27 Sum 60 Count 5
61
Technique 3 Block Statistic Signature (BSS)
Indexing
62
Technique 4 Access Methods
PPB format
BSS indexing
DSP
Access Methods
XCQ Querying Engine
XCQ Compression Engine
DTD
XML Document
XPath Queries
63
Technique 4 Access Methods
  • Purpose
  • For realizing partial decompression during query
    processing
  • 4 types of queries
  • Selection queries
  • Structural queries
  • Structure-based aggregation queries
  • Path-based aggregation queries

64
Technique 4 Access Methods Selection Queries
//entryauthor/_at_nameJess and
publisher/text()ABC
Structure Stream
Keys for path-based grouped Date Streams
d0 /library/entry/author/_at_name
d1 /library/entry/title/text()
d2 /library/entry/year/text()
d3 /library/entry/publisher/text()
d4 /library/entry/num_copy/text()
65
Technique 4 Access Methods Structural Queries
/library/entry/author
Structure Stream
d0
d1
d2
d3
d4
Keys for path-based grouped Date Streams
d0 /library/entry/author/_at_name
d1 /library/entry/title/text()
d2 /library/entry/year/text()
d3 /library/entry/publisher/text()
d4 /library/entry/num_copy/text()
66
Technique 4 Access Methods Structure-Based
Aggregation Queries
count(//entry)
Structure Stream
d0
d1
d2
d3
d4
Keys for path-based grouped Date Streams
d0 /library/entry/author/_at_name
d1 /library/entry/title/text()
d2 /library/entry/year/text()
d3 /library/entry/publisher/text()
d4 /library/entry/num_copy/text()
67
Technique 4 Access Methods Path-Based
Aggregation Queries
sum(//num_copy/text()1)
Structure Stream
d0
d1
d2
d3
d4
Keys for path-based grouped Date Streams
d0 /library/entry/author/_at_name
d1 /library/entry/title/text()
d2 /library/entry/year/text()
d3 /library/entry/publisher/text()
d4 /library/entry/num_copy/text()
68
Experiment Context
  • Compressors under study
  • gzip, XMill, XGrind, XCQ
  • Datasets

Document Size Data-Centric/ Document-Centric Regularity (Relative Level)
Weblog 89 MB Data-Centric 5
SwissProt 32 MB Data-Centric 3
DBLP 41 MB Data-Centric 2
TPC-H 32 MB Data-Centric 6
XMark 104 MB Data-Centric 4
Shakespeare 8 MB Document-Centric 1
69
Experiment Compression Performance
  • Compression Performance
  • gzip, XMill, XCQ (No Partition) and XGrind
  • Scalability
  • XCQ
  • Partitioning
  • BSS Indexing overhead

Objective Comparable to XMill and better than
XGrind
70
Compression Ratios
71
Compression Times
72
Decompression Times
73
Experiment Compression Performance
  • Compression Performance
  • gzip, XMill, XCQ and XGrind
  • Scalability
  • XCQ
  • Partitioning
  • BSS Indexing overhead

Result Comparable to XMill
74
Scalability Compressed Sizes
75
Experiment Compression Performance
  • Compression Performance
  • gzip, XMill, XCQ (No Partition) and XGrind
  • Scalability
  • XCQ
  • Partitioning
  • BSS Indexing

Result Overheads introduced are low
76
Experiment Results Partitioning Effect on XCQ
Compression
77
Experiment Results BSS Indexing Effect on XCQ
Compression
78
Experiment Compression Performance
  • Query Performance
  • Different block sizes have impact!
  • XCQ vs XGrind

Result Choose a good block size
79
Experiment Results Query performance
Selection queries
80
Experiment Results Query performance
Selection queries
81
Experiment Results Query performance
Structural Query and Structure-Based Aggregation
Query
82
Experiment Results Query performance
Path-Based Aggregation Query
83
Experiment Compression Performance
  • Query Performance
  • Different block sizes
  • XCQ vs XGrind

Objective How to choose a good block size? A few
hundred elements
84
Experiment Compression Performance
  • Query Performance
  • Different block sizes
  • XCQ vs XGrind

Objective More efficient query performance
85
Experiment Results XCQ vs XGrind (Data Centric
Documents)
86
Experiment Results XCQ vs XGrind (Document
Centric Document)
87
Lessons and Development
  • XCQ Framework
  • Developed techniques
  • DSP
  • PPG document format
  • BSS indexing
  • Access methods
  • Benefits of XCQ from experimental results
  • Simple Indexing, Mathematical Foundation
  • Compression performance
  • Comparable to XMill
  • Query performance
  • Better than XGrind for Data-Centric Documents
  • Comparable to XGrind for Document-Centric Document

88
Multi-query evaluation of Compressed Data over
network
  • Widespread XML documents in remote locations
  • Large scale
  • XML verbosity
  • Traditional XML query processing
  • One by one on a standalone system
  • Original result fragments or whole documents are
    forwarded.
  • Heavy bandwidth costs for Internet or Poor
    processing efficiency
  • Motivations
  • Provide efficient query evaluation on compressed
    XML data
  • Reduce bandwidth consumption in result publication

89
Architecture
  • Composed of the server and a group of clients
  • On the server side
  • A large-scale XML document
  • Largest results directing to the nearest clients
  • Under compression
  • Co-operative clients
  • Further dissemination XML data to remote clients
    is possible

90
Preliminaries- XPress
  • XPress
  • For tags
  • reverse arithmetic encoding
  • Encoded into numerical intervals
  • For text
  • dictionary huffman encoder
  • Compared with XGrind
  • Higher compression ratio
  • More efficient query evaluation
  • Less decompression need

91
Preliminaries-Interval Encoding
  • Reverse arithmetic encoding
  • Adopted to compress tags in XPress

Element a b c
Probability 0.3 0.3 0.4
Original interval 0.0, 0.3) 0.3, 0.6) 0.6, 1.0)
92
Preliminaries-Interval Encoding
  • Reverse arithmetic encoding
  • Adopted to compress tags in XPress
  • The interval of /a/c is
  • 0.60.40.0, 0.60.40.3) 0.6, 0.72)

Element a b c
Probability 0.3 0.3 0.4
Original interval 0.0, 0.3) 0.3, 0.6) 0.6, 1.0)
Original interval of c
93
Preliminaries-Interval Encoding
  • Reverse arithmetic encoding
  • Adopted to compress tags in XPress
  • The interval of /a/c is
  • 0.60.40.0, 0.60.40.3) 0.6, 0.72)

Element a b c
Probability 0.3 0.3 0.4
Original interval 0.0, 0.3) 0.3, 0.6) 0.6, 1.0)
Probability of c
94
Preliminaries-Interval Encoding
  • Reverse arithmetic encoding
  • Adopted to compress tags in XPress
  • The interval of /a/c is
  • 0.60.40.0, 0.60.40.3) 0.6, 0.72)

Element a b c
Probability 0.3 0.3 0.4
Original interval 0.0, 0.3) 0.3, 0.6) 0.6, 1.0)
Original interval of a
95
Preliminaries-Interval Encoding
  • Reverse arithmetic encoding
  • Adopted to compress tags in XPress
  • The interval of /a/c is
  • 0.60.40.0, 0.60.40.3) 0.6, 0.72)
  • The interval of //c is 0.6, 1.0)

Element a b c
Probability 0.3 0.3 0.4
Original interval 0.0, 0.3) 0.3, 0.6) 0.6, 1.0)
96
Preliminaries-Interval Encoding
  • Reverse arithmetic encoding
  • Adopted to compress tags in XPress
  • The interval of /a/c is
  • 0.60.40.0, 0.60.40.3) 0.6, 0.72)
  • The interval of //c is 0.6, 1.0)
  • //c is a suffix of /a/c
  • The interval of //c contains the interval of
    /a/c

Element a b c
Probability 0.3 0.3 0.4
Original interval 0.0, 0.3) 0.3, 0.6) 0.6, 1.0)
97
Preliminaries-XML Containment
  • Query Evaluation on compressed document
  • XP/, //,
  • Query QA, QB submitted by client CA and CB

98
Preliminaries-XML Containment
  • Query Evaluation on compressed document
  • XP/, //,
  • Query QA, QB submitted by client CA and CB
  • XPath Containment
  • If QAs result is always contained by QBs for
    every XML document, then QB contains QA.

99
Preliminaries-XML Containment
  • Query Evaluation on compressed document
  • XP/, //,
  • Query QA, QB submitted by client CA and CB
  • XPath Containment
  • If QAs result is always contained by QBs for
    every XML document, then QB contains QA.
  • Application in our scenario
  • If QB contains QA, then result of QA can be
    published by CB.
  • Classify queries based on the containment
    relationship

100
Our approach
  • Query-Index-Tree (QIT)
  • QIT Construction
  • Multi-Query Evaluation
  • Sub-Index Construction for Clients

101
Query-Index-Tree (QIT)
  • Built at the server side
  • Each node corresponds to a query
  • Explore containment relationship
  • Among ancestors and descendants
  • Remark all result locations as indices
  • Target
  • based on the hierachical level of QIT
  • Evaluate queries
  • Route result fragments

102
An QIT Example
103
An QIT Example
104
An QIT Example
105
An QIT Example
106
QIT Construction
  • Recursive classification
  • All submitted queries
  • is a descendant set of root

107
QIT Construction
  • Recursive classification
  • QA contains
  • all other queries

108
QIT Construction
  • Recursive classification
  • Recursive classification
  • in QAs descendant set

109
QIT Construction
  • Recursive classification

  • Each class has a query

  • containing others

110
QIT Construction
  • Recursive classification
  • Classification continues until leafs

111
Preprocess for Multi-Query Evaluation
  • On server side, Over compressed document
  • How to evaluate queries using QIT
  • How to support intermediate clients to locate
    results
  • Tags are encoded into intervals
  • To avoid decompression in query processing
  • Interval translation
  • Simple path interval
  • Complex path simple paths
    intervals
  • Examples
  • /a/b//c/d /a/b /c/d
  • /a/b//c/d /a/b, /c/d

112
Experiment - Overall Cost Savings
  • Compare with linear query processing (without
    QIT)
  • Saving Ratio

113
Collaborative Processing
  • A co-operative framework for multi-query
    processing over compressed XML data
  • Keep results under compression to save bandwidth
  • Bring forward QIT and building algorithm
  • Future work
  • QIT is not enough for handling complex XPath
  • Subscribed queries and non-subscribed queries.
  • XPath queries and XPath FT queries

114
Papers Compression
  • XMILL An Efficient Compressor for XML Data by
    Liefke and Suciu, in SIGMOD'2001
  • P. M. Tolani and J. R. Haritsa. XGRIND A
    Query-friendly XML Compressor. IEEE ICDE Conf.,
    pp. 225-234, 2002.
  • M. Girardot and N. Sundaresan. Millau an
    encoding format for efficient representation and
    exchange of XML over the Web. WWW Conf., pp.
    747-765, 2000.
  • H. Ishikawa, S. Yokoyama, S. Isshiki and M. Ohta.
    Project Xanadu XML- and Active-Database-Unified
    Approach to Distributed E-Commerce. Int. Workshop
    on DEXA, 2001.
  • A.Arion, A. Bonifati, G. Costa, S. DAguanno, I.
    Manolescu, A. Pugliese, Efficient Query
    Evaluation over XML Compressed Data, EDBT 2004.
  • JunKi Min, MyungJae Park, ChinWan Chung, XPRESS
    A Queriable Compression for XML Data, EDBT 2004.

115
Our publications for XML compression
  • Xiaoling WANG, Aoying ZHOU, Juzhen HE and Wilfred
    NG. MQX Multi-Query Processing Engine for
    Compressed XML Data. International Conference on
    Information Retrieval. ACM SIGIR 2007, Amsterdam,
    Holland (Demonstration Paper), pp. 897, (2007).
  • Wilfred NG, Ho-Lam LAU and Aoying ZHOU. Divide,
    Compress and Conquer Querying XML via
    Partitioned Path-Based Compressed Data Blocks.
    Accepted and to appear World Wide Web Journal,
    (2006).
  • Juzhen HE, Wilfred NG, Xiaoling WANG and Aoying
    ZHOU. An Efficient Co-operative Framework for
    Multi-Query Processing over Compressed XML Data.
    International Conference of Database Systems for
    Advanced Applications. DASFAA 2006, Lecture Notes
    in Computer Science Vol. 3882, Singapore, pp.
    218-232, (2006).
  • Wilfred NG, Wai-Yeung LAM, Peter WOOD and Mark
    LEVENE. XCQ A Queriable XML Compression System.
    Accepted and to appear An International Journal
    of Knowledge and Information Systems, (2005).
  • Wilfred NG, Wai-Yeung LAM and James CHENG.
    Comparative Analysis of XML Compression
    Technologies. Accepted and to appear World Wide
    Web Journal Internet and Web Information
    Systems, (2005).
  • James CHENG and Wilfred NG. XQzip Querying
    Compressed XML Using Structural Indexing.
    International Conference on Extending Database
    Technology EDBT 2004, Lecture Notes of Computer
    Science Vol.2992, Heraklion, Crete, Greece, page
    219-236, (2004).
  • Wai-Yeung LAM, Wilfred NG, Peter WOOD and Mark
    LEVENE.  XCQ XML Compression and Querying
    System. Poster Proceedings of the World Wide Web
    WWW'2003, Budapest, (2003).
Write a Comment
User Comments (0)
About PowerShow.com