Title: Constructing Optimal Wavelet Synopses
1Constructing Optimal Wavelet Synopses
- Dimitris Sacharidis
- dsachar_at_dblab.ntua.gr
- Timos Sellis
- timos_at_dblab.ntua.gr
2outline
- introduction
- background
- wavelet basics
- example
- wavelet synopses
- example
- error metrics
- optimal synopses
- interesting issues
- data streams
- models
- streaming wavelet synopses
- epilogue
3introduction
- analyzing massive multi-dimensional datasets
- complex aggregate queries over large parts of the
data - exploratory nature
- promptness over accuracy, but with guarantees
- resort in approximate query processing over
precomputed synopses (e.g., histograms, samples,
wavelets) - numerous data management applications require to
continuously generate, process and analyze data
on-line - the data streaming paradigm
- summarize in real time, using small space and in
one pass - provide approximate query answers with quality
guarantees - provide useful data summarization
- need to measure inaccuracy, application dependent
4outline
- introduction
- background
- wavelet basics
- example
- wavelet synopses
- example
- error metrics
- optimal synopses
- interesting issues
- data streams
- models
- streaming wavelet synopses
- epilogue
5wavelets basics
- wavelet decomposition is a mathematical tool for
the hierarchical decomposition of functions - applications in signal/image processing
- used extensively as a data reduction tool in db
scenarios - selectivity estimation for large aggregate
queries - fast approximate query answers
- general purpose streaming synopsis
- features
- efficient performs in linear time and space (vs.
histograms N2)) - high compression ratio, small-B property
- generalizes to multiple dimensions
6example
assume a data vector d of 8 values
iteratively perform pair-wise averaging and semi
differencing
every node contributes positively to the leaves
in its left subtree andnegatively to the leaves
in its right subtree
averages are not needed
wavelet tree (a.k.a. error tree)
7outline
- introduction
- background
- wavelet basics
- example
- wavelet synopses
- example
- error metrics
- optimal synopses
- interesting issues
- data streams
- models
- streaming wavelet synopses
- epilogue
8wavelet synopses
- any set of B coefficients constitutes a B-term
wavelet synopsis - stored as ltindex,valuegt pairs
- implicitly all non-stored coefficients are set to
zero - introduces reconstruction error per point estimate
e d-d
9measuring accuracy
- use some norm to aggregate individual errors
- L2 norm Sei2 is the sum squared error (sse)
- sse 224
- L8 norm max ei is the maximum absolute error
- max-abs-error 10
- generalized to any weighted Lp norm Swieip
- e.g. max-rel-error max (1/di)ei 10/4 250
vector of point errors e
vector of data values d
10optimal synopses
- a B-term wavelet synopsis can be optimized for
any error metric - sse optimal synopses are straightforward
- wavelet transformation is orthonormal (after
normalization) ? by Parsevals theorem L2 norm is
preserved - choose the highest in absolute (normalized) value
coefficients - other (weighted or non) Lp norm optimal synopses
require superlinear (quadratic) time in N - dynamic programming over the wavelet tree
11interesting issues
- I/O efficiency issues when dealing with massive
multi-dimensional datasets M. Jahangiri, D.
Sacharidis, C. Shahabi 05 - during transformation try to minimize I/Os
- efficient maintenance as new data are appended
(requires more than just some updating) - how about optimizing for workloads of range-sum
queries? - no known results (without using the prefix-sum
array) - ranges overlap arbitrarily ? no easy dynamic
programming formulation exists
12outline
- introduction
- background
- wavelet basics
- example
- wavelet synopses
- example
- error metrics
- optimal synopses
- interesting issues
- data streams
- models
- streaming wavelet synopses
- epilogue
13working over data streams
- main challenges when data are streaming
- stream items are only seen once
- require small working space
- process stream items quickly
- provide an answer quickly with quality guarantees
two models depending on how a data vector a is
rendered
turnstile model stream elements are updates of
type (i,u) which implies ai ? ai u and,
further, do not appear ordered in i
time series model stream elements are vector
values of type (i,ai) and appear ordered in i
(e.g., time)
14streaming wavelet synopses
- time series model
- at most only logN coefficients are affected
- a large number of coefficients has finalized
value - can perform bottom-up dynamic programming (space
required is prohibitive) - greedy techniques should be deployed instead
- turnstile model
- even optimizing for the sse is hard G. Cormode,
M. Garofalakis, D. Sacharidis 06 - other error metrics have not been studied
15outline
- introduction
- background
- wavelet basics
- example
- wavelet synopses
- example
- error metrics
- optimal synopses
- interesting issues
- data streams
- models
- streaming wavelet synopses
- epilogue
16epilogue
- wavelet synopses are a highly successful data
summarization technique - yet, several problems remain open
- optimize for range query workloads
- greedy (time-series) streaming algorithms
- other metrics for general (turnstile) streaming
data
17thank you!
http//www.dblab.ntua.gr/
18unrestricted wavelet synopses
- the retained coefficients can assume any value,
not restricted to their decomposed value (even
harder optimization problem!) - quick example optimize for max-abs-error, d
2, 10, 12, 8 and B1 - restricted synopsis keep the overall average 8 ?
m.a.e. 6 - unrestricted synopsis keep the overall average
but change its value to 7 ? m.a.e. 5