Title: Efficient Incremental Maintenance of Data Cubes
1Efficient Incremental Maintenance of Data Cubes
- 2006. 9. 15.
- Ki Yong Lee
- Software Laboratories
- Samsung Electronics Co., Ltd.
- Myoung Ho Kim
- Division of Computer Science
- Korea Advanced Institute of Science and Technology
2Outline
- Introduction
- Related work
- Incremental maintenance of aggregate views
- Incremental maintenance of data cubes
- Our approach
- Key idea
- Problem formulation
- Heuristic algorithm
- Performance evaluation
- Conclusion
3Data Cube
- A generalized group-by operator GBP96
- Computes group-bys for all possible combinations
of a given set of attributes
SELECT a, b, SUM(m) FROM F GROUP BY a, b
SELECT a, , SUM(m) FROM F GROUP BY a
SELECT a, b, SUM(m) FROM F CUBE BY a, b
2n
SELECT , b, SUM(m) FROM F GROUP BY b
Dimension attributes
SELECT , , SUM(m) FROM F (GROUP BY ?)
4Cube Lattice
- We represent a data cube as a lattice diagram
HRU96 - Each node represents a group-by, which is called
a cuboid - Each edge (qi, qj) represents that qj can be
computed from qi
Cuboid (group-by)
Aggregation
5Maintenance of Data Cubes
- A data cube is typically stored as a materialized
view - How can we update a data cube efficiently when
the source relations change?
SELECT a, b, c, d, SUM(m) FROM F CUBE BY a, b, c,
d
?
SELECT a, b, c, d, SUM(m) FROM F CUBE BY a, b,
c, d
6Related Work (1/2)
- Incremental maintenance of an aggregate view
ASELECT a, b, c, SUM(m) FROM F GROUP BY a,
b, c
F
?ASELECT a, b, c, SUM(m) FROM ?F GROUP BY
a, b, c
A
?F
7Related Work (2/2)
- Incremental maintenance of a data cube
- Propagate stage computes the delta cube
- Refresh stage refreshes the data cube by the
delta cube
?
??
?
b
a
c
b
a
c
?b
?a
?c
bc
ac
bc
ab
ac
ab
?ac
?bc
?ab
abc
abc
?abc
Delta cuboid
F
?F
Original cube
Updated cube
Delta cube
8Motivation
- To incrementally maintain a data cube with 2n
cuboids, existing methods compute 2n delta
cuboids - As n increases, the maintenance cost increases
significantly
Original cube
Delta cube
9Motivation (contd)
- Each cuboid is refreshed separately in existing
methods
2n delta cuboids are used
10Key Idea
- Refresh more than one cuboid by a delta cuboid
11Key Idea (contd)
- Benefit
- The number of delta cuboids that need to be
computed is reduced
??, ?a,?b,?c,?d, ?ab,?ac,?ad,?bc,?bd,?cd, ?abc,?ab
d,?acd,?bcd, ?abcd
?bc,?cd, ?abd,?acd,?bcd, ?abcd
2n 16 delta cuboids need to be computed
Only 6 delta cuboids need to be computed
12Key Idea (contd)
- Refreshing more than one cuboid by a delta cuboid
?ab
ab
? 13
4
?abc
abc
4
? 8
13Key Idea (contd)
- However, this method requires more access to ab
?ab
ab
?ab
?abc
?abc
abc
?abc
14Key Idea (contd)
- But, if ?abc is sorted by a, b and c, ab can be
refreshed by ?abc without more access to ab
?ab
ab
? 28
19
?abc
abc
4
? 8
15Key Idea (contd)
- A delta cuboid can be easily sorted during its
computation with little or no additional cost - In most existing commercial relational database
systems, aggregation algorithms are based on
sorting G93 - If a group-by is computed by sorting algorithms,
sorted results on the grouping attributes can be
easily obtained - We assume that a delta cuboid is computed by
sorting based algorithms - Thus, the above method can be applied to any
delta cuboid with no additional sorting cost
16Generalization of the Idea
- ?d1d2dk
- A delta cuboid sorted in the order of attributes
d1, d2, , dk. - The following set of delta cuboids can be
refreshed by ?d1d2dk - d1d2dk, d1d2dk-1, , d1, ?
- ?d1d2dk ? q1, q2, , qi
- Cuboids q1, q2, , qi are refreshed by ?d1d2dk
- Example
- ?abcd ? abcd, abc, ab, a, ?
- Cuboids abcd, abc, ab, a, ? are refreshed by
?abcd - ?acdb ? acdb, acd, ac, a, ?
- Cuboids acdb, acd, ac, a, ? are refreshed by
?acdb
17Our Approach
- We propose an incremental maintenance method that
can maintain a data cube with 2n cuboids using
only a subset of 2n delta cuboids
?abc ? abc ?ab ? ab ?ca ? ca ?bc ?
bc ?a ? a ?b ? b ?c ? c ?? ? ?
?acb ? acb, ac ?ba ? ba, b ?cb ? cb,
c ?a ? a, ?
?abc ? abc, ab, a, ? ?ca ? ca, c ?bc ?
bc, b
23 8 delta cuboids
4 delta cuboids
3 delta cuboids
18Cost of Computing Delta Cuboids
- Different sets of delta cuboids incur different
computation cost - We represent the cost of computing delta cuboids
by the cost of a delta cuboid computation plan
?abc, ?ab, ?ca, ?bc, ?a, ?b, ?c, ??
?acb, ?ba, ?cb, ?a
?abc, ?ca, ?bc
3
7
6
8
8
15
15
16
14
16
14
14
19Problem Formulation
- Delta cube
- ?Q ?q1, ?q2, , ?qm, where ?qi is a delta
cuboid - Refresh chain
- A sequence of elements lt?q1, ?q2, , ?qigt in ?Q
such that q1 ? q2 ? ? qi - lt?q1, ?q2, , ?qigt implies ?q1 ? q1, q2, ,
qi - Example
- lt?abc, ?ab, ?agt implies ?abc ? abc, ab, a
- lt?cba, ?cb, ?cgt implies ?cba ? cba, cb, c
20Problem Formulation (contd)
- Refresh partition
- A partition of the elements of ?Q into disjoint
refresh chains - Example
lt?acb, ?acgt, lt?ab, ?bgt, lt?bc, ?cgt, lt?a, ??gt
?acb ? acb, ac ?ba ? ba, b ?cb ? cb,
c ?a ? a, ?
implies
lt?abc, ?ab, ?a, ??gt, lt?ac, ?cgt, lt?bc, ?bgt
?abc ? abc, ab, a, ? ?ca ? ca, c ?bc ?
bc, b
implies
21Problem Formulation (contd)
- Delta cuboid computation plan
- A subtree of the delta lattice including at least
all of the first elements of refresh chains in a
given refresh partition - Example
lt?abc, ?ab, ?a, ??gt, lt?ca, ?cgt, lt?bc, ?bgt
Refresh partition
Delta cuboid computation plan
22Problem Statement
- Given a delta cube and its delta lattice, find a
refresh partition that minimizes the cost of a
delta cuboid computation plan
Delta cube ?abc, ?ab, ?ac, ?bc, ?a, ?b, ?a, ??
find out
Refresh partition lt?abc, ?ab, ?a, ??gt, lt?ca,
?cgt, lt?bc, ?bgt
Delta cuboid computation plan
23NP-Hardness of the Problem
- For a given refresh partition, finding the
minimum cost delta cuboid computation plan is
NP-complete - (proved in the paper)
- Our problem is NP-hard
- Our problem is at least as hard as finding the
minimum cost delta cuboid computation plan - Moreover, there can be many refresh partitions
for a given delta cube - Hence, we resort to heuristic approaches
24Idea behind Our Heuristic
- As the number of delta cuboids to be computed
increases, the cost of a delta cuboid computation
plan increases - Hence, we minimize the number of refresh chains
in a refresh partition as possible - The minimum number of refresh chains in a refresh
partition for a delta cube with 2n delta cuboids
(proved in the paper)
?ab ?a ?b ??
?ab, ?a ?b ??
?ab, ?a ?b, ??
n ?n/2?
25Heuristic Algorithm
- Starts from the refresh partition with 2n refresh
chains - Each refresh chain consists of only one delta
cuboid - Repeatedly merge refresh chains until there are
exactly aaa refresh chains in the refresh
partition - Whenever refresh chains are merged, a new delta
cuboid computation plan with less cost is
produced
n ?n/2?
?b, ?? ?a, ?ab, ?c, ?ca, ?bc ?abc
?? ?a, ?b, ?c ?ab, ?ca, ?bc ?abc
?ca, ?c, ?bc, ?b, ?? ?abc, ?ab, ?a
2n
n ?n/2?
26Example of the Algorithm
Lv(0)
??
??
3
Lv(1)
?b
?a
?c
?b
?a
?c
?b,??
?a
?c
7
7
8
6
8
6
Lv(2)
?ca
?bc
?ab
?ca
?bc
?ab
?ca
?bc
?ab
15
15
16
14
16
14
Lv(3)
?abc
?abc
?abc
(3) Step 2
(1) Input
(2) Step 1
?ca, ?c, ?bc, ?b, ??, ?abc, ?ab, ?a
?bc,?b,??
?ca,?c
?ca,?c
?bc,?b,??
?ab,?a
15
15
16
14
14
?abc,?ab,?a
?abc
(5) Step 4
(4) Step 3
(6) Output
27Analysis of the Algorithm
- Lemma 1 Given a delta cube with 2n delta
cuboids, the proposed heuristic algorithm
produces a refresh partition with exactly
refresh chains - Thus, we need to compute only delta
cuboids to refresh a data cube with 2n cuboids
n ?n/2?
n ?n/2?
n ?n/2?
28Analysis of the Algorithm (contd)
- Lemma 2 Let TC be a delta cuboid computation
plan found by the proposed heuristic. Then the
following is true. - G a delta lattice with 2n delta cuboids
- G/2 a subgraph of G such that Level(0),
Level(1), , Level(?n/2?) are removed from G - TG the minimum spanning tree of G
- TG/2 the minimum spanning tree of G/2
- Thus, the cost of TC is bounded by the cost of
TG/2
Cost(TC) lt Cost(TG/2) lt Cost(TG)
29Performance Evaluation (1/3)
- Data warehouse environment
- Oracle9i database
- Sun Blade 1000 with UltraSparc III CPU and 512MB
RAM - TPC-H benchmark schema and data
- Data cubes used in the experiments
- Defined over lineitem table in the TPC-H schema
30Performance Evaluation (2/3)
- By varying the size of changes
(a) Q1
(b) Q2
(c) Q3
31Performance Evaluation (3/3)
- The number of tuples generated in the experiment
(a) Q1
(b) Q2
(c) Q3
32Summary
- We proposed an efficient incremental maintenance
method for data cubes - The proposed method can refresh a data cube with
2n delta cuboids using only delta cuboids - The cost of computing delta cuboids can be
substantially reduced - We formulated the problem and developed a
heuristic algorithm for this problem - We showed the efficiency of the proposed method
through performance evaluation
n ?n/2?
33References
- GBP96 J. Gray, A. Bosworth, A. Layman, and H.
Pirahesh. Data Cube A Relational Aggregation
Operator Generalizing Group-By, Cross-Tab, and
Sub-Totals. In Proceedings of the ICDE
Conference, p. 152-159, 1996 - HRU96 V. Harinarayan, A. Rajaraman, and J. D.
Ullman. Implementing Data Cubes Efficiently. In
Proceedings of the ACM SIGMOD Conference, p.
205-216, 1996. - G93 Goetz Graefe, Query Evaluation Techniques
for Large Databases, ACM Computing Surveys, Vol.
25, Issue 2, p. 73-169, 1993. - FG82 L. R. Foulds and R. L. Graham. The Steiner
Problem in Phylogeny is NP-Complete. Advances in
Applied Mathematics, 3 43-49, 1982.