Efficient Incremental Maintenance of Data Cubes - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Efficient Incremental Maintenance of Data Cubes

Description:

Division of Computer Science. Korea Advanced Institute of Science and Technology. 2 ... if ?abc is sorted by a, b and c, ab can be refreshed by ?abc without ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 34
Provided by: csU70
Category:

less

Transcript and Presenter's Notes

Title: Efficient Incremental Maintenance of Data Cubes


1
Efficient Incremental Maintenance of Data Cubes
  • 2006. 9. 15.
  • Ki Yong Lee
  • Software Laboratories
  • Samsung Electronics Co., Ltd.
  • Myoung Ho Kim
  • Division of Computer Science
  • Korea Advanced Institute of Science and Technology

2
Outline
  • Introduction
  • Related work
  • Incremental maintenance of aggregate views
  • Incremental maintenance of data cubes
  • Our approach
  • Key idea
  • Problem formulation
  • Heuristic algorithm
  • Performance evaluation
  • Conclusion

3
Data Cube
  • A generalized group-by operator GBP96
  • Computes group-bys for all possible combinations
    of a given set of attributes

SELECT a, b, SUM(m) FROM F GROUP BY a, b
SELECT a, , SUM(m) FROM F GROUP BY a
SELECT a, b, SUM(m) FROM F CUBE BY a, b
2n
SELECT , b, SUM(m) FROM F GROUP BY b
Dimension attributes
SELECT , , SUM(m) FROM F (GROUP BY ?)
4
Cube Lattice
  • We represent a data cube as a lattice diagram
    HRU96
  • Each node represents a group-by, which is called
    a cuboid
  • Each edge (qi, qj) represents that qj can be
    computed from qi

Cuboid (group-by)
Aggregation
5
Maintenance of Data Cubes
  • A data cube is typically stored as a materialized
    view
  • How can we update a data cube efficiently when
    the source relations change?

SELECT a, b, c, d, SUM(m) FROM F CUBE BY a, b, c,
d
?
SELECT a, b, c, d, SUM(m) FROM F CUBE BY a, b,
c, d
6
Related Work (1/2)
  • Incremental maintenance of an aggregate view

ASELECT a, b, c, SUM(m) FROM F GROUP BY a,
b, c
F
?ASELECT a, b, c, SUM(m) FROM ?F GROUP BY
a, b, c
A
?F
7
Related Work (2/2)
  • Incremental maintenance of a data cube
  • Propagate stage computes the delta cube
  • Refresh stage refreshes the data cube by the
    delta cube

?
??
?
b
a
c
b
a
c
?b
?a
?c

bc
ac
bc
ab
ac
ab
?ac
?bc
?ab
abc
abc
?abc
Delta cuboid
F
?F
Original cube
Updated cube
Delta cube
8
Motivation
  • To incrementally maintain a data cube with 2n
    cuboids, existing methods compute 2n delta
    cuboids
  • As n increases, the maintenance cost increases
    significantly

Original cube
Delta cube
9
Motivation (contd)
  • Each cuboid is refreshed separately in existing
    methods

2n delta cuboids are used
10
Key Idea
  • Refresh more than one cuboid by a delta cuboid

11
Key Idea (contd)
  • Benefit
  • The number of delta cuboids that need to be
    computed is reduced

??, ?a,?b,?c,?d, ?ab,?ac,?ad,?bc,?bd,?cd, ?abc,?ab
d,?acd,?bcd, ?abcd
?bc,?cd, ?abd,?acd,?bcd, ?abcd
2n 16 delta cuboids need to be computed
Only 6 delta cuboids need to be computed
12
Key Idea (contd)
  • Refreshing more than one cuboid by a delta cuboid

?ab
ab
? 13
4
?abc
abc
4
? 8
13
Key Idea (contd)
  • However, this method requires more access to ab

?ab
ab
?ab
?abc
?abc
abc
?abc
14
Key Idea (contd)
  • But, if ?abc is sorted by a, b and c, ab can be
    refreshed by ?abc without more access to ab

?ab
ab
? 28
19
?abc
abc
4
? 8
15
Key Idea (contd)
  • A delta cuboid can be easily sorted during its
    computation with little or no additional cost
  • In most existing commercial relational database
    systems, aggregation algorithms are based on
    sorting G93
  • If a group-by is computed by sorting algorithms,
    sorted results on the grouping attributes can be
    easily obtained
  • We assume that a delta cuboid is computed by
    sorting based algorithms
  • Thus, the above method can be applied to any
    delta cuboid with no additional sorting cost

16
Generalization of the Idea
  • ?d1d2dk
  • A delta cuboid sorted in the order of attributes
    d1, d2, , dk.
  • The following set of delta cuboids can be
    refreshed by ?d1d2dk
  • d1d2dk, d1d2dk-1, , d1, ?
  • ?d1d2dk ? q1, q2, , qi
  • Cuboids q1, q2, , qi are refreshed by ?d1d2dk
  • Example
  • ?abcd ? abcd, abc, ab, a, ?
  • Cuboids abcd, abc, ab, a, ? are refreshed by
    ?abcd
  • ?acdb ? acdb, acd, ac, a, ?
  • Cuboids acdb, acd, ac, a, ? are refreshed by
    ?acdb

17
Our Approach
  • We propose an incremental maintenance method that
    can maintain a data cube with 2n cuboids using
    only a subset of 2n delta cuboids

?abc ? abc ?ab ? ab ?ca ? ca ?bc ?
bc ?a ? a ?b ? b ?c ? c ?? ? ?
?acb ? acb, ac ?ba ? ba, b ?cb ? cb,
c ?a ? a, ?
?abc ? abc, ab, a, ? ?ca ? ca, c ?bc ?
bc, b
23 8 delta cuboids
4 delta cuboids
3 delta cuboids
18
Cost of Computing Delta Cuboids
  • Different sets of delta cuboids incur different
    computation cost
  • We represent the cost of computing delta cuboids
    by the cost of a delta cuboid computation plan

?abc, ?ab, ?ca, ?bc, ?a, ?b, ?c, ??
?acb, ?ba, ?cb, ?a
?abc, ?ca, ?bc
3
7
6
8
8
15
15
16
14
16
14
14
19
Problem Formulation
  • Delta cube
  • ?Q ?q1, ?q2, , ?qm, where ?qi is a delta
    cuboid
  • Refresh chain
  • A sequence of elements lt?q1, ?q2, , ?qigt in ?Q
    such that q1 ? q2 ? ? qi
  • lt?q1, ?q2, , ?qigt implies ?q1 ? q1, q2, ,
    qi
  • Example
  • lt?abc, ?ab, ?agt implies ?abc ? abc, ab, a
  • lt?cba, ?cb, ?cgt implies ?cba ? cba, cb, c

20
Problem Formulation (contd)
  • Refresh partition
  • A partition of the elements of ?Q into disjoint
    refresh chains
  • Example

lt?acb, ?acgt, lt?ab, ?bgt, lt?bc, ?cgt, lt?a, ??gt
?acb ? acb, ac ?ba ? ba, b ?cb ? cb,
c ?a ? a, ?
implies
lt?abc, ?ab, ?a, ??gt, lt?ac, ?cgt, lt?bc, ?bgt
?abc ? abc, ab, a, ? ?ca ? ca, c ?bc ?
bc, b
implies
21
Problem Formulation (contd)
  • Delta cuboid computation plan
  • A subtree of the delta lattice including at least
    all of the first elements of refresh chains in a
    given refresh partition
  • Example

lt?abc, ?ab, ?a, ??gt, lt?ca, ?cgt, lt?bc, ?bgt
Refresh partition
Delta cuboid computation plan
22
Problem Statement
  • Given a delta cube and its delta lattice, find a
    refresh partition that minimizes the cost of a
    delta cuboid computation plan

Delta cube ?abc, ?ab, ?ac, ?bc, ?a, ?b, ?a, ??
find out
Refresh partition lt?abc, ?ab, ?a, ??gt, lt?ca,
?cgt, lt?bc, ?bgt
Delta cuboid computation plan
23
NP-Hardness of the Problem
  • For a given refresh partition, finding the
    minimum cost delta cuboid computation plan is
    NP-complete
  • (proved in the paper)
  • Our problem is NP-hard
  • Our problem is at least as hard as finding the
    minimum cost delta cuboid computation plan
  • Moreover, there can be many refresh partitions
    for a given delta cube
  • Hence, we resort to heuristic approaches

24
Idea behind Our Heuristic
  • As the number of delta cuboids to be computed
    increases, the cost of a delta cuboid computation
    plan increases
  • Hence, we minimize the number of refresh chains
    in a refresh partition as possible
  • The minimum number of refresh chains in a refresh
    partition for a delta cube with 2n delta cuboids
    (proved in the paper)

?ab ?a ?b ??
?ab, ?a ?b ??
?ab, ?a ?b, ??
n ?n/2?
25
Heuristic Algorithm
  • Starts from the refresh partition with 2n refresh
    chains
  • Each refresh chain consists of only one delta
    cuboid
  • Repeatedly merge refresh chains until there are
    exactly aaa refresh chains in the refresh
    partition
  • Whenever refresh chains are merged, a new delta
    cuboid computation plan with less cost is
    produced

n ?n/2?
?b, ?? ?a, ?ab, ?c, ?ca, ?bc ?abc
?? ?a, ?b, ?c ?ab, ?ca, ?bc ?abc
?ca, ?c, ?bc, ?b, ?? ?abc, ?ab, ?a
2n
n ?n/2?
26
Example of the Algorithm
Lv(0)
??
??
3
Lv(1)
?b
?a
?c
?b
?a
?c
?b,??
?a
?c
7
7
8
6
8
6
Lv(2)
?ca
?bc
?ab
?ca
?bc
?ab
?ca
?bc
?ab
15
15
16
14
16
14
Lv(3)
?abc
?abc
?abc
(3) Step 2
(1) Input
(2) Step 1
?ca, ?c, ?bc, ?b, ??, ?abc, ?ab, ?a
?bc,?b,??
?ca,?c
?ca,?c
?bc,?b,??
?ab,?a
15
15
16
14
14
?abc,?ab,?a
?abc
(5) Step 4
(4) Step 3
(6) Output
27
Analysis of the Algorithm
  • Lemma 1 Given a delta cube with 2n delta
    cuboids, the proposed heuristic algorithm
    produces a refresh partition with exactly
    refresh chains
  • Thus, we need to compute only delta
    cuboids to refresh a data cube with 2n cuboids

n ?n/2?
n ?n/2?
n ?n/2?
28
Analysis of the Algorithm (contd)
  • Lemma 2 Let TC be a delta cuboid computation
    plan found by the proposed heuristic. Then the
    following is true.
  • G a delta lattice with 2n delta cuboids
  • G/2 a subgraph of G such that Level(0),
    Level(1), , Level(?n/2?) are removed from G
  • TG the minimum spanning tree of G
  • TG/2 the minimum spanning tree of G/2
  • Thus, the cost of TC is bounded by the cost of
    TG/2

Cost(TC) lt Cost(TG/2) lt Cost(TG)
29
Performance Evaluation (1/3)
  • Data warehouse environment
  • Oracle9i database
  • Sun Blade 1000 with UltraSparc III CPU and 512MB
    RAM
  • TPC-H benchmark schema and data
  • Data cubes used in the experiments
  • Defined over lineitem table in the TPC-H schema

30
Performance Evaluation (2/3)
  • By varying the size of changes

(a) Q1
(b) Q2
(c) Q3
31
Performance Evaluation (3/3)
  • The number of tuples generated in the experiment

(a) Q1
(b) Q2
(c) Q3
32
Summary
  • We proposed an efficient incremental maintenance
    method for data cubes
  • The proposed method can refresh a data cube with
    2n delta cuboids using only delta cuboids
  • The cost of computing delta cuboids can be
    substantially reduced
  • We formulated the problem and developed a
    heuristic algorithm for this problem
  • We showed the efficiency of the proposed method
    through performance evaluation

n ?n/2?
33
References
  • GBP96 J. Gray, A. Bosworth, A. Layman, and H.
    Pirahesh. Data Cube A Relational Aggregation
    Operator Generalizing Group-By, Cross-Tab, and
    Sub-Totals. In Proceedings of the ICDE
    Conference, p. 152-159, 1996
  • HRU96 V. Harinarayan, A. Rajaraman, and J. D.
    Ullman. Implementing Data Cubes Efficiently. In
    Proceedings of the ACM SIGMOD Conference, p.
    205-216, 1996.
  • G93 Goetz Graefe, Query Evaluation Techniques
    for Large Databases, ACM Computing Surveys, Vol.
    25, Issue 2, p. 73-169, 1993.
  • FG82 L. R. Foulds and R. L. Graham. The Steiner
    Problem in Phylogeny is NP-Complete. Advances in
    Applied Mathematics, 3 43-49, 1982.
Write a Comment
User Comments (0)
About PowerShow.com