Title: Incremental Maintenance for Non-Distributive Aggregate Functions
1Incremental Maintenance for Non-Distributive
Aggregate Functions
Themis Palpanas (U of Toronto) Richard
Sidle Bobbie Cochrane Hamid Pirahesh
- work done at IBM Almaden Research Center
2Motivation
- large amounts of data stored in databases
- often times data warehouses are used
- consolidate data from many sources
- offer more general and descriptive view of data
- queried by business intelligence tools and
decision support systems - produce expensive OLAP queries
- these OLAP queries have nice properties
- based on same set of tables
- perform similar aggregations
3Motivation (contd)
- can efficiently support such queries with
Automatic Summary Tables (ASTs) - materialized queries defined over a set of base
tables - precomputed once, used many times
- answer complex queries fast
- must maintain ASTs when base tables change
- inserts, updates, deletes
4Motivation (contd)
AST
AST definition
5Aggregate Functions
- characterization of functions wrt insertion and
deletion operations - updates are series of deletions and insertions
- distributive aggregate functions
- new value computed based on old value and value
of operation - SUM()
- non-distributive aggregate functions
- above property does not hold
- STDDEV()
- MIN() (because of deletions)
6Problem Statement
- given ASTs with aggregate functions
- distributive
- SUM, COUNT
- non-distributive
- STDDEV, CORRELATION, REGRESSION, MIN/MAX, XMLAGG,
- when base tables change
- incrementally maintain affected ASTs
- efficient maintenance of ASTs with
- non-distributive aggregate functions
7Outline
- Current Approach
- Our Solution
- Experimental Evaluation
- Related Work
- Conclusions
8Current Approach
AST
delta
AST definition
9Current Approach (contd)
- works for distributive
- SUM, COUNT
- does not work for non-distributive
- STDDEV, CORRELATION, REGRESSION
- MIN/MAX
- XMLAGG
- need new way to deal with these functions
10Our Solution
- selective recomputation
- no longer enough to compute delta
- must recompute some aggregation groups
- minimize work to be done
- choose which groups to recompute
- optimize query plan
11Our Solution (contd)
combine old and new values
recompute affected groups
AST
delta
AST definition
12Our Solution (contd)
- the 5 steps
- compute new aggregate values
- change column derivation
- recompute only affected groups
- eliminate unnecessary operations
- optimize for special cases
13Initial Query Plan
141. Compute New Aggregate Values
- compute delta for distributive functions
- recompute non-distributive functions
- get those values only for affected groups
- duplicate computation for distributive functions!
152. Change Column Derivation
UDI
- change column derivation
- rewrite phase projects out unused columns
- entire AST gets recomputed!
LOJ
AST
LOJ
non-distributive only
prop
AST
distributive only
162. Change Column Derivation
- example AST
- SELECT dept_id,COUNT(emp_id),MAX(age),STDDEV(salar
y) - FROM employees
- GROUP BY dept_id
- result of COUNT() computed from old propagate
phase - results of MAX() and STDDEV() from AST definition
173. Recompute Affected Groups
UDI
- push join predicate down in AST
- only affected groups are recomputed
- special rules for super-aggregates
- GROUPING SETS
- ROLLUP
- CUBE
non-distributive only
LOJ
AST
LOJ
distributive only
AST
J
J
prop
T1
Tk
183. Recompute Affected Groups
- special treatment for ASTs with super-aggregates
- predicates not pushdownable
- caution not to compute totals of totals
- build special join predicate
- ensure correct aggregations
- change rewrite rules
- allow predicate pushdown through super aggregates
- applicable only for special join predicate
194. Remove Unnecessary Operations
- outerjoin not always needed
- when changes are only inserts
- reroute columns from propagate phase through AST
- remove outerjoin operator
- same for updates not referencing AST grouping
columns and predicates
UDI
LOJ
all columns
AST
distributive only
AST
J
J
prop
T1
Tk
204. Remove Unnecessary Operations
- example AST
- SELECT dept_id,COUNT(emp_id),MAX(age),STDDEV(salar
y) - FROM employees
- GROUP BY dept_id
- modification on base tables
- UPDATE employees SET salary10 WHERE agegt40
- outerjoin operation will not be built
- update does not refer to grouping column
(dept_id), and no predicate in AST refers to
updated column (salary) - certain that no tuples in AST will be deleted
- only STDDEV() will be recomputed
- the rest are not affected by changes
215. Optimize for Special Cases
- recomputation step not needed when
- only insertions and only MIN/MAX functions
- build predicate in apply phase
- check if new min/max should replace old values
- only deletions referring only to grouping columns
of AST - can only cause entire groups to be deleted
- handled in apply phase
225. Optimize for Special Cases
- example AST
- SELECT dept_id,COUNT(emp_id),MAX(age),STDDEV(salar
y) - FROM employees
- GROUP BY dept_id
- modification on base tables
- DELETE FROM employees WHERE dept_idgt40
- selective recomputation step not needed
- deletion refers only to grouping column (dept_id)
- certain that entire groups will be deleted from
AST - no other groups will be affected
23Experimental Evaluation
- prototype implementation in IBM DB2 UDB
- star schema database
- sales of products over 5 year time period
- fact table 10 million tuples
- AST with non-distributive aggregate function
- 240,000 tuples
- workload simulates nightly updates
- add/delete data for first day of month
- add/delete data for second day of month
- add/delete data for full month
24Experimental Evaluation (contd)
workload 1 workload 2 workload 3
incremental 286 294 420
full refresh 699 702 692
- deletions require 40-60 of full refresh time
workload 1 workload 2 workload 3
incremental 3 n/a 31
full refresh 699 702 692
- optimized deletions require 1-4 of full refresh
time
25Experimental Evaluation (contd)
workload 1 workload 2 workload 3
incremental 151 158 180
full refresh 702 702 721
- insertions/updates require 20-25 of full refresh
time
26Related Work
- incremental view maintenance
- differential refresh algorithms
- Lindsay et al. 1986, Blakeley et al. 1986, Qian
and Wiederhold 1991, Ceri and Widom 1991 - deferred incremental maintenance
- Colby et al. 1996, Salem et al. 2000
- views with aggregation
- Quass 1996, Mumick et al. 1997
27Conclusions
- incremental maintenance for ASTs with
non-distributive aggregate functions - support MIN/MAX, STDDEV, CORRELATION, REGRESSION,
XMLAGG, - efficient selective recomputation
- recompute only affected groups
- optimize query plan
- customize for special cases
- significant performance improvements
28Future Work
- examine use of work areas
- temporary storage space
- store intermediate values
- maintenance without recomputation
- STDDEV, MIN/MAX(?),
- very helpful for ASTs defined with
super-aggregates - ASTs with HAVING clauses
- do not know when groups will enter/leave AST