CS 345: Topics in Data Warehousing

1 / 30

About This Presentation

Title:

CS 345: Topics in Data Warehousing

Description:

CS 345: Topics in Data Warehousing Tuesday, November 2, 2004 Review of Thursday s Class Join Indexes Projection Indexes Horizontal vs. Vertical decomposition Bit ... – PowerPoint PPT presentation

Number of Views:7

Avg rating:3.0/5.0

Slides: 31

Provided by: BrianB105

Learn more at: http://www.stanford.edu

more less

Transcript and Presenter's Notes

Title: CS 345: Topics in Data Warehousing

1
CS 345Topics in Data Warehousing

Tuesday, November 2, 2004

2
Review of Thursdays Class

Join Indexes
Projection Indexes
Horizontal vs. Vertical decomposition
Bit-Sliced Indexes
Fast bitmap counts and sums
Range queries
Bit Vector Filtering

3
Outline of Todays Class

Pre-computed aggregates
Materialized views
Aggregate navigation
Dimension and fact aggregates
Selection of aggregates
Manual selection
Greedy algorithm
Limitations of greedy approach

4
Physical Database Design

Logical database design
What are the facts and dimensions?
Goals Simplicity, Expressiveness
Make the database easy to understand
Make queries easy to ask
Physical database design
How should the data be arranged on disk?
Goal Performance
Manageability is an important secondary concern
Make queries run fast

5
Load vs. Query Trade-off

Trade-off between query performance and load
performance
To make queries run fast
Precompute as much as possible
Build lots of data structures
Indexes
Materialized views
But
Data structures require disk space to store
Building/updating data structures takes time
More data structures ? longer load time

6
Typical Storage Allocations

Base data
Fact tables and dimension tables
Fact table space gtgt Dimension table space
Indexes
100-200 of base data
Aggregates / Materialized Views
100 of base data
Extra data structures 2-3 times size of base data

7
Materialized Views

Many DBMSs support materialized views
Precomputed result of a particular query
Goal faster response for related queries
Example
View definitionSELECT State, SUM(Quantity)
FROM Sales GROUP BY State
Query SELECT SUM(Quantity) FROM SalesWHERE
State 'CA'
Scan view rather than Sales table
View matching problem
When can a query be re-written using a
materialized view?
Difficult to solve in its full generality
DBMSs handle common cases via limited set of
re-write rules

8
Aggregate Tables

Also known as summary tables
Common form of precomputation in data warehouses
Reduce dimensionality of fact by aggregating
across some dimensions
Result is smaller version of fact table
Can be used to answer queries that refer only to
dimensions that are retained
Similar to the idea of a covering index

9
Aggregate Table Example

Sales Fact table
Dimensions (Date, Product, Store, Promotion,
Transaction ID)
Measurements DollarAmt, Quantity
Create aggregate table with (Date, Store)
SELECT DateKey, StoreKey, SUM(DollarAmt),
SUM(Quantity)FROM SalesGROUP BY DateKey,
StoreKey
Store the result in Sales2 table
Queries that only reference Date and Store
attributes can use the aggregate table instead
SELECT Store.District, SUM(Quantity)FROM Sales,
Store, DateWHERE Sales.Date_key
Date.Date_keyAND Sales.Store_key
Store.Store_keyAND Date.Month 'September
2004'GROUP BY Store.District
Replace Sales by Sales2 ? Same query result!

10
Aggregate Tables vs. Indexes

Idea behind covering fact index
Thinner version of fact table
Index takes up less space than fact table
Fewer I/Os required to scan it
Idea behind aggregate table
Thinner and shorter version of fact table
Aggregate table takes up much less space than
fact table
Fewer I/Os required to scan it
Aggregate table has fewer rows
Index has 1 index entry per fact table row
Regardless of how many columns are in the index
Aggregate table has 1 row per unique combination
of dimensions
Often many fewer rows compared to the fact table!
Index supports efficient lookup on leading terms
Useful when filters are selective
Avoid scanning rows that will be filtered out
Can build indexes on aggregate tables, too!

11
Aggregate Navigation

Two techniques to manage aggregates
Let the database management system do it
Do it yourself
Let the database do it
Use materialized view capabilities built in to
the DBMS
Query re-writing happens automatically
Do it yourself
Create and populate additional tables
Perform explicit query re-write (aggregate
navigation)
Pros and Cons of Do-It-Yourself
Pros
More flexibility re-writing power
Better load performance (possibly)
Cons
More tables to manage
Load becomes more complex
Need to write aggregate navigation code
Clients need to use aggregate navigator

12
Aggregate Navigation

Clients are unaware of aggregates
Client queries reference base-level fact
and dimension tables

WarehouseClient
SQLQueries
DataWarehouse
RewrittenQueries
WarehouseClient
AggregateNavigator
WarehouseClient
What aggregates exist?How large is each one?
AggregateMetadata
13
Dimension Aggregates

Methods to define aggregates
1. Include or leave out entire dimensions
2. Include some columns of each dimension, but
not others
Second approach is more common / more useful
Example Several versions of Date dimension
Base Date dimension
1 row per day
Includes all Date attributes
Monthly aggregate dimension
1 row per month
Includes Date attributes at month level or higher
Yearly aggregate dimension
1 row per year
Includes only year-level Date attributes
Each dimension aggregate has its own set of
surrogate keys
Each aggregated fact joins to 1 version of the
Date dimension
Or else the Date dimension is omitted entirely

14
Choosing Dimension Aggregates

Dimension aggregates often roll up along
hierarchies
Day month year
SKU brand category department
Store city county state country
Any subset of dimension attributes can be used
Promotion dimension includes attributes for
coupon, ad, discount, and end-of-aisle display
Promotion aggregate might include only ad-related
attributes
Customer aggregate might include only a few
frequently-queried columns (age, gender, income,
marital status)
Goal reduced number of distinct combinations
Results in fewer rows in aggregated fact
Customer aggregate that included SSN would be
pointless

15
Aggregate Examples

Sales fact table
Date, Product, Store, Promotion, Transaction ID
Date aggregates
Month
Year
Product aggregates
Brand
Manufacturer
Promotion aggregates
Ad
Discount
Coupon
In-Store Display
Store aggregates
District
State

16
Choosing Fact Aggregates

Aggregated fact table includes
Foreign keys to dimension aggregate tables
Aggregated measurement columns
For example, Quantity column holds SUM(Quantity)
Need to choose which version of each dimension to
use
Transaction ID will never be included
Aggregates with degenerate dimensions are rare
Number of possible fact aggregates 4 4 6 4
Dimension with n aggregates ? n2 possibilities
Include base dimension
Omit dimension entirely
n dimension aggregates to choose from
Constructing fact aggregates for all combinations
would be impractical
How to decide which combinations to construct?

17
Aggregate Selection

Two approaches
Manual selection
Data warehouse designer chooses aggregates
Could be time-consuming and error-prone
Automatic selection
Use algorithms to optimize choice of aggregates
Good in principle however, problem is hard
Heuristics for manual selection
Include a mixture of breadth and depth
A few broad aggregates
Lots of dimensions at fairly fine-grained level
of detail
Achieve moderate speedup on a wide range of
queries
Lots of highly targeted aggregates with only a
few rows each
Dimensions are highly rolled up or omitted
entirely
Each aggregate achieves large speedup on a small
class of queries
Roughly equal allocation of space to each type
Consider the query workload
What attributes are queried most frequently?
What sets of attributes are often queried
together?

18
Constructing Aggregates

Constructing dimension aggregates
Determine attributes in aggregate
Generate unique combinations of those attributes
by SELECT DISTINCT from dimension table
Assign surrogate keys to aggregate table rows
Constructing fact aggregates
Build mapping table that maps dimension keys to
dimension aggregate keys, for each dimension
Join fact table to mapping tables and group by
aggregate keys
Constructing aggregates from other aggregates
Fact aggregate on (Product, Year) can be built
from (Product, Month) aggregate
Faster than using base table

19
Data Cube Lattice
State, Month, Color
State, Month
State,Color
Month,Color
RollUp
DrillDown
State
Month
Color
Total
20
Sparsity Revisited

Fact tables are usually sparse
Not all possible combinations of dimension values
actually occur
E.g. not all products sell in all stores on all
days
Aggregate tables are not as sparse
Coarser grain ? lesser sparsity
All products sell in SOME store in all MONTHS
Thus space savings from aggregation can be less
than one might think
Example Date dimension vs. Year dimension
aggregate
Year table has 1/365 as many rows as Date table
Fact aggregate with (Product, Store, Year) has
fewer rows than fact aggregate with (Product,
Store, Date)
But more than 1/365 of the rows!
Some potential space savings are lost due to
reduced sparsity

21
Automatic Selection of Aggregates

Problem Definition
Inputs
Query workload
Set of candidate aggregates
Query cost model
Maximum space to use for aggregates
Output
Set of aggregates to construct
Objective
Minimize cost of executing query workload
Problem is NP-Complete
Approximation is required
Well discuss a Greedy algorithm for the problem
Due to Harinarayan, Rajaraman, and Ullman (1996)

22
Problem Inputs

Set of candidate aggregates
Well consider the data cube lattice
Query workload
Each OLAP query maps to a node in the lattice
Union of grouping and filtering attributes
Workload weight assigned to each lattice node
Weight fraction of queries in workload that
correspond to this node
Query cost model
Well use simple linear cost model
Cost proportional to size of fact aggregate used
to answer query
Justification
Dominant cost is I/O to retrieve tables
Dimension tables small relevant to fact
Oversimplification of reality
Makes the problem easier to analyze
Maximum space to use for aggregates
Well fix a maximum number of aggregates,
regardless of their size
Another simplification for purposes of analysis

23
Configurations
A configuration with 3 aggregates
Node cost assignments
A
A
B
A
B
A
C
C
A
B
C

A configuration consists of a set of aggregates
To compute the cost of a configuration
For each lattice node, find its smallest ancestor
in the configuration
Cost of that node size of smallest ancestor
Cost of configuration weighted sum of node costs

24
Greedy Algorithm

Greedy aggregate selection algorithm
Add aggregates one at a time until space budget
is exhausted
Always add the aggregate whose addition will most
decrease the cost of the current configuration
Performance guarantee for Greedy
Benefit of configuration C (cost of no-aggregate
configuration) - (cost of C)
BG,k Benefit of k-aggregate configuration
chosen by greedy algorithm
BOPT,k Benefit of best possible k-aggregate
configuration
Theorem BG,k gt BOPT,k 0.63
Greedy always achieves at least 63 of optimal
benefit
For proof, see Implementing Data Cubes
Efficiently, by Harinarayan, Rajaraman, and
Ullman, 1996

25
Example of Greedy Algorithm
Fact table cost 500

Empty configuration has cost 2500
5 queries 500 cost
Which to add first?
A cost 1000
5200
B cost 1700
2100 3500
C cost 1698
299 3500
D cost 2100
1100 4500
Add aggregate A first

Weight1
200
A
100
100
99
B
C
90
90
90
90
D
Weight1
Weight1
Weight1
Weight1
Number of allowed aggregates 3
26
Example of Greedy Algorithm
Fact table cost 500

Which to add next?
AB cost 800
2100 3200
AC cost 798
299 3200
AD cost 890
190 4200
Add C next
Final 3-aggregate configuration ACD
Cost 688
190 299 2200

Weight1
200
A
100
100
99
B
C
90
90
90
90
D
Weight1
Weight1
Weight1
Weight1
Number of allowed aggregates 3
27
Example of Greedy Algorithm
Greedy Configuration
Optimal Configuration
Weight1
Weight1
200
200
100
100
99
100
100
99
90
90
90
90
90
90
90
90
Weight1
Weight1
Weight1
Weight1
Weight1
Weight1
Weight1
Weight1
Total cost 600
Total cost 688
28
Practical Limitations

Sizes of aggregates
Greedy algorithm assumed sizes of aggregates
known
In reality, computing the size of an aggregate
can be expensive
Essentially, need to construct the aggregate
table
Estimation techniques
Based on sampling or hashing
Number of candidates
n total dimension attributes ? 2n possible fact
aggregates
Considering all possible aggregates would take
too long even for moderate n
Need to prune the search space before applying
greedy selection
Number of aggregates vs. space consumed
Space consumption is more appropriate limit than
maximum number
Modified greedy algorithm At each step, add
aggregate that has best ratio of (cost
improvement) / (size of aggregate)
Impact of indexes
Selection of indexes also affects query
performance
Should not be done independently of aggregate
selection
Thursday A more practical technique