Title: CS 345: Topics in Data Warehousing
1CS 345Topics in Data Warehousing
- Tuesday, November 2, 2004
2Review of Thursdays Class
- Join Indexes
- Projection Indexes
- Horizontal vs. Vertical decomposition
- Bit-Sliced Indexes
- Fast bitmap counts and sums
- Range queries
- Bit Vector Filtering
3Outline of Todays Class
- Pre-computed aggregates
- Materialized views
- Aggregate navigation
- Dimension and fact aggregates
- Selection of aggregates
- Manual selection
- Greedy algorithm
- Limitations of greedy approach
4Physical Database Design
- Logical database design
- What are the facts and dimensions?
- Goals Simplicity, Expressiveness
- Make the database easy to understand
- Make queries easy to ask
- Physical database design
- How should the data be arranged on disk?
- Goal Performance
- Manageability is an important secondary concern
- Make queries run fast
5Load vs. Query Trade-off
- Trade-off between query performance and load
performance - To make queries run fast
- Precompute as much as possible
- Build lots of data structures
- Indexes
- Materialized views
- But
- Data structures require disk space to store
- Building/updating data structures takes time
- More data structures ? longer load time
6Typical Storage Allocations
- Base data
- Fact tables and dimension tables
- Fact table space gtgt Dimension table space
- Indexes
- 100-200 of base data
- Aggregates / Materialized Views
- 100 of base data
- Extra data structures 2-3 times size of base data
7Materialized Views
- Many DBMSs support materialized views
- Precomputed result of a particular query
- Goal faster response for related queries
- Example
- View definitionSELECT State, SUM(Quantity)
FROM Sales GROUP BY State - Query SELECT SUM(Quantity) FROM SalesWHERE
State 'CA' - Scan view rather than Sales table
- View matching problem
- When can a query be re-written using a
materialized view? - Difficult to solve in its full generality
- DBMSs handle common cases via limited set of
re-write rules
8Aggregate Tables
- Also known as summary tables
- Common form of precomputation in data warehouses
- Reduce dimensionality of fact by aggregating
across some dimensions - Result is smaller version of fact table
- Can be used to answer queries that refer only to
dimensions that are retained - Similar to the idea of a covering index
9Aggregate Table Example
- Sales Fact table
- Dimensions (Date, Product, Store, Promotion,
Transaction ID) - Measurements DollarAmt, Quantity
- Create aggregate table with (Date, Store)
- SELECT DateKey, StoreKey, SUM(DollarAmt),
SUM(Quantity)FROM SalesGROUP BY DateKey,
StoreKey - Store the result in Sales2 table
- Queries that only reference Date and Store
attributes can use the aggregate table instead - SELECT Store.District, SUM(Quantity)FROM Sales,
Store, DateWHERE Sales.Date_key
Date.Date_keyAND Sales.Store_key
Store.Store_keyAND Date.Month 'September
2004'GROUP BY Store.District - Replace Sales by Sales2 ? Same query result!
10Aggregate Tables vs. Indexes
- Idea behind covering fact index
- Thinner version of fact table
- Index takes up less space than fact table
- Fewer I/Os required to scan it
- Idea behind aggregate table
- Thinner and shorter version of fact table
- Aggregate table takes up much less space than
fact table - Fewer I/Os required to scan it
- Aggregate table has fewer rows
- Index has 1 index entry per fact table row
- Regardless of how many columns are in the index
- Aggregate table has 1 row per unique combination
of dimensions - Often many fewer rows compared to the fact table!
- Index supports efficient lookup on leading terms
- Useful when filters are selective
- Avoid scanning rows that will be filtered out
- Can build indexes on aggregate tables, too!
11Aggregate Navigation
- Two techniques to manage aggregates
- Let the database management system do it
- Do it yourself
- Let the database do it
- Use materialized view capabilities built in to
the DBMS - Query re-writing happens automatically
- Do it yourself
- Create and populate additional tables
- Perform explicit query re-write (aggregate
navigation) - Pros and Cons of Do-It-Yourself
- Pros
- More flexibility re-writing power
- Better load performance (possibly)
- Cons
- More tables to manage
- Load becomes more complex
- Need to write aggregate navigation code
- Clients need to use aggregate navigator
12Aggregate Navigation
- Clients are unaware of aggregates
- Client queries reference base-level fact
and dimension tables
WarehouseClient
SQLQueries
DataWarehouse
RewrittenQueries
WarehouseClient
AggregateNavigator
WarehouseClient
What aggregates exist?How large is each one?
AggregateMetadata
13Dimension Aggregates
- Methods to define aggregates
- 1. Include or leave out entire dimensions
- 2. Include some columns of each dimension, but
not others - Second approach is more common / more useful
- Example Several versions of Date dimension
- Base Date dimension
- 1 row per day
- Includes all Date attributes
- Monthly aggregate dimension
- 1 row per month
- Includes Date attributes at month level or higher
- Yearly aggregate dimension
- 1 row per year
- Includes only year-level Date attributes
- Each dimension aggregate has its own set of
surrogate keys - Each aggregated fact joins to 1 version of the
Date dimension - Or else the Date dimension is omitted entirely
14Choosing Dimension Aggregates
- Dimension aggregates often roll up along
hierarchies - Day month year
- SKU brand category department
- Store city county state country
- Any subset of dimension attributes can be used
- Promotion dimension includes attributes for
coupon, ad, discount, and end-of-aisle display - Promotion aggregate might include only ad-related
attributes - Customer aggregate might include only a few
frequently-queried columns (age, gender, income,
marital status) - Goal reduced number of distinct combinations
- Results in fewer rows in aggregated fact
- Customer aggregate that included SSN would be
pointless
15Aggregate Examples
- Sales fact table
- Date, Product, Store, Promotion, Transaction ID
- Date aggregates
- Month
- Year
- Product aggregates
- Brand
- Manufacturer
- Promotion aggregates
- Ad
- Discount
- Coupon
- In-Store Display
- Store aggregates
- District
- State
16Choosing Fact Aggregates
- Aggregated fact table includes
- Foreign keys to dimension aggregate tables
- Aggregated measurement columns
- For example, Quantity column holds SUM(Quantity)
- Need to choose which version of each dimension to
use - Transaction ID will never be included
- Aggregates with degenerate dimensions are rare
- Number of possible fact aggregates 4 4 6 4
- Dimension with n aggregates ? n2 possibilities
- Include base dimension
- Omit dimension entirely
- n dimension aggregates to choose from
- Constructing fact aggregates for all combinations
would be impractical - How to decide which combinations to construct?
17Aggregate Selection
- Two approaches
- Manual selection
- Data warehouse designer chooses aggregates
- Could be time-consuming and error-prone
- Automatic selection
- Use algorithms to optimize choice of aggregates
- Good in principle however, problem is hard
- Heuristics for manual selection
- Include a mixture of breadth and depth
- A few broad aggregates
- Lots of dimensions at fairly fine-grained level
of detail - Achieve moderate speedup on a wide range of
queries - Lots of highly targeted aggregates with only a
few rows each - Dimensions are highly rolled up or omitted
entirely - Each aggregate achieves large speedup on a small
class of queries - Roughly equal allocation of space to each type
- Consider the query workload
- What attributes are queried most frequently?
- What sets of attributes are often queried
together?
18Constructing Aggregates
- Constructing dimension aggregates
- Determine attributes in aggregate
- Generate unique combinations of those attributes
by SELECT DISTINCT from dimension table - Assign surrogate keys to aggregate table rows
- Constructing fact aggregates
- Build mapping table that maps dimension keys to
dimension aggregate keys, for each dimension - Join fact table to mapping tables and group by
aggregate keys - Constructing aggregates from other aggregates
- Fact aggregate on (Product, Year) can be built
from (Product, Month) aggregate - Faster than using base table
19Data Cube Lattice
State, Month, Color
State, Month
State,Color
Month,Color
RollUp
DrillDown
State
Month
Color
Total
20Sparsity Revisited
- Fact tables are usually sparse
- Not all possible combinations of dimension values
actually occur - E.g. not all products sell in all stores on all
days - Aggregate tables are not as sparse
- Coarser grain ? lesser sparsity
- All products sell in SOME store in all MONTHS
- Thus space savings from aggregation can be less
than one might think - Example Date dimension vs. Year dimension
aggregate - Year table has 1/365 as many rows as Date table
- Fact aggregate with (Product, Store, Year) has
fewer rows than fact aggregate with (Product,
Store, Date) - But more than 1/365 of the rows!
- Some potential space savings are lost due to
reduced sparsity
21Automatic Selection of Aggregates
- Problem Definition
- Inputs
- Query workload
- Set of candidate aggregates
- Query cost model
- Maximum space to use for aggregates
- Output
- Set of aggregates to construct
- Objective
- Minimize cost of executing query workload
- Problem is NP-Complete
- Approximation is required
- Well discuss a Greedy algorithm for the problem
- Due to Harinarayan, Rajaraman, and Ullman (1996)
22Problem Inputs
- Set of candidate aggregates
- Well consider the data cube lattice
- Query workload
- Each OLAP query maps to a node in the lattice
- Union of grouping and filtering attributes
- Workload weight assigned to each lattice node
- Weight fraction of queries in workload that
correspond to this node - Query cost model
- Well use simple linear cost model
- Cost proportional to size of fact aggregate used
to answer query - Justification
- Dominant cost is I/O to retrieve tables
- Dimension tables small relevant to fact
- Oversimplification of reality
- Makes the problem easier to analyze
- Maximum space to use for aggregates
- Well fix a maximum number of aggregates,
regardless of their size - Another simplification for purposes of analysis
23Configurations
A configuration with 3 aggregates
Node cost assignments
A
A
B
A
B
A
C
C
A
B
C
- A configuration consists of a set of aggregates
- To compute the cost of a configuration
- For each lattice node, find its smallest ancestor
in the configuration - Cost of that node size of smallest ancestor
- Cost of configuration weighted sum of node costs
24Greedy Algorithm
- Greedy aggregate selection algorithm
- Add aggregates one at a time until space budget
is exhausted - Always add the aggregate whose addition will most
decrease the cost of the current configuration - Performance guarantee for Greedy
- Benefit of configuration C (cost of no-aggregate
configuration) - (cost of C) - BG,k Benefit of k-aggregate configuration
chosen by greedy algorithm - BOPT,k Benefit of best possible k-aggregate
configuration - Theorem BG,k gt BOPT,k 0.63
- Greedy always achieves at least 63 of optimal
benefit - For proof, see Implementing Data Cubes
Efficiently, by Harinarayan, Rajaraman, and
Ullman, 1996
25Example of Greedy Algorithm
Fact table cost 500
- Empty configuration has cost 2500
- 5 queries 500 cost
- Which to add first?
- A cost 1000
- 5200
- B cost 1700
- 2100 3500
- C cost 1698
- 299 3500
- D cost 2100
- 1100 4500
- Add aggregate A first
Weight1
200
A
100
100
99
B
C
90
90
90
90
D
Weight1
Weight1
Weight1
Weight1
Number of allowed aggregates 3
26Example of Greedy Algorithm
Fact table cost 500
- Which to add next?
- AB cost 800
- 2100 3200
- AC cost 798
- 299 3200
- AD cost 890
- 190 4200
- Add C next
- Final 3-aggregate configuration ACD
- Cost 688
- 190 299 2200
Weight1
200
A
100
100
99
B
C
90
90
90
90
D
Weight1
Weight1
Weight1
Weight1
Number of allowed aggregates 3
27Example of Greedy Algorithm
Greedy Configuration
Optimal Configuration
Weight1
Weight1
200
200
100
100
99
100
100
99
90
90
90
90
90
90
90
90
Weight1
Weight1
Weight1
Weight1
Weight1
Weight1
Weight1
Weight1
Total cost 600
Total cost 688
28Practical Limitations
- Sizes of aggregates
- Greedy algorithm assumed sizes of aggregates
known - In reality, computing the size of an aggregate
can be expensive - Essentially, need to construct the aggregate
table - Estimation techniques
- Based on sampling or hashing
- Number of candidates
- n total dimension attributes ? 2n possible fact
aggregates - Considering all possible aggregates would take
too long even for moderate n - Need to prune the search space before applying
greedy selection - Number of aggregates vs. space consumed
- Space consumption is more appropriate limit than
maximum number - Modified greedy algorithm At each step, add
aggregate that has best ratio of (cost
improvement) / (size of aggregate) - Impact of indexes
- Selection of indexes also affects query
performance - Should not be done independently of aggregate
selection - Thursday A more practical technique
29Course Project
- Project is deliberately open-ended
- Some possibilities include
- Survey of research literature
- Read several related research papers write a
report summarizing them - Research project
- Compare alternate approaches to the same problem
- Devise and test a brand new approach to a
difficult problem - Programming project
- Build a tool for some aspect of designing /
querying / loading data warehouses - Implement one of the data structures or
algorithms discussed in class - Other project related to your research /
interests - Should involve concepts from this course in some
way
30Course Project
- Project timeline
- By Tuesday, Nov. 9
- Send the instructor e-mail with the general topic
youre considering - Its OK if you dont yet know exactly what you
want to do - Receive feedback to help narrow in on a specific
project - By Tuesday, Nov. 16
- Submit project description (1 page or less)
- Describe your plans in some detail
- Tuesday, Nov. 30 and Tuesday, Dec. 2
- 5-10 minute in-class presentations
- Brief overview of your project and your results
- Doing a demo is great, if appropriate
- By Wednesday, Dec. 8
- Submit final project write-up