CS 345: Topics in Data Warehousing

1 / 30
About This Presentation
Title:

CS 345: Topics in Data Warehousing

Description:

CS 345: Topics in Data Warehousing Tuesday, November 2, 2004 Review of Thursday s Class Join Indexes Projection Indexes Horizontal vs. Vertical decomposition Bit ... – PowerPoint PPT presentation

Number of Views:7
Avg rating:3.0/5.0
Slides: 31
Provided by: BrianB105
Learn more at: http://www.stanford.edu

less

Transcript and Presenter's Notes

Title: CS 345: Topics in Data Warehousing


1
CS 345Topics in Data Warehousing
  • Tuesday, November 2, 2004

2
Review of Thursdays Class
  • Join Indexes
  • Projection Indexes
  • Horizontal vs. Vertical decomposition
  • Bit-Sliced Indexes
  • Fast bitmap counts and sums
  • Range queries
  • Bit Vector Filtering

3
Outline of Todays Class
  • Pre-computed aggregates
  • Materialized views
  • Aggregate navigation
  • Dimension and fact aggregates
  • Selection of aggregates
  • Manual selection
  • Greedy algorithm
  • Limitations of greedy approach

4
Physical Database Design
  • Logical database design
  • What are the facts and dimensions?
  • Goals Simplicity, Expressiveness
  • Make the database easy to understand
  • Make queries easy to ask
  • Physical database design
  • How should the data be arranged on disk?
  • Goal Performance
  • Manageability is an important secondary concern
  • Make queries run fast

5
Load vs. Query Trade-off
  • Trade-off between query performance and load
    performance
  • To make queries run fast
  • Precompute as much as possible
  • Build lots of data structures
  • Indexes
  • Materialized views
  • But
  • Data structures require disk space to store
  • Building/updating data structures takes time
  • More data structures ? longer load time

6
Typical Storage Allocations
  • Base data
  • Fact tables and dimension tables
  • Fact table space gtgt Dimension table space
  • Indexes
  • 100-200 of base data
  • Aggregates / Materialized Views
  • 100 of base data
  • Extra data structures 2-3 times size of base data

7
Materialized Views
  • Many DBMSs support materialized views
  • Precomputed result of a particular query
  • Goal faster response for related queries
  • Example
  • View definitionSELECT State, SUM(Quantity)
    FROM Sales GROUP BY State
  • Query SELECT SUM(Quantity) FROM SalesWHERE
    State 'CA'
  • Scan view rather than Sales table
  • View matching problem
  • When can a query be re-written using a
    materialized view?
  • Difficult to solve in its full generality
  • DBMSs handle common cases via limited set of
    re-write rules

8
Aggregate Tables
  • Also known as summary tables
  • Common form of precomputation in data warehouses
  • Reduce dimensionality of fact by aggregating
    across some dimensions
  • Result is smaller version of fact table
  • Can be used to answer queries that refer only to
    dimensions that are retained
  • Similar to the idea of a covering index

9
Aggregate Table Example
  • Sales Fact table
  • Dimensions (Date, Product, Store, Promotion,
    Transaction ID)
  • Measurements DollarAmt, Quantity
  • Create aggregate table with (Date, Store)
  • SELECT DateKey, StoreKey, SUM(DollarAmt),
    SUM(Quantity)FROM SalesGROUP BY DateKey,
    StoreKey
  • Store the result in Sales2 table
  • Queries that only reference Date and Store
    attributes can use the aggregate table instead
  • SELECT Store.District, SUM(Quantity)FROM Sales,
    Store, DateWHERE Sales.Date_key
    Date.Date_keyAND Sales.Store_key
    Store.Store_keyAND Date.Month 'September
    2004'GROUP BY Store.District
  • Replace Sales by Sales2 ? Same query result!

10
Aggregate Tables vs. Indexes
  • Idea behind covering fact index
  • Thinner version of fact table
  • Index takes up less space than fact table
  • Fewer I/Os required to scan it
  • Idea behind aggregate table
  • Thinner and shorter version of fact table
  • Aggregate table takes up much less space than
    fact table
  • Fewer I/Os required to scan it
  • Aggregate table has fewer rows
  • Index has 1 index entry per fact table row
  • Regardless of how many columns are in the index
  • Aggregate table has 1 row per unique combination
    of dimensions
  • Often many fewer rows compared to the fact table!
  • Index supports efficient lookup on leading terms
  • Useful when filters are selective
  • Avoid scanning rows that will be filtered out
  • Can build indexes on aggregate tables, too!

11
Aggregate Navigation
  • Two techniques to manage aggregates
  • Let the database management system do it
  • Do it yourself
  • Let the database do it
  • Use materialized view capabilities built in to
    the DBMS
  • Query re-writing happens automatically
  • Do it yourself
  • Create and populate additional tables
  • Perform explicit query re-write (aggregate
    navigation)
  • Pros and Cons of Do-It-Yourself
  • Pros
  • More flexibility re-writing power
  • Better load performance (possibly)
  • Cons
  • More tables to manage
  • Load becomes more complex
  • Need to write aggregate navigation code
  • Clients need to use aggregate navigator

12
Aggregate Navigation
  • Clients are unaware of aggregates
  • Client queries reference base-level fact
    and dimension tables

WarehouseClient
SQLQueries
DataWarehouse
RewrittenQueries
WarehouseClient
AggregateNavigator
WarehouseClient
What aggregates exist?How large is each one?
AggregateMetadata
13
Dimension Aggregates
  • Methods to define aggregates
  • 1. Include or leave out entire dimensions
  • 2. Include some columns of each dimension, but
    not others
  • Second approach is more common / more useful
  • Example Several versions of Date dimension
  • Base Date dimension
  • 1 row per day
  • Includes all Date attributes
  • Monthly aggregate dimension
  • 1 row per month
  • Includes Date attributes at month level or higher
  • Yearly aggregate dimension
  • 1 row per year
  • Includes only year-level Date attributes
  • Each dimension aggregate has its own set of
    surrogate keys
  • Each aggregated fact joins to 1 version of the
    Date dimension
  • Or else the Date dimension is omitted entirely

14
Choosing Dimension Aggregates
  • Dimension aggregates often roll up along
    hierarchies
  • Day month year
  • SKU brand category department
  • Store city county state country
  • Any subset of dimension attributes can be used
  • Promotion dimension includes attributes for
    coupon, ad, discount, and end-of-aisle display
  • Promotion aggregate might include only ad-related
    attributes
  • Customer aggregate might include only a few
    frequently-queried columns (age, gender, income,
    marital status)
  • Goal reduced number of distinct combinations
  • Results in fewer rows in aggregated fact
  • Customer aggregate that included SSN would be
    pointless

15
Aggregate Examples
  • Sales fact table
  • Date, Product, Store, Promotion, Transaction ID
  • Date aggregates
  • Month
  • Year
  • Product aggregates
  • Brand
  • Manufacturer
  • Promotion aggregates
  • Ad
  • Discount
  • Coupon
  • In-Store Display
  • Store aggregates
  • District
  • State

16
Choosing Fact Aggregates
  • Aggregated fact table includes
  • Foreign keys to dimension aggregate tables
  • Aggregated measurement columns
  • For example, Quantity column holds SUM(Quantity)
  • Need to choose which version of each dimension to
    use
  • Transaction ID will never be included
  • Aggregates with degenerate dimensions are rare
  • Number of possible fact aggregates 4 4 6 4
  • Dimension with n aggregates ? n2 possibilities
  • Include base dimension
  • Omit dimension entirely
  • n dimension aggregates to choose from
  • Constructing fact aggregates for all combinations
    would be impractical
  • How to decide which combinations to construct?

17
Aggregate Selection
  • Two approaches
  • Manual selection
  • Data warehouse designer chooses aggregates
  • Could be time-consuming and error-prone
  • Automatic selection
  • Use algorithms to optimize choice of aggregates
  • Good in principle however, problem is hard
  • Heuristics for manual selection
  • Include a mixture of breadth and depth
  • A few broad aggregates
  • Lots of dimensions at fairly fine-grained level
    of detail
  • Achieve moderate speedup on a wide range of
    queries
  • Lots of highly targeted aggregates with only a
    few rows each
  • Dimensions are highly rolled up or omitted
    entirely
  • Each aggregate achieves large speedup on a small
    class of queries
  • Roughly equal allocation of space to each type
  • Consider the query workload
  • What attributes are queried most frequently?
  • What sets of attributes are often queried
    together?

18
Constructing Aggregates
  • Constructing dimension aggregates
  • Determine attributes in aggregate
  • Generate unique combinations of those attributes
    by SELECT DISTINCT from dimension table
  • Assign surrogate keys to aggregate table rows
  • Constructing fact aggregates
  • Build mapping table that maps dimension keys to
    dimension aggregate keys, for each dimension
  • Join fact table to mapping tables and group by
    aggregate keys
  • Constructing aggregates from other aggregates
  • Fact aggregate on (Product, Year) can be built
    from (Product, Month) aggregate
  • Faster than using base table

19
Data Cube Lattice
State, Month, Color
State, Month
State,Color
Month,Color
RollUp
DrillDown
State
Month
Color
Total
20
Sparsity Revisited
  • Fact tables are usually sparse
  • Not all possible combinations of dimension values
    actually occur
  • E.g. not all products sell in all stores on all
    days
  • Aggregate tables are not as sparse
  • Coarser grain ? lesser sparsity
  • All products sell in SOME store in all MONTHS
  • Thus space savings from aggregation can be less
    than one might think
  • Example Date dimension vs. Year dimension
    aggregate
  • Year table has 1/365 as many rows as Date table
  • Fact aggregate with (Product, Store, Year) has
    fewer rows than fact aggregate with (Product,
    Store, Date)
  • But more than 1/365 of the rows!
  • Some potential space savings are lost due to
    reduced sparsity

21
Automatic Selection of Aggregates
  • Problem Definition
  • Inputs
  • Query workload
  • Set of candidate aggregates
  • Query cost model
  • Maximum space to use for aggregates
  • Output
  • Set of aggregates to construct
  • Objective
  • Minimize cost of executing query workload
  • Problem is NP-Complete
  • Approximation is required
  • Well discuss a Greedy algorithm for the problem
  • Due to Harinarayan, Rajaraman, and Ullman (1996)

22
Problem Inputs
  • Set of candidate aggregates
  • Well consider the data cube lattice
  • Query workload
  • Each OLAP query maps to a node in the lattice
  • Union of grouping and filtering attributes
  • Workload weight assigned to each lattice node
  • Weight fraction of queries in workload that
    correspond to this node
  • Query cost model
  • Well use simple linear cost model
  • Cost proportional to size of fact aggregate used
    to answer query
  • Justification
  • Dominant cost is I/O to retrieve tables
  • Dimension tables small relevant to fact
  • Oversimplification of reality
  • Makes the problem easier to analyze
  • Maximum space to use for aggregates
  • Well fix a maximum number of aggregates,
    regardless of their size
  • Another simplification for purposes of analysis

23
Configurations
A configuration with 3 aggregates
Node cost assignments
A
A
B
A
B
A
C
C
A
B
C
  • A configuration consists of a set of aggregates
  • To compute the cost of a configuration
  • For each lattice node, find its smallest ancestor
    in the configuration
  • Cost of that node size of smallest ancestor
  • Cost of configuration weighted sum of node costs

24
Greedy Algorithm
  • Greedy aggregate selection algorithm
  • Add aggregates one at a time until space budget
    is exhausted
  • Always add the aggregate whose addition will most
    decrease the cost of the current configuration
  • Performance guarantee for Greedy
  • Benefit of configuration C (cost of no-aggregate
    configuration) - (cost of C)
  • BG,k Benefit of k-aggregate configuration
    chosen by greedy algorithm
  • BOPT,k Benefit of best possible k-aggregate
    configuration
  • Theorem BG,k gt BOPT,k 0.63
  • Greedy always achieves at least 63 of optimal
    benefit
  • For proof, see Implementing Data Cubes
    Efficiently, by Harinarayan, Rajaraman, and
    Ullman, 1996

25
Example of Greedy Algorithm
Fact table cost 500
  • Empty configuration has cost 2500
  • 5 queries 500 cost
  • Which to add first?
  • A cost 1000
  • 5200
  • B cost 1700
  • 2100 3500
  • C cost 1698
  • 299 3500
  • D cost 2100
  • 1100 4500
  • Add aggregate A first

Weight1
200
A
100
100
99
B
C
90
90
90
90
D
Weight1
Weight1
Weight1
Weight1
Number of allowed aggregates 3
26
Example of Greedy Algorithm
Fact table cost 500
  • Which to add next?
  • AB cost 800
  • 2100 3200
  • AC cost 798
  • 299 3200
  • AD cost 890
  • 190 4200
  • Add C next
  • Final 3-aggregate configuration ACD
  • Cost 688
  • 190 299 2200

Weight1
200
A
100
100
99
B
C
90
90
90
90
D
Weight1
Weight1
Weight1
Weight1
Number of allowed aggregates 3
27
Example of Greedy Algorithm
Greedy Configuration
Optimal Configuration
Weight1
Weight1
200
200
100
100
99
100
100
99
90
90
90
90
90
90
90
90
Weight1
Weight1
Weight1
Weight1
Weight1
Weight1
Weight1
Weight1
Total cost 600
Total cost 688
28
Practical Limitations
  • Sizes of aggregates
  • Greedy algorithm assumed sizes of aggregates
    known
  • In reality, computing the size of an aggregate
    can be expensive
  • Essentially, need to construct the aggregate
    table
  • Estimation techniques
  • Based on sampling or hashing
  • Number of candidates
  • n total dimension attributes ? 2n possible fact
    aggregates
  • Considering all possible aggregates would take
    too long even for moderate n
  • Need to prune the search space before applying
    greedy selection
  • Number of aggregates vs. space consumed
  • Space consumption is more appropriate limit than
    maximum number
  • Modified greedy algorithm At each step, add
    aggregate that has best ratio of (cost
    improvement) / (size of aggregate)
  • Impact of indexes
  • Selection of indexes also affects query
    performance
  • Should not be done independently of aggregate
    selection
  • Thursday A more practical technique

29
Course Project
  • Project is deliberately open-ended
  • Some possibilities include
  • Survey of research literature
  • Read several related research papers write a
    report summarizing them
  • Research project
  • Compare alternate approaches to the same problem
  • Devise and test a brand new approach to a
    difficult problem
  • Programming project
  • Build a tool for some aspect of designing /
    querying / loading data warehouses
  • Implement one of the data structures or
    algorithms discussed in class
  • Other project related to your research /
    interests
  • Should involve concepts from this course in some
    way

30
Course Project
  • Project timeline
  • By Tuesday, Nov. 9
  • Send the instructor e-mail with the general topic
    youre considering
  • Its OK if you dont yet know exactly what you
    want to do
  • Receive feedback to help narrow in on a specific
    project
  • By Tuesday, Nov. 16
  • Submit project description (1 page or less)
  • Describe your plans in some detail
  • Tuesday, Nov. 30 and Tuesday, Dec. 2
  • 5-10 minute in-class presentations
  • Brief overview of your project and your results
  • Doing a demo is great, if appropriate
  • By Wednesday, Dec. 8
  • Submit final project write-up
Write a Comment
User Comments (0)