Title: CS 345: Topics in Data Warehousing
1CS 345Topics in Data Warehousing
- Tuesday, October 12, 2004
2Review of Thursdays Class
- Facts
- Semi-additive facts
- Factless fact tables
- Slowly Changing Dimensions
- Overwrite history
- Preserve history
- Hybrid schemes
- More dimension topics
- Dimension roles
- Junk dimension
- More fact topics
- Multiple currencies
- Master/Detail facts and fact allocation
- Accumulating Snapshot fact tables
3Outline of Todays Class
- Customer Relationship Management (CRM)
- Dimension-focused queries
- Drill-across
- Conformed dimensions
- Customer dimension
- Behavioral attributes
- Auxiliary tables
- Techniques for very large dimensions
- Outriggers
- Mini-dimensions
- Hierarchies
- Bridge tables
4Customer Relationship Management (CRM)
- Currently a hot topic in business data analysis
- Idea Gain better understanding of customer
behavior by integrating data from various sources - Multiple interaction types
- Orders
- Returns
- Customer support
- Billing
- Service / repairs
- Multiple interaction channels
- Retail store
- E-mail
- Call center (Inbound / Outbound)
- Web site
5CRM questions
- Customer profitability
- Identify most / least profitable customers
- 80/20 rule
- Customer retention
- Which customers are most likely to defect to a
competitor? - Which retention measures work best?
- Customer acquisition
- Which prospects are most promising?
- What offers will entice them to become customers?
- Up-sell / Cross-sell
- Gain additional business from existing customers
- Provide targeted offers during inbound
communications
6Dimension-focused Queries
- Standard OLAP queries are fact-focused
- Query touches one fact table and its associated
dimensions - Some types of analysis are dimension-focused
- Bring together data from different fact tables
that have a dimension in common - Common dimension used to coordinate facts
- Sometimes referred to as drilling across
7Drill-Across Example
- Example scenario
- Sales fact with dimensions (Date, Customer,
Product, Store) - CustomerSupport fact with dimensions (Date,
Customer, Product, ServiceRep) - Question How does frequency of support calls by
California customers affect their purchases of
Product X? - Step 1 Query CustomerSupport fact
- Group by Customer SSN
- Filter on State California
- Compute COUNT
- Query result has schema (Customer SSN,
SupportCallCount) - Step 2 Query Sales fact
- Group by Customer SSN
- Filter on State California, Product Name
Product X - Compute SUM(TotalSalesAmt)
- Query result has schema (Customer SSN,
TotalSalesAmt) - Step 3 Combine query results
- Join Result 1 and Result 2 based on Customer SSN
- Group by SupportCallCount
- Compute COUNT, AVG(TotalSalesAmt)
8A Problem with the Example
- What if some customers dont make any support
calls? - No rows for these customers in CustomerSupport
fact - No rows for these customers in result of Step 1
- No data for these customers in result of Step 3
- Solution use outer join in Step 3
- Customers who are in Step 2 but not Step 1 will
be included in result of Step 3 - Attributes from Step 1 result table will be NULL
for these customers - Convert these NULLs to an appropriate value
before presenting results - Using SQL NVL() function
9Conformed Dimensions
- Bottom-up data warehousing approach builds one
data mart at a time - Drill-across between data marts requires common
dimension tables - Common dimensions and attributes should be
standardized across data marts - Create master copy of each common dimension table
- Three types of conformed dimensions
- Dimension table identical to master copy
- Dimension table has subset of rows from the
master copy - Can improve performance when many dimension rows
are not relevant to a particular process - Dimension table has subset of attributes from
master copy - Allows for roll-up dimensions at different grains
10Conformed Dimension Example
- Monthly sales forecasts
- Predicted sales for each brand in each district
in each month - POS Sales fact recorded at finer-grained detail
- Product SKU vs. Brand
- Date vs. Month
- Store vs. District
- Use roll-up dimensions
- Brand dimension is rolled-up version of master
Product dimension - One row per brand
- Only include attributes relevant at brand level
or higher - Month dimension is rolled-up Date
- District dimension is rolled-up Store
- Schema
- Sales (Date, Product, Store, Promotion,
Transaction ID) - Forecast (Month, Brand, District)
11Drill-Across Example
- Question How did actual sales diverge from
forecasted sales in Sept. 04? - Drill-across between Forecast and Sales
- Step 1 Query Forecast fact
- Group by Brand Name, District Name
- Filter on MonthAndYear Sept 04
- Calculate SUM(ForecastAmt)
- Query result has schema (Brand Name, District
Name, ForecastAmt) - Step 2 Query Sales fact
- Group by Brand Name, District Name
- Filter on MonthAndYear Sept 04
- Calculate SUM(TotalSalesAmt)
- Query result has schema (Brand Name, District
Name, TotalSalesAmt) - Step 3 Combine query results
- Join Result 1 and Result 2 on Brand Name and
District Name - Result has schema (Brand Name, District Name,
ForecastAmt, TotalSalesAmt) - Outer join unnecessary assuming
- Forecast exists for every brand, district, and
month - Every brand has some sales in every district
during every month
12The Customer Dimension
- Customer dimensions can be very wide
- Often dozens or even hundreds of attributes
- Contact information (name, address, phone,
e-mail) - Demographics (age, ethnicity, gender, education,
profession, income, household size, etc.) - Psychographics (interests, values, beliefs,
attitudes) - Dates (birthday, first purchase, last purchase,
online reg. date) - Behavioral scores (RFM, churn propensity, etc.)
- Data available from many sources
- Information provided directly by customers
- Prospect lists acquired from partners or vendors
- Syndicated data
- Market research
- Customs data
- Data derived from warehouse analysis
13Behavioral Attributes
- Customers can be segmented based on past behavior
- Aggregated fact data converted to dimensional
attributes - Examples
- RFM scoring
- Recency of last purchase
- Frequency of purchases
- Monetary value of purchases
- Scores based on predictive models
- Propensity to churn
- Probability of default
- Segmentation based on clustering algorithms
- Raw aggregated data
- Total dollar sales in past year
14Behavioral Attributes
- Two techniques for handling behavioral attributes
- Dimension attributes generated during ETL process
- Stored in dimension table
- Good query performance
- Limited flexibility
- Preserving history possible (but may be
expensive) - Virtual attributes created on demand via user
queries - Stored in auxiliary tables
- Very flexible and customizable
- Increased management complexity
- Increased query complexity
- Query performance may suffer
- Cant easily preserve history
15Auxiliary Tables for User-Defined Attributes
User-CreatedAuxiliary Table
Customer Dimension
Cust_id Name Zip
1 Brian 94403
2 Rajeev 94303
Name Score
Brian 3
Rajeev 5
Natural keyof customerdimension
User-definedattribute fromquery result
Name is natural key
- Join dimension and auxiliary table using natural
key - Join result looks like expanded customer
dimension
16Another Use of Auxiliary Tables
- Track a set of customers over time
- For example, a focus group or pre-selected sample
- Set of customers may be defined based on a query
- Query results may change over time as customer
attributes slowly change - How to preserve the initial set?
- Create single-column auxiliary table containing
natural key of customers in the set - Join to the auxiliary table to filter based on
the initial customer set
17Continuous vs. Discrete Values
- Some attributes have large number of possible
values on a continuous scale - Income
- Age
- Most simple behavioral attributes are of this
type - TotalSalesOfProductXIn2003
- Disadvantages of continuous attributes
- Grouping by continuous attribute produces huge,
meaningless report - Number of unique attribute combinations explodes
- Greater number of rows in dimension
- More frequent changes to value of dimension
attributes - Group continuous attributes into discrete bands
- Like a histogram
- Instead of Salary 47540, use Salary 40K-50K
- Avoids above disadvantages
- Downside loss of information
18Very Large Dimensions
- Customer dimensions can be very wide
- Dozens or hundreds of attributes
- Customer dimensions can be very large
- Tens of millions of rows in some warehouses
- Sometimes includes prospects as well as actual
customers - Size can lead to performance challenges
- One case when performance concerns can trump
simplicity - Can we reduce width of dimension table?
- Can we reduce number of rows caused by preserving
history for slowly changing dimension?
19Outrigger Tables
- Limited normalization of large dimension table to
save space - Identify attribute sets with these properties
- Highly correlated
- Low in cardinality (compared to of customers)
- Change in unison
- Example
- External data provider computes demographic data
for each county - 100 demographic attributes are provided
- Updates are supplied every six months
- Follow these steps for each attribute set
- Create a separate outrigger dimension for each
attribute set - Remove the attributes from the customer dimension
- Replace with a foreign key to the outrigger table
- No foreign key from fact row to outrigger
- Outrigger attributes indirectly associated with
facts via customer dim.
20Outrigger Example
Customer Dimension
Cust_id FName LName Zip Demo_id
1 Brian Babcock 94403 34
2 Rajeev Motwani 94303 12
3 Leland Stanford 94305 12
Demo_id County AvgInc HHoldSize
12 Santa Clara 78000 2.3
34 San Mateo 67000 2.5
County Demographics Outrigger
21Outrigger Tables
- Advantages
- Space savings
- Customer dimension table becomes narrower
- Outrigger table has relatively few rows
- One copy per county vs. one copy per customer
- Disadvantages
- Additional tables introduced
- Accessing outrigger attributes requires an extra
join - Users must remember which attributes are in
outrigger vs. main customer dimension - Creating a view can solve this problem
22Mini-Dimensions
- Some attributes change relatively frequently
- Behavior-based scores
- Certain demographic attributes
- Age, Income, Marital Status, of children
- How to preserve history without row explosion?
- Some attributes are queried relatively frequently
- Queries using huge customer dimension are slowed
- How to improve query performance?
- Create a mini-dimension
- Remove frequently-changing or frequently-queried
attributes from the customer dimension - Add them to a separate mini-dimension table
instead - Discretize mini-dimension attributes to reduce
cardinality - Group continuously-valued attributes into buckets
or bands - Example Age lt 20, Age 20-29, Age 30-39, Age
40-49, Age 50 - Include foreign keys to both customer dimension
mini-dimension in fact table
23Mini-Dimensions
- Advantages
- History preserved without space blow-up
- Fact table captures historical record of
attribute values - Mini-dimension has small number of rows
- of unique combinations of mini-dimension
attributes is small - Consequence of discretization
- Limit number of attributes in a single
mini-dimension! - Improved performance for queries that use
mini-dimension - At least for those queries that can avoid the
main customer dimension - Disadvantages
- Fact table width increases
- Due to increased number of dimension foreign keys
- Information lost due to discretization
- Less detail is available
- Impractical to change bucket / band boundaries
- Additional tables introduced
- Users must remember which attributes are in
mini-dimension vs. main customer dimension
24Outrigger vs. Mini-Dimension
CustomerDimension
CustomerDimension
Fact
Fact
Outrigger
Mini-dimension
25Outrigger vs. Mini-Dimension
- Mini-dimension approach
- Explicit link between fact table and
mini-dimension - No explicit link between customer dimension and
mini-dimension - Difficult to express queries that group past
customer behavior based on current demographic
values - Implicit association via fact table
- Outrigger approach
- No explicit link between fact table and outrigger
- Explicit link between customer dimension and
outrigger - Associating facts with outrigger requires join
through customer dimension - Preserving history for rapidly-changing
attributes leads to customer dimension blow-up - Hybrid approach
- Separate some attributes into their own
mini-dimension - Add foreign key to mini-dimension to both fact
table and customer dimension - Customer dimension foreign key updated using
overwrite history semantics - Queries based on historically accurate attribute
values use fact table foreign key - Queries based on latest attribute values use
customer dimension foreign key - Greater expressive power, and greater risk of
confusing users!
26More Outriggers / Mini-Dims
- Lots of information about some customers, little
info about others - A common scenario
- Example web site browsing behavior
- Web User dimension ( Customer dimension)
- Unregistered users
- User identity tracked over time via cookies
- Limited information available
- First active date, Latest active date, Behavioral
attributes - Possibly ZIP code through IP lookup
- Registered users
- Lots of data provided by user during registration
- Many more unregistered users than registered
users - Most attribute values are unknown for
unregistered users - Split registered user attributes into a separate
table - Either an outrigger or a mini-dimension
- For unregistered users, point to special
Unregistered row
27Handling Hierarchies
- Hierarchical relationships among dimension
attributes are common - There are various ways to handle hierarchies
- Store all levels of hierarchy in denormalized
dimension table - The preferred solution in almost all cases!
- Create snowflake schema with hierarchy captured
in separate outrigger table - Only recommended for huge dimension tables
- Storage savings have negligible impact in most
cases - What about variable-depth hierarchies?
- Examples
- Corporate organization chart
- Parts composed of subparts
- Previous two solutions assumed fixed-depth
- Creating recursive foreign key to parent row is a
possibility - Employee dimension has boss attribute which is
FK to Employee - The CEO has NULL value for boss
- This approach is not recommended
- Cannot be queried effectively using SQL
- Alternative approach bridge table
28Bridge Tables
Customer 1
- Customer dimension has one row for each customer
entity at any level of the hierarchy - Separate bridge table has schema
- Parent customer key
- Subsidiary customer key
- Depth of subsidiary
- Bottom flag
- Top flag
- One row in bridge table for every (ancestor,
descendant) pair - Customer counts as its own Depth-0 ancestor
- 16 rows for the hierarchy at right
- Fact table can join
- Directly to customer dimension
- Through bridge table to customer dimension
Customer 2
Customer 3
Customer 4
Customer 5
Customer 6
Customer 7
29Using Bridge Tables in Queries
- Two join directions
- Navigate up the hierarchy
- Fact joins to subsidiary customer key
- Dimension joins to parent customer key
- Navigate down the hierarchy
- Fact joins to parent customer key
- Dimension joins to subsidiary customer key
- Safe uses of the bridge table
- Filter on customer dimension restricts query to a
single customer - Use bridge table to combine data about that
customers subsidiaries or parents - Filter on bridge table restricts query to a
single level - Require Top Flag Y
- Require Depth 1
- For immediate parent / child organizations
- Require (Depth 1 OR Top Flag Y)
- Generalizes the previous example to properly
treat top-level customers - Other uses of the bridge table risk over-counting
- Bridge table is many-to-many between fact and
dimension