Title: TechnoLogica
1DW Concepts Dimension Modeling Techniques
Milena Gerova Project Manager
TechnoLogica Ltd. 3, Sofiisko Pole Str. tel (
3592) 91 91 2 (ten lines) e-mail
office_at_technologica.com, http//
www.technologica.com
2TechnoLogica DW Projects
- Business Management System National Health
Insurance Fund (10.2004 current) - Customer Data Integration Allianz Bulgaria
Holding (10.2004 current) - Regulatory Reporting System BULBANK (2002 -
2003) - Information System Monetary StatisticsBulgarian
National Bank (April 2003 August 2004) - Management Information System BULBANK (January
2001 - June 2002)
3Agenda
- DW Terminology Overview
- Dimensional Modeling
- Dimension Types
- History and Dimensions
- Hierarchy in Dimensions
4The data warehouse must
- Make an organizations information easily
accessible. - Present the organizations information
consistently. - Be adaptive and resilient to change
- Be a secure bastion that protects our information
assets. - Serve as the foundation for improved decision
making - The business community must accept the data
warehouse if it is to be deemed successful.
5Components of a Data Warehouse
6Dimensional Modeling
- Dimensional modeling is a new name for an old
technique for making databases simple and
understandable - Dimensional modeling is quite different from
third-normal-form (3NF) modeling
- ERM -gtThe Transaction Processing Model
- One table per entity
- Minimize data redundancy
- Optimize update
- DM -gt The data warehousing model
- One fact table for a process in the organization
- Maximize understandability
- Optimized for retrieval
- Resilient to change
7Star Dimensional Modeling
8Four-Step Dimensional Design Process
- 1. Select the business process to model.
- 2. Declare the grain of the business process.
- 3. Choose the dimensions that apply to each fact
table row. - 4. Identify the numeric facts that will populate
each fact table row.
9Dimensions
- Determine these by the ways you want to slice and
dice the data - Small number of rows compared to facts
- Usually 5-10 dimensions surrounding a fact table
- Time is almost always a dimension used by every
fact - Track history
- Uses Surrogate Keys
- Hierarchies are usually built into them if
possible
10Date Dimension
- The date dimension is the one dimension nearly
guaranteed to be in every data mart - Date Dimension Time Dimension before
- We can build the date dimension table in advance
(5-10 years -gt only 3,650 rows)
11Date Dimension
12Date Dimension
13Date Dimension
- Data warehouses always need an explicit date
dimension table. There are many date attributes
not supported by the SQL date function, including
fiscal periods, seasons, holidays, and weekends.
Rather than attempting to determine these
nonstandard calendar calculations in a query, we
should look them up in a date dimension table. - select sum(f.amount_sold)from DATE_DIM d, FACT
fwhere d.Calendar_Month January and d.id
f.date_dim_id
14Dimension Normalization(Denormalized dimension)
15Dimension Normalization(Denormalized dimension)
16Dimension Normalization(Snowflaking)
17Dimension Normalization(Snowflaking)
- The dimension tables should remain as flat tables
physically. - Normalized, snowflaked dimension tables penalize
cross-attribute browsing and prohibit the use of
bit-mapped indexes. - Disk space savings gained by normalizing the
dimension tables typically are less than 1
percent of the total disk space needed for the
overall schema
18Too Many Dimensions
19Too Many Dimensions
- A very large number of dimensions typically is a
sign that several dimensions are not completely
independent and should be combined into a single
dimension. - If our design has 25 or more dimensions, we
should look for ways to combine correlated
dimensions into a single dimension - It is a dimensional modeling mistake to represent
elements of a hierarchy as separate dimensions in
the fact table.
20Surrogate Keys
- Every join between dimension and fact tables in
the data warehouse should be based on meaningless
integer surrogate keys. - You should avoid using the natural operational
production codes. None of the data warehouse keys
should be smart, where you can tell something
about the row just by looking at the key.
21Surrogate Keys
- Surrogate keys are like an immunization for the
data warehouse - Buffer the data warehouse environment from
operational changes - Performance advantages The smaller surrogate key
translates into smaller fact tables, smaller fact
table indices, and more fact table rows per block
input-output operation - Surrogate keys are used to record dimension
conditions that may not have an operational
codeNo Promotion in Effect, Date Not
Applicable.
22Surrogate Keys
- The date dimension is the one dimension where
surrogate keys should be assigned in a
meaningful, sequential order - Surrogate keys are needed to support one of the
primary techniques for handling changes to
dimension table attributes - Dont use concatenated or compound keys for
dimension tables
23Data Warehouse Bus Architecture
24Data Warehouse Bus Matrix
25Conformed Dimensions
- Most dimensions are defined naturally at the most
granular level possible - Conformed dimensions are either identical or
strict mathematical subsets of the most granular,
detailed dimension - They have consistent dimension keys, consistent
attribute column names, consistent attribute
definitions, and consistent attribute values - The conformed dimension may be the same physical
table within the database or may be duplicated
synchronously in each data mart
26Conformed Dimensions
- Roll-up dimensions conform to the base-level
atomic dimension if they are a strict subset of
that atomic dimension.
27Conformed Dimensions
- They should be built once in the staging area
- They must be published prior to staging of the
fact data - The dimension authority has responsibility for
defining, maintaining, and publishing a
particular dimension or its subsets to all the
data mart clients who need it
28Tracking History in Dimensions
- Unchanging Dimensions
- Changing, but Original Values are Irrelevant A
phone number in a customer record - Slowly Changing Dimensions (SCD) A customer
address, manager - Rapidly Changing Dimensions Income range of a
customer - Continuously Changing Dimensions Customer age
29Type 1 Overwrite the Value
- The type 1 response is easy to implement, but
- it does not maintain any history of prior
attribute values - any preexisting aggregations based on the
department value will need to be rebuilt
30Type 2 Add a Dimension Row
- The type 2 response is the primary technique for
accurately tracking slowly changing dimension
attributes. It is extremely powerful because the
new dimension row automatically partitions
history in the fact table. - Its not suitable for dimension tables that
already exceed a million rows
31Type 2 Add a Dimension Row
32Type 3 Add a Dimension Column
- The type 3 slowly changing dimension technique
allows us to see new and historical fact data by
either the new or prior attribute values.
33Hybrid SCD TechniquesSeries of Type 3 Attributes
- Predictable Changes with Multiple Version
Overlays - Report each years sales using the district map
for that year. - Report each years sales using a district map
from an arbitrary different year. - Report an arbitrary span of years sales using a
single district map from any chosen year. The
most common version of this requirement would be
to report the complete span of fact data using
the current district map.
34Hybrid SCD TechniquesType 2 with "Current"
Overwrite
- Unpredictable Changes with Single-Version Overlay
preserves historical accuracy while supporting
the ability to report historical data according
to the current values
35Dimension Table Staging
36Dimension Table Staging
37Degenerate Dimension
- Dimension keys without corresponding dimension
tables - Operational control numbers such as order
numbers, invoice numbers, and bill-oflading
numbers usually give rise to empty dimensions - Degenerate dimensions are stored in the fact
tables where the grain of the table is the
document itself or a line item in the document
38Junk Dimensions
- What to do with flags and indicators
- Leave the flags and indicators unchanged in the
fact table row. - Make each flag and indicator into its own
separate dimension - Strip out all the flags and indicators from the
design. - A junk dimension is a convenient grouping of
typically low-cardinality flags and indicators
39Junk Dimensions
- Whether to use junk dimension
- 5 indicators, each has 3 values -gt 243 (35) rows
- 5 indicators, each has 100 values -gt 100 million
(1005) rows - When to insert rows in the dimension
40Multiple Currencies
41Customer Dimension
- Critical element for effective CRM
- The most challenging dimension for any data
warehouse - extremely deep (with millions of rows)
- extremely wide (with dozens or even hundreds of
attributes) - sometimes subject to rather rapid change
42Customer Dimension Name and Address Parsing
43Customer Dimension Other Common Customer
Attributes
- Gender
- Ethnicity
- Age or other life-stage classifications
- Income or other lifestyle classifications
- Status (for example, new, active, inactive,
closed) - Referring source
- Business-specific market segment
- Scores characterizing the customer, such as
purchase behavior, payment behavior, product
preferences
44Customer Dimension Aggregated Facts as Attributes
- These attributes are to be used for constraining
and labeling they are not to be used in numeric
calculations - Focus on those which will be used frequently
- Minimize the frequency with which these
attributes need to be updated - Replace metrics with more meaningful descriptive
values, such as High Spender
45Dimension Outriggers for aLow-Cardinality
Attribute Set
46Rapidly Changing CustomerDimensions
- Challenges
- It generally takes too long to constrain or
browse among the relationships in such a big
table - It is difficult to use previously described
techniques for tracking changes in these large
dimensions - One solution is to break off frequently analyzed
or frequently changing attributes into a separate
dimension, referred to as a minidimension
47Rapidly Changing CustomerDimensions
- The Mini Dimension with "Current" Overwrite
48Rapidly Changing CustomerDimensions
- The minidimension terminology refers to when the
demographics key is part of the fact table
composite key - If the demographics key is a foreign key in the
customer dimension, we refer to it as an outrigger
49Rapidly Changing CustomerDimensions
- Type 2 with Natural Keys in Fact Table
50Implications of Type 2 CustomerDimension Changes
- Be careful to avoid overcounting because we may
have multiple rows in the customer dimension for
the same individual - COUNT DISTINCT
- A most recent row indicator
- The comparison operators depend on the business
rules used to set our effective/expiration dates.
51Customer Behavior Study Groups
- Capture the keys of the customers or products
whose behavior you are tracking
52Commercial Customer Hierarchies
53Commercial Customer Hierarchies
54Commercial Customer Hierarchies
55Commercial Customer Hierarchies
- Be aware of risk of double counting
- SELECT 'San Francisco', SUM(F.REVENUE)FROM FACT
F, DATE DWHERE F.CUSTOMER_KEY IN (SELECT
B.SUBSIDIARY_KEY FROM CUSTOMER C, BRIDGE
B WHERE C.CUSTOMER_KEY B.PARENT_KEY AND
C.CUSTOMER_CITY 'San Francisco') //to sum all
SF parentsAND F.DATE_KEY D.DATE_KEYAND
D.MONTH 'January 2002GROUP BY 'San Francisco'
56Heterogeneous Product Schemas
57Heterogeneous Product Schemas
58Common Dimensional Modeling Mistakes to Avoid
- Mistake 10 Place text attributes used for
constraining and grouping in a fact table - Mistake 9 Limit verbose descriptive attributes
in dimensions to save space - Mistake 8 Split hierarchies and hierarchy levels
into multiple dimensions - Mistake 7 Ignore the need to track dimension
attribute changes - Mistake 6 Solve all query performance problems
by adding more hardware
59Common Dimensional Modeling Mistakes to Avoid
- Mistake 5 Use operational or smart keys to join
dimension tables to a fact table - Mistake 4 Neglect to declare and then comply
with the fact tables grain - Mistake 3 Design the dimensional model based on
a specific report - Mistake 2 Expect users to query the lowest-level
atomic data in a normalized forma - Mistake 1 Fail to conform facts and dimensions
across separate fact tables
60Questions
and
Answers