Title: Dimensional Modeling Techniques
1Dimensional Modeling Techniques
2Dimensional Models
- A E-R model for an organization represents all
the business processes in the enterprise - Sales calls, order entry, shipment, billing, etc.
- First step in developing dimensional model
architecture is to separate out individual
business processes - Some approaches call these subject areas
3Info Cubes
- Basic idea of dimensional modeling is that every
type of business data can be represented as a
kind of cube (hypercube) of data - Cells of the cube contain measured values (facts)
- Edges of the cube define the natural dimensions
of the data
4(No Transcript)
5Keys
- All keys should be anonymous (surrogate) data
warehouse keys - Do not use the keys used in the production
systems - Production keys can change because data is
temporary whereas data in the data warehouse is
long term - Surrogate keys take up less space usually
6Facts
- Facts are
- Something that is not known in advance
- An observation in the marketplace
- Usually numeric
- Any numeric field (especially floating point
values) is likely to be a fact especially if it
is subject to change - Dimensions usually dont change or change very
slowly - Fact tables are usually very narrow but have many
(millions, billions) of records
7Example
- Quantity sold is not floating point but
- It is numeric
- It is unique for each transaction
- It is a measurable quantity
- Standard price is a value that doesnt change
often but - It is floating point value
- It does change occasionally
- Choice of putting it in the fact table probably
depends on how often it changes
8Attributes
- Attributes are usually text fields and usually
describe a characteristic of a tangible thing - For example, product flavor is an attribute that
describes a product - It is not a measured quantity
- It is known in advance
9Dimensions
- Attributes that describe things are organized
into dimensions - Dimensions are collections of attributes that are
highly correlated with one another - Attributes in a single dimension often come from
multiple tables in the production database - Dimensions usually have relatively few records
but are very wide
10Retail Dimensions
- Product
- prod id, name, description, etc.
- Customer
- Customer id, name, address, email, etc.
- Promotion
- Prom id, name, terms, dates
- Store
- Store ID, location, contact, contact phone, etc.
- Time
- Day, week, month, year, etc.
11Granularity
- Granularity refers to the level of aggregation of
the data - Individual sales quantities vs. average
- Individual product units vs. cases
- Dimensions are almost always at their most
granular (atomic) level - Each record is an individual customer
- Facts should also be atomic since any other level
of aggregation can be derived from atomic data
and the table can be gracefully extended - Can use a technique called aggregates to improve
performance
12Combining Entities in Dimensions
- Often there are more than one way to define
dimensions - Product dimension and store dimension
- Product-Store dimension
- Assume there are 1000 products and 100 stores
- If all (or most) products appear in all stores
then the dimension will have up to 100,000
records - If each store only carries a few products then
the single Product-Store dimension may not only
make sense it may be interesting
13Combining Entities in Dimensions
- In general, entities should can be combined if
- There is a one-to-one or one-to-many relationship
between attributes - The attributes are highly correlated
- Entities shouldnt be combined if
- There is a many-to-many relationship between them
- The entities participate independently in
separate business process fact tables
14Many-to-Many Attributes
There may be a many-to-manyrelationship between
ShipTo andBillTo. Since there must be a
recordfor every combination of ShipTo andBillTo
the table would become verylarge.
15Dimensions Drilling Down
- Dimension attributes take on the role of
- Query constraints
- Column headers in reports
- Attributes can be added to SQL SELECT list and be
used in SQL GROUP BY or Order BY clauses
16District Brand Total Dollars Total
Cost Gross Profict Atherton Clean Fast
1,233 1,058 175 Atherton More
Power 2,239 2,200 39
Every attribute can be used as a column header.
17Dimensions Drilling Down
- Drilling down involves adding another attribute
to the SQL SELECT list - Drilling Down can take advantage of hierarchical
relationships among attributes in dimensions but
can use any attributes whether they are related
or not
18Finance and Marketing Hierarchy in Product
Dimension
The marketinghierarchy
The finance hierarchy
Other attributesunrelated to anyhierarchy
SELECT product_category,
Drill Down
SELECT product_category, product_brand
19Time Dimension Hierarchy
Multiple hierarchies canco-exist within the
samedimension. Notice that the dimensionsare
not normalized.
20Partial Customer Dimension
This partial customerdimension contains
threehierarchies ShipTo Address, Corporate
hierarchy, and our sales force hierarchy.
Customer Hierarchy
21Snowflaking
- Snowflaking occurs when the low cardinality
fields in dimension tables are removed into
separate tables - Normalizing the dimension tables
- Snowflaking is tempting to database types who
like to normalize but it is usually a bad idea - Decreases performance
- Decreases clarity of database
- Prevents the use of bitmap indexes which increase
performance dramatically
22Snowflaked Product Dimension
23Permissible Snowflaking
- Creating subdimensions is permissible under
specific circumstances - If some attributes in a dimension are at a
different granularity than the others - If we save a significant amount of storage space
(very large dimension) - If we often want to browse among the attributes
in the subdimension
24Demographic Subdimension of Customer Dimension
Aggregated at County Level
Individual Customers
25Verbose Attributes
- Dimensions should contain readable fields that
describe the members of the dimension - Not good C001AX247
- Not good GRN PAPR TOWLS 32OZ RVS
- Remember these appear in reports so should be
descriptive - Attributes can be translated in the extraction
process
26Good Attributes
- Verbose
- Descriptive
- Complete (no missing values)
- Quality assured
- Indexed
- Equally available
- Documented
27Time Dimension
Subdimension created in this casebecause we want
to use the sameTime Dimension in
different countries
Time dimension is critical in every data
warehouse. This dimension has adaily
granularity and has enough detail that
applications dont have to haveany hard coded
date logic.
28Slowly Changing Dimensions
- Dimensions are usually fairly static but
sometimes can change slowly - A well designed data warehouse will specify a
change strategy for every attribute of every
dimension - Three strategies for handling changing dimensions
29Type 1 Overwrite the value
- The old attribute value is overwritten so the
dimension always reflects the most recent value
Product
SKU Number Product Key
Description Department (Natural
Key) 12345 IntelliKidz 1.0
Education ABC922-Z
Assume IntelliKidz 1.0 is reclassified as
Strategy. The Type 1 responsesimply updates the
existing record
Product
SKU Number Product Key
Description Department (Natural
Key) 12345 IntelliKidz 1.0
Strategy ABC922-Z
30Type 1 Overwrite the value
- Advantage is that its quick and easy
- Disadvantage is that there is no historical
record - So if the sales if IntelliKidz 1.0 takes off
because of the reclassification we have no
historical record of the reason - Correct strategy if were simply correcting a bad
value
31Type 2 Add a Dimension Row
- A new record is created and the old record is kept
- Product
SKU Number - Product Key Description Department
(Natural Key) - IntelliKidz 1.0 Education
ABC922-Z - 67890 IntelliKidz 1.0 Strategy
ABC922-Z
May include a effective date field but it isnt
necessary
32Type 2 Add a Dimension Row
- An example of why we cant use the natural key as
our primary key - Type 2 changes are said to perfectly partition
history because fact table records that are added
after the change will reference record 67890 - We can look at the entire history using the SKU
or can look at before and after using the Product
Key - Most common approach
33Type 3 Add a Dimension Column
- Adding a column is useful if we want the ability
to see fact data as if the change never occurred - Sales district boundaries are altered but users
want to still see sales in terms of the old
boundaries
Product
Prior SKU
Number Product Key Description
Department Department (Natural Key) 12345
IntelliKidz 1.0 Strategy
Education ABC922-Z
34Type 3 Add a Dimension Column
- Useful when we want to maintain multiple
alternate realities - We can look at all the fact data by either value
of the attribute - Often occurs with an attribute which is human
applied and has limited hard effects - Rarely used
35Rapidly Changing Dimensions
- If changes to dimensions are rapid then
- If the dimension is small use the Type 2 approach
- If the dimension is large then create a
subdimension with the changing attributes
36Rapidly Changing Dimensions
Demographics are convertedinto banded values
(e.g.0-12,000, 12,000-20,000,etc.) and then
we have a recordfor every combination of
demographic attributes. If we have five
attributes with 10 levels each then we have 105
100,000 records. If this gets too large
createmultiple demographic dimensions
(i.e.demographics and purchase-demogs) Whenever
a change occurs we adda record to the fact table
and sinceeach fact table record includes
thecustomer and demog keys, wellalways be able
to classify customers.
37Degenerate Dimensions
- Many dimensional designs revolve around a line
item oriented document like a sales order - Line item is the natural granularity for the fact
table - The problem is what to do with the document
number (e.g. order number) - It isnt really a fact and doesnt belong to a
dimension - The answer is to add it to the fact table as a
degenerate dimension
38Junk Dimensions
- Sometimes when all the descriptive attributes are
assigned to dimensions there are data elements
left over - e.g. miscellaneous flags like order type
- These can be gathered into a junk dimension
39Additive Facts
- Whenever possible facts should be perfectly
additive - If a fact can be added across all dimensions it
is perfectly additive - Usually measures of activity are perfectly
additive - Measures of intensity are not perfectly additive
- Account balances, inventory levels
- These are snapshots
- Some can be added across time dimension unless
you sum and divide by the time units - Some cant be added across any dimension (e.g.
temperature)
40Textual Facts
- Textual attributes should almost always be in
dimension tables - There are some specific instances when a textual
attribute can be a fact - A description of weather on an accident report
- This value is unique for each record so it would
be difficult to add to a dimension since every
accident would then require a unique dimension
record
41Design Process
- Single-source or multi-source data marts
- Declare the fact table grain
- Very precisely define what a record in the fact
table is (Sales order line item) - Choosing the dimensions
- Choosing the facts
42Fact Table Grain
- Individual transactions
- Each sales transaction is a fact record
- Each insurance claim transaction is a fact record
- Each ATM transaction is a fact record
- Very simple structure and can vary wildly in
number
43Fact Table Grain
- Snapshots
- Each daily product sales total in each store is a
fact record - Each monthly account snapshot is a fact record
- Can be complex with many attributes including
many non-additive attributes - In many cases snapshots can be constructed from
transaction facts so we would only create a
snapshot fact table for performance reasons - In some cases (like the second above) the
snapshot is not a simple aggregation
44Fact Table Grain
- Line items
- Each line item on each order is a fact record
- Each line item on each shipments invoice is a
fact record - Each coverage in each individual insurance policy
is a fact record - Represent a line item on some type of control
document (i.e. order)
Back
45Choosing the Dimensions
- Once the fact table is established choosing the
dimensions may be straightforward - Some are natural (i.e. customer for order line
items) - Some are required for the users reporting
requirements - Dimensions must be at the same granularity or
coarser then the fact table (cant have monthly
sales in the fact table and have days in the time
dimension)
Back
46Choosing the Facts
- Transaction facts are usually just the amount of
the transaction - Snapshot facts can be complex
- Line item facts can contain several values (i.e.
quantities, gross amounts, adjustments, etc.) - Facts should always match the grain of the table
- Aggregates can be created to increase performance