Title: Data Models for Warehouse
1Data Models for Warehouse
- Session-12/13
- Data Management for Decision Support
2Data Models
- Data Models
- relations
- stars snowflakes
- cubes
- Operators
- slice dice
- roll-up, drill down
- pivoting
- other
3Data Models
- Star schemas are database schemas that exploit
the structure of data for decision support query - Queries in DSS tend to
- Examine a set of factual transactions- POS,
Customer events - Facts are analyzed in variety of ways - POS
transaction by week, or store - For example a retail store
- POS is at the center
- Product information - SKU, hierarchy of ( section
dept, BU) - Time information - day, week, month, year
- Stores - Store-id, hierarchy (regions, city,
locality) - Suppliers- Sup-id, location, discounts
4Data Models
Products
Time
Sales Transactions
Stores
Suppliers
Information is split between two classes- Factual
information and Reference information
5FACT DATA
- Fact data records the information on factual
event that occurred in the business- POS, Phone
calls, Banking transactions - Typically 70 of Warehouse data is Fact data
- Important to identify and define structure right
in the first place as restructuring is an
expensive process - Detail content of FACT is derived from the
business requirement - Recorded Facts do not change as they are events
of past
6Dimension Data
- Information that is used for analyzing the
elemental data, for example, product hierarchy,
time periods, customers, stores - It is the reference data used for analysis of
Facts - Organizing the information in separate reference
tables offers better query performance - It differs from Fact data as it changes over
time, due to changes in business, reorganization - It should be structured to permit rapid changes
7FACT and Dimensions
- Millions to billions of rows
- Multiple foreign keys
- Numeric
- Does not change
- Tens to millions of rows
- One primary key
- Textual decription
- Frequently modifies
8Decision Support Queries
- Examples
- Average number of sales of Haldiram per store
over last month (various types within the brand) - Projected sales of Deepavali gift packs against
the actual - The top 20 customers (spending) over last
quarter - The customers with average balance in excess of
Rs. 25000 for past one year - gt Each of these queries is based on Factual data
9Decision Support Queries
POS Transaction Membership card
Transaction Account transactions
Sales of Haldiram Customer Spend Account
Balance
Quantity Sold Product Store Date, Time Revenue
Realized Customer-Id Store Transaction
Value Date and Time Customer AC number type of
transaction amount
10Star Schema
- The star schema is a data-modeling technique used
to map multidimensional decision support into a
relational database. - Star schemas yield an easily implemented model
for multidimensional data analysis while still
preserving the relational structure of the
operational database. - Four Components
- Facts
- Dimensions
- Attributes
- Attribute hierarchies
11A Simple Star Schema
12Star Schema
- Facts
- Facts are numeric measurements (values) that
represent a specific business aspect or activity. - The fact table contains facts that are linked
through their dimensions. - Facts can be computed or derived at run-time
(metrics). - Dimensions
- Dimensions are qualifying characteristics that
provide additional perspectives to a given fact. - Dimensions are stored in dimension tables.
13Identifying Facts and Dimensions
Elemental Transaction
Determine Key Dimensions
Check if Fact is a dimension
Check if dimensions is a Fact
14Identification Step 1
- Examine the enterprise model and identify the
transaction that or of interest- driven by
business requirement analysis - These will be transaction that describes events
fundamental to the business e.g., calls for
Telecom, account transactions in banking - For each potential Fact ask a question- Is this
information operated upon by business process?
Daily sales versus POS, even if system reports
daily sales POS may be the FACT - The limit of current recording should not
influence Warehouse design
15Identification Step 1
Fact Table POS Transaction Stock movement and
position Customer events Account
transactions Claims and receipts Call
events Customer events(install, disconnect,
payment)
Sector and Business Retail Sales Shrinkage Retai
l Banking Customer profiling
Profitability Insurance Product
Profitability Telecom Call Analysis
Customer Analysis
16Identification Step 2
- Look at the logical model to find the entities
associated with entities in the fact table. List
out all such logically associate entities. - These are candidate References, the task is to
find key dimension entities that may not be
directly associated. - For example, retail banking account transaction
are candidate fact table. The account transaction
is candidate reference. But, the customer I
indirectly related to transaction. Although, a
better choice. - Analyze account transaction by account?
- Analyze how customers use our services?
- You store both relationships but customer becomes
a dimension
17Identification Step3
- FACT is not actually a denormalized dimension
table - Consider the following
- house-details
- Cable-laid
- Sales-persons visit
- connected to the service
- promotional material sent
- subscription cancelled
-
- Home-details - candidate fact
- Operational events
- Report on number of connections quarter-to-date
- Time-lag between laying and subscrition
18Identification Step 4
- Dimension is not a FACT
- Lot depends on DSS requirements-
- Customer can be FACT or Dimension
- Promotions can be fact or dimensions
- Ask questions using other dimensions- Using how
many other dimensions, Can I view this entity. - Can I view promotion by Time?
- Can I view promotions by product?
- Can I view promotion by store?
- Can I vie promotions by suppliers?
- If answer to these question is yes, then it is a
FACT
19Star Schema
- Attributes
- Each dimension table contains attributes.
Attributes are often used to search, filter, or
classify facts. - Dimensions provide descriptive characteristics
about the facts through their attributes.
Possible Attributes For Sales Dimensions
20Three Dimensional View Of Sales
21Slice And Dice View Of Sales
22Star Schema
- Attribute Hierarchies
- Attributes within dimensions can be ordered in a
well-defined attribute hierarchy. - The attribute hierarchy provides a top-down data
organization that is used for two main purposes - Aggregation
- Drill-down/roll-up data analysis
23A Location Attribute Hierarchy
24Attribute Hierarchies In Multidimensional Analysis
25Star Schema
- Star Schema Representation
- Facts and dimensions are normally represented by
physical tables in the data warehouse database. - The fact table is related to each dimension table
in a many-to-one (M1) relationship. - Fact and dimension tables are related by foreign
keys and are subject to the primary/foreign key
constraints.
26Star Schema For Sales
27Orders Star Schema
28The Multi-Dimensional Model
- Sales by product line over the past six months
- Sales by store between 1990 and 1995
Store Info
Key columns joining fact table to dimension tables
Numerical Measures
Prod Code Time Code Store Code Sales Qty
Fact table for measures
Product Info
Dimension tables
Time Info
. . .
29Dimensional Modeling
- Dimensions are organized into hierarchies
- E.g., Time dimension days ? weeks ? quarters
- E.g., Product dimension product ? product line ?
brand - Dimensions have attributes
30Dimension Hierarchies
Store Dimension
Product Dimension
Total
Total
Region
Manufacturer
District
Brand
Stores
Products
31ROLAP Dimensional Modeling Using Relational DBMS
- Special schema design star, snowflake
- Special indexes bitmap, multi-table join
- Special tuning maximize query throughput
- Proven technology (relational model, DBMS), tend
to outperform specialized MDDB especially on
large data sets - Products
- IBM DB2, Oracle, Sybase IQ, RedBrick, Informix
32MOLAP Dimensional Modeling Using the Multi
Dimensional Model
- MDDB a special-purpose data model
- Facts stored in multi-dimensional arrays
- Dimensions used to index array
- Sometimes on top of relational DB
- Products
- Pilot, Arbor Essbase, Gentia
33Star Schema (in RDBMS)
34Star Schema Example
35Star Schema with Sample Data
36The Classic Star Schema
- A single fact table, with detail and summary data
- Fact table primary key has only one key column
per dimension - Each key is generated
- Each dimension is a single table, highly
denormalized
Benefits Easy to understand, easy to define
hierarchies, reduces of physical joins, low
maintenance, very simple metadata Drawbacks
Summary data in the fact table yields poorer
performance for summary levels, huge dimension
tables a problem
37The Classic Star Schema
The biggest drawback dimension tables must carry
a level indicator for every record and every
query must use it. In the example below, without
the level constraint, keys for all stores in the
NORTH region, including aggregates for region and
district will be pulled from the fact table,
resulting in error.
Example Select A.STORE_KEY, A.PERIOD_KEY,
A.dollars from Fact_Table A where A.STORE_KEY in
(select STORE_KEY from Store_Dimension
B where region North and Level 2) and
etc...
Level is needed whenever aggregates are stored
with detail facts.
38The Level Problem
- Level is a problem because because it causes
potential for error. If the query builder, human
or program, forgets about it, perfectly
reasonable looking WRONG answers can occur. - One alternative the FACT CONSTELLATION model...
39The Fact Constellation Schema
District Fact Table
Region Fact Table
District_ID PRODUCT_KEY PERIOD_KEY
Region_ID PRODUCT_KEY PERIOD_KEY
Dollars Units Price
Dollars Units Price
40The Fact Constellation Schema
In the Fact Constellations, aggregate tables are
created separately from the detail, therefor it
is impossible to pick up, for example, Store
detail when querying the District Fact Table.
Major Advantage No need for the Level
indicator in the dimension tables, since no
aggregated data is stored with lower-level
detail Disadvantage Dimension tables are still
very large in some cases, which can slow
performance front-end must be able to detect
existence of aggregate facts, which requires more
extensive metadata
41Another Alternative to Level
- Fact Constellation is a good alternative to the
Star, but when dimensions have very high
cardinality, the sub-selects in the dimension
tables can be a source of delay. - An alternative is to normalize the dimension
tables by attribute level, with each smaller
dimension table pointing to an appropriate
aggregated fact table, the Snowflake Schema ...
42The Snowflake Schema
Store Dimension
STORE KEY
District_ID
Region_ID
Store Description City State District ID District
Desc. Region_ID Region Desc. Regional Mgr.
District Desc. Region_ID
Region Desc. Regional Mgr.
Store Fact Table
District Fact Table
RegionFact Table
Region_ID PRODUCT_KEY PERIOD_KEY
District_ID PRODUCT_KEY PERIOD_KEY
STORE KEY
PRODUCT KEY
Dollars Units Price
PERIOD KEY
Dollars Units Price
Dollars Units Price
43The Snowflake Schema
- No LEVEL in dimension tables
- Dimension tables are normalized by decomposing at
the attribute level - Each dimension table has one key for each level
of the dimensionís hierarchy - The lowest level key joins the dimension table to
both the fact table and the lower level attribute
table
How does it work? The best way is for the query
to be built by understanding which summary levels
exist, and finding the proper snowflaked
attribute tables, constraining there for keys,
then selecting from the fact table.
44The Snowflake Schema
- Additional features The original Store Dimension
table, completely de-normalized, is kept intact,
since certain queries can benefit by its
all-encompassing content. - In practice, start with a Star Schema and create
the snowflakes with queries. This eliminates
the need to create separate extracts for each
table, and referential integrity is inherited
from the dimension table.
Advantage Best performance when queries involve
aggregation Disadvantage Complicated
maintenance and metadata, explosion in the number
of tables in the database
45Advantages of ROLAP Dimensional Modeling
- Define complex, multi-dimensional data with
simple model - Reduces the number of joins a query has to
process - Allows the data warehouse to evolve with rel. low
maintenance - HOWEVER! Star schema and relational DBMS are not
the magic solution - Query optimization is still problematic
46Aggregates
- Add up amounts for day 1
- In SQL SELECT sum(amt) FROM SALE
- WHERE date 1
81
47Aggregates
- Add up amounts by day
- In SQL SELECT date, sum(amt) FROM SALE
- GROUP BY date
48Another Example
- Add up amounts by day, product
- In SQL SELECT date, sum(amt) FROM SALE
- GROUP BY date, prodId
rollup
drill-down
49Aggregates
- Operators sum, count, max, min, median,
ave - Having clause
- Using dimension hierarchy
- average by region (within store)
- maximum by month (within date)
50ROLAP vs. MOLAP
- ROLAPRelational On-Line Analytical Processing
- MOLAPMulti-Dimensional On-Line Analytical
Processing
51The MOLAP Cube
Fact table view
Multi-dimensional cube
dimensions 2
523-D Cube
Multi-dimensional cube
Fact table view
day 2
day 1
dimensions 3
53Example
roll-up to region
Dimensions Time, Product, Store Attributes Pro
duct (upc, price, ) Store Hierarchies Pro
duct ? Brand ? Day ? Week ? Quarter Store ?
Region ? Country
NY
Store
SF
roll-up to brand
LA
10 34 56 32 12 56
Juice Milk Coke Cream Soap Bread
Product
roll-up to week
M T W Th F S S
Time
56 units of bread sold in LA on M
54Cube Aggregation Roll-up
Example computing sums
day 2
. . .
day 1
129
55Cube Operators for Roll-up
day 2
. . .
day 1
sale(s1,,)
129
sale(s2,p2,)
sale(,,)
56Extended Cube
day 2
sale(,p2,)
day 1
57Aggregation Using Hierarchies
store
day 2
day 1
region
country
(store s1 in Region A stores s2, s3 in Region B)
58Slicing
day 2
day 1
TIME day 1
59Slicing Pivoting
60Summary of Operations
- Aggregation (roll-up)
- aggregate (summarize) data to the next higher
dimension element - e.g., total sales by city, year ? total sales by
region, year - Navigation to detailed data (drill-down)
- Selection (slice) defines a subcube
- e.g., sales where city Gainesville and date
1/15/90 - Calculation and ranking
- e.g., top 3 of cities by average income
- Visualization operations (e.g., Pivot)
- Time functions
- e.g., time average
61Query Analysis Tools
- Query Building
- Report Writers (comparisons, growth, graphs,)
- Spreadsheet Systems
- Web Interfaces
- Data Mining