Data Models for Warehouse

1 / 61
About This Presentation
Title:

Data Models for Warehouse

Description:

List out all such logically associate entities. ... Fact Constellation is a good alternative to the Star, but when dimensions have ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 62
Provided by: all1113

less

Transcript and Presenter's Notes

Title: Data Models for Warehouse


1
Data Models for Warehouse
  • Session-12/13
  • Data Management for Decision Support

2
Data Models
  • Data Models
  • relations
  • stars snowflakes
  • cubes
  • Operators
  • slice dice
  • roll-up, drill down
  • pivoting
  • other

3
Data Models
  • Star schemas are database schemas that exploit
    the structure of data for decision support query
  • Queries in DSS tend to
  • Examine a set of factual transactions- POS,
    Customer events
  • Facts are analyzed in variety of ways - POS
    transaction by week, or store
  • For example a retail store
  • POS is at the center
  • Product information - SKU, hierarchy of ( section
    dept, BU)
  • Time information - day, week, month, year
  • Stores - Store-id, hierarchy (regions, city,
    locality)
  • Suppliers- Sup-id, location, discounts

4
Data Models
Products
Time
Sales Transactions
Stores
Suppliers
Information is split between two classes- Factual
information and Reference information
5
FACT DATA
  • Fact data records the information on factual
    event that occurred in the business- POS, Phone
    calls, Banking transactions
  • Typically 70 of Warehouse data is Fact data
  • Important to identify and define structure right
    in the first place as restructuring is an
    expensive process
  • Detail content of FACT is derived from the
    business requirement
  • Recorded Facts do not change as they are events
    of past

6
Dimension Data
  • Information that is used for analyzing the
    elemental data, for example, product hierarchy,
    time periods, customers, stores
  • It is the reference data used for analysis of
    Facts
  • Organizing the information in separate reference
    tables offers better query performance
  • It differs from Fact data as it changes over
    time, due to changes in business, reorganization
  • It should be structured to permit rapid changes

7
FACT and Dimensions
  • Millions to billions of rows
  • Multiple foreign keys
  • Numeric
  • Does not change
  • Tens to millions of rows
  • One primary key
  • Textual decription
  • Frequently modifies

8
Decision Support Queries
  • Examples
  • Average number of sales of Haldiram per store
    over last month (various types within the brand)
  • Projected sales of Deepavali gift packs against
    the actual
  • The top 20 customers (spending) over last
    quarter
  • The customers with average balance in excess of
    Rs. 25000 for past one year
  • gt Each of these queries is based on Factual data

9
Decision Support Queries
  • Examples

POS Transaction Membership card
Transaction Account transactions
Sales of Haldiram Customer Spend Account
Balance
Quantity Sold Product Store Date, Time Revenue
Realized Customer-Id Store Transaction
Value Date and Time Customer AC number type of
transaction amount
10
Star Schema
  • The star schema is a data-modeling technique used
    to map multidimensional decision support into a
    relational database.
  • Star schemas yield an easily implemented model
    for multidimensional data analysis while still
    preserving the relational structure of the
    operational database.
  • Four Components
  • Facts
  • Dimensions
  • Attributes
  • Attribute hierarchies

11
A Simple Star Schema
12
Star Schema
  • Facts
  • Facts are numeric measurements (values) that
    represent a specific business aspect or activity.
  • The fact table contains facts that are linked
    through their dimensions.
  • Facts can be computed or derived at run-time
    (metrics).
  • Dimensions
  • Dimensions are qualifying characteristics that
    provide additional perspectives to a given fact.
  • Dimensions are stored in dimension tables.

13
Identifying Facts and Dimensions
Elemental Transaction
Determine Key Dimensions
Check if Fact is a dimension
Check if dimensions is a Fact
14
Identification Step 1
  • Examine the enterprise model and identify the
    transaction that or of interest- driven by
    business requirement analysis
  • These will be transaction that describes events
    fundamental to the business e.g., calls for
    Telecom, account transactions in banking
  • For each potential Fact ask a question- Is this
    information operated upon by business process?
    Daily sales versus POS, even if system reports
    daily sales POS may be the FACT
  • The limit of current recording should not
    influence Warehouse design

15
Identification Step 1
Fact Table POS Transaction Stock movement and
position Customer events Account
transactions Claims and receipts Call
events Customer events(install, disconnect,
payment)
Sector and Business Retail Sales Shrinkage Retai
l Banking Customer profiling
Profitability Insurance Product
Profitability Telecom Call Analysis
Customer Analysis
16
Identification Step 2
  • Look at the logical model to find the entities
    associated with entities in the fact table. List
    out all such logically associate entities.
  • These are candidate References, the task is to
    find key dimension entities that may not be
    directly associated.
  • For example, retail banking account transaction
    are candidate fact table. The account transaction
    is candidate reference. But, the customer I
    indirectly related to transaction. Although, a
    better choice.
  • Analyze account transaction by account?
  • Analyze how customers use our services?
  • You store both relationships but customer becomes
    a dimension

17
Identification Step3
  • FACT is not actually a denormalized dimension
    table
  • Consider the following
  • house-details
  • Cable-laid
  • Sales-persons visit
  • connected to the service
  • promotional material sent
  • subscription cancelled
  • Home-details - candidate fact
  • Operational events
  • Report on number of connections quarter-to-date
  • Time-lag between laying and subscrition

18
Identification Step 4
  • Dimension is not a FACT
  • Lot depends on DSS requirements-
  • Customer can be FACT or Dimension
  • Promotions can be fact or dimensions
  • Ask questions using other dimensions- Using how
    many other dimensions, Can I view this entity.
  • Can I view promotion by Time?
  • Can I view promotions by product?
  • Can I view promotion by store?
  • Can I vie promotions by suppliers?
  • If answer to these question is yes, then it is a
    FACT

19
Star Schema
  • Attributes
  • Each dimension table contains attributes.
    Attributes are often used to search, filter, or
    classify facts.
  • Dimensions provide descriptive characteristics
    about the facts through their attributes.

Possible Attributes For Sales Dimensions
20
Three Dimensional View Of Sales
21
Slice And Dice View Of Sales
22
Star Schema
  • Attribute Hierarchies
  • Attributes within dimensions can be ordered in a
    well-defined attribute hierarchy.
  • The attribute hierarchy provides a top-down data
    organization that is used for two main purposes
  • Aggregation
  • Drill-down/roll-up data analysis

23
A Location Attribute Hierarchy
24
Attribute Hierarchies In Multidimensional Analysis
25
Star Schema
  • Star Schema Representation
  • Facts and dimensions are normally represented by
    physical tables in the data warehouse database.
  • The fact table is related to each dimension table
    in a many-to-one (M1) relationship.
  • Fact and dimension tables are related by foreign
    keys and are subject to the primary/foreign key
    constraints.

26
Star Schema For Sales
27
Orders Star Schema
28
The Multi-Dimensional Model
  • Sales by product line over the past six months
  • Sales by store between 1990 and 1995

Store Info
Key columns joining fact table to dimension tables
Numerical Measures
Prod Code Time Code Store Code Sales Qty
Fact table for measures
Product Info
Dimension tables
Time Info
. . .
29
Dimensional Modeling
  • Dimensions are organized into hierarchies
  • E.g., Time dimension days ? weeks ? quarters
  • E.g., Product dimension product ? product line ?
    brand
  • Dimensions have attributes

30
Dimension Hierarchies
Store Dimension
Product Dimension
Total
Total
Region
Manufacturer
District
Brand
Stores
Products
31
ROLAP Dimensional Modeling Using Relational DBMS
  • Special schema design star, snowflake
  • Special indexes bitmap, multi-table join
  • Special tuning maximize query throughput
  • Proven technology (relational model, DBMS), tend
    to outperform specialized MDDB especially on
    large data sets
  • Products
  • IBM DB2, Oracle, Sybase IQ, RedBrick, Informix

32
MOLAP Dimensional Modeling Using the Multi
Dimensional Model
  • MDDB a special-purpose data model
  • Facts stored in multi-dimensional arrays
  • Dimensions used to index array
  • Sometimes on top of relational DB
  • Products
  • Pilot, Arbor Essbase, Gentia

33
Star Schema (in RDBMS)
34
Star Schema Example
35
Star Schema with Sample Data
36
The Classic Star Schema
  • A single fact table, with detail and summary data
  • Fact table primary key has only one key column
    per dimension
  • Each key is generated
  • Each dimension is a single table, highly
    denormalized

Benefits Easy to understand, easy to define
hierarchies, reduces of physical joins, low
maintenance, very simple metadata Drawbacks
Summary data in the fact table yields poorer
performance for summary levels, huge dimension
tables a problem
37
The Classic Star Schema
The biggest drawback dimension tables must carry
a level indicator for every record and every
query must use it. In the example below, without
the level constraint, keys for all stores in the
NORTH region, including aggregates for region and
district will be pulled from the fact table,
resulting in error.
Example Select A.STORE_KEY, A.PERIOD_KEY,
A.dollars from Fact_Table A where A.STORE_KEY in
(select STORE_KEY from Store_Dimension
B where region North and Level 2) and
etc...
Level is needed whenever aggregates are stored
with detail facts.
38
The Level Problem
  • Level is a problem because because it causes
    potential for error. If the query builder, human
    or program, forgets about it, perfectly
    reasonable looking WRONG answers can occur.
  • One alternative the FACT CONSTELLATION model...

39
The Fact Constellation Schema
District Fact Table
Region Fact Table
District_ID PRODUCT_KEY PERIOD_KEY
Region_ID PRODUCT_KEY PERIOD_KEY
Dollars Units Price
Dollars Units Price
40
The Fact Constellation Schema
In the Fact Constellations, aggregate tables are
created separately from the detail, therefor it
is impossible to pick up, for example, Store
detail when querying the District Fact Table.
Major Advantage No need for the Level
indicator in the dimension tables, since no
aggregated data is stored with lower-level
detail Disadvantage Dimension tables are still
very large in some cases, which can slow
performance front-end must be able to detect
existence of aggregate facts, which requires more
extensive metadata
41
Another Alternative to Level
  • Fact Constellation is a good alternative to the
    Star, but when dimensions have very high
    cardinality, the sub-selects in the dimension
    tables can be a source of delay.
  • An alternative is to normalize the dimension
    tables by attribute level, with each smaller
    dimension table pointing to an appropriate
    aggregated fact table, the Snowflake Schema ...

42
The Snowflake Schema
Store Dimension
STORE KEY
District_ID
Region_ID
Store Description City State District ID District
Desc. Region_ID Region Desc. Regional Mgr.
District Desc. Region_ID
Region Desc. Regional Mgr.
Store Fact Table
District Fact Table
RegionFact Table
Region_ID PRODUCT_KEY PERIOD_KEY
District_ID PRODUCT_KEY PERIOD_KEY
STORE KEY
PRODUCT KEY
Dollars Units Price
PERIOD KEY
Dollars Units Price
Dollars Units Price
43
The Snowflake Schema
  • No LEVEL in dimension tables
  • Dimension tables are normalized by decomposing at
    the attribute level
  • Each dimension table has one key for each level
    of the dimensionís hierarchy
  • The lowest level key joins the dimension table to
    both the fact table and the lower level attribute
    table

How does it work? The best way is for the query
to be built by understanding which summary levels
exist, and finding the proper snowflaked
attribute tables, constraining there for keys,
then selecting from the fact table.
44
The Snowflake Schema
  • Additional features The original Store Dimension
    table, completely de-normalized, is kept intact,
    since certain queries can benefit by its
    all-encompassing content.
  • In practice, start with a Star Schema and create
    the snowflakes with queries. This eliminates
    the need to create separate extracts for each
    table, and referential integrity is inherited
    from the dimension table.

Advantage Best performance when queries involve
aggregation Disadvantage Complicated
maintenance and metadata, explosion in the number
of tables in the database
45
Advantages of ROLAP Dimensional Modeling
  • Define complex, multi-dimensional data with
    simple model
  • Reduces the number of joins a query has to
    process
  • Allows the data warehouse to evolve with rel. low
    maintenance
  • HOWEVER! Star schema and relational DBMS are not
    the magic solution
  • Query optimization is still problematic

46
Aggregates
  • Add up amounts for day 1
  • In SQL SELECT sum(amt) FROM SALE
  • WHERE date 1

81
47
Aggregates
  • Add up amounts by day
  • In SQL SELECT date, sum(amt) FROM SALE
  • GROUP BY date

48
Another Example
  • Add up amounts by day, product
  • In SQL SELECT date, sum(amt) FROM SALE
  • GROUP BY date, prodId

rollup
drill-down
49
Aggregates
  • Operators sum, count, max, min, median,
    ave
  • Having clause
  • Using dimension hierarchy
  • average by region (within store)
  • maximum by month (within date)

50
ROLAP vs. MOLAP
  • ROLAPRelational On-Line Analytical Processing
  • MOLAPMulti-Dimensional On-Line Analytical
    Processing

51
The MOLAP Cube
Fact table view
Multi-dimensional cube
dimensions 2
52
3-D Cube
Multi-dimensional cube
Fact table view
day 2
day 1
dimensions 3
53
Example
roll-up to region
Dimensions Time, Product, Store Attributes Pro
duct (upc, price, ) Store Hierarchies Pro
duct ? Brand ? Day ? Week ? Quarter Store ?
Region ? Country
NY
Store
SF
roll-up to brand
LA
10 34 56 32 12 56
Juice Milk Coke Cream Soap Bread
Product
roll-up to week
M T W Th F S S
Time
56 units of bread sold in LA on M
54
Cube Aggregation Roll-up
Example computing sums
day 2
. . .
day 1
129
55
Cube Operators for Roll-up
day 2
. . .
day 1
sale(s1,,)
129
sale(s2,p2,)
sale(,,)
56
Extended Cube

day 2
sale(,p2,)
day 1
57
Aggregation Using Hierarchies
store
day 2
day 1
region
country
(store s1 in Region A stores s2, s3 in Region B)
58
Slicing
day 2
day 1
TIME day 1
59
Slicing Pivoting
60
Summary of Operations
  • Aggregation (roll-up)
  • aggregate (summarize) data to the next higher
    dimension element
  • e.g., total sales by city, year ? total sales by
    region, year
  • Navigation to detailed data (drill-down)
  • Selection (slice) defines a subcube
  • e.g., sales where city Gainesville and date
    1/15/90
  • Calculation and ranking
  • e.g., top 3 of cities by average income
  • Visualization operations (e.g., Pivot)
  • Time functions
  • e.g., time average

61
Query Analysis Tools
  • Query Building
  • Report Writers (comparisons, growth, graphs,)
  • Spreadsheet Systems
  • Web Interfaces
  • Data Mining
Write a Comment
User Comments (0)