Title: Overview of Advanced Databases FT228
1Dr. Brian Mac Namee (www.comp.dit.ie/bmacnamee)
Business Systems Intelligence1. Data
Warehousing 2
2Entity Relationship Data Model (ERD)
- ER Approach works by dividing the data into many
discreet entities - Each entity becomes a Table in the physical
schema - Why it has been so successful
- Coupled with the concept of Normalization it
drives all the redundancy out of the database - Change (or add or delete) the data at just one
point - Can build very fast access methods (index)
- Results in efficient transactional processing
- Where is the catch?
3ERD Where Is The Catch?
- Lets have a look at a typical ER data model first
- Some Observations
- A Symmetric Model
- All the tables look the same
- Which table is more important ? Which is the
largest? - Which tables contain numerical measurements of
the business? - Which table contain nearly static descriptive
attributes? - Very hard to visualize and keep it in head
- A large number of possible connections to any two
(or more) tables
4A Typical OLTP Oriented ER Data Model
5ERD Catch Continues
- ERD and Normalization result in large number of
tables - Hard to be understood by the users (db
programmers) - Hard to navigate by DBMS software in an optimum
way - Real value of ERD is in using tables individually
or in pairs - Too complex for queries that span multiple tables
with a large number of records
6How to Simplify a Data Model
- Two general methods
- De-Normalization
- Dimensional Modeling
- De-Normalization
- Reverses the effect of Normalization
- Reintroduce redundancy while reducing the number
of tables - Popular approaches Pre-Join de-normalization,
Column Replication or Movement and Aggregation
7Data Warehouse Design
- The database component of a data warehouse is
described using a technique called dimensionality
modelling - Logical design technique that aims to present the
data in a standard, intuitive form that allows
for high-performance access - Uses the concepts of ER modelling with some
important restrictions
8Data Warehouse Design (cont)
- Every dimensional model (DM) is composed of
- One table with a composite primary key, called
the fact table - A set of smaller tables called dimension tables
- Each dimension table has a simple (non-composite)
primary key that corresponds exactly to one of
the components of the composite key in the fact
table.
9Data Warehouse Design
- Forms star-like structure, which is called a star
schema or star join - Star schema is a logical structure that has a
fact table containing factual data in the centre,
surrounded by dimension tables containing
reference data, can be denormalised. - Facts are generated by events that occurred in
the past, and are unlikely to change, regardless
of how they are analysed
10Data Warehouse Design (cont)
Dimension Table
Dimension Table
Fact Table
Dimension Table
Dimension Table
Star Schema
11Data Warehouse Design (cont)
- Bulk of data in data warehouse is in fact tables,
which can be extremely large. - Important to treat fact data as read-only
reference data that will not change over time. - Most useful fact tables contain one or more
numerical measures, or facts that occur for
each record and are numeric and additive.
12Data Warehouse Design (cont)
- Dimension tables usually contain descriptive
textual information. - Dimension attributes are used as the constraints
in data warehouse queries. - Star schemas can be used to speed up query
performance by denormalizing reference
information into a single dimension table.
13Dimensional Modeling
- Models the data around two basic concepts Facts
Dimensions. - Facts
- Facts are numeric measurements (values) that
represent a specific business aspect or activity. - Facts can be computed or derived at run-time
(metrics). - Examples Unit Cost, Sale Amount, Quantity Sold
14Dimensional Modeling (cont)
- Dimensions
- Dimensions are qualifying characteristics that
provide additional perspectives to a given fact. - Examples Date (Day, Month, Qtr, Year), Product
(Type, Category)
15Dimensional Modeling (cont...)
- Every dimensional model (DM) is composed of one
(or more) fact tables, and a set of smaller
dimension tables. - Look on Fact table through one (or more)
dimensions. - What is the sale amount in Consumer Product
category, for elderly customers in the second
quarter of 2004? - Forms star-like structure, which is called a
star schema or star join.
16Example Dimensional Model
17Time Dimension Exercise
- Every Data Warehouse will need Time information
- I.e. a Time Dimension
- Compose a generic Time Dimension Table
- E.g. what are the different attributes you can
use to describe 11th February, 2008.
18Time Dimension Exercise
create table time_dimension ( date_key Number
not null, full_date Date, day_of_week
Number, day_num_in_month Number, day_num_overa
ll Number, day_name Varchar2(9), day_abbrev
Varchar2(3), weekday_flag Varchar2(1), week_n
um_in_year Number, week_num_overall
Number, week_begin_date Date, week_begin_date_
key Number, month Number, month_num_overall
Number, month_name Varchar2(9), month_abbrev
Varchar2(3), quarter Number, year
Number, yearmo Number, fiscal_month
Number, fiscal_quarter Number, fiscal_year
Number, last_day_in_month_flag
Varchar2(1), same_weekday_year_ago_date
Date, primary key (date_key))
19Data Model Design for Data Warehouses
- Nine-Step Methodology includes following steps
- Choosing the subject
- Choosing the grain
- Identifying and conforming the dimensions
- Choosing the facts
- Storing pre-calculations in the fact table
- Rounding out the dimension tables
- Choosing the duration of the database
- Tracking slowly changing dimensions
- Deciding the query priorities and the query mode
20Step 1 Choosing The Subject
- The subject (or function) refers to the subject
matter of a particular data mart - A business process is a major operational process
in an organization - Typically supported by a legacy system (database)
or an OLTP - Examples Orders, Invoices, Inventory etc.
21Step 1 Choosing The Subject (cont)
- First data mart built should be the one that is
most likely to be - Delivered on time
- Within budget
- To answer the most commercially important
business questions
22Step 2 Choosing The Grain
- Grain is the fundamental, atomic level of data to
be represented. - Decide what a record of the fact table is to
represent. - Grain is also termed as unit of analyses.
- Typical grains
- Individual Transactions
- Daily aggregates (snapshots)
- Monthly aggregates
23Step 2 Choosing The Grain (cont)
- Identify dimensions of the fact table. The grain
decision for the fact table also determines the
grain of each dimension table. - Also include time as a core dimension, which is
always present in star schemas. - Sometimes grain varies for different facts within
same business process. How?
24Step 3 Identifying Conforming The Dimensions
- Dimensions set the context for asking questions
about the facts in the fact table. - If any dimension occurs in two data marts, they
must be exactly the same dimension, or one must
be a mathematical subset of the other. - A dimension used in more than one data mart is
referred to as being conformed.
25Step 3 Identifying Conforming The Dimensions
- Choose the dimensions that apply to each fact in
the fact table. - Typical dimensions time, product, customer etc.
- Need to identify the descriptive attributes that
explain each dimension - Need to determine hierarchies within each
dimension?
26Steps 4 5 Choosing The Facts
- The grain of the fact table determines which
facts can be used in the data mart. - Facts should be numeric and additive.
- Example Quantity Sold, Amount etc.
- Unusable facts include
- Non-numeric facts
- Non-additive facts
- Fact at different granularity from other facts in
table - Storing Pre-Calculations in the Fact Table
- Once the facts have been selected each should be
re-examined to determine whether there are
opportunities to use pre-calculations.
27Step 6 Rounding Out The Dimension Tables
- Text descriptions are added to the dimension
tables. - Text descriptions should be as intuitive and
understandable to the users as possible. - Usefulness of a data mart is determined by the
scope and nature of the attributes of the
dimension tables. - (See exercise on Time Dimension)
28Step 7 Choosing The Duration Of The Database
- Duration measures how far back in time the fact
table goes. - Very large fact tables raise at least two very
significant data warehouse design issues. - Often difficult to source increasing old data.
- It is mandatory that the old versions of the
important dimensions be used, not the most
current versions. Known as the Slowly Changing
Dimension problem.
29Step 8 Tracking Slowly Changing Dimensions
- The slowly changing dimension problem means that
the proper description of the old dimension data
must be used with old fact data. - Often, a generalized key must be assigned to
important dimensions in order to distinguish
multiple snapshots of dimensions over a period of
time.
30Step 8 Tracking Slowly Changing Dimensions
(cont)
- Three basic types of slowly changing dimensions
- Where a changed dimension attribute is
overwritten. - Where a changed dimension attribute causes a new
dimension record to be created. - Where a changed dimension attribute causes an
alternate attribute to be created so that both
the old and new values of the attribute are
simultaneously accessible in the same dimension
record.
31Step 9 Deciding The Query Priorities And The
Query Modes
- Most critical physical design issues affecting
the end-users perception includes - Physical sort order of the fact table on disk
- Presence of pre-stored summaries or aggregations.
- Additional physical design issues include
administration, backup, indexing performance, and
security.
32Dimensional Modelling
- Dimensional Modelling is a logical design
technique that seeks to present the data in a
standard framework that is intuitive and allows
for high-performance access
33Dimensional Modelling (cont)
- Fact table
- Consists of a multi-part primary key and usually
contains numeric data - Numeric data is aggregated based on the
multi-part primary key - Additive is crucial because the DW applications
typically retrieve data based on more than one
set of facts - Dimension table
- Contains descriptive information
- Are the entry point for queries
34Dimensional Modelling
- Dimensional Model
- SELECT description, SUM(quoted_price),
SUM(quantity), - SUM(unit_price) , SUM(total_comm)
- FROM order_fact of,
- part_dimension pd
- WHERE of.part_nr pd.part_nr
- GROUP BY description
-
- ER-Model
- SELECT description, SUM(quoted_price),
SUM(quantity), - SUM(unit_price), SUM(total_comm)
- FROM order o,
- order_detail od,
- part p,
- customer c,
- slsrep s
- WHERE o.order_nr od.order_nr
- AND p.part_nr od.part_nr
- AND o.customer_nr c.customer_nr
35Dimensional Modelling
36Simple DW Example
- Supermarket (Chain Store)
- Business Area Sales
- Grain Individual Purchases
- Dimensions
- Time
- Product
- Store
- Customer
- Employee
- Facts
- Total Sales
- Number of items
- Total Cost Value
37Data Warehouse Design (cont)
Customer
Time
Individual Purchases
Employee
Products
Store
38More Detailed Exercise
- A more detailed exercise See handout
39Exam Question Example
- You are working on a data warehousing project for
the examinations department at the Dublin
Institute of Technology (DIT). The examinations
department looks after all exam and continuous
asessment results for all of the students within
the Institute. The purpose the data warehousing
project is to allow new reporting capabilities so
that examinations department staff can examine
grade patterns for particular courses monitor
average grades for modules and patterns in grades
for courses given by particular staff members
and to help with student retention efforts. - Currently in the examinations departments
transactional systems information about each
student is indexed by a student number, and
includes name, address, date of birth, etc.
Similarly, information stored about the
instructors working within the Institute is
indexed by staff ID and includes name, address,
department, etc. The information that needs to be
stored about each course includes the course
title, course code, and the weighting given to
the continuous assessment element and exam
element of a particular course. At the
Institutes progression boards each year the
continuous assessment and exam results for each
student, for each course, are entered into the
Institutes transactional grade storage system
and this informaion should be trasnferred across
to the data warehouse. - Design a star schema for the above scenario. (20
marks) - Discuss how the star schema supports the
reporting requirements outlined in the above
scenario. (15 marks)
40Different Types Of Dimensional Model
- The star-schema can be extended in two ways
- Snow flake model
- Multi-star model (also know as fact
constellation)
41Example Star Schema
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
42Example Snowflake Schema
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
43Example Fact Constellation
Shipping Fact Table
time_key
Sales Fact Table
item_key
time_key
shipper_key
item_key
from_location
branch_key
to_location
location_key
dollars_cost
units_sold
units_shipped
dollars_sold
avg_sales
Measures
44Summary
- Over the last two lectures we have introduced
them idea of data warehouses - Data warehouses evolved to address the issues of
using transactional databases to answer new kinds
of questions - IBM have a very detailed RedBook ondimensional
modelling that is well worth looking at - http//www.redbooks.ibm.com/redbooks/pdfs/sg247138
.pdf -