Title: Data Warehousing
1Data Warehousing
Introduction and Chapter 1
2Introduction
- Definitions
- Legacy Systems
- Dimensions
- Data Dependencies Model
- Dimensional Model
3An ER Model
4A Dimensional Model
Time
Market
Product
5Why Data Warehouses?
- To meet the long sought after goal of providing
the user with more flexible data bases containing
data that can be accessed every which way.
6OLTP vs. OLAP
- OLTP (Online transaction processing) has been the
standard reason for IS and DP for the last thirty
years. Most legacy systems are quite good at
capturing data but do not facilitate data access. - OLAP (Online analytical processing) is a set of
procedures for defining and using a dimension
framework for decision support
7The Goals for and Characteristics of a DW
- Make organizational data accessible
- Facilitate consistency
- Adaptable and yet resilient to change
- Secure and reliable
- Designed with a focus on supporting decision
making
8The Goals for and Characteristics of a DW
- Generate an environment in which data can be
sliced and diced in multiple ways - It is more than data, it is a set of tools to
query, analyze, and present information - The DW is the place where operational data is
published (cleaned up, assembled, etc.)
9Basic elements of the data warehouse
DataStaging Area
Operational Source Systems
DataPresentation Area
DataAccessTools
Services Clean, combine, and standardizeConform
DimensionsNo user query services Data
Store Flat files and relational
tables Processing Sorting and sequential
processing
Data Mart 1 DimensionalAtomic and summary
dataBased on a single business process
Ad hoc query tools Report Writers Analytical
Applications Modeling Forecasting Scoring Dat
a Mining
Extract
Load
Access
DW BusConformed facts and dimensions
Extract
Data Mart 2 Similar design
Extract
Load
Access
10Data Staging Area
- Extract-Transformation-Load
- Extract Reading the source data and copying the
data to the staging area - Transformation
- Cleaning
- Combining
- Duplicating
- Assigning keys
- Load present data to the bulk loading facilities
of the data mart
11Organization of data in the presentation area of
the data warehouse
- Data in the warehouse are dimensional, not
normalized relations - However, data that are ultimately presented in
the data warehouse will often be derived directly
from relational DBs - Data should be atomic someplace in the warehouse
even if the presentation is aggregate - Uses the bus architecture to support a
decentralized set of data marts
12Updates to a data warehouse
- For many years, the dogma stated that data
warehouses are never updated. - This is unrealistic since labels, titles, etc.
change. - Some components will, therefore, be changed
albeit, via a managed load (as opposed to
transactional updates)
13Dimensional Modeling Terms and Concepts
- Fact table
- Dimension tables
14Fact Tables
- Fact table a table in the data warehouse that
contains - Numerical performance measures
- Foreign keys that tie the fact table to the
dimension tables
15Fact Tables
- Each row records a measurement describing a
transaction - Where?
- When?
- Who?
- How much?
- How many?
- The level of detail represented by this data is
referred to as the grain of the data warehouse - Questions can only be asked down to a level
corresponding with the grain of the data warehouse
16Fact Tables
- Fact tables contain numeric data that can be one
of three types - Additive
- Semi-additive
- Non-additive
- Fact tables contain foreign keys
- A group of foreign keys will be used to create a
concatenated primary key - Fact tables generally dont contain textual data
17Dimension tables
- Tables containing textual descriptors of the
business - Dimension tables are usually wide (e.g., 100
columns) - Dimension tables are usually shallow (100s of
thousand or a few million rows) - Values in the dimensions usually provide
- Constraints on queries (e.g., view customer by
region) - Report headings
18Dimension tables
- The quality of the dimensions will determine the
quality of the data warehouse that is, the DW is
only as good as its dimension attributes - Dimensions are often split into hierarchical
branches (i.e., snowflakes) because of the
hierarchical nature of organizations - Product part ? Product ? Brand
- Dimensions are usually highly denormalized
19Dimension tables
- The dimension attributes define the constraints
for the DW. Without good dimensions, it becomes
difficult to narrow down on a solution when the
DW is used for decision support
20Bringing together facts and dimensions Building
the dimensional Model
- Start with the normalized ER Model
- Group the ER diagram components into segments
based on common business processes and model each
as a unit - Find MM relationships in the model with numeric
and additive non-key facts and include them in a
fact table - Denormalize the other tables as needed and
designate one field as a primary key
21A Dimensional Model
22So, What is a DW?
- A data warehouse is a subject-oriented,
integrated, non-volatile, and time-variant
collection of data in support of managements
decisions W.H. Inmon (the father of DW)
23Subject Oriented
- Data in a data warehouse are organized around the
major subjects of the organization
24Integrated
- Data from multiple sources are standardized
(scrubbed, cleansed, etc.) and brought into one
environment
25Non-Volatile
- Once added to the DW, data are not changed
(barring the existence of major errors)
26Time Variant
- The DW captures data at a specific moment, thus,
it is a snap-shot view of the organization at
that moment in time. As these snap-shots
accumulate, the analyst is able to examine the
organization over time (a time series!) - The snap-shot is called a production data extract