Title: DATA WAREHOUSE DESIGN
1COMP 332PRINCIPLES OFDATABASE DESIGN
2DATA WAREHOUSE DESIGN OUTLINE
- Operational versus Data Warehouse Databases
- Data Warehouse Architecture
- Conceptual Schema (User View)
- Logical Schema (System View)
- Data Warehouse Design
- Physical Design
3INTRODUCTION 9.1
- A data warehouse is a subject-oriented,
integrated, time-variant, nonvolatile collection
of data in support of management decisions
- large repository of 100s of gigabytes or
terrabytes - used primarily for
- standard reports and graphical presentation
customer profiling trend analysis market
analysis decision effectiveness - dimensional analysis aggregate/summarize data
- data mining discover non-obvious relationships
in the data (e.g., risk analysis fraud
detection) - due to huge data volumes, performanceis highly
sensitive to the database design
4OPERATIONAL VS DATA WAREHOUSE DATABASES
- operational database
- supports mission-critical requirements for online
transaction processing (OLTP) and batch
processing - organized around business (functional)
areas/processes - stores detailed, nonredundant, updateable and
current data
How many gadgets were sold to customer number
123876 on Sept. 19?
- data warehouse database
- supports ad hoc query processing (online
analytical processing (OLAP)) for decision
support - organized around subjects, such as customers,
product, etc. - stores summarized, redundant, non-updateable and
historic data
What three products resulted in the most frequent
calls to the hotline over the past quarter?
5OPERATIONAL VS DATA WAREHOUSE DATABASES
6DATA WAREHOUSE ARCHITECTURE
Operational data sources
Data mart
Data mart
Data mart
Feeder DB1
Backend Tools report, query OLAP data
mining applications visualization
- extract
- clean
- load
- refresh
Feeder DB2
Data warehouse
. . .
Metadata
web pages
7DATA WAREHOUSE ARCHITECTURE (contd)
- data extraction - extracts data from operation
systems - database heterogeneity - different DBMSs
- data heterogeneity - different definitions/represe
ntations of data - data cleaning/scrubbing - makes extracted data
consistent - inconsistent field lengths, descriptions, value
assignments - missing data, duplicate data, violation of
integrity constraints - data loading - puts data into the data warehouse
- may need to sort, summarize, aggregate data
build indexes - data refresh - propagate updates on operational
data - when to refresh? - periodically
- how to refresh? - data shipping transaction
shipping - metadata
- for use by designers and administrators (e.g.,
about data sources) - for use by end users (describes data content of
data warehouse)
8CONCEPTUAL SCHEMA (USER VIEW) 9.4
9CONCEPTUAL SCHEMA (USER VIEW)
- but often represented as a multidimensional
structure (cube/hypercube) organized around
subject areas
Tokyo
. . .
M a r k e t
Singapore
Hong Kong
1200
1500
1800
2100
Camera
. . .
P r o d u c t
Tuner
T i m e
Q1
Q2
Q3
Q4
10MULTIDIMENSIONAL MODEL
- facts
- raw numeric data that defines the objects
associated with the subject areas and that is to
be analyzed - can be aggregated and summarized
- e.g., price, revenue, units of items sold, etc.
- dimensions
- provide the context for the facts and the paths
along which basic access operations to facts
occur - the set of dimensions uniquely determine a fact
and give it meaning - dimensions have attributes that are often
hierarchical - e.g., product category, industry, year of
introduction, average profit margin - e.g., time years, quarters, months, weeks, days
11MULTIDIMENSIONAL MODEL OPERATIONS
- drill down decrease the aggregation level more
detailed view - e.g., sales by quarter to sales by month
- roll up increase the aggregation level less
detailed view - e.g., sales by quarter to sales by year
- slice and dice apply selection and projection
to dimensions - e.g., select all cameras and project on markets
and units - pivot re-orient the multidimensional view of
the data - rank sort the data
12LOGICAL SCHEMA (SYSTEM VIEW) 9.2
Star schema
13LOGICAL SCHEMA (SYSTEM VIEW)
Star schema
1
1
N
N
N
N
1
1
N
N
1
1
14DATA WAREHOUSE DESIGN 9.4.3
- Step 1 Analyze end-user requirements and
environment - similar to normal database design, but adapted
for data warehouse - Step 2 Define Cubes, Dimensions, Hierarchies
- high-level, conceptual multidimensional modeling
- Step 3 Define Dimension Members
- logical design of dimensions
- Step 4 Define Aggregations and Other Formulas
- what to aggregate how to store aggregates when
to pre-aggregate
15DATA WAREHOUSE PHYSICAL DESIGN 9.3
- need to deal with very large volumes of data
- need to optimize for efficient query processing
(since there are few updates) - require physical database structuresand query
processing techniques - that enhance query performance for very large
data warehouses - major issues to consider
- indexing
- join optimization
- materializing views of aggregations
16INDEXING
- keys for dimension tables are typically
multi-attribute keys - represent the entire hierarchy of dimension
attributes - become foreign key of dimension in fact table
- create an artificial key for a dimension if
necessary - create indexes on each primary key column of each
dimension table and on all foreign keys in the
fact table - facilitates joins between fact table and
dimension tables - make use of join indexes (pre-computed join) to
speed up joins - maps an attribute value of a dimension table to
one or more rows in a fact table - multi-key join indexes can represent n-way joins
17BITMAP INDEXES
- queries against low cardinality attributes can be
optimized using bitmap indexes - e.g., gender 2 values (male/female)
- marital status 4 values (single, married,
divorced, widowed) - bitmap index construction
- each row in a table is represented by a single
bit in a bitmap - each attribute value has its own bit vector
- if the row has a qualifying value, its bit is set
to 1 - speeds up special index operations such as
intersection or union
18BITMAP INDEXES (contd)
AND
intersection bitmap
19JOIN PROCESSING
- most RDBMs can only do pairwise joins (i.e., two
relations at a time), but data warehouse
operations require multi-way joins - join order is important due to creation of
intermediate results that can be large and need
to be stored - N! ways to join N tables
- many possible join algorithms to choose from
- only fact table is directly related to most other
tables - allow unrelated tables to be joined
- result is the Cartesian product
- usually this is smaller than joining with the
fact table
20VIEW MATERIALIZATION 9.3.2
- to speed up query processing, it may be
cost-effective to store some aggregations of data
rather than to compute them - usually applied to fact table based on the
attributes of one or more dimension tables - e.g., rollup sales by week, by month, by product
type, etc. - the aggregations can either be
- calculated as needed high compute cost
- stored high storage cost update cost
- aggregate data can be used to derive a higher
level summary - e.g., produce yearly results from monthly summary
data
21PARTITIONING 9.3.3
- Horizontal
- especially for time dimension (e.g., sales by
month) - query processing dependent
- Vertical
- to separate rarely used data from frequently used
data - may be advantageous to store table column-wise to
facilitate aggregation