DATA WAREHOUSE DESIGN - PowerPoint PPT Presentation

1 / 16

About This Presentation

Title:

DATA WAREHOUSE DESIGN

Description:

Camera Hong Kong Q4 2100. Camera Singapore Q1 1000. Camera Singapore Q2 1100. Tuner Tokyo Q3 250 ... maps an attribute value of a dimension table to one or more ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 17

Provided by: fredloc

Category:

more less

Transcript and Presenter's Notes

Title: DATA WAREHOUSE DESIGN

1
COMP 332PRINCIPLES OFDATABASE DESIGN

DATA WAREHOUSE DESIGN

2
DATA WAREHOUSE DESIGN OUTLINE

Operational versus Data Warehouse Databases
Data Warehouse Architecture
Conceptual Schema (User View)
Logical Schema (System View)
Data Warehouse Design
Physical Design

3
INTRODUCTION 9.1

A data warehouse is a subject-oriented,
integrated, time-variant, nonvolatile collection
of data in support of management decisions

large repository of 100s of gigabytes or
terrabytes
used primarily for
standard reports and graphical presentation
customer profiling trend analysis market
analysis decision effectiveness
dimensional analysis aggregate/summarize data
data mining discover non-obvious relationships
in the data (e.g., risk analysis fraud
detection)
due to huge data volumes, performanceis highly
sensitive to the database design

4
OPERATIONAL VS DATA WAREHOUSE DATABASES

operational database
supports mission-critical requirements for online
transaction processing (OLTP) and batch
processing
organized around business (functional)
areas/processes
stores detailed, nonredundant, updateable and
current data

How many gadgets were sold to customer number
123876 on Sept. 19?

data warehouse database
supports ad hoc query processing (online
analytical processing (OLAP)) for decision
support
organized around subjects, such as customers,
product, etc.
stores summarized, redundant, non-updateable and
historic data

What three products resulted in the most frequent
calls to the hotline over the past quarter?
5
OPERATIONAL VS DATA WAREHOUSE DATABASES
6
DATA WAREHOUSE ARCHITECTURE
Operational data sources
Data mart
Data mart
Data mart
Feeder DB1
Backend Tools report, query OLAP data
mining applications visualization

extract
clean
load
refresh

Feeder DB2
Data warehouse
. . .
Metadata
web pages
7
DATA WAREHOUSE ARCHITECTURE (contd)

data extraction - extracts data from operation
systems
database heterogeneity - different DBMSs
data heterogeneity - different definitions/represe
ntations of data
data cleaning/scrubbing - makes extracted data
consistent
inconsistent field lengths, descriptions, value
assignments
missing data, duplicate data, violation of
integrity constraints
data loading - puts data into the data warehouse
may need to sort, summarize, aggregate data
build indexes
data refresh - propagate updates on operational
data
when to refresh? - periodically
how to refresh? - data shipping transaction
shipping
metadata
for use by designers and administrators (e.g.,
about data sources)
for use by end users (describes data content of
data warehouse)

8
CONCEPTUAL SCHEMA (USER VIEW) 9.4

similar to a spreadsheet

9
CONCEPTUAL SCHEMA (USER VIEW)

but often represented as a multidimensional
structure (cube/hypercube) organized around
subject areas

Tokyo
. . .
M a r k e t
Singapore
Hong Kong
1200
1500
1800
2100
Camera
. . .
P r o d u c t
Tuner
T i m e
Q1
Q2
Q3
Q4
10
MULTIDIMENSIONAL MODEL

facts
raw numeric data that defines the objects
associated with the subject areas and that is to
be analyzed
can be aggregated and summarized
e.g., price, revenue, units of items sold, etc.
dimensions
provide the context for the facts and the paths
along which basic access operations to facts
occur
the set of dimensions uniquely determine a fact
and give it meaning
dimensions have attributes that are often
hierarchical
e.g., product category, industry, year of
introduction, average profit margin
e.g., time years, quarters, months, weeks, days

11
MULTIDIMENSIONAL MODEL OPERATIONS

drill down decrease the aggregation level more
detailed view
e.g., sales by quarter to sales by month
roll up increase the aggregation level less
detailed view
e.g., sales by quarter to sales by year
slice and dice apply selection and projection
to dimensions
e.g., select all cameras and project on markets
and units
pivot re-orient the multidimensional view of
the data
rank sort the data

12
LOGICAL SCHEMA (SYSTEM VIEW) 9.2
Star schema
13
LOGICAL SCHEMA (SYSTEM VIEW)
Star schema
1
1
N
N
N
N
1
1
N
N
1
1
14
DATA WAREHOUSE DESIGN 9.4.3

Step 1 Analyze end-user requirements and
environment
similar to normal database design, but adapted
for data warehouse
Step 2 Define Cubes, Dimensions, Hierarchies
high-level, conceptual multidimensional modeling
Step 3 Define Dimension Members
logical design of dimensions
Step 4 Define Aggregations and Other Formulas
what to aggregate how to store aggregates when
to pre-aggregate

15
DATA WAREHOUSE PHYSICAL DESIGN 9.3

need to deal with very large volumes of data
need to optimize for efficient query processing
(since there are few updates)
require physical database structuresand query
processing techniques
that enhance query performance for very large
data warehouses
major issues to consider
indexing
join optimization
materializing views of aggregations

16
INDEXING

keys for dimension tables are typically
multi-attribute keys
represent the entire hierarchy of dimension
attributes
become foreign key of dimension in fact table
create an artificial key for a dimension if
necessary
create indexes on each primary key column of each
dimension table and on all foreign keys in the
fact table
facilitates joins between fact table and
dimension tables
make use of join indexes (pre-computed join) to
speed up joins
maps an attribute value of a dimension table to
one or more rows in a fact table
multi-key join indexes can represent n-way joins

17
BITMAP INDEXES

queries against low cardinality attributes can be
optimized using bitmap indexes
e.g., gender 2 values (male/female)
marital status 4 values (single, married,
divorced, widowed)
bitmap index construction
each row in a table is represented by a single
bit in a bitmap
each attribute value has its own bit vector
if the row has a qualifying value, its bit is set
to 1
speeds up special index operations such as
intersection or union

18
BITMAP INDEXES (contd)
AND

intersection bitmap
19
JOIN PROCESSING

most RDBMs can only do pairwise joins (i.e., two
relations at a time), but data warehouse
operations require multi-way joins
join order is important due to creation of
intermediate results that can be large and need
to be stored
N! ways to join N tables
many possible join algorithms to choose from
only fact table is directly related to most other
tables
allow unrelated tables to be joined
result is the Cartesian product
usually this is smaller than joining with the
fact table

20
VIEW MATERIALIZATION 9.3.2

to speed up query processing, it may be
cost-effective to store some aggregations of data
rather than to compute them
usually applied to fact table based on the
attributes of one or more dimension tables
e.g., rollup sales by week, by month, by product
type, etc.
the aggregations can either be
calculated as needed high compute cost
stored high storage cost update cost
aggregate data can be used to derive a higher
level summary
e.g., produce yearly results from monthly summary
data

21
PARTITIONING 9.3.3