DATA WAREHOUSE DESIGN - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

DATA WAREHOUSE DESIGN

Description:

Camera Hong Kong Q4 2100. Camera Singapore Q1 1000. Camera Singapore Q2 1100. Tuner Tokyo Q3 250 ... maps an attribute value of a dimension table to one or more ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 17
Provided by: fredloc
Category:
Tags: data | design | warehouse | hong | kong | map | of

less

Transcript and Presenter's Notes

Title: DATA WAREHOUSE DESIGN


1
COMP 332PRINCIPLES OFDATABASE DESIGN
  • DATA WAREHOUSE DESIGN

2
DATA WAREHOUSE DESIGN OUTLINE
  • Operational versus Data Warehouse Databases
  • Data Warehouse Architecture
  • Conceptual Schema (User View)
  • Logical Schema (System View)
  • Data Warehouse Design
  • Physical Design

3
INTRODUCTION 9.1
  • A data warehouse is a subject-oriented,
    integrated, time-variant, nonvolatile collection
    of data in support of management decisions
  • large repository of 100s of gigabytes or
    terrabytes
  • used primarily for
  • standard reports and graphical presentation
    customer profiling trend analysis market
    analysis decision effectiveness
  • dimensional analysis aggregate/summarize data
  • data mining discover non-obvious relationships
    in the data (e.g., risk analysis fraud
    detection)
  • due to huge data volumes, performanceis highly
    sensitive to the database design

4
OPERATIONAL VS DATA WAREHOUSE DATABASES
  • operational database
  • supports mission-critical requirements for online
    transaction processing (OLTP) and batch
    processing
  • organized around business (functional)
    areas/processes
  • stores detailed, nonredundant, updateable and
    current data

How many gadgets were sold to customer number
123876 on Sept. 19?
  • data warehouse database
  • supports ad hoc query processing (online
    analytical processing (OLAP)) for decision
    support
  • organized around subjects, such as customers,
    product, etc.
  • stores summarized, redundant, non-updateable and
    historic data

What three products resulted in the most frequent
calls to the hotline over the past quarter?
5
OPERATIONAL VS DATA WAREHOUSE DATABASES
6
DATA WAREHOUSE ARCHITECTURE
Operational data sources
Data mart
Data mart
Data mart
Feeder DB1
Backend Tools report, query OLAP data
mining applications visualization
  • extract
  • clean
  • load
  • refresh

Feeder DB2
Data warehouse
. . .
Metadata
web pages
7
DATA WAREHOUSE ARCHITECTURE (contd)
  • data extraction - extracts data from operation
    systems
  • database heterogeneity - different DBMSs
  • data heterogeneity - different definitions/represe
    ntations of data
  • data cleaning/scrubbing - makes extracted data
    consistent
  • inconsistent field lengths, descriptions, value
    assignments
  • missing data, duplicate data, violation of
    integrity constraints
  • data loading - puts data into the data warehouse
  • may need to sort, summarize, aggregate data
    build indexes
  • data refresh - propagate updates on operational
    data
  • when to refresh? - periodically
  • how to refresh? - data shipping transaction
    shipping
  • metadata
  • for use by designers and administrators (e.g.,
    about data sources)
  • for use by end users (describes data content of
    data warehouse)

8
CONCEPTUAL SCHEMA (USER VIEW) 9.4
  • similar to a spreadsheet

9
CONCEPTUAL SCHEMA (USER VIEW)
  • but often represented as a multidimensional
    structure (cube/hypercube) organized around
    subject areas

Tokyo
. . .
M a r k e t
Singapore
Hong Kong
1200
1500
1800
2100
Camera
. . .
P r o d u c t
Tuner
T i m e
Q1
Q2
Q3
Q4
10
MULTIDIMENSIONAL MODEL
  • facts
  • raw numeric data that defines the objects
    associated with the subject areas and that is to
    be analyzed
  • can be aggregated and summarized
  • e.g., price, revenue, units of items sold, etc.
  • dimensions
  • provide the context for the facts and the paths
    along which basic access operations to facts
    occur
  • the set of dimensions uniquely determine a fact
    and give it meaning
  • dimensions have attributes that are often
    hierarchical
  • e.g., product category, industry, year of
    introduction, average profit margin
  • e.g., time years, quarters, months, weeks, days

11
MULTIDIMENSIONAL MODEL OPERATIONS
  • drill down decrease the aggregation level more
    detailed view
  • e.g., sales by quarter to sales by month
  • roll up increase the aggregation level less
    detailed view
  • e.g., sales by quarter to sales by year
  • slice and dice apply selection and projection
    to dimensions
  • e.g., select all cameras and project on markets
    and units
  • pivot re-orient the multidimensional view of
    the data
  • rank sort the data

12
LOGICAL SCHEMA (SYSTEM VIEW) 9.2
Star schema
13
LOGICAL SCHEMA (SYSTEM VIEW)
Star schema
1
1
N
N
N
N
1
1
N
N
1
1
14
DATA WAREHOUSE DESIGN 9.4.3
  • Step 1 Analyze end-user requirements and
    environment
  • similar to normal database design, but adapted
    for data warehouse
  • Step 2 Define Cubes, Dimensions, Hierarchies
  • high-level, conceptual multidimensional modeling
  • Step 3 Define Dimension Members
  • logical design of dimensions
  • Step 4 Define Aggregations and Other Formulas
  • what to aggregate how to store aggregates when
    to pre-aggregate

15
DATA WAREHOUSE PHYSICAL DESIGN 9.3
  • need to deal with very large volumes of data
  • need to optimize for efficient query processing
    (since there are few updates)
  • require physical database structuresand query
    processing techniques
  • that enhance query performance for very large
    data warehouses
  • major issues to consider
  • indexing
  • join optimization
  • materializing views of aggregations

16
INDEXING
  • keys for dimension tables are typically
    multi-attribute keys
  • represent the entire hierarchy of dimension
    attributes
  • become foreign key of dimension in fact table
  • create an artificial key for a dimension if
    necessary
  • create indexes on each primary key column of each
    dimension table and on all foreign keys in the
    fact table
  • facilitates joins between fact table and
    dimension tables
  • make use of join indexes (pre-computed join) to
    speed up joins
  • maps an attribute value of a dimension table to
    one or more rows in a fact table
  • multi-key join indexes can represent n-way joins

17
BITMAP INDEXES
  • queries against low cardinality attributes can be
    optimized using bitmap indexes
  • e.g., gender 2 values (male/female)
  • marital status 4 values (single, married,
    divorced, widowed)
  • bitmap index construction
  • each row in a table is represented by a single
    bit in a bitmap
  • each attribute value has its own bit vector
  • if the row has a qualifying value, its bit is set
    to 1
  • speeds up special index operations such as
    intersection or union

18
BITMAP INDEXES (contd)
AND

intersection bitmap
19
JOIN PROCESSING
  • most RDBMs can only do pairwise joins (i.e., two
    relations at a time), but data warehouse
    operations require multi-way joins
  • join order is important due to creation of
    intermediate results that can be large and need
    to be stored
  • N! ways to join N tables
  • many possible join algorithms to choose from
  • only fact table is directly related to most other
    tables
  • allow unrelated tables to be joined
  • result is the Cartesian product
  • usually this is smaller than joining with the
    fact table

20
VIEW MATERIALIZATION 9.3.2
  • to speed up query processing, it may be
    cost-effective to store some aggregations of data
    rather than to compute them
  • usually applied to fact table based on the
    attributes of one or more dimension tables
  • e.g., rollup sales by week, by month, by product
    type, etc.
  • the aggregations can either be
  • calculated as needed high compute cost
  • stored high storage cost update cost
  • aggregate data can be used to derive a higher
    level summary
  • e.g., produce yearly results from monthly summary
    data

21
PARTITIONING 9.3.3
  • Horizontal
  • especially for time dimension (e.g., sales by
    month)
  • query processing dependent
  • Vertical
  • to separate rarely used data from frequently used
    data
  • may be advantageous to store table column-wise to
    facilitate aggregation
Write a Comment
User Comments (0)
About PowerShow.com