Lecture 10: Data Warehouses - PowerPoint PPT Presentation

About This Presentation

Title:

Lecture 10: Data Warehouses

Description:

2. Introduction. In the late 80s and early 90s, companies ... TIME. LOCATION. pname week city. PID date ZIP. year. category quarter state. 11. 25.3 OLAP Queries ... – PowerPoint PPT presentation

Number of Views:77

Avg rating:3.0/5.0

Slides: 32

Provided by: RaghuRamak

Learn more at: http://web.cecs.pdx.edu

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 10: Data Warehouses

1
Lecture 10 Data Warehouses

Introduction
Operational vs. Warehouse
Multidimensional Data
Examples
MOLAP vs ROLAP
Dimensional Hierarchies
OLAP Queries
Demos
Comparison with SQL Queries
CUBE Operator
Multidimensional Design
Star/Snowflake Schemas

Online Aggregation
Implementation Issues
Bitmap Index
Constructing a Data Warehouse
Views
Materialized View Example
Materialized View is an Index
Issues in Materialized Views
Maintaining Materialized Views

2
Introduction

In the late 80s and early 90s, companies began to
use their DBMSs for complex, interactive,
exploratory analysis of historical data.
This was called Decision Support, and On-Line
Analytic Processing (OLAP).
DS slowed down the operation of the company,
called On-Line Transaction Processing (OLTP).
This led to the creation of Data Warehouses,
separate from operational Databases.

3
Operational vs Data Warehouse Requirements
4
Operational vs Data Warehouse Requirements, ctd
5
Data Warehousing
Operational Data

Integrated data spanning long time periods, often
augmented with summary information.
Several terabytes to petabytes common.
Interactive response times expected for
complex queries ad-hoc updates uncommon.

EXTRACT TRANSFORM LOAD REFRESH
DATA WAREHOUSE
Metadata Repository
SUPPORTS
DATA MINING
6
Multidimensional Data

In order to support OLAP, warehouse data is often
structured multidimensionally, as measures and
dimensions.
Measure Numeric attribute, e.g. sales amount
Dimension attribute categorizing the measure,
e.g. product, store, date of sale.
The fact table is a foreign key for each
dimension, plus an attribute for each measure.
There will also be a dimension table for each
dimension.
On the next page, the fact tables are red, the
dimension tables are green.

7
Examples of MultiDimensional Data

Purchase(ProductID, StoreID, DateID, Amt)
Product(ID, SKU, size, brand)
Store(ID, Address, Sales District, Region,
Manager)
Date (ID, Week, Month, Holiday, Promotion)
Claims(ProvID, MembID, Procedure, DateID, Cost)
Providers(ID, Practice, Address, ZIP, City,
State)
Members(ID, Contract, Name, Address)
Procedure (ID, Name, Type)
Telecomm (CustID, SalesRepID, ServiceID, DateID)
SalesRep(ID, Address, Sales District, Region,
Manager)
Service(ID, Name, Category)

8
MOLAP vs ROLAP

Multidimensional data can be stored physically in
a (disk-resident, persistent) array called MOLAP
systems. Alternatively, can store as a relation
called ROLAP systems.
The main relation, which relates dimensions to a
measure, is called the fact table. Each
dimension can have additional attributes and an
associated dimension table.
E.g., Products(pid, locid, timeid, amt)
Fact tables are much larger than dimensional
tables.

9
25.2 Multidimensional Data Model
timeid
locid
amt
pid

Collection of numeric measures, which depend on
a set of dimensions.
E.g., measure Amt, dimensions Product (key pid),
Location (locid), and Time (timeid).

Slice locid1 is shown
locid
10
Dimension Hierarchies

For each dimension, some of the attributes may be
organized in a hierarchy

PRODUCT
TIME
LOCATION
year
category quarter
state
pname week
city
PID date
ZIP
11
25.3 OLAP Queries

Influenced by SQL and by spreadsheets.
A common operation is to aggregate a measure over
one or more dimensions.
Find total sales.
Find total sales for each city, or for each
state.
Find top five products ranked by total sales.
Roll-up Aggregating at different levels of a
dimension hierarchy.
E.g., Given total sales by city, we can roll-up
to get sales by state.

12
OLAP Queries

Drill-down The inverse of roll-up.
E.g., Given total sales by state, can drill-down
to get total sales by city.
E.g., Can also drill-down on different dimension
to get total sales by product for each state.
Pivoting Aggregation on selected dimensions.
E.g., Pivoting on State and Year
yields this cross-tabulation

OR CA Total
63 81 144
2007
38 107 145

Slicing and Dicing Equality
and range selections on one
or more dimensions.

2008
75 35 110
2009
176 223 339
Total
13
Cognos Demo

Now we watch a demo of Cognos (bought by IBM)
Dimensions ProductsMargin ranges
Measure Order value (sales)
First pivot from Product dimension to Margin
Range
Notice how quickly the cube changes
Slice to Low Margin, pivot to Product and Company
Region
Drill Down to High Tech, IDES AG
Now the guilty product is clear.

14
Tableau Demo

http//www.tableausoftware.com/products/tour2
Note the many measures.
Pivot on sales, date (drill down to month),
region as color.
Clear date, pivot on product and drill down on
subcategory.
Change region from color to rows
Move profit into color
Change bars to circles
Pivot on dates (columns)

15
Comparison with SQL Queries

The cross-tabulation obtained by pivoting can
also be computed using a collection of
SQLqueries

SELECT T.year, L.state, SUM(S.amt) FROM Sales
S, Times T, Locations L WHERE S.timeidT.timeid
AND S.locidL.locid GROUP BY T.year, L.state
SELECT T.year, SUM(S.amt) FROM Sales S, Times
T WHERE S.timeidT.timeid GROUP BY T.year
SELECT L.state,SUM(S.amt) FROM Sales S,
Location L WHERE S.locidL.locid GROUP BY L.state
16
The CUBE Operator

Generalizing the previous example, if there are k
dimensions, we have 2k possible SQL GROUP BY
queries that can be generated through pivoting on
a subset of dimensions.
CUBE pid, locid, timeid BY SUM Sales
Equivalent to rolling up Sales on all eight
subsets of the set pid, locid, timeid each
roll-up corresponds to an SQL query of the form

SELECT SUM(S.amt) FROM Sales S GROUP BY
grouping-list
Lots of work on optimizing the CUBE operator!
17
Example Multidimensional Design
TIMES
holiday_flag
week
date
timeid
month
quarter
year
(Fact table)
amt
locid
timeid
pid
SALES
PRODUCTS
LOCATIONS
price
category
pname
pid
country
state
city
locid

This kind of schema is very common in OLAP
applications
It is called a star schema
What is wrong with it?

18
Star/Snowflake Schemas

Why normalize?
Space
Redundancy, anomalies
Why unnormalize?
Performance
Which is more important in D. Warehouses?
If normalized, it is a snowflake schema

19
Online Aggregation

Consider an aggregate query, e.g., finding the
average sales by state. Can we provide the user
with some information before the exact average is
computed for all states?
Can show the current running average for each
state as the computation proceeds.
Even better, if we use statistical techniques and
sample tuples to aggregate instead of simply
scanning the aggregated table, we can provide
bounds such as the average for Oregon is
2000102 with 95 probability.
Should also use nonblocking algorithms!

20
25.6 Implementation Issues

New indexing techniques Bitmap indexes, Join
indexes, array representations, compression,
precomputation of aggregations, etc.
E.g., Bitmap index

sex custid name sex rating rating
Bit-vector 1 bit for each possible value.
F
M
21
Bitmap Indexes

Work when an attribute has few values, e.g.
gender or rating
Advantage Small enough to fit in memory
Many queries can be answered by bit-vector ops,
e.g. females with rating 3.

22
25.7 Constructing a D. Warehouse

Extract
Is the data in native format?
Clean
How many ways can you spell Mr.?
Errors, missing information
Transform
Fix semantic mismatches.
E.g. Lastfirst vs. Name
Load
Do it in parallel or else.
Refresh
Both data and indexes

23
25.8,9 Views and Decision Support

In large databases, precomputation is necessary
for decent response times
Examples brain, google
Example Precompute daily sums for the cube.
What can be derived from those precomputations?
These precomputed queries are called Materialized
Views (SQL Server Indexed views).

24
Materialized View Example
CREATE VIEW DailySum(date, sumamt) AS SELECT
date, SUM(amt) FROM Times Join Sales
USING(timeid) GROUP BY date
Mat.View
Query
SELECT week, SUM(amt) FROM Times Join Sales
USING(timeid) Group By week
Modified Query
SELECT week, SUM(sumamt) FROM Times Join DailySum
USING (week) GROUP BY week
25
Pros and Cons of Materialized Views

Pro Modified query is a join of two small
tables original query is a join with one huge
table.
Con Materialized views take up space, need to be
updated.

26
A Materialized View is an Index

Recall the definition of an index
Data structure that provides fast access to data
Table indexes were of the form (value,
pointer), perhaps at leaf level of a search
structure. This is different.
Needs to be maintained as underlying tables
change.
Ideally, we want incremental view maintenance
algorithms.

27
What views should we materialize?

Remember the software that automatically chooses
optimal index configurations?
The same software will choose optimal
materialized views, given a workload and
available space.

28
What about the optimizer?

Given a query and a set of materialized views,
can we use the materialized views to answer the
query?
This is tricky. Best reference is 348

29
Refreshing Materialized Views

How often should we refresh the materialized
view?
Many enterprises refresh warehouse data only
weekly/nightly, so can afford to completely
rebuild their materialized views.
Others want their warehouses to be current, so
materialized views must be updated incrementally
if possible.
Let's look at some simple examples.

30
25.10 Maintaining Materialized Views