Title: Lecture 10: Data Warehouses
1Lecture 10 Data Warehouses
- Introduction
- Operational vs. Warehouse
- Multidimensional Data
- Examples
- MOLAP vs ROLAP
- Dimensional Hierarchies
- OLAP Queries
- Demos
- Comparison with SQL Queries
- CUBE Operator
- Multidimensional Design
- Star/Snowflake Schemas
- Online Aggregation
- Implementation Issues
- Bitmap Index
- Constructing a Data Warehouse
- Views
- Materialized View Example
- Materialized View is an Index
- Issues in Materialized Views
- Maintaining Materialized Views
2Introduction
- In the late 80s and early 90s, companies began to
use their DBMSs for complex, interactive,
exploratory analysis of historical data. - This was called Decision Support, and On-Line
Analytic Processing (OLAP). - DS slowed down the operation of the company,
called On-Line Transaction Processing (OLTP). - This led to the creation of Data Warehouses,
separate from operational Databases.
3Operational vs Data Warehouse Requirements
4Operational vs Data Warehouse Requirements, ctd
5Data Warehousing
Operational Data
- Integrated data spanning long time periods, often
augmented with summary information. - Several terabytes to petabytes common.
- Interactive response times expected for
complex queries ad-hoc updates uncommon.
EXTRACT TRANSFORM LOAD REFRESH
DATA WAREHOUSE
Metadata Repository
SUPPORTS
DATA MINING
6Multidimensional Data
- In order to support OLAP, warehouse data is often
structured multidimensionally, as measures and
dimensions. - Measure Numeric attribute, e.g. sales amount
- Dimension attribute categorizing the measure,
e.g. product, store, date of sale. - The fact table is a foreign key for each
dimension, plus an attribute for each measure. - There will also be a dimension table for each
dimension. - On the next page, the fact tables are red, the
dimension tables are green.
7Examples of MultiDimensional Data
- Purchase(ProductID, StoreID, DateID, Amt)
- Product(ID, SKU, size, brand)
- Store(ID, Address, Sales District, Region,
Manager) - Date (ID, Week, Month, Holiday, Promotion)
- Claims(ProvID, MembID, Procedure, DateID, Cost)
- Providers(ID, Practice, Address, ZIP, City,
State) - Members(ID, Contract, Name, Address)
- Procedure (ID, Name, Type)
- Telecomm (CustID, SalesRepID, ServiceID, DateID)
- SalesRep(ID, Address, Sales District, Region,
Manager) - Service(ID, Name, Category)
8MOLAP vs ROLAP
- Multidimensional data can be stored physically in
a (disk-resident, persistent) array called MOLAP
systems. Alternatively, can store as a relation
called ROLAP systems. - The main relation, which relates dimensions to a
measure, is called the fact table. Each
dimension can have additional attributes and an
associated dimension table. - E.g., Products(pid, locid, timeid, amt)
- Fact tables are much larger than dimensional
tables.
925.2 Multidimensional Data Model
timeid
locid
amt
pid
- Collection of numeric measures, which depend on
a set of dimensions. - E.g., measure Amt, dimensions Product (key pid),
Location (locid), and Time (timeid).
Slice locid1 is shown
locid
10Dimension Hierarchies
- For each dimension, some of the attributes may be
organized in a hierarchy
PRODUCT
TIME
LOCATION
year
category quarter
state
pname week
city
PID date
ZIP
1125.3 OLAP Queries
- Influenced by SQL and by spreadsheets.
- A common operation is to aggregate a measure over
one or more dimensions. - Find total sales.
- Find total sales for each city, or for each
state. - Find top five products ranked by total sales.
- Roll-up Aggregating at different levels of a
dimension hierarchy. - E.g., Given total sales by city, we can roll-up
to get sales by state.
12OLAP Queries
- Drill-down The inverse of roll-up.
- E.g., Given total sales by state, can drill-down
to get total sales by city. - E.g., Can also drill-down on different dimension
to get total sales by product for each state. - Pivoting Aggregation on selected dimensions.
- E.g., Pivoting on State and Year
yields this cross-tabulation
OR CA Total
63 81 144
2007
38 107 145
- Slicing and Dicing Equality
- and range selections on one
- or more dimensions.
2008
75 35 110
2009
176 223 339
Total
13Cognos Demo
- Now we watch a demo of Cognos (bought by IBM)
- Dimensions ProductsMargin ranges
- Measure Order value (sales)
- First pivot from Product dimension to Margin
Range - Notice how quickly the cube changes
- Slice to Low Margin, pivot to Product and Company
Region - Drill Down to High Tech, IDES AG
- Now the guilty product is clear.
14Tableau Demo
- http//www.tableausoftware.com/products/tour2
- Note the many measures.
- Pivot on sales, date (drill down to month),
region as color. - Clear date, pivot on product and drill down on
subcategory. - Change region from color to rows
- Move profit into color
- Change bars to circles
- Pivot on dates (columns)
15Comparison with SQL Queries
- The cross-tabulation obtained by pivoting can
also be computed using a collection of
SQLqueries
SELECT T.year, L.state, SUM(S.amt) FROM Sales
S, Times T, Locations L WHERE S.timeidT.timeid
AND S.locidL.locid GROUP BY T.year, L.state
SELECT T.year, SUM(S.amt) FROM Sales S, Times
T WHERE S.timeidT.timeid GROUP BY T.year
SELECT L.state,SUM(S.amt) FROM Sales S,
Location L WHERE S.locidL.locid GROUP BY L.state
16The CUBE Operator
- Generalizing the previous example, if there are k
dimensions, we have 2k possible SQL GROUP BY
queries that can be generated through pivoting on
a subset of dimensions. - CUBE pid, locid, timeid BY SUM Sales
- Equivalent to rolling up Sales on all eight
subsets of the set pid, locid, timeid each
roll-up corresponds to an SQL query of the form
SELECT SUM(S.amt) FROM Sales S GROUP BY
grouping-list
Lots of work on optimizing the CUBE operator!
17Example Multidimensional Design
TIMES
holiday_flag
week
date
timeid
month
quarter
year
(Fact table)
amt
locid
timeid
pid
SALES
PRODUCTS
LOCATIONS
price
category
pname
pid
country
state
city
locid
- This kind of schema is very common in OLAP
applications - It is called a star schema
- What is wrong with it?
18Star/Snowflake Schemas
- Why normalize?
- Space
- Redundancy, anomalies
- Why unnormalize?
- Performance
- Which is more important in D. Warehouses?
- If normalized, it is a snowflake schema
19Online Aggregation
- Consider an aggregate query, e.g., finding the
average sales by state. Can we provide the user
with some information before the exact average is
computed for all states? - Can show the current running average for each
state as the computation proceeds. - Even better, if we use statistical techniques and
sample tuples to aggregate instead of simply
scanning the aggregated table, we can provide
bounds such as the average for Oregon is
2000102 with 95 probability. - Should also use nonblocking algorithms!
2025.6 Implementation Issues
- New indexing techniques Bitmap indexes, Join
indexes, array representations, compression,
precomputation of aggregations, etc. - E.g., Bitmap index
sex custid name sex rating rating
Bit-vector 1 bit for each possible value.
F
M
21Bitmap Indexes
- Work when an attribute has few values, e.g.
gender or rating - Advantage Small enough to fit in memory
- Many queries can be answered by bit-vector ops,
e.g. females with rating 3.
2225.7 Constructing a D. Warehouse
- Extract
- Is the data in native format?
- Clean
- How many ways can you spell Mr.?
- Errors, missing information
- Transform
- Fix semantic mismatches.
- E.g. Lastfirst vs. Name
- Load
- Do it in parallel or else.
- Refresh
- Both data and indexes
2325.8,9 Views and Decision Support
- In large databases, precomputation is necessary
for decent response times - Examples brain, google
- Example Precompute daily sums for the cube.
- What can be derived from those precomputations?
- These precomputed queries are called Materialized
Views (SQL Server Indexed views).
24Materialized View Example
CREATE VIEW DailySum(date, sumamt) AS SELECT
date, SUM(amt) FROM Times Join Sales
USING(timeid) GROUP BY date
Mat.View
Query
SELECT week, SUM(amt) FROM Times Join Sales
USING(timeid) Group By week
Modified Query
SELECT week, SUM(sumamt) FROM Times Join DailySum
USING (week) GROUP BY week
25Pros and Cons of Materialized Views
- Pro Modified query is a join of two small
tables original query is a join with one huge
table. - Con Materialized views take up space, need to be
updated.
26A Materialized View is an Index
- Recall the definition of an index
- Data structure that provides fast access to data
- Table indexes were of the form (value,
pointer), perhaps at leaf level of a search
structure. This is different. - Needs to be maintained as underlying tables
change. - Ideally, we want incremental view maintenance
algorithms.
27What views should we materialize?
- Remember the software that automatically chooses
optimal index configurations? - The same software will choose optimal
materialized views, given a workload and
available space.
28What about the optimizer?
- Given a query and a set of materialized views,
can we use the materialized views to answer the
query? - This is tricky. Best reference is 348
29Refreshing Materialized Views
- How often should we refresh the materialized
view? - Many enterprises refresh warehouse data only
weekly/nightly, so can afford to completely
rebuild their materialized views. - Others want their warehouses to be current, so
materialized views must be updated incrementally
if possible. - Let's look at some simple examples.
3025.10 Maintaining Materialized Views
- Incremental view maintenance
- Defn make changes in view that correspond to
changes in the base tables - Example V SELECT a FROM R
- How is V modified if r is inserted to R?
- How is V modified if r is deleted from R?
31Maintaining Materialized Views
- Consider V R ? S
- How is V modified if r is inserted to R?
- How is V modified if r is deleted from R?
- Consider V SELECT g,COUNT()
- FROM R GROUP BY g
- How is V modified if r is inserted to R?
- How is V modified if r is deleted from R
- For more general cases, see 348