Title: OLAP
1OLAP
2Overview
- Traditional database systems are tuned to many,
small, simple queries. - Some new applications use fewer, more
time-consuming, analytic queries. - New architectures have been developed to handle
analytic queries efficiently.
3The Data Warehouse
- The most common form of data integration.
- Copy sources into a single DB (warehouse) and try
to keep it up-to-date. - Usual method periodic reconstruction of the
warehouse, perhaps overnight. - Frequently essential for analytic queries.
4OLTP
- Most database operations involve On-Line
Transaction Processing (OTLP). - Short, simple, frequent queries and/or
modifications, each involving a small number of
tuples. - Examples Answering queries from a Web interface,
sales at cash registers, selling airline tickets.
5OLAP
- On-Line Analytical Processing (OLAP, or
analytic) queries are, typically - Few, but complex queries --- may run for hours.
- Queries do not depend on having an absolutely
up-to-date database.
6Common Architecture
- Databases at store branches handle OLTP.
- Local store databases copied to a central
warehouse overnight. - Analysts use the warehouse for OLAP.
7Star Schemas
- A star schema is a common organization for data
at a warehouse. It consists of - Fact table a very large accumulation of facts
such as sales. - Often insert-only.
- Dimension tables smaller, generally static
information about the entities involved in the
facts.
8Example Star Schema
- Suppose we want to record in a warehouse
information about every car sale the serial
number, the date of sale, the dealer who sold the
car, the day, the time, and the price charged. - The fact table is a relation
- Sales(serialNo, date, dealer, price)
9Example -- Continued
- The dimension tables include information about
the autos, dealers, and days dimensions - Autos(serialNo, model, color)
- Dealers(name, city, state)
- Days(day, week, month, year)
- (5, 27, 7, 2000)
Day dimension table probably not stored.
10Visualization Star Schema
Dimension Table (Dealer)
Dimension Table (Autos)
Dimension Attrs.
Dependent Attrs.
Fact Table - Sales
Dimension Table (Day)
Dimension Table (etc.)
11Dimensions and Dependent Attributes
- Two classes of fact-table attributes
- Dimension attributes the key of a dimension
table. - Dependent attributes a value determined by the
dimension attributes of the tuple. - price is the dependent attribute of our example
Sales relation. - It is determined by the combination of dimension
attributes.
12Approaches to Building Warehouses
- ROLAP relational OLAP Tune a relational
DBMS to support star schemas. - MOLAP multidimensional OLAP Use a
specialized DBMS with a model such as the data
cube.
13Data Cubes
- Keys of dimension tables are the dimensions of a
hypercube. - Example for the Sales data, the three dimensions
are serialNo, date, and dealer. - Dependent attributes (e.g., price) appear at the
points of the cube.
14Visualization -- Data Cubes
car
price
dealer
date
15Slicing and Dicing
- Slice
- focus on particular partitions along (one or
more) dimension i.e., focusing on a particular
slice of the cube - WHERE clause in SQL
- Dice
- partitions the cube into smaller subcubes and the
points in each subcube are aggregated - GROUP BY clause in SQL
16Slicing and Dicing in SQL
- SELECT grouping-attributes and aggregations
- FROM fact table joined with (zero or more)
dimension tables - WHERE certain attributes are compared with
constants / slicing / - GROUP BY grouping-attributes / dicing /
17Slicing and Dicing Example
- Suppose a particular car model, say Gobi, is
not selling as well as anticipated. How to
analyze? - Maybe its the color
- Slice for Gobi. Dice for color.
- SELECT color, SUM(price)
- FROM Sales NATURAL JOIN Autos
- WHERE model 'Gobi'
- GROUP BY color
- Doesnt show anything interesting.
18Slicing and Dicing Example
- Suppose the previous query doesn't tell us much
each color produces about the same revenue. - Since the query does not dice for time, we only
see the total over all time for each color. - We may thus issue a revised query that also
partitions time by month. - SELECT color, month, SUM(price)
- FROM (Sales NATURAL JOIN Autos) JOIN Days ON date
day - WHERE model 'Gobi'
- GROUP BY color, month
19Slicing and Dicing Example
- We might discover that red Gobis havnt sold well
recently. - Does this problem exists at all dealers, or only
some dealers have had low sales of red Gobis? - Lets dice on dealer dimension as well.
- SELECT dealer, month, SUM(price)
- FROM (Sales NATURAL JOIN Autos) JOIN Days ON date
day - WHERE model 'Gobi' AND color 'red
- GROUP BY month, dealer
20Slicing and Dicing Example
- At this point, we find that the sales per month
for red Gobis are so small that we cannot observe
any trends easily. - Thus, we decide that it was a mistake to dice by
month. A better idea would be to partition only
by years, and look at only the last two years
(2006 and 2007). - SELECT dealer, year, SUM(price)
- FROM (Sales NATURAL JOIN Autos) JOIN Days ON date
day - WHERE model 'Gobi' AND color 'red' AND
- (year 2001 OR year 2002)
- GROUP BY year, dealer
21Drill-Down and Roll-Up
- Previous examples illustrate two common patterns
in sequences of queries that slice-and-dice the
data cube. - Drill-down is the process of partitioning more
finely and/or focusing on specific values in
certain dimensions. - Each of the example steps except the last is an
instance of drill-down. - Roll-up is the process of partitioning more
coarsely. - The last step, where we grouped by years instead
of months to eliminate the effect of randomness
in the data, is an example of roll-up.
22Marginals
- The data cube also includes aggregation
(typically SUM) along the margins of the cube. - The marginals include aggregations over one
dimension, two dimensions,
23Visualization --- Data Cube w/Aggregation
car
SUM over all Days
price
dealer
date
24Cube operator
- CUBE(F) is an augmented table for fact table F
- tuples (or points) added in CUBE(F)
- have a value, denoted to each dimension
- represents aggregation along that dimension
25Example
- Sales(serialNo, date, dealer, price)
-
- Sales(model, color, date, dealer, val, cnt)
serialNo dimension not well suited for the cube
as summing the price over all dates, or over all
dealers, but keeping the serial number fixed has
no effect. We will replace the serial number by
model and color to which the serial number
connects. Also, we will have as independent
attributes, val for sum of prices and cnt for
number of cars sold.
26Example CUBE(Sale)
- ('Gobi', 'red', '2001-05-21', 'Friendly Fred',
45000, 2) - On May 21, 2001, dealer Friendly Fred sold two
red Gobis for a total of 45,000. - Hypothetical tuple that would be in both Sales
and CUBE(Sales). - ('Gobi', , '2001-05-21', 'Friendly Fred',
152000, 7) - On May 21, 2001, Friendly Fred sold seven Gobis
of all colors, for a total price of 152,000. - Note that this tuple is in CUBE(Sales) but not in
Sales.
27Example CUBE(Sale)
- ('Gobi', , '2001-05-21', , 2348000, 100)
- On May 21, 2001, there were 100 Gobis sold by all
the dealers, and the total price of those Gobis
was 2,348,000. - ('Gobi', , , , 1339800000, 58000)
- Over all time, dealers, and colors, 58,000 Gobis
have been sold for a total price of
1,339,800,000. - (, , , , 3521727000, 198000)
- Total sales of all models in all colors, over all
time at all dealers is 198,000 cars for a total
price of 3,521,727,000.
28CUBE Helps Answering Queries
- SELECT color, AVG(price)
- FROM Sales
- WHERE model 'Gobi'
- GROUP BY color
- is answered by looking for all tuples of
CUBE(Sales) with the form - ('Gobi', c, , , v, n)
- where c is any specific color.
- v and n will be the sum and number of sales of
Gobis in that color. - The average price, is v/n.
- The answer to the query is the set of (c v/n)
pairs obtained from all - ('Gobi', c, , , v, n) tuples.
29CUBE in SQL
- SELECT model, color, date, dealer, SUM(val) AS v,
SUM(cnt) AS n - FROM Sales
- GROUP BY model, color, date, dealer
- WITH CUBE
- Suppose we materialize this into SalesCube.
- Then the previous query is rewritten into
- SELECT color, SUM(v)/SUM(n)
- FROM SalesCube
- WHERE model 'Gobi' AND
- date IS NULL AND
- dealer IS NULL