Data Analysis - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Data Analysis

Description:

Short, simple, frequent queries and/or modifications, each ... selling airline tickets. OLAP ... Computing Baskets1 is cheap, since it doesn't involve a join. ... – PowerPoint PPT presentation

Number of Views:347

Avg rating:3.0/5.0

Slides: 35

Provided by: aaa23

Category:

more less

Transcript and Presenter's Notes

Title: Data Analysis

1
Data Analysis
2
Overview

Traditional database systems are tuned to many,
small, simple queries.
Some applications use fewer, more time-consuming,
analytic queries.

3
OLTP

Most database operations involve On-Line
Transaction Processing (OTLP).
Short, simple, frequent queries and/or
modifications, each involving a small number of
tuples.
Examples
Answering queries from a Web interface,
sales at cash registers,
selling airline tickets.

4
OLAP

On-Line Analytical Processing (OLAP, or
analytic) queries are, typically
Few, but complex queries --- may run for hours.
Queries don't depend on having an absolutely
up-to-date database.
Example
Partition car sales by years, and look at only
the last two years (2007 and 2008). Aggregate
price.

5
Common Architecture

Databases at store branches handle OLTP.
Local store databases copied to a central data
warehouse overnight.
Analysts use the data warehouse for OLAP.

6
Star Schemas

A star schema is a common organization for data
at a warehouse. It consists of
Fact table a very large accumulation of facts
such as sales.
Dimension tables smaller, generally static
information about the entities involved in the
facts.

7
Example Star Schema

Suppose we want to record in a warehouse
information about every car sale, e.g.
serial number,
dealer who sold the car,
date of the sale, and
price charged.
The fact table is thus
Sales(serialNo, dealer, day, price)

8
Example -- Continued

The dimension tables include information about
the autos, dealers, and days dimensions
Autos(serialNo, model, color)
Dealers(name, city, province)
Days(day, week, month, year)

Days dimension table probably not stored.
9
Visualization Star Schema
Dimension Table (Dealer)
Dimension Table (Autos)
Dimension Attrs.
Dependent Attrs., e.g. price
Fact Table - Sales
Dimension Table (Day)
Dimension Table (etc.)
10
Data Cubes

Keys of dimension tables are the dimensions of a
hypercube.
Example for the Sales data, the three dimensions
are serialNo, date, and dealer.
Dependent attributes (e.g., price) appear at the
points of the cube.

11
Visualization -- Data Cubes
car
price
dealer
date
12
Slicing and Dicing

Slice
focus on particular partitions along (one or
more) dimension i.e., focus on a particular slice
of the cube
WHERE clause in SQL
Dice
partition the cube into smaller subcubes and
aggregated the points in each
GROUP BY clause in SQL

13
Slicing and Dicing in SQL

SELECT grouping-attributes and aggregations
FROM fact table joined with (zero or more)
dimension tables
WHERE certain attributes are compared with
constants / slicing /
GROUP BY grouping-attributes / dicing /

14
Slicing and Dicing Example

Suppose a particular car model, say Gobi, isn't
selling as well as anticipated. How to analyze?
Maybe its the color
Slice for Gobi. Dice for color.
SELECT color, SUM(price)
FROM Sales NATURAL JOIN Autos
WHERE model 'Gobi'
GROUP BY color

15
Slicing and Dicing Example

Suppose the previous query doesn't tell us much
each color produces about the same revenue.
Since the query does not dice for time, we only
see the total over all time for each color.
We may thus issue a revised query that also
partitions time by month.
SELECT color, month, SUM(price)
FROM (Sales NATURAL JOIN Autos) NATURAL JOIN Days
WHERE model 'Gobi'
GROUP BY color, month

16
Slicing and Dicing Example

We might discover that red Gobis havent sold
well recently.
Does this problem exists at all dealers, or only
some dealers have had low sales of red Gobis?
Lets dice on dealer dimension.
SELECT dealer, month, SUM(price)
FROM (Sales NATURAL JOIN Autos) NATURAL JOIN Days
WHERE model 'Gobi' AND color 'red
GROUP BY month, dealer

17
Slicing and Dicing Example

At this point, we find that the sales per month
for red Gobis are so small that we cannot observe
any trends easily.
We decide it was a mistake to dice by month. A
better partition by years, and look at only the
last two years (2007 and 2008).
SELECT dealer, year, SUM(price)
FROM (Sales NATURAL JOIN Autos) NATURAL JOIN Days
WHERE model 'Gobi' AND color 'red' AND
(year 2007 OR year 2008)
GROUP BY year, dealer

18
Drill-Down and Roll-Up

Previous examples illustrate two common patterns
in sequences of queries that slice-and-dice the
data cube.
Drill-down is the process of partitioning more
finely and/or focusing on specific values in
certain dimensions.
Each of the example steps except the last is an
instance of drill-down.
Roll-up is the process of partitioning more
coarsely.
The last step, where we grouped by years instead
of months to eliminate the effect of randomness
in the data, is an example of roll-up.

19
Marginals --- Data Cube w/Aggregation
20
Cube operator

CUBE(F) is an augmented table for fact table F
tuples (or points) added in CUBE(F)
have a value, denoted to each dimension
represents aggregation along that dimension

21
Example

Sales(serialNo, date, dealer, price)
Sales(model, color, date, dealer, val, cnt)

serialNo dimension not well suited for the cube
as summing the price over all dates, or over all
dealers, but keeping the serial number fixed has
no effect.
We replace the serialNo by model and color to
which the serial number connects. Also, we will
have as independent attributes, val for sum of
prices and cnt for number of cars sold.
22
Example CUBE(Sale)

('Gobi', 'red', '2001-05-21', 'Friendly Fred',
45000, 2)
On May 21, 2001, dealer Friendly Fred sold two
red Gobis for a total of 45,000.
This tuple is in both Sales and CUBE(Sales).
('Gobi', , '2001-05-21', 'Friendly Fred',
152000, 7)
On May 21, 2001, Friendly Fred sold seven Gobis
of all colors, for a total price of 152,000.
Note that this tuple is in CUBE(Sales) but not in
Sales.

23
Example CUBE(Sale)

('Gobi', , '2001-05-21', , 2348000, 100)
On May 21, 2001, there were 100 Gobis sold by all
the dealers, and the total price of those Gobis
was 2,348,000.
('Gobi', , , , 1339800000, 58000)
Over all time, dealers, and colors, 58,000 Gobis
have been sold for a total price of
1,339,800,000.
(, , , , 3521727000, 198000)
Total sales of all models in all colors, over all
time at all dealers is 198,000 cars for a total
price of 3,521,727,000.

24
CUBE Helps Answering Queries

SELECT color, AVG(price)
FROM Sales
WHERE model 'Gobi'
GROUP BY color
is answered by looking for all tuples of
CUBE(Sales) with the form
('Gobi', c, , , v, n)
where c is any specific color. v and n will be
the sum and number of sales of Gobis in that
color. The average price, is v/n. The answer to
query is the set of (c v/n) pairs obtained from
all ('Gobi', c, , , v, n) tuples.

25
CUBE in SQL

SELECT model, color, date, dealer, SUM(val) AS v,
SUM(cnt) AS n
FROM Sales
GROUP BY model, color, date, dealer
WITH CUBE
Suppose we materialize this into SalesCube.
Then the previous query is rewritten into
SELECT color, SUM(v)/SUM(n)
FROM SalesCube
WHERE model 'Gobi' AND
date IS NULL AND
dealer IS NULL

26
Data Mining

Data mining is a popular term for summarization
of big data sets in useful ways.
Examples
Clustering Web pages by topic.
Finding characteristics of fraudulent credit-card
use.
Finding products that are usually bought together.

27
Market-Basket Data

An important form of mining from relational data
involves market baskets sets of items that
are purchased together as a customer leaves a
store.
Summary of basket data is frequent itemsets
sets of items that often appear together in
baskets.

28
Example Market Baskets

If people often buy hamburger and ketchup
together, the store can
Put hamburger and ketchup near each other and put
potato chips between.
Run a sale on hamburger and raise the price of
ketchup.

29
Finding Frequent Pairs

The simplest case is when we only want to find
frequent pairs of items.
Assume data is in a relation Baskets(basket,
item).
The support threshold s is the minimum number
of baskets in which a pair appears before we are
interested.

30
Frequent Pairs in SQL

SELECT b1.item, b2.item
FROM Baskets b1, Baskets b2
WHERE b1.basket b2.basket
AND b1.item
GROUP BY b1.item, b2.item
HAVING COUNT() s

31
A-Priori Trick (1)

Straightforward implementation involves a join of
a huge Baskets table with itself.
The A-Priori algorithm speeds the query by
recognizing that a pair of items i, j cannot
have support s unless both i and j do.

32
A-Priori Trick (2)

Use a materialized view to hold only information
about frequent items.
INSERT INTO Baskets1(basket, item)
SELECT FROM Baskets
WHERE item IN (
SELECT item FROM Baskets
GROUP BY item
HAVING COUNT() s
)

33
A-Priori Algorithm

Materialize the view Baskets1.
Run the obvious query, but on Baskets1 instead of
Baskets.
Notes
Computing Baskets1 is cheap, since it doesnt
involve a join.
Baskets1 probably has many fewer tuples than
Baskets.
Running time shrinks with the square of the
number of tuples involved in the join.

34
Course Plug