Title: Data Analysis
1Data Analysis
2Overview
- Traditional database systems are tuned to many,
small, simple queries. - Some applications use fewer, more time-consuming,
analytic queries.
3OLTP
- Most database operations involve On-Line
Transaction Processing (OTLP). - Short, simple, frequent queries and/or
modifications, each involving a small number of
tuples. - Examples
- Answering queries from a Web interface,
- sales at cash registers,
- selling airline tickets.
4OLAP
- On-Line Analytical Processing (OLAP, or
analytic) queries are, typically - Few, but complex queries --- may run for hours.
- Queries don't depend on having an absolutely
up-to-date database. - Example
- Partition car sales by years, and look at only
the last two years (2007 and 2008). Aggregate
price.
5Common Architecture
- Databases at store branches handle OLTP.
- Local store databases copied to a central data
warehouse overnight. - Analysts use the data warehouse for OLAP.
6Star Schemas
- A star schema is a common organization for data
at a warehouse. It consists of - Fact table a very large accumulation of facts
such as sales. - Dimension tables smaller, generally static
information about the entities involved in the
facts.
7Example Star Schema
- Suppose we want to record in a warehouse
information about every car sale, e.g. - serial number,
- dealer who sold the car,
- date of the sale, and
- price charged.
- The fact table is thus
- Sales(serialNo, dealer, day, price)
8Example -- Continued
- The dimension tables include information about
the autos, dealers, and days dimensions -
- Autos(serialNo, model, color)
- Dealers(name, city, province)
- Days(day, week, month, year)
Days dimension table probably not stored.
9Visualization Star Schema
Dimension Table (Dealer)
Dimension Table (Autos)
Dimension Attrs.
Dependent Attrs., e.g. price
Fact Table - Sales
Dimension Table (Day)
Dimension Table (etc.)
10Data Cubes
- Keys of dimension tables are the dimensions of a
hypercube. - Example for the Sales data, the three dimensions
are serialNo, date, and dealer. - Dependent attributes (e.g., price) appear at the
points of the cube.
11Visualization -- Data Cubes
car
price
dealer
date
12Slicing and Dicing
- Slice
- focus on particular partitions along (one or
more) dimension i.e., focus on a particular slice
of the cube - WHERE clause in SQL
- Dice
- partition the cube into smaller subcubes and
aggregated the points in each - GROUP BY clause in SQL
13Slicing and Dicing in SQL
- SELECT grouping-attributes and aggregations
- FROM fact table joined with (zero or more)
dimension tables - WHERE certain attributes are compared with
constants / slicing / - GROUP BY grouping-attributes / dicing /
14Slicing and Dicing Example
- Suppose a particular car model, say Gobi, isn't
selling as well as anticipated. How to analyze? - Maybe its the color
- Slice for Gobi. Dice for color.
- SELECT color, SUM(price)
- FROM Sales NATURAL JOIN Autos
- WHERE model 'Gobi'
- GROUP BY color
15Slicing and Dicing Example
- Suppose the previous query doesn't tell us much
each color produces about the same revenue. - Since the query does not dice for time, we only
see the total over all time for each color. - We may thus issue a revised query that also
partitions time by month. - SELECT color, month, SUM(price)
- FROM (Sales NATURAL JOIN Autos) NATURAL JOIN Days
- WHERE model 'Gobi'
- GROUP BY color, month
16Slicing and Dicing Example
- We might discover that red Gobis havent sold
well recently. - Does this problem exists at all dealers, or only
some dealers have had low sales of red Gobis? - Lets dice on dealer dimension.
- SELECT dealer, month, SUM(price)
- FROM (Sales NATURAL JOIN Autos) NATURAL JOIN Days
- WHERE model 'Gobi' AND color 'red
- GROUP BY month, dealer
17Slicing and Dicing Example
- At this point, we find that the sales per month
for red Gobis are so small that we cannot observe
any trends easily. - We decide it was a mistake to dice by month. A
better partition by years, and look at only the
last two years (2007 and 2008). - SELECT dealer, year, SUM(price)
- FROM (Sales NATURAL JOIN Autos) NATURAL JOIN Days
- WHERE model 'Gobi' AND color 'red' AND
- (year 2007 OR year 2008)
- GROUP BY year, dealer
18Drill-Down and Roll-Up
- Previous examples illustrate two common patterns
in sequences of queries that slice-and-dice the
data cube. - Drill-down is the process of partitioning more
finely and/or focusing on specific values in
certain dimensions. - Each of the example steps except the last is an
instance of drill-down. - Roll-up is the process of partitioning more
coarsely. - The last step, where we grouped by years instead
of months to eliminate the effect of randomness
in the data, is an example of roll-up.
19Marginals --- Data Cube w/Aggregation
20Cube operator
- CUBE(F) is an augmented table for fact table F
- tuples (or points) added in CUBE(F)
- have a value, denoted to each dimension
- represents aggregation along that dimension
21Example
- Sales(serialNo, date, dealer, price)
-
-
- Sales(model, color, date, dealer, val, cnt)
serialNo dimension not well suited for the cube
as summing the price over all dates, or over all
dealers, but keeping the serial number fixed has
no effect.
We replace the serialNo by model and color to
which the serial number connects. Also, we will
have as independent attributes, val for sum of
prices and cnt for number of cars sold.
22Example CUBE(Sale)
- ('Gobi', 'red', '2001-05-21', 'Friendly Fred',
45000, 2) - On May 21, 2001, dealer Friendly Fred sold two
red Gobis for a total of 45,000. - This tuple is in both Sales and CUBE(Sales).
- ('Gobi', , '2001-05-21', 'Friendly Fred',
152000, 7) - On May 21, 2001, Friendly Fred sold seven Gobis
of all colors, for a total price of 152,000. - Note that this tuple is in CUBE(Sales) but not in
Sales.
23Example CUBE(Sale)
- ('Gobi', , '2001-05-21', , 2348000, 100)
- On May 21, 2001, there were 100 Gobis sold by all
the dealers, and the total price of those Gobis
was 2,348,000. - ('Gobi', , , , 1339800000, 58000)
- Over all time, dealers, and colors, 58,000 Gobis
have been sold for a total price of
1,339,800,000. - (, , , , 3521727000, 198000)
- Total sales of all models in all colors, over all
time at all dealers is 198,000 cars for a total
price of 3,521,727,000.
24CUBE Helps Answering Queries
- SELECT color, AVG(price)
- FROM Sales
- WHERE model 'Gobi'
- GROUP BY color
- is answered by looking for all tuples of
CUBE(Sales) with the form - ('Gobi', c, , , v, n)
- where c is any specific color. v and n will be
the sum and number of sales of Gobis in that
color. The average price, is v/n. The answer to
query is the set of (c v/n) pairs obtained from
all ('Gobi', c, , , v, n) tuples.
25CUBE in SQL
- SELECT model, color, date, dealer, SUM(val) AS v,
SUM(cnt) AS n - FROM Sales
- GROUP BY model, color, date, dealer
- WITH CUBE
- Suppose we materialize this into SalesCube.
- Then the previous query is rewritten into
- SELECT color, SUM(v)/SUM(n)
- FROM SalesCube
- WHERE model 'Gobi' AND
- date IS NULL AND
- dealer IS NULL
26Data Mining
- Data mining is a popular term for summarization
of big data sets in useful ways. - Examples
- Clustering Web pages by topic.
- Finding characteristics of fraudulent credit-card
use. - Finding products that are usually bought together.
27Market-Basket Data
- An important form of mining from relational data
involves market baskets sets of items that
are purchased together as a customer leaves a
store. - Summary of basket data is frequent itemsets
sets of items that often appear together in
baskets.
28Example Market Baskets
- If people often buy hamburger and ketchup
together, the store can - Put hamburger and ketchup near each other and put
potato chips between. - Run a sale on hamburger and raise the price of
ketchup.
29Finding Frequent Pairs
- The simplest case is when we only want to find
frequent pairs of items. - Assume data is in a relation Baskets(basket,
item). - The support threshold s is the minimum number
of baskets in which a pair appears before we are
interested.
30Frequent Pairs in SQL
- SELECT b1.item, b2.item
- FROM Baskets b1, Baskets b2
- WHERE b1.basket b2.basket
- AND b1.item
- GROUP BY b1.item, b2.item
- HAVING COUNT() s
31A-Priori Trick (1)
- Straightforward implementation involves a join of
a huge Baskets table with itself. - The A-Priori algorithm speeds the query by
recognizing that a pair of items i, j cannot
have support s unless both i and j do.
32A-Priori Trick (2)
- Use a materialized view to hold only information
about frequent items. - INSERT INTO Baskets1(basket, item)
- SELECT FROM Baskets
- WHERE item IN (
- SELECT item FROM Baskets
- GROUP BY item
- HAVING COUNT() s
- )
33A-Priori Algorithm
- Materialize the view Baskets1.
- Run the obvious query, but on Baskets1 instead of
Baskets. - Notes
- Computing Baskets1 is cheap, since it doesnt
involve a join. - Baskets1 probably has many fewer tuples than
Baskets. - Running time shrinks with the square of the
number of tuples involved in the join.
34Course Plug
- Fall 2009 CSC 485E / CSC 571 Advanced Databases
- Spring 2010 SENG 474 Data Mining