Lecture 1: Data Warehousing - PowerPoint PPT Presentation

About This Presentation

Title:

Lecture 1: Data Warehousing

Description:

Lecture 1: Data Warehousing Based on the s by Jeffrey D. Ullman and Hector Garcia-Molina at Stanford University * ... – PowerPoint PPT presentation

Number of Views:221

Avg rating:3.0/5.0

Slides: 94

Provided by: Janet240

Learn more at: https://www.cs.kent.edu

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 1: Data Warehousing

1
Lecture 1 Data Warehousing

Based on the slides by Jeffrey D. Ullman and
Hector Garcia-Molina at Stanford University

2
Overview

Traditional database systems are tuned to many,
small, simple queries.
Some new applications use fewer, more
time-consuming, complex queries.
New architectures have been developed to handle
complex analytic queries efficiently.

3
The Data Warehouse

The most common form of data integration.
Copy sources into a single DB (warehouse) and try
to keep it up-to-date.
Usual method periodic reconstruction of the
warehouse, perhaps overnight.
Frequently essential for analytic queries.

4
OLTP

Most database operations involve On-Line
Transaction Processing (OTLP).
Short, simple, frequent queries and/or
modifications, each involving a small number of
tuples.
Examples Answering queries from a Web interface,
sales at cash registers, selling airline tickets.

5
OLAP

Of increasing importance are On-Line Application
Processing (OLAP) queries.
Few, but complex queries --- may run for hours.
Queries do not depend on having an absolutely
up-to-date database.

6
OLAP Examples

Amazon analyzes purchases by its customers to
come up with an individual screen with products
of likely interest to the customer.
Analysts at Wal-Mart look for items with
increasing sales in some region.

7
Warehouse Architecture
Metadata
8
Why a Warehouse?

Two Approaches
Query-Driven (Lazy)
Warehouse (Eager)

9
Data Warehouse

Databases at store branches handle OLTP.
Local store databases copied to a central
warehouse overnight.
Analysts use the warehouse for OLAP.

10
Query-Driven Approach
11
Advantages of Warehousing

High query performance
Queries not visible outside warehouse
Local processing at sources unaffected
Can operate when sources unavailable
Can query data not stored in a DBMS
Extra information at warehouse
Modify, summarize (store aggregates)
Add historical information

12
Advantages of Query-Driven

No need to copy data
less storage
no need to purchase data
More up-to-date data
Query needs can be unknown
Only query interface needed at sources
May be less draining on sources

13
OLTP vs. OLAP

OLTP On Line Transaction Processing
Describes processing at operational sites
OLAP On Line Analytical Processing
Describes processing at warehouse

14
OLTP vs. OLAP
OLTP
OLAP

Mostly updates
Many small transactions
Mb-Tb of data
Raw data
Clerical users
Up-to-date data
Consistency, recoverability critical

Mostly reads
Queries long, complex
Gb-Tb of data
Summarized, consolidated data
Decision-makers, analysts as users

15
Star Schemas

A star schema is a common organization for data
at a warehouse. It consists of
Fact table a very large accumulation of facts
such as sales.
Often insert-only.
Dimension tables smaller, generally static
information about the entities involved in the
facts.

16
Example Star Schema

Suppose we want to record in a warehouse
information about every beer sale the bar, the
brand of beer, the drinker who bought the beer,
the day, the time, and the price charged.
The fact table is a relation
Sales(bar, beer, drinker, day, time, price)

17
Example, Continued

The dimension tables include information about
the bar, beer, and drinker dimensions
Bars(bar, addr, license)
Beers(beer, manf)
Drinkers(drinker, addr, phone)

18
Visualization Star Schema
Dimension Table (Drinkers)
Dimension Table (Bars)
Dimension Attrs.
Dependent Attrs.
Fact Table - Sales
Dimension Table (Beers)
Dimension Table (etc.)
19
Dimensions and Dependent Attributes

Two classes of fact-table attributes
Dimension attributes the key of a dimension
table.
Dependent attributes a value determined by the
dimension attributes of the tuple.

20
Example Dependent Attribute

price is the dependent attribute of our example
Sales relation.
It is determined by the combination of dimension
attributes bar, beer, drinker, and the time
(combination of day and time-of-day attributes).

21
Approaches to Building Warehouses

ROLAP relational OLAP Tune a relational
DBMS to support star schemas.
MOLAP multidimensional OLAP Use a
specialized DBMS with a model such as the data
cube.

22
ROLAP Techniques

Bitmap indexes For each key value of a
dimension table (e.g., each beer for relation
Beers) create a bit-vector telling which tuples
of the fact table have that value.
Materialized views Store the answers to several
useful queries (views) in the warehouse itself.

23
Typical OLAP Queries

Often, OLAP queries begin with a star join the
natural join of the fact table with all or most
of the dimension tables.
Example
SELECT
FROM Sales, Bars, Beers, Drinkers
WHERE Sales.bar Bars.bar AND
Sales.beer Beers.beer AND
Sales.drinker Drinkers.drinker

24
Typical OLAP Queries --- (2)

The typical OLAP query will
Start with a star join.
Select for interesting tuples, based on dimension
data.
Group by one or more dimensions.
Aggregate certain attributes of the result.

25
Example OLAP Query

For each bar in Palo Alto, find the total sale of
each beer manufactured by Anheuser-Busch.
Filter addr Palo Alto and manf
Anheuser-Busch.
Grouping by bar and beer.
Aggregation Sum of price.

26
Example In SQL

SELECT bar, beer, SUM(price)
FROM Sales NATURAL JOIN Bars
NATURAL JOIN Beers
WHERE addr Palo Alto AND
manf Anheuser-Busch
GROUP BY bar, beer

27
Using Materialized Views

A direct execution of this query from Sales and
the dimension tables could take too long.
If we create a materialized view that contains
enough information, we may be able to answer our
query much faster.

28
Example Materialized View

Which views could help with our query?
Key issues
It must join Sales, Bars, and Beers, at least.
It must group by at least bar and beer.
It must not select out Palo-Alto bars or
Anheuser-Busch beers.
It must not project out addr or manf.

29
Example --- Continued

Here is a materialized view that could help
CREATE VIEW BABMS(bar, addr,
beer, manf, sales) AS
SELECT bar, addr, beer, manf,
SUM(price) sales
FROM Sales NATURAL JOIN Bars
NATURAL JOIN Beers
GROUP BY bar, addr, beer, manf

30
Example --- Concluded

Heres our query using the materialized view
BABMS
SELECT bar, beer, sales
FROM BABMS
WHERE addr Palo Alto AND
manf Anheuser-Busch

31
MOLAP and Data Cubes

Keys of dimension tables are the dimensions of a
hypercube.
Example for the Sales data, the four dimensions
are bar, beer, drinker, and time.
Dependent attributes (e.g., price) appear at the
points of the cube.

32
Visualization - Data Cubes
beer
price
bar
drinker
33
Marginals

The data cube also includes aggregation
(typically SUM) along the margins of the cube.
The marginals include aggregations over one
dimension, two dimensions,

34
Visualization - Data Cube w/ Aggregation
beer
SUM over all Drinkers
price
bar
drinker
35
Example Marginals

Our 4-dimensional Sales cube includes the sum of
price over each bar, each beer, each drinker, and
each time unit (perhaps days).
It would also have the sum of price over all
bar-beer pairs, all bar-drinker-day triples,

36
Structure of the Cube

Think of each dimension as having an additional
value .
A point with one or more s in its coordinates
aggregates over the dimensions with the s.
Example Sales(Joes Bar, Bud, , ) holds
the sum over all drinkers and all time of the Bud
consumed at Joes.

37
Drill-Down

Drill-down de-aggregate break an aggregate
into its constituents.
Example having determined that Joes Bar sells
very few Anheuser-Busch beers, break down his
sales by particular A.-B. beer.

38
Roll-Up

Roll-up aggregate along one or more dimensions.
Example given a table of how much Bud each
drinker consumes at each bar, roll it up into a
table giving total amount of Bud consumed for
each drinker.

39
Roll Up and Drill Down
of Anheuser-Busch by drinker/bar
of A-B / drinker
Jim Bob Mary
Joes Bar 45 33 30
Nut- House 50 36 42
Blue Chalk 38 31 40
Jim Bob Mary
133 100 112
Roll upby Bar
Drill downby Beer
of A-B Beers / drinker
Jim Bob Mary
Bud 40 29 40
Mlob 45 31 37
Bud Light 48 40 35
40
Materialized Data-Cube Views

Data cubes invite materialized views that are
aggregations in one or more dimensions.
Dimensions may not be completely aggregated ---
an option is to group by an attribute of the
dimension table.

41
Example

A materialized view for our Sales data cube
might
Aggregate by drinker completely.
Not aggregate at all by beer.
Aggregate by time according to the week.
Aggregate according to the city of the bar.

42
Example

A materialized view for our Sales data cube
might
Aggregate by drinker completely.
Not aggregate at all by beer.
Aggregate by time according to the week.
Aggregate according to the city of the bar.

43
Warehouse Models Operators

Data Models
relations
stars snowflakes
cubes
Operators
slice dice
roll-up, drill down
pivoting
other

44
Star
45
Star Schema
46
Terms

Fact table
Dimension tables
Measures

47
Dimension Hierarchies
sType
store
city
region
è snowflake schema è constellations
48
Cube
Fact table view
Multi-dimensional cube
dimensions 2
49
3-D Cube
Multi-dimensional cube
Fact table view
dimensions 3
50
ROLAP vs. MOLAP

ROLAPRelational On-Line Analytical Processing
MOLAPMulti-Dimensional On-Line Analytical
Processing

51
Aggregates

Add up amounts for day 1
In SQL SELECT sum(amt) FROM SALE
WHERE date 1

81
52
Aggregates

Add up amounts by day
In SQL SELECT date, sum(amt) FROM SALE
GROUP BY date

53
Another Example

Add up amounts by day, product
In SQL SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId

rollup
drill-down
54
Aggregates

Operators sum, count, max, min, median,
ave
Having clause
Using dimension hierarchy
average by region (within store)
maximum by month (within date)

55
Cube Aggregation
Example computing sums
day 2
. . .
day 1
129
56
Cube Operators
day 2
. . .
day 1
sale(c1,,)
129
sale(c2,p2,)
sale(,,)
57
Extended Cube

day 2
sale(,p2,)
day 1
58
Aggregation Using Hierarchies
customer
region
country
(customer c1 in Region A customers c2, c3 in
Region B)
59
Pivoting
Fact table view
Multi-dimensional cube
60
Query Analysis Tools

Query Building
Report Writers (comparisons, growth, graphs,)
Spreadsheet Systems
Web Interfaces
Data Mining

61
Other Operations

Time functions
e.g., time average
Computed Attributes
e.g., commission sales rate
Text Queries
e.g., find documents with words X AND B
e.g., rank documents by frequency of
words X, Y, Z

62
Implementing a Warehouse

Monitoring Sending data from sources
Integrating Loading, cleansing,...
Processing Query processing, indexing, ...
Managing Metadata, Design, ...

63
Monitoring

Source Types relational, flat file, IMS, VSAM,
IDMS, WWW, news-wire,
Incremental vs. Refresh

64
Monitoring Techniques

Periodic snapshots
Database triggers
Log shipping
Data shipping (replication service)
Transaction shipping
Polling (queries to source)
Screen scraping
Application level monitoring

è Advantages Disadvantages!!
65
Monitoring Issues

Frequency
periodic daily, weekly,
triggered on big change, lots of changes, ...
Data transformation
convert data to uniform format
remove add fields (e.g., add date to get
history)
Standards (e.g., ODBC)
Gateways

66
Integration

Data Cleaning
Data Loading
Derived Data

67
Data Cleaning

Migration (e.g., yen ð dollars)
Scrubbing use domain-specific knowledge (e.g.,
social security numbers)
Fusion (e.g., mail list, customer merging)
Auditing discover rules relationships(like
data mining)

68
Loading Data

Incremental vs. refresh
Off-line vs. on-line
Frequency of loading
At night, 1x a week/month, continuously
Parallel/Partitioned load

69
Derived Data

Derived Warehouse Data
indexes
aggregates
materialized views (next slide)
When to update derived data?
Incremental vs. refresh

70
Materialized Views

Define new warehouse relations using SQL
expressions

71
Processing

ROLAP servers vs. MOLAP servers
Index Structures
What to Materialize?
Algorithms

72
ROLAP Server

Relational OLAP Server

tools
Special indices, tuning Schema is denormalized
73
MOLAP Server

Multi-Dimensional OLAP Server

M.D. tools
multi-dimensional server
could also sit on relational DBMS
74
Index Structures

Traditional Access Methods
B-trees, hash tables, R-trees, grids,
Popular in Warehouses
inverted lists
bit map indexes
join indexes
text indexes

75
Inverted Lists
. . .
data records
inverted lists
age index
76
Using Inverted Lists

Query
Get people with age 20 and name fred
List for age 20 r4, r18, r34, r35
List for name fred r18, r52
Answer is intersection r18

77
Bit Maps
. . .
age index
data records
bit maps
78
Using Bit Maps

Query
Get people with age 20 and name fred
List for age 20 1101100000
List for name fred 0100000001
Answer is intersection 010000000000

Good if domain cardinality small
Bit vectors can be compressed

79
Join

Combine SALE, PRODUCT relations
In SQL SELECT FROM SALE, PRODUCT

80
What to Materialize?

Store in warehouse results useful for common
queries
Example

total sales
day 2
. . .
day 1
129
materialize
81
Materialization Factors

Type/frequency of queries
Query response time
Storage cost
Update cost

82
Cube Aggregates Lattice
129
all
city
product
date
city, product
city, date
product, date
use greedy algorithm to decide what to materialize
city, product, date
83
Dimension Hierarchies
all
state
city
84
Dimension Hierarchies
all
product
city
date
product, date
city, product
city, date
state
city, product, date
state, date
state, product
state, product, date
not all arcs shown...
85
Interesting Hierarchy
all
years
weeks
quarters
conceptual dimension table
months
days
86
Managing