Title: Lecture 1: Data Warehousing
1Lecture 1 Data Warehousing
- Based on the slides by Jeffrey D. Ullman and
- Hector Garcia-Molina at Stanford University
2Overview
- Traditional database systems are tuned to many,
small, simple queries. - Some new applications use fewer, more
time-consuming, complex queries. - New architectures have been developed to handle
complex analytic queries efficiently.
3The Data Warehouse
- The most common form of data integration.
- Copy sources into a single DB (warehouse) and try
to keep it up-to-date. - Usual method periodic reconstruction of the
warehouse, perhaps overnight. - Frequently essential for analytic queries.
4OLTP
- Most database operations involve On-Line
Transaction Processing (OTLP). - Short, simple, frequent queries and/or
modifications, each involving a small number of
tuples. - Examples Answering queries from a Web interface,
sales at cash registers, selling airline tickets.
5OLAP
- Of increasing importance are On-Line Application
Processing (OLAP) queries. - Few, but complex queries --- may run for hours.
- Queries do not depend on having an absolutely
up-to-date database.
6OLAP Examples
- Amazon analyzes purchases by its customers to
come up with an individual screen with products
of likely interest to the customer. - Analysts at Wal-Mart look for items with
increasing sales in some region.
7Warehouse Architecture
Metadata
8Why a Warehouse?
- Two Approaches
- Query-Driven (Lazy)
- Warehouse (Eager)
9Data Warehouse
- Databases at store branches handle OLTP.
- Local store databases copied to a central
warehouse overnight. - Analysts use the warehouse for OLAP.
10Query-Driven Approach
11Advantages of Warehousing
- High query performance
- Queries not visible outside warehouse
- Local processing at sources unaffected
- Can operate when sources unavailable
- Can query data not stored in a DBMS
- Extra information at warehouse
- Modify, summarize (store aggregates)
- Add historical information
12Advantages of Query-Driven
- No need to copy data
- less storage
- no need to purchase data
- More up-to-date data
- Query needs can be unknown
- Only query interface needed at sources
- May be less draining on sources
13OLTP vs. OLAP
- OLTP On Line Transaction Processing
- Describes processing at operational sites
- OLAP On Line Analytical Processing
- Describes processing at warehouse
14OLTP vs. OLAP
OLTP
OLAP
- Mostly updates
- Many small transactions
- Mb-Tb of data
- Raw data
- Clerical users
- Up-to-date data
- Consistency, recoverability critical
- Mostly reads
- Queries long, complex
- Gb-Tb of data
- Summarized, consolidated data
- Decision-makers, analysts as users
15Star Schemas
- A star schema is a common organization for data
at a warehouse. It consists of - Fact table a very large accumulation of facts
such as sales. - Often insert-only.
- Dimension tables smaller, generally static
information about the entities involved in the
facts.
16Example Star Schema
- Suppose we want to record in a warehouse
information about every beer sale the bar, the
brand of beer, the drinker who bought the beer,
the day, the time, and the price charged. - The fact table is a relation
- Sales(bar, beer, drinker, day, time, price)
17Example, Continued
- The dimension tables include information about
the bar, beer, and drinker dimensions - Bars(bar, addr, license)
- Beers(beer, manf)
- Drinkers(drinker, addr, phone)
18Visualization Star Schema
Dimension Table (Drinkers)
Dimension Table (Bars)
Dimension Attrs.
Dependent Attrs.
Fact Table - Sales
Dimension Table (Beers)
Dimension Table (etc.)
19Dimensions and Dependent Attributes
- Two classes of fact-table attributes
- Dimension attributes the key of a dimension
table. - Dependent attributes a value determined by the
dimension attributes of the tuple.
20Example Dependent Attribute
- price is the dependent attribute of our example
Sales relation. - It is determined by the combination of dimension
attributes bar, beer, drinker, and the time
(combination of day and time-of-day attributes).
21Approaches to Building Warehouses
- ROLAP relational OLAP Tune a relational
DBMS to support star schemas. - MOLAP multidimensional OLAP Use a
specialized DBMS with a model such as the data
cube.
22ROLAP Techniques
- Bitmap indexes For each key value of a
dimension table (e.g., each beer for relation
Beers) create a bit-vector telling which tuples
of the fact table have that value. - Materialized views Store the answers to several
useful queries (views) in the warehouse itself.
23Typical OLAP Queries
- Often, OLAP queries begin with a star join the
natural join of the fact table with all or most
of the dimension tables. - Example
- SELECT
- FROM Sales, Bars, Beers, Drinkers
- WHERE Sales.bar Bars.bar AND
- Sales.beer Beers.beer AND
- Sales.drinker Drinkers.drinker
24Typical OLAP Queries --- (2)
- The typical OLAP query will
- Start with a star join.
- Select for interesting tuples, based on dimension
data. - Group by one or more dimensions.
- Aggregate certain attributes of the result.
25Example OLAP Query
- For each bar in Palo Alto, find the total sale of
each beer manufactured by Anheuser-Busch. - Filter addr Palo Alto and manf
Anheuser-Busch. - Grouping by bar and beer.
- Aggregation Sum of price.
26Example In SQL
- SELECT bar, beer, SUM(price)
- FROM Sales NATURAL JOIN Bars
- NATURAL JOIN Beers
- WHERE addr Palo Alto AND
- manf Anheuser-Busch
- GROUP BY bar, beer
27Using Materialized Views
- A direct execution of this query from Sales and
the dimension tables could take too long. - If we create a materialized view that contains
enough information, we may be able to answer our
query much faster.
28Example Materialized View
- Which views could help with our query?
- Key issues
- It must join Sales, Bars, and Beers, at least.
- It must group by at least bar and beer.
- It must not select out Palo-Alto bars or
Anheuser-Busch beers. - It must not project out addr or manf.
29Example --- Continued
- Here is a materialized view that could help
- CREATE VIEW BABMS(bar, addr,
- beer, manf, sales) AS
- SELECT bar, addr, beer, manf,
- SUM(price) sales
- FROM Sales NATURAL JOIN Bars
- NATURAL JOIN Beers
- GROUP BY bar, addr, beer, manf
30Example --- Concluded
- Heres our query using the materialized view
BABMS - SELECT bar, beer, sales
- FROM BABMS
- WHERE addr Palo Alto AND
- manf Anheuser-Busch
31MOLAP and Data Cubes
- Keys of dimension tables are the dimensions of a
hypercube. - Example for the Sales data, the four dimensions
are bar, beer, drinker, and time. - Dependent attributes (e.g., price) appear at the
points of the cube.
32Visualization - Data Cubes
beer
price
bar
drinker
33Marginals
- The data cube also includes aggregation
(typically SUM) along the margins of the cube. - The marginals include aggregations over one
dimension, two dimensions,
34Visualization - Data Cube w/ Aggregation
beer
SUM over all Drinkers
price
bar
drinker
35Example Marginals
- Our 4-dimensional Sales cube includes the sum of
price over each bar, each beer, each drinker, and
each time unit (perhaps days). - It would also have the sum of price over all
bar-beer pairs, all bar-drinker-day triples,
36Structure of the Cube
- Think of each dimension as having an additional
value . - A point with one or more s in its coordinates
aggregates over the dimensions with the s. - Example Sales(Joes Bar, Bud, , ) holds
the sum over all drinkers and all time of the Bud
consumed at Joes.
37Drill-Down
- Drill-down de-aggregate break an aggregate
into its constituents. - Example having determined that Joes Bar sells
very few Anheuser-Busch beers, break down his
sales by particular A.-B. beer.
38Roll-Up
- Roll-up aggregate along one or more dimensions.
- Example given a table of how much Bud each
drinker consumes at each bar, roll it up into a
table giving total amount of Bud consumed for
each drinker.
39Roll Up and Drill Down
of Anheuser-Busch by drinker/bar
of A-B / drinker
Jim Bob Mary
Joes Bar 45 33 30
Nut- House 50 36 42
Blue Chalk 38 31 40
Jim Bob Mary
133 100 112
Roll upby Bar
Drill downby Beer
of A-B Beers / drinker
Jim Bob Mary
Bud 40 29 40
Mlob 45 31 37
Bud Light 48 40 35
40Materialized Data-Cube Views
- Data cubes invite materialized views that are
aggregations in one or more dimensions. - Dimensions may not be completely aggregated ---
an option is to group by an attribute of the
dimension table.
41Example
- A materialized view for our Sales data cube
might - Aggregate by drinker completely.
- Not aggregate at all by beer.
- Aggregate by time according to the week.
- Aggregate according to the city of the bar.
42Example
- A materialized view for our Sales data cube
might - Aggregate by drinker completely.
- Not aggregate at all by beer.
- Aggregate by time according to the week.
- Aggregate according to the city of the bar.
43Warehouse Models Operators
- Data Models
- relations
- stars snowflakes
- cubes
- Operators
- slice dice
- roll-up, drill down
- pivoting
- other
44Star
45Star Schema
46Terms
- Fact table
- Dimension tables
- Measures
47Dimension Hierarchies
sType
store
city
region
è snowflake schema è constellations
48Cube
Fact table view
Multi-dimensional cube
dimensions 2
493-D Cube
Multi-dimensional cube
Fact table view
dimensions 3
50ROLAP vs. MOLAP
- ROLAPRelational On-Line Analytical Processing
- MOLAPMulti-Dimensional On-Line Analytical
Processing
51Aggregates
- Add up amounts for day 1
- In SQL SELECT sum(amt) FROM SALE
- WHERE date 1
81
52Aggregates
- Add up amounts by day
- In SQL SELECT date, sum(amt) FROM SALE
- GROUP BY date
53Another Example
- Add up amounts by day, product
- In SQL SELECT date, sum(amt) FROM SALE
- GROUP BY date, prodId
rollup
drill-down
54Aggregates
- Operators sum, count, max, min, median,
ave - Having clause
- Using dimension hierarchy
- average by region (within store)
- maximum by month (within date)
55Cube Aggregation
Example computing sums
day 2
. . .
day 1
129
56Cube Operators
day 2
. . .
day 1
sale(c1,,)
129
sale(c2,p2,)
sale(,,)
57Extended Cube
day 2
sale(,p2,)
day 1
58Aggregation Using Hierarchies
customer
region
country
(customer c1 in Region A customers c2, c3 in
Region B)
59Pivoting
Fact table view
Multi-dimensional cube
60Query Analysis Tools
- Query Building
- Report Writers (comparisons, growth, graphs,)
- Spreadsheet Systems
- Web Interfaces
- Data Mining
61Other Operations
- Time functions
- e.g., time average
- Computed Attributes
- e.g., commission sales rate
- Text Queries
- e.g., find documents with words X AND B
- e.g., rank documents by frequency of
words X, Y, Z
62Implementing a Warehouse
- Monitoring Sending data from sources
- Integrating Loading, cleansing,...
- Processing Query processing, indexing, ...
- Managing Metadata, Design, ...
63Monitoring
- Source Types relational, flat file, IMS, VSAM,
IDMS, WWW, news-wire, - Incremental vs. Refresh
64Monitoring Techniques
- Periodic snapshots
- Database triggers
- Log shipping
- Data shipping (replication service)
- Transaction shipping
- Polling (queries to source)
- Screen scraping
- Application level monitoring
è Advantages Disadvantages!!
65Monitoring Issues
- Frequency
- periodic daily, weekly,
- triggered on big change, lots of changes, ...
- Data transformation
- convert data to uniform format
- remove add fields (e.g., add date to get
history) - Standards (e.g., ODBC)
- Gateways
66Integration
- Data Cleaning
- Data Loading
- Derived Data
67Data Cleaning
- Migration (e.g., yen ð dollars)
- Scrubbing use domain-specific knowledge (e.g.,
social security numbers) - Fusion (e.g., mail list, customer merging)
- Auditing discover rules relationships(like
data mining)
68Loading Data
- Incremental vs. refresh
- Off-line vs. on-line
- Frequency of loading
- At night, 1x a week/month, continuously
- Parallel/Partitioned load
69Derived Data
- Derived Warehouse Data
- indexes
- aggregates
- materialized views (next slide)
- When to update derived data?
- Incremental vs. refresh
70Materialized Views
- Define new warehouse relations using SQL
expressions
71Processing
- ROLAP servers vs. MOLAP servers
- Index Structures
- What to Materialize?
- Algorithms
72ROLAP Server
tools
Special indices, tuning Schema is denormalized
73MOLAP Server
- Multi-Dimensional OLAP Server
M.D. tools
multi-dimensional server
could also sit on relational DBMS
74Index Structures
- Traditional Access Methods
- B-trees, hash tables, R-trees, grids,
- Popular in Warehouses
- inverted lists
- bit map indexes
- join indexes
- text indexes
75Inverted Lists
. . .
data records
inverted lists
age index
76Using Inverted Lists
- Query
- Get people with age 20 and name fred
- List for age 20 r4, r18, r34, r35
- List for name fred r18, r52
- Answer is intersection r18
77Bit Maps
. . .
age index
data records
bit maps
78Using Bit Maps
- Query
- Get people with age 20 and name fred
- List for age 20 1101100000
- List for name fred 0100000001
- Answer is intersection 010000000000
- Good if domain cardinality small
- Bit vectors can be compressed
79Join
- Combine SALE, PRODUCT relations
- In SQL SELECT FROM SALE, PRODUCT
80What to Materialize?
- Store in warehouse results useful for common
queries - Example
total sales
day 2
. . .
day 1
129
materialize
81Materialization Factors
- Type/frequency of queries
- Query response time
- Storage cost
- Update cost
82Cube Aggregates Lattice
129
all
city
product
date
city, product
city, date
product, date
use greedy algorithm to decide what to materialize
city, product, date
83Dimension Hierarchies
all
state
city
84Dimension Hierarchies
all
product
city
date
product, date
city, product
city, date
state
city, product, date
state, date
state, product
state, product, date
not all arcs shown...
85Interesting Hierarchy
all
years
weeks
quarters
conceptual dimension table
months
days
86Managing
- Metadata
- Warehouse Design
- Tools
87Metadata
- Administrative
- definition of sources, tools, ...
- schemas, dimension hierarchies,
- rules for extraction, cleaning,
- refresh, purging policies
- user profiles, access control, ...
88Metadata
- Business
- business terms definition
- data ownership, charging
- Operational
- data lineage
- data currency (e.g., active, archived, purged)
- use stats, error reports, audit trails
89Design
- What data is needed?
- Where does it come from?
- How to clean data?
- How to represent in warehouse (schema)?
- What to summarize?
- What to materialize?
- What to index?
90Tools
- Development
- design edit schemas, views, scripts, rules,
queries, reports - Planning Analysis
- what-if scenarios (schema changes, refresh
rates), capacity planning - Warehouse Management
- performance monitoring, usage patterns, exception
reporting - System Network Management
- measure traffic (sources, warehouse, clients)
- Workflow Management
- reliable scripts for cleaning analyzing data
91Current State of Industry
- Extraction and integration done off-line
- Usually in large, time-consuming, batches
- Everything copied at warehouse
- Not selective about what is stored
- Query benefit vs storage update cost
- Query optimization aimed at OLTP
- High throughput instead of fast response
- Process whole query before displaying anything
92Future Directions
- Better performance
- Larger warehouses
- Easier to use
- What are companies research labs working on?
93Conclusions
- Massive amounts of data and complexity of queries
will push limits of current warehouses - Need better systems
- easier to use
- provide quality information