Title: Basic data structure in Data Warehouse
1Basic data structure in Data Warehouse
- Level of aggregation
- Data stored in OLTP
- Making the database match data warehouse
- The star schema
Dr. Reuven Bakalash
2- Data Warehouse is way of storing data for later
retrieval. This retrieval is used to support
decision-making - DW is a storage mechanism for read-only,
historical, aggregated data - Read-only the user cant change the data
- Historical usually the data is at least a day
old (in OLTP data is hold as long as it is
current or active) - All the data is aggregated to a certain point of
time
3Data aggregated to daily totals
Data as found in standard OLTP
4OLTP systems are standard, normalized
databases.OLTP systems are optimized for insert,
updates, and deletes (transactions)OLTP
systems achieve greater speed by minimizing
repeated data (normalization), and by limiting
the number of indexes.Indexes slow down
inserts, updates, and deletes. Therefore, we use
just enough indexes to get by. Over-indexing
significantly decreases performance.Minimization
of repeated data. An example
5The data behind the order looks very different
For each order we are repeating a lot of
information. Instead, we use relational
technology to tie each detail to the heading
record. The new detail records should look like
this
6OrderID
OrderDate
Product1
Quantity1
Product2
Quantity2
. . .
An order table like this has a limit of product
sets. If we make room for 20 sets, we cant
handle that order for 21 products. If and order
have just one product, we have wasted
space. Therefore we break those repeated fields
into a separate table
Now, the order can have any number of products.
7A simplified logical view of the tables looks
like this
No extended cost for each record in the
OrderDetail table. We store minimum data to speed
inserts, updates, and deletes. Any number that
can be calculated is calculated and not stored.
8- Advantages
- Minimized data entry. For each record we enter
the PK from two tables and add the quantity - Reduced OrderDetail record.Normalized table
structure takes up much less space. It helps
speed inserts, updates, and deletes - Most of the fields that link to other tables are
numeric, allowing better performance that with
text fields. Numeric fields also index faster and
more efficiently - With each transaction indexes are updated along
with the table. This overhead significantly
decrease the performance. With normalization
there are fewer indexes per table.
9- In Data Warehouse we retrieve data for analysis.
Therefore DW utilizes joins and queries multiple
tables to get the data. Joins are much slower
than reading from a single table. So we want to
minimize the number of tables - The fewer indexes, the slower data retrieval, the
slower select queries run. In DW we need fast
retrieval by any attribute, therefore abundance
of secondary indexes help. -
- Conclusion, two separate database structures are
needed - An OLTP system for transactions
- And an OLAP system for data retrieval
10Which view of the data is easier for the end-user
to understand?
11We had to use joins to create this table. We can
create a VIEW that looks like this, but we still
use joins in the background, and therefore not
achieve the best performance.The solution to
put all the required data in one table. Then
queries will be performed without joins.
The second view is much easier to understand.
12How we view information
- How many bottles of Aniseed Syrup did we sell
last week? - Are overall sales of Condiments up or down this
year compared to previous years? - On a quarterly and then monthly basis, are Diary
Products sales cyclical? - In what regions are sales down this year compared
to the same period last year? What products in
those regions account for the greatest percentage
of the decrease? - These questions have in common
- Time element
- Aggregated data
- Looking at data in terms of by conditions (by
category, e.g. by quarter) - If we want to view the data as aggregated
numbers broken down along a series of by
criteria, why dont we just store data in this
format?
13The database structure to support DW queries
- Facts and dimensions
- Measures numeric values that are measurable and
additive (e.g., sales dollars) - Dimensions by conditions map into dimensions.
There is always a time dimension. Product and
geography are very common as well. - Building dimensions into tables
- Dimension tables have a single-field primary key.
The full information is stored in the other
fields. These fields do not link to other tables.
The dimension tables are often fat. - Dimension tables are often short (in compare to
fact table). - e.g. 30,000 products in the product dimension
table, sold each day for 10 years. 30,000x365x10
equals 110,000,000 records in the fact table.
14A product dimension table. e.g. X12JC is a
refrigerator, which falls into category of major
appliances, and the sub-category of refrigerators.
All the hierarchies are contained in the
individual dimension tables. No additional tables
are needed to hold dimensional information. This
hierarchy, all in the same table, allows us to
perform a query that performs sums by category.
We can then drill down to calculate sums for
the subcategories.We can then calculate the sums
for the individual products in a particular
subcategory. The actual sums are based on
measures stored in the fact table.
15Consolidated dimensional hierarchies star
schemas
- A simplified example of two dimension tables
joined to the fact table. The fact table has only
one measure SalesDollars.
In order to see the total sales for a particular
month for a particular category, the SQL looks
like this SELECT Sum ( SalesFact.SalesDollars)
AS SumOfSalesDollars FROM TimeDimension JOIN
(ProductDimension JOIN SalesFact ON Product
Dimension.ProductID SalesFact.ProductID) ON
TimeDimension.TimeID SalesFact.TimeID WHERE Pro
ductDimension.CategoryBrass Goods AND
TimeDimension.Month3 AND TimeDimension.Year19
99 To drill down to a subcategory SELECT Sum
( SalesFact.SalesDollars) AS SumOfSalesDollars F
ROM TimeDimension JOIN (ProductDimension JOIN
SalesFact ON Product Dimension.ProductID
SalesFact.ProductID) ON TimeDimension.TimeID
SalesFact.TimeID WHERE ProductDimension.Subcatego
ryWidgets AND TimeDimension.Month3 AND
TimeDimension.Year1999
16- A simplified star schema with three dimensions.
- Lets assume 10 years of daily data, 200 stores,
and 500 products. The fact table is 365,000,000
records long. - The primary keys have numeric data
17Snowflakes schemas
Sometimes, the dimension tables have the
hierarchies broken out into separate tables,
trying to save space. However, in the overall
scheme the dimension tables usually only hold
about 1 of the records. This is more normalized
structure, but leads to more difficult queries
and slower response time.
18Fact granularity
- The granularity (or frequency) of the fact table
is determined by the time dimension. e.g. weekly
or monthly totals. The lower granularity the more
records in the table. - The granularity determines how far you can drill
down without returning to the transaction-level
data - Most OLAP systems have a daily grain
- The fact table is often sparsely populated. We
only store the records that have a value.
19The relationship of a star schema to a report
Time dimension
Product dimension
time_key (FK) Product_key (FK) Store_key
(FK) Promo_key (FK) Dollars Units cost
product_key (PK) SKU description brand category Pa
ckage_type Size flavor
Time_key (PK) SQL_date Day_of_week Week_number mon
th
Store dimension
store_key (PK) Store_ID Store_name address Distric
t region
Promotion dimension
promotion_key (PK) Promotion_name Promotion_type P
rice_treatment Ad_treatment Display_treatment Coup
on_type
Drag And drop
Drag And drop
Drag And drop
Drag And drop
Compute
District Brand Total Total cost Gross
profit Atherton Clean Fast 1,233 1,058 175 Ath
erton Ever More 2,239 2,200
39 Atherton Zippy 848 650 198 Belmont Cle
an Fast 2,097 1,848 249 Belmont Ever
More 2,428 2,350 78 Belmont Zippy 633
580 53