Basic data structure in Data Warehouse - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Basic data structure in Data Warehouse

Description:

Data Warehouse is way of storing data for later retrieval. This retrieval is used to support ... How many bottles of Aniseed Syrup did we sell last week? ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 20
Provided by: ReuvenB
Category:

less

Transcript and Presenter's Notes

Title: Basic data structure in Data Warehouse


1
Basic data structure in Data Warehouse
  • Level of aggregation
  • Data stored in OLTP
  • Making the database match data warehouse
  • The star schema

Dr. Reuven Bakalash
2
  • Data Warehouse is way of storing data for later
    retrieval. This retrieval is used to support
    decision-making
  • DW is a storage mechanism for read-only,
    historical, aggregated data
  • Read-only the user cant change the data
  • Historical usually the data is at least a day
    old (in OLTP data is hold as long as it is
    current or active)
  • All the data is aggregated to a certain point of
    time

3
Data aggregated to daily totals
Data as found in standard OLTP
4
OLTP systems are standard, normalized
databases.OLTP systems are optimized for insert,
updates, and deletes (transactions)OLTP
systems achieve greater speed by minimizing
repeated data (normalization), and by limiting
the number of indexes.Indexes slow down
inserts, updates, and deletes. Therefore, we use
just enough indexes to get by. Over-indexing
significantly decreases performance.Minimization
of repeated data. An example
5
The data behind the order looks very different
For each order we are repeating a lot of
information. Instead, we use relational
technology to tie each detail to the heading
record. The new detail records should look like
this
6
OrderID
OrderDate
Product1
Quantity1
Product2
Quantity2
. . .
An order table like this has a limit of product
sets. If we make room for 20 sets, we cant
handle that order for 21 products. If and order
have just one product, we have wasted
space. Therefore we break those repeated fields
into a separate table
Now, the order can have any number of products.
7
A simplified logical view of the tables looks
like this
No extended cost for each record in the
OrderDetail table. We store minimum data to speed
inserts, updates, and deletes. Any number that
can be calculated is calculated and not stored.
8
  • Advantages
  • Minimized data entry. For each record we enter
    the PK from two tables and add the quantity
  • Reduced OrderDetail record.Normalized table
    structure takes up much less space. It helps
    speed inserts, updates, and deletes
  • Most of the fields that link to other tables are
    numeric, allowing better performance that with
    text fields. Numeric fields also index faster and
    more efficiently
  • With each transaction indexes are updated along
    with the table. This overhead significantly
    decrease the performance. With normalization
    there are fewer indexes per table.

9
  • In Data Warehouse we retrieve data for analysis.
    Therefore DW utilizes joins and queries multiple
    tables to get the data. Joins are much slower
    than reading from a single table. So we want to
    minimize the number of tables
  • The fewer indexes, the slower data retrieval, the
    slower select queries run. In DW we need fast
    retrieval by any attribute, therefore abundance
    of secondary indexes help.
  • Conclusion, two separate database structures are
    needed
  • An OLTP system for transactions
  • And an OLAP system for data retrieval

10
Which view of the data is easier for the end-user
to understand?
11
We had to use joins to create this table. We can
create a VIEW that looks like this, but we still
use joins in the background, and therefore not
achieve the best performance.The solution to
put all the required data in one table. Then
queries will be performed without joins.
The second view is much easier to understand.
12
How we view information
  • How many bottles of Aniseed Syrup did we sell
    last week?
  • Are overall sales of Condiments up or down this
    year compared to previous years?
  • On a quarterly and then monthly basis, are Diary
    Products sales cyclical?
  • In what regions are sales down this year compared
    to the same period last year? What products in
    those regions account for the greatest percentage
    of the decrease?
  • These questions have in common
  • Time element
  • Aggregated data
  • Looking at data in terms of by conditions (by
    category, e.g. by quarter)
  • If we want to view the data as aggregated
    numbers broken down along a series of by
    criteria, why dont we just store data in this
    format?

13
The database structure to support DW queries
  • Facts and dimensions
  • Measures numeric values that are measurable and
    additive (e.g., sales dollars)
  • Dimensions by conditions map into dimensions.
    There is always a time dimension. Product and
    geography are very common as well.
  • Building dimensions into tables
  • Dimension tables have a single-field primary key.
    The full information is stored in the other
    fields. These fields do not link to other tables.
    The dimension tables are often fat.
  • Dimension tables are often short (in compare to
    fact table).
  • e.g. 30,000 products in the product dimension
    table, sold each day for 10 years. 30,000x365x10
    equals 110,000,000 records in the fact table.

14
A product dimension table. e.g. X12JC is a
refrigerator, which falls into category of major
appliances, and the sub-category of refrigerators.
All the hierarchies are contained in the
individual dimension tables. No additional tables
are needed to hold dimensional information. This
hierarchy, all in the same table, allows us to
perform a query that performs sums by category.
We can then drill down to calculate sums for
the subcategories.We can then calculate the sums
for the individual products in a particular
subcategory. The actual sums are based on
measures stored in the fact table.
15
Consolidated dimensional hierarchies star
schemas
  • A simplified example of two dimension tables
    joined to the fact table. The fact table has only
    one measure SalesDollars.

In order to see the total sales for a particular
month for a particular category, the SQL looks
like this SELECT Sum ( SalesFact.SalesDollars)
AS SumOfSalesDollars FROM TimeDimension JOIN
(ProductDimension JOIN SalesFact ON Product
Dimension.ProductID SalesFact.ProductID) ON
TimeDimension.TimeID SalesFact.TimeID WHERE Pro
ductDimension.CategoryBrass Goods AND
TimeDimension.Month3 AND TimeDimension.Year19
99 To drill down to a subcategory SELECT Sum
( SalesFact.SalesDollars) AS SumOfSalesDollars F
ROM TimeDimension JOIN (ProductDimension JOIN
SalesFact ON Product Dimension.ProductID
SalesFact.ProductID) ON TimeDimension.TimeID
SalesFact.TimeID WHERE ProductDimension.Subcatego
ryWidgets AND TimeDimension.Month3 AND
TimeDimension.Year1999
16
  • A simplified star schema with three dimensions.
  • Lets assume 10 years of daily data, 200 stores,
    and 500 products. The fact table is 365,000,000
    records long.
  • The primary keys have numeric data

17
Snowflakes schemas
Sometimes, the dimension tables have the
hierarchies broken out into separate tables,
trying to save space. However, in the overall
scheme the dimension tables usually only hold
about 1 of the records. This is more normalized
structure, but leads to more difficult queries
and slower response time.
18
Fact granularity
  • The granularity (or frequency) of the fact table
    is determined by the time dimension. e.g. weekly
    or monthly totals. The lower granularity the more
    records in the table.
  • The granularity determines how far you can drill
    down without returning to the transaction-level
    data
  • Most OLAP systems have a daily grain
  • The fact table is often sparsely populated. We
    only store the records that have a value.

19
The relationship of a star schema to a report
Time dimension
Product dimension
time_key (FK) Product_key (FK) Store_key
(FK) Promo_key (FK) Dollars Units cost
product_key (PK) SKU description brand category Pa
ckage_type Size flavor
Time_key (PK) SQL_date Day_of_week Week_number mon
th
Store dimension
store_key (PK) Store_ID Store_name address Distric
t region
Promotion dimension
promotion_key (PK) Promotion_name Promotion_type P
rice_treatment Ad_treatment Display_treatment Coup
on_type
Drag And drop
Drag And drop
Drag And drop
Drag And drop
Compute
District Brand Total Total cost Gross
profit Atherton Clean Fast 1,233 1,058 175 Ath
erton Ever More 2,239 2,200
39 Atherton Zippy 848 650 198 Belmont Cle
an Fast 2,097 1,848 249 Belmont Ever
More 2,428 2,350 78 Belmont Zippy 633
580 53
Write a Comment
User Comments (0)
About PowerShow.com