Multidimensional Data Technology MDD - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Multidimensional Data Technology MDD

Description:

Let's assume 10 years of daily data, 200 stores, and 50000 products. ... MDDs are capable of providing stunning query performance, which is mostly a ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 29

Provided by: reu1

Category:

more less

Transcript and Presenter's Notes

Title: Multidimensional Data Technology MDD

1
Multidimensional Data Technology (MDD)
2
Dimension tables
Fact table
day product branch units,
A tuple at Fact Table describes a single atomic
measure

A simplified star schema with three dimensions.
Lets assume 10 years of daily data, 200 stores,
and 50000 products. The fact table is
36,500,000,000 records long.
But where is the aggregated data?

3
Aggregated data
Atomic data
Days/ weeks/ months/ quarters/ years
Summary Tables
Fact Table
4

Relational systems are being used for
applications requiring storing, updating and
retrieval In fact, relational DBMS were never
intended to provide the very powerful functions
for data synthesis, analysis, and consolidation
that is being defined as multi-dimensional data
analysis. These types of functions were always
intended to be provided by separate, enduser
tools that were outside and complementary to the
relational DBMS products.
E.F.Codd, S.B.Codd, C.T.Salley. Beyond Decision
Support, Computerworld, 27(30), July 1993
(E.F. Codd is the founder of relational databases)

5
Aggregation in relational DBs

Aggregation (summarization) is the key factor in
performance in business intelligence and decision
support systems
Limited availability of information persists due
to poor aggregation performance
Partial or outdated information results in lost
productivity, lost revenue, user dissatisfaction,
etc.

"The single most dramatic way to affect
performance in a large data warehouse is to
provide a proper set of aggregates in some cases
speeding queries by a factor of 100 or even 1000.
No other means exist to harvest such spectacular
gains." Ralph Kimball, "The Data Warehouse
Lifecycle Toolkit, 1998 (one of the founders of
the data warehouse industry)
6
MDD technology

Known as multidimensional OLAP
Faster than relational aggregation
Limited in amount of data and dimensions
External to DW

7
The source The Case for Relational OLAP,
MicroStartegy Inc.
Atomicity (Gigabytes)
1000
x
Promotion Analysis DSS
Retail Merchant DSS
x
100
Relational aggregation
Banking Profit DSS
x
x
Insurance Policy DSS
10
Bank Credit Scoring DSS
x
x
Financials DSS
x
Utility Task DSS
MDD
Dimensionality
10
100
1000
OLAP/ROLAP
8
MDD external to DW
Data star scheme
Data MDD structure
Source
Operational data
Data Extract Data Cleanup Data Load
OLAP
External data
e.g. load 3h, aggregation 10h
9

Rather than storing information as records, and
records in tables, MDDs (logically) store data in
arrays
MDDs have no standard access method (such as SQL)
or API's
MDDs are capable of providing stunning query
performance, which is mostly a function of
anticipating the manner in which data will be
accessed.
Because information in a MDD is stored is much
coarser grain than a RDB, the index is much
smaller and is usually resident in memory. Once
the in-memory index is scanned (in a blink of the
eye), a few pages are drawn from the database.
Another interesting aspect of MDDs is that
information is stored in arrays. This means that
values in the arrays can be updated without
affecting the index. This is the reason that MDDs
are so nicely suited for read-write applications.

10
The naive approach to MDD
Measure (units, )
Day1 2 3 4
Store 1
GEOGR 3d
Store 2
PRODUCTS 2d
TIME (1st Dimension)
Store 5
11
Queries

How many bottles of Aniseed Syrup did we sale in
the Lexington store the first day?
How many bottles of Aniseed Syrup did we sale in
Lexington store the last 4 days?
How many bottles of Aniseed Syrup did we sell
last 4 days?

Day1 2 3 4
Prod 4
Aniseed Syrup
Store 1
Prod 2
Prod 1
Prod 4
Aniseed Syrup
Lexington Store
Prod 2
Units sold
Prod 1
Prod 4
Aniseed Syrup
Store 3
Prod 2
Prod 1
12
Storage
D3 Block Pointer
Index
Data
13
Day1 2 3 4
Prod 4
Prod 3
Store 1
Prod 2
Prod 1
.
Prod 4
.
Data
INDn1
Data
INDn2
Prod 3
Store 2
Prod 2
.
Prod 1
Prod 4
Prod 3
Store 3
Prod 2
DATA FILE
Prod 1
SPARSE DATA
14
An example of indexing sparse data
.
n
Block Pointer
IND0
INDn
.
Start
End
Size
Data
INDn1
Data
INDn2
Index File
.
Data File
15

Three types of multidimensional data in OLAP
applications
Data from external sources
Detailed (atomic) data
Pre-aggregated data
On-the-fly aggregated data

16
Picking the optimum amount of pre-calculation. In
this case, pre-calculating a well-chosen set of
about a third of all possible results will
deliver a good balance of build time, database
size and responsiveness.
17
Dimension Hierarchies
Store Dimension
Product Dimension
Total
Total
Region
Manufacturer
District
Brand
Stores
Products
18
Two hierarchical dimensions. The white cells
represent potential detail items, the pale green
are first level consolidations and the pale blue
are second level consolidations. The darker green
and blue cells are, respectively, the cross
products of first and second level
consolidations, and the single yellow cell is the
cross of the second level consolidations.
19
(No Transcript)
20
The concentrating effect of sparsity and
multidimensionality.
The red dots indicate that a cell is populated,
and each of the frames indicates the percentage
of the detail cells that have been randomly
populated, and the resulting compound growth
factor (CGF). Notice how much more densely
populated are the derived/consolidated (colored)
areas, particularly with very sparse detailed
data. The higher level consolidations (shown in
darker colors) become almost fully populated even
with very sparse detailed data, which leads to
higher compound growth factors (CGFs) with
sparser data.
21
The compound growth factor falls as input data
density rises. Typical multidimensional data is
very sparse, and the CGF can be well over 2.0,
particularly if there are more than six
dimensions. The data shown in this chart was
obtained by running 120 simulations with randomly
generated data cells in a larger version of the
test model. The blue dots are the actual CGF
values observed, and the red line is a
logarithmic curve fit.
22
Density increases rapidly at higher consolidation
levels. These results, based upon 60 statistical
simulations, show data densities at base level,
and four consolidation levels note the
logarithmic density scale. However sparse the
base data, by the third consolidation level, the
data is almost 100 percent dense.
23
Database explosion. This graph shows the ratio
(not percentages) of fully calculated database
size compared to input data size, depending on
the number of sparse dimensions.
24
Sparce structures
Data in multidimensional applications tends to be
clustered Into relatively dense blocks, with
large gaps in between

Hypercube - single-cube logical structure. All
parts of the data space have identical
dimensionality.
In relational applications it uses a single fact
table star schema
Multicube - database is segmented into a set of
multidimensional structures each of which is
composed of a subset of the overall number of
dimensions in the database.
Block multicubes use orthogonal dimensions, so
there are no special dimensions at the data level
Series multicubes treat each variable as a
separate cube, with its own set of other
dimensions.

Designing a hypercube model is a top-down
process, with three major steps.
First, you decide what aspect, or process of the
business you will capture in the model, such as
sales activity or claims processing.
Next, identify the values that you want capture,
such as sales amounts or elements of costs. This
information is almost always numeric.
Last, identify the granularity of the data, that
is, the lowest level of detail at which you will
capture it. These elements are the dimensions.
Common dimensions are measure, time, scenario,
geography, product and customer.

A drawback of this "positional" architecture is
that even minor changes in the dimensional
structure require a complete reorganization of
the database.
Another drawback in the hypercube model is that
every value shares the same dimensionality.
For example, we may use payment amount by
customer, and it is redundant to store it
repeatedly by geography.

Multicubes, uses a more elegant solution. Though
implementations vary across products, multicubes
dimension each variable separately and deal
internally with the consequences. The downside is
that these approaches are less straightforward
and carry steeper learning curves.

28
The chart shows the ratio of the final database
size and the input data volume a figure of 1.0
would mean that there was no database growth
through pre-calculations or indexing, whereas a
figure of 10.0 means ten-fold growth. They are
all published results from audited runs of the
OLAP Council's APB-1 benchmark. All four
performed the same set of analyses on the same
volume of input data (approximately 35Mb), and
all operated in MOLAP mode. This not only shows
the amazing differences between superficially
similar products, but also that MOLAP products
can both explode and implode databases (which
proves the point that database explosion is
nothing to do with ROLAP vs MOLAP architectures).
It also shows how much better Essbase 5 was at
handling database explosion than Essbase 4.1,
even though both were pure MOLAPs.

Write a Comment

User Comments (0)