Multidimensional Data Technology MDD - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Multidimensional Data Technology MDD

Description:

Let's assume 10 years of daily data, 200 stores, and 50000 products. ... MDDs are capable of providing stunning query performance, which is mostly a ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 29
Provided by: reu1
Category:

less

Transcript and Presenter's Notes

Title: Multidimensional Data Technology MDD


1
Multidimensional Data Technology (MDD)
2
Dimension tables
Fact table
day product branch units,
A tuple at Fact Table describes a single atomic
measure
  • A simplified star schema with three dimensions.
  • Lets assume 10 years of daily data, 200 stores,
    and 50000 products. The fact table is
    36,500,000,000 records long.
  • But where is the aggregated data?

3
Aggregated data
Atomic data
Days/ weeks/ months/ quarters/ years
Summary Tables
Fact Table
4
  • Relational systems are being used for
    applications requiring storing, updating and
    retrieval In fact, relational DBMS were never
    intended to provide the very powerful functions
    for data synthesis, analysis, and consolidation
    that is being defined as multi-dimensional data
    analysis. These types of functions were always
    intended to be provided by separate, enduser
    tools that were outside and complementary to the
    relational DBMS products.
  • E.F.Codd, S.B.Codd, C.T.Salley. Beyond Decision
    Support, Computerworld, 27(30), July 1993
    (E.F. Codd is the founder of relational databases)

5
Aggregation in relational DBs
  • Aggregation (summarization) is the key factor in
    performance in business intelligence and decision
    support systems
  • Limited availability of information persists due
    to poor aggregation performance
  • Partial or outdated information results in lost
    productivity, lost revenue, user dissatisfaction,
    etc.

"The single most dramatic way to affect
performance in a large data warehouse is to
provide a proper set of aggregates in some cases
speeding queries by a factor of 100 or even 1000.
No other means exist to harvest such spectacular
gains." Ralph Kimball, "The Data Warehouse
Lifecycle Toolkit, 1998 (one of the founders of
the data warehouse industry)
6
MDD technology
  • Known as multidimensional OLAP
  • Faster than relational aggregation
  • Limited in amount of data and dimensions
  • External to DW

7
The source The Case for Relational OLAP,
MicroStartegy Inc.
Atomicity (Gigabytes)
1000
x
Promotion Analysis DSS
Retail Merchant DSS
x
100
Relational aggregation
Banking Profit DSS
x
x
Insurance Policy DSS
10
Bank Credit Scoring DSS
x
x
Financials DSS
x
Utility Task DSS
MDD
Dimensionality
10
100
1000
OLAP/ROLAP
8
MDD external to DW
Data star scheme
Data MDD structure
Source
Operational data
Data Extract Data Cleanup Data Load
OLAP
External data
e.g. load 3h, aggregation 10h
9
  • Rather than storing information as records, and
    records in tables, MDDs (logically) store data in
    arrays
  • MDDs have no standard access method (such as SQL)
    or API's
  • MDDs are capable of providing stunning query
    performance, which is mostly a function of
    anticipating the manner in which data will be
    accessed.
  • Because information in a MDD is stored is much
    coarser grain than a RDB, the index is much
    smaller and is usually resident in memory. Once
    the in-memory index is scanned (in a blink of the
    eye), a few pages are drawn from the database.
  • Another interesting aspect of MDDs is that
    information is stored in arrays. This means that
    values in the arrays can be updated without
    affecting the index. This is the reason that MDDs
    are so nicely suited for read-write applications.

10
The naive approach to MDD
Measure (units, )
Day1 2 3 4
Store 1
GEOGR 3d
Store 2
PRODUCTS 2d
TIME (1st Dimension)
Store 5
11
Queries
  • How many bottles of Aniseed Syrup did we sale in
    the Lexington store the first day?
  • How many bottles of Aniseed Syrup did we sale in
    Lexington store the last 4 days?
  • How many bottles of Aniseed Syrup did we sell
    last 4 days?

Day1 2 3 4
Prod 4
Aniseed Syrup
Store 1
Prod 2
Prod 1
Prod 4
Aniseed Syrup
Lexington Store
Prod 2
Units sold
Prod 1
Prod 4
Aniseed Syrup
Store 3
Prod 2
Prod 1
12
Storage
D3 Block Pointer
Index
Data
13
Day1 2 3 4
Prod 4
Prod 3
Store 1
Prod 2
Prod 1
.
Prod 4
.
Data
INDn1
Data
INDn2
Prod 3
Store 2
Prod 2
.
Prod 1
Prod 4
Prod 3
Store 3
Prod 2
DATA FILE
Prod 1
SPARSE DATA
14
An example of indexing sparse data
.
n
Block Pointer
IND0
INDn
.
Start
End
Size
Data
INDn1
Data
INDn2
Index File
.
Data File
15
  • Three types of multidimensional data in OLAP
    applications
  • Data from external sources
  • Detailed (atomic) data
  • Pre-aggregated data
  • On-the-fly aggregated data

16
Picking the optimum amount of pre-calculation. In
this case, pre-calculating a well-chosen set of
about a third of all possible results will
deliver a good balance of build time, database
size and responsiveness.
17
Dimension Hierarchies
Store Dimension
Product Dimension
Total
Total
Region
Manufacturer
District
Brand
Stores
Products
18
Two hierarchical dimensions. The white cells
represent potential detail items, the pale green
are first level consolidations and the pale blue
are second level consolidations. The darker green
and blue cells are, respectively, the cross
products of first and second level
consolidations, and the single yellow cell is the
cross of the second level consolidations.
19
(No Transcript)
20
The concentrating effect of sparsity and
multidimensionality.
The red dots indicate that a cell is populated,
and each of the frames indicates the percentage
of the detail cells that have been randomly
populated, and the resulting compound growth
factor (CGF). Notice how much more densely
populated are the derived/consolidated (colored)
areas, particularly with very sparse detailed
data. The higher level consolidations (shown in
darker colors) become almost fully populated even
with very sparse detailed data, which leads to
higher compound growth factors (CGFs) with
sparser data.
21
The compound growth factor falls as input data
density rises. Typical multidimensional data is
very sparse, and the CGF can be well over 2.0,
particularly if there are more than six
dimensions. The data shown in this chart was
obtained by running 120 simulations with randomly
generated data cells in a larger version of the
test model. The blue dots are the actual CGF
values observed, and the red line is a
logarithmic curve fit.
22
Density increases rapidly at higher consolidation
levels. These results, based upon 60 statistical
simulations, show data densities at base level,
and four consolidation levels note the
logarithmic density scale. However sparse the
base data, by the third consolidation level, the
data is almost 100 percent dense.
23
Database explosion. This graph shows the ratio
(not percentages) of fully calculated database
size compared to input data size, depending on
the number of sparse dimensions.
24
Sparce structures
Data in multidimensional applications tends to be
clustered Into relatively dense blocks, with
large gaps in between
  • Hypercube - single-cube logical structure. All
    parts of the data space have identical
    dimensionality.
  • In relational applications it uses a single fact
    table star schema
  • Multicube - database is segmented into a set of
    multidimensional structures each of which is
    composed of a subset of the overall number of
    dimensions in the database.
  • Block multicubes use orthogonal dimensions, so
    there are no special dimensions at the data level
  • Series multicubes treat each variable as a
    separate cube, with its own set of other
    dimensions.

25
  • Designing a hypercube model is a top-down
    process, with three major steps.
  • First, you decide what aspect, or process of the
    business you will capture in the model, such as
    sales activity or claims processing.
  • Next, identify the values that you want capture,
    such as sales amounts or elements of costs. This
    information is almost always numeric.
  • Last, identify the granularity of the data, that
    is, the lowest level of detail at which you will
    capture it. These elements are the dimensions.
  • Common dimensions are measure, time, scenario,
    geography, product and customer.

26
  • A drawback of this "positional" architecture is
    that even minor changes in the dimensional
    structure require a complete reorganization of
    the database.
  • Another drawback in the hypercube model is that
    every value shares the same dimensionality.
  • For example, we may use payment amount by
    customer, and it is redundant to store it
    repeatedly by geography.

27
  • Multicubes, uses a more elegant solution. Though
    implementations vary across products, multicubes
    dimension each variable separately and deal
    internally with the consequences. The downside is
    that these approaches are less straightforward
    and carry steeper learning curves.

28
The chart shows the ratio of the final database
size and the input data volume a figure of 1.0
would mean that there was no database growth
through pre-calculations or indexing, whereas a
figure of 10.0 means ten-fold growth. They are
all published results from audited runs of the
OLAP Council's APB-1 benchmark. All four
performed the same set of analyses on the same
volume of input data (approximately 35Mb), and
all operated in MOLAP mode. This not only shows
the amazing differences between superficially
similar products, but also that MOLAP products
can both explode and implode databases (which
proves the point that database explosion is
nothing to do with ROLAP vs MOLAP architectures).
It also shows how much better Essbase 5 was at
handling database explosion than Essbase 4.1,
even though both were pure MOLAPs.
Write a Comment
User Comments (0)
About PowerShow.com