Title: Multidimensional Data Technology MDD
1Multidimensional Data Technology (MDD)
2Dimension tables
Fact table
day product branch units,
A tuple at Fact Table describes a single atomic
measure
- A simplified star schema with three dimensions.
- Lets assume 10 years of daily data, 200 stores,
and 50000 products. The fact table is
36,500,000,000 records long. - But where is the aggregated data?
3Aggregated data
Atomic data
Days/ weeks/ months/ quarters/ years
Summary Tables
Fact Table
4- Relational systems are being used for
applications requiring storing, updating and
retrieval In fact, relational DBMS were never
intended to provide the very powerful functions
for data synthesis, analysis, and consolidation
that is being defined as multi-dimensional data
analysis. These types of functions were always
intended to be provided by separate, enduser
tools that were outside and complementary to the
relational DBMS products. - E.F.Codd, S.B.Codd, C.T.Salley. Beyond Decision
Support, Computerworld, 27(30), July 1993
(E.F. Codd is the founder of relational databases)
5Aggregation in relational DBs
- Aggregation (summarization) is the key factor in
performance in business intelligence and decision
support systems - Limited availability of information persists due
to poor aggregation performance - Partial or outdated information results in lost
productivity, lost revenue, user dissatisfaction,
etc.
"The single most dramatic way to affect
performance in a large data warehouse is to
provide a proper set of aggregates in some cases
speeding queries by a factor of 100 or even 1000.
No other means exist to harvest such spectacular
gains." Ralph Kimball, "The Data Warehouse
Lifecycle Toolkit, 1998 (one of the founders of
the data warehouse industry)
6MDD technology
- Known as multidimensional OLAP
- Faster than relational aggregation
- Limited in amount of data and dimensions
- External to DW
7The source The Case for Relational OLAP,
MicroStartegy Inc.
Atomicity (Gigabytes)
1000
x
Promotion Analysis DSS
Retail Merchant DSS
x
100
Relational aggregation
Banking Profit DSS
x
x
Insurance Policy DSS
10
Bank Credit Scoring DSS
x
x
Financials DSS
x
Utility Task DSS
MDD
Dimensionality
10
100
1000
OLAP/ROLAP
8MDD external to DW
Data star scheme
Data MDD structure
Source
Operational data
Data Extract Data Cleanup Data Load
OLAP
External data
e.g. load 3h, aggregation 10h
9- Rather than storing information as records, and
records in tables, MDDs (logically) store data in
arrays - MDDs have no standard access method (such as SQL)
or API's - MDDs are capable of providing stunning query
performance, which is mostly a function of
anticipating the manner in which data will be
accessed. - Because information in a MDD is stored is much
coarser grain than a RDB, the index is much
smaller and is usually resident in memory. Once
the in-memory index is scanned (in a blink of the
eye), a few pages are drawn from the database. - Another interesting aspect of MDDs is that
information is stored in arrays. This means that
values in the arrays can be updated without
affecting the index. This is the reason that MDDs
are so nicely suited for read-write applications.
10The naive approach to MDD
Measure (units, )
Day1 2 3 4
Store 1
GEOGR 3d
Store 2
PRODUCTS 2d
TIME (1st Dimension)
Store 5
11Queries
- How many bottles of Aniseed Syrup did we sale in
the Lexington store the first day? - How many bottles of Aniseed Syrup did we sale in
Lexington store the last 4 days? - How many bottles of Aniseed Syrup did we sell
last 4 days?
Day1 2 3 4
Prod 4
Aniseed Syrup
Store 1
Prod 2
Prod 1
Prod 4
Aniseed Syrup
Lexington Store
Prod 2
Units sold
Prod 1
Prod 4
Aniseed Syrup
Store 3
Prod 2
Prod 1
12Storage
D3 Block Pointer
Index
Data
13Day1 2 3 4
Prod 4
Prod 3
Store 1
Prod 2
Prod 1
.
Prod 4
.
Data
INDn1
Data
INDn2
Prod 3
Store 2
Prod 2
.
Prod 1
Prod 4
Prod 3
Store 3
Prod 2
DATA FILE
Prod 1
SPARSE DATA
14An example of indexing sparse data
.
n
Block Pointer
IND0
INDn
.
Start
End
Size
Data
INDn1
Data
INDn2
Index File
.
Data File
15- Three types of multidimensional data in OLAP
applications - Data from external sources
- Detailed (atomic) data
- Pre-aggregated data
- On-the-fly aggregated data
16Picking the optimum amount of pre-calculation. In
this case, pre-calculating a well-chosen set of
about a third of all possible results will
deliver a good balance of build time, database
size and responsiveness.
17Dimension Hierarchies
Store Dimension
Product Dimension
Total
Total
Region
Manufacturer
District
Brand
Stores
Products
18Two hierarchical dimensions. The white cells
represent potential detail items, the pale green
are first level consolidations and the pale blue
are second level consolidations. The darker green
and blue cells are, respectively, the cross
products of first and second level
consolidations, and the single yellow cell is the
cross of the second level consolidations.
19(No Transcript)
20The concentrating effect of sparsity and
multidimensionality.
The red dots indicate that a cell is populated,
and each of the frames indicates the percentage
of the detail cells that have been randomly
populated, and the resulting compound growth
factor (CGF). Notice how much more densely
populated are the derived/consolidated (colored)
areas, particularly with very sparse detailed
data. The higher level consolidations (shown in
darker colors) become almost fully populated even
with very sparse detailed data, which leads to
higher compound growth factors (CGFs) with
sparser data.
21The compound growth factor falls as input data
density rises. Typical multidimensional data is
very sparse, and the CGF can be well over 2.0,
particularly if there are more than six
dimensions. The data shown in this chart was
obtained by running 120 simulations with randomly
generated data cells in a larger version of the
test model. The blue dots are the actual CGF
values observed, and the red line is a
logarithmic curve fit.
22Density increases rapidly at higher consolidation
levels. These results, based upon 60 statistical
simulations, show data densities at base level,
and four consolidation levels note the
logarithmic density scale. However sparse the
base data, by the third consolidation level, the
data is almost 100 percent dense.
23Database explosion. This graph shows the ratio
(not percentages) of fully calculated database
size compared to input data size, depending on
the number of sparse dimensions.
24Sparce structures
Data in multidimensional applications tends to be
clustered Into relatively dense blocks, with
large gaps in between
- Hypercube - single-cube logical structure. All
parts of the data space have identical
dimensionality. - In relational applications it uses a single fact
table star schema - Multicube - database is segmented into a set of
multidimensional structures each of which is
composed of a subset of the overall number of
dimensions in the database. - Block multicubes use orthogonal dimensions, so
there are no special dimensions at the data level
- Series multicubes treat each variable as a
separate cube, with its own set of other
dimensions.
25- Designing a hypercube model is a top-down
process, with three major steps. - First, you decide what aspect, or process of the
business you will capture in the model, such as
sales activity or claims processing. - Next, identify the values that you want capture,
such as sales amounts or elements of costs. This
information is almost always numeric. - Last, identify the granularity of the data, that
is, the lowest level of detail at which you will
capture it. These elements are the dimensions. - Common dimensions are measure, time, scenario,
geography, product and customer.
26- A drawback of this "positional" architecture is
that even minor changes in the dimensional
structure require a complete reorganization of
the database. - Another drawback in the hypercube model is that
every value shares the same dimensionality. - For example, we may use payment amount by
customer, and it is redundant to store it
repeatedly by geography.
27- Multicubes, uses a more elegant solution. Though
implementations vary across products, multicubes
dimension each variable separately and deal
internally with the consequences. The downside is
that these approaches are less straightforward
and carry steeper learning curves.
28The chart shows the ratio of the final database
size and the input data volume a figure of 1.0
would mean that there was no database growth
through pre-calculations or indexing, whereas a
figure of 10.0 means ten-fold growth. They are
all published results from audited runs of the
OLAP Council's APB-1 benchmark. All four
performed the same set of analyses on the same
volume of input data (approximately 35Mb), and
all operated in MOLAP mode. This not only shows
the amazing differences between superficially
similar products, but also that MOLAP products
can both explode and implode databases (which
proves the point that database explosion is
nothing to do with ROLAP vs MOLAP architectures).
It also shows how much better Essbase 5 was at
handling database explosion than Essbase 4.1,
even though both were pure MOLAPs.