Title: OnLine Analytical Processing on Large Databases Managed by Computational Grids
1On-Line Analytical Processing on Large Databases
Managed by Computational Grids
- A Min Tjoa2 , Peter Brezany1 , Bernhard Fiser1
- Umut Onan1 , Ibrahim Elsayed1
- 1Institute for Software Science, University of
Vienna, Austria - brezany,fiser,onan,elsayed_at_par.univie.ac.at
- 2Institute of Software Technology and Interactive
Systems - Vienna University of Technology, Austria
- tjoa_at_ifs.tuwien.ac.at
Globe 2004, Zaragoza, Spain
2Outline
- Motivation
- Knowledge Discovery in Grid Datasets
- GridMiner Project in Vienna
- OLAP Background Information
- Storage Management Structures
- OLAP Services
- Centralized Service
- Distributed Service
- Federated Service
- Caching
- Future Work
- Conclusions
3Motivation
Business
Medicine
Scientific experiments
Data and data exploration
cloud
Simulations
Earth observations
4Stages of a Data Exploration Project
Time to
Importance complete to
success (percent of total) (percent of
total)
Based on Data Preparation for Data Mining, by
Dorian Pyle, Morgan Kaufmann
- Exploring the problem 10 15
- Exploring the solution 9 20 14 80
- Implementation specification 1 51
- Knowledge discovery
- a. Data preparation 60 15
-
- b. Data surveying 15 3
- c. Data modeling 5 2
80
20
5The Knowledge Discovery Process
Knowledge
OLAP Queries
OLAP
Online Analytical Mining
Evaluation and Presentation
Data Mining
Selection and Transformation
Data Warehouse
Cleaning and Integration
6The GridMiner Project in Vienna
- GridMiner A knowledge discovery Grid
infrastructure (http//www.gridminer.org/) - OGSA-based architecture
- Workflow management
- Grid-aware data preprocessing and data mining
services - Data mediation service
- OLAP service
- GUI
- Current Implementation on top of Globus Toolkit
3.0 - Application Management of patients with
traumatic brain injuries
7GridMiner Architecture
GridMiner Workflow
GM DSCE Dynamic Service Control
GridMiner Core
GMPPS Preprocessing
GMDMS Data Mining
GMPRS Presentation
GMDT Transformation
GMOMS OLAM
GridMiner Base
GMMS Mediation
GMIS Information
GMRB Resource Broker
GMCMS OLAP / Cubes
Grid Core
Grid Core Services
Security
File and Database Access Service
Replica Management
Basic Grid Services
Fabric
Grid Resources
Data Source
8Collaboration of GM-Services
Simple Scenario
GMPPS Preprocessing
GMDMS Data Mining
GMDIS Integration
GMPRS Presentation
Intermediate Result 1
Intermediate Result 2 (e.g. flat table)
Intermediate Result 3 (e.g. PMML)
Final Result
Data Sources
9Collaboration (2)
Complex Scenarios
GMPPS
GMDIS
GMPPS
GMDMS
GMPRS
GMPPS
GMDMS
GMPRS
GMPPS
GMPPS
GMDMS
GMPRS
GMDMS
GMPRS
GMPPS
GMDIS
GMCMS
GMOMS
GMPRS
GMPPS
GMPPS
10Workflow Models
Static workflows
Dynamic workflows
11Dynamic Workflows
- Dynamic Service Control Language (DSCL)
- based on XML
- easy to use
- Dynamic Service Control Engine (DSCE)
- processes the workflow according to DSCL
DSCL
DSCE
Service A
Service B
Service D
Service C
12DSCL Control Flow
Automatic conversion
Users view
dscl
variables
composition
sequence
createService activityIDact1
parallel
invoke activityIDact2.1
invoke activityIDact2.2
sequence
13Grid Database Access III
14Grid Database Access With OGSA-DAI
GDS gets a query via Perform Document GDS Engine
process specified activities GDS returns results
15Grid Database Access With OGSA-DAI
16Grid Data Mediation Service - Architecture
17GDMS Example Scenario
- Heterogeneities
- Name in A is Alexander Wöhrer
- Name in C has to be combined
- Distribution
- 3 data sources
18GDMS Mapping Schema
- Builds together with the mediated schema (user
view) the main configuration of the GDMS - One Mapping Schema for each Table in the Virtual
Data Source (VDS)
19Data Mediation - Future Work
- Easier Creation, Usage and Maintainability with
the help of Ontologies - Better Performance
- Evolve to a distributed mediaton service
20OLAP
Basic Concepts
21Relational (ROLAP) vs. Multi-Dimensional Model
(MOLAP)
Model Mini Van, Coupe, Sedan
Colour Red, Blue, White, Green
Two-Dimensional Model
22ROLAP vs. MOLAP
Model Year Color Sales Chevy 1990 Blue 87 Chevy 1
990 Red 5 Chevy 1990 ALL 92 Chevy ALL Blue 87 Chev
y ALL Red 5 Chevy ALL ALL 92 Ford 1990 Blue 99 For
d 1990 Green 64 Ford 1990 ALL 163 Ford
1991 Blue 7 Ford 1991 Red 8 Ford 1991 ALL 15 Ford
ALL Blue 106 Ford ALL Green 64 Ford ALL Red 8 All
1990 Blue 186 All 1990 Gree 64 ALL 1991 Blue 7 AL
L 1991 Red 8 Ford ALL ALL 178 ALL 1990 ALL 255 AL
L 1991 ALL 15 ALL ALL Blue 193 ALL ALL Green 64 AL
L ALL Red 13 ALL ALL ALL 270
SELECT Model, Year, Color, SUM(sales) AS
Sales FROM Sales WHERE Model IN Ford,
Chevy AND Year BETWEEN 1990
AND 1992 GROUP BY CUBE (Model, Year, Color)
Model Year Color Sales Chevy 1990 Red 5 Chevy 199
0 Blue 87 Ford 1990 Green 64 Ford 1990 Blue 99 For
d 1991 Red 8 Ford 1991 Blue 7
?
23ROLAP vs. MOLAP
Cross Tab
Data Cube
24ROLAP vs. MOLAP
- The ROLAPs summary records are stored directly
int standard relational tables, without any need
for data conversion. A complex analytical query
is cumbersome to express in SQL and it might not
be efficient to execute. - The array-based model, MOLAP (Multi-Dimensional
OLAP), has the advantage that native arrays
provide an immediate form of indexing for cube
queries.
25MOLAP
- Each value of the attribute associated with the
dimension determines a position of the dimension. - The observed values, called measures are located
at the intersections of the dimension positions.
The intersections are called cells and are
populated with measures. - Dimensional indexing a unique integer value is
assigned to each possible dimension position, and
the corresponding mapping information is stored
in an appropriate index structure.
26Towards Grid OLAP
Research Objectives High-Performance
Grid OLAP Services
27Requirements
- Operation on large data sets
- Centralized OLAP Service (parallel computing
power can be included) - Distributed OLAP service
- Federation of autonomous distributed OLAP services
28Development Strategy
OE
Network
OE
OE
OLAP Engine
OE
29Development Strategy (2)
- Precondition No open-source OLAP system
available - Decision development (in Java) from scratch
- Advantage motivation for research activities
addressing all facets - Disadvantage a possible long implementation
curve - First step centralized sequential Grid OLAP
service
30Towards Centralized Service
OLAP
Workflow Engine
DSCL, OMML
OMML
XML
GUI
Mediator
PMML
PMML
RD
XMLD
CSV
Data Mining Engine
31Data Organization
- Storing Multi-dimensional Data into
One-dimensional Storage - Chunking
- Bit Encoded Sparse Structure BESS
- Dynamic Bit Encoding (DBE)
32Toward Indexing
The simplest method for computing a linear
address from the multidimensional one (1)
assign each possible position within one
dimension an unique integer value and
store these matching information in another
table (2) Bit-shift the integer assigned to
the row dimension and logical OR it with the
integer assigned to the column dimension.
(3) Use the combined integer as your memory
address.
Drawback We want to store 12 values, but we
reserve 65534 addresses. Another important
issue How to determine the position index size?
Model Index(hex) Mini Van 0x00 Coupe
0x01 Sedan 0x02
Color Index(hex) Red 0x00 Blue
0x01 White 0x02 Green 0x03
(Coupe, White) ? 0x0102 (a linear address of
the measure)
33Chunking
34Dense and Sparse Chunk Storage
35Sparsity ExampleHP Application
- Web access analysis engine
- E.g., a newspaper Web site received 1.5 million
hits a week - Modeling the data using 4 dimensions
- ip address of the originate site (48,128 values)
- referring site (10,432 values)
- subject uri (18,085 values)
- hours of day (24 values)
- The resulting cube contained over 200 trillion
cells!
36Bit Encoded Sparse Structure (BESS)
A 100, B 1000, C 1000
Principles
Chunking
37Toward Dynamic Bit Encoding
First Improvement
PosIndex of dimension j
Stretched PosIndex of dimension 1
Logical combined stretched PosIndex of all
dimensions
38Dynamic Bit EncodingThe BESS Enhancement
39Dynamic Bit Encoding Example for 2 Dimensions
Index 0 0 0 0 0 0 i10 i00
Mask 0 0 0 0 1 0 0
1 Encoded 0 0 0 0 i10 0 0
i00
Dimension 0
Index 0 0 0 i41 i31 i21 i11
i01 Mask 0 1 1 1 0 1 1
0 Encoded 0 i41 i31 i21 0 i11 i01
0
Dimension 1
Enc0 0 0 0 0 0 0 i10 i00
Enc1 0 0 0 0 1 0 0
1 Global 0 i41 i31 i21 i00 i11 i01
i00
Encoded Global Index
40Performance
Each time factor (in seconds) in the following
tables was retrieved by averaging
3 x 107 encodings
41Computing Chunk Index and Measure Index
chunkPosIndex
NUM_DIM_BITS
x measure
mPosIndex (relative to ChunkPosIndex)
NUM_DIM_BITS (the number of bits used within each
dimension of the chunk)
chunkPosIndex posIndex SHIFT NUM_DIM_BITS to
the right
computing mPosIndex ? next slide
42Computing Measure Index
AND
To get the multidimensional index of the measure
mIndex perform binary shift of mPosIndex by j
NUM_DIM_BITS, 0 ? j ? maxDim, of each dimension j
to the left and combine all results together
using a logical OR.
43Chunk SwappingHandling Huge Data Volumes
I/O
chunkIndex
44Disk-Based Chunk Organization
45Cube Construction
- Initialization
- Read tuple from Grid Data/Mediation Service and
split into positions and measures. - Lookup and store index of each position.
- Calculate the global index number of the chunk
used to store the measures associated with their
positions - Locate and load the chunk or create a new one, if
it did not exist before. - Use indices of the positions to store the
measures into the chunk. - Continue at step 2 until no more tuples are
available on the input. - Finalization.
46Query Processing
- Generating Subcubes
- Query Algorithm
- Query Plan
47Materializing Subcubes
Summarizing z, ?xy
Summarizing y ?x, z ?z
Summarizing x y ?y, z ?z
48Materializing Subcubes (2)
Index table examples
49The Query Algorithm
For retrieval of one element, we first need a
structure to specify our request. For example,
we want to get the sum along y of the sum along z
for x0
50Query Plan (Schedule)
Query lattice for a four-dimensional cube
Improving caching, minimizing I/O (communication
optimization in distributed version) For
example, there are 2 paths form node ABCD to AB
51Parallel OLAP
52Parallel OLAP
53Parallel OLAP
54Distributed OLAP Aggregation of Compute and
Storage Resources
Tuple Stream
55Federated OLAPMotivating Example
- Effective management of a network requires
collecting, correlating, and analyzing a variety
of network trace data. - Analysis of flow data collecting at each router
and stored in a local data warehouse adjacent
to the router is a challenging application. - All flow information is conceptually part of a
single relation with the following schema - Flow ( RouterId, SourceIP, SourcePort,
SourceMask, SourceAS, DestIP, DestPort,
DestMask, DestAS, StartTime, EndTime, NumPackets,
NumBytes)
56CachingCentralized
Scheuermann, et al. Cache System Watchman
Cache Entry
query ID consists of an internal representation
of of the query string exact query match
employed
A profit metric is defined for each query and on
the basis of it a decision is taken whether the
query has to be cached or can be replaced.
Deshpande, et al. Chunk Based Caching
Fine granularity caching it allows queries to
partially reuse the results of previous
queries with which they overlap.
57CachingDistributed
P. Kalnis, et al. An Adaptive Peer-to-Peer
Network for Distributed Caching of OLAP Results
DW1
P1
c2
P6
c1
P2
P5
P3
P7
DW2
P4
Assume that P2 issues a query referring to chunks
c1, c2 and c3.
58GridMiner Current Structure
59Conclusions
- Overview of GridMiner and its OLAP research
effort - Other approaches
- Research Exhibition at Supercomputing 2004,
Pittsburgh, USA - www.gridminer.org