OnLine Analytical Processing on Large Databases Managed by Computational Grids

1 / 59

About This Presentation

Title:

OnLine Analytical Processing on Large Databases Managed by Computational Grids

Description:

Coupe 0x01. Sedan 0x02. Color Index(hex) Red 0x00. Blue ... (Coupe, White) 0x0102 (a linear address of the measure) Chunking. Dense and Sparse Chunk Storage ... –

Number of Views:58

Avg rating:3.0/5.0

Slides: 60

Provided by: ACE558

Category:

more less

Transcript and Presenter's Notes

Title: OnLine Analytical Processing on Large Databases Managed by Computational Grids

1
On-Line Analytical Processing on Large Databases
Managed by Computational Grids

A Min Tjoa2 , Peter Brezany1 , Bernhard Fiser1
Umut Onan1 , Ibrahim Elsayed1
1Institute for Software Science, University of
Vienna, Austria
brezany,fiser,onan,elsayed_at_par.univie.ac.at
2Institute of Software Technology and Interactive
Systems
Vienna University of Technology, Austria
tjoa_at_ifs.tuwien.ac.at

Globe 2004, Zaragoza, Spain
2
Outline

Motivation
Knowledge Discovery in Grid Datasets
GridMiner Project in Vienna
OLAP Background Information
Storage Management Structures
OLAP Services
Centralized Service
Distributed Service
Federated Service
Caching
Future Work
Conclusions

3
Motivation
Business
Medicine
Scientific experiments
Data and data exploration
cloud
Simulations
Earth observations
4
Stages of a Data Exploration Project
Time to
Importance complete to
success (percent of total) (percent of
total)
Based on Data Preparation for Data Mining, by
Dorian Pyle, Morgan Kaufmann

Exploring the problem 10 15
Exploring the solution 9 20 14 80
Implementation specification 1 51
Knowledge discovery
a. Data preparation 60 15
b. Data surveying 15 3
c. Data modeling 5 2

80
20
5
The Knowledge Discovery Process
Knowledge
OLAP Queries
OLAP
Online Analytical Mining
Evaluation and Presentation
Data Mining
Selection and Transformation
Data Warehouse
Cleaning and Integration
6
The GridMiner Project in Vienna

GridMiner A knowledge discovery Grid
infrastructure (http//www.gridminer.org/)
OGSA-based architecture
Workflow management
Grid-aware data preprocessing and data mining
services
Data mediation service
OLAP service
GUI
Current Implementation on top of Globus Toolkit
3.0
Application Management of patients with
traumatic brain injuries

7
GridMiner Architecture
GridMiner Workflow
GM DSCE Dynamic Service Control
GridMiner Core
GMPPS Preprocessing
GMDMS Data Mining
GMPRS Presentation
GMDT Transformation
GMOMS OLAM
GridMiner Base
GMMS Mediation
GMIS Information
GMRB Resource Broker
GMCMS OLAP / Cubes
Grid Core
Grid Core Services
Security
File and Database Access Service
Replica Management
Basic Grid Services
Fabric
Grid Resources
Data Source
8
Collaboration of GM-Services
Simple Scenario
GMPPS Preprocessing
GMDMS Data Mining
GMDIS Integration
GMPRS Presentation
Intermediate Result 1
Intermediate Result 2 (e.g. flat table)
Intermediate Result 3 (e.g. PMML)
Final Result
Data Sources
9
Collaboration (2)
Complex Scenarios
GMPPS
GMDIS
GMPPS
GMDMS
GMPRS
GMPPS
GMDMS
GMPRS
GMPPS
GMPPS
GMDMS
GMPRS
GMDMS
GMPRS
GMPPS
GMDIS
GMCMS
GMOMS
GMPRS
GMPPS
GMPPS
10
Workflow Models
Static workflows
Dynamic workflows
11
Dynamic Workflows

Dynamic Service Control Language (DSCL)
based on XML
easy to use
Dynamic Service Control Engine (DSCE)
processes the workflow according to DSCL

DSCL
DSCE
Service A
Service B
Service D
Service C
12
DSCL Control Flow
Automatic conversion
Users view
dscl
variables
composition
sequence
createService activityIDact1
parallel
invoke activityIDact2.1
invoke activityIDact2.2
sequence

13
Grid Database Access III
14
Grid Database Access With OGSA-DAI
GDS gets a query via Perform Document GDS Engine
process specified activities GDS returns results
15
Grid Database Access With OGSA-DAI
16
Grid Data Mediation Service - Architecture
17
GDMS Example Scenario

Heterogeneities
Name in A is Alexander Wöhrer
Name in C has to be combined
Distribution
3 data sources

18
GDMS Mapping Schema

Builds together with the mediated schema (user
view) the main configuration of the GDMS
One Mapping Schema for each Table in the Virtual
Data Source (VDS)

19
Data Mediation - Future Work

Easier Creation, Usage and Maintainability with
the help of Ontologies
Better Performance
Evolve to a distributed mediaton service

20
OLAP
Basic Concepts
21
Relational (ROLAP) vs. Multi-Dimensional Model
(MOLAP)
Model Mini Van, Coupe, Sedan
Colour Red, Blue, White, Green
Two-Dimensional Model
22
ROLAP vs. MOLAP
Model Year Color Sales Chevy 1990 Blue 87 Chevy 1
990 Red 5 Chevy 1990 ALL 92 Chevy ALL Blue 87 Chev
y ALL Red 5 Chevy ALL ALL 92 Ford 1990 Blue 99 For
d 1990 Green 64 Ford 1990 ALL 163 Ford
1991 Blue 7 Ford 1991 Red 8 Ford 1991 ALL 15 Ford
ALL Blue 106 Ford ALL Green 64 Ford ALL Red 8 All
1990 Blue 186 All 1990 Gree 64 ALL 1991 Blue 7 AL
L 1991 Red 8 Ford ALL ALL 178 ALL 1990 ALL 255 AL
L 1991 ALL 15 ALL ALL Blue 193 ALL ALL Green 64 AL
L ALL Red 13 ALL ALL ALL 270
SELECT Model, Year, Color, SUM(sales) AS
Sales FROM Sales WHERE Model IN Ford,
Chevy AND Year BETWEEN 1990
AND 1992 GROUP BY CUBE (Model, Year, Color)
Model Year Color Sales Chevy 1990 Red 5 Chevy 199
0 Blue 87 Ford 1990 Green 64 Ford 1990 Blue 99 For
d 1991 Red 8 Ford 1991 Blue 7
?
23
ROLAP vs. MOLAP
Cross Tab
Data Cube
24
ROLAP vs. MOLAP

The ROLAPs summary records are stored directly
int standard relational tables, without any need
for data conversion. A complex analytical query
is cumbersome to express in SQL and it might not
be efficient to execute.
The array-based model, MOLAP (Multi-Dimensional
OLAP), has the advantage that native arrays
provide an immediate form of indexing for cube
queries.

25
MOLAP

Each value of the attribute associated with the
dimension determines a position of the dimension.
The observed values, called measures are located
at the intersections of the dimension positions.
The intersections are called cells and are
populated with measures.
Dimensional indexing a unique integer value is
assigned to each possible dimension position, and
the corresponding mapping information is stored
in an appropriate index structure.

26
Towards Grid OLAP
Research Objectives High-Performance
Grid OLAP Services
27
Requirements

Operation on large data sets
Centralized OLAP Service (parallel computing
power can be included)
Distributed OLAP service
Federation of autonomous distributed OLAP services

28
Development Strategy
OE
Network
OE
OE
OLAP Engine
OE
29
Development Strategy (2)

Precondition No open-source OLAP system
available
Decision development (in Java) from scratch
Advantage motivation for research activities
addressing all facets
Disadvantage a possible long implementation
curve
First step centralized sequential Grid OLAP
service

30
Towards Centralized Service
OLAP
Workflow Engine
DSCL, OMML
OMML
XML
GUI
Mediator
PMML
PMML
RD
XMLD
CSV
Data Mining Engine
31
Data Organization

Storing Multi-dimensional Data into
One-dimensional Storage
Chunking
Bit Encoded Sparse Structure BESS
Dynamic Bit Encoding (DBE)

32
Toward Indexing
The simplest method for computing a linear
address from the multidimensional one (1)
assign each possible position within one
dimension an unique integer value and
store these matching information in another
table (2) Bit-shift the integer assigned to
the row dimension and logical OR it with the
integer assigned to the column dimension.
(3) Use the combined integer as your memory
address.
Drawback We want to store 12 values, but we
reserve 65534 addresses. Another important
issue How to determine the position index size?
Model Index(hex) Mini Van 0x00 Coupe
0x01 Sedan 0x02
Color Index(hex) Red 0x00 Blue
0x01 White 0x02 Green 0x03
(Coupe, White) ? 0x0102 (a linear address of
the measure)
33
Chunking
34
Dense and Sparse Chunk Storage
35
Sparsity ExampleHP Application

Web access analysis engine
E.g., a newspaper Web site received 1.5 million
hits a week
Modeling the data using 4 dimensions
ip address of the originate site (48,128 values)
referring site (10,432 values)
subject uri (18,085 values)
hours of day (24 values)
The resulting cube contained over 200 trillion
cells!

36
Bit Encoded Sparse Structure (BESS)
A 100, B 1000, C 1000
Principles
Chunking
37
Toward Dynamic Bit Encoding
First Improvement
PosIndex of dimension j
Stretched PosIndex of dimension 1
Logical combined stretched PosIndex of all
dimensions
38
Dynamic Bit EncodingThe BESS Enhancement
39
Dynamic Bit Encoding Example for 2 Dimensions
Index 0 0 0 0 0 0 i10 i00
Mask 0 0 0 0 1 0 0
1 Encoded 0 0 0 0 i10 0 0
i00
Dimension 0
Index 0 0 0 i41 i31 i21 i11
i01 Mask 0 1 1 1 0 1 1
0 Encoded 0 i41 i31 i21 0 i11 i01
0
Dimension 1
Enc0 0 0 0 0 0 0 i10 i00
Enc1 0 0 0 0 1 0 0
1 Global 0 i41 i31 i21 i00 i11 i01
i00
Encoded Global Index
40
Performance
Each time factor (in seconds) in the following
tables was retrieved by averaging
3 x 107 encodings
41
Computing Chunk Index and Measure Index
chunkPosIndex
NUM_DIM_BITS
x measure
mPosIndex (relative to ChunkPosIndex)

NUM_DIM_BITS (the number of bits used within each
dimension of the chunk)
chunkPosIndex posIndex SHIFT NUM_DIM_BITS to
the right
computing mPosIndex ? next slide
42
Computing Measure Index
AND
To get the multidimensional index of the measure
mIndex perform binary shift of mPosIndex by j
NUM_DIM_BITS, 0 ? j ? maxDim, of each dimension j
to the left and combine all results together
using a logical OR.
43
Chunk SwappingHandling Huge Data Volumes
I/O
chunkIndex
44
Disk-Based Chunk Organization
45
Cube Construction

Initialization
Read tuple from Grid Data/Mediation Service and
split into positions and measures.
Lookup and store index of each position.
Calculate the global index number of the chunk
used to store the measures associated with their
positions
Locate and load the chunk or create a new one, if
it did not exist before.
Use indices of the positions to store the
measures into the chunk.
Continue at step 2 until no more tuples are
available on the input.
Finalization.

46
Query Processing

Generating Subcubes
Query Algorithm
Query Plan

47
Materializing Subcubes
Summarizing z, ?xy
Summarizing y ?x, z ?z
Summarizing x y ?y, z ?z
48
Materializing Subcubes (2)
Index table examples
49
The Query Algorithm
For retrieval of one element, we first need a
structure to specify our request. For example,
we want to get the sum along y of the sum along z
for x0
50
Query Plan (Schedule)
Query lattice for a four-dimensional cube
Improving caching, minimizing I/O (communication
optimization in distributed version) For
example, there are 2 paths form node ABCD to AB
51
Parallel OLAP
52
Parallel OLAP
53
Parallel OLAP
54
Distributed OLAP Aggregation of Compute and
Storage Resources
Tuple Stream
55
Federated OLAPMotivating Example

Effective management of a network requires
collecting, correlating, and analyzing a variety
of network trace data.
Analysis of flow data collecting at each router
and stored in a local data warehouse adjacent
to the router is a challenging application.
All flow information is conceptually part of a
single relation with the following schema
Flow ( RouterId, SourceIP, SourcePort,
SourceMask, SourceAS, DestIP, DestPort,
DestMask, DestAS, StartTime, EndTime, NumPackets,
NumBytes)

56
CachingCentralized
Scheuermann, et al. Cache System Watchman
Cache Entry
query ID consists of an internal representation
of of the query string exact query match
employed
A profit metric is defined for each query and on
the basis of it a decision is taken whether the
query has to be cached or can be replaced.
Deshpande, et al. Chunk Based Caching
Fine granularity caching it allows queries to
partially reuse the results of previous
queries with which they overlap.
57
CachingDistributed
P. Kalnis, et al. An Adaptive Peer-to-Peer
Network for Distributed Caching of OLAP Results
DW1
P1
c2
P6
c1
P2
P5
P3
P7
DW2
P4
Assume that P2 issues a query referring to chunks
c1, c2 and c3.
58
GridMiner Current Structure
59
Conclusions