OnLine Analytical Processing on Large Databases Managed by Computational Grids

1 / 59
About This Presentation
Title:

OnLine Analytical Processing on Large Databases Managed by Computational Grids

Description:

Coupe 0x01. Sedan 0x02. Color Index(hex) Red 0x00. Blue ... (Coupe, White) 0x0102 (a linear address of the measure) Chunking. Dense and Sparse Chunk Storage ... –

Number of Views:58
Avg rating:3.0/5.0
Slides: 60
Provided by: ACE558
Category:

less

Transcript and Presenter's Notes

Title: OnLine Analytical Processing on Large Databases Managed by Computational Grids


1
On-Line Analytical Processing on Large Databases
Managed by Computational Grids
  • A Min Tjoa2 , Peter Brezany1 , Bernhard Fiser1
  • Umut Onan1 , Ibrahim Elsayed1
  • 1Institute for Software Science, University of
    Vienna, Austria
  • brezany,fiser,onan,elsayed_at_par.univie.ac.at
  • 2Institute of Software Technology and Interactive
    Systems
  • Vienna University of Technology, Austria
  • tjoa_at_ifs.tuwien.ac.at

Globe 2004, Zaragoza, Spain
2
Outline
  • Motivation
  • Knowledge Discovery in Grid Datasets
  • GridMiner Project in Vienna
  • OLAP Background Information
  • Storage Management Structures
  • OLAP Services
  • Centralized Service
  • Distributed Service
  • Federated Service
  • Caching
  • Future Work
  • Conclusions

3
Motivation
Business
Medicine
Scientific experiments
Data and data exploration
cloud
Simulations
Earth observations
4
Stages of a Data Exploration Project
Time to
Importance complete to
success (percent of total) (percent of
total)
Based on Data Preparation for Data Mining, by
Dorian Pyle, Morgan Kaufmann
  • Exploring the problem 10 15
  • Exploring the solution 9 20 14 80
  • Implementation specification 1 51
  • Knowledge discovery
  • a. Data preparation 60 15
  • b. Data surveying 15 3
  • c. Data modeling 5 2

80
20
5
The Knowledge Discovery Process
Knowledge
OLAP Queries
OLAP
Online Analytical Mining
Evaluation and Presentation
Data Mining
Selection and Transformation
Data Warehouse
Cleaning and Integration
6
The GridMiner Project in Vienna
  • GridMiner A knowledge discovery Grid
    infrastructure (http//www.gridminer.org/)
  • OGSA-based architecture
  • Workflow management
  • Grid-aware data preprocessing and data mining
    services
  • Data mediation service
  • OLAP service
  • GUI
  • Current Implementation on top of Globus Toolkit
    3.0
  • Application Management of patients with
    traumatic brain injuries

7
GridMiner Architecture
GridMiner Workflow
GM DSCE Dynamic Service Control
GridMiner Core
GMPPS Preprocessing
GMDMS Data Mining
GMPRS Presentation
GMDT Transformation
GMOMS OLAM
GridMiner Base
GMMS Mediation
GMIS Information
GMRB Resource Broker
GMCMS OLAP / Cubes
Grid Core
Grid Core Services
Security
File and Database Access Service
Replica Management
Basic Grid Services
Fabric
Grid Resources
Data Source
8
Collaboration of GM-Services
Simple Scenario
GMPPS Preprocessing
GMDMS Data Mining
GMDIS Integration
GMPRS Presentation
Intermediate Result 1
Intermediate Result 2 (e.g. flat table)
Intermediate Result 3 (e.g. PMML)
Final Result
Data Sources
9
Collaboration (2)
Complex Scenarios
GMPPS
GMDIS
GMPPS
GMDMS
GMPRS
GMPPS
GMDMS
GMPRS
GMPPS
GMPPS
GMDMS
GMPRS
GMDMS
GMPRS
GMPPS
GMDIS
GMCMS
GMOMS
GMPRS
GMPPS
GMPPS
10
Workflow Models
Static workflows
Dynamic workflows
11
Dynamic Workflows
  • Dynamic Service Control Language (DSCL)
  • based on XML
  • easy to use
  • Dynamic Service Control Engine (DSCE)
  • processes the workflow according to DSCL

DSCL
DSCE
Service A
Service B
Service D
Service C
12
DSCL Control Flow
Automatic conversion
Users view
dscl
variables
composition
sequence
createService activityIDact1
parallel
invoke activityIDact2.1
invoke activityIDact2.2
sequence

13
Grid Database Access III
14
Grid Database Access With OGSA-DAI
GDS gets a query via Perform Document GDS Engine
process specified activities GDS returns results
15
Grid Database Access With OGSA-DAI
16
Grid Data Mediation Service - Architecture
17
GDMS Example Scenario
  • Heterogeneities
  • Name in A is Alexander Wöhrer
  • Name in C has to be combined
  • Distribution
  • 3 data sources

18
GDMS Mapping Schema
  • Builds together with the mediated schema (user
    view) the main configuration of the GDMS
  • One Mapping Schema for each Table in the Virtual
    Data Source (VDS)

19
Data Mediation - Future Work
  • Easier Creation, Usage and Maintainability with
    the help of Ontologies
  • Better Performance
  • Evolve to a distributed mediaton service

20
OLAP
Basic Concepts
21
Relational (ROLAP) vs. Multi-Dimensional Model
(MOLAP)
Model Mini Van, Coupe, Sedan
Colour Red, Blue, White, Green
Two-Dimensional Model
22
ROLAP vs. MOLAP
Model Year Color Sales Chevy 1990 Blue 87 Chevy 1
990 Red 5 Chevy 1990 ALL 92 Chevy ALL Blue 87 Chev
y ALL Red 5 Chevy ALL ALL 92 Ford 1990 Blue 99 For
d 1990 Green 64 Ford 1990 ALL 163 Ford
1991 Blue 7 Ford 1991 Red 8 Ford 1991 ALL 15 Ford
ALL Blue 106 Ford ALL Green 64 Ford ALL Red 8 All
1990 Blue 186 All 1990 Gree 64 ALL 1991 Blue 7 AL
L 1991 Red 8 Ford ALL ALL 178 ALL 1990 ALL 255 AL
L 1991 ALL 15 ALL ALL Blue 193 ALL ALL Green 64 AL
L ALL Red 13 ALL ALL ALL 270
SELECT Model, Year, Color, SUM(sales) AS
Sales FROM Sales WHERE Model IN Ford,
Chevy AND Year BETWEEN 1990
AND 1992 GROUP BY CUBE (Model, Year, Color)
Model Year Color Sales Chevy 1990 Red 5 Chevy 199
0 Blue 87 Ford 1990 Green 64 Ford 1990 Blue 99 For
d 1991 Red 8 Ford 1991 Blue 7
?
23
ROLAP vs. MOLAP
Cross Tab
Data Cube
24
ROLAP vs. MOLAP
  • The ROLAPs summary records are stored directly
    int standard relational tables, without any need
    for data conversion. A complex analytical query
    is cumbersome to express in SQL and it might not
    be efficient to execute.
  • The array-based model, MOLAP (Multi-Dimensional
    OLAP), has the advantage that native arrays
    provide an immediate form of indexing for cube
    queries.

25
MOLAP
  • Each value of the attribute associated with the
    dimension determines a position of the dimension.
  • The observed values, called measures are located
    at the intersections of the dimension positions.
    The intersections are called cells and are
    populated with measures.
  • Dimensional indexing a unique integer value is
    assigned to each possible dimension position, and
    the corresponding mapping information is stored
    in an appropriate index structure.

26
Towards Grid OLAP
Research Objectives High-Performance
Grid OLAP Services
27
Requirements
  • Operation on large data sets
  • Centralized OLAP Service (parallel computing
    power can be included)
  • Distributed OLAP service
  • Federation of autonomous distributed OLAP services

28
Development Strategy
OE
Network
OE
OE
OLAP Engine
OE
29
Development Strategy (2)
  • Precondition No open-source OLAP system
    available
  • Decision development (in Java) from scratch
  • Advantage motivation for research activities
    addressing all facets
  • Disadvantage a possible long implementation
    curve
  • First step centralized sequential Grid OLAP
    service

30
Towards Centralized Service
OLAP
Workflow Engine
DSCL, OMML
OMML
XML
GUI
Mediator
PMML
PMML
RD
XMLD
CSV
Data Mining Engine
31
Data Organization
  • Storing Multi-dimensional Data into
    One-dimensional Storage
  • Chunking
  • Bit Encoded Sparse Structure BESS
  • Dynamic Bit Encoding (DBE)

32
Toward Indexing
The simplest method for computing a linear
address from the multidimensional one (1)
assign each possible position within one
dimension an unique integer value and
store these matching information in another
table (2) Bit-shift the integer assigned to
the row dimension and logical OR it with the
integer assigned to the column dimension.
(3) Use the combined integer as your memory
address.
Drawback We want to store 12 values, but we
reserve 65534 addresses. Another important
issue How to determine the position index size?
Model Index(hex) Mini Van 0x00 Coupe
0x01 Sedan 0x02
Color Index(hex) Red 0x00 Blue
0x01 White 0x02 Green 0x03
(Coupe, White) ? 0x0102 (a linear address of
the measure)
33
Chunking
34
Dense and Sparse Chunk Storage
35
Sparsity ExampleHP Application
  • Web access analysis engine
  • E.g., a newspaper Web site received 1.5 million
    hits a week
  • Modeling the data using 4 dimensions
  • ip address of the originate site (48,128 values)
  • referring site (10,432 values)
  • subject uri (18,085 values)
  • hours of day (24 values)
  • The resulting cube contained over 200 trillion
    cells!

36
Bit Encoded Sparse Structure (BESS)
A 100, B 1000, C 1000
Principles
Chunking
37
Toward Dynamic Bit Encoding
First Improvement
PosIndex of dimension j
Stretched PosIndex of dimension 1
Logical combined stretched PosIndex of all
dimensions
38
Dynamic Bit EncodingThe BESS Enhancement
39
Dynamic Bit Encoding Example for 2 Dimensions
Index 0 0 0 0 0 0 i10 i00
Mask 0 0 0 0 1 0 0
1 Encoded 0 0 0 0 i10 0 0
i00
Dimension 0
Index 0 0 0 i41 i31 i21 i11
i01 Mask 0 1 1 1 0 1 1
0 Encoded 0 i41 i31 i21 0 i11 i01
0
Dimension 1
Enc0 0 0 0 0 0 0 i10 i00
Enc1 0 0 0 0 1 0 0
1 Global 0 i41 i31 i21 i00 i11 i01
i00
Encoded Global Index
40
Performance
Each time factor (in seconds) in the following
tables was retrieved by averaging
3 x 107 encodings
41
Computing Chunk Index and Measure Index
chunkPosIndex
NUM_DIM_BITS
x measure
mPosIndex (relative to ChunkPosIndex)

NUM_DIM_BITS (the number of bits used within each
dimension of the chunk)
chunkPosIndex posIndex SHIFT NUM_DIM_BITS to
the right
computing mPosIndex ? next slide
42
Computing Measure Index
AND
To get the multidimensional index of the measure
mIndex perform binary shift of mPosIndex by j
NUM_DIM_BITS, 0 ? j ? maxDim, of each dimension j
to the left and combine all results together
using a logical OR.
43
Chunk SwappingHandling Huge Data Volumes
I/O
chunkIndex
44
Disk-Based Chunk Organization
45
Cube Construction
  • Initialization
  • Read tuple from Grid Data/Mediation Service and
    split into positions and measures.
  • Lookup and store index of each position.
  • Calculate the global index number of the chunk
    used to store the measures associated with their
    positions
  • Locate and load the chunk or create a new one, if
    it did not exist before.
  • Use indices of the positions to store the
    measures into the chunk.
  • Continue at step 2 until no more tuples are
    available on the input.
  • Finalization.

46
Query Processing
  • Generating Subcubes
  • Query Algorithm
  • Query Plan

47
Materializing Subcubes
Summarizing z, ?xy
Summarizing y ?x, z ?z
Summarizing x y ?y, z ?z
48
Materializing Subcubes (2)
Index table examples
49
The Query Algorithm
For retrieval of one element, we first need a
structure to specify our request. For example,
we want to get the sum along y of the sum along z
for x0
50
Query Plan (Schedule)
Query lattice for a four-dimensional cube
Improving caching, minimizing I/O (communication
optimization in distributed version) For
example, there are 2 paths form node ABCD to AB
51
Parallel OLAP
52
Parallel OLAP
53
Parallel OLAP
54
Distributed OLAP Aggregation of Compute and
Storage Resources
Tuple Stream
55
Federated OLAPMotivating Example
  • Effective management of a network requires
    collecting, correlating, and analyzing a variety
    of network trace data.
  • Analysis of flow data collecting at each router
    and stored in a local data warehouse adjacent
    to the router is a challenging application.
  • All flow information is conceptually part of a
    single relation with the following schema
  • Flow ( RouterId, SourceIP, SourcePort,
    SourceMask, SourceAS, DestIP, DestPort,
    DestMask, DestAS, StartTime, EndTime, NumPackets,
    NumBytes)

56
CachingCentralized
Scheuermann, et al. Cache System Watchman
Cache Entry
query ID consists of an internal representation
of of the query string exact query match
employed
A profit metric is defined for each query and on
the basis of it a decision is taken whether the
query has to be cached or can be replaced.
Deshpande, et al. Chunk Based Caching
Fine granularity caching it allows queries to
partially reuse the results of previous
queries with which they overlap.
57
CachingDistributed
P. Kalnis, et al. An Adaptive Peer-to-Peer
Network for Distributed Caching of OLAP Results
DW1
P1
c2
P6
c1
P2
P5
P3
P7
DW2
P4
Assume that P2 issues a query referring to chunks
c1, c2 and c3.
58
GridMiner Current Structure
59
Conclusions
  • Overview of GridMiner and its OLAP research
    effort
  • Other approaches
  • Research Exhibition at Supercomputing 2004,
    Pittsburgh, USA
  • www.gridminer.org
Write a Comment
User Comments (0)
About PowerShow.com