Title: Parallelizing the Data Cube
1Parallelizing the Data Cube
- PhD Oral Defence
- Todd Eavis
- July 23, 2003
2Overview
- Motivation for Parallel, Relational OLAP
- Core Algorithms and Methods
- Primary Systems Contributions
- Experimental Evaluation and Results
- Conclusions and Future Work
3- Motivation for Parallel, Relational OLAP
- Core Algorithms and Methods
- Primary Systems Contributions
- Experimental Evaluation and Results
- Conclusions and Future Work
4Why study OLAP and the Data Cube?
- On-line Analytical Processing the foundation for
a range of essential business applications - sales and marketing analysis, planning and
budgeting - 4 billion dollar industry by 2005
- Data Cube a core OLAP construct, first proposed
in 1996 by Gray et al GBLP, that supports
sophisticated multi-dimensional data analysis - Relevance to the Research Community? Results of
Citeseer queries - OLAP 797 papers
- Data Cube 362 papers
- Our interest Data Cube Generation and Querying
5Scale of OLAP Data Warehouses
- Average size of production data warehouses
currently 700 GB survey.com/Olap Report - Expected to reach 4 TB by 2004
- 1/3 currently lt 50 GB. In two years, this number
will drop to just 6 - Biggest data warehouses growing by a factor of 20
Winter Report - Biggest expected to exceed 100 TB within 2 years
- Our Interest Exploiting Parallel Algorithms
6Fundamental Design Alternatives
- MOLAP (Multi-dimensional OLAP)
- Materialize data cube as a multi-dimensional
array - In theory implicit indexing. In practice hybrid
schemes for sparse and dense regions - Best for dense, low-dimensional spaces
- ROLAP (Relational OLAP)
- Store data as relational tables
- Requires an explicit multi-dimensional index
- Scales well to higher dimensions and higher
cardinalities - Our Interest Highly Scalable ROLAP model
7- Motivation for Parallel, Relational OLAP
- Core Algorithms and Methods
- Primary Systems Contributions
- Experimental Evaluation and Results
- Conclusions and Future Work
8Computing the Full Cube in Parallel
- Small number of previous projects GC, LHL, MM,
NWY - Speedup quite limited
- Our approach Parallel Pipesort DEHR2, DER3
- Model 2d views as a task graph
- Create Scan Pipelines AADG as Minimum Cost
Spanning Tree using O(dn(m nlogn)) bipartite
matching (n nodes, m edges) - Partition task graph into sub-trees with O(p3d
p2d) augmented k-min-max BSP and distribute
sub-trees to p processors - Use over-sampling S p sub-trees to improve
load balance. High-low pairing in S 1 rounds
provides approximation to NP-Complete problem - Use performance optimized algorithms for sorting,
scanning, and I/O to generation local views
9Computing Partial Cubes in Parallel
- Important in practice for environments with
higher dimensions and/or specific visualization
needs - Little previous work, only partial solutions
BR,GC,SAG - Our approach Greedy algorithms for Schedule
Tree construction DER2, DER5 - Solution consists of algorithms for generating
efficieint Essential Trees (red) and
algorithms for adding beneficial non-selected
nodes (blue) - Greedy method record state information in
Plan Objects. Incrementally add nodes with
maximum benefit - Pre-sorting candidate views by estimated size
can reduce run-time from O(n3) to O(n2) - O(dn) heuristic extensions for higher
dimensional space. A confidence factor ß limits
risk
10A Parallel Date Cube Query Engine
- Views must be indexed prior to access
- Related work sequential r-trees for data cube
RYR and general purpose parallel r-trees SL - Our Approach Parallel RCUBE DER1, DER6
- Records ordered as per Hilbert Space Filling
Curve - P-processor round-robin record striping
- Construct Partial r-tree indexes on each node,
packing page blocks in Hilbert order - Parallel Query Engine
- Combines indexing and OLAP post-processing (query
transformation, parallel Sample Sort, record
permutation, etc.) - Uses surrogate views to support Partial Cubes
- Supports linear dimension hierarchies
11The Virtual Data Cube
- Motivation Hide the complexity of Data Cube
algorithms and implementation - Requires no knowledge of
- Format or extent of indexing
- Degree of materialization (full or partial)
- Representation of hierarchies
- Physical order of view attributes
- Degree of parallelism
12- Motivation for Parallel, Relational OLAP
- Core Algorithms and Methods
- Primary Systems Contributions
- Experimental Evaluation and Results
- Conclusions and Future Work
13Systems Overview
- Full, robust Data Cube prototype DER4
- Approximately 20,000 lines of code
- C/C, LEDA, MPI, STL
- Template-based graph algorithms
- Designed for, and evaluated on, contemporary
parallel machines - Shared nothing Linux cluster (Dalhousie)
- Shared disk SunFire multi-processor (HPCVL)
- Supporting systems include
- Flex/Bison based data generator
- Batch query generator
- View Subset generator
14Key Performance Issues
- Dynamic selection of best sorting algorithm
- Radix sort versus quicksort
- Minimization of data movement
- Use of horizontal and vertical indirection
- New pipeline aggregation algorithm
- Lazy aggregation
- Streamlined I/O
- I/O manager
- Independent I/O and computation threads
15Costing model
- Sophisticated cost model, common to both full and
partial cube DEHR1 - Based upon view size estimator
- Probabilistic counting technique
- Experimentally supported metrics for
- Dynamic Sorting (linear time versus comparison
based) - In-memory scanning and data movement
- Machine specific Read and Write I/O
- Dynamically considers impact of computation
versus I/O
16A Better Search Strategy
- Standard r-tree search strategy employs Depth
First Search - Our approach Linear Breadth First Search
- Map the search algorithm to the linearly ordered
levels of the packed index - Resolve query with a left-to-right, top-to-bottom
walk of the tree - Disk head never moves backwards
- Resolution consists of a sequence of clustered
scans - Degrades gracefully to a sequential scan of index
sequential scan of data
17- Motivation for Parallel, Relational OLAP
- Core Algorithms and Methods
- Primary Systems Contributions
- Experimental Evaluation and Results
- Conclusions and Future Work
18Experimental Evaluation
- Default test environment includes
- 16 to 24 processors
- 2 million records/80 MB
- 4 to 14 dimensions
- Random query batches
- 24-node Linux cluster, 16-node SunFire MP (disk
array)
- Parallel Speedup approaching linear for all
components - Efficiency between 80 and 95
Partial Cube
Full Cube
Query Processing
19Full Cube Evaluation
- Shared Disk
- 80 90 efficiency
- Disk array is bottleneck
- Optimized pipeline processing
- Order of magnitude improvement
- Over sampling factor
- SF 2 consistently best
20Partial Cube Evaluation
- Tree pruning with confidence factor (on 14 d)
- Can eliminate up to 60 of original tree
- Virtually no reduction in tree quality
- Using partial cube algorithms for full cube
- All within 6 of best benchmark
- Recursive algorithm with 0.1
- Partial cube of 3 dimensions or less
- Reductions over naïve method of 65 70
21Query Evaluation
- Overhead of using surrogate views (1 to 16
processors) - Run time on materialized views versus time when
those views were unavailable
- Record retrieval imbalance (16 processors)
- Only 0.3 from optimal load balance
- Ratio of blocks retrieved to required seeks
- Random query batches
- Up to 1401 for large, sparse spaces
22- Motivation for Parallel, Relational OLAP
- Core Algorithms and Methods
- Primary Systems Contributions
- Experimental Evaluation and Results
- Conclusions and Future Work
23Thesis Conclusions
- ROLAP a viable alternative to MOLAP in parallel
setting - Partial cubes can be efficiently generated
- ROLAP cubes can be efficiently indexed
- Virtual cube abstraction can be efficiently
supported
24Research Highlights
- First parallel ROLAP system in the Data Cube
literature - A balanced approach to data cube research
- Algorithm design
- Systems engineering
- Extensive performance analysis
- Evaluated on contemporary parallel machines
- Commodity-style shared nothing cluster
- Shared disk architectures
- Integration of three independent data cube
research projects into a single cohesive OLAP
framework the Virtual Cube
25Future Work
- Automated partial cube specification
- Extension of virtual cube
- Parallel Query optimization
- In addition to range queries or linear
hierarchies - High volume query environments
- OLAP visualization
- New projects are building on the current base
- Generation of Iceberg Cubes
- Mining of association rules
26Thank You!
References Our own Virtual Data Cube Research
References The Data Cube Literature
GBLP J. Gray and A. Bosworth and A. Layman and
H. Pirahesh", Data Cube A Relational Aggregation
Operator Generalizing Group-By, Cross-Tab, and
Sub-Totals", ICDE, 1996 GC S. Goil and A.
Choudhary, A Parallel Scalable Infrastructure
for OLAP and Data Mining, IDEAS,1999 LHL H.
Lu, X. Huang and Z. Li,, Computing Data Cubes
Using Massively Parallel Processors, PCW '97,
1997 MM S. Muto and M. Kitsuregawa, A dynamic
Load Balancing Strategy for Parallel Datacube
Computation, WDW O,1999 NWY R. Ng, A. Wagner
and Y. Yin, Iceberg-cube Computation with PC
Clusters, SIGMOD, 2001 AADG S. Agarwal and R.
Agrawal and P. Deshpande and A. Gupta and J.
Naughton and R. Ramakrishnan and S. Sarawagi, On
the Computation of Multidimensional aggregates,
VLDB, 1996 BSP R. Becker and S. Schach and Y.
Perl, A shifting algorithm for min-max tree
partitioning, Journal of the ACM, 1982 BR K.
Beyer and R. Ramakrishnan, Bottom-up computation
of sparse and Iceberg CUBEs, SIGMOD,1999 RYR N.
Roussopoulos and Y. Kotidis and M. Roussopolis,
Cubetree Organization of the bulk incremental
updates on the data cube, SIGMOD, 1997 SL B.
Schnitzer and S. Leutenegger, Master-client
r-trees a new parallel architecture, SSDM, 1999
- DEHR1 F. Dehne, T. Eavis, S. Hambrusch and A.
Rau-Chaplin, Parallelizing The Data Cube,
Parallel and Distributed Databases An
International Journal, 2001 - DER1 F. Dehne, Todd Eavis, A. Rau-Chaplin,
Distributed Multi-dimensional ROLAP Indexing for
the Data Cube, CCGrid, 2003. - DER2 F. Dehne, T. Eavis and A. Rau-Chaplin,
Computing Partial Data Cubes for Parallel Data
Warehousing Applications, Euro PVM-MPI, 2001. - DER3 F. Dehne, T. Eavis, and A. Rau-Chaplin,
Coarse Grained Parallel On-Line Analytical
Processing (OLAP) For Data Mining, ICCS, 2001. - DER4 F. Dehne, T. Eavis, and A. Rau-Chaplin, A
Cluster Architecture for Parallel Data
Warehousing, CCGrid, 2001. - DEHR2 F. Dehne, S. Hambrusch, T. Eavis, and A.
Rau-Chaplin, Parallelizing The Data Cube, ICDT,
2001. - CHER Y. Chen, F.Dehne, Todd Eavis, A.
Rau-Chaplin, Parallel ROLAP Data Cube
Construction on Shared Nothing Multi-Processors,
IPDPS, 2003. - DER5 F Dehne, T.Eavis, and A. Rau-Chaplin,
Computing Partial Data Cubes, Submitted to HICCS,
2003. - DER6 F Dehne, T.Eavis, and A. Rau-Chaplin,
RCUBE Parallel Multi-Dimensional ROLAP Indexing,
Submitted to Journal of to Data Mining and
Knowledge Discovery., 2003