A Comparative Study of Spatial Indexing Techniques for Multidimensional Scientific Datasets

1 / 23

About This Presentation

Title:

A Comparative Study of Spatial Indexing Techniques for Multidimensional Scientific Datasets

Description:

Chunk 1. Chunk 2. Chunk 3. Chunk 4. Chunk 5. Chunk 6. Chunk 7 ... Dataset partitioned into 9 chunks. Spatial Indexing Structures. Space partitioning methods ... –

Number of Views:133

Avg rating:3.0/5.0

Slides: 24

Provided by: beomse

Category:

more less

Transcript and Presenter's Notes

Title: A Comparative Study of Spatial Indexing Techniques for Multidimensional Scientific Datasets

1
A Comparative Study of Spatial Indexing
Techniques for Multidimensional Scientific
Datasets

SSDBM 2004
Beomseok Nam and Alan Sussman
Department of Computer Science
University of Maryland, College Park

2
Motivation

Data Chunking
Huge scientific datasets (gt100GB)
Group individual point data
Efficient multidimensional indexing trees for
scientific datasets
Data chunking leads to rectangular objects
Fast index search
Fast index creation for large datasets ingested
everyday

3
Data Chunking

Partition a multidimensional dataset into
coarse-grained hyper-rectangular blocks
Scientific datasets
Collection of multidimensional arrays
Have spatial/temporal locality
Sensor devices store data in the order it is
acquired, or simulations generate it that way
Results in tight bounding boxes

4
Data Chunking

Data and the bounding boxes

Chunk1
Chunk3
Chunk2
Chunk5
Chunk4
Chunk6
Chunk8
Chunk9
Chunk7
Problem space
Dataset partitioned into 9 chunks
5
Spatial Indexing Structures

Space partitioning methods
Internal node binary KD-tree
Dimension independent of fan-outs
KDB-trees, hB-trees, hB trees, Hybrid trees,
etc.
Data partitioning methods
Internal node List of bounding boxes
Dimension dependent internal node fan-outs
R-trees, R trees, R-trees, X-trees, etc.

6
Space partitioning methods

KDB-trees
Can index only point data due to no overlap
Cascading split problem
Minimum node utilization is not guaranteed
Spatial KD-tree
Can index non-point data by allowing overlap
Memory based binary tree, not suitable for large
DB

7
Space Partitioning Methods (2)

Hybrid Trees
Two split positions in one split dimension
When overlap-free split is not possible, allows
overlap of sub-regions
Point data only
Overlap allowed only in non-leaf nodes
SH-trees
New space partitioning method for non-point data
Combination of SKD-trees and Hybrid trees

8
Data partitioning methods

R-trees
Allows overlapping regions
Non-point data can be indexed
Large overlap leads to poor search performance
R-trees
Optimized version of R-trees
Forced reinsertions (fast search, but expensive
insertion)

R trees
No overlap
Object duplication methods
Point data only
Infinite recursive split possible for rectangles

9
Data partitioning methods (2)

X-trees
Avoids highly overlapping regions via supernode,
which spans multiple pages on disk
Additional disk management overhead
Overlap-free split based on split history
Not always possible for non-point data

10
Node Splitting for SH-trees

Two split positions
Goal of node split is to minimize (maxLower
minUpper)
Iterate for each dimension to find minimum
(maxLower minUpper), and split that dimension

minUpper
maxLower
11
Default insertion into SH-trees

Dynamic split update
Update one of split positions to include the
newly inserted object
This may affect the bbx of other children
Cascading Overlap Problem

12
Greedy Insertion into SH-trees
default
greedy

Move minimum overlapping split into high level in
an internal KD-tree node
Expensive, O(N2), with no guarantee of optimal
split

13
Performance Evaluation

Platform
SunBlade 100 (500MHz Sparcv9, 256MB, 7200RPM IDE,
9ms seek time)
Turned off file cache
Kronos Landsat Dataset
3D AVHRR level 1B datasets (Latitude, Longitude,
Time)
One month (Jan. 1992) 30GB
Workload generator (Customer Behavior Model
Graph)
Synthetic Dataset
Uniformly distributed 200,000 high dimensional
hyper-cubes in the unit hyper-cube
Implementations
R-trees, R-trees, and X-trees from
www.rtreeportal.org, with some minor
modifications for fair comparison

14
Index Creation for Kronos datasets

I/O SH-tree (greedy/default) lt R-tree lt R-tree
lt X-tree
Time SH-tree (default) lt R-tree lt SH-tree
(greedy) lt R-tree lt X-tree

15
Index Search for Kronos datasets

Sum over 2,000 queries
I/O SH-tree (greedy/default) lt R-tree lt X-tree
lt R-tree
Time SH-tree (greedy/default) lt X-tree lt R-tree
lt R-tree
Greedy insertion algorithm does not improve
search performance much

16
Index Creation for Synthetic Dataset (High
Dimensions)

Average over 200,000 hyper-cubes
I/O SH-tree (greedy/default) lt R-tree lt R-tree
lt X-tree
Time R-tree lt SH-tree (default) lt R-tree lt
SH-tree (greedy) lt X-tree
Size of X-tree root node is 667 pages

17
Index Search for Synthetic Dataset(High
Dimensions)

Average over 10,000 queries
I/O SH-tree (greedy/default) lt R-tree lt X-tree
lt R-tree
Time SH-tree (greedy/default) lt X-tree lt R-tree
lt R-tree
X-tree search time is fast, but SH-tree is
superior
SH-tree performance is almost independent of
dimensions

18
Conclusion

SH-trees outperform other trees for both
insertion and search
Future directions
Find better reorganization algorithm
Integrating SH-trees with Grid middleware (e.g.,
Storage Resource Broker, DataCutter) to
effectively index large distributed datasets

19
Back-up
20
Greedy Insertion