Title: A Comparative Study of Spatial Indexing Techniques for Multidimensional Scientific Datasets
1A Comparative Study of Spatial Indexing
Techniques for Multidimensional Scientific
Datasets
- SSDBM 2004
- Beomseok Nam and Alan Sussman
- Department of Computer Science
- University of Maryland, College Park
2Motivation
- Data Chunking
- Huge scientific datasets (gt100GB)
- Group individual point data
- Efficient multidimensional indexing trees for
scientific datasets - Data chunking leads to rectangular objects
- Fast index search
- Fast index creation for large datasets ingested
everyday
3Data Chunking
- Partition a multidimensional dataset into
coarse-grained hyper-rectangular blocks - Scientific datasets
- Collection of multidimensional arrays
- Have spatial/temporal locality
- Sensor devices store data in the order it is
acquired, or simulations generate it that way - Results in tight bounding boxes
4Data Chunking
- Data and the bounding boxes
Chunk1
Chunk3
Chunk2
Chunk5
Chunk4
Chunk6
Chunk8
Chunk9
Chunk7
Problem space
Dataset partitioned into 9 chunks
5Spatial Indexing Structures
- Space partitioning methods
- Internal node binary KD-tree
- Dimension independent of fan-outs
- KDB-trees, hB-trees, hB trees, Hybrid trees,
etc. - Data partitioning methods
- Internal node List of bounding boxes
- Dimension dependent internal node fan-outs
- R-trees, R trees, R-trees, X-trees, etc.
6Space partitioning methods
- KDB-trees
- Can index only point data due to no overlap
- Cascading split problem
- Minimum node utilization is not guaranteed
- Spatial KD-tree
- Can index non-point data by allowing overlap
- Memory based binary tree, not suitable for large
DB
7Space Partitioning Methods (2)
- Hybrid Trees
- Two split positions in one split dimension
- When overlap-free split is not possible, allows
overlap of sub-regions - Point data only
- Overlap allowed only in non-leaf nodes
- SH-trees
- New space partitioning method for non-point data
- Combination of SKD-trees and Hybrid trees
8Data partitioning methods
- R-trees
- Allows overlapping regions
- Non-point data can be indexed
- Large overlap leads to poor search performance
- R-trees
- Optimized version of R-trees
- Forced reinsertions (fast search, but expensive
insertion)
- R trees
- No overlap
- Object duplication methods
- Point data only
- Infinite recursive split possible for rectangles
9Data partitioning methods (2)
- X-trees
- Avoids highly overlapping regions via supernode,
which spans multiple pages on disk - Additional disk management overhead
- Overlap-free split based on split history
- Not always possible for non-point data
10Node Splitting for SH-trees
- Two split positions
- Goal of node split is to minimize (maxLower
minUpper) - Iterate for each dimension to find minimum
(maxLower minUpper), and split that dimension
minUpper
maxLower
11Default insertion into SH-trees
- Dynamic split update
- Update one of split positions to include the
newly inserted object - This may affect the bbx of other children
- Cascading Overlap Problem
12Greedy Insertion into SH-trees
default
greedy
- Move minimum overlapping split into high level in
an internal KD-tree node - Expensive, O(N2), with no guarantee of optimal
split
13Performance Evaluation
- Platform
- SunBlade 100 (500MHz Sparcv9, 256MB, 7200RPM IDE,
9ms seek time) - Turned off file cache
- Kronos Landsat Dataset
- 3D AVHRR level 1B datasets (Latitude, Longitude,
Time) - One month (Jan. 1992) 30GB
- Workload generator (Customer Behavior Model
Graph) - Synthetic Dataset
- Uniformly distributed 200,000 high dimensional
hyper-cubes in the unit hyper-cube - Implementations
- R-trees, R-trees, and X-trees from
www.rtreeportal.org, with some minor
modifications for fair comparison
14Index Creation for Kronos datasets
- I/O SH-tree (greedy/default) lt R-tree lt R-tree
lt X-tree - Time SH-tree (default) lt R-tree lt SH-tree
(greedy) lt R-tree lt X-tree
15Index Search for Kronos datasets
- Sum over 2,000 queries
- I/O SH-tree (greedy/default) lt R-tree lt X-tree
lt R-tree - Time SH-tree (greedy/default) lt X-tree lt R-tree
lt R-tree - Greedy insertion algorithm does not improve
search performance much
16Index Creation for Synthetic Dataset (High
Dimensions)
- Average over 200,000 hyper-cubes
- I/O SH-tree (greedy/default) lt R-tree lt R-tree
lt X-tree - Time R-tree lt SH-tree (default) lt R-tree lt
SH-tree (greedy) lt X-tree - Size of X-tree root node is 667 pages
17Index Search for Synthetic Dataset(High
Dimensions)
- Average over 10,000 queries
- I/O SH-tree (greedy/default) lt R-tree lt X-tree
lt R-tree - Time SH-tree (greedy/default) lt X-tree lt R-tree
lt R-tree - X-tree search time is fast, but SH-tree is
superior - SH-tree performance is almost independent of
dimensions
18Conclusion
- SH-trees outperform other trees for both
insertion and search - Future directions
- Find better reorganization algorithm
- Integrating SH-trees with Grid middleware (e.g.,
Storage Resource Broker, DataCutter) to
effectively index large distributed datasets
19Back-up
20Greedy Insertion
- Move minimum overlapping split into high level in
an internal KD-tree node.
21Cascading Overlap Problem
default
greedy
- Greedy Insertion
- Move minimum overlapping split into high level in
an internal KD-tree node - Expensive, O(N2), with no guarantee of optimal
split
22Greedy Insertion Anomlay
23Greedy Insertion Anomaly (2)