A Comparative Study of Spatial Indexing Techniques for Multidimensional Scientific Datasets

1 / 23
About This Presentation
Title:

A Comparative Study of Spatial Indexing Techniques for Multidimensional Scientific Datasets

Description:

Chunk 1. Chunk 2. Chunk 3. Chunk 4. Chunk 5. Chunk 6. Chunk 7 ... Dataset partitioned into 9 chunks. Spatial Indexing Structures. Space partitioning methods ... –

Number of Views:133
Avg rating:3.0/5.0
Slides: 24
Provided by: beomse
Category:

less

Transcript and Presenter's Notes

Title: A Comparative Study of Spatial Indexing Techniques for Multidimensional Scientific Datasets


1
A Comparative Study of Spatial Indexing
Techniques for Multidimensional Scientific
Datasets
  • SSDBM 2004
  • Beomseok Nam and Alan Sussman
  • Department of Computer Science
  • University of Maryland, College Park

2
Motivation
  • Data Chunking
  • Huge scientific datasets (gt100GB)
  • Group individual point data
  • Efficient multidimensional indexing trees for
    scientific datasets
  • Data chunking leads to rectangular objects
  • Fast index search
  • Fast index creation for large datasets ingested
    everyday

3
Data Chunking
  • Partition a multidimensional dataset into
    coarse-grained hyper-rectangular blocks
  • Scientific datasets
  • Collection of multidimensional arrays
  • Have spatial/temporal locality
  • Sensor devices store data in the order it is
    acquired, or simulations generate it that way
  • Results in tight bounding boxes

4
Data Chunking
  • Data and the bounding boxes

Chunk1
Chunk3
Chunk2
Chunk5
Chunk4
Chunk6
Chunk8
Chunk9
Chunk7
Problem space
Dataset partitioned into 9 chunks
5
Spatial Indexing Structures
  • Space partitioning methods
  • Internal node binary KD-tree
  • Dimension independent of fan-outs
  • KDB-trees, hB-trees, hB trees, Hybrid trees,
    etc.
  • Data partitioning methods
  • Internal node List of bounding boxes
  • Dimension dependent internal node fan-outs
  • R-trees, R trees, R-trees, X-trees, etc.

6
Space partitioning methods
  • KDB-trees
  • Can index only point data due to no overlap
  • Cascading split problem
  • Minimum node utilization is not guaranteed
  • Spatial KD-tree
  • Can index non-point data by allowing overlap
  • Memory based binary tree, not suitable for large
    DB

7
Space Partitioning Methods (2)
  • Hybrid Trees
  • Two split positions in one split dimension
  • When overlap-free split is not possible, allows
    overlap of sub-regions
  • Point data only
  • Overlap allowed only in non-leaf nodes
  • SH-trees
  • New space partitioning method for non-point data
  • Combination of SKD-trees and Hybrid trees

8
Data partitioning methods
  • R-trees
  • Allows overlapping regions
  • Non-point data can be indexed
  • Large overlap leads to poor search performance
  • R-trees
  • Optimized version of R-trees
  • Forced reinsertions (fast search, but expensive
    insertion)
  • R trees
  • No overlap
  • Object duplication methods
  • Point data only
  • Infinite recursive split possible for rectangles

9
Data partitioning methods (2)
  • X-trees
  • Avoids highly overlapping regions via supernode,
    which spans multiple pages on disk
  • Additional disk management overhead
  • Overlap-free split based on split history
  • Not always possible for non-point data

10
Node Splitting for SH-trees
  • Two split positions
  • Goal of node split is to minimize (maxLower
    minUpper)
  • Iterate for each dimension to find minimum
    (maxLower minUpper), and split that dimension

minUpper
maxLower
11
Default insertion into SH-trees
  • Dynamic split update
  • Update one of split positions to include the
    newly inserted object
  • This may affect the bbx of other children
  • Cascading Overlap Problem

12
Greedy Insertion into SH-trees
default
greedy
  • Move minimum overlapping split into high level in
    an internal KD-tree node
  • Expensive, O(N2), with no guarantee of optimal
    split

13
Performance Evaluation
  • Platform
  • SunBlade 100 (500MHz Sparcv9, 256MB, 7200RPM IDE,
    9ms seek time)
  • Turned off file cache
  • Kronos Landsat Dataset
  • 3D AVHRR level 1B datasets (Latitude, Longitude,
    Time)
  • One month (Jan. 1992) 30GB
  • Workload generator (Customer Behavior Model
    Graph)
  • Synthetic Dataset
  • Uniformly distributed 200,000 high dimensional
    hyper-cubes in the unit hyper-cube
  • Implementations
  • R-trees, R-trees, and X-trees from
    www.rtreeportal.org, with some minor
    modifications for fair comparison

14
Index Creation for Kronos datasets
  • I/O SH-tree (greedy/default) lt R-tree lt R-tree
    lt X-tree
  • Time SH-tree (default) lt R-tree lt SH-tree
    (greedy) lt R-tree lt X-tree

15
Index Search for Kronos datasets
  • Sum over 2,000 queries
  • I/O SH-tree (greedy/default) lt R-tree lt X-tree
    lt R-tree
  • Time SH-tree (greedy/default) lt X-tree lt R-tree
    lt R-tree
  • Greedy insertion algorithm does not improve
    search performance much

16
Index Creation for Synthetic Dataset (High
Dimensions)
  • Average over 200,000 hyper-cubes
  • I/O SH-tree (greedy/default) lt R-tree lt R-tree
    lt X-tree
  • Time R-tree lt SH-tree (default) lt R-tree lt
    SH-tree (greedy) lt X-tree
  • Size of X-tree root node is 667 pages

17
Index Search for Synthetic Dataset(High
Dimensions)
  • Average over 10,000 queries
  • I/O SH-tree (greedy/default) lt R-tree lt X-tree
    lt R-tree
  • Time SH-tree (greedy/default) lt X-tree lt R-tree
    lt R-tree
  • X-tree search time is fast, but SH-tree is
    superior
  • SH-tree performance is almost independent of
    dimensions

18
Conclusion
  • SH-trees outperform other trees for both
    insertion and search
  • Future directions
  • Find better reorganization algorithm
  • Integrating SH-trees with Grid middleware (e.g.,
    Storage Resource Broker, DataCutter) to
    effectively index large distributed datasets

19
Back-up
20
Greedy Insertion
  • Move minimum overlapping split into high level in
    an internal KD-tree node.

21
Cascading Overlap Problem
default
greedy
  • Greedy Insertion
  • Move minimum overlapping split into high level in
    an internal KD-tree node
  • Expensive, O(N2), with no guarantee of optimal
    split

22
Greedy Insertion Anomlay
23
Greedy Insertion Anomaly (2)
Write a Comment
User Comments (0)
About PowerShow.com