Title: Generalized Multidimensional Data Mapping and Query Processing GiMP
1Generalized Multi-dimensional Data Mapping and
Query Processing (GiMP)
- Authors Rui Zhang et al.
- ACM TODS 2005
- Presented by Youngdae Kim _at_ IDS Lab.
- 18 Sep, 2007
2Background
- Multi-dimensional data
- spatial data
- geographic information
- ex) Pohang located at (129, 35)
- object with many fields
- ex) employee relation with fields id, salary,
name, age, address, - Queries
- point query
- give me object(s) located at (3, 5)
- give me employee(s) with age35 and name Jack
- range query (window query)
- give me all objects whose location overlap with
the range 3,7 and 4,6 - kNN query
- give me the k nearest neighbors of object a
d2
5
0
5
d1
3Background (cont.)
- Index structure
- R-tree
- pack regions into rectangles close to each other
recursively - do not use the stable DBMS index structure (e.g.,
B-tree) - not easy to integrate with current DBMSs
(complicated concurrency and recovery problem
exist) - why not use B-tree?
- not easy to assign orders (or keys) to
multi-dimensional data sequentially while
preserving their proximity - but efficiency and reliability are high if we can
use B-tree
R-tree
close
still close?
one-d
multi-d
4Mapping-based Indexing Schemes
- General strategy
- mapping
- multi-dimensional data one-dimensional
data (key) - one-to-one or many-to-one
- query processing using B-tree
- transform multi-dimensional query into key
range(s) - get matched entries using B-tree
- discard false positives
- we obtained a super-set of answers ? possibly
there exist irrelevant data - discard them
- Examples
- UB-tree, Pyramid technique, iMinMax, iDistance
5Observations
- Crux of mapping-based indexing scheme
- mapping method
- distance from reference point scattering factor
- query transformation
- multi-dimensional window query transformed into
one-dimensional range query - for kNN query, use the incremental mapping
mechanism
distance
key (p1) distance scattering factor
p1
r
6Contributions
- Generalizes the mapping-based indexing and query
processing process (GiMP) - defined a framework for easy extension
- cf) GiST generalizes tree-search indexing
scheme - Suggests a measurement to predict performance of
mapping-based indexing scheme - Solves the mappability problem
- Is there an one-to-one mapping for given data
space?
7GiMP Structure
GiMP
Components
Data Mapping
Reference(P) Distance(P1, P2) Base(P)
B-tree
Queries Point query Range query Nearest Neighbor
Basic operations Insert Delete
Components
Components
MapRange(rg) MapAnnulus(Q, rmin, rmax)
Insert(P) Delete(P)
8GiMP Data Mapping
- Components
- Reference (P)
- reference point for P
- ex) starting point with Z-value 0
- Distance (P1, P2)
- distance between P1 and P2 in multi-dimensional
space - can be L1, Euclidean, Max ,or any user-defined
distance - Base (P)
- value to be added to the transformed value
- usually used for scattering keys
- Key (P) Base (P) Distance (P, Reference (P))
9GiMP Query Processing
- Components
- MapRange (rg)
- transform given range (rg) into key range
- MapAnnulus (Q, rmin, rmax)
- transform given annulus into one-dimensional
intervals, usually incremental mapping - used for kNN search
a set of intervals a1, b2, a2, b2, ,
an, bn
rmin
a set of intervals a1, b2, a2, b2
rmax
10GiMP Basic Operations
- Components
- Insert (P)
- calculate Key (P) and insert into B-tree using
the usual B-tree insertion operation - Delete (P)
- use the usual B-tree deletion operation
11GiMP UB-tree Instantiation
- Data mapping (one-to-one)
- use Z-value to map multi-dimensional data
- P Z-value, one-to-one mapping
- Reference (P) the point with 0 Z-value
- Distance (P1, P2) difference of Z-values
- Base (P) 0 since Z-value mapping is one-to-one
- Key (P) Base (P) Distance(P, Reference (P))
Z-value of P
12GiMP UB-tree Instantiation (cont.)
- Query processing
- MapRange (rg)
- find the Z-value range corresponding to the rg
- ex) suppose rg is the orange region
intervals to search 12, 15, 24, 27
B-tree search
13GiMP Pyramid Instantiation
- Data mapping (many-to-one)
- divide n-dimensional space into 2d pyramids that
share the center point of the space as their top
and a (d-1)-dimensional surface of the data space
as their base - each of 2d pyramids is divided into several
partitions - each data point has height
- key (P) height of P pyramids number
(d-1)-dimensional surface
height of v
pyramid
partition
p3
p2
p0
center point
v
p1
data space
14GiMP Pyramid Instantiation (cont.)
- Data mapping (many-to-one) (cont.)
- Reference (P) center point
- Distance (P, Reference (P)) height of P
- Base (P) pyramids number
- Key (P) Base (P) Distance(P, Reference (P))
- pyramids number height of P
- ex) assume height of v 2.5, then Key (v) 1
2.5 3.5
15GiMP Pyramid Instantiation (cont.)
- Query processing
- MapRange (rg)
- find the key range for the partitions which
overlap the rg - ex) suppose rg is the dark-shaded region
d1
p3
the corresponding intervals for the light-shaded
partitions
p2
p0
p1
d0
16GiMP Pyramid Instantiation (cont.)
- Query processing (cont.)
- MapAnnulus (Q, rmin, rmax)
- incremental key range search
- ex) suppose we first try (a) and then (b) for kNN
search - at (a), range query transforms to 2hQ-r0,
2hQr0 for pyr2 - save the lower bound (2hQ-r0) and upper bound
(2hQr0) - at (b), range query transforms to 2, 2hQ-r0,
2hQr0, 2hQr0 dr for pyr2 ? the keys to be
searched form a continuous range
17GiMP iDistance Instantiation
- Data mapping (many-to-one)
- data space is divided into Np partitions
- each partition has a reference point
- data point P belongs to Ni partition if i
argmin dist(P, ri) - key (P) distance (P, ri) i c
- Reference (P) nearest reference point to P
- Base (P) i c
- Distance (P1, P2) Euclidean distance between P1
and P2 - Key (P) Base (P) Distance (P, Reference (P))
N partitions
key (p) d 1
p
r2
rN
r1
d
18GiMP iDistance Instantiation
- Query processing
- MapAnnulus (Q, rmin, rmax)
19Performance of GiMP
- Direct implementation vs GiMP
20Performance Prediction
- What dominates the overall performance?
- the mapping process
- how the query is mapped to the one-dimensional
ranges - redundant mapping causes performance degradation
- Mapping redundancy
- ratio between the mapped region and the query
region - mr 1 is optimal
nm the number of pages that contain the data
points that are in the mapped region
na the minimum number of pages that contain the
data points in the answer set of a query Q
21Performance Prediction (cont.)
- Experimental results with amr (averaged mapping
redundancy)
22Mappibility Problem
- Observation
- naturally, one-to-one mapping shows better
performance than many-to-one mapping indexing
scheme - Mappibility
- the existence possibility of one-to-one mapping
from d-dimensional data space to one-dimensional
domain - existence of one-to-one mapping depends on the
nature of the data space (countable or
uncountable property)
23Conclusion
- Users can define their own mapping-based indexing
scheme by implementing the components of GiMP - MR (mapping redundancy) is a governing factor in
the efficiency of mapping-based indexing schemes,
so that it can be used as a performance
prediction measurement - Existence of one-to-one mapping depends on the
nature of the data space