Title: Spatial Join Algorithms
1Spatial Join Algorithms
2Why we need a Spatial Join ?
- Spatial data are commonly found in applications
like cartography, CAD, and GIS - In spatial join, two spatial relations are
combined together based on some spatial criteria. - Examples of spatial join
- Only spatial join is required
- Find all forests which are in a city
- Find all cities that are crossed by river
- Find all cities that are affected by the fire
region. - Find all buildings that overlap with a park.
- Spatial join with a selection criteria
- Find all government-owned buildings that overlap
with a park. - Find all forests in USA that receive more than 20
inches of average rainfall by year. - Find all day cares in lafayette that are within 3
miles from houses with rent less than 600.
3Why not using Relational Join Algorithms ?
- Nested Loop Join
- Every object of one relation has to be checked
against all objects of the other relation. Since
we consider a very large relations of spatial
objects, the performance of the nested loop is
not acceptable - Hash-Based Join
- Hash-based joins are suitable for equi-joins but
not for spatial joins. - Sort-Merge Join
- There is no total ordering of spatial objects.
- However, adaptation of these algorithms to deal
with the spatial properties can be applied.
4Is the Spatial Join Problem is Already Solved in
Another Domain?
- Geometric domain
- The abstraction of the spatial join is finding
the intersection between two sets of geometric
shapes. - Many solutions are provided in the context of
geometry. - Geometric solutions considers only CPU cost.
- Geometric solutions are accepted if the data set
can fit in memory. - VLSI domain
- Spatial-Access methods (e.g., R-tree) are also
defined in the VLSI context. - Divide-and-conquer algorithm (Gutting et al, IS
93) is designed for rectangle intersection
problem with large sizes. - However, still the I/O time is not well
addressed. I/O is essential for spatial join
algorithms due to the massive amount of spatial
data.
5What is Special about Spatial ?
- There exists no total ordering among spatial
objects that preserves spatial proximity - Space-filling curves can be used, but not with an
accurate ordering. - Many spatial operators are not closed
- The intersection of two polygons may return any
number of single points, dangling edges, or
disjoint polygons. - Spatial operates are more expensive than standard
relational operators - Examples of Spatial operators are overlap,
contained, include - Spatial databases tend to be large
- The cities, rivers, restaurants, gas stations,
forests, highways in USA. - Spatial data have a complex structure
- Imagine representing the boundaries of Lafayette
in a database.
6Filter Step and Refinement Step
- Filter Step
- An approximation of each spatial object (e.g.,
the minimum bounding rectangle) is used to
eliminate tuples that cannot be part of the
result. This step produces candidates that are a
superset of the actual result. - Refinement Step
- Each candidate is examined to check if it is a
part of the result. I/O cost due to fetching the
exact object from the disk and a CPU-intensive
computational geometry algorithm. - An intermediate step (geometric filter) is used
in some algorithms.
7The Filter Step
- Transformation approaches
- There is no index in any of the relations
- The two relations are indexed.
- There exists only one index for only one relation
- Unified approaches regardless of the index
existence
8Transformation Approaches for Spatial Join
- Map to the one-dimensional space (Orenstein et
al, TSE 88) - Rectangles are sorted according to the Z-order.
- Two one-dimensional spatial join algorithms are
proposed (spatial-merge, and spatial-filter). - Both algorithms are later enhanced in (Aref et
al, SDH 94) with the linear-scan and
estimate-based spatial join algorithms. - Map to higher-dimensions. (Becker et al, ICDE 93)
- D-dimensional rectangles are transformed into
points in the 2D-dimesional space (corner
transformation). - A multi-dimensional join algorithm is used with
the support of grid files. - Transformation-Based Spatial Join (Song et al
CIKM99, TKDE 99) - Corner transformation are used
- A special algorithm for spatial join that does
not rely on indexing is proposed.
9Spatial Join algorithms without Indexing Support
10PBSM (Partition-Based Spatial-Merge Join) (Patel
et al, SIGMOD 96)
- The spatial universe is divided into disjoint P
partitions. - The MBRs from the two relations are mapped to
their partitions. One MBR can be clipped to
several partitions. - An in-memory spatial join algorithm is used for
each partition using the plane-sweep algorithm. - The number of partitions is chosen to allow each
partition to fit in memory. - Additional techniques are provided to handle data
skew. - If data inside one partition is still cannot fit
in memory, a recursive partitioning may be used. - The output data may contain duplicates.
- A sorting step need to be done in the Refinement
step to remove the duplicates.
11Spatial Hash-Join (Lo et al, SIGMOD 96)
- A general framework for extending a relational
hash join algorithms. - Can produce duplicate results
12PBSM as an instance of the Spatial Hash-Join
framework
13Spatial Hash Join Designers Choice
14Other Spatial join algorithms for non-indexed
relations
- Seeded tree approach (Lo et al, SSD 95)
- Two seeded trees are built for both relations
using spatial sampling techniques. - A depth-first traversal algorithm is used for
joining the two seeded trees. - Size Separation Spatial Join, S3J (Koudas et al,
SIGMOD 97) - For each entry in both data sets A, and B
- Compute the Hilbert value H of the centroid.
- Determine the level at which this entry belongs
to (Similar to the Filter Tree) and place the
entry in this level file - For each level file, sort entries by the Hilbert
value - Perform a synchronized scan over the pages of
each level. - Scalable Sweeping-Based Spatial Join, SSSJ (Arge
et al, VLDB 98) - Similar to PBSM, it is a partition-based and
plane-sweep based approach. - The main contribution is that it utilize the
foundations of computational geometry algorithms
to improve the in-memory plane-sweep algorithm. - This requires changing the partitioning function
to partition over only on-dimension.
15Spatial Join algorithms For Both Indexed Relations
16Spatial Join using Depth-Traversal R-Tree
(Brinkhoff et al, SIGMOD 93)
- The sketch of the proposed algorithm for the case
of equal heights and intersection operator is - Procedure spatialJoin (R, S R-Tree Node)
- For all entries ER in R and all entries ES in S
where ER.rect intersects ES.rect - If (R, S are leaf pages)
- Output (ER, ES)
- Else
- SpatialJoin (ER.ref, ES.ref) // ER.ref,
ES.ref is the node referenced by ER, ES - End
- The main idea is to synchronously traverse the
two R-trees in a depth-first traversal. - Enhancements are proposed to tune
- The CPU time
- The I/O time
17Spatial Join using Depth-Traversal R-Tree (Cont.)
- Tuning CPU-Time
- Restricting the search space. Among all the nodes
in R,S, we only check the entries that intersect
with R?S. - Spatial sorting and plane-sweep. Use the
plane-sweep algorithm for the set of candidate
from nodes R, S. - I/O time tuning
- Local plane-sweep order. Use the plane-sweep
order for fetching pages from disk. - Local plane-sweep order with pinning. In addition
to the previous approach, we pin the rectangles
that have maximal degree. The degree of a
rectangle is the number of its intersected
rectangles. - Local Z-order with pinning. Instead of doing
plane-sweep order for reading the disk pages, we
use the Z-order of the centroids of the
rectangles.
18Other Spatial Join algorithms for both indexed
relations
- BFRJ Breadth-First R-tree Join (Huang et al,
VLDB 97) - Synchronous traversal of two R-trees in a
breadth-first traversal. - Unlike the depth-first traversal, where a local
optimization is achieved for a node by node join,
in BFRJ, a global optimization is achieved for
each level. - The main idea is that, based on the global
optimization, we can take decisions as which
nodes need to be joined to each other. - Notice that these cannot be done using
depth-first traversal, because the limitation of
the current scope. - Both relations are indexed using PMR quadtree
(Hoel et al, VLDB 95) - Performs a synchronized tree traversal at the
leaf level.
19Spatial Join algorithms for one indexed relation
- Spatial join using seeded trees (Lo et al, SIGMOD
94, TKDE 98) - A seeded tree is built for the non-indexed
relation. - The steps to build the seeded tree is guided by
the existing R-Tree - The R-tree and the seeded tree are joined using
the dept-traversal approach. - Sort and Match (Papadopoulos et al, SSD 99)
- The STR bulk loading algorithm is applied for the
non-indexed relation. - Instead of building the packed tree, it directly
matches in-memory created leaf nodes with the
existing R-tree index. - Slot Index Spatial join, SISJ (Mamoulis et al,
TKDE 03) - SISJ combines the ideas of the seeded tree join
with the spatial hash join. - The key idea is to define the spatial partitions
of the spatial hash join using the existing
R-tree.
20A Unified Approach for Indexed and Non-Indexed
Spatial Joins (Arge et al, EDBT 00)
- An extension for SSSJ to deal with indexed
relations. - Similar to SSSJ, non-indexed data are sorted
according to their MBRs, and fed into the
plane-sweep algorithm. - For indexed data, an additional pre-processing
step is required to exploit the index structure
and directly extract the data in a sorting order
according to the plane-sweep algorithm. - A main conclusion of this paper is that using an
index-based approach for spatial join whenever
indexes are available does not always lead to the
best execution time. - A cost model is proposed to decide whether to
follow an index-based approach or the unified
approach.
21The Refinement Step
22Multi-Step Processing (Brinkhoff et al, SIGMOD 94)
- The refinement step is divided into two steps.
- Identifying more false and true hits.
- In this step, more accurate approximations other
than the MBR is investigated to identify false
and true hits. - Exact geometry intersection.
- Eventually all the remaining pairs of candidates
are examined at this stage. This the most time
consuming step, where it requires CPU time to
compute the exact intersection test, and I/O time
to read the spatial object from disk. - It is important to notice that improvements in
the exact geometry intersection step has the
lowest impact, since its effect can be canceled
by the improvements in the previous two steps.
23Multi-Step Processing (Cont.)
24Exact Geometry Processing in Multi-Step Processing
25Other Work for the Refinement Step
- Approximations other than MBR (Veenhof et al,
BNCOD 95) - Approximations of spatial objects are constructed
by rotating two parallel lines around the object. - Symbolic Intersection Detection (SSD 97,
Geoinformatice 98) - Concerned with the exact geometry computation.
- Enumerates all the possible situations that two
clipped polygon segments can have inside an MBR. - Raster approximation (Zimbrao et al, VLDB 98).
26Other Problems Related to the Spatial Join
- Non-blocking spatial join (Luo et al, ICDE 02)
- Multiway Spatial join
- (Mamoulis et al, GIS 98), (Mamoulis et al, SIGMOD
99), (Papadias et al, PODS 99), (Papadias et al,
EDBT 02). - Selectivity Estimation
- (Faloutsos et al, SIGMOD 00), (An et al, ICDE
01), (Mamoulis et al, SSTD 01), (Sun et al, EDBT
02) - Cost models
- For R-tree based indexed relation (Gunther, ICDE
93) - Parallel spatial Join
- PMR-Quadtree (Hoel et al, VLDB 94). R-tree
(Brinkhoff et al, ICDE 96). Non-indexed relation
(Patel et al, GIS 00) - Cascaded Spatial Join (Aref et al, GIS 96)
- Caching strategies (Abel et al, GeoInformatica
99) - Duplicate Detection (Dittrich et al, ICDE 00)
- High-dimension spatial Join (Koudas et al, ICDE
98)
27Summary
- Spatial Join algorithms are performed in two
steps Filter Step and Refinement Step. - In Filter Step, five approaches are used based
on - Transformation approaches
- No index is available
- Only one index is available
- Two indices are available
- A unified Approach.
- In the Refinement step This step can be further
divided into Geometric filter step and exact
geometry processing step.