Title: MultiDimensional Range Searching
1Multi-Dimensional Range Searching
Committee Subhash Suri (Chair), Amr El Abbadi,
Teofilo Gonzalez
- Amit Bhosle
- Department of Computer Science
- University of California Santa Barbara
2What I Will Cover Today
- Introduction problem statement / applications
- Some classical solutions
- A lower bound by Chazelle in orthogonal range
searching - Indexing schemes in context to range searching
- Some indexing structures (R-trees and Box trees)
3What is Range Searching ?
- Preprocess a set P of objects for efficiently
answering queries. - Typically, P is a collection of geometric objects
(points, rectangles, polygons) in Rd. - Query Range, Q d-rectangles, balls, halfspaces,
simplices, etc.. - Either count all objects in P ? Q or report
the objects themselves.
4Example Points in R2
Q1
Q2
5Why Study Range Searching ?
- Applications in several fields
- Databases
- Spatial databases (G.I.S.)
- Computer Graphics
- Robotics
- Algorithmic tool (example ?? )
- And more..
6Some Classical Approaches
- Griding or bucketing
- Simple independent of query shape
- Good only for uniform distribution of input and
can be quite bad for skewed data. - Range Trees
- Good query time and space for lower dimensions
- Poor in higher dimensions O(logdn)
- kD-Trees
- Linear space in all dimensions.
- Query time becomes almost linear for high
dimensions O(n1-1/d).
7The Grid Approach
k
2
1
k
1
2
8Grids (Contd.)
- Either the queries should be aligned with the
grid or result is approximate. - Cell sizes need not be uniform and can be adapted
to data distribution. - O(kd-1) query time, O(nd k d) preprocessing
time and O(nk d) space ( k is the number of
divisions of each axis). - Error decreases as k increases, but space
requirement increases.
9Range Trees
- 1-D case Build a balanced binary tree using the
points co-ordinates as the keys.
6
17
2
4
5
7
8
12
15
19
7
Counting ?
4
12
Optimal time/space
2
5
8
15
7
19
15
12
8
2
4
5
10Range Trees (Contd.)
- d-Dimensions
- Build a 1-D range tree on the first dimension.
- Each internal node points to another tree built
recursively on the remaining d-1 dimension for
the points in its subtree.
P3
P3
P2
P2
P4
P4
P1
P1
P4
P3
P2
P1
11Range Trees (Contd.)
P3
P8
P5
P2
P6
P4
P7
P1
P3
P2
P4
P1
P4
P3
P2
P1
P8
P7
P6
P5
12Range Trees (Contd.)
- Search by the first dimension gives us a O(logn)
subtrees which together contain the output
point(s). - Search the remaining d-1 dimensions recursively
among these. - O(logd n k) query time, O(n logd-1 n)
preprocessing time and space.
13kD-Trees (k-dimensional Trees)
- 1-d tree split along median point and
recursively build subtrees for the left and right
sets. - Higher dimensions same approach, but cycle
through the dimensions. Or, select the next
dimension as the one with the widest spread. - Efficiency of query processing drops as
dimensions increase (becomes almost linear).
However, the space requirement remains linear
O(n.d)
14kD-Trees (Contd.)
c
o
m
d
f
l
n
a
b
e
g
j
k
h
i
f
l
j
i
h
k
d
n
m
e
g
b
a
c
o
15kD-Trees (Contd.)
- Query complexity How many cells can a query box
intersect ?
Let us consider a facet of the query
- Any axis parallel line can intersect atmost 2 of
these 4 cells. - Each of these 4 cells contain exactly n/4 points.
- Q(n) 2.Q(n/4) 1
- Q(n) O(n1/2)
- i.e. Query answered in O(n1-1/d m) time where m
is the output size
16Summary of Classical Solutions
- Classical solutions, though good for small
dimension space, do not perform well in higher
dimensions. - Updates (inserts/deletes) are expensive (we did
not discuss them). - Desired properties of the data structure
- Near linear size
- Query time O(k f(n)) where f is a very slowly
increasing function. - Preprocessing time not as important as the above
two.
17Lower Bounds in Orthogonal Range Searching
- Bernard Chazelle, Princeton
- Proved that for the range reporting problem,
O(kpolylog(n)) query time requires
?(n(logn/loglogn)d-1) space on a pointer
machine. - Lower bound holds only for pointer algorithms
- These algorithms need an explicit pointer to
an object to access it! e.g. They cannot use the
co-ordinates of the points for indexing into a
structure, etc.. - Algorithms based on Range Trees, kD-trees fall
in this class of algorithms.
18Models of Computation
Memory access rules differ for pointer and RAM
machines
Memory
Output Device
Input Device
Central Processing Unit
Control Unit
19Chazelles Lower Bound
? (root)
Data structure A digraph G(V,E) of bounded
out-degree.
G has a representative node for each input point
and some other internal nodes.
For a query q, the algorithm non-deterministicall
y traverses G, adds/deletes some (internal) nodes
and edges, and produces a set W(q) of nodes which
is a superset of answer points nodes.
20Chazelles Lower Bound (Contd.)
Range Trees and kD-Trees are some such data
structures.
Range Trees
kD-Trees
P3
P2
P4
f
P1
a
b
c
d
g
e
P4
P3
P2
P1
21Chazelles Lower Bound (Contd.)
Desired query time O(k polylog(n)) (k is the
output size)
? (root)
For the query time to be linear in k, the nodes
for the answer set should be close to each
other. This should hold for any query q.
22Chazelles Lower Bound (Contd.)
- If we have a set of queries, Sq1 ,q2 ,,qs,
such that - P ? qi ? logb n (each range has many
points) - P ? qi ? qj ? 1 (no two ranges share many
points) - Then, each qi has a representative subset of
nodes in G which is compact (and has many edges).
gt G has many edges. - By the bounded out-degree condition on G, it
consequently has many vertices, and thus requires
large amount of memory.
23Chazelles Lower Bound (Contd.)
- If S exhibits the desired properties, then he
shows that for a query time of - W(q) a.(k logbn) ,
- V gt S (logbn) / 216a4 (long
algebraic proof) -
- Recall that W(q) is the output set produced for a
query q. - The point set and the queries are generated as
follows - Let n m? ,
- where m ?2logbp? and
- ? ? log p / (1 b.loglogp) ? for
some large integer p. - If p is large enough, m ? logbn and ? ? log
n /(1 b.loglogn) -
24Bad Input Set and Queries
- Define the point set as P (?m(i), i) 0 ?
i ? n - ?m(i) Write i in base m over ? bits and
reverse the bits. - Consider a tree T, which encodes the x -
co-ordinates of points in P (take their m-ary
representation). - Each node has m children labeled 0,1,2,,m-1.
.
1st bit
.
2nd bit
0 1 2 m-1
.
ith bit
Height of T is ? (one level for each bit)
.
25Generating S
- 0 1 2
m-1 - A node at depth r is associated with m ?-r
points those points whose m-ary representation
has the first r bits as the ancestors of this
node. - Sort them by the y co-ordinate and split them
into groups of m points. - Total no. of groups ?( nodes at level r).(m
?-r/m) - Each group can be enclosed in a query box
- Total of ?m ?-1 queries.
26Eg n 33
27 queries
For n m?, we have ? m?-1 queries
27Indexing Perspective
Data
Disk
Data is too large to be stored in memory and has
to be stored on the disk (in chunks of size B
possibly with repetition. B, the block size is
the unit of data transfer from disk in one read).
Storage Redundancy Maximum number of copies of
a data item. A query regarding items satisfying
some criteria is answered by retrieving blocks
from the disk such that the contained points form
a superset of the answer. Access Overhead Ratio
of no. of blocks retrieved to the minimum no. of
blocks required to answer the query.
28Indexing Perspective (Contd.)
Blocks (redundancy 1)
Now redundancy 2
Data
- Better overhead if queries have same aspect
ratio as our blocks. - Else, far more blocks have to be retrieved than
bare minimum! - Idea Have blocks of several different aspect
ratios.
29Indexing - Limitations
- A query can have any aspect ratio.
- Not possible to have blocks of every aspect ratio
with limited memory. - Have blocks with sufficient no. of different
aspect ratios so that any aspect ratio can be
approximated (Hellerstein, Koutsoupias,
Papadimitriou).
30Overhead for Redundancy r
- Choose blocks so that any aspect ratio can be
approximated. - Blocks will have the shape Bx ? B1-x
- Let x (2i-1) /2r i 1,2,,r
- Store all such blocks
- Redundancy is r since there are r shapes for the
blocks and an input point - can be present in only 1 block of a particular
shape
31Overhead for fixed redundancy..
- If query is aligned with the blocks, then let k
blocks suffice. - We can easily form a query which will require 2.k
2 blocks to be covered.
q
They achieve k B1/2r
k
q
32Lower Bound on Access Overhead for r1
- Access overhead, a ?(B1-1/d)
- d2 Use only B ? 1 and 1 ? B queries (2n2/B
total queries). - Let s ? S intersect x horizontal and y vertical
lines.
n
- x.y ? B
- x y ? 2 B1/2
- i.e. s intersects atleast 2 B1/2 of the above
queries. - Block-query product (n2/B) 2 B1/2
- gt Average no. of blocks a query intersects
B1/2
n
x
y
33Indexing Structures
- R-trees as indexing structures Extension of
B-Trees to multiple dimensions.
Node degrees Internal node between t and
2t Root between 2 and 2t
Input objects associated with leaves and all
leaves at the same level. Each internal node
stores the smallest bounding box of the objects
in its subtree.
34R-Trees
B
u
q
r
t
p
H
C
B
A
v
s
A
w
A
C
y
s
r
q
p
z
y
x
x
H
z
C
B
w
v
u
t
Performance is measured as the number of disk
accesses required to answer a query.
35Lower Bounds on R-trees
- Query processing in bounding box hierarchies.
- Almost similar to query processing in kD-trees.
- Crossing number as a measure of efficiency of a
bounding box hierarchy the smaller, the better
! - There is a collection of n d-rectangles, for
which any r-tree T of min-degree t there is a
query box intersecting ?((n/t)1-1/d) nodes of T
and none of the input d-rectangles. (Pankaj
Agarwal, et al.)
36R-tree Efficiency
- Bounding box of any t squares hits ? 2(t1/2-1)
queries. - Total bounding-box query intersections ?
(n/t1/2) - Total queries 2(n1/2-1) O(n1/2) gt A query
intersects atleast ?((n/t)1/2) bounding boxes. - In general, an empty query box intersects ?
((n/t)1-1/d) bounding boxes of the rtree.
Query boxes
Input rectangle
37Good Box-trees and Conversion to Good R-trees
- Pankaj Agarwal, et al.
- kD-trees for rectangle intersection queries
c,d
x2 , y2
a,b
x1 , y1
The rectangles intersect iff (c,d) ? (x1 , y1)
(a,b) ? (x2 , y2) i.e. (-a,-b,c,d ) ? (-x2 ,
-y2 , x 1 , y1)
38kD-Trees to Box Trees
- Trivial to verify that the original problem of
range searching on rectangles is now a problem of
range searching on points. - Build a kD-tree on these points O(n1-1/2d k)
query time. - Convert to a box-tree as follows
- replace each points in leaves of the kD-tree
with the corresponding d-rectangle - at each internal node, store the bounding box of
its children. - Careful analysis shows that the query time is
actually O(n1-1/d klogn)
39Box Tree Analysis
- What is a visited node ? A node is said to be
visited if the query algorithm continues to its
children nodes. - Two types
- the input boxes in the subtree of a visited node
v have one or more output boxes (atmost k such
nodes). - all boxes stored in subtree of v are disjoint
from the query Q - (not many of such nodes can be visited).
40Box Tree Analysis
a
All input boxes cannot be separated from Q by the
same hyperplane. Thus, atleast 2 such
hyperplanes which separate an input box in
subtree of v from Q.
b
Q
In 2d space, points representing a and b lie on
opposite side of the above hyperplane through a
facet of Q. Thus, this hyperplane intersects the
cell representing v. The other hyperplane also
intersects the cell of v. Thus, their
intersection, which is a 2d-2 flat also
intersects the cell of v.
41Box Tree Analysis (Contd.)
- By the property of kD-trees, such cells can be
atmost O(2i.(2d-2)/2d) O(2i(1-1/d)). - Height of the tree is O(log n) (kD-trees are
perfectly balanced). - Thus total number of visited nodes for a query
?(k 2i(1-1/d) ) O(klogn n 1-1/d ) - Using a slightly modified construction of the box
tree, they reduce the query time to O(k n 1-1/d
).
42Avenues for Further Research
- Lower bounds suggest that no data structure might
be possible which scales well in high-dimension
space for an entirely generic set of inputs and
queries. - Interesting assumptions about the input objects
and queries might result in better performance. - Pankaj et al. showed that R-Trees do not have a
good worst case performance even if input is a
set of hypercubes.
43Further Research
- What if queries are also hypercubes or have O(1)
aspect ratio ? - The lower bounds do not hold in these cases
both for R-Trees and indexing. - Mark deBerg, et al. constructed box trees with
polylog query time for collision checking in
industrial installations.
44 45Junk
462-d Case
Query boxes
Input rectangle
47kD-Trees (Contd.)
d
c
f
a
b
g
f
e
a
b
c
d
g
e
O(n) size data structure. O(nlogn) construction
time.
48Indexing Perspective
- Hellerstein, Koutsoupias, Papadimitriou
- Efficiency of an indexing scheme for a database
- Storage redundancy how many copies of a data
item - Access overhead how many times more blocks
than necessary does a query retrieve. - An indexing problem is defined in the context of
a workload. - Workload consists of
- A domain (e.g. Rd ),
- A subset of the domain called instance (e.g. a
set of points in Rd ), and - A set of subsets of the instance, the set of
queries (Eg. d-rectangles).
49Range Searching as Indexing Workloads
- Range queries in R2
- Domain, D R2
- Instance, I (i,j) 1? i,j ? n
- Query, Qa,b,c,d (i,j) a ? i ? b, c ? j
? d - one query for each quadruple (a,b,c,d)
with - 1 ? a ? b ? n and 1 ? c ? d ? n
- Indexing Schemes
- A collection S s1 ,s2,ss of blocks,
- si I
- A query retrieves a set of blocks which cover it
(possibly retrieving more blocks than necessary). -
50Access Overhead for fixed Storage Redundancy
If we have blocks with the same aspect ratio as
the query, then best overhead
But, query can have any aspect ratio. Not
possible to have blocks in S of all possible
aspect ratios (storage redundancy is fixed at r ).
51Overhead when r 1 (Contd.)
- d 3
- Consider B? 1? 1 , 1? B? 1 and 1? 1? B queries
- Let s ? S intersect x, y and z lines in each
direction.
x.y.z ? B gt No. of queries intersected
xy yz zx ? 3.B2/3 No. of blocks
n3/B Block-query intersecting pairs
3B2/3.n3/B No. of queries 3.n3/B Thus, a query
intersects B2/3 blocks. In d-dimensions,
overhead is ?(B1-1/d)
z
x
y
52(No Transcript)