Title: Indexing Techniques
1Indexing Techniques
- Zhang Xiaofeng
- Li Xiaolan
- He Qi
2Preview
- The Pyramid-Technique
- Study iMinMax(?)
- iDistance
- Hyper-cube iDistance
3The Pyramid-Technique
-
- By Zhang Xiaofeng
- zhangxi4_at_comp.nus.edu.sg
4Outline
- Pyramid-Technique
- versus
- other
technique - How to use it
- Some paper for you
5Start!
6- what is the Pyramid-Technique?
- why do we need the Pyramid-
- Technique?
- what are the problems of other tech-
- niques?
7So called Balanced Split
- to be used in space partitioning of
- many indexing structures. (such as SS-
- tree, SR-tree, TV-tree.)
- to split the data space equally filled
- regions.
8The problems
- In high-dimensional space, it seems im-
- possible!
- Why?
9Example
10Conclusion
- So we split just only once in few dimen-
- sions.
11- Now we can see the problems!
- If we want to query a very small range,
- 0.01 in 20-dimension, the radio is quite
- large. This means we should access all
- the data pages!
12The Pyramid-Technique
- The Pyramid-Technique can handle it
- more efficiently.
- Let us look at the Figure.
13Partitioning using Pyramid-Technique
Balanced split
14How to do it?
15- The Pyramid-Techniques is based on a special
partitioning strategy that is optimized for
high-dimensional data. - After partitioning, each point in the data space
is mapped into 1-dimensional data and using
B-tree to build the indexing structure.
16How to partitioning?
- Step1
- to split the data space into 2d pyramids having
the center point of the data space
(0.5,0.5,,0.5) as their top and a
(d-1)-dimensional surface of the data space as
their base. - For example in 2-dimensional space,
17pyramid
Center point
(d-1)-dimensional surface
18- Each of the 2d pyramid is divided into several
partitions each corresponding to one data page of
the B-tree
19partitions
20- Step2 number the pyramids
- Note that in the 2-dimensional example in the
Figure. The surface of the Pi, which the i-th or
(i-d)-th coordinate is 0 or 1 respectively.
21pyramid
3
0
2
Center point
1
(d-1)-dimensional surface
22Some definitions
- Definition 1 A d-dimensional point v is defined
to be located in pyramid
23- Definition 2Height of a point v
height of v
v
24- Definition 3 Pyramid value of a point v
- The pyramid value of v is defined
- as
-
25How to handle the querying
- point query
- Given a point of p, to determine whether p is
in the dataset.
26- range query
- In case of range queries, the problem is
defined as followsGiven a dimensional interval - determine the points in the database which are
inside the range.
27- First we should determine which pyramids are
intersected by the range query. - Second we have determine which pyramid values
inside an affected pyramid Pi are affected by the
query. Thus, we are looking for the interval
hlow,hhigh.
28- For simplification, just to focus the
description of the algorithm only on pyramid - Pi where iltd.
29- As the first step, we transform the query rec-
- tangle q into an equivalent rectangle such
- that the interval is defined relative to the
center - point.
30- We also define MIN(r) and MAX(r) like
- this
- Note that may be large than .
- Analogously, we define
31- A pyramid is intersected by a
hyper-rectangle
- iff
-
- (Hint the value of the is negative.)
32- In second step, we have to determine which
pyramid values inside an affected pyramid Pi are
affected by the query. That is we should
determine the in the range of - we define the notation
33- There are two cases
- Case 1
34 35(No Transcript)
36Some papers for you
- The Pyramid-Technique Towards Breaking the Curse
of Dimensionality - The SR-tree An Index Structure for
High-Dimensional Nearest Neighbor Queries - The TV-tree An Index Structure for
High-Dimensional Data
37Study iMinMax(?)
-
- By Li Xiaolan
- lixiaola_at_comp.nus.edu.s
g
38Outline
- Quick Review of iMinMax(?)
- Partition with ?
- Point Range Search
- Comparison with iDistance
- Reference
39Quick review of iMinMax(?)
- xmin and xmax are smallest and largest value of
p(x1, x2, , xd) - dmin and dmax are the corresponding dimension for
xmin and xmax - ? is a real number, called tuning knob
40Partition with ?
- How can iMinMax(?) partition data
- How can ? influence partition
- Problems of estimating ?
41y2y y?(1-?)/2, 1
yx
(d2) y
x
1-?-y
A
4
y1x x?(1-?)/2, 1
y1x x?0, (1-?)/2
1
3
2
B
y1-x-?
y2y y?0, (1-?)/2
x (d1)
3
2
1
4
1 1(1-?)/2 2 2(1-?)/2
3
42y1-x-? (-1lt?lt0)
y1-x (?0)
yx
y1-x-? (0lt?lt1)
Points on the dashed line have the same index
value
varying ? can get different partitions
43Problems of estimating ?
.
Cluster center 0.6, 0.6 ? - 0.1, then
points can be evenly partitioned.
For high dimensional dataset, ? looks out for
the cluster center.
Normal distribution of skewed data (? -
0.1)
44Problems of estimating ?
- Since points are mapped to many surfaces based
on their edges and dimensions, and ? is only a
real number, so how to judge if ? is near the
cluster center? - Every dataset exist a optimal ? value. But it is
still difficult for us to guess a certain value
of ? in face of most dataset. - Without any clue, only by varying ? from -1 to
1, and observe the performance may get the
optimal value for ?, but it is expensive! - Estimating ? may be a remaining problem for
further study.
45Point Range Search
- Point search
- Query point p is mapped to xp based on iMinMax(?)
- Using B tree to index points according to y
- Simple, algorithm is omitted here
- Range search
- Great feature of iMinMax(?)
- Basic and better transformation
- Algorithm explanation
46Basic tranformation
- Suppose original query range
- Q (q1, q2, , qd)
- qj xj1, xj2 1? j ? d
- Basic tranformation
- qj jxj1, jxj2
- Answer set
- ans(Q) ?ans(qj)
47Example
- Q ( 0.2, 0.4, 0.3, 0.5, 0.1, 0.7 )
- Q ( 1.2, 1.4, 2.3, 2.5, 3.1, 3.7 )
- sub queries
- q1 1.2, 1.4
- q2 2.3, 2.5
- q3 3.1, 3.7
48y2y y?(1-?)/2, 1
yx
(d2) y
q1 1.35, 1.8 q2 2.3, 2.7 All the
colored parts need to be examined. 4 triangles
are not in answer set.
1
4
0.7
y1x x?0, (1-?)/2
1
3
y1x x?(1-?)/2, 1
2
0.3
0.8
0.35
0
x (d1)
1
y2y y?0, (1-?)/2
y1-x-?
49Better transformation
if Q(q1, q2, , qd) satisfies
qj jxj1, jxj2
for each point ( x1, x2, , xd ) in query range,
we have xmin ? minxj1, xmax ? maxxj1
so minxi? ? 1-maxxi it means for all the
answer points, ydmaxxmax ?dmaxmaxxj
so the transformed subquery can be narrowed
to qj jmaxxj1, jxj2
50Better Transformation
- Example
- Q ( 0.3, 0.5 , 0.4, 0.7 , 0.6, 0.9
) - point ( x1, x2, x3
) - xmin0.3, xmax0.6, ? 0.4
- 0.3 0.4 gt 1 0.6 ? xmin 0.4 gt1- xmax
- point falls in the xmax edge (dmax xmax)!
- Q ( 1.6, 0.5, 1.6, 1.7, 3.6, 3.9 )
-
51Better transformation
Likewise, if Q satisfies
qj jxj1, jxj2
the transformed subquery can be narrowed to
qj jxj1, jminxj2 so now the better
solution is
520.20.5 gt1-0.4, lower bound 0.4 q1 1.4,
1.3 q2 2.4, 2.6 q1 can be pruned! It simply
means we only have to check the partitions that
intersect the query range.
2 triangles can be pruned !
53Range search explanation
- First, a d-dimensional query range is transformed
to d subqueries, qi li, hi - Next, pruneSubquery is invoked to check if qi can
be pruned. - For each remaining subquery, B tree is traversed
to the appropriate leaf node. - Examine every x in li, hi
- Get final answer set
54Comparison with iDistance
- Query range in iMinMax
- ? 0 when comparing
- When to terminate searching
- Performance comparison
55Query range in iMinMax
56Whole dataset
Since the region bounded by the window query is
larger than the search sphere, false drops
appear.
A
P
r
D
Point A and C are the nearest neighbors, while
point B is the false hit. Only by enlarging r can
we prune B from the answer set and add D in.
B
C
Figure 7 example of false drops
57When to terminate searching
- If all the K nearest heighbors can be found
within the current sphere at radius r, the
algorithm will terminate. - Otherwise, the algorithm will continue until the
whole dataset is searched.
58Performance Comparison
59Performance Comparison
- For the uniform dataset, iMinMax performs as well
as iDistance. - For the clustered dataset, iMinMax is inferior to
iDistance. - Many false drops may occur in a clustered
dataset, so iMinMax may have to search a large
part of the whole dataset to guarantee 100
accuracy. - However, iMinMax can produce approximate answers
very quickly, lower than linear scan with up to
95 accuracy.
60Reference
- Indexing the Edges A simple and yet efficient
approach to high-dimensional indexing - Progressive KNN Search Using B-trees
- Indexing the Distance An efficient method to KNN
processing
61iDistance
- By He Qi
- heqi_at_comp.nus.edu.s
g
62Topics
- Two important issues in iDistance data space
partitioning and selection of reference points - Problems of iDistance
- A progress of iDistance hyper-cube iDistance
- Questions future work
63Data space partitioning
- Equal partitioning of data space
- Do not consider the distribution of data, data
points may be allocated unevenly and loosely. - Cluster based partitioning
- Consider the distribution of data, close data
points are allocated densely in each cluster.
64Equal partitioning of data space
q
r
KNN query in 2-dimension skewed data space
65Selection of Reference points
66Contrast between two kinds of reference point
selection
Overlap between sub-space
r
r
partition
r
r
Reference point
Points with same distance from the reference point
1. Centroid of a partition
2. Outside of a partition
67Problems of iDistance
- For Similarity range and KNN queries, the search
space is large. The reason is that many points in
a partition have same indexing value. - The cost of identifying the partitions that
overlap with a query region is high because of
the high cost of computing distance between two
points. - Dont support Window/Range query.
68Hyper-cube iDistance
69Basic idea
- Instead of using hyper-spheres to bound
partitions and query regions in iDistance, we use
hyper-cubes with a same direction (parallel to
each axis) to bound partitions and query regions.
Note the partitioning process is same.
70How to index?
- For each partition, we point a hyper-plane as the
reference plane. - The indexing value of a data is the distance
between the data point and its projection onto
the reference plane . - The centroid of each partition(hyper-cube) is
also kept.
71Hyper-cube iDistance
- Contrast with iDistance in iDistance, to get
the same effect, a reference point should be set
infinitely far, while the overlaps also increase
rapidly.
72How to perform KNN query?
- Given a query point and the current search
radius, For each partition overlapping with the
query region, the yellow and purple regions need
to be searched. - Pruning each point in search space should be
examined if it is in the query region (which is
much cheaper than computing distance) first. Only
those in the query region (purple region) need
compute distance from q. - Stopping criterion the search stops when 1) K
points are found, 2) and the distance of the
farthest object in answer set from q is less than
or equal to the current search radius.
73Contrasting with iDistance
74Questions Future work
- How to handle the data entries with a same
indexing value in a B tree? - Given a data set and an indexing method, how to
compute the average cost of a certain kind of
query? - How to Consider the interdependences and varying
importance of dimensions in hyper-cube iDistance?
75Summary
- Four indexing technologies for high-dimensional
database - Pyramid
- iMinMax(?)
- iDistance
- Hyper-cube iDistance
-