Indexing Techniques - PowerPoint PPT Presentation

1 / 75

About This Presentation

Title:

Indexing Techniques

Description:

The surface of the Pi, which the i-th or (i-d)-th coordinate is 0 or 1 respectively. ... dmin and dmax are the corresponding dimension for xmin and xmax ... – PowerPoint PPT presentation

Number of Views:17

Avg rating:3.0/5.0

Slides: 76

Provided by: iscp4

Category:

more less

Transcript and Presenter's Notes

Title: Indexing Techniques

1
Indexing Techniques

Zhang Xiaofeng
Li Xiaolan
He Qi

2
Preview

The Pyramid-Technique
Study iMinMax(?)
iDistance
Hyper-cube iDistance

3
The Pyramid-Technique

By Zhang Xiaofeng
zhangxi4_at_comp.nus.edu.sg

4
Outline

Pyramid-Technique
versus
other
technique
How to use it
Some paper for you

5
Start!
6

what is the Pyramid-Technique?
why do we need the Pyramid-
Technique?
what are the problems of other tech-
niques?

7
So called Balanced Split

to be used in space partitioning of
many indexing structures. (such as SS-
tree, SR-tree, TV-tree.)
to split the data space equally filled
regions.

8
The problems

In high-dimensional space, it seems im-
possible!
Why?

9
Example

In 20-dimensional space,

10
Conclusion

So we split just only once in few dimen-
sions.

Now we can see the problems!
If we want to query a very small range,
0.01 in 20-dimension, the radio is quite
large. This means we should access all
the data pages!

12
The Pyramid-Technique

The Pyramid-Technique can handle it
more efficiently.
Let us look at the Figure.

13
Partitioning using Pyramid-Technique
Balanced split
14
How to do it?
15

The Pyramid-Techniques is based on a special
partitioning strategy that is optimized for
high-dimensional data.
After partitioning, each point in the data space
is mapped into 1-dimensional data and using
B-tree to build the indexing structure.

16
How to partitioning?

Step1
to split the data space into 2d pyramids having
the center point of the data space
(0.5,0.5,,0.5) as their top and a
(d-1)-dimensional surface of the data space as
their base.
For example in 2-dimensional space,

17
pyramid
Center point
(d-1)-dimensional surface
18

Each of the 2d pyramid is divided into several
partitions each corresponding to one data page of
the B-tree

19
partitions
20

Step2 number the pyramids
Note that in the 2-dimensional example in the
Figure. The surface of the Pi, which the i-th or
(i-d)-th coordinate is 0 or 1 respectively.

21
pyramid
3
0
2
Center point
1
(d-1)-dimensional surface
22
Some definitions

Definition 1 A d-dimensional point v is defined
to be located in pyramid

Definition 2Height of a point v

height of v
v
24

Definition 3 Pyramid value of a point v
The pyramid value of v is defined
as

25
How to handle the querying

point query
Given a point of p, to determine whether p is
in the dataset.

range query
In case of range queries, the problem is
defined as followsGiven a dimensional interval
determine the points in the database which are
inside the range.

First we should determine which pyramids are
intersected by the range query.
Second we have determine which pyramid values
inside an affected pyramid Pi are affected by the
query. Thus, we are looking for the interval
hlow,hhigh.

For simplification, just to focus the
description of the algorithm only on pyramid
Pi where iltd.

As the first step, we transform the query rec-
tangle q into an equivalent rectangle such
that the interval is defined relative to the
center
point.

We also define MIN(r) and MAX(r) like
this
Note that may be large than .
Analogously, we define

A pyramid is intersected by a
hyper-rectangle
iff
(Hint the value of the is negative.)

In second step, we have to determine which
pyramid values inside an affected pyramid Pi are
affected by the query. That is we should
determine the in the range of
we define the notation

There are two cases
Case 1

Case 2

35
(No Transcript)
36
Some papers for you

The Pyramid-Technique Towards Breaking the Curse
of Dimensionality
The SR-tree An Index Structure for
High-Dimensional Nearest Neighbor Queries
The TV-tree An Index Structure for
High-Dimensional Data

37
Study iMinMax(?)

By Li Xiaolan
lixiaola_at_comp.nus.edu.s
g

38
Outline

Quick Review of iMinMax(?)
Partition with ?
Point Range Search
Comparison with iDistance
Reference

39
Quick review of iMinMax(?)

xmin and xmax are smallest and largest value of
p(x1, x2, , xd)
dmin and dmax are the corresponding dimension for
xmin and xmax
? is a real number, called tuning knob

40
Partition with ?

How can iMinMax(?) partition data
How can ? influence partition
Problems of estimating ?

41
y2y y?(1-?)/2, 1
yx
(d2) y

x
1-?-y
A
4
y1x x?(1-?)/2, 1
y1x x?0, (1-?)/2
1
3
2
B
y1-x-?
y2y y?0, (1-?)/2
x (d1)
3
2
1
4
1 1(1-?)/2 2 2(1-?)/2
3
42
y1-x-? (-1lt?lt0)
y1-x (?0)
yx
y1-x-? (0lt?lt1)
Points on the dashed line have the same index
value
varying ? can get different partitions
43
Problems of estimating ?
.
Cluster center 0.6, 0.6 ? - 0.1, then
points can be evenly partitioned.
For high dimensional dataset, ? looks out for
the cluster center.
Normal distribution of skewed data (? -
0.1)
44
Problems of estimating ?

Since points are mapped to many surfaces based
on their edges and dimensions, and ? is only a
real number, so how to judge if ? is near the
cluster center?
Every dataset exist a optimal ? value. But it is
still difficult for us to guess a certain value
of ? in face of most dataset.
Without any clue, only by varying ? from -1 to
1, and observe the performance may get the
optimal value for ?, but it is expensive!
Estimating ? may be a remaining problem for
further study.

45
Point Range Search

Point search
Query point p is mapped to xp based on iMinMax(?)
Using B tree to index points according to y
Simple, algorithm is omitted here
Range search
Great feature of iMinMax(?)
Basic and better transformation
Algorithm explanation

46
Basic tranformation

Suppose original query range
Q (q1, q2, , qd)
qj xj1, xj2 1? j ? d
Basic tranformation
qj jxj1, jxj2
Answer set
ans(Q) ?ans(qj)

47
Example

Q ( 0.2, 0.4, 0.3, 0.5, 0.1, 0.7 )
Q ( 1.2, 1.4, 2.3, 2.5, 3.1, 3.7 )
sub queries
q1 1.2, 1.4
q2 2.3, 2.5
q3 3.1, 3.7

48
y2y y?(1-?)/2, 1
yx
(d2) y
q1 1.35, 1.8 q2 2.3, 2.7 All the
colored parts need to be examined. 4 triangles
are not in answer set.
1
4
0.7
y1x x?0, (1-?)/2
1
3
y1x x?(1-?)/2, 1
2
0.3
0.8
0.35
0
x (d1)
1
y2y y?0, (1-?)/2
y1-x-?
49
Better transformation
if Q(q1, q2, , qd) satisfies
qj jxj1, jxj2
for each point ( x1, x2, , xd ) in query range,
we have xmin ? minxj1, xmax ? maxxj1
so minxi? ? 1-maxxi it means for all the
answer points, ydmaxxmax ?dmaxmaxxj
so the transformed subquery can be narrowed
to qj jmaxxj1, jxj2
50
Better Transformation

Example
Q ( 0.3, 0.5 , 0.4, 0.7 , 0.6, 0.9
)
point ( x1, x2, x3
)
xmin0.3, xmax0.6, ? 0.4
0.3 0.4 gt 1 0.6 ? xmin 0.4 gt1- xmax
point falls in the xmax edge (dmax xmax)!
Q ( 1.6, 0.5, 1.6, 1.7, 3.6, 3.9 )

51
Better transformation
Likewise, if Q satisfies
qj jxj1, jxj2

the transformed subquery can be narrowed to
qj jxj1, jminxj2 so now the better
solution is
52
0.20.5 gt1-0.4, lower bound 0.4 q1 1.4,
1.3 q2 2.4, 2.6 q1 can be pruned! It simply
means we only have to check the partitions that
intersect the query range.
2 triangles can be pruned !
53
Range search explanation

First, a d-dimensional query range is transformed
to d subqueries, qi li, hi
Next, pruneSubquery is invoked to check if qi can
be pruned.
For each remaining subquery, B tree is traversed
to the appropriate leaf node.
Examine every x in li, hi
Get final answer set

54
Comparison with iDistance

Query range in iMinMax
? 0 when comparing
When to terminate searching
Performance comparison

55
Query range in iMinMax
56
Whole dataset
Since the region bounded by the window query is
larger than the search sphere, false drops
appear.
A
P
r
D
Point A and C are the nearest neighbors, while
point B is the false hit. Only by enlarging r can
we prune B from the answer set and add D in.
B
C
Figure 7 example of false drops
57
When to terminate searching

If all the K nearest heighbors can be found
within the current sphere at radius r, the
algorithm will terminate.
Otherwise, the algorithm will continue until the
whole dataset is searched.

58
Performance Comparison
59
Performance Comparison

For the uniform dataset, iMinMax performs as well
as iDistance.
For the clustered dataset, iMinMax is inferior to
iDistance.
Many false drops may occur in a clustered
dataset, so iMinMax may have to search a large
part of the whole dataset to guarantee 100
accuracy.
However, iMinMax can produce approximate answers
very quickly, lower than linear scan with up to
95 accuracy.

60
Reference

Indexing the Edges A simple and yet efficient
approach to high-dimensional indexing
Progressive KNN Search Using B-trees
Indexing the Distance An efficient method to KNN
processing

61
iDistance

By He Qi
heqi_at_comp.nus.edu.s
g

62
Topics

Two important issues in iDistance data space
partitioning and selection of reference points
Problems of iDistance
A progress of iDistance hyper-cube iDistance
Questions future work

63
Data space partitioning

Equal partitioning of data space
Do not consider the distribution of data, data
points may be allocated unevenly and loosely.
Cluster based partitioning
Consider the distribution of data, close data
points are allocated densely in each cluster.

64
Equal partitioning of data space
q
r
KNN query in 2-dimension skewed data space
65
Selection of Reference points
66
Contrast between two kinds of reference point
selection
Overlap between sub-space
r
r
partition
r
r
Reference point
Points with same distance from the reference point
1. Centroid of a partition
2. Outside of a partition
67
Problems of iDistance

For Similarity range and KNN queries, the search
space is large. The reason is that many points in
a partition have same indexing value.
The cost of identifying the partitions that
overlap with a query region is high because of
the high cost of computing distance between two
points.
Dont support Window/Range query.

68
Hyper-cube iDistance

a progress of iDistance

69
Basic idea

Instead of using hyper-spheres to bound
partitions and query regions in iDistance, we use
hyper-cubes with a same direction (parallel to
each axis) to bound partitions and query regions.
Note the partitioning process is same.

70
How to index?

For each partition, we point a hyper-plane as the
reference plane.
The indexing value of a data is the distance
between the data point and its projection onto
the reference plane .
The centroid of each partition(hyper-cube) is
also kept.

71
Hyper-cube iDistance

Contrast with iDistance in iDistance, to get
the same effect, a reference point should be set
infinitely far, while the overlaps also increase
rapidly.

72
How to perform KNN query?

Given a query point and the current search
radius, For each partition overlapping with the
query region, the yellow and purple regions need
to be searched.
Pruning each point in search space should be
examined if it is in the query region (which is
much cheaper than computing distance) first. Only
those in the query region (purple region) need
compute distance from q.
Stopping criterion the search stops when 1) K
points are found, 2) and the distance of the
farthest object in answer set from q is less than
or equal to the current search radius.

73
Contrasting with iDistance
74
Questions Future work

How to handle the data entries with a same
indexing value in a B tree?
Given a data set and an indexing method, how to
compute the average cost of a certain kind of
query?
How to Consider the interdependences and varying
importance of dimensions in hyper-cube iDistance?

75
Summary