Indexing Techniques - PowerPoint PPT Presentation

1 / 75
About This Presentation
Title:

Indexing Techniques

Description:

The surface of the Pi, which the i-th or (i-d)-th coordinate is 0 or 1 respectively. ... dmin and dmax are the corresponding dimension for xmin and xmax ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Slides: 76
Provided by: iscp4
Category:

less

Transcript and Presenter's Notes

Title: Indexing Techniques


1
Indexing Techniques
  • Zhang Xiaofeng
  • Li Xiaolan
  • He Qi

2
Preview
  • The Pyramid-Technique
  • Study iMinMax(?)
  • iDistance
  • Hyper-cube iDistance

3
The Pyramid-Technique
  • By Zhang Xiaofeng
  • zhangxi4_at_comp.nus.edu.sg

4
Outline
  • Pyramid-Technique
  • versus
  • other
    technique
  • How to use it
  • Some paper for you

5
Start!
6
  • what is the Pyramid-Technique?
  • why do we need the Pyramid-
  • Technique?
  • what are the problems of other tech-
  • niques?

7
So called Balanced Split
  • to be used in space partitioning of
  • many indexing structures. (such as SS-
  • tree, SR-tree, TV-tree.)
  • to split the data space equally filled
  • regions.

8
The problems
  • In high-dimensional space, it seems im-
  • possible!
  • Why?

9
Example
  • In 20-dimensional space,

10
Conclusion
  • So we split just only once in few dimen-
  • sions.

11
  • Now we can see the problems!
  • If we want to query a very small range,
  • 0.01 in 20-dimension, the radio is quite
  • large. This means we should access all
  • the data pages!

12
The Pyramid-Technique
  • The Pyramid-Technique can handle it
  • more efficiently.
  • Let us look at the Figure.

13
Partitioning using Pyramid-Technique
Balanced split
14
How to do it?
15
  • The Pyramid-Techniques is based on a special
    partitioning strategy that is optimized for
    high-dimensional data.
  • After partitioning, each point in the data space
    is mapped into 1-dimensional data and using
    B-tree to build the indexing structure.

16
How to partitioning?
  • Step1
  • to split the data space into 2d pyramids having
    the center point of the data space
    (0.5,0.5,,0.5) as their top and a
    (d-1)-dimensional surface of the data space as
    their base.
  • For example in 2-dimensional space,

17
pyramid
Center point
(d-1)-dimensional surface
18
  • Each of the 2d pyramid is divided into several
    partitions each corresponding to one data page of
    the B-tree

19
partitions
20
  • Step2 number the pyramids
  • Note that in the 2-dimensional example in the
    Figure. The surface of the Pi, which the i-th or
    (i-d)-th coordinate is 0 or 1 respectively.

21
pyramid
3
0
2
Center point
1
(d-1)-dimensional surface
22
Some definitions
  • Definition 1 A d-dimensional point v is defined
    to be located in pyramid

23
  • Definition 2Height of a point v

height of v
v
24
  • Definition 3 Pyramid value of a point v
  • The pyramid value of v is defined
  • as

25
How to handle the querying
  • point query
  • Given a point of p, to determine whether p is
    in the dataset.

26
  • range query
  • In case of range queries, the problem is
    defined as followsGiven a dimensional interval
  • determine the points in the database which are
    inside the range.

27
  • First we should determine which pyramids are
    intersected by the range query.
  • Second we have determine which pyramid values
    inside an affected pyramid Pi are affected by the
    query. Thus, we are looking for the interval
    hlow,hhigh.

28
  • For simplification, just to focus the
    description of the algorithm only on pyramid
  • Pi where iltd.

29
  • As the first step, we transform the query rec-
  • tangle q into an equivalent rectangle such
  • that the interval is defined relative to the
    center
  • point.

30
  • We also define MIN(r) and MAX(r) like
  • this
  • Note that may be large than .
  • Analogously, we define

31
  • A pyramid is intersected by a
    hyper-rectangle
  • iff
  • (Hint the value of the is negative.)

32
  • In second step, we have to determine which
    pyramid values inside an affected pyramid Pi are
    affected by the query. That is we should
    determine the in the range of
  • we define the notation

33
  • There are two cases
  • Case 1

34
  • Case 2

35
(No Transcript)
36
Some papers for you
  • The Pyramid-Technique Towards Breaking the Curse
    of Dimensionality
  • The SR-tree An Index Structure for
    High-Dimensional Nearest Neighbor Queries
  • The TV-tree An Index Structure for
    High-Dimensional Data

37
Study iMinMax(?)
  • By Li Xiaolan
  • lixiaola_at_comp.nus.edu.s
    g

38
Outline
  • Quick Review of iMinMax(?)
  • Partition with ?
  • Point Range Search
  • Comparison with iDistance
  • Reference

39
Quick review of iMinMax(?)
  • xmin and xmax are smallest and largest value of
    p(x1, x2, , xd)
  • dmin and dmax are the corresponding dimension for
    xmin and xmax
  • ? is a real number, called tuning knob

40
Partition with ?
  • How can iMinMax(?) partition data
  • How can ? influence partition
  • Problems of estimating ?

41
y2y y?(1-?)/2, 1
yx
(d2) y

x
1-?-y
A
4
y1x x?(1-?)/2, 1
y1x x?0, (1-?)/2
1
3
2
B
y1-x-?
y2y y?0, (1-?)/2
x (d1)
3
2
1
4
1 1(1-?)/2 2 2(1-?)/2
3
42
y1-x-? (-1lt?lt0)
y1-x (?0)
yx
y1-x-? (0lt?lt1)
Points on the dashed line have the same index
value
varying ? can get different partitions
43
Problems of estimating ?
.
Cluster center 0.6, 0.6 ? - 0.1, then
points can be evenly partitioned.
For high dimensional dataset, ? looks out for
the cluster center.
Normal distribution of skewed data (? -
0.1)
44
Problems of estimating ?
  • Since points are mapped to many surfaces based
    on their edges and dimensions, and ? is only a
    real number, so how to judge if ? is near the
    cluster center?
  • Every dataset exist a optimal ? value. But it is
    still difficult for us to guess a certain value
    of ? in face of most dataset.
  • Without any clue, only by varying ? from -1 to
    1, and observe the performance may get the
    optimal value for ?, but it is expensive!
  • Estimating ? may be a remaining problem for
    further study.

45
Point Range Search
  • Point search
  • Query point p is mapped to xp based on iMinMax(?)
  • Using B tree to index points according to y
  • Simple, algorithm is omitted here
  • Range search
  • Great feature of iMinMax(?)
  • Basic and better transformation
  • Algorithm explanation

46
Basic tranformation
  • Suppose original query range
  • Q (q1, q2, , qd)
  • qj xj1, xj2 1? j ? d
  • Basic tranformation
  • qj jxj1, jxj2
  • Answer set
  • ans(Q) ?ans(qj)

47
Example
  • Q ( 0.2, 0.4, 0.3, 0.5, 0.1, 0.7 )
  • Q ( 1.2, 1.4, 2.3, 2.5, 3.1, 3.7 )
  • sub queries
  • q1 1.2, 1.4
  • q2 2.3, 2.5
  • q3 3.1, 3.7

48
y2y y?(1-?)/2, 1
yx
(d2) y
q1 1.35, 1.8 q2 2.3, 2.7 All the
colored parts need to be examined. 4 triangles
are not in answer set.
1
4
0.7
y1x x?0, (1-?)/2
1
3
y1x x?(1-?)/2, 1
2
0.3
0.8
0.35
0
x (d1)
1
y2y y?0, (1-?)/2
y1-x-?
49
Better transformation
if Q(q1, q2, , qd) satisfies
qj jxj1, jxj2
for each point ( x1, x2, , xd ) in query range,
we have xmin ? minxj1, xmax ? maxxj1
so minxi? ? 1-maxxi it means for all the
answer points, ydmaxxmax ?dmaxmaxxj
so the transformed subquery can be narrowed
to qj jmaxxj1, jxj2
50
Better Transformation
  • Example
  • Q ( 0.3, 0.5 , 0.4, 0.7 , 0.6, 0.9
    )
  • point ( x1, x2, x3
    )
  • xmin0.3, xmax0.6, ? 0.4
  • 0.3 0.4 gt 1 0.6 ? xmin 0.4 gt1- xmax
  • point falls in the xmax edge (dmax xmax)!
  • Q ( 1.6, 0.5, 1.6, 1.7, 3.6, 3.9 )

51
Better transformation
Likewise, if Q satisfies
qj jxj1, jxj2
   
the transformed subquery can be narrowed to
qj jxj1, jminxj2 so now the better
solution is
52
0.20.5 gt1-0.4, lower bound 0.4 q1 1.4,
1.3 q2 2.4, 2.6 q1 can be pruned! It simply
means we only have to check the partitions that
intersect the query range.
2 triangles can be pruned !
53
Range search explanation
  • First, a d-dimensional query range is transformed
    to d subqueries, qi li, hi
  • Next, pruneSubquery is invoked to check if qi can
    be pruned.
  • For each remaining subquery, B tree is traversed
    to the appropriate leaf node.
  • Examine every x in li, hi
  • Get final answer set

54
Comparison with iDistance
  • Query range in iMinMax
  • ? 0 when comparing
  • When to terminate searching
  • Performance comparison

55
Query range in iMinMax
56
Whole dataset
Since the region bounded by the window query is
larger than the search sphere, false drops
appear.
A
P
r
D
Point A and C are the nearest neighbors, while
point B is the false hit. Only by enlarging r can
we prune B from the answer set and add D in.
B
C
Figure 7 example of false drops
57
When to terminate searching
  • If all the K nearest heighbors can be found
    within the current sphere at radius r, the
    algorithm will terminate.
  • Otherwise, the algorithm will continue until the
    whole dataset is searched.

58
Performance Comparison
59
Performance Comparison
  • For the uniform dataset, iMinMax performs as well
    as iDistance.
  • For the clustered dataset, iMinMax is inferior to
    iDistance.
  • Many false drops may occur in a clustered
    dataset, so iMinMax may have to search a large
    part of the whole dataset to guarantee 100
    accuracy.
  • However, iMinMax can produce approximate answers
    very quickly, lower than linear scan with up to
    95 accuracy.

60
Reference
  • Indexing the Edges A simple and yet efficient
    approach to high-dimensional indexing
  • Progressive KNN Search Using B-trees
  • Indexing the Distance An efficient method to KNN
    processing

61
iDistance
  • By He Qi
  • heqi_at_comp.nus.edu.s
    g

62
Topics
  • Two important issues in iDistance data space
    partitioning and selection of reference points
  • Problems of iDistance
  • A progress of iDistance hyper-cube iDistance
  • Questions future work

63
Data space partitioning
  • Equal partitioning of data space
  • Do not consider the distribution of data, data
    points may be allocated unevenly and loosely.
  • Cluster based partitioning
  • Consider the distribution of data, close data
    points are allocated densely in each cluster.

64
Equal partitioning of data space
q
r
KNN query in 2-dimension skewed data space
65
Selection of Reference points
66
Contrast between two kinds of reference point
selection
Overlap between sub-space
r
r
partition
r
r
Reference point
Points with same distance from the reference point
1. Centroid of a partition
2. Outside of a partition
67
Problems of iDistance
  • For Similarity range and KNN queries, the search
    space is large. The reason is that many points in
    a partition have same indexing value.
  • The cost of identifying the partitions that
    overlap with a query region is high because of
    the high cost of computing distance between two
    points.
  • Dont support Window/Range query.

68
Hyper-cube iDistance
  • a progress of iDistance

69
Basic idea
  • Instead of using hyper-spheres to bound
    partitions and query regions in iDistance, we use
    hyper-cubes with a same direction (parallel to
    each axis) to bound partitions and query regions.
    Note the partitioning process is same.

70
How to index?
  • For each partition, we point a hyper-plane as the
    reference plane.
  • The indexing value of a data is the distance
    between the data point and its projection onto
    the reference plane .
  • The centroid of each partition(hyper-cube) is
    also kept.

71
Hyper-cube iDistance
  • Contrast with iDistance in iDistance, to get
    the same effect, a reference point should be set
    infinitely far, while the overlaps also increase
    rapidly.

72
How to perform KNN query?
  • Given a query point and the current search
    radius, For each partition overlapping with the
    query region, the yellow and purple regions need
    to be searched.
  • Pruning each point in search space should be
    examined if it is in the query region (which is
    much cheaper than computing distance) first. Only
    those in the query region (purple region) need
    compute distance from q.
  • Stopping criterion the search stops when 1) K
    points are found, 2) and the distance of the
    farthest object in answer set from q is less than
    or equal to the current search radius.

73
Contrasting with iDistance
74
Questions Future work
  • How to handle the data entries with a same
    indexing value in a B tree?
  • Given a data set and an indexing method, how to
    compute the average cost of a certain kind of
    query?
  • How to Consider the interdependences and varying
    importance of dimensions in hyper-cube iDistance?

75
Summary
  • Four indexing technologies for high-dimensional
    database
  • Pyramid
  • iMinMax(?)
  • iDistance
  • Hyper-cube iDistance
Write a Comment
User Comments (0)
About PowerShow.com