When Is Nearest Neighbors Indexable? - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

When Is Nearest Neighbors Indexable?

Description:

The performance of CD index does not scale for VV workloads using Euclidean distances. ... Under normal construction buckets contain more than two data points each. ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 25

Provided by: nes76

Category:

more less

Transcript and Presenter's Notes

Title: When Is Nearest Neighbors Indexable?

1
When Is Nearest Neighbors Indexable?

Uri Shaft (Oracle Corp.)
Raghu Ramakrishnan (UW-Madison)

2
Motivation -Scalability Experiments

Dozens of papers describe experiments about index
scalability with increased dimensions.
Constants are
Number of data points
Data and Query distribution
Index structure / search algorithm
Variable
Number of dimensions
Measurement
Performance of index.

3
Example From PODS 1997
4
Example From PODS 1997
5
Motivation

In many cases the conclusion is that the
empirical evidence suggests the index structures
do scale with dimensionality
We would like to investigate these claims
mathematically supply a proof of scalability or
non-scalability.

6
Historical Context

Continues work done in When
Is Nearest Neighbors Meaningful? (ICDT 1999)
Previous work about behavior of distance
distributions.
This work about behavior of indexing structures
under similar conditions.

7
Contents

Vanishing Variance property
Convex Description index structures
Indexing Theorem
The performance of CD index does not scale for VV
workloads using Euclidean distances.
Conclusion
Future Work

8
Vanishing Variance

Same definition used in ICDT 99 work (although
not named in that work)
In 1999 we showed that the workloads become
meaningless ratios of distances between query
and various data points become arbitrarily small.
We use the same result here.

9
Vanishing Variance

A scalability experiment contains a series of
workloads W1,W2,,Wm,
m is the number of dimensions
each workload W1 has n data points and a query
point (same distribution)
Distance distribution marked as Dm
Vanishing Variance

10
Contents

Vanishing Variance property
Convex Description index structures
Indexing Theorem
The performance of CD index does not scale for VV
workloads using Euclidean distances.
Conclusion
Future Work

11
Convex Description Index

Data points distributed to buckets (e.g. disk
pages). Access to a buckets is all or nothing.
We allow redundancy. A bucket contains at least
two data points.
Each bucket associated with a description a
convex region containing all data points in the
bucket.
Search algorithm accesses at least all buckets
whose convex region is closer than the nearest
neighbor.
Cost of search is the number of data points
retrieved.

12
Example R-Tree

Buckets are disk pages. Under normal construction
buckets contain more than two data points each.
Bucket descriptions are convex and contain all
data points (Bounding Rectangles).
Search algorithm accesses all buckets whose
convex region is closer than the nearest neighbor
(and probably a few more).

13
Convex Description Indexes

All R-Tree variants
X-Tree
M-Tree
kdb-Tree
SS-Tree and SR-Tree
Many more

14
Other indexes (non-CD)

Probability structures (P-Tree, VLDB 2000)
Access based on clusters. A near enough bucket
may not be accessed
Projection index (like VA-file)
Compression structures.
All data points accessed in pieces, not in
buckets.

15
Contents

Vanishing Variance property
Convex Description index structures
Indexing Theorem
The performance of CD index does not scale for VV
workloads using Euclidean distances.
Conclusion
Future Work

16
Indexing Theorem

If
Scalability experiment uses a series of workloads
with Vanishing Variance
The distance metric is Euclidean
The indexing structure is Convex Description
Then
The expected cost of a query converges to the
number of data points I.e., a linear scan of
the data

17
Sketch of Proof

Because of Vanishing Variance, the ratio of
distances between various query and data points
becomes arbitrarily close to 1.
When using Euclidean distance, we can look at an
arbitrary data bucket and a query point, choose
two data points from the bucket and create a
triangle

18
Distances of Q, D1, D2,, Dn are about the
same. Distance of Q to Y is much smaller
Bucket
D1
D2
Y
Q
Therefore, distance of Q to data bucket is less
than distance to nearest neighbor.
19
Contents

Vanishing Variance property
Convex Description index structures
Indexing Theorem
The performance of CD index does not scale for VV
workloads using Euclidean distances.
Conclusion
Future Work

20
Conclusion

Dozens of papers describe experiments about index
scalability with increased dimensions.
We wanted to investigate these claims
mathematically supply a proof of scalability or
non-scalability.
We proved that many of these experiments do not
scale in dimensionality.

21
Conclusion

Use this theorem to to channel indexing research
into more useful and practical avenues
Review previous results accordingly.

22
Future Work

Remove restriction of at least two data points in
bucket.
Easy exercise, need to take into account the cost
of traversing a hierarchical data structure.
Investigate other Lp metrics
Investigate projection indexes using Euclidean
metric (looks like they do not scale either)

23
Future Work

Find scalable indexing structure for Uniform data
and L metric
Hint use compression
Find number of data points needed for R-Tree to
be practical on uniform data, L2 metric.
Approx

24
Questions

Write a Comment

User Comments (0)