When Is Nearest Neighbors Indexable? - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

When Is Nearest Neighbors Indexable?

Description:

The performance of CD index does not scale for VV workloads using Euclidean distances. ... Under normal construction buckets contain more than two data points each. ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 25
Provided by: nes76
Category:

less

Transcript and Presenter's Notes

Title: When Is Nearest Neighbors Indexable?


1
When Is Nearest Neighbors Indexable?
  • Uri Shaft (Oracle Corp.)
  • Raghu Ramakrishnan (UW-Madison)

2
Motivation -Scalability Experiments
  • Dozens of papers describe experiments about index
    scalability with increased dimensions.
  • Constants are
  • Number of data points
  • Data and Query distribution
  • Index structure / search algorithm
  • Variable
  • Number of dimensions
  • Measurement
  • Performance of index.

3
Example From PODS 1997
4
Example From PODS 1997
5
Motivation
  • In many cases the conclusion is that the
    empirical evidence suggests the index structures
    do scale with dimensionality
  • We would like to investigate these claims
    mathematically supply a proof of scalability or
    non-scalability.

6
Historical Context
  • Continues work done in When
    Is Nearest Neighbors Meaningful? (ICDT 1999)
  • Previous work about behavior of distance
    distributions.
  • This work about behavior of indexing structures
    under similar conditions.

7
Contents
  • Vanishing Variance property
  • Convex Description index structures
  • Indexing Theorem
  • The performance of CD index does not scale for VV
    workloads using Euclidean distances.
  • Conclusion
  • Future Work

8
Vanishing Variance
  • Same definition used in ICDT 99 work (although
    not named in that work)
  • In 1999 we showed that the workloads become
    meaningless ratios of distances between query
    and various data points become arbitrarily small.
  • We use the same result here.

9
Vanishing Variance
  • A scalability experiment contains a series of
    workloads W1,W2,,Wm,
  • m is the number of dimensions
  • each workload W1 has n data points and a query
    point (same distribution)
  • Distance distribution marked as Dm
  • Vanishing Variance

10
Contents
  • Vanishing Variance property
  • Convex Description index structures
  • Indexing Theorem
  • The performance of CD index does not scale for VV
    workloads using Euclidean distances.
  • Conclusion
  • Future Work

11
Convex Description Index
  • Data points distributed to buckets (e.g. disk
    pages). Access to a buckets is all or nothing.
    We allow redundancy. A bucket contains at least
    two data points.
  • Each bucket associated with a description a
    convex region containing all data points in the
    bucket.
  • Search algorithm accesses at least all buckets
    whose convex region is closer than the nearest
    neighbor.
  • Cost of search is the number of data points
    retrieved.

12
Example R-Tree
  • Buckets are disk pages. Under normal construction
    buckets contain more than two data points each.
  • Bucket descriptions are convex and contain all
    data points (Bounding Rectangles).
  • Search algorithm accesses all buckets whose
    convex region is closer than the nearest neighbor
    (and probably a few more).

13
Convex Description Indexes
  • All R-Tree variants
  • X-Tree
  • M-Tree
  • kdb-Tree
  • SS-Tree and SR-Tree
  • Many more

14
Other indexes (non-CD)
  • Probability structures (P-Tree, VLDB 2000)
  • Access based on clusters. A near enough bucket
    may not be accessed
  • Projection index (like VA-file)
  • Compression structures.
  • All data points accessed in pieces, not in
    buckets.

15
Contents
  • Vanishing Variance property
  • Convex Description index structures
  • Indexing Theorem
  • The performance of CD index does not scale for VV
    workloads using Euclidean distances.
  • Conclusion
  • Future Work

16
Indexing Theorem
  • If
  • Scalability experiment uses a series of workloads
    with Vanishing Variance
  • The distance metric is Euclidean
  • The indexing structure is Convex Description
  • Then
  • The expected cost of a query converges to the
    number of data points I.e., a linear scan of
    the data

17
Sketch of Proof
  • Because of Vanishing Variance, the ratio of
    distances between various query and data points
    becomes arbitrarily close to 1.
  • When using Euclidean distance, we can look at an
    arbitrary data bucket and a query point, choose
    two data points from the bucket and create a
    triangle

18
Distances of Q, D1, D2,, Dn are about the
same. Distance of Q to Y is much smaller
Bucket
D1
D2
Y
Q
Therefore, distance of Q to data bucket is less
than distance to nearest neighbor.
19
Contents
  • Vanishing Variance property
  • Convex Description index structures
  • Indexing Theorem
  • The performance of CD index does not scale for VV
    workloads using Euclidean distances.
  • Conclusion
  • Future Work

20
Conclusion
  • Dozens of papers describe experiments about index
    scalability with increased dimensions.
  • We wanted to investigate these claims
    mathematically supply a proof of scalability or
    non-scalability.
  • We proved that many of these experiments do not
    scale in dimensionality.

21
Conclusion
  • Use this theorem to to channel indexing research
    into more useful and practical avenues
  • Review previous results accordingly.

22
Future Work
  • Remove restriction of at least two data points in
    bucket.
  • Easy exercise, need to take into account the cost
    of traversing a hierarchical data structure.
  • Investigate other Lp metrics
  • Investigate projection indexes using Euclidean
    metric (looks like they do not scale either)

23
Future Work
  • Find scalable indexing structure for Uniform data
    and L metric
  • Hint use compression
  • Find number of data points needed for R-Tree to
    be practical on uniform data, L2 metric.
  • Approx

24
Questions
Write a Comment
User Comments (0)
About PowerShow.com