Title: When Is Nearest Neighbor Meaningful
1When Is Nearest Neighbor Meaningful?
- By Kevin Beyer, Jonathan Goldstein, Raghu
Ramakrishnan, and Uri Shaft
2Nearest neighbor queries
Typical query in 2D
Unstable query in 2D
3Main theoretical instability result
(i.e. As dimensionality increases, all points
become equidistant w.r.t. the query point)
4IID contrast as dimensionality increases
5Repercussions of the technical result
- Serious questions are raised about techniques
that map approximate similarity into high
dimensional nearest neighbor problems. - The ease with which linear scan beats more
complex access methods for high-D nearest
neighbor is explained by our theorem. - These results should not be taken to mean that
all high dimensional nearest neighbor problems
are badly framed or that more complex access
methods will always fail on individual high-D
data sets.
6Example result application
- Assume the following
- The data distribution and query distribution are
IID in all dimensions. - All the appropriate moments are finite (i.e., up
to the é2pùth moment). - The query point is chosen independently of the
data points.
7Examples that meet our condition
- IID (Identical Independently Distributed), Q D
(Query distribution follows data distribution) - Variance converging to 0 at a bounded rate, Q D
- Variance converging to infinity at a bounded
rate, Q D - Partial correlation between all dimensions, Q D
- Variance converging to 0 at a bounded rate, and
partial correlation between all dimensions, Q D - Perfectly realized clustering, Q IID uniform
8Examples that dont meet our condition
- Total correlation between all dimensions, Q D
- All dimensions are linear combinations of a fixed
number of IID random variables, Q D - Perfectly realized clustering with query
distribution following data distribution, Q D
9Contrast in ideally clustered data
Top right - Typical distance distribution Bottom
left - Ideal clusters Bottom right - Distance
distribution for ideally clustered data/queries