Title: When is
1When is Nearest Neighbor Meaningful?
- Authors
- Kevin Beyer, Jonathan Goldstein, Raghu
Ramakrishnan - Uri Shaft
Presentation by Vuk Malbasa For CIS664 Prof.
Vasilis Megalooekonomou
2Overview
- Introduction
- Examples of when NN is useful and not
- Conditions under which NN is not useful
- Application of results
- Meaningful applications of high-dimensional NN
- Experimental Studies
- Conclusions
3Introduction
?
- Nearest neighbor is a technique where an unseen
example is assumed to have similar properties to
the already classified point closest to it. - The examples on the left are cases where it is
obvious that using NN is a useful. - Are there cases where this technique is not
useful?
4Examples
Query point
Center of circle
Histogram of distances to other points
Query point
5Conditions under which NN is not useful
Definition A nearest neighbor query is unstable
for a given e if the distance from the query
point to most data points is less than (1 e)
times the distance from the query point to its
nearest neighbor.
Dmax
Dmin
(1e)Dmin
It can be shown that under certain conditions for
any fixed e gt 0, as dimensionality rises, the
probability that a query is unstable converges to
1.
6Conditions under which NN is not useful
If for a given scenario (set of data points and a
set of query points) the equation below is
satisfied then NN is not useful. Stated
differently, as the dimensionality of data m is
increased then if the variance of the
distribution scaled by the overall magnitude of
the distance converges to zero then NN is
meaningless.
(1)
7Application of results
- Example 1
- The data distribution and query distribution are
IID in all dimensions - All appropriate moments are finite
- Query point in chosen independently of data
points - In this case queries are unstable.
-
8Application of results
- Example 2
- Same as previous example but all dimensions of
both query points and data points are completely
dependant. - value for dimension 1 value for dimension 2
- In this case queries are not unstable and NN is
meaningful. -
9Application of results
- Example 3
- Every dimension is unique, but all dimensions are
correlated with all other dimensions and the
variance of each additional dimension increases. - First independent variables U1, , Um are
generated such that Ui Uniform(0,sqrt(i)) - X1U1, for 2 i m XiUi (Xi-1/2)
- In this case queries are unstable.
10Meaningful applications of high-dimensional NN
- Query point matches one of the data points
exactly. - Query point falls within some small distance of
one of the data points (this becomes increasingly
more difficult as dimensionality rises). - Data is clustered into several clusters with a
fixed maximum distance e, and the query point
falls within one of these clusters. (If the query
point isnt required to fall within some cluster
then the query is unstable). - Implicit low dimensionality (underlying
dimensionality of data is low regardless of
actual dimensionality).
11Experimental Studies
- Conditions described in (1) describe what happens
as dimensionality approaches infinity - Experiments are needed to observe the rate of
this convergence.
12Experimental Studies
13Conclusions
- Query instability is an indication of a
meaningless query. - While there are situations where high dimensional
NN queries are meaningful, they are very specific
and differ from the independent dimensions
basis. - The distinction in distance decreases fastest in
the first 20 dimensions.
14Conclusions
- Make sure that the distance distribution between
query points and data points allows for enough
contrast. - When evaluating a NN processing technique, test
it on meaningful workloads.
15 16