When is - PowerPoint PPT Presentation

About This Presentation
Title:

When is

Description:

... and a set of query points) the equation below is satisfied then NN is not useful. ... The data distribution and query distribution are IID in all dimensions ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 17
Provided by: cisTe
Category:
Tags: all | are | satisfied

less

Transcript and Presenter's Notes

Title: When is


1
When is Nearest Neighbor Meaningful?
  • Authors
  • Kevin Beyer, Jonathan Goldstein, Raghu
    Ramakrishnan
  • Uri Shaft

Presentation by Vuk Malbasa For CIS664 Prof.
Vasilis Megalooekonomou
2
Overview
  • Introduction
  • Examples of when NN is useful and not
  • Conditions under which NN is not useful
  • Application of results
  • Meaningful applications of high-dimensional NN
  • Experimental Studies
  • Conclusions

3
Introduction
?
  • Nearest neighbor is a technique where an unseen
    example is assumed to have similar properties to
    the already classified point closest to it.
  • The examples on the left are cases where it is
    obvious that using NN is a useful.
  • Are there cases where this technique is not
    useful?

4
Examples
Query point
Center of circle
Histogram of distances to other points
Query point
5
Conditions under which NN is not useful
Definition A nearest neighbor query is unstable
for a given e if the distance from the query
point to most data points is less than (1 e)
times the distance from the query point to its
nearest neighbor.
Dmax
Dmin
(1e)Dmin
It can be shown that under certain conditions for
any fixed e gt 0, as dimensionality rises, the
probability that a query is unstable converges to
1.
6
Conditions under which NN is not useful
If for a given scenario (set of data points and a
set of query points) the equation below is
satisfied then NN is not useful. Stated
differently, as the dimensionality of data m is
increased then if the variance of the
distribution scaled by the overall magnitude of
the distance converges to zero then NN is
meaningless.
(1)
7
Application of results
  • Example 1
  • The data distribution and query distribution are
    IID in all dimensions
  • All appropriate moments are finite
  • Query point in chosen independently of data
    points
  • In this case queries are unstable.

8
Application of results
  • Example 2
  • Same as previous example but all dimensions of
    both query points and data points are completely
    dependant.
  • value for dimension 1 value for dimension 2
  • In this case queries are not unstable and NN is
    meaningful.

9
Application of results
  • Example 3
  • Every dimension is unique, but all dimensions are
    correlated with all other dimensions and the
    variance of each additional dimension increases.
  • First independent variables U1, , Um are
    generated such that Ui Uniform(0,sqrt(i))
  • X1U1, for 2 i m XiUi (Xi-1/2)
  • In this case queries are unstable.

10
Meaningful applications of high-dimensional NN
  • Query point matches one of the data points
    exactly.
  • Query point falls within some small distance of
    one of the data points (this becomes increasingly
    more difficult as dimensionality rises).
  • Data is clustered into several clusters with a
    fixed maximum distance e, and the query point
    falls within one of these clusters. (If the query
    point isnt required to fall within some cluster
    then the query is unstable).
  • Implicit low dimensionality (underlying
    dimensionality of data is low regardless of
    actual dimensionality).

11
Experimental Studies
  • Conditions described in (1) describe what happens
    as dimensionality approaches infinity
  • Experiments are needed to observe the rate of
    this convergence.

12
Experimental Studies
13
Conclusions
  • Query instability is an indication of a
    meaningless query.
  • While there are situations where high dimensional
    NN queries are meaningful, they are very specific
    and differ from the independent dimensions
    basis.
  • The distinction in distance decreases fastest in
    the first 20 dimensions.

14
Conclusions
  • Make sure that the distance distribution between
    query points and data points allows for enough
    contrast.
  • When evaluating a NN processing technique, test
    it on meaningful workloads.

15
  • Thanks!

16
  • ?
Write a Comment
User Comments (0)
About PowerShow.com