Nearest Neighbor Retrieval Using Distance-Based Hashing - PowerPoint PPT Presentation

1 / 1

About This Presentation

Title:

Nearest Neighbor Retrieval Using Distance-Based Hashing

Description:

Nearest Neighbor Retrieval Using Distance-Based Hashing ... Experiments on several real-world data sets demonstrate that our method produces ... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 2

Provided by: cspeo

Category:

more less

Transcript and Presenter's Notes

Title: Nearest Neighbor Retrieval Using Distance-Based Hashing

1
Database Group
Nearest Neighbor Retrieval Using Distance-Based
Hashing Michalis Potamias and Panagiotis
Papapetrou supervised by Prof George Kollios
Analysis

Probability of collision between any two objects
Same probability on a k-bit hash table
Prob of collision in at least one of the l hash
tables
Accuracy, i.e. the probability over all queries Q
that we will retrieve the nearest neighbor N(Q)
LookupCost Expected number of objects that
collide in at least one of the l hash tables
HashCost of distance computations to evaluate
h-functions
Total Cost per query
Efficiency (for all Queries)
Use Sampling to estimate Accuracy and Efficiency
Sample Queries

Hash Based Indexing

Idea
Come up with hash functions that hash similar
objects to similar buckets
Hash every database object to some buckets
At query time apply the same hash function to the
query
Filter Retrieve the collisions. The rest of the
database is pruned.
Refine Compute actual distances. Return the
object with the smallest distance as the NN.

A method is proposed for indexing spaces with
arbitrary distance measures, so as to achieve
efficient approximate nearest neighbor retrieval.
Hashing methods, such as Locality Sensitive
Hashing (LSH), have been successfully applied for
similarity indexing in vector spaces and string
spaces under the Hamming distance. The key
novelty of the hashing technique proposed here is
that it can be applied to spaces with arbitrary
distance measures. First, we describe a
domain-independent method for constructing a
family of binary hash functions. Then, we use
these functions to construct multiple multi-bit
hash tables. We show that the LSH formalism is
not applicable for analyzing the behavior of
these tables as index structures. We present a
novel formulation, that uses statistical
observations from sample data to analyze
retrieval accuracy and efficiency for the
proposed indexing method. Experiments on several
real-world data sets demonstrate that our method
produces good trade-offs between accuracy and
efficiency, and significantly outperforms
VP-trees, which are a well-known method for
distance-based indexing.
Locality Sensitive Hashing
Additional Optimizations
Locality Sensitive Family of Functions Amplify
the gap between p1 and p2 Randomly pick l hash
vectors of k functions each. Probability of
collision in at least one of l hash tables
Problem

Hierarchical DBH
Rank Queries according to D(Q,N(Q)
Divide space into disjoint subsets (equi-height)
Train separate indices for each subset
Reduce Hash Cost
Use small number of pseudoline points

NEAREST NEIGHBOR Given a database S, a distance
function D our task is for a previous unseen
query q, locate a point p of the database such
that the distance between q and every point o of
the database is greater or equal than the
distance between p and q.
COST MODEL Minimize number of Distance
Computations
Computing D may be very expensive
Dynamic Time Warping for Time Series
Edit Distance Variants for DNA alignment
PROBLEM DEFINITION Define index structure to
answer Nearest Neighbor queries efficiently
A SOLUTION Brute Force! Try them all and
get the exact answer
OUR SOLUTION Are we willing to trade accuracy
for efficiency ?

Experiments
H using Pseudoline Projections (HDBH)
Works on Arbitrary Space but is not Locality
Sensitive! Define a line projection function that
maps an arbitrary space into the real line
R Real valued ? Discrete valued Hash
tables should be balanced. Thus t1, t2 are
chosen from V
ACCURACY vs. EFFICIENCY How often is the actual
NN retrieved? How much time does NN retrieval
take?
Conclusion

General purpose
Distance is black box
Does not require metric properties
Statistical analysis is possible
Even when NN is not returned, a very close N is
returned For many applications thats fine!!