Title: Nearest Neighbor Retrieval Using Distance-Based Hashing
1Database Group
Nearest Neighbor Retrieval Using Distance-Based
Hashing Michalis Potamias and Panagiotis
Papapetrou supervised by Prof George Kollios
Analysis
- Probability of collision between any two objects
- Same probability on a k-bit hash table
- Prob of collision in at least one of the l hash
tables - Accuracy, i.e. the probability over all queries Q
that we will retrieve the nearest neighbor N(Q) - LookupCost Expected number of objects that
collide in at least one of the l hash tables - HashCost of distance computations to evaluate
h-functions - Total Cost per query
- Efficiency (for all Queries)
- Use Sampling to estimate Accuracy and Efficiency
- Sample Queries
Hash Based Indexing
- Idea
- Come up with hash functions that hash similar
objects to similar buckets - Hash every database object to some buckets
- At query time apply the same hash function to the
query - Filter Retrieve the collisions. The rest of the
database is pruned. - Refine Compute actual distances. Return the
object with the smallest distance as the NN.
A method is proposed for indexing spaces with
arbitrary distance measures, so as to achieve
efficient approximate nearest neighbor retrieval.
Hashing methods, such as Locality Sensitive
Hashing (LSH), have been successfully applied for
similarity indexing in vector spaces and string
spaces under the Hamming distance. The key
novelty of the hashing technique proposed here is
that it can be applied to spaces with arbitrary
distance measures. First, we describe a
domain-independent method for constructing a
family of binary hash functions. Then, we use
these functions to construct multiple multi-bit
hash tables. We show that the LSH formalism is
not applicable for analyzing the behavior of
these tables as index structures. We present a
novel formulation, that uses statistical
observations from sample data to analyze
retrieval accuracy and efficiency for the
proposed indexing method. Experiments on several
real-world data sets demonstrate that our method
produces good trade-offs between accuracy and
efficiency, and significantly outperforms
VP-trees, which are a well-known method for
distance-based indexing.
Locality Sensitive Hashing
Additional Optimizations
Locality Sensitive Family of Functions Amplify
the gap between p1 and p2 Randomly pick l hash
vectors of k functions each. Probability of
collision in at least one of l hash tables
Problem
- Hierarchical DBH
- Rank Queries according to D(Q,N(Q)
- Divide space into disjoint subsets (equi-height)
- Train separate indices for each subset
- Reduce Hash Cost
- Use small number of pseudoline points
- NEAREST NEIGHBOR Given a database S, a distance
function D our task is for a previous unseen
query q, locate a point p of the database such
that the distance between q and every point o of
the database is greater or equal than the
distance between p and q. - COST MODEL Minimize number of Distance
Computations - Computing D may be very expensive
- Dynamic Time Warping for Time Series
- Edit Distance Variants for DNA alignment
- PROBLEM DEFINITION Define index structure to
answer Nearest Neighbor queries efficiently - A SOLUTION Brute Force! Try them all and
get the exact answer - OUR SOLUTION Are we willing to trade accuracy
for efficiency ?
Experiments
H using Pseudoline Projections (HDBH)
Works on Arbitrary Space but is not Locality
Sensitive! Define a line projection function that
maps an arbitrary space into the real line
R Real valued ? Discrete valued Hash
tables should be balanced. Thus t1, t2 are
chosen from V
ACCURACY vs. EFFICIENCY How often is the actual
NN retrieved? How much time does NN retrieval
take?
Conclusion
- General purpose
- Distance is black box
- Does not require metric properties
- Statistical analysis is possible
- Even when NN is not returned, a very close N is
returned For many applications thats fine!!
- Not sublinear in size of DB
- Statistical (not probabilistic)
- Need representative sample sets
- Hands dataset .. actual performance was
different than simulation .. - the training set was not representative!