Similarity Search in High Dimensions via Hashing - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Similarity Search in High Dimensions via Hashing

Description:

Similarity Search in High Dimensions via Hashing Aristides Gionis, Protr Indyk and Rajeev Motwani Department of Computer Science Stanford University – PowerPoint PPT presentation

Number of Views:110

Avg rating:3.0/5.0

Slides: 23

Provided by: Jiy6

Learn more at: https://web.ece.ucsb.edu

Category:

more less

Transcript and Presenter's Notes

Title: Similarity Search in High Dimensions via Hashing

1
Similarity Search in High Dimensions via Hashing

Aristides Gionis, Protr Indyk and Rajeev Motwani
Department of Computer Science
Stanford University
presented by Jiyun Byun
Vision Research Lab in ECE at UCSB

2
Outline

Introduction
Locality Sensitive Hashing
Analysis
Experiments
Concluding Remarks

3
Introduction

Nearest neighbor search (NNS)
The curse of dimensionality
experimental approach use heuristic
analytical approach
Approximate approach
e-Nearest Neighbor Search (e-NNS)
Goal for any given query q Rd, returns the
points p ? P
where d(q,P) is the distance of q to the
its closest points in P
right answers are much closer than irrelevant
ones
time/quality trade off

4
Locality Sensitive Hashing (LSH)

Collision probability depends on distance between
points
higher collision probability for close objects
small collision probability for those that far
apart
Given a query point,
hash it using a set of hash functions
inspect the entries in each bucket

Locality Sensitive Hashing

6
Locality Sensitive Hashing (LSH)Setting

C the largest coordinate among all points in
the given dataset P of dimension d (Rd)
Embed P into the Hamming cube 0,1d
dimension d Cd
v(p) UnaryC(x1)UnaryC(xd)
use the unary code for each point along each
dimension

isometric embedding
d1(p,q) dH(v(p),v(q))
embedding preserves the distance between points

7
Locality Sensitive Hashing (LSH)Hash
functions(1/2)

Build a hash function on Hamming cube in d
dimensions
Choose L subsets of the dimensions I1,I2, ..IL
Ij consists of k elements from 1,,d
found by sampling uniformly at random with
replacement
Project each point on each Ij.
gj(p) projection of p on Ij obtained by
concatenating the bit values of p for dimensions
Ij
Store p in buckets gj(p), j 1.. L

8
Locality Sensitive Hashing (LSH)Hash
functions(2/2)

Two levels of hashing
LSH function
maps a point p to bucket gj(p)
standard hash function
maps the contents of buckets into a hash table of
size M
B bucket capacity ? memory utilization
parameter

9
Query processing

Search buckets gj(q)
until CL points are found or all L indices are
searched.
Approximate K-NNS
output the K points closest to q
fewer if less than K points are found
-neighbor with parameter r

10
Analysis

where r1 lt r2 and P1gtP2
Family of single projections in Hamming cube Hd
is (r, r(1 ), 1-r/d, 1- r(1 )/d) sensitive
if dH(q,p) r (r bits on which p and q
differ)Pr h(q) h(p) r/d

11
LSH solve(r ) Neighbor problem

Determine if
there exists a point within distance r of query
point q
or whether all points are at least a distance
r(1 ) away from q
In the former case,
return a point within distance r(1 ) of q.
Repeat construction to boost the probability.

12
e-NN problem

For a given query point q,
return a point p from the dataset P
multiple instances of (r, )-neighbor solution.
(r0, )-neighbor, (r0(1 ), )-neighbor, (r0(1
)2, )-neighbor, ,rmax neighbor

13
Experiments(1/3)

Datasets
color histograms (Corel Draw)
n 20,000 d 8,,64
texture features (Aerial photos)
n 270,000 d 60
Query sets
Disk
second level bucket is directly mapped to a disk
block

14
Experiments(2/3)

profiles

Normalized frequency
Normalized frequency
Interpoint distance
Interpoint distance
color histogram
texture features
15
Experiments(3/3)

Performance
speed average number of blocks accessed
effective error
dLSH LSH NN distance(q) , d NN distance(q)
miss ratio
the fraction of queries for which no answer was
found

16
Experiments color histogram(1/4)

Error vs. Number of indices(L)

17
Experiments color histogram(2/4)

Dependence on n

Disk Accesses
Disk Accesses
Number of database points
Number of database points
Approximate 1 NNS
Approximate 10 NNS
18
Experiments color histogram(3/4)