Title: Similarity Search in High Dimensions via Hashing
1Similarity Search in High Dimensions via Hashing
- Aristides Gionis, Protr Indyk and Rajeev Motwani
- Department of Computer Science
- Stanford University
- presented by Jiyun Byun
- Vision Research Lab in ECE at UCSB
2Outline
- Introduction
- Locality Sensitive Hashing
- Analysis
- Experiments
- Concluding Remarks
3Introduction
- Nearest neighbor search (NNS)
- The curse of dimensionality
- experimental approach use heuristic
- analytical approach
- Approximate approach
- e-Nearest Neighbor Search (e-NNS)
- Goal for any given query q Rd, returns the
points p ? P - where d(q,P) is the distance of q to the
its closest points in P - right answers are much closer than irrelevant
ones - time/quality trade off
4Locality Sensitive Hashing (LSH)
- Collision probability depends on distance between
points - higher collision probability for close objects
- small collision probability for those that far
apart - Given a query point,
- hash it using a set of hash functions
- inspect the entries in each bucket
5- Locality Sensitive Hashing
6Locality Sensitive Hashing (LSH)Setting
- C the largest coordinate among all points in
the given dataset P of dimension d (Rd) - Embed P into the Hamming cube 0,1d
- dimension d Cd
- v(p) UnaryC(x1)UnaryC(xd)
- use the unary code for each point along each
dimension
- isometric embedding
- d1(p,q) dH(v(p),v(q))
- embedding preserves the distance between points
7Locality Sensitive Hashing (LSH)Hash
functions(1/2)
- Build a hash function on Hamming cube in d
dimensions - Choose L subsets of the dimensions I1,I2, ..IL
- Ij consists of k elements from 1,,d
- found by sampling uniformly at random with
replacement - Project each point on each Ij.
- gj(p) projection of p on Ij obtained by
concatenating the bit values of p for dimensions
Ij - Store p in buckets gj(p), j 1.. L
8Locality Sensitive Hashing (LSH)Hash
functions(2/2)
- Two levels of hashing
- LSH function
- maps a point p to bucket gj(p)
- standard hash function
- maps the contents of buckets into a hash table of
size M -
- B bucket capacity ? memory utilization
parameter
9Query processing
- Search buckets gj(q)
- until CL points are found or all L indices are
searched. - Approximate K-NNS
- output the K points closest to q
- fewer if less than K points are found
- -neighbor with parameter r
10Analysis
-
-
-
- where r1 lt r2 and P1gtP2
- Family of single projections in Hamming cube Hd
is (r, r(1 ), 1-r/d, 1- r(1 )/d) sensitive - if dH(q,p) r (r bits on which p and q
differ)Pr h(q) h(p) r/d -
11LSH solve(r ) Neighbor problem
- Determine if
- there exists a point within distance r of query
point q - or whether all points are at least a distance
r(1 ) away from q - In the former case,
- return a point within distance r(1 ) of q.
- Repeat construction to boost the probability.
12e-NN problem
- For a given query point q,
- return a point p from the dataset P
- multiple instances of (r, )-neighbor solution.
- (r0, )-neighbor, (r0(1 ), )-neighbor, (r0(1
)2, )-neighbor, ,rmax neighbor
13Experiments(1/3)
- Datasets
- color histograms (Corel Draw)
- n 20,000 d 8,,64
- texture features (Aerial photos)
- n 270,000 d 60
- Query sets
- Disk
-
- second level bucket is directly mapped to a disk
block
14Experiments(2/3)
Normalized frequency
Normalized frequency
Interpoint distance
Interpoint distance
color histogram
texture features
15Experiments(3/3)
- Performance
- speed average number of blocks accessed
- effective error
-
- dLSH LSH NN distance(q) , d NN distance(q)
- miss ratio
- the fraction of queries for which no answer was
found
16Experiments color histogram(1/4)
- Error vs. Number of indices(L)
17Experiments color histogram(2/4)
Disk Accesses
Disk Accesses
Number of database points
Number of database points
Approximate 1 NNS
Approximate 10 NNS
18Experiments color histogram(3/4)
Miss ratio
Miss ratio
Number of database points
Number of database points
Approximate 1 NNS
Approximate 10 NNS
19Experiments color histogram(4/4)
Disk Accesses
Disk Accesses
Number of dimension
Number of dimension
Approximate 1 NNS
Approximate 10 NNS
20Experiments texture features(1/2)
- Number of indices vs. Error
21Experiments texture features(2/2)
- Number of indices vs. Size
22Concluding remarks
- Locality Sensitive Hashing
- fast approximation
- dynamic/join version
- Future work
- hybrid techniques tree-based and hashing-based
-