CS 361A (Advanced Data Structures and Algorithms) - PowerPoint PPT Presentation

About This Presentation
Title:

CS 361A (Advanced Data Structures and Algorithms)

Description:

CS 361A (Advanced Data Structures and Algorithms) Lecture 19 (Dec 5, 2005) Nearest Neighbors: Dimensionality Reduction and Locality-Sensitive Hashing – PowerPoint PPT presentation

Number of Views:258
Avg rating:3.0/5.0
Slides: 47
Provided by: Mayu150
Category:

less

Transcript and Presenter's Notes

Title: CS 361A (Advanced Data Structures and Algorithms)


1
CS 361A (Advanced Data Structures and Algorithms)
  • Lecture 19 (Dec 5, 2005)
  • Nearest Neighbors
    Dimensionality Reduction and Locality-Sensitive
    Hashing
  • Rajeev Motwani

2
Metric Space
  • Metric Space (M,D)
  • For points p,q in M, D(p,q) is distance from p to
    q
  • only reasonable model for high-dimensional
    geometric space
  • Defining Properties
  • Reflexive D(p,q) 0 if and only if pq
  • Symmetric D(p,q) D(q,p)
  • Triangle Inequality D(p,q) is at most
    D(p,r)D(r,q)
  • Interesting Cases
  • M ? points in d-dimensional space
  • D ? Hamming or Euclidean Lp-norms

3
High-Dimensional Near Neighbors
  • Nearest Neighbors Data Structure
  • Given N points Pp1, , pN in metric space
    (M,D)
  • Queries Which point p?P is closest to point
    q?
  • Complexity Tradeoff preprocessing space with
    query time
  • Applications
  • vector quantization
  • multimedia databases
  • data mining
  • machine learning

4
Known Results
Query Time Storage Technique Paper
dN dN Brute-Force
2d log N N2d1 Voronoi Diagram Dobkin-Lipton 76
Dd/2 log N Nd/2 Random Sampling Clarkson 88
d5 log N Nd Combination Meiser 93
logd-1 N N logd-1 N Parametric Search Agarwal-Matousek 92
  • Some expressions are approximate
  • Bottom-line exponential dependence on d

5
Approximate Nearest Neighbor
  • Exact Algorithms
  • Benchmark brute-force needs space O(N), query
    time O(N)
  • Known Results exponential dependence on
    dimension
  • Theory/Practice no better than brute-force
    search
  • Approximate Near-Neighbors
  • Given N points Pp1, , pN in metric space
    (M,D)
  • Given error parameter ?gt0
  • Goal for query q and nearest-neighbor p, return
    r such that
  • Justification
  • Mapping objects to metric space is heuristic
    anyway
  • Get tremendous performance improvement

6
Results for Approximate NN
Query Time Storage Technique Paper
dd e-d dN Balanced Trees Arya et al 94
d2 polylog(N,d) N N2d dN polylog(N,d) Random Projection Kleinberg 97
log3 N N1/?2 Search Trees Dimension Reduction Indyk-Motwani 98
dN1/?log2N N11/?log N Locality-Sensitive Hashing Indyk-Motwani 98
External Memory External Memory Locality-Sensitive Hashing Gionis-Indyk-Motwani 99
  • Will show main ideas of last 3 results
  • Some expressions are approximate

7
Approximate r-Near Neighbors
  • Given N points Pp1,,pN in metric space
    (M,D)
  • Given error parameter ?gt0, distance threshold
    rgt0
  • Query
  • If no point p with D(q,p)ltr, return FAILURE
  • Else, return any p with D(q,p)lt (1?)r
  • Application
  • Solving Approximate Nearest Neighbor
  • Assume maximum distance is R
  • Run in parallel for
  • Time/space O(log R) overhead
  • Indyk-Motwani reduce to O(polylog n) overhead

8
Hamming Metric
  • Hamming Space
  • Points in M bit-vectors 0,1d (can generalize
    to 0,1,2,,qd)
  • Hamming Distance D(p,q) of positions where
    p,q differ
  • Remarks
  • Simplest high-dimensional setting
  • Still useful in practice
  • In theory, as hard (or easy) as Euclidean space
  • Trivial in low dimensions
  • Example
  • Hypercube in d3 dimensions
  • 000, 001, 010, 011, 100, 101, 110, 111

9
Dimensionality Reduction
  • Overall Idea
  • Map from high to low dimensions
  • Preserve distances approximately
  • Solve Nearest Neighbors in new space
  • Performance improvement at cost of approximation
    error
  • Mapping?
  • Hash function family H H1, , Hm
  • Each Hi 0,1d ? 0,1t with tltltd
  • Pick HR from H uniformly at random
  • Map each point in P using same HR
  • Solve NN problem on HR(P) HR(p1), , HR(pN)

10
Reduction for Hamming Spaces
  • Theorem For any r and small ?gt0, there is hash
    family H such that for any p,q and random HR ?H


  • with probability gt1-?, provided for some
    constant C,

b
b
a
a
c
c
11
Remarks
  • For fixed threshold r, can distinguish between
  • Near D(p,q) lt r
  • Far D(p,q) gt (1e)r
  • For N points, need
  • Yet, can reduce to O(log N)-dimensional space,
    while approximately preserving distances
  • Works even if points not known in advance

12
Hash Family
  • Projection Function
  • Let S be ordered, multiset of s indexes from
    1,,d
  • pS0,1d ?0,1s projects p into s-dimensional
    subspace
  • Example
  • d5, p01100
  • s3, S2,2,4 ? pS 110
  • Choosing hash function HR in H
  • Repeat for i1,,t
  • Pick Si randomly (with replacement) from 1d
  • Pick random hash function fi0,1s ?0,1
  • hi(p)fi(pSi)
  • HR(p) (h1(p), h2(p),,ht(p))
  • Remark note similarity to Bloom Filters

13
Illustration of Hashing
. . . . .
1
d
p
0 1 1 0 0 0 1 0 1 0
pS1
pSt
. . . . .
1 0 0 1
0 0 0 0
. . .
. . .
1
s
1
s
ft
f1
0 1 1 0
HR(p)
. . .
h1(p)
ht(p)
14
Analysis I
  • Choose random index-set S
  • Claim For any p,q
  • Why?
  • p,q differ in D(p,q) bit positions
  • Need all s indexes of S to avoid these positions
  • Sampling with replacement from 1, ,d

15
Analysis II
  • Choose sd/r
  • Since 1-xlte-x for xlt1, we obtain
  • Thus

16
Analysis III
  • Recall hi(p)fi(pSi)
  • Thus
  • Choosing c ½ (1-e-1)

17
Analysis IV
  • Recall HR(p)(h1(p),h2(p),,ht(p))
  • D(HR(p),HR(q)) number of is where hi(p), hi(q)
    differ
  • By linearity of expectations
  • Theorem almost proved
  • For high probability bound, need Chernoff Bound

18
Chernoff Bound
  • Consider Bernoulli random variables X1,X2, , Xn
  • Values are 0-1
  • PrXi1 x and PrXi0 1-x
  • Define X X1X2Xn with EXnx
  • Theorem For independent X1,, Xn, for any 0lt?lt1,

P
2?nx
X
nx
19
Analysis V
  • Define
  • Xi0 if hi(p)hi(q), and 1 otherwise
  • nt
  • Then X X1X2Xt D(HR(p),HR(q))
  • Case 1 D(p,q)ltr ? xc
  • Case 2 D(p,q)gt(1e)r ? xce/6
  • Observe sloppy bounding of constants in Case 2

20
Putting it all together
  • Recall
  • Thus, error probability
  • Choosing C1200/c
  • Theorem is proved!!

21
Algorithm I
  • Set error probability
  • Select hash HR and map points p ? HR(p)
  • Processing query q
  • Compute HR(q)
  • Find nearest neighbor HR(p) for HR(q)
  • If then return p,
    else FAILURE
  • Remarks
  • Brute-force for finding HR(p) implies query time
  • Need another approach for lower dimensions

22
Algorithm II
  • Fact Exact nearest neighbors in 0,1t requires
  • Space O(2t)
  • Query time O(t)
  • How?
  • Precompute/store answers to all queries
  • Number of possible queries is 2t
  • Since
  • Theorem In Hamming space 0,1d, can solve
    approximate nearest neighbor with
  • Space
  • Query time

23
Different Metric
  • Many applications have sparse points
  • Many dimensions but few 1s
  • Example points?documents, dimensions?words
  • Better to view as sets
  • Previous approach would require large s
  • For sets A,B, define
  • Observe
  • AB ? sim(A,B)1
  • A,B disjoint ? sim(A,B)0
  • Question Handling D(A,B)1-sim(A,B) ?

24
Min-Hash
  • Random permutations p1,,pt of universe
    (dimensions)
  • Define mapping hj(A)mina in A pj(a)
  • Fact Prhj(A) hj(B) sim(A,B)
  • Proof? already seen!!
  • Overall hash-function
  • HR(A) (h1(A), h2(A),,ht(A))

25
Min-Hash Analysis
  • Select
  • Hamming Distance
  • D(HR(A),HR(B)) ? number of js such that
  • Theorem For any A,B,
  • Proof? Exercise (apply Chernoff Bound)
  • Obtain ANN algorithm similar to earlier result

26
Generalization
  • Goal
  • abstract technique used for Hamming space
  • enable application to other metric spaces
  • handle Dynamic ANN
  • Dynamic Approximate r-Near Neighbors
  • Fix threshold r
  • Query if any point within distance r of q,
    return any point within distance
  • Allow insertions/deletions of points in P
  • Recall earlier method required preprocessing
    all possible queries in hash-range-space

27
Locality-Sensitive Hashing
  • Fix metric space (M,D), threshold r, error
  • Choose probability parameters Q1 gt Q2gt0
  • Definition Hash family HhM?S for (M,D) is
    called .
    -sensitive, if for random h and for any p,q
    in M
  • Intuition
  • p,q are near ? likely to collide
  • p,q are far ? unlikely to collide

28
Examples
  • Hamming Space M0,1d
  • point pb1bd
  • H hi(b1bd)bi, for i1d
  • sampling one bit at random
  • Prhi(q)hi(p) 1 D(p,q)/d
  • Set Similarity D(A,B) 1 sim(A,B)
  • Recall
  • H
  • Prh(A)h(B) 1 D(A,B)

29
Multi-Index Hashing
  • Overall Idea
  • Fix LSH family H
  • Boost Q1, Q2 gap by defining G H k
  • Using G, each point hashes into l buckets
  • Intuition
  • r-near neighbors likely to collide
  • few non-near pairs in any bucket
  • Define
  • G g g(p) h1(p)h2(p)hk(p)
  • Hamming metric ? sample k random bits

30
Example (l4)











h1
hk
p
g1

q
g2

g3

g4

r
31
Overall Scheme
  • Preprocessing
  • Prepare hash table for range of G
  • Select l hash functions g1, g2, , gl
  • Insert(p) add p to buckets g1(p), g2(p), ,
    gl(p)
  • Delete(p) remove p from buckets g1(p), g2(p),
    , gl(p)
  • Query(q)
  • Check buckets g1(q), g2(q), , gl(q)
  • Report nearest of (say) first 3l points
  • Complexity
  • Assume computing D(p,q) needs O(d) time
  • Assume storing p needs O(d) space
  • Insert/Delete/Query Time O(dlk)
  • Preprocessing/Storage O(dNNlk)

32
Collision Probability vs. Distance
1
Q1
Q2
0
r
r
r
r
33
Multi-Index versus Error
  • Set lNz where
  • Theorem For lNz, any query returns r-near
    neighbor correctly with probability at least 1/6.
  • Consequently (ignoring kO(log N) factors)
  • Time O(dNz)
  • Space O(N1z)
  • Hamming Metric ?
  • Boost Probability use several parallel
    hash-tables

34
Analysis
  • Define (for fixed query q)
  • p any point with D(q,p) lt r
  • FAR(q) all p with D(q,p) gt (1 )r
  • BUCKET(q,j) all p with gj(p) gj(q)
  • Event Esize
  • (?query cost bounded by O(dl))
  • Event ENN gj(p) gj(q) for some j
  • (?nearest point in l buckets is r-near
    neighbor)
  • Analysis
  • Show PrEsize x gt 2/3 and PrENN y gt 1/2
  • Thus Prnot(Esize ENN) lt (1-x) (1-y) lt 5/6

35
Analysis Bad Collisions
  • Choose
  • Fact
  • Clearly
  • Markov Inequality PrXgtr.EXlt1/r, for Xgt0
  • Lemma 1

36
Analysis Good Collisions
  • Observe ?
  • Since lnz ?
  • Lemma 2 PrENN gt1/2

37
Euclidean Norms
  • Recall
  • x(x1, x2, , xd) and y(y1, y2, , yd) in Rd
  • L1-norm
  • Lp-norm (for pgt1)

38
Extension to L1-Norm
  • Round coordinates to 1,M
  • Embed L1-1,,Md into Hamming-0,1dM
  • Unary Mapping
  • Apply algorithm for Hamming Spaces
  • Error due to rounding of 1/M ?
  • Space-Time Overhead due to mapping of d ? dM

39
Extension to L2-Norm
  • Observe
  • Little difference in L1-norm and L2-norm for high
    d
  • Additional error is small
  • More generally Lp, for 1 p 2
  • Figiel et al 1977, Johnson-Schechtman 1982
  • Can embed Lp into L1
  • Dimensions d ? O(d)
  • Distances preserved within factor (1a)
  • Key Idea random rotation of space

40
Improved Bounds
  • Indyk-Motwani 1998
  • For any Lp-norm
  • Query Time O(log3 N)
  • Space
  • Problem impractical
  • Today only a high-level sketch

41
Better Reduction
  • Recall
  • Reduced Approximate Nearest Neighbors to
    Approximate r-Near Neighbors
  • Space/Time Overhead O(log R)
  • R max distance in metric space
  • Ring-Cover Trees
  • Removed dependence on R
  • Reduced overhead to O(polylog N)

42
Approximate r-Near Neighbors
  • Idea
  • Impose regular-grid on Rd
  • Decompose into cubes of side length s
  • Label cubes with points at distance ltr
  • Data Structure
  • Query q determine cube containing q
  • Cube labels candidate r-near neighbors
  • Goals
  • Small s ? lower error
  • Fewer cubes ? smaller storage

43










p1
p2
p3
44
Grid Analysis
  • Assume r1
  • Choose
  • Cube Diameter
  • Number of cubes
  • Theorem For any Lp-norm, can solve Approx
    r-Near Neighbor using
  • Space
  • Time O(d)

45
Dimensionality Reduction
  • Johnson-Lindenstraus 84, Frankl-Maehara 88 For
    , can map points in P into subspace
    of dimension while preserving all
    inter-point distances to within a factor
  • Proof idea project onto random lines
  • Result for NN
  • Space
  • Time O(polylog N)

46
References
  • Approximate Nearest Neighbors Towards Removing
    the Curse of Dimensionality
    P. Indyk and R. Motwani
    STOC 1998
  • Similarity Search in High Dimensions via Hashing
    A. Gionis, P. Indyk, and R. Motwani
    VLDB 1999
Write a Comment
User Comments (0)
About PowerShow.com