Title: Compact Data Representations and their Applications
1Compact Data Representations and their
Applications
- Moses Charikar
- Princeton University
2Sketching Paradigm
- Construct compact representation (sketch) of data
such that - Interesting functions of data can be computed
from compact representation
estimated
3Why care about compact representations ?
- Practical motivations
- Algorithmic techniques for massive data sets
- Compact representations lead to reduced space,
time requirements - Make impractical tasks feasible
- Theoretical Motivations
- Interesting mathematical problems
- Connections to many areas of research
4Questions
- What is the data ?
- What functions do we want to compute on the data
? - How do we estimate functions on the sketches ?
- Different considerations arise from different
combinations of answers - Compact representation schemes are functions of
the requirements
5What is the data ?
- Sets, vectors, points in Euclidean space, points
in a metric space, vertices of a graph. - Mathematical representation of objects (e.g.
documents, images, customer profiles, queries).
6What functions do we want to compute on the data ?
- Local functions pairs of objectse.g. distance
between objects - Sketch of each object, such that function can be
estimated from pairs of sketches - Global functions entire data sete.g.
statistical properties of data - Sketch of entire data set, ability to update,
combine sketches
7Local functions distance/similarity
- Distance is a general metric, i.e satisfies
triangle inequality - Normed spacex (x1, x2, , xd) y (y1, y2,
, yd) - Other special metrics (e.g. Earth Mover Distance)
8Estimating distance from sketches
- Arbitrary function of sketches
- Information theory, communication complexity
question. - Sketches are points in normed space
- Embedding original distance function in normed
space. Bourgain 85 Linial,London,Rabinovich
94 - Original metric is (same) normed space
- Original data points are high dimensional
- Sketches are points low dimensions
- Dimension reduction in normed spacesJohnson
Lindenstrauss 84
9Global functions
- Statistical properties of entire data set
- Frequency moments
- Sortedness of data
- Set membership
- Size of join of relations
- Histogram representation
- Most frequent items in data set
- Clustering of data
10Streaming algorithms
- Perform computation in one (or constant) pass(es)
over data using a small amount of storage space - Availability of sketch function facilitates
streaming algorithm - Additional requirements - sketch should allow
- Update to incorporate new data items
- Combination of sketches for different data sets
storage
input
11Goals
- Glimpse of sketching techniques, especially in
geometric settings. - Basic theoretical ideas, no messy details
- Concrete application
12Talk Outline
- Dimension reduction
- Similarity preserving hash functions
- sketching vector norms
- sketching Earth Mover Distance (EMD)
- Application to image retrieval
13Low Distortion Embeddings
- Given metric spaces (X1,d1) (X2,d2),embedding
f X1 ? X2 has distortion D if ratio of
distances changes by at most D - Dimension Reduction
- Original space high dimensional
- Make target space be of low dimension, while
maintaining small distortion
14Dimension Reduction in L2
- n points in Euclidean space (L2 norm) can be
mapped down to O((log n)/?2) dimensions with
distortion at most 1?.Johnson Lindenstrauss
84 - Two interesting properties
- Linear mapping
- Oblivious choice of linear mapping does not
depend on point set - Quite simple JL84, FM88, IM98, DG99, Ach01
Even a random 1/-1 matrix works - Many applications
15Dimension reduction for L1
- C,Sahai 02Linear embeddings are not good for
dimension reduction in L1 - There exist O(n) points in L1 in n dimensions,
such that any linear mapping with distortion ?
needs n/?2 dimensions
16Dimension reduction for L1
- C, Brinkman 03Strong lower bounds for
dimension reduction in L1 - There exist n points in L1 , such that any
embedding with constant distortion ? needs n1/?2
dimensions - Simpler proof by Lee,Naor 04
- Does not rule out other sketching techniques
17Talk Outline
- Dimension reduction
- Similarity preserving hash functions
- sketching vector norms
- sketching Earth Mover Distance (EMD)
- Application to image retrieval
18Similarity Preserving Hash Functions
- Similarity function sim(x,y), distance d(x,y)
- Family of hash functions F with probability
distribution such that
19Applications
- Compact representation scheme for estimating
similarity - Approximate nearest neighbor search
Indyk,Motwani 98 Kushilevitz,Ostrovsky,Rabani
98
20Relaxations of SPH
- Estimate distance measure, not similarity measure
in 0,1. - Measure Ef(h(x),h(y).
- Estimator will approximate distance function.
21Sketching Set SimilarityMinwise Independent
Permutations
Broder,Manasse,Glassman,Zweig 97
Broder,C,Frieze,Mitzenmacher 98
22Sketching L1
- Design sketch for vectors to estimate L1 norm
- Hash function to distinguish between small and
large distances KOR 98 - Map L1 to Hamming space
- Bit vectors a(a1,a2,,an) and b(b1,b2,,bn)
- Distinguish between distances ? (1-e)n/k versus
? (1e)n/k - XOR random set of k bits
- Prh(a)h(b) differs by constant in two cases
23Sketching L1 via stable distributions
- a(a1,a2,,an) and b(b1,b2,,bn)
- Sketching L2
- f(a) Si ai Xi f(b) Si bi XiXi
independent Gaussian - f(a)-f(b) has Gaussian distribution scaled by
a-b2 - Form many coordinates, estimate a-b2 by taking
L2 norm - Sketching L1
- f(a) Si ai Xi f(b) Si bi XiXi
independent Cauchy distributed - f(a)-f(b) has Cauchy distribution scaled by
a-b1 - Form many coordinates, estimate a-b1 by taking
medianIndyk 00 -- streaming applications
24Earth Mover Distance (EMD)
P
Q
EMD(P,Q)
25Bipartite/Bichromatic matching
- Minimum cost matching between two sets of points.
- Point weights ? multiple copies of points
Fast estimation of bipartite matching
Agarwal,Varadarajan 04
Goal Sketch point set to enable estimation of
min cost matching
26Detour Classification with pairwise
relationships Kleinberg,Tardos 99
we
27LP Relaxation and Rounding
Kleinberg,Tardos 99
Chekuri,Khanna,Naor,Zosin 01
28Approximating metrics by trees
29EMD on trees embedding into L1
suggested by Piotr Indyk
wT(P)-wT(Q)
EMD(P,Q) STlTwT(P)-wT(Q)
30EMD on general metrics
- Approximate metric by probability distribution on
trees - Sample tree from distribution and compute L1
representation - EMD(P,Q) ? Ed(v(P),v(Q)) ? O(log n) EMD(P,Q)
31Tree approximations for Euclidean points
distortion O(d log ?) Bartal 96, CCGGP 98
proposed by Indyk,Thaper 03 for estimating EMD
32Talk Outline
- Dimension reduction
- Similarity preserving hash functions
- sketching vector norms
- sketching Earth Mover Distance (EMD)
- Application to image retrieval
33Motivation
- Apply sketching techniques in complex setting
- Compact data structures for high-quality and
efficient image retrieval ? - Evaluate effectiveness of sketching techniques
- Lv,C,Li 04
34Region Based Image Retrieval (RBIR)
- Region representation
- color (histogram, moments, fourier coefficients)
- position, shape
- Region based image similarity measure
- Independent best match Blobworld, NETRA
- each region in one matched to best region in
other - One-to-one match Windsurf, WALRUS
- one-to-one matching between two sets of regions
- EMD match
35Overview
36Region Representation
- Color moments
- First three moments in HSV color space
- ? 9-D vector
- Bounding box
- Aspect ratio
- Bounding box size
- Area ratio
- Region centroid
- ? 5-D vector
- Weighted L1 distance
37Addressing problems with EMD match
- Region weights proportional to region size?
- Big background has disproportionate effect
- Ground distance region distance?
- Pair of different regions can have large effect
- distance meaningless beyond certain point
- EMD-match
- Region weights Square root of region size
- Ground distance Thresholded region distance
38Compact region representation
- 14D region vectors ? N?K bit vectors
- hamming distance ? (weighted) L1 distance
- XOR groups of K bits ? N bit vector
- hamming distance ? thresholded L1 distance
39Thresholding distance by XORing bits
Number of bits XORed control shape of flattened
curve
40EMD embeddingcombining region vectors
- Pick random pattern (bit positions and bit
values) - Add region weights matching pattern
- M such patterns M coordinates of image vector
- Related to Indyk Thaper 04
41Evaluation Criteria
- Effectiveness of EMD match
- Compactness of data structureswithout
compromising quality - Efficiency and effectiveness of embedding and
filtering algorithm
42Evaluation Methodology
- 10,000 images
- 32 queries with similar images identifiedhttp//d
bvis.inf.uni-konstanz.de/research/projects/SimSear
ch/effpics.html - Segmentation via JSEGavg. regions 7.16, min
1, max 57 - Effectiveness measured by average
precisionAverage of precision values at
positions of k target images
43(No Transcript)
44Effectiveness of EMD Match
SIMPLIcity avg. precision 0.331
45Region representation size
46Effect of region bit vector size
47Effect of number of random patterns(raw image
vector size)
48Search quality Effect of image bit vector size
and filtering
49Average query time Effect of image bit vector
size and filtering
50Conclusions
- Compact representations at the heart of several
algorithmic techniques for large data sets - Compact representations tailored to applications
- Effective for region based image retrieval
- Many interesting research questions
- sketching EMD over points in R2
- upper bounds for dimension reduction in L1