Title: Multimedia DBs
1Multimedia DBs
2Multimedia dbs
- A multimedia database stores text, strings and
images - Similarity queries (content based retrieval)
- Given an image find the images in the database
that are similar (or you can describe the query
image) - Extract features, index in feature space, answer
similarity queries using GEMINI - Again, average values help!
3Image Features
- Features extracted from an image are based on
- Color distribution
- Shapes and structure
- ..
4Images - color
what is an image? A 2-d RGB array
5Images - color
Color histograms, and distance function
6Images - color
Mathematically, the distance function between a
vector x and a query q is
D(x, q) (x-q)T A (x-q) S aij (xi-qi) (xj-qj)
AI ?
7Images - color
- Problem cross-talk
- Features are not orthogonal -gt
- SAMs will not work properly
- Q what to do?
- A feature-extraction question
8Images - color
- possible answers
- avg red, avg green, avg blue
- it turns out that this lower-bounds the histogram
distance -gt - no cross-talk
- SAMs are applicable
9Images - color
time
performance
seq scan
w/ avg RGB
selectivity
10Images - shapes
- distance function Euclidean, on the area,
perimeter, and 20 moments - (Q how to normalize them?
11Images - shapes
- distance function Euclidean, on the area,
perimeter, and 20 moments - (Q how to normalize them?
- A divide by standard deviation)
12Images - shapes
- distance function Euclidean, on the area,
perimeter, and 20 moments - (Q other features / distance functions?
13Images - shapes
- distance function Euclidean, on the area,
perimeter, and 20 moments - (Q other features / distance functions?
- A1 turning angle
- A2 dilations/erosions
- A3 ... )
14Images - shapes
- distance function Euclidean, on the area,
perimeter, and 20 moments - Q how to do dim. reduction?
15Images - shapes
- distance function Euclidean, on the area,
perimeter, and 20 moments - Q how to do dim. reduction?
- A Karhunen-Loeve ( centered PCA/SVD)
16Images - shapes
log( of I/Os)
all kept
of features kept
17Dimensionality Reduction
- Many problems (like time-series and image
similarity) can be expressed as proximity
problems in a high dimensional space - Given a query point we try to find the points
that are close - But in high-dimensional spaces things are
different!
18Effects of High-dimensionality
- Assume a uniformly distributed set of points in
high dimensions 0,1d - Lets have a query with length 0.1 in each
dimension ? query selectivity in 100-d 10-100 - If we want constant selectivity (0.1) the length
of the side must be 1!
19Effects of High-dimensionality
- Surface is everything!
- Probability that a point is closer than 0.1 to a
(d-1) dimensional surface - D2 0.36
- D 10 1
- D100 1
20Effects of High-dimensionality
- Number of grid cells and surfaces
- Number of k-dimensional surfaces in a
d-dimensional hypercube - Binary partitioning ? 2d cells
- Indexing in high-dimensions is extremely
difficult curse of dimensionality
21Dimensionality Reduction
- The main idea reduce the dimensionality of the
space. - Project the d-dimensional points in a
k-dimensional space so that - k ltlt d
- distances are preserved as well as possible
- Solve the problem in low dimensions
- (the GEMINI idea of course)
22DR requirements
- The ideal mapping should
- Be fast to compute O(N) or O(N logN) but not
O(N2) - Preserve distances leading to small discrepancies
- Provide a fast algorithm to map a new query (why?)
23MDS (multidimensional scaling)
- Input a set of N items, the pair-wise (dis)
similarities and the dimensionality k - Optimization criterion
- stress (?ij(D(Si,Sj) - D(Ski, Skj) )2 /
?ijD(Si,Sj) 2) 1/2 - where D(Si,Sj) be the distance between time
series Si, Sj, and D(Ski, Skj) be the Euclidean
distance of the k-dim representations - Steepest descent algorithm
- start with an assignment (time series to k-dim
point) - minimize stress by moving points
24MDS
- Disadvantages
- Running time is O(N2), because of slow
convergence - Also it requires O(N) time to insert a new point,
not practical for queries
25FastMap Faloutsos and Lin, 1995
- Maps objects to k-dimensional points so that
distances are preserved well - It is an approximation of Multidimensional
Scaling - Works even when only distances are known
- Is efficient, and allows efficient query
transformation
26FastMap
- Find two objects that are far away
- Project all points on the line the two objects
define, to get the first coordinate
27FastMap - next iteration
28Results
Documents /cosine similarity -gt Euclidean
distance (how?)
29(No Transcript)