Title: Multimedia DBs
1Multimedia DBs
2Multimedia dbs
- A multimedia database stores text, strings and
images - Similarity queries (content based retrieval)
- Given an image find the images in the database
that are similar (or you can describe the query
image) - Extract features, index in feature space, answer
similarity queries using GEMINI - Again, average values help!
- (Used QBIC IBM Almaden)
3Image Features
- Features extracted from an image are based on
- Color distribution
- Shapes and structure
- ..
4Images - color
what is an image? A 2-d RGB array
5Images - color
Color histograms, and distance function
6Images - color
Mathematically, the distance function between a
vector x and a query q is
D(x, q) (x-q)T A (x-q) S aij (xi-qi) (xj-qj)
AI ?
7Images - color
- Problem cross-talk
- Features are not orthogonal -gt
- SAMs will not work properly
- Q what to do?
- A feature-extraction question
8Images - color
- possible answers
- avg red, avg green, avg blue
- it turns out that this lower-bounds the histogram
distance -gt - no cross-talk
- SAMs are applicable
9Images - color
time
performance
seq scan
w/ avg RGB
selectivity
10Images - shapes
- distance function Euclidean, on the area,
perimeter, and 20 moments - (Q how to normalize them?
11Images - shapes
- distance function Euclidean, on the area,
perimeter, and 20 moments - (Q how to normalize them?
- A divide by standard deviation)
12Images - shapes
- distance function Euclidean, on the area,
perimeter, and 20 moments - (Q other features / distance functions?
13Images - shapes
- distance function Euclidean, on the area,
perimeter, and 20 moments - (Q other features / distance functions?
- A1 turning angle
- A2 dilations/erosions
- A3 ... )
14Images - shapes
- distance function Euclidean, on the area,
perimeter, and 20 moments - Q how to do dim. reduction?
15Images - shapes
- distance function Euclidean, on the area,
perimeter, and 20 moments - Q how to do dim. reduction?
- A Karhunen-Loeve ( centered PCA/SVD)
16Images - shapes
log( of I/Os)
all kept
of features kept
17Dimensionality Reduction
- Many problems (like time-series and image
similarity) can be expressed as proximity
problems in a high dimensional space - Given a query point we try to find the points
that are close - But in high-dimensional spaces things are
different!
18Effects of High-dimensionality
- Assume a uniformly distributed set of points in
high dimensions 0,1d - Lets have a query with length 0.1 in each
dimension ? query selectivity in 100-d 10-100 - If we want constant selectivity (0.1) the length
of the side must be 1!
19Effects of High-dimensionality
- Surface is everything!
- Probability that a point is closer than 0.1 to a
(d-1) dimensional surface - D2 0.36
- D 10 1
- D100 1
20Effects of High-dimensionality
- Number of grid cells and surfaces
- Number of k-dimensional surfaces in a
d-dimensional hypercube - Binary partitioning ? 2d cells
- Indexing in high-dimensions is extremely
difficult curse of dimensionality
21X-tree
- Performance impacted by the amount of overlap
between index nodes - Need to follow different paths
- Overlap, multi-overlap, weighted overlap
- R-tree when overlap is small
- Sequential access when overlap is large
- When an overflow occurs
- Split into two nodes if overlap is small
- Otherwise create a super-node with twice the
capacity - Tradeoffs made locally over different regions of
data space - No performance comparisons with linear scan!
22Pyramid Tree
- Designed for Range queries
- Map each d-dimensional point to 1-d value
- Build B-tree on 1-d values
- A range query is transformed into a set of 1-d
ranges - More efficient than X-tree, Hilbert order, and
sequential scan
23Pyramid transformation
pyramids
- 2d pyramids with top at
- center of data-space
- points in different pyramids
- ordered based on pyramid id
- points within a pyramid
- ordered based on height
- value(v) pyramid(v) height(v)
24Vector Approximation (VA) file
- Tile d-dimensional data-space uniformly
- A fixed number of bits in each dimensions (8)
- 256 partitions along each dimension
- 256d tiles
- Approximate each point by corresponding tile
- size of approximation 8d bits d bytes
- size of each point 4d bytes (assuming a word
per dimension) - 2-step approach, the first using VA file
25Simple NN searching
- d distance to kth NN so far
- For each approximation ai
- If lb(q,ai) lt d then
- Compute r distance(q,vi)
- If r lt d then
- Add point i to the set of NNs
- Update d
- Performance based on ordering of vectors and
their approximations
26Near-optimal NN searching
- d kth distant ub(q,a) so far
- For each approximation ai
- Compute lb(q,ai) and ub(q,ai)
- If lb(q,ai) lt d then
- If ub(q,ai) lt d then
- Add point i to the set of NNs
- Update d
- InsertHeap(Heap,lb(q,ai),i)
27Near-optimal NN searching (2)
- d distance to kth NN so far
- Repeat
- Examine the next entry (li,i) from the heap
- If d lt li then break
- Else
- Compute r distance(q,vi)
- If r lt d then
- Add point i to the set of NNs
- Update d
- Forever
- Sub-linear (log n) vectors after first phase
28SS-tree and SR-tree
- Use Spheres for index nodes (SS-tree)
- Higher fanout since storage cost is reduced
- Use rectangles and spheres for index nodes
- Index node defined by the intersection of two
volumes - More accurate representation of data
- Higher storage cost
29Metric Tree (M-tree)
- Definition of a metric
- d(x,y) gt 0
- d(x,y) d(y,x)
- d(x,y) d(y,z) gt d(x,z)
- d(x,x) 0
- Non-vector spaces
- Edit distance
- d(u,v) sqrt ((u-v)TA(u-v) ) used in QBIC
30Basic idea
x,d(x,p),r(x)
y,d(y,p),r(y)
Parent p
y
x
d(y,z) lt r(y)
z
Index entry (routing object, distance to
parent,covering radius)
All objects in subtree are within a distance of
covering radius from routing object.
31Range queries
x,d(x,p),r(x)
y,d(y,p),r(y)
Parent p
y
Query q with range t
x
t
q
z
d(q,z) gt d(q,y) - d(y,z) d(y,z) lt r(y) So,
d(q,z) gt d(q,y) -r(y) if d(q,y) - r(y) gt t then
d(q,z) gt t Prune subtree y if d(q,y) - r(y) gt t
(C1)
32Range queries
x,d(x,p),r(x)
y,d(y,p),r(y)
Parent p
y
Query q with range t
x
t
q
z
Prune subtree y if d(q,y) - r(y) gt t (C1)
d(q,y) gt d(q,p) - d(p,y) d(q,y) gt d(p,y) -
d(q,p) So, d(q,y) gt d(q,p) - d(p,y) if d(q,p)
- d(p,y) - r(y) gt t then d(q,y) - r(y) gt
t Prune subtree y if d(q,p) - d(p,y) - r(y) gt t
(C2)
33Range query algorithm
- RQ(q, t, Root, Subtrees S1, S2, )
- For each subtree Si
- prune if condition C2 holds
- otherwise compute distance to root of Si and
prune if condition C1 holds - otherwise search the children of Si
34Nearest neighbor query
- Maintain a priority list of k NN distances
- Minimum distance to a subtree with root
x dmin(q,x) max(d(q,x) - r(x), 0) - d(q,p) - d(p,x) - r(x) lt d(q,x) - r(x)
- may not need to compute d(q,x)
- Maximum distance to a subtree with root
x dmax(q,x) d(q,x) r(x)
x
q
d(q,z) r(x) gt d(q,x) d(q,z) gt d(q,x) - r(x)
r(x)
d(q,z) lt d(q,x) r(x)
z
35Nearest neighbor query
- Maintain an estimate dp of the kth smallest
maximum distance - Prune a subtree x if dmin(q,x) gt dp
36References
- Christos Faloutsos, Ron Barber, Myron Flickner,
Jim Hafner, Wayne Niblack, Dragutin Petkovic,
William Equitz Efficient and Effective Querying
by Image Content. JIIS 3(3/4) 231-262 (1994) - Stefan Berchtold, Daniel A. Keim, Hans-Peter
Kriegel The X-tree An Index Structure for
High-Dimensional Data. VLDB 1996 28-39 - Stefan Berchtold, Christian Böhm, Hans-Peter
Kriegel The Pyramid-Technique Towards Breaking
the Curse of Dimensionality. SIGMOD Conference
1998 142-153 - Roger Weber, Hans-Jörg Schek, Stephen Blott A
Quantitative Analysis and Performance Study for
Similarity-Search Methods in High-Dimensional
Spaces. VLDB 1998 194-205 - Paolo Ciaccia, Marco Patella, Pavel Zezula
M-tree An Efficient Access Method for Similarity
Search in Metric Spaces. VLDB 1997 426-435