M-Tree: An Efficient Access Method for Similarity Search in Metric Space

1 / 30

About This Presentation

Title:

M-Tree: An Efficient Access Method for Similarity Search in Metric Space

Description:

Other techniques to solve same problem and how this one is step ahead? ... Earth movers distance. Quadratic form distance. Metric Spaces-An abstraction of Similarity ... –

Number of Views:260

Avg rating:3.0/5.0

Slides: 31

Provided by: Ami987

Category:

more less

Transcript and Presenter's Notes

Title: M-Tree: An Efficient Access Method for Similarity Search in Metric Space

1
M-Tree An Efficient Access Method for Similarity
Search in Metric Space

Presenters
Amool Gupta
Amit Sharma

2
MOTIVATION

Basic problem that it addresses?(Why)
Other techniques to solve same problem and how
this one is step ahead?
Basic Fundamentals of this Indexing structure

3
Similarity Search Problem
4
Similarity Searching

Effectiveness
- The way of formulating the similarity measures
a model of human perception
Efficiency
- The way of achieving the required performance
over huge volumes of data index structure

5
(No Transcript)
6
(No Transcript)
7
Examples of Distance Functions

Lp metric function( vectors)
L1 Manhattan distance
Euclidean Distance
Linfinity
Edit Distance (for String)
Hausdorff distance
Earth movers distance
Quadratic form distance

8
Metric Spaces-An abstraction of Similarity

A metric space M (D,d) is a pair, where
D is a domain (universe) of values, and
d is a distance function that, ? x,y,z ? U,
satisfies the metric axioms
d(x,y) 0, d(x,y) 0 ? x y (positivity)
d(x,y) d(y,x) (symmetry)
d(x,y) d(x,z) d(z,y) (triangle inequality)
All the distance functions seen in the previous
examples are metrics, and so are the (weighted)
Lp-norms
The only distance seen so far that does not fit
the metric framework is the DTW
Metric indexes only use the metric axioms to
organize objects, and exploit the triangle
inequality to prune the search space

Limitations of SAMs
SAMs are limited to indexing of DB Objects
represented by means of feature values in
Multi-dimensional vector space (we need more
generic indexing strategy)
Dissimilarity of object measured by Lp distance
between feature values
Assumes distance computation Trivial
Limitations of Metric Tress
Does not support dynamic database environment
Reduces distance computations but Pays no
attention to I/O costs

10
What is a relative distance?

OA AB OB
AB OA OB
AB relative position of B w.r.t A

B
A
O
11
M-Tree

Key ideas is to Some how reduce distance
computation and at same time reduce I/O.
M-Tree partition objects on the basis of their
relative distance as measured by specific
distance function and stores this objects into
nodes.

r(Or)
P(Or)
Or
root
12
M-Tree Structure

Leaf Nodes stores all indexed db objects by
their key or feature values.
Internal Nodes Called routing nodes.
Routing objects Or is associated with
Or feature value of DB object
Ptr(T(Or)) pointer to root of sub tree T(Or)
r(Or) covering radius or maximum relative
distance of objects in sub tree T(Or) from
routing object Or
d(Or , P(Or)) distance of routing object from
its parent object P(Or)

13
M-Tree Structure

Leaf Node Entry for database object.
Oj feature value of DB object.
oid(Oj) object key
d(Oj P(Oj)) distance of Oj from its parent P(Oj)

14
Processing Queries

Generally SAM try to prune tree for a given Query
and main emphasis is on developing efficient
pruning method which reduces no of disk access
but once a tree is pruned it is required to
compute distance of query point Q from each point
in pruned tree.
On the contrary emphasis of M- Tree is on pruning
as well as to reduce computation of distance
which is achieved by maximizing use of pre
computed distance stored in nodes of M-Tree

15
Range Query

Given query point Q ,
Maximum search distance r(Q)
Range query range(Q, r(Q)) is all objects Oj such
that d(Oj , Q ) lt r(Q)

r(Q)
Or is of our interest if intersection occurs How
To detect intersection using pre Computed
distances? If relative distance between Q and
Or is Less then sum of covering radii of two
Intersection is found.
Q
r(Or)
P(Or)
Or
root
16
Range Query

Leaf node

Object in leaf node is a solution to range Query
if it lies in its covering radii. We can again
use relative distance to Find weather object
lies in covering radii Or not
r(Q)
Q
P(Or)
Oj
root
17
Algorithm for Range Queries
18
K nearest neighbors queries

Given query point Q ,
An integer k gt 1
k-NN is NN(Q,k) is k indexed objects which have
shortest distance to Q

Q
Max Bound
Min Bound
r(Or)
P(Or)
Or
root
19
SPLIT MANAGEMENT

M-Tree grows bottom-up fashion
Overflow of node N is managed by splitting N into
two new nodes N and N(newly created)
PARTITIONING Distributing entries are among N
and N
PROMOTE Two entries are promoted as routing
objects and moved to parent level

20
SPLIT MANAGEMENT

If the split node is a leaf, then the covering
radius of a promoted object, say Op1, is set to
r(Op1) maxd(Oj,Op1 )Oj ? N1
whereas if overflow occurs in an internal node
r(Op1 ) maxd(Or,Op1) r(Or)Or ? N1

21
SPLIT POLICIES

Specific implementation of Promote and Partition
method defines a split policy
Ideal split policy should promote two objects and
partition other objects so obtained regions have
- Minimum volume
- Minimum Overlap
How it is different from SAM??

22
PROMOTE Choosing Routing objects

M_RAD minimum Radii sum
mM_RAD minimizes maximum of two Radii
M_LB_DIST maximum lower bound on distance
RANDOM
SAMPLING

23
PARTITIONING-Distribution of Entries

Generalized Hyperplane(Unbalanced split (why?))
Assign each object Oj ?N to the nearest routing
object
if d(Oj,Op1 ) d(Oj,Op2 ) then
assign Oj to N1, else
assign Oj to N2.
Balanced Compute d(Oj,Op1) and d(Oj,Op2 ) for
all
Oj ? N. Repeat until N is empty
Assign to N1 the nearest neighbor of Op1 in N
and remove it from N
Assign to N2 the nearest neighbor of Op2 in N
and remove it from N.

24
Experimental Results

Assumed constant node size
Tested all split policies
Results
Balanced partition method has shown to put
significant overhead and increased th I/O cost
Fastest split policy observed to be RANDOM and
slowest m_RAD
Average volume covered per page(quality of tree
construction) M_LB_DIST proved effective