Title: M-Tree: An Efficient Access Method for Similarity Search in Metric Space
1M-Tree An Efficient Access Method for Similarity
Search in Metric Space
- Presenters
- Amool Gupta
- Amit Sharma
2MOTIVATION
- Basic problem that it addresses?(Why)
- Other techniques to solve same problem and how
this one is step ahead? - Basic Fundamentals of this Indexing structure
3Similarity Search Problem
4Similarity Searching
- Effectiveness
- - The way of formulating the similarity measures
a model of human perception - Efficiency
- - The way of achieving the required performance
over huge volumes of data index structure
5(No Transcript)
6(No Transcript)
7Examples of Distance Functions
- Lp metric function( vectors)
- L1 Manhattan distance
- Euclidean Distance
- Linfinity
- Edit Distance (for String)
- Hausdorff distance
- Earth movers distance
- Quadratic form distance
8Metric Spaces-An abstraction of Similarity
- A metric space M (D,d) is a pair, where
- D is a domain (universe) of values, and
- d is a distance function that, ? x,y,z ? U,
satisfies the metric axioms - d(x,y) 0, d(x,y) 0 ? x y (positivity)
- d(x,y) d(y,x) (symmetry)
- d(x,y) d(x,z) d(z,y) (triangle inequality)
- All the distance functions seen in the previous
examples are metrics, and so are the (weighted)
Lp-norms - The only distance seen so far that does not fit
the metric framework is the DTW - Metric indexes only use the metric axioms to
organize objects, and exploit the triangle
inequality to prune the search space
9- Limitations of SAMs
- SAMs are limited to indexing of DB Objects
represented by means of feature values in
Multi-dimensional vector space (we need more
generic indexing strategy) - Dissimilarity of object measured by Lp distance
between feature values - Assumes distance computation Trivial
- Limitations of Metric Tress
- Does not support dynamic database environment
- Reduces distance computations but Pays no
attention to I/O costs -
10What is a relative distance?
- OA AB OB
- AB OA OB
- AB relative position of B w.r.t A
B
A
O
11M-Tree
- Key ideas is to Some how reduce distance
computation and at same time reduce I/O. - M-Tree partition objects on the basis of their
relative distance as measured by specific
distance function and stores this objects into
nodes.
r(Or)
P(Or)
Or
root
12M-Tree Structure
- Leaf Nodes stores all indexed db objects by
their key or feature values. - Internal Nodes Called routing nodes.
- Routing objects Or is associated with
- Or feature value of DB object
- Ptr(T(Or)) pointer to root of sub tree T(Or)
- r(Or) covering radius or maximum relative
distance of objects in sub tree T(Or) from
routing object Or - d(Or , P(Or)) distance of routing object from
its parent object P(Or)
13M-Tree Structure
- Leaf Node Entry for database object.
- Oj feature value of DB object.
- oid(Oj) object key
- d(Oj P(Oj)) distance of Oj from its parent P(Oj)
14Processing Queries
- Generally SAM try to prune tree for a given Query
and main emphasis is on developing efficient
pruning method which reduces no of disk access
but once a tree is pruned it is required to
compute distance of query point Q from each point
in pruned tree. - On the contrary emphasis of M- Tree is on pruning
as well as to reduce computation of distance
which is achieved by maximizing use of pre
computed distance stored in nodes of M-Tree
15Range Query
- Given query point Q ,
- Maximum search distance r(Q)
- Range query range(Q, r(Q)) is all objects Oj such
that d(Oj , Q ) lt r(Q)
r(Q)
Or is of our interest if intersection occurs How
To detect intersection using pre Computed
distances? If relative distance between Q and
Or is Less then sum of covering radii of two
Intersection is found.
Q
r(Or)
P(Or)
Or
root
16Range Query
Object in leaf node is a solution to range Query
if it lies in its covering radii. We can again
use relative distance to Find weather object
lies in covering radii Or not
r(Q)
Q
P(Or)
Oj
root
17Algorithm for Range Queries
18K nearest neighbors queries
- Given query point Q ,
- An integer k gt 1
- k-NN is NN(Q,k) is k indexed objects which have
shortest distance to Q
Q
Max Bound
Min Bound
r(Or)
P(Or)
Or
root
19SPLIT MANAGEMENT
- M-Tree grows bottom-up fashion
- Overflow of node N is managed by splitting N into
two new nodes N and N(newly created) - PARTITIONING Distributing entries are among N
and N - PROMOTE Two entries are promoted as routing
objects and moved to parent level
20SPLIT MANAGEMENT
- If the split node is a leaf, then the covering
radius of a promoted object, say Op1, is set to - r(Op1) maxd(Oj,Op1 )Oj ? N1
- whereas if overflow occurs in an internal node
- r(Op1 ) maxd(Or,Op1) r(Or)Or ? N1
21SPLIT POLICIES
- Specific implementation of Promote and Partition
method defines a split policy - Ideal split policy should promote two objects and
partition other objects so obtained regions have - - Minimum volume
- - Minimum Overlap
- How it is different from SAM??
22PROMOTE Choosing Routing objects
- M_RAD minimum Radii sum
- mM_RAD minimizes maximum of two Radii
- M_LB_DIST maximum lower bound on distance
- RANDOM
- SAMPLING
23PARTITIONING-Distribution of Entries
- Generalized Hyperplane(Unbalanced split (why?))
Assign each object Oj ?N to the nearest routing
object - if d(Oj,Op1 ) d(Oj,Op2 ) then
- assign Oj to N1, else
- assign Oj to N2.
- Balanced Compute d(Oj,Op1) and d(Oj,Op2 ) for
all - Oj ? N. Repeat until N is empty
- Assign to N1 the nearest neighbor of Op1 in N
and remove it from N - Assign to N2 the nearest neighbor of Op2 in N
and remove it from N.
24Experimental Results
- Assumed constant node size
- Tested all split policies
- Results
- Balanced partition method has shown to put
significant overhead and increased th I/O cost - Fastest split policy observed to be RANDOM and
slowest m_RAD - Average volume covered per page(quality of tree
construction) M_LB_DIST proved effective
25Experimental Results(2)
26I/O cost
27Avg Volume per page
28I/O cost
29I/O cost for M-Tree R-Tree
30