M-Tree: An Efficient Access Method for Similarity Search in Metric Space

1 / 30
About This Presentation
Title:

M-Tree: An Efficient Access Method for Similarity Search in Metric Space

Description:

Other techniques to solve same problem and how this one is step ahead? ... Earth movers distance. Quadratic form distance. Metric Spaces-An abstraction of Similarity ... –

Number of Views:260
Avg rating:3.0/5.0
Slides: 31
Provided by: Ami987
Category:

less

Transcript and Presenter's Notes

Title: M-Tree: An Efficient Access Method for Similarity Search in Metric Space


1
M-Tree An Efficient Access Method for Similarity
Search in Metric Space
  • Presenters
  • Amool Gupta
  • Amit Sharma

2
MOTIVATION
  • Basic problem that it addresses?(Why)
  • Other techniques to solve same problem and how
    this one is step ahead?
  • Basic Fundamentals of this Indexing structure

3
Similarity Search Problem
4
Similarity Searching
  • Effectiveness
  • - The way of formulating the similarity measures
    a model of human perception
  • Efficiency
  • - The way of achieving the required performance
    over huge volumes of data index structure

5
(No Transcript)
6
(No Transcript)
7
Examples of Distance Functions
  • Lp metric function( vectors)
  • L1 Manhattan distance
  • Euclidean Distance
  • Linfinity
  • Edit Distance (for String)
  • Hausdorff distance
  • Earth movers distance
  • Quadratic form distance

8
Metric Spaces-An abstraction of Similarity
  • A metric space M (D,d) is a pair, where
  • D is a domain (universe) of values, and
  • d is a distance function that, ? x,y,z ? U,
    satisfies the metric axioms
  • d(x,y) 0, d(x,y) 0 ? x y (positivity)
  • d(x,y) d(y,x) (symmetry)
  • d(x,y) d(x,z) d(z,y) (triangle inequality)
  • All the distance functions seen in the previous
    examples are metrics, and so are the (weighted)
    Lp-norms
  • The only distance seen so far that does not fit
    the metric framework is the DTW
  • Metric indexes only use the metric axioms to
    organize objects, and exploit the triangle
    inequality to prune the search space

9
  • Limitations of SAMs
  • SAMs are limited to indexing of DB Objects
    represented by means of feature values in
    Multi-dimensional vector space (we need more
    generic indexing strategy)
  • Dissimilarity of object measured by Lp distance
    between feature values
  • Assumes distance computation Trivial
  • Limitations of Metric Tress
  • Does not support dynamic database environment
  • Reduces distance computations but Pays no
    attention to I/O costs

10
What is a relative distance?
  • OA AB OB
  • AB OA OB
  • AB relative position of B w.r.t A

B
A
O
11
M-Tree
  • Key ideas is to Some how reduce distance
    computation and at same time reduce I/O.
  • M-Tree partition objects on the basis of their
    relative distance as measured by specific
    distance function and stores this objects into
    nodes.

r(Or)
P(Or)
Or
root
12
M-Tree Structure
  • Leaf Nodes stores all indexed db objects by
    their key or feature values.
  • Internal Nodes Called routing nodes.
  • Routing objects Or is associated with
  • Or feature value of DB object
  • Ptr(T(Or)) pointer to root of sub tree T(Or)
  • r(Or) covering radius or maximum relative
    distance of objects in sub tree T(Or) from
    routing object Or
  • d(Or , P(Or)) distance of routing object from
    its parent object P(Or)

13
M-Tree Structure
  • Leaf Node Entry for database object.
  • Oj feature value of DB object.
  • oid(Oj) object key
  • d(Oj P(Oj)) distance of Oj from its parent P(Oj)

14
Processing Queries
  • Generally SAM try to prune tree for a given Query
    and main emphasis is on developing efficient
    pruning method which reduces no of disk access
    but once a tree is pruned it is required to
    compute distance of query point Q from each point
    in pruned tree.
  • On the contrary emphasis of M- Tree is on pruning
    as well as to reduce computation of distance
    which is achieved by maximizing use of pre
    computed distance stored in nodes of M-Tree

15
Range Query
  • Given query point Q ,
  • Maximum search distance r(Q)
  • Range query range(Q, r(Q)) is all objects Oj such
    that d(Oj , Q ) lt r(Q)

r(Q)
Or is of our interest if intersection occurs How
To detect intersection using pre Computed
distances? If relative distance between Q and
Or is Less then sum of covering radii of two
Intersection is found.
Q
r(Or)
P(Or)
Or
root
16
Range Query
  • Leaf node

Object in leaf node is a solution to range Query
if it lies in its covering radii. We can again
use relative distance to Find weather object
lies in covering radii Or not
r(Q)
Q
P(Or)
Oj
root
17
Algorithm for Range Queries
18
K nearest neighbors queries
  • Given query point Q ,
  • An integer k gt 1
  • k-NN is NN(Q,k) is k indexed objects which have
    shortest distance to Q

Q
Max Bound
Min Bound
r(Or)
P(Or)
Or
root
19
SPLIT MANAGEMENT
  • M-Tree grows bottom-up fashion
  • Overflow of node N is managed by splitting N into
    two new nodes N and N(newly created)
  • PARTITIONING Distributing entries are among N
    and N
  • PROMOTE Two entries are promoted as routing
    objects and moved to parent level

20
SPLIT MANAGEMENT
  • If the split node is a leaf, then the covering
    radius of a promoted object, say Op1, is set to
  • r(Op1) maxd(Oj,Op1 )Oj ? N1
  • whereas if overflow occurs in an internal node
  • r(Op1 ) maxd(Or,Op1) r(Or)Or ? N1

21
SPLIT POLICIES
  • Specific implementation of Promote and Partition
    method defines a split policy
  • Ideal split policy should promote two objects and
    partition other objects so obtained regions have
  • - Minimum volume
  • - Minimum Overlap
  • How it is different from SAM??

22
PROMOTE Choosing Routing objects
  • M_RAD minimum Radii sum
  • mM_RAD minimizes maximum of two Radii
  • M_LB_DIST maximum lower bound on distance
  • RANDOM
  • SAMPLING

23
PARTITIONING-Distribution of Entries
  • Generalized Hyperplane(Unbalanced split (why?))
    Assign each object Oj ?N to the nearest routing
    object
  • if d(Oj,Op1 ) d(Oj,Op2 ) then
  • assign Oj to N1, else
  • assign Oj to N2.
  • Balanced Compute d(Oj,Op1) and d(Oj,Op2 ) for
    all
  • Oj ? N. Repeat until N is empty
  • Assign to N1 the nearest neighbor of Op1 in N
    and remove it from N
  • Assign to N2 the nearest neighbor of Op2 in N
    and remove it from N.

24
Experimental Results
  • Assumed constant node size
  • Tested all split policies
  • Results
  • Balanced partition method has shown to put
    significant overhead and increased th I/O cost
  • Fastest split policy observed to be RANDOM and
    slowest m_RAD
  • Average volume covered per page(quality of tree
    construction) M_LB_DIST proved effective

25
Experimental Results(2)
26
I/O cost
27
Avg Volume per page
28
I/O cost
29
I/O cost for M-Tree R-Tree
30
  • Thanks
Write a Comment
User Comments (0)
About PowerShow.com