On Fast Non-Metric Similarity Search by Metric Access Methods - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

On Fast Non-Metric Similarity Search by Metric Access Methods

Description:

triangular triplet (a,b,c) = a b c & a c b & b c a ... TG-modifiers (e.g. by nesting them) until we turn all the triplets into triangular ones. ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 22
Provided by: siretMs
Category:

less

Transcript and Presenter's Notes

Title: On Fast Non-Metric Similarity Search by Metric Access Methods


1
On Fast Non-Metric Similarity Search by Metric
Access Methods
  • Tomáš Skopal (tomas_at_skopal.net)Charles
    University in Prague Faculty of Mathematics and
    Physics Department of Software Engineering
    Prague, Czech Republic

2
Presentation Outline
  • introduction
  • motivation of non-metric similarity search
  • metric access methods, intrinsic dimensionality
  • our objective fast non-metric search
  • turning non-metric into metric
  • the TriGen algorithm
  • experimental results
  • conclusions and future work

3
Similarity Search in Multimedia Databases
  • non-structured data instances
  • multimedia objects, texts, sequences, time
    series, etc.
  • distance function d U ? U ? R
  • d(O1,O2) interpreted as a dissimilarity score of
    two objects
  • metric properties (?Oi, Oj, Ok ? U)
  • reflexivity d(Oi, Oj) 0 ?? Oi Oj
  • positivity d(Oi, Oj) gt 0 ? Oi ? Oj
  • symmetry d(Oi, Oj) d(Oj, Oi)
  • triangular inequality d(Oi, Oj) d(Oj, Ok) ??
    d(Oi, Ok)
  • triangular triplet (a,b,c) a b ? c a c ?
    b b c ? a
  • when triangular inequality satisfied by d, then
    for every O1,O2,O3 ? U (d(O1,O2), d(O2,O3),
    d(O1,O3)) is a triangular triplet

4
Metric Access Methods
  • given a metric d and a dataset S ? U, metric
    access methods (MAMs) can be used to organize
    objects of S
  • Reason fast query processing (range k-nearest
    neighbor queries)
  • Principle of MAMs structured decomposition of
    objects into equivalence classes, such that only
    some candidate classes have to be searched when
    querying
  • the filtering of non-relevant classes is possible
    due to the metric properties (esp. triangular
    inequality)
  • Examples M-tree, PM-tree, D-index, gh-tree,
    vp-tree, LAESA, etc.

5
Metric Access Methods, cont.
  • intrinsic dimensionality
  • definition (as proposed in 4) ??(S,d) ?2 /
    2?2 (? is mean and ?2 is variance of distance
    distribution in S)
  • indicates how effeciently (quickly) could be a
    dataset S queried using a metric d
  • low ? (e.g. below 10) means the dataset is
    well-structured i.e. there exist tight
    clusters of objects
  • high ? means the dataset is poorly structured
    i.e. objects are almost equaly distant
  • in consequence, intrinsically high-dimensional
    datasets are hard to organize, so that querying
    becomes inefficient (sequential scan)
  • example an M-tree hierarchy built on a
    high-dimensional dataset

6
Metric vs. non-metric measures
  • non-metric measures are often robust(resistant
    to outliers, errors in objects, etc.)
  • the symmetry and mainly the triangular inequality
    are often violated
  • cannot be directly used with MAMs

7
Examples of Non-metric measures
  • various k-median distances
  • measure distance between the two (k-th) most
    similar portions in objects
  • COSIMIR
  • back-propagation network with single output
    neuron serving as a distance, allows training
  • Dynamic Time Warping distance
  • sequence alignment technique
  • minimizes the sum of distances between sequence
    elements
  • fractional Lp distances
  • generalization of Minkowski distances (plt1)
  • more robust to extreme differences in coordinates

8
Turning Non-metric into Metric
  • the reflexivity positivity
  • by setting a minimum distance lowerbound d- lt 0,
    i.e. O1? O2 ? drp(O1, O2) d(O1, O2) d-
    some small value, otherwise drp(O1, O2) 0
  • the symmetry
  • e.g. ds(O1, O2) min(d(O1, O2), d(O2, O1))
  • query is processed using ds, and the query
    result is re-filtered using d
  • how to satisfy the triangular inequality ?
  • we apply a modifying function f on d, making
    semi-metric a metric

9
SP-modifiers
  • Let f be a function f R?? R, such that f(0) 0
    and f is increasing (i.e. f(x) gt f(y) ? x
    gt y).
  • For similarity search purposes f(d(?,?))
    further denoted as df can be safely used
    instead of just d. (In case of range query (Q,
    rQ) the query radius rQ is modified to
    f(rQ).)
  • Proof All similarity orderings are preserved.
  • Consider the set of all pairs of objects from U.
  • Create ordering of the pairs with respect to
    distances of the two objects in the pair.
  • The ordering does not change after the
    application of any f on the distances, because
    f is increasing.
  • We call such function f as similarity-preserving
    modifier (simply SP-modifier.)

10
TG-modifiers
  • We want to find such SP-modifier, that forces d
    to satisfy the triangular inequality
  • any concave SP-modifier f is metric-preserving
    (proof in 3)
  • when applied on any metric d(?,?), df is metric
    as well
  • when applied on a triangular triplet (a,b,c),
    (f(a),f(b),f(c)) is triangular triplet as well
  • any concave SP-modifier is triangle-generating
    (TG-modifier)
  • when applied on all possible triplets, some of
    them become triangular (theory of concave
    functions)
  • the more concave f, the more triplets become
    triangular
  • once a triplet becomes triangular, after
    application of any other TG-modifier it remains
    triangular
  • Theorem Every semi-metric can be turned by a
    single TG-modifier into a metric.

11
Proof Incremental Triplet Stretching
We repeatedly apply TG-modifiers on all
triangular triplets (generated by d(?,?) on S),
starting with a less concave TG-modifier,
proceeding with more concave ones.
We continue with applying more and more concave
TG-modifiers (e.g. by nesting them) until we
turn all the triplets into triangular ones.
12
Optimal TG-modifier
  • There exist infinitely many TG-modifiers that
    turn a given semi-metric into a metric. However,
    not all are suitable for fast similarity search.
  • The optimal TG-modifier should
  • turn every non-triangular triplet generated by d
    (considering the objects from S) into a
    triangular one (i.e. enforce the triangular
    inequality)
  • keep the intrinsic dimensionality of S with
    respect to df as low as possible

13
Scaling the concavity
  • How to find an optimal TG-modifier for a given d
    (and S)?
  • We make use of some predefined TG-bases
  • TG-base is an extended TG-modifier such that it
    uses a concavity weight w ? 0 as second
    parameter, i.e. f R ? R ? R
  • for w 0, the TG-base turns into identity, i.e.
    f(x,0) x
  • with increasing w, the TG-modifier f(x,w) becomes
    more concave
  • the greater w (thus more concave f),
  • the more triplets become triangular
  • the higher the intrinsic dimensionality is
  • we can relax the strict condition of needing all
    triplets to become triangular by introducing a
    TG-error tolerance ? (a ratio of triangular
    triplets to non-triangular triplets) to be
    satisfied
  • a choice of exact or approximate search (? 0 or
    ? gt 0)

14
Proposed TG-bases
  • general-purpose TG-bases
  • Fractional Power TG-base (FP-base)
  • Rational Bezier Quadric TG-bases (RBQ-bases)
  • each such TG-base is additionally provided by the
    second Bezier point (a,b)
  • choosing different (a,b) allows to predefine the
    place of maximum concavity in the TG-base
  • we need to find an optimal w for a TG-base f,
    such that df becomes metric, but w is as low as
    possible the TriGen algorithm

15
The TriGen algorithm
The algorithm finds a TG-modifier (formed by a
TG-base and the appropriate concavity weight w),
which turns a given semi-metric d into an
(approximated) metric, while the intrinsic
dimensionality is kept as low as possible. The
algorithm makes use of halving the concavity
interval, when searching for the optimal
concavity weight.
16
Experimental Results
  • The testbed
  • two dataset (1 real images (histograms), 1
    synthetic - polygons)
  • 10 non-metric measures (6 for images, 4 for
    polygons)
  • TriGen was used to create the modification of a
    semi-metric into metric
  • 2 MAMs M-tree and PM-tree
  • Testing of
  • 1) intrinsic dimensionalities of the datasets
    (with respect to df, where f is the TG-modifier
    found by TriGen)
  • 2) performance k-NN queries performance,
    retrieval error (when the TG-error tolerance ? gt
    0)

17
Experiments intrinsic dimensionalities
18
Experiments k-NN queries
19
Experiments k-NN queries
20
Conclusions and Future Work
  • We have presented
  • a way of fast searching in non-metric datasets by
    metric access methods
  • in particular, the Trigen algorithm for turning
    any semi-metric into a metric
  • future work
  • a generalized framework for fast exact and
    approximate similarity search (either metric or
    non-metric) a combination with previous work 2

21
References
  • 1 T. Skopal, J. Pokorný, V. Snášel Nearest
    Neighbours Search using the PM-tree. In DASFAA
    2005, Beijing, China, pages 803815. LNCS 3453,
    Springer.
  • 2 T. Skopal, P. Moravec, Jaroslav Pokorný, V.
    SnášelMetric Indexing for the Vector Model in
    Text Retrieval, In SPIRE 2004, Padova, Italy,
    pages 183-195, LNCS 3246, Springer.
  • 3 P. CorazzaIntroduction to metric-preserving
    functions, American Mathematical Monthly 104(4),
    1999.
  • 4 E. Chávez, G. NavarroA Probabilistic Spell
    for the Curse of Dimensionality, In ALENEX 2001,
    LNCS 2153, Springer.
Write a Comment
User Comments (0)
About PowerShow.com