Title: Nearest Neighbours Search using the PM-tree
1Nearest Neighbours Search using the PM-tree
- Tomáš Skopal1 Jaroslav Pokorný1 Václav
Snášel2
1 Charles University in Prague Department of
Software Engineering Czech Republic
2 VSB - Technical University of
OstravaDepartment of Computer Science Czech
Republic
2Presentation Outline
- Similarity search in Metric Spaces
- M-tree
- the structure
- k-NN search
- PM-tree (an extension of M-tree)
- motivation
- the structure
- k-NN search
- Experimental Results
3Similarity search in Metric Spaces
- Similarity search
- methods for content-based retrieval in multimedia
databases - the similarity measure is often modelled by a
metric d (satisfying triangular inequality,
symmetry, reflexivity, non-negativity) - similarity queries (query by example) realized as
metric queries - range query (Q , rQ) (specified by a query object
Q and covering radius rQ) - k-NN query (Q , k) (specified by a query object Q
and number of nearest neighbours k) - Metric Access Methods (MAMs)
- designed to search in metric datasets in order to
keep the search costs minimal - search costs number of distance computations
I/O costs - only distances between objects are used for
indexing (the structure of object
representation is not used for indexing) - many MAMs are not suitable for similarity search
in large datasets - either a static method or high I/O search costs
- M-tree and (recently) D-index are the only
suitable candidates so far
4M-tree (metric tree)
- dynamic, balanced, and paged tree structure
(like e.g. B-tree, R-tree) - the leaves are clusters of indexed objects Oj
(ground objects) - routing entries in the inner nodes represent
hyper-spherical metric regions (Oi , rOi),
recursively bounding the object clusters in
leaves - the triangular inequality allows discarding of
irrelevant M-tree branches (metric regions
resp.) during query evaluation
5k-NN search in the M-tree
- branch-and-bound algorithm (similar to that of
R-tree) - modification of range query algorithm, but the
query radius rQ is dynamic - rQ decreasing from infinity to the distance to
the k-th neighbour - utilized two structures priority queue PR and
sorted array NN - PR stores requests for nodes not-filtered from
the search yet - request of form routing entry to a node N,
dmin(N), where dmin(N) is the lower bound
distance from Q to all possible objects in N,
i.e. dmin(rout. entry to N) max 0 , d(Q ,
Oi) rOi - where (Oi , rOi ) is region of the Ns routing
entry (requests in PR sorted by dmin(N)) - NN stores k candidate objects (or distance upper
bounds) - at the end of algorithm run, NN contains the
result, i.e. the k nearest neighbours - entry of form candidate object Oi, d(Q,Oi) or
- , dmax(N), where dmax() is the upper bound
distance from Q to all possible objects in N, i.e
- dmax(rout. entry to N) d(Q , Oi) rOi
- PR stores only requests with dmin() lt dmax(),
other requests are removed from PR - i.e. such requests are removed, which do not
overlap the dynamic query region (Q , rQ) - Query processing the requests in PR are
processed in FIFO manner ? a node N is retrieved,
while PR and NN structures are updates by
routing/ground entries of N - PR is initialized to ( root , 8 ), NN is
initialized by k entries -,8 to ( - ,8 , -
,8 , ... ) - optimal in I/O costs (the same I/O costs as range
query (Q , d(Q , NN5) ) )
6k-NN search in M-tree example (k2)
7k-NN search in M-tree example (k2)
8k-NN search in M-tree example (k2)
5 nodes accessed, the same nodes accessed by
range query (Q , d(Q,O5) )
9PM-tree motivation
- metric regions in M-tree are unnecessarily large
- ? indexing of large portions of empty space (the
dead space) - ? higher probability of intersection with query
region - ? less efficient search
- reduction of metric region volume should lead
to more effective discarding of irrelevant
subtrees - the question is how to specify a compact metric
region bounding all the objects more tightly ?
generalization of the M-tree for another metric
region shape representations
10PM-tree region
- utilization of global pivots (inspired by
LAESA-like methods) - given a fixed set of p global pivots Pi
(selected from (a part of) the dataset) - p hyper-ring regions (Pi , HRi) are defined for
each routing entry - array HR of p intervals ltHRi.min , HRi.maxgt
- each interval HRi bounds the distances of
objects to the respective pivot Pi - PM-tree region M-tree region HR array
(pivots Pi shared by all PM-tree regions) - intersection of the hyper-sphere and the
hyper-rings forms a smaller region bounding all
the objects in leaves - the more pivots, the more tightly bounded region
- PM-tree is built the same way as M-tree is
built, i.e. the hyper-rings only cut off the
M-tree sphere
11PM-tree, query processing
- distances d(Q , Pi) for all i p must be
computed prior to processing a query - metric region (Oi , rOi , HR) is relevant to
(intersected by) a range query (Q , rQ) just in
case that all the hyper-rings and the
hyper-sphere overlap the range query region ?
the more hyper-rings, the lower probability of
intersection with query - ? no additional distance computations are
needed for the intersection test
Q
Q
M-tree region
PM-tree region
12k-NN search in the PM-tree
- 3 modifications of M-trees k-NN algorithm
- different intersection test between query region
(Q, rQ) and PM-tree region (Oi , rOi , HR) - ??t1..p d(Pt , Q) rQ HRt.max ? d(Pt , Q)
rQ HRt.min - different dmin construction ( possible distance
increase to the farthest hyper-ring) - dmin(rout. entry to N) max 0, d(Q , Oi) rOi
, HRfarthest - HRfarthest max??t1..p d(Pt , Q) HRt.max
, HRt.min d(Pt , Q) - different dmax construction ( possible distance
decrease to the farthest object in the nearest
hyper-ring)dmax(rout. entry to N) max d(Q ,
Oi) rOi , HRnearest HRnearest min ?t1..p
d(Q , Oi) HRt.max
13k-NN search in PM-tree example (k2)
14k-NN search in PM-tree example (k2)
15k-NN search in PM-tree example (k2)
5 nodes accessed, the same nodes accessed by
range query (Q , d(Q,O5) )
16Experimental Results (synthetic datasets)
- synthetic vector datasets (4D 60D) 100,000
tuples 1000 clusters - disk page sizes 1 KB 4 KB index sizes 4.5 MB
55 MB
17Experimental Results(image database)
- WBIIS image database appr. 10,000 256D-vectors
(gray histograms) - disk page size 32 KB index sizes 16 MB 20 MB
18References
- 1 Skopal T., Pokorný J., Snášel V. PM-tree
Pivoting Metric Tree for Similarity Search in
Multimedia Databases, ADBIS 2004, Budapest,
Hungary - 2 Skopal T. Pivoting M-tree A Metric Access
Method for Efficient Similarity Search, DATESO
2004, Desná, Czech Republic - 3 Skopal T., Pokorný J., Krátký M., Snášel V.
Revisiting M-tree Building Principles. ADBIS
2003, Dresden, Germany, LNCS 2798, Springer - 4 Skopal T.
- Metric Indexing in Information Retrieval
- PhD thesis, VSB-Technical University of Ostrava
- http//urtax.ms.mff.cuni.cz/skopal/phd/thesis.pd
f