Title: Christian B
1Christian BöhmUniversity for Health Informatics
and Technology, InnsbruckSimilarity Search and
Data Mining Database Techniques Supporting Next
Decade's ApplicationsKeynote at iiWAS 2002
21
Similarity Search
3Feature Based Similarity
4Simple Similarity Queries
- Specify query object and
- Find similar objects range query
- Find the k most similar objects nearest
neighbor q.
5Multidimensional Index Structure (R-tree)
6Range Query with Depth-First Traversal
7Nearest Neighbor Priority Algorithm
Hjaltason, Samet Ranking in Spatial Databases,
SSD 1995
4 page accesses
8Problems of High-Dim. Index Structures
- Curse of dimensionality
- Search performance of index deteriorates in high
dim. - Outperformed by sequential scan
- Solution
- Optimize various parameters of index structures
- Needed Cost model for queriesHow many pages are
expected to be accessed for - Range queries (with given e)
- Nearest neighbor queries (with given k)
9Cost Estimation (Uniformity/Independence)
- Minkowski sumEstimation of the access
probability of a pageBöhm A Cost Model for
Query Processing in High-Dimensional Data Spaces,
TODS 25(2), 2000
Nearest neighbor Estimate distance by point
density
10Cost Estimation
- Boundary and saturation effects in high dim.
space(considered by our model extension) - Correlation between attributes(considered by the
concept of fractal dimension) - Cluster structure has also impact on performance
- Currently neglected by our model
- Histograms and similar data descriptions
difficult in high-dimensional space (number of
histo-bins exponential in dimensionality) - Other descriptions of cluster structure
(dendrograms) - Subject to future work
11Optimization of Index Structures
- To avoid the possibility to outperform index
based query processing by the sequential scan - Optimize various parameters such as
- Logical block size of the index pages
- Indexed dimension
- I/O schedule optimization (fast index scan)
- Data quantization
- Observe the balance! (Master Confucius)
12Page Size Optimization
13Page Size Optimization
14Optimized Dimension Assignment
Hi-dim. Index
Inverted List
Matching
R-tree
B-tree
Problem in hi-dim Too few splits ineach
dimension
Problem in hi-dim Too many resultsin each
dimension
15Optimized Dimension Assignment
Hi-dim. Index
Inverted List
Matching
R-tree
B-tree
Compromise A moderate number of R-treeseach
indexing a few dimensions
OPTIMIZE!
16Schedule Optimization (Fast Index Scan)
Range Query Required Pages are known from the
directory
17Schedule Optimization (NN Queries)
- Current expenses are traded for possible later
savings - Start at 100 page and extend forward and
backward - Optimize the cumulated cost balance (CCB)
18Quantization
- Approximate the points by quantization grid
based on quantiles - Benefitfewer bits for representation
- Cost Grid cell partially intersectedÞ access
the original point data - How to choose grid resolution ???
Weber, Schek, Blott A Quantitative Analysis and
Performance Study..., VLDB 1998
19Independent Quantization (IQ tree)
Combines index, scan, and quantization Berchtold,
Böhm, Jagadish, Kriegel, Sander Independent
Quantization..., ICDE 2000
Grid resolution optimized by cost model
20Open Research Problems in Optimization
- Multi-Parameter Optimization
- How can parameters be optimized simultaneously?
- Are there conflicts between optimization goals?
- Example
Uniform dataÞ Quantization
Correlated dataÞ Tree Striping
21Open Research Problems in Optimization
- Consider Insert/Delete/Update
- If the data set faces heavy update, the
constructed index should look differently
compared with more static data sets - Update-bound Construct index rather simple
- Query-bound Spend more effort to organize data
- Can be considered as an optimization problem
222
Data Mining
23KDD Algorithms Based on Similarity Queries
24Join Applikationen
- Katalogkonversion (Catalogue Matching)
- z.B. Astronomie-Kataloge
25Clustering
- Clustering (e.g. DBSCAN)Ester, Kriegel, Sander,
Xu A Density Based Algorithm for Discovering
Clusters, KDD 1996
26Cache Behavior
27Clustering and Similarity Join
- DBSCAN uses similarity join as basic
operationsBöhm, Braunmüller, Breunig, Kriegel
High Perf. Clustering based on the Sim. Join,
CIKM 2000
28k-Nearest Neighbor Classification
Objects with known class
29Distance Range Join (e-Join)
- Most widespread and best evaluated join
- Often also called the similarity join
30k-Closest Pair Query
SELECT FROM R, SORDER BY R.obj -
S.objSTOP AFTER k
31k-Nearest Neighbor Join
SELECT FROM R, SGROUP BY R.objORDER BY
R.obj - S.objSTOP AFTER K ( ¹ k )
- In SQL notation
- (limited to k 1)
32R-tree Spatial Join (RSJ)
procedure r_tree_sim_join (R, S, e) if IsDirpg
(R) Ù IsDirpg (S) then foreach r Î R.children
do foreach s Î S.children do if
mindist (r,s) e then CacheLoad(r)
CacheLoad(s) r_tree_sim_join (r,s,e)
else ( assume R,S both DataPg ) foreach
p Î R.points do foreach q Î S.points do
if p - q e then report (p,q)
R
S
33Modeling and Optimization
- Böhm, Kriegel A Cost Model and Index
Architecture for the Similarity Join, Wednesday,
1630 - Mating probability of index pages
- Probability that distance between two pages e
- Two-fold application of Minkowski sum
34Modeling and Optimization
- I/O cost
- High const. cost per page
- Large capacity optimum
- CPU cost
- Low const. cost per page
- Low capacity optimum
- CPU-performance like CPU optimized index
- I/O- performance like I/O optimized index
35Open Problems for Research (Sim. Join)
- Modeling and Optimization
- Dimension
- Quantization
- Page scheduling
- Caching strategies
- Nearest Neighbor Join
- Applications
- Algorithms
- General
- Integration into object-relational DBMS
363
New Challenges
37New Challenges
- Incertain Features
- Application
- Biometric Identification
- Particularities
- Features individually associated with incertainty
(e.g. as Gaussian distributions) - Queries
- Probability of match
- Find objects with highes probability of match
- Find objects with probability of match gt e
Relative probability
Feature a1
38New Challenges
- Support of e-commerce in all phases
- Marketing ? customer segmentation
- Sales and booking ? advanced similarity search
- Add-on products ? Sales transaction analysis
- Advanced Similarity
- Adaptable
- Multimodal models
- Relevance-feedback
- Convex hull
39New Challenges
- Stock quota Technical chart analysis
- Known Database techniques for similarity search
in time sequences (DFT, etc.)
40New Challenges
- Professional analyst tools use
- Trading signals generated by indicators (etc.
MACD) - Formations indicating trends in charts
- Relationships to the market and to derivatives
41Conclusion
- Database primitives abstraction from
application Similarity Search Þ
Clustering Classification Þ Similarity Join
Outlier Detection - Advantages
- General solution, reuse
- Separately optimizable
Range QueriesNearest Neighbor Queries