Christian B - PowerPoint PPT Presentation

About This Presentation

Title:

Christian B

Description:

Currently neglected by our model ... Schedule Optimization (NN Queries) Current expenses are traded for possible later savings ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 42

Provided by: dbs3

Category:

Tags: christian | lof | models | nn | striping

more less

Transcript and Presenter's Notes

Title: Christian B

1
Christian BöhmUniversity for Health Informatics
and Technology, InnsbruckSimilarity Search and
Data Mining Database Techniques Supporting Next
Decade's ApplicationsKeynote at iiWAS 2002
2
1
Similarity Search
3
Feature Based Similarity
4
Simple Similarity Queries

Specify query object and
Find similar objects range query
Find the k most similar objects nearest
neighbor q.

5
Multidimensional Index Structure (R-tree)
6
Range Query with Depth-First Traversal
7
Nearest Neighbor Priority Algorithm
Hjaltason, Samet Ranking in Spatial Databases,
SSD 1995
4 page accesses
8
Problems of High-Dim. Index Structures

Curse of dimensionality
Search performance of index deteriorates in high
dim.
Outperformed by sequential scan
Solution
Optimize various parameters of index structures
Needed Cost model for queriesHow many pages are
expected to be accessed for
Range queries (with given e)
Nearest neighbor queries (with given k)

9
Cost Estimation (Uniformity/Independence)

Minkowski sumEstimation of the access
probability of a pageBöhm A Cost Model for
Query Processing in High-Dimensional Data Spaces,
TODS 25(2), 2000

Nearest neighbor Estimate distance by point
density
10
Cost Estimation

Boundary and saturation effects in high dim.
space(considered by our model extension)
Correlation between attributes(considered by the
concept of fractal dimension)
Cluster structure has also impact on performance
Currently neglected by our model
Histograms and similar data descriptions
difficult in high-dimensional space (number of
histo-bins exponential in dimensionality)
Other descriptions of cluster structure
(dendrograms)
Subject to future work

11
Optimization of Index Structures

To avoid the possibility to outperform index
based query processing by the sequential scan
Optimize various parameters such as
Logical block size of the index pages
Indexed dimension
I/O schedule optimization (fast index scan)
Data quantization
Observe the balance! (Master Confucius)

12
Page Size Optimization
13
Page Size Optimization
14
Optimized Dimension Assignment
Hi-dim. Index
Inverted List
Matching
R-tree
B-tree
Problem in hi-dim Too few splits ineach
dimension
Problem in hi-dim Too many resultsin each
dimension
15
Optimized Dimension Assignment
Hi-dim. Index
Inverted List
Matching
R-tree
B-tree
Compromise A moderate number of R-treeseach
indexing a few dimensions
OPTIMIZE!
16
Schedule Optimization (Fast Index Scan)
Range Query Required Pages are known from the
directory
17
Schedule Optimization (NN Queries)

Current expenses are traded for possible later
savings
Start at 100 page and extend forward and
backward
Optimize the cumulated cost balance (CCB)

18
Quantization

Approximate the points by quantization grid
based on quantiles
Benefitfewer bits for representation
Cost Grid cell partially intersectedÞ access
the original point data
How to choose grid resolution ???

Weber, Schek, Blott A Quantitative Analysis and
Performance Study..., VLDB 1998
19
Independent Quantization (IQ tree)
Combines index, scan, and quantization Berchtold,
Böhm, Jagadish, Kriegel, Sander Independent
Quantization..., ICDE 2000
Grid resolution optimized by cost model
20
Open Research Problems in Optimization

Multi-Parameter Optimization
How can parameters be optimized simultaneously?
Are there conflicts between optimization goals?
Example

Uniform dataÞ Quantization
Correlated dataÞ Tree Striping
21
Open Research Problems in Optimization

Consider Insert/Delete/Update
If the data set faces heavy update, the
constructed index should look differently
compared with more static data sets
Update-bound Construct index rather simple
Query-bound Spend more effort to organize data
Can be considered as an optimization problem

22
2
Data Mining
23
KDD Algorithms Based on Similarity Queries
24
Join Applikationen

Katalogkonversion (Catalogue Matching)
z.B. Astronomie-Kataloge

25
Clustering

Clustering (e.g. DBSCAN)Ester, Kriegel, Sander,
Xu A Density Based Algorithm for Discovering
Clusters, KDD 1996

26
Cache Behavior
27
Clustering and Similarity Join

DBSCAN uses similarity join as basic
operationsBöhm, Braunmüller, Breunig, Kriegel
High Perf. Clustering based on the Sim. Join,
CIKM 2000

28
k-Nearest Neighbor Classification

Example

Objects with known class
29
Distance Range Join (e-Join)

Most widespread and best evaluated join
Often also called the similarity join

30
k-Closest Pair Query
SELECT FROM R, SORDER BY R.obj -
S.objSTOP AFTER k

In SQL notation

31
k-Nearest Neighbor Join
SELECT FROM R, SGROUP BY R.objORDER BY
R.obj - S.objSTOP AFTER K ( ¹ k )

In SQL notation
(limited to k 1)

32
R-tree Spatial Join (RSJ)
procedure r_tree_sim_join (R, S, e) if IsDirpg
(R) Ù IsDirpg (S) then foreach r Î R.children
do foreach s Î S.children do if
mindist (r,s) e then CacheLoad(r)
CacheLoad(s) r_tree_sim_join (r,s,e)
else ( assume R,S both DataPg ) foreach
p Î R.points do foreach q Î S.points do
if p - q e then report (p,q)
R
S
33
Modeling and Optimization

Böhm, Kriegel A Cost Model and Index
Architecture for the Similarity Join, Wednesday,
1630
Mating probability of index pages
Probability that distance between two pages e
Two-fold application of Minkowski sum

34
Modeling and Optimization

I/O cost
High const. cost per page
Large capacity optimum
CPU cost
Low const. cost per page
Low capacity optimum
CPU-performance like CPU optimized index
I/O- performance like I/O optimized index

35
Open Problems for Research (Sim. Join)

Modeling and Optimization
Dimension
Quantization
Page scheduling
Caching strategies
Nearest Neighbor Join
Applications
Algorithms
General
Integration into object-relational DBMS

36
3
New Challenges
37
New Challenges

Incertain Features
Application
Biometric Identification
Particularities
Features individually associated with incertainty
(e.g. as Gaussian distributions)
Queries
Probability of match
Find objects with highes probability of match
Find objects with probability of match gt e

Relative probability
Feature a1
38
New Challenges

Support of e-commerce in all phases
Marketing ? customer segmentation
Sales and booking ? advanced similarity search
Add-on products ? Sales transaction analysis
Advanced Similarity
Adaptable
Multimodal models
Relevance-feedback
Convex hull

39
New Challenges

Stock quota Technical chart analysis
Known Database techniques for similarity search
in time sequences (DFT, etc.)

40
New Challenges

Professional analyst tools use
Trading signals generated by indicators (etc.
MACD)
Formations indicating trends in charts
Relationships to the market and to derivatives

41
Conclusion

Database primitives abstraction from
application Similarity Search Þ
Clustering Classification Þ Similarity Join
Outlier Detection
Advantages
General solution, reuse
Separately optimizable

Range QueriesNearest Neighbor Queries

Write a Comment

User Comments (0)