Title: Substructure Similarity Search in Graph Databases
1Substructure Similarity Search in Graph Databases
- By X. Yan P. Yu J.Han
- Ömer Can KOLÇAK
2Outline
- Introduction
- Preliminary Concepts
- Structural Filtering
- Feature Set Selection
- Algorithm Implementation
3Introduction
- Data research has been facing a new challenge
raised by the emergence of complex structural
data. - Graphs have broad applications and they are used
in datasets especially in chemistry-informatics
and bio-informatics. (eg. ChemIDplus, PDB) - Also in computer vision and pattern recognition,
graphs are used to represent complex structures
such as hand-drawn symbols, 3D objects and
medical images.
4Introduction
- All these applications indicate the importance
and the broad usage of graph database and its
similarity search system. - While the discovery in graph datasets has been
studied, a systematic examination of graph query
systems becomes equally important.
5Structure Search Queries
- Full Structure Search
- Substructure Search
- Full Structure Similarity Search
6Question
- What if no matches occur for a given query graph?
Query Refinement Process
7Query Refinement Process
- manually time consuming
- define the portion of the query for exact
matching - and let the system change the portion slightly
RELAXATION RATIO
8Example
9Introduction
- The existing tools such as ChemIDplus, could only
provide the full structure similarity search and
the exact substructure search. - Pairwise substructure similarity computation is
very expensive. - For one edge misses, exact substructure search
may work. - What if the number of deletions is more than one?
10Graph Similarity Filtering
- GRAFIL
- A feature-based structural filtering algorithm
- no pairwise computation
- Instead, two data structures
- feature-graph matrix
- edge-feature matrix
- filters the dataset by using these matrices
- not on the database, on the matrices
11Contribution of Grafil
- A significant contribution of this study is an
examination of an inceasingly important search
problem in graph databases and the proposal of a
feature-based filtering algorithm for efficient
substructure similarity search. - The concept presented in Grafil can be applied to
searching approximate,non-consequtive sequences,
trees and other complicated structures as well.
12Preliminary Concepts
- DEFINITION (SUBSTRUCTURE SIMILARITY SEARCH)
- Given a graph database DG1,G2,...,Gn and a
query graph Q, similarity search is discover all
the graphs that approximately contain this query
graph. - Target Graph Graphs in Dataset
13Preliminary Concepts
- DEFINITION (RELAXATION RATIO)
- Given two graphs G and Q, if P is the maximum
common subgraph of G and Q, then the substructure
similarity between G and Q is defined by E(P) /
E(Q), and 1-E(P) / E(Q) is called
relaxation ratio.
14Example
Substructure Similarity 11/12 92
Maximum Common Subgraph , P E(P)11
Relaxation Ratio 1-(11/12) 8
15Structural Filtering
- Given a query graph, the major target of our
algorithm is to filter as many graphs as possible
using a feature-based approach. - Features
- Paths
- Discriminative Frequent Structures
- Elementary Structures
- etc...
16Example
This Query Graph contains seven occurences of
these features One fa, two fbs and four fcs
17Feature-Graph Matrix
- easily maintainable
18Framework
- Given a graph database and a query graph, the
substructure similarity search can be performed
in the following four steps - Index Costruction Select small structures as
features in the graph database, and built the
feature-graph matrix between the features and the
graphs in the database. - Feature Miss Estimation Select a feature set,
calculate the number of selected features
contained in the query graph, then compute the
upper bound of feature misses (dmax) if the query
graph is relaxed with one edge deletion.
19Framework
- Query Processing Use the feature-graph matrix to
calculate the difference in the number of
features between each graph G in the database and
query Q. If the difference is greater than dmax,
eliminate graph G. The remaining graphs
constitute a candidate answer, written as CQ. - Query Relaxation Relax the query further if the
user needs more matches than those returned from
the previous step iterate Steps 2 to 4.
20Feature Miss Estimation
Construct a feature set all features
for k1 dmax4
21Framework on Example
- Given a graph database and a query graph
- Index Construction
- Built the feature graph matrix
- Feature Miss Estimation
- Calculate dmax4
22Framework on Example
- Query Processing
- Calculate the difference in the number of
features between each graph G and query Q.
Total number of occurrences 7
dmax4
Misses 5
3
2
3
CQG2, G3, G4
23Question
- Should we use all the features together in a
single filter? - Does a filter achieve good filtering performance
if all the features are used together? - Intuitively, such a strategy would improve the
performance since all the available information
is used. - But, not true
24Question Feature Miss Estimation
for k1 dmax2
25Question Query Processing
Total number of occurrences 3
dmax2
Misses 3
3
2
2
CQG2, G3
26Answer
- By adding all features in the feature set, we may
fail to filter some graphs that do not satisfy
the query requirement. - To improve the accuracy of the filtering, we
should select feature sets by grouping the
features. - The example implies that the filtering power may
be weakened if we deploy all the features in one
filter. - In order to measure the filtering power,
selectivity
27Selectivity
- DEFINITION (SELECTIVITY)
- Given a graph database D, a query Q, and a
feature f, the selectivity is defined by its
average frequency difference within D and Q.
Occurrence of fa in query graph1
Occurrence of fb in query graph2
Occurrence of fc in query graph4
Selectivity of fa 3/4
Selectivity of fb 7/4
Selectivity of fc 3/4
28Feature Set Selection
- Rule 1. Select a large number of features
- Rule 2. Make sure features cover the query graph
uniformly. - Rule 3. Separate features with different
selectivity.
29Feature Set Selection of Grafil
- Grafil has two types of feature set selection
- Based component (Grafil-base) combines features
with the same size - Clustering Component
30Clustering Component
- Grafil first combines the features whose size
differs at most by 1, and sort them by
selectivity. - Hierarchical clustering
- Grafil divides them into three groups with high
selectivity, medium selectivity and low
selectivity.
31Grafil Algorithm
-base component -clustering component -pipeline
model
32Conclusion
- We discuss the problem of substructure similarity
search in large scale graph databases, a problem
raised by the emergence of massive, complex
structural data - Different from the previous work, our solution
explored the filtering algorithm using indexed
structural patterns, without doing costly
structure comparisons - The successful transformation of the
structure-based similarity measure to the
feature-based measure renders our method
attractive in terms of accuracy and efficiency