Substructure Similarity Search in Graph Databases - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Substructure Similarity Search in Graph Databases

Description:

Pairwise substructure similarity computation is very expensive. For one edge misses, exact substructure search ... DEFINITION (SUBSTRUCTURE SIMILARITY SEARCH) ... – PowerPoint PPT presentation

Number of Views:337

Avg rating:3.0/5.0

Slides: 33

Provided by: mehme9

Category:

more less

Transcript and Presenter's Notes

Title: Substructure Similarity Search in Graph Databases

1
Substructure Similarity Search in Graph Databases

By X. Yan P. Yu J.Han
Ömer Can KOLÇAK

2
Outline

Introduction
Preliminary Concepts
Structural Filtering
Feature Set Selection
Algorithm Implementation

3
Introduction

Data research has been facing a new challenge
raised by the emergence of complex structural
data.
Graphs have broad applications and they are used
in datasets especially in chemistry-informatics
and bio-informatics. (eg. ChemIDplus, PDB)
Also in computer vision and pattern recognition,
graphs are used to represent complex structures
such as hand-drawn symbols, 3D objects and
medical images.

4
Introduction

All these applications indicate the importance
and the broad usage of graph database and its
similarity search system.
While the discovery in graph datasets has been
studied, a systematic examination of graph query
systems becomes equally important.

5
Structure Search Queries

Full Structure Search
Substructure Search
Full Structure Similarity Search

6
Question

What if no matches occur for a given query graph?

Query Refinement Process
7
Query Refinement Process

manually time consuming
define the portion of the query for exact
matching
and let the system change the portion slightly

RELAXATION RATIO
8
Example
9
Introduction

The existing tools such as ChemIDplus, could only
provide the full structure similarity search and
the exact substructure search.
Pairwise substructure similarity computation is
very expensive.
For one edge misses, exact substructure search
may work.
What if the number of deletions is more than one?

10
Graph Similarity Filtering

GRAFIL
A feature-based structural filtering algorithm
no pairwise computation
Instead, two data structures
feature-graph matrix
edge-feature matrix
filters the dataset by using these matrices
not on the database, on the matrices

11
Contribution of Grafil

A significant contribution of this study is an
examination of an inceasingly important search
problem in graph databases and the proposal of a
feature-based filtering algorithm for efficient
substructure similarity search.
The concept presented in Grafil can be applied to
searching approximate,non-consequtive sequences,
trees and other complicated structures as well.

12
Preliminary Concepts

DEFINITION (SUBSTRUCTURE SIMILARITY SEARCH)
Given a graph database DG1,G2,...,Gn and a
query graph Q, similarity search is discover all
the graphs that approximately contain this query
graph.
Target Graph Graphs in Dataset

13
Preliminary Concepts

DEFINITION (RELAXATION RATIO)
Given two graphs G and Q, if P is the maximum
common subgraph of G and Q, then the substructure
similarity between G and Q is defined by E(P) /
E(Q), and 1-E(P) / E(Q) is called
relaxation ratio.

14
Example
Substructure Similarity 11/12 92
Maximum Common Subgraph , P E(P)11
Relaxation Ratio 1-(11/12) 8
15
Structural Filtering

Given a query graph, the major target of our
algorithm is to filter as many graphs as possible
using a feature-based approach.
Features
Paths
Discriminative Frequent Structures
Elementary Structures
etc...

16
Example
This Query Graph contains seven occurences of
these features One fa, two fbs and four fcs
17
Feature-Graph Matrix
- easily maintainable
18
Framework

Given a graph database and a query graph, the
substructure similarity search can be performed
in the following four steps
Index Costruction Select small structures as
features in the graph database, and built the
feature-graph matrix between the features and the
graphs in the database.
Feature Miss Estimation Select a feature set,
calculate the number of selected features
contained in the query graph, then compute the
upper bound of feature misses (dmax) if the query
graph is relaxed with one edge deletion.

19
Framework

Query Processing Use the feature-graph matrix to
calculate the difference in the number of
features between each graph G in the database and
query Q. If the difference is greater than dmax,
eliminate graph G. The remaining graphs
constitute a candidate answer, written as CQ.
Query Relaxation Relax the query further if the
user needs more matches than those returned from
the previous step iterate Steps 2 to 4.

20
Feature Miss Estimation
Construct a feature set all features
for k1 dmax4
21
Framework on Example

Given a graph database and a query graph
Index Construction
Built the feature graph matrix

Feature Miss Estimation
Calculate dmax4

22
Framework on Example

Query Processing
Calculate the difference in the number of
features between each graph G and query Q.

Total number of occurrences 7
dmax4
Misses 5
3
2
3
CQG2, G3, G4
23
Question

Should we use all the features together in a
single filter?
Does a filter achieve good filtering performance
if all the features are used together?
Intuitively, such a strategy would improve the
performance since all the available information
is used.
But, not true

24
Question Feature Miss Estimation
for k1 dmax2
25
Question Query Processing

Query Processing

Total number of occurrences 3
dmax2
Misses 3
3
2
2
CQG2, G3
26
Answer

By adding all features in the feature set, we may
fail to filter some graphs that do not satisfy
the query requirement.
To improve the accuracy of the filtering, we
should select feature sets by grouping the
features.
The example implies that the filtering power may
be weakened if we deploy all the features in one
filter.
In order to measure the filtering power,
selectivity

27
Selectivity

DEFINITION (SELECTIVITY)
Given a graph database D, a query Q, and a
feature f, the selectivity is defined by its
average frequency difference within D and Q.

Occurrence of fa in query graph1
Occurrence of fb in query graph2
Occurrence of fc in query graph4
Selectivity of fa 3/4
Selectivity of fb 7/4
Selectivity of fc 3/4
28
Feature Set Selection

Rule 1. Select a large number of features
Rule 2. Make sure features cover the query graph
uniformly.
Rule 3. Separate features with different
selectivity.

29
Feature Set Selection of Grafil

Grafil has two types of feature set selection
Based component (Grafil-base) combines features
with the same size
Clustering Component

30
Clustering Component

Grafil first combines the features whose size
differs at most by 1, and sort them by
selectivity.
Hierarchical clustering
Grafil divides them into three groups with high
selectivity, medium selectivity and low
selectivity.

31
Grafil Algorithm
-base component -clustering component -pipeline
model
32
Conclusion

We discuss the problem of substructure similarity
search in large scale graph databases, a problem
raised by the emergence of massive, complex
structural data
Different from the previous work, our solution
explored the filtering algorithm using indexed
structural patterns, without doing costly
structure comparisons
The successful transformation of the
structure-based similarity measure to the
feature-based measure renders our method
attractive in terms of accuracy and efficiency