Title: gSpan: GraphBased Substructure Pattern Mining
1gSpan Graph-Based SubstructurePattern Mining
- Xifeng Yan and Jiawei Han
- Presented by
- Quang Lam Nguyen
2Agenda
- Motivation Basics
- gSpan - Algorithm
- Example
- Experimental Results
- Conclusion
3Mining frequent connected sub-graphs
- Given a minimum support minSup and graph G find
all the graphs in the database that have G as a
subgraph - If the number of such graphs is no less than
minSup than G is a frequent subgraph
4Basics - MinSup
(1) (2) (3)
5gSpan - Algorithm
- Overview
- Graph, DFS Tree, DFS Code, Minimum DFS Code
- DFS Search Tree
- Pseudo Pseudo Code
6gSpan Overview (1)
- 3 concepts
- DFS Lexicographic Order of sub-graphs
- Minimum DFS Codes
- DFS (Depth First Search)
- Avoiding bad memory and time consumption problems
- Candidate Generation
- BFS
- Output Set of frequent substructure patterns
within a graph dataset
7gSpan Overview (2)
- gSpan builds tree, returning the largest frequent
sub-graphs.
-
0 edges
1 edge
2 edges
- New edges are only added if the new child
represents a frequent sub-graph in the given
graph data set. - The algorithm ensures that childs representing
the same sub-graph never are built twice
8Graph, DFS-Tree, DFS-Code
- One graph can have several DFS-Trees
Backward edge
Forward edge
a)
b)
c)
- Each DFS Tree is represented by a sequence of
edges the DFS code
DFS Code
9DFS Code
0
X
0 (0,1,X,a,Y) 1 (1,2,Y,b,X) 2 (2,0,X,a,X) 3 (2,3,X
,c,Z) 4 (3,1,Z,b,Y) 5 (1,4,Y,d,Z)
a
a
1
Y
d
4
b
Z
b
2
X
c
--gt a graph can have several DFS Codes - must
choose canonical form
3
Z
10Minimum DFS Code Graph Isomorphism
- Sort DFS Code Edges (lt_e) and sort DFS Codes
(lt_l) - According to lt_l-order of DFS Codes for a graph
there is a min(G) canonical form - min(G) is unique
- Two graphs G and H are isomorphic if and only if
- min(G) min(H)
11DFS Search Tree (1)
- Each node a graph, a DFS Tree
12DFS Search Tree (2) - Pruning
Not minimal
Pruning
13DFS Search Tree (3)
- Each node represents a DFS Tree (thus a graph),
and a nodes children are DFS trees grown one
edge. - The tree is built depth-first,
- And whenever expanding a node, the
lexicographical ordering lt_L is applied to the
nodes children. - That is, a nodes leftmost child is always the
smallest, and is searched first. - This ensures that the smallest motifs always are
visited first.
14Example (1)
- Dataset Graph a)
- Step 1 Clean graph according to minSup -gt b)
MinSup 2
- Step 2 Find all frequent single-edged
graphs/patterns
(0,1,a,c) -gt (a_5,c_3),(a_6,c_1)
(0,1,b,c) -gt (b_2,c_3),(b_4,c_1)
15Example (2)
- Sort graphs and go depth-first
- Expand if children are frequent sub-graphs
- Else Backtrack, prune if not minimal
- Return Pattern (a,b,c) and instances
16Experimental Results
- Synthetic Data
- Chemical Data
17Synthetic Data
Number of graphs 10 000 Number of
frequent sub-graphs 200 minSup
100 N Number of labels I Size of
sub-graphs T Size of graphs
18Chemical Data
340 molecules 66 atom types and 4 bond types as
labels on average only 27 vertices with 28
edges
19Conclusion
- Lower memory requirements
- No Candidate Generation
- False Positives Pruning
- Lexicographic Ordering minimizes search tree
20Appendix
- Sources
- DFS Edge Order lt_e
- DFS Lexicographic Order lt_l
21Sources
- Other presentations on gSpan
- download.informatik.uni-freiburg.de/lectures/ML/20
04-2005WS/Misc/Slides/16_1_Toxicology_4up.pdf - www.informatik.uni-freiburg.de/ml/teaching/ws04/l
m/20041109_gSpan_Guetlein.ppt - Nice text on gSpan and similar algorithms
- http//www.diva-portal.org/ntnu/abstract.xsql?dbid
1112 - Webpage for graph mining
- http//hms.liacs.nl/graphs.html
22DFS Codes DFS Edge Order - lt_e
- Recall from class that we can define an order
over the edges of such a graph with respect to
the DFS tree - An edge is always listed (vi, vj) such that I lt j
- Given edges e (vi,vj) and e (vi,vj)
- e lt e if
- j lt j or (i gt i and j j) when both are
forward edges - i lt i or (j lt j and i i) when both are
backward edges - j lt i when e is forward and e is backward
- i lt j when e is backward and e is forward
23DFS Lexicographic Order - lt_l