gSpan: GraphBased Substructure Pattern Mining - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

gSpan: GraphBased Substructure Pattern Mining

Description:

gSpan: Graph-Based Substructure. Pattern Mining. Xifeng Yan and ... Output: Set of frequent substructure patterns within a graph dataset. gSpan Overview (2) ... – PowerPoint PPT presentation

Number of Views:2461
Avg rating:5.0/5.0
Slides: 24
Provided by: kbsUnih
Category:

less

Transcript and Presenter's Notes

Title: gSpan: GraphBased Substructure Pattern Mining


1
gSpan Graph-Based SubstructurePattern Mining
  • Xifeng Yan and Jiawei Han
  • Presented by
  • Quang Lam Nguyen

2
Agenda
  • Motivation Basics
  • gSpan - Algorithm
  • Example
  • Experimental Results
  • Conclusion

3
Mining frequent connected sub-graphs
  • Given a minimum support minSup and graph G find
    all the graphs in the database that have G as a
    subgraph
  • If the number of such graphs is no less than
    minSup than G is a frequent subgraph

4
Basics - MinSup
(1) (2) (3)
5
gSpan - Algorithm
  • Overview
  • Graph, DFS Tree, DFS Code, Minimum DFS Code
  • DFS Search Tree
  • Pseudo Pseudo Code

6
gSpan Overview (1)
  • 3 concepts
  • DFS Lexicographic Order of sub-graphs
  • Minimum DFS Codes
  • DFS (Depth First Search)
  • Avoiding bad memory and time consumption problems
  • Candidate Generation
  • BFS
  • Output Set of frequent substructure patterns
    within a graph dataset

7
gSpan Overview (2)
  • gSpan builds tree, returning the largest frequent
    sub-graphs.

-
0 edges
1 edge
2 edges
  • New edges are only added if the new child
    represents a frequent sub-graph in the given
    graph data set.
  • The algorithm ensures that childs representing
    the same sub-graph never are built twice

8
Graph, DFS-Tree, DFS-Code
  • One graph can have several DFS-Trees

Backward edge
Forward edge
a)
b)
c)
  • Each DFS Tree is represented by a sequence of
    edges the DFS code

DFS Code
9
DFS Code
0
X
0 (0,1,X,a,Y) 1 (1,2,Y,b,X) 2 (2,0,X,a,X) 3 (2,3,X
,c,Z) 4 (3,1,Z,b,Y) 5 (1,4,Y,d,Z)
a
a
1
Y
d
4
b
Z
b
2
X
c
--gt a graph can have several DFS Codes - must
choose canonical form
3
Z
10
Minimum DFS Code Graph Isomorphism
  • Sort DFS Code Edges (lt_e) and sort DFS Codes
    (lt_l)
  • According to lt_l-order of DFS Codes for a graph
    there is a min(G) canonical form
  • min(G) is unique
  • Two graphs G and H are isomorphic if and only if
  • min(G) min(H)

11
DFS Search Tree (1)
  • Each node a graph, a DFS Tree

12
DFS Search Tree (2) - Pruning
Not minimal
Pruning
13
DFS Search Tree (3)
  • Each node represents a DFS Tree (thus a graph),
    and a nodes children are DFS trees grown one
    edge.
  • The tree is built depth-first,
  • And whenever expanding a node, the
    lexicographical ordering lt_L is applied to the
    nodes children.
  • That is, a nodes leftmost child is always the
    smallest, and is searched first.
  • This ensures that the smallest motifs always are
    visited first.

14
Example (1)
  • Dataset Graph a)
  • Step 1 Clean graph according to minSup -gt b)

MinSup 2
  • Step 2 Find all frequent single-edged
    graphs/patterns

(0,1,a,c) -gt (a_5,c_3),(a_6,c_1)
(0,1,b,c) -gt (b_2,c_3),(b_4,c_1)
15
Example (2)
  • Sort graphs and go depth-first
  • Expand if children are frequent sub-graphs
  • Else Backtrack, prune if not minimal
  • Return Pattern (a,b,c) and instances

16
Experimental Results
  • Synthetic Data
  • Chemical Data

17
Synthetic Data
Number of graphs 10 000 Number of
frequent sub-graphs 200 minSup
100 N Number of labels I Size of
sub-graphs T Size of graphs
18
Chemical Data
340 molecules 66 atom types and 4 bond types as
labels on average only 27 vertices with 28
edges
19
Conclusion
  • Lower memory requirements
  • No Candidate Generation
  • False Positives Pruning
  • Lexicographic Ordering minimizes search tree

20
Appendix
  • Sources
  • DFS Edge Order lt_e
  • DFS Lexicographic Order lt_l

21
Sources
  • Other presentations on gSpan
  • download.informatik.uni-freiburg.de/lectures/ML/20
    04-2005WS/Misc/Slides/16_1_Toxicology_4up.pdf
  • www.informatik.uni-freiburg.de/ml/teaching/ws04/l
    m/20041109_gSpan_Guetlein.ppt
  • Nice text on gSpan and similar algorithms
  • http//www.diva-portal.org/ntnu/abstract.xsql?dbid
    1112
  • Webpage for graph mining
  • http//hms.liacs.nl/graphs.html

22
DFS Codes DFS Edge Order - lt_e
  • Recall from class that we can define an order
    over the edges of such a graph with respect to
    the DFS tree
  • An edge is always listed (vi, vj) such that I lt j
  • Given edges e (vi,vj) and e (vi,vj)
  • e lt e if
  • j lt j or (i gt i and j j) when both are
    forward edges
  • i lt i or (j lt j and i i) when both are
    backward edges
  • j lt i when e is forward and e is backward
  • i lt j when e is backward and e is forward

23
DFS Lexicographic Order - lt_l
Write a Comment
User Comments (0)
About PowerShow.com