Title: Lei Shi
1Seminar 2009
Frequent Subgraph/ Substructure Mining
- Lei Shi
- Department of Computer Science and Engineering
- State University of New York at Buffalo
2Outline
- Introduction
- Apriori-based Subgrah Mining
- Pattern Growth Subgraph Mining
- Summary
3Graphs are everywhere
4Graph Mining Problems
- Graph Pattern Mining
- Frequent subgraph pattern mining
- Pattern summarization
- Optimal graph patterns
- Graph patterns with constraints
- Approximate graph patterns .
- Graph Classification
- Graph clustering
- Important node identification
- Bridge and hub identification
- Other Important Topics
- Graph compression
- Graph model
- Social network analysis.
5Subgraph pattern Mining
- Frequent subgraph
- A (sub)graph is frequent if its support
(occurrence frequency) in a given dataset is no
less than a minimum support threshold - Application of subgraph pattern mining
- Mining biochemical structures
- Program control flow analysis
- Mining XML structures or Web communities
- Building blocks for graph classifiction,
clustering,compression, comparison and
correlation analysis.
6Frequent Subgraph Example
(1) (2) (3)
7Key Challenges in Subgraph Mining
- Graph isomorphism
- to detect if two graphs are identical in
structure - Graph representation (Canonical Labeling)
- A canonical label is a unique code of a given
graph. - Canonical label should be the same no matter how
graphs are represented, as long as graphs have
the same topological structure and the same
labeling of edges and vertices. - Subgraph candidate generation
- generate candidate frequent subgraphs from
datasets
8Subgraph Mining Approaches
- Apriori-based
- AGM/AcGM Inokuchi, et al. (PKDD00)
- FSG Kuramochi and Karypis (ICDM01)
- M. Kuramochi and G. Karypis. Frequent subgraph
discovery. In ICDM01, pages 313-320, Nov. 2001 - PATH Vanetik and Gudes (ICDM02, ICDM04)
- FFSM Huan, et al. (ICDM03) and SPIN Huan et
al. (KDD04) - FTOSM Horvath et al. (KDD06)
- Pattern growth based
- Subdue Holder et al. (KDD94)
- MoFa Borgelt and Berthold (ICDM02)
- gSpan Yan and Han (ICDM02)
- Yan, X. and Han, J. 2002. gSpan Graph-Based
Substructure Pattern Mining. In Proceedings of
the 2002 IEEE international Conference on Data
Mining (Icdm02) (December 09-12, 2002). ICDM.
IEEE Computer Society, Washington, DC, 721 - Gaston Nijssen and Kok (KDD04)
- CMTreeMiner Chi et al. (TKDE05)
- LEAP Yan et al. (SIGMOD08)
9Outline
- Introduction and Background
- Apriori-based Subgrah Mining
- Pattern Growth Subgraph Mining
- Summary
10Apriori-based Approach
- FSG Frequent subgraph discovery. In ICDM01,
Nov. 2001 M.Kuramochi and G. Karypis. - Flattened Representation as Canonical Labeling
- Apriori-based method to generate subgraph
candidate
11Graph Representation in FSG
12Graph Representation in FSG
- Flatterned Representation
Lexicographic order or dictionary order
13Apriori-based method
- Apriori Property
- If a graph is frequent, all of its subgraphs are
frequent. - Candidate Generation
- Create a set of candidate size k1
- -from given two frequent k-subgraphs
- -containing the same (k-1)-subgraph
- -Result in several candidates size k1
14Apriori-based method
- Graph candidate generated Example
15Apriori-based method
16Apriori-based method
- Experiment Result
- -Chemical Compound Dataset, which contains 340
compounds,24 different atoms (vertices)
17Outline
- Introduction
- Apriori-based Subgrah Mining
- Pattern Growth Subgraph Mining
- Summary
18Motivation of gSpan
- Weakness of Apriori-based approach
- The generation of size (k1) subgraph candidates
from size k frequent subgraph too complicated and
complex. - Pruning false positive subgraph isomorphism is
an NP complete problem which is costly. - gSpan Graph-Based Substructure Pattern Mining
- Change the way to represent a graph (DFS Depth
First Search) - Using pattern growth to generate new subgraph
candidate.
19gSpan Graph-Based Substructure Pattern Mining
- DFS (Depth First Search) Code
- First Step DFS the graph and use edges on the
path to represent the graph. - Second Step DFS Lexicographic Order
- Pattern Growth subgraph generation
20DFS code
An edge is presented by 5 tuples.
21DFS code
- Second Step DFS Lexicographic Order
22Pattern Growth Approach
- Pattern Growth (free extension)
23Pattern Growth Approach
24Pattern Growth Approach
25Pattern Growth Approach
26Pattern Growth Approach
27gSpan
28gSpan
29Pattern Growth Approach
- Experimental result using Chemical data
- 340 molecules
- 66 atom types and
- 4 bond types as labels
- On average only 27 vertices with 28 edges
30Summary
- Graph representation
- Flattern representation vs. DFS code
- Generation of Candidate Patterns
- apriori vs. pattern growth
-
31 32Pattern-Growth Approach
33Frequent Graph Pattern
- Given a graph dataset D, find subgraph g, s.t.
- Where is the percentage of graphs
in D that contain g. - Problem 1 Exponential Pattern Set
- Problem 2 Threshold Setting
34Difference between frequent itemset and frequent
subgraph discovery
35Frequent itemset discovery
36subgraph Mining Algorithms
- Apriori-based approach
- AGM/AcGM Inokuchi, et al. (PKDD00)
- FSG Kuramochi and Karypis (ICDM01)
- PATH Vanetik and Gudes (ICDM02, ICDM04)
- FFSM Huan, et al. (ICDM03) and SPIN Huan et
al. (KDD04) - FTOSM Horvath et al. (KDD06)
- Pattern growth approach
- Subdue Holder et al. (KDD94)
- MoFa Borgelt and Berthold (ICDM02)
- gSpan Yan and Han (ICDM02)
- Gaston Nijssen and Kok (KDD04)
- CMTreeMiner Chi et al. (TKDE05)
- LEAP Yan et al. (SIGMOD08)
37Framework of subraph Mining Algorithms
- Search Order
- breadth vs. depth
- complete vs. incomplete
- Generation of Candidate Patterns
- apriori vs. pattern growth
- Discovery Order of Patterns
- DFS order
- path tree graph
- Elimination of Duplicate Subgraphs
- passive vs. active
- Support Calculation
- embedding store or not
38Frequent Subgraph
Examples
39Example (cont.)
40Subgraph Mining Approaches
- Apriori-based approach
- AGM/AcGM Inokuchi, et al. (PKDD00)
- FSG Kuramochi and Karypis (ICDM01)
- M. Kuramochi and G. Karypis. Frequent subgraph
discovery. In ICDM01, pages 313-320, Nov. 2001 - PATH Vanetik and Gudes (ICDM02, ICDM04)
- FFSM Huan, et al. (ICDM03) and SPIN Huan et
al. (KDD04) - FTOSM Horvath et al. (KDD06)
- Pattern growth approach
- Subdue Holder et al. (KDD94)
- MoFa Borgelt and Berthold (ICDM02)
- gSpan Yan and Han (ICDM02)
- Yan, X. and Han, J. 2002. gSpan Graph-Based
Substructure Pattern Mining. In Proceedings of
the 2002 IEEE international Conference on Data
Mining (Icdm02) (December 09-12, 2002). ICDM.
IEEE Computer Society, Washington, DC, 721 - Gaston Nijssen and Kok (KDD04)
- CMTreeMiner Chi et al. (TKDE05)
- LEAP Yan et al. (SIGMOD08)
41Outline
- Introduction and Background
- Apriori-based Subgrah Mining
- Pattern Growth Subgraph Mining
- Summary
- DFS code
- Yan, X. and Han, J. 2002. gSpan Graph-Based
Substructure Pattern Mining. In Proceedings of
the 2002 IEEE international Conference on Data
Mining (Icdm02) (December 09-12, 2002). ICDM.
IEEE Computer Society, Washington, DC, 721
42Pattern Growth Approach