Title: Efficient Mining of GraphBased Data
1Efficient Mining of Graph-Based Data
Jesus Gonzalez, Istvan Jonyer, Larry Holder and
Diane Cook University of Texas at
Arlington Department of Computer Science and
Engineering http//cygnus.uta.edu/subdue
2Motivation
- Structural/relational data
- Ease of graph representation
3Graph-Based Discovery
Input Database
Substructure S1 (graph form)
Compressed Database
T1
triangle
shape
C1
C1
S1
B1
object
R1
R1
on
square
S1
S1
S1
shape
T2
T3
T4
object
B2
B3
B4
4Algorithm
- Create substructure for each unique vertex label
Substructures
triangle (4), square (4), circle (1), rectangle
(1)
5Algorithm
- Expand best substructure by an edge or
edgeneighboring vertex
Substructures
6Algorithm
- Keep only best beam-width substructures on queue
- Terminate when queue is empty or discovered
substructures gt limit - Compress graph and repeat to generate
hierarchical description - Note polynomially constrained
7Evaluation Metric
- Substructures evaluated based on ability to
compress input graph - Compression measured using minimum description
length (DL) - Best substructure S in graph G minimizes DL(S)
DL(GS)
8Examples
9Inexact Graph Match
- Some variations may occur between instances
- Want to abstract over minor differences
- Difference cost of transforming one graph to
isomorphism of another - Match if cost/size lt threshold
10Parallel/Distributed Discovery
- Divide graph into P partitions using Metis,
distribute to P processors - Each processor performs serial Subdue on local
partition - Broadcast best substructures, evaluate on other
processors - Master processor stores best global substructures
- Close to linear speedup
11Graph-Based Concept Learning
- One graph stores positive examples
- One graph stores negative examples
- Find substructure that compresses positive graph
but not negative graph - (PosEgsNotCovered) (NegEgsCovered)
- Multiple iterations implements set-covering
approach
12Concept-Learning Example
13Concept-Learning Results
- Chess endgames (19,257 examples)
- Black King is () or is not (-) in check
- 99.8 FOIL, 99.21 Subdue
14More Concept-Learning Results
- Tic-Tac-Toe endgames
- is win for X (958 examples)
- 100 Subdue, 92.35 FOIL
- Bach chorales
- Musical sequences (20 sequences)
- 100 Subdue, 85.71 FOIL
15Graph-Based Clustering
- Iterate Subdue until single vertex
- Each cluster (substructure) inserted into a
classification lattice
Root
16Clustering Example Animals
17Graph-Based Clustering Results
18Cobweb Results
- Comparison of Subdue and Cobweb results
- Subdue lattice produced better generalization,
resulting in less clusters at higher levels - Subdue lattice identifies overlap between
(reptile) and (amphibian/fish)
19Clustering Example DNA
20Graph-Based Clustering Results
21Evaluation of Clusterings
- Traditional evaluation
- Not applicable to hierarchical domains
- Does not make sense to compare clusters in
different subtrees - Not applicable to relational clusterings
22Properties of Good Clusterings
- Small number of clusters
- Large coverage ? good generality
- Big cluster descriptions
- More features ? more inferential power
- Minimal or no overlap between clusters
- More distinct clusters ? better defined concepts
23New Evaluation Heuristic for Hierarchical
Clusterings
- Clustering rooted at C with c children Hi having
Hi instances Hi,k - distance() measured by inexact graph match
- Animals SubdueCQ2.6, CobwebCQ1.7
24Graph-Based Data Mining Application Domains
- Biochemical domains
- Protein data
- DNA data
- Toxicology (cancer) data
- Spatial-temporal domains
- Earthquake data
- Aircraft Safety and Reporting System
- Telecommunications data
- Program source code
- Web topology
25Theoretical Analysis
- Galois lattice Lequiere et al.
- Conceptual graphs Sowa et al.
- PAC analysis Jappy et al.
26Graph-based Data Mining
- Pattern (substructure) discovery
- Hierarchical discovery
- Distributed discovery
- Concept learning
- Clustering
- Compression heuristic based on minimum
description length
27Future Work
- Concept learning
- Theoretical analysis
- Comparison to ILP systems
- Clustering
- Classification lattice
- Hierarchical relational conceptual clustering
evaluation metric - Probabilistic substructures
- Domains WWW, source code
28Subdue Source Code and Data
- http//cygnus.uta.edu/subdue