Efficient Mining of GraphBased Data - PowerPoint PPT Presentation

About This Presentation
Title:

Efficient Mining of GraphBased Data

Description:

Inexact Graph Match. Some variations may occur between instances ... distance() measured by inexact graph match. Animals: SubdueCQ=2.6, CobwebCQ=1.7. CSE_at_UTA ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 29
Provided by: Lawrence2
Learn more at: https://ailab.wsu.edu
Category:

less

Transcript and Presenter's Notes

Title: Efficient Mining of GraphBased Data


1
Efficient Mining of Graph-Based Data
Jesus Gonzalez, Istvan Jonyer, Larry Holder and
Diane Cook University of Texas at
Arlington Department of Computer Science and
Engineering http//cygnus.uta.edu/subdue
2
Motivation
  • Structural/relational data
  • Ease of graph representation

3
Graph-Based Discovery
Input Database
Substructure S1 (graph form)
Compressed Database
T1
triangle
shape
C1
C1
S1
B1
object
R1
R1
on
square
S1
S1
S1
shape
T2
T3
T4
object
B2
B3
B4
4
Algorithm
  • Create substructure for each unique vertex label

Substructures
triangle (4), square (4), circle (1), rectangle
(1)
5
Algorithm
  • Expand best substructure by an edge or
    edgeneighboring vertex

Substructures
6
Algorithm
  • Keep only best beam-width substructures on queue
  • Terminate when queue is empty or discovered
    substructures gt limit
  • Compress graph and repeat to generate
    hierarchical description
  • Note polynomially constrained

7
Evaluation Metric
  • Substructures evaluated based on ability to
    compress input graph
  • Compression measured using minimum description
    length (DL)
  • Best substructure S in graph G minimizes DL(S)
    DL(GS)

8
Examples
9
Inexact Graph Match
  • Some variations may occur between instances
  • Want to abstract over minor differences
  • Difference cost of transforming one graph to
    isomorphism of another
  • Match if cost/size lt threshold

10
Parallel/Distributed Discovery
  • Divide graph into P partitions using Metis,
    distribute to P processors
  • Each processor performs serial Subdue on local
    partition
  • Broadcast best substructures, evaluate on other
    processors
  • Master processor stores best global substructures
  • Close to linear speedup

11
Graph-Based Concept Learning
  • One graph stores positive examples
  • One graph stores negative examples
  • Find substructure that compresses positive graph
    but not negative graph
  • (PosEgsNotCovered) (NegEgsCovered)
  • Multiple iterations implements set-covering
    approach

12
Concept-Learning Example
13
Concept-Learning Results
  • Chess endgames (19,257 examples)
  • Black King is () or is not (-) in check
  • 99.8 FOIL, 99.21 Subdue

14
More Concept-Learning Results
  • Tic-Tac-Toe endgames
  • is win for X (958 examples)
  • 100 Subdue, 92.35 FOIL
  • Bach chorales
  • Musical sequences (20 sequences)
  • 100 Subdue, 85.71 FOIL

15
Graph-Based Clustering
  • Iterate Subdue until single vertex
  • Each cluster (substructure) inserted into a
    classification lattice

Root
16
Clustering Example Animals
17
Graph-Based Clustering Results
18
Cobweb Results
  • Comparison of Subdue and Cobweb results
  • Subdue lattice produced better generalization,
    resulting in less clusters at higher levels
  • Subdue lattice identifies overlap between
    (reptile) and (amphibian/fish)

19
Clustering Example DNA
20
Graph-Based Clustering Results
  • Coverage
  • 61
  • 68
  • 71

21
Evaluation of Clusterings
  • Traditional evaluation
  • Not applicable to hierarchical domains
  • Does not make sense to compare clusters in
    different subtrees
  • Not applicable to relational clusterings

22
Properties of Good Clusterings
  • Small number of clusters
  • Large coverage ? good generality
  • Big cluster descriptions
  • More features ? more inferential power
  • Minimal or no overlap between clusters
  • More distinct clusters ? better defined concepts

23
New Evaluation Heuristic for Hierarchical
Clusterings
  • Clustering rooted at C with c children Hi having
    Hi instances Hi,k
  • distance() measured by inexact graph match
  • Animals SubdueCQ2.6, CobwebCQ1.7

24
Graph-Based Data Mining Application Domains
  • Biochemical domains
  • Protein data
  • DNA data
  • Toxicology (cancer) data
  • Spatial-temporal domains
  • Earthquake data
  • Aircraft Safety and Reporting System
  • Telecommunications data
  • Program source code
  • Web topology

25
Theoretical Analysis
  • Galois lattice Lequiere et al.
  • Conceptual graphs Sowa et al.
  • PAC analysis Jappy et al.

26
Graph-based Data Mining
  • Pattern (substructure) discovery
  • Hierarchical discovery
  • Distributed discovery
  • Concept learning
  • Clustering
  • Compression heuristic based on minimum
    description length

27
Future Work
  • Concept learning
  • Theoretical analysis
  • Comparison to ILP systems
  • Clustering
  • Classification lattice
  • Hierarchical relational conceptual clustering
    evaluation metric
  • Probabilistic substructures
  • Domains WWW, source code

28
Subdue Source Code and Data
  • http//cygnus.uta.edu/subdue
Write a Comment
User Comments (0)
About PowerShow.com