Clustering Graphs by Weighted Substructure Mining - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Clustering Graphs by Weighted Substructure Mining

Description:

Clustering Graphs by Weighted Substructure Mining ... Substructure Representation. 0/1 vector of ... Clustering algorithm based on substructure representation ... – PowerPoint PPT presentation

Number of Views:111
Avg rating:3.0/5.0
Slides: 30
Provided by: kojit
Category:

less

Transcript and Presenter's Notes

Title: Clustering Graphs by Weighted Substructure Mining


1
Clustering Graphs by Weighted Substructure Mining
  • Max Planck Institute for Biological Cybernetics
  • Koji Tsuda

Joint work with Taku Kudo (Google Japan)
2
Unsupervised Clustering of Labeled Undirected
Graphs
3
Graph Structures in Biology
  • DNA Sequence
  • RNA
  • Texts in literature
  • Compounds

H
C
C
O
C
H
H
C
C
C
H
H
H
Amitriptyline
inhibits
adenosine
uptake
4
Substructure Representation
  • 0/1 vector of pattern indicators
  • Huge dimensionality!
  • Need Graph Mining for selecting features
  • Better than paths (Marginalized graph kernels)

patterns
5
Overview
  • Clustering algorithm based on substructure
    representation
  • Key Selecting informative substructures
  • EM-based Graph Clustering
  • Fitting a binomial mixture model
  • Combination of
  • L1 regularization
  • Weighted substructure mining

6
Quick Review of Graph Mining
7
Graph Mining
  • Analysis of Graph Databases
  • Find all patterns satisfying predetermined
    conditions
  • Frequent Substructure Mining
  • Combinatorial, Exhaustive
  • Recently developed
  • AGM (Inokuchi et al., 2000), gspan (Yan et al.,
    2002), Gaston (2004)

8
Graph Mining
  • Frequent Substructure Mining
  • Enumerate all patterns occurred in at least m
    graphs
  • Indicator of pattern k in graph i

Support(k) of occurrence of pattern k
9
Gspan (Yan and Han, 2002)
  • Efficient Frequent Substructure Mining Method
  • DFS Code
  • Efficient detection of isomorphic patterns
  • Extend Gspan for our works

10
Enumeration on Tree-shaped Search Space
  • Each node has a pattern
  • Generate nodes from the root
  • Add an edge at each step

11
Tree Pruning
Support(g) of occurrence of pattern g
  • Anti-monotonicity
  • If support(g) lt m, stop exploring!

Not generated
12
Discriminative patternsWeighted Substructure
Mining
  • w_i gt 0 positive class
  • w_i lt 0 negative class
  • Weighted Substructure Mining
  • Patterns with large frequency difference
  • Not Anti-Monotonic Use a bound

13
Multiclass version
  • Multiple weight vectors
  • (graph belongs to class )
  • (otherwise)
  • Search patterns overrepresented in a class

14
EM-based clustering of graphs
15
EM-based graph clustering
  • Motivation
  • Learning a mixture model in the feature space of
    patterns
  • Basis for more complex probabilistic inference
  • L1 regularization Graph Mining
  • E-step -gt Mining -gt M-step

16
Probabilistic Model
  • Binomial Mixture
  • Each Component

Mixing weight for cluster
Parameter vector for cluster
17
Function to minimize
  • L1-Regularized log likelihood
  • Baseline constant
  • ML parameter estimate using single binomial
    distribution
  • In solution, most parameters exactly equal to
    constants

18
E-step
  • Active pattern
  • E-step computed only with active patterns
    (computable!)

19
M-step
  • Putative cluster assignment by E-step
  • Each parameter is solved separately
  • Use graph mining to find active patterns
  • Then, solve it only for active patterns

20
Solution
  • Occurrence probability in a cluster
  • Overall occurrence probability

21
Important Observation
For active pattern k, the occurrence probability
in a graph cluster is significantly different
from the average
22
Mining for Active Patterns F
  • F is rewritten in the following form
  • Active patterns can be found by graph mining!
    (multiclass)

23
Experiments RNA graphs
  • Stem as a node
  • Secondary structure by RNAfold
  • 0/1 Vertex label (self loop or not)

24
Clustering RNA graphs
  • Three Rfam families
  • Intron GP I (Int, 30 graphs)
  • SSU rRNA 5 (SSU, 50 graphs)
  • RNase bact a (RNase, 50 graphs)
  • Three bipartition problems
  • Results evaluated by ROC scores (Area under the
    ROC curve)

25
Examples of RNA Graphs
26
ROC Scores
27
No of Patterns Time
28
Found Patterns
29
Conclusion
  • Probabilistic clustering based on substructure
    representation
  • Inference helped by graph mining
  • Many possible extensions
  • Naïve Bayes
  • Graph PCA, LFD, CCA
  • Semi-supervised learning
  • Applications in Biology?
Write a Comment
User Comments (0)
About PowerShow.com