Title: Clustering Graphs by Weighted Substructure Mining
1Clustering Graphs by Weighted Substructure Mining
- Max Planck Institute for Biological Cybernetics
- Koji Tsuda
Joint work with Taku Kudo (Google Japan)
2Unsupervised Clustering of Labeled Undirected
Graphs
3Graph Structures in Biology
- DNA Sequence
- RNA
- Texts in literature
H
C
C
O
C
H
H
C
C
C
H
H
H
Amitriptyline
inhibits
adenosine
uptake
4Substructure Representation
- 0/1 vector of pattern indicators
- Huge dimensionality!
- Need Graph Mining for selecting features
- Better than paths (Marginalized graph kernels)
patterns
5Overview
- Clustering algorithm based on substructure
representation - Key Selecting informative substructures
- EM-based Graph Clustering
- Fitting a binomial mixture model
- Combination of
- L1 regularization
- Weighted substructure mining
6Quick Review of Graph Mining
7Graph Mining
- Analysis of Graph Databases
- Find all patterns satisfying predetermined
conditions - Frequent Substructure Mining
- Combinatorial, Exhaustive
- Recently developed
- AGM (Inokuchi et al., 2000), gspan (Yan et al.,
2002), Gaston (2004)
8Graph Mining
- Frequent Substructure Mining
- Enumerate all patterns occurred in at least m
graphs -
- Indicator of pattern k in graph i
Support(k) of occurrence of pattern k
9Gspan (Yan and Han, 2002)
- Efficient Frequent Substructure Mining Method
- DFS Code
- Efficient detection of isomorphic patterns
- Extend Gspan for our works
10Enumeration on Tree-shaped Search Space
- Each node has a pattern
- Generate nodes from the root
- Add an edge at each step
11Tree Pruning
Support(g) of occurrence of pattern g
- Anti-monotonicity
- If support(g) lt m, stop exploring!
Not generated
12Discriminative patternsWeighted Substructure
Mining
- w_i gt 0 positive class
- w_i lt 0 negative class
- Weighted Substructure Mining
- Patterns with large frequency difference
- Not Anti-Monotonic Use a bound
13Multiclass version
- Multiple weight vectors
- (graph belongs to class )
- (otherwise)
- Search patterns overrepresented in a class
14EM-based clustering of graphs
15EM-based graph clustering
- Motivation
- Learning a mixture model in the feature space of
patterns - Basis for more complex probabilistic inference
- L1 regularization Graph Mining
- E-step -gt Mining -gt M-step
16Probabilistic Model
- Binomial Mixture
- Each Component
Mixing weight for cluster
Parameter vector for cluster
17Function to minimize
- L1-Regularized log likelihood
- Baseline constant
- ML parameter estimate using single binomial
distribution - In solution, most parameters exactly equal to
constants
18E-step
- Active pattern
- E-step computed only with active patterns
(computable!)
19M-step
- Putative cluster assignment by E-step
- Each parameter is solved separately
- Use graph mining to find active patterns
- Then, solve it only for active patterns
20Solution
- Occurrence probability in a cluster
- Overall occurrence probability
21Important Observation
For active pattern k, the occurrence probability
in a graph cluster is significantly different
from the average
22Mining for Active Patterns F
- F is rewritten in the following form
- Active patterns can be found by graph mining!
(multiclass)
23Experiments RNA graphs
- Stem as a node
- Secondary structure by RNAfold
- 0/1 Vertex label (self loop or not)
24Clustering RNA graphs
- Three Rfam families
- Intron GP I (Int, 30 graphs)
- SSU rRNA 5 (SSU, 50 graphs)
- RNase bact a (RNase, 50 graphs)
- Three bipartition problems
- Results evaluated by ROC scores (Area under the
ROC curve)
25Examples of RNA Graphs
26ROC Scores
27No of Patterns Time
28Found Patterns
29Conclusion
- Probabilistic clustering based on substructure
representation - Inference helped by graph mining
- Many possible extensions
- Naïve Bayes
- Graph PCA, LFD, CCA
- Semi-supervised learning
- Applications in Biology?