Title: Graph Mining Applications in Machine Learning Problems
1 Graph Mining Applicationsin Machine Learning
Problems
- Max Planck Institute for Biological Cybernetics
- Koji Tsuda
2Motivations for graph analysis
- Existing methods assume tables
- Structured data beyond this framework
- ? New methods for analysis
3Graphs..
4Graph Structures in Biology
- DNA Sequence
- RNA
- Texts in literature
H
C
C
O
C
H
H
C
C
C
H
H
H
Amitriptyline
inhibits
adenosine
uptake
5Overview
- Path representation
- Graph Kernel Its disadvantages
- Substructure representation
- Graph Mining
- EM-based Graph Clustering (Tsuda and Kudo, ICML
2006)
6Path Representations Marginalized Graph Kernels
7Marginalized Graph Kernels
(Kashima, Tsuda, Inokuchi, ICML 2003)
- Going to define the kernel function
- Both vertex and edges are labeled
8Label path
- Sequence of vertex and edge labels
- Generated by random walking
- Uniform initial, transition, terminal
probabilities
9Path-probability vector
10Kernel definition
- Kernels for paths
- Take expectation over all possible paths!
- Marginalized kernels for graphs
11Computation
12Graph Kernel Applications
- Chemical Compounds (Mahe et al., 2005)
- Protein 3D structures (Borgwardt et al, 2005)
- RNA graphs (Karklin et al., 2005)
- Pedestrian detection
- Signal Processing
13Strong points of MGK
- Polynomial time computation O(n3)
- Positive definite kernel
- Support Vector Machines
- Kernel PCA
- Kernel CCA
- And so on
14Drawbacks of graph kernels
- Global similarity measure
- Fails to capture subtle differences
- Long paths suppressed
- Results not interpretable
- Structural features ignored (e.g. loops)
- No labels -gt kernel always 1
15Substructure Representation Graph Mining
16Substructure Representation
- 0/1 vector of pattern indicators
- Huge dimensionality!
- Need Graph Mining for selecting features
patterns
17Graph Mining
- Subfield of Data Mining
- KDD, ICDM, PKDD
- not popular in ICML, NIPS
- Analysis of Graph Databases
- Frequent Substructure Mining
- Combinatorial algorithm
- Recently developed
- AGM (Inokuchi et al., 2000), gspan (Yan et al.,
2002), Gaston (2004)
18Graph Mining
- Frequent Substructure Mining
- Enumerate all patterns occurred in at least m
graphs -
- Indicator of pattern k in graph i
Support(k) of occurrence of pattern k
19Enumeration on Tree-shaped Search Space
- Each node has a pattern
- Generate nodes from the root
- Add an edge at each step
20Tree Pruning
Support(g) of occurrence of pattern g
- Anti-monotonicity
- If support(g) lt m, stop exploring!
Not generated
21Gspan (Yan and Han, 2002)
- Efficient Frequent Substructure Mining Method
- DFS Code
- Efficient detection of isomorphic patterns
- Extend Gspan for our works
22Depth First Search (DFS) Code
DFS Code Tree on G
0,1,A,a,B
2,0,A,b,A
1,2,B,c,A
0,2,A,b,A
0,1,A,a,B
0,3,A,b,C
G1
G0
2,0,A,b,A
0,3,A,b,C
0,3,A,b,C
Isomorphic
0,3,A,b,C
Non-minimum DFS code. Prune it.
23Discriminative patterns
- w_i gt 0 positive class
- w_i lt 0 negative class
- Weighted Substructure Mining
- Patterns with large frequency difference
- Not Anti-Monotonic Use a bound
24Multiclass version
- Multiple weight vectors
- (graph belongs to class )
- (otherwise)
- Search patterns overrepresented in a class
25Summary of Graph Mining
- Efficient way of searching patterns satisfying
predetermined conditions - NP hard
- But actual speed depends on the data
- Faster for..
- Sparse graphs
- Diverse kinds of labels
26EM-based clustering of graphs(Tsuda and Kudo,
ICML 2006)
27EM-based graph clustering
- Motivation
- Learning a mixture model in the feature space of
patterns - Basis for more complex probabilistic inference
- L1 regularization Graph Mining
- E-step -gt Mining -gt M-step
28Probabilistic Model
- Binomial Mixture
- Each Component
Mixing weight for cluster
Parameter vector for cluster
29Ordinary EM algorithm
- Maximizing the log likelihood
- E-step Get posterior
- M-step Estimate using posterior probs.
- Both are computationally prohibitive (!)
30Regularization
- L1-Regularized log likelihood
- Baseline constant
- ML parameter estimate using single binomial
distribution - In solution, most parameters exactly equal to
constants
31E-step
- Active pattern
- E-step computed only with active patterns
(computable!)
32M-step
- Putative cluster assignment
- Each parameter is solved separately
- Naïve way
- solve it for all params and identify active
patterns - Use graph mining to find active patterns
33Solution
- Occurrence probability in a cluster
- Overall occurrence probability
34Solution
35Important Observation
For active pattern k, the occurrence probability
in a graph cluster is significantly different
from the average
36Mining for Active Patterns
- Active pattern
- Equivalently written as
- F can be found by graph mining! (multiclass)
37Experiments RNA graphs
- Stem as a node
- Secondary structure by RNAfold
- 0/1 Vertex label (self loop or not)
38Clustering RNA graphs
- Three Rfam families
- Intron GP I (Int, 30 graphs)
- SSU rRNA 5 (SSU, 50 graphs)
- RNase bact a (RNase, 50 graphs)
- Three bipartition problems
- Results evaluated by ROC scores (Area under the
ROC curve)
39Examples of RNA Graphs
40ROC Scores
41No of Patterns Time
42Found Patterns
43Conclusion
- Substructure representation is better than paths
- Probabilistic inference helped by graph mining
- Many possible extensions
- Naïve Bayes
- Graph PCA, LFD, CCA
- Semi-supervised learning
- Applications in Biology?
44Ongoing work..