Title: DISCOVERING LARGER NETWORK MOTIFS
1DISCOVERING LARGER NETWORK MOTIFS
- Li Chen
- 4/16/2009
- CSC 8910 Analysis of Biological Network, Spring
2009 - Dr. Yi Pan
2THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF
DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS
3THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF
DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS
- Two distinct definitions of a motif based on
frequency and statistical significance - Definition 1 a motif is a sub-graph that appears
more than a threshold number of times. - Definition 2 a motif is a sub-graph that
appears more often than expected by chance.
(over-presented motif)
4THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF
DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS
- Two characteristics used to evaluate a motif
- Frequency
- 1. Arbitrary overlaps of nodes and edges (non-
identical - case)
- 2. Only overlaps of nodes (edge-disjoint case)
- 3. No overlaps (edge and vertex-disjoint case)
5THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF
DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS
- Statistical Significance compares the obtained
values of the frequencies for the observed and
random networks. - 1. Z-score
-
- 2. Abundance
-
6THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF
DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS
- Models of Random Graphs
- Preserves the same degree distribution of
- biological networks
- Preserve degree sequence (search of n-node
motifs) - Based on geometric random networks and Poisson
- distribution of the degree
- Incorporate node clustering into model
7THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF
DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS
- 3. Compact Topological Motifs introduces
a compact graph - representation obtained by grouping
together maximal - sets of nodes that are
indistinguishable. -
-
-
-
The graph on the left
show the -
sets U1 and U2 as
compact nodes -
and U1U2 as compact
edge. -
-
8THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF
DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS
- Motif Discovery Algorithm
- Exact algorithm on motifs with a small number of
nodes - 1. Exhaustive Recursive Search (ERS) the
input - network is represented by an adjacency
matrix M. - (motif size lt 4)
- 2. ESU starting with individual nodes and
adding - one node at a time until the required
size k is - reached. (motif size lt14)
9THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF
DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS
- Approximate Algorithms
- 1. Search Algorithm Based on Sampling (MFINDER)
it - picks at random edges of the input graph
until a set of - k nodes obtained to get sample sub-graph
and assigns - weights to the samples to correct the
non-uniform - sampling. It scale will with large networks,
but does not - scale well with large motifs.
10THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF
DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS
- 2. Rand-ESU do not needed to compute the weights
of all - samples compared with MFINDER. ESU builds a
tree - whose leaves correspond to sub-graphs of size
k while - internal nodes correspond to sub-graphs of
size 1 up to - k-1, depending on the tree level. It assigns
to each level - in the tree a probability that the nodes are
further - explored, so as to guarantee all leaves are
visited with - uniform probability.
11THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF
DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS
- 3. NeMoFINDER combines approaches of data
mining and - computational biology communities. It
search for repeated - trees and extend them to sub-graphs. It
leads to a - reduction of the computation time for
discovery of larger - motifs, but at the cost of missing some
potentially - interesting sub-graphs.
12THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF
DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS
- 4. Sub-graph Counting by Scalar Computation
it - characterize a biological network by a
set of measures - based on scalars and functional of the
adjacency matrix - associated to the network. Its advantages
are - mathematical elegance and computational
efficiency. -
13THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF
DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS
- 5. A-priori-based Motif Detection the basic
idea is if a sub- - graph is frequent so are all its
sub-graphs. It builds - candidate motifs of size k by joining
motifs of size k-1 and - then evaluating their frequency.
14A ROADMAP OF CLUSTERING ALGORITHM IN
BIOINFORMATICS APPLICATIONS
15A ROADMAP OF CLUSTERING ALGORITHM IN
BIOINFORMATICS APPLICATIONS
- Desirable features of clustering algorithms to
evaluate - Scalability
- Robustness
- Order insensitivity
- Minimum user-specified input
- Mixed data types
- Arbitrary-shaped clusters
- Point proportion admissibility Duplicating data
and re-clustering should not alter the results.
16A ROADMAP OF CLUSTERING ALGORITHM IN
BIOINFORMATICS APPLICATIONS
- Five categories clustering algorithm
- Partitioning Clustering Algorithm
- Hierarchical Clustering Algorithm
- Grid-based Clustering Algorithm
- Density-based Clustering Algorithm
- Model-based Clustering Algorithm
- Graph-based Clustering Algorithm
17A ROADMAP OF CLUSTERING ALGORITHM IN
BIOINFORMATICS APPLICATIONS
- Partition Clustering Algorithm
- Numerical Methods
- 1. K-means algorithm and Farthest First Traversal
k-center (FFT) algorithm - 2. K-medoids or PAM (Partitioning Around
Medoids) - 3. CLARA (Clustering Large Applications)
- 4. CLARANS (Clustering Large Applications Based
upon - Randomized Search) and Fuzzy K-means
18A ROADMAP OF CLUSTERING ALGORITHM IN
BIOINFORMATICS APPLICATIONS
- Discrete Methods
- 1. K-modes
- 2. Fuzzy K-modes
- 3. Squeezer and COOLCAT.
- Mixed of Discrete and Numerical Clustering
Methods - 1. K-prototypes
19A ROADMAP OF CLUSTERING ALGORITHM IN
BIOINFORMATICS APPLICATIONS
- Hierarchical Clustering Algorithm
- Divide the data into a tree of nodes, where each
node represents a cluster. - Two categories based on methods or purposes
- 1. Agglomerative vs. Divisive
- 2. Single vs. Complete vs. Average linkage
20A ROADMAP OF CLUSTERING ALGORITHM IN
BIOINFORMATICS APPLICATIONS
- Popular natures can have various levels of
subsets - Drawbacks
- 1. Slow
- 2. Errors are not tolerable
- 3. Information losses when moving the levels
- Two kinds of methods
- 1. Numerical Methods BIRCH, CURE , Spectral
clustering - 2. Discrete Methods ROCK, Chameleon, LIMBO
21A ROADMAP OF CLUSTERING ALGORITHM IN
BIOINFORMATICS APPLICATIONS
- Grid-based Clustering Algorithm
- Form a grid structure of cells from the input
data. Then each data is distributed in a cell of
the grid. - STING combines a numerical grid-base clustering
method and hierarchical method
22A ROADMAP OF CLUSTERING ALGORITHM IN
BIOINFORMATICS APPLICATIONS
- Density-based Clustering Algorithm
- Use a local density standard
- Clusters are dense subspaces separated by low
density spaces - Examples of bioinformatics application finding
the densest subspaces in interactome(protein-prote
in interaction) networks
23A ROADMAP OF CLUSTERING ALGORITHM IN
BIOINFORMATICS APPLICATIONS
- DBSCAN, OPTICS, DENCLUE, WaveCluster, CLIQUE use
numerical values for clustering - SEQOPTICS is used for sequence clustering
- HIERDENC (Hierarchical Density-based Clustering),
- MULIC (Multiple Layer Incremental
Clustering), Projected (subspace) clustering,
CACTUS, STIRR, CLICK, CLOPE use discrete values
for clustering
24A ROADMAP OF CLUSTERING ALGORITHM IN
BIOINFORMATICS APPLICATIONS
- Model-based Clustering Algorithm
- Uses a model often derived by a statistical
distribution - Bioinformatics applications
- 1. gene expression
- 2. interactomes
- 3. sequences
25A ROADMAP OF CLUSTERING ALGORITHM IN
BIOINFORMATICS APPLICATIONS
- Numerical model-based methods
- 1. Self-Organizing Maps
- Discrete model-based clustering algorithm
- 1. COBWEB
- Numerical and discrete model-based clustering
methods - 1. BILCOM (Bi-level clustering of Mixed Discrete
and - Numerical Biomedical Data) using empirical
Bayesian - approach
26A ROADMAP OF CLUSTERING ALGORITHM IN
BIOINFORMATICS APPLICATIONS
- Examples
- 1. Gene expression clustering
- 2. Protein sequence clustering
- 3. AutoClass
- 4. SVM Clustering methods
- Graph-based Clustering Algorithm
- Applied to interactomers for complex prediction
and sequence networks
27A ROADMAP OF CLUSTERING ALGORITHM IN
BIOINFORMATICS APPLICATIONS
- Examples
- 1. MCODE (Molecular Complex Detection)
- 2. SPC (Super Paramagnetic Clustering)
- 3. RNSC (Restricted Neighborhood Search
Clustering) - 4. MCL(Markov Clustering)
- 5. TribeMCL
- 6. SPC
- 7. CD-HIT
- 8. ProClust
- 9. BAG algorithms
28A ROADMAP OF CLUSTERING ALGORITHM IN
BIOINFORMATICS APPLICATIONS
- Usage in Bioinformatics Applications
- Gene expression clustering
- 1. K-means algorithm
- 2. Hierarchical algorithm
- 3. SOMs
- Interactomes
- 1. AutoClass,
- 2. SVM clustering
- 3. COBSEB
- 4. MULIC
- Sequence clustering
- 1. Hierarchical clustering algorithm
29REFERENCES
- 1 Bill Andreopoulos, Aijun An, Xiaogang Wang,
and Michael Schroeder. A roadmap of clustering
algorithms finding a match for a biomedical
application. Brief Bioinform, pages bbn058,
February 2009. - 2 Alberto Apostolico, Matteo Comin, and Laxmi
Parida". Bridging Lossy and Lossless Compression
by Motif Pattern Discovery. Electronic Notes in
Discrete Mathematics, 21219 - 225, 2005. General
Theory of Information Transfer and Combinatorics.
- 3 Giovanni Ciriello and Concettina Guerra. A
review on models and algorithms for motif
discovery in protein-protein interaction
networks. Brief Funct Genomic Proteomic,
7(2)147-156, 2008. - 4 Jun Huan, Wei Wang, and Jan Prins. Efficient
Mining of Frequent Subgraphs in the Presence of
Isomorphism. Data Mining, IEEE International
Conference on, 0549, 2003. - 5 Michihiro Kuramochi and George Karypis.
Finding Frequent Patterns in a Large Sparse
Graph. Data Mining and Knowledge Discovery,
11(3)243-271, November 2005. - 6 Laxmi Parida. Discovering Topological Motifs
Using a Compact Notation. Journal of
Computational Biology, 14(3)300-323, 2007.
30