Title: Network motifs: discovery and applications
1Network motifs discovery and applications
- Guy Zinman
- Seminar in Bioinformatics
- Technion, Spring 2005
2Outline
- Theory of network motifs
- Definition, Algorithm
- Application to E. Coli transcription network
- The dynamic behavior of the motifs
- Finding active subnetworks
- Simulated annealing
- experiments
3Network
4Network
- Dictionary definition
- A group or system of (electric) components and
connecting circuitry designed to function in a
specific manner. - Network is the backbone of a complex system
- Studies of networks are similar to paleontology
learning about an animal - from its backbone
5Network motifs
- The notion of motif, widely used for sequence
analysis, is generalized to the level of
networks. - Network Motifs are defined as patterns of
interconnections that recur in many different
parts of a network at frequencies much higher
than those found in randomized networks.
6Network motifs (cont.)
- Such motifs are found in networks from
- Biochemistry
- Transcriptional regulation networks
- Neurobiology
- Neuron connectivity
- Ecology
- Food webs
- Engineering
- Electoronic circuits
- World Wide Web
7Network motifs (cont.)
8(No Transcript)
9Schematic view of motif detection
- Occurrence of the FFL motif
10Random vs designed/evolved features
- Large networks may contain information about
design principles and/or evolution of the complex
system - Which features are there for a reason
- design principles (e.g. feed-forward loops)
- constraints (e.g. the all nodes on the Internet
must be connected to each other) - evolution, growth dynamics (e.g. network growth
is mainly due to gene duplication)
11Network motifs
- Alon U. et al Network Motifs Simple building
Blocks of Complex Networks Science, 2002. - Different motifs were found in different classes
of network. - The motif reflect the underlying processes that
generate each type of network.
12Motifs detected
- Two significant motifs
- Both appeared numerous times in non-homologous
gene systems that perform diverse biological
functions
13Motifs detected
14Motifs detected
15Main tasks for detecting network motifs
- There are two main tasks in detecting network
motifs - (1) generating an ensemble of proper random
networks - (2) counting the subgraphs in the real network
and in random networks.
16The algorithm
- Starting point graph with directed edges
- Scan for n-node subgraphs (n3,4) and count
number of occurrences - Compare to Erdos-Renyi randomized graph
- (randomization preserves in-, out- and inout-
degree of each node)
17All 3-node connected subgraphs
- 13 different isomorphic types of 3-node connected
subgraph - There are
- 199 4-node subgraphs,
- 9364 5-node subgraphs
-
18Generation of randomized network
- Algorithm A
- Employ a Markov-chain algorithm based on starting
with the real network and repeatedly swapping
randomly chosen pairs of connections (X1 gt Y1,
X2 gt Y2 is replaced by X1 gt Y2, X2 gt Y1) until
the network is well randomized. - Switching is prohibited if the either of the
connections X1 gt Y2 or X2 gt Y1 already exist.
19Generation of randomized network
- Algorithm B
- Each network was presented as a connectivity
matrix M, such that Mij 1 if there is a
connection directed from node i to node j, and 0
otherwise. - The goal is to create a randomized connectivity
matrix Mrand, which has the same number of
nonzero elements in each row and column as the
corresponding row and column of the real
connectivity matrix.
20Generation of randomized network
- Ri ?jMrand,ij ?jMij, Ci ?iMrand,ij ?iMij.
- To generate the randomized networks, we start
with an empty matrix Mrand. - We then repeatedly randomly choose a row n
according to the weights pi Ri/?Ri and a column
m according to the weights qj Rj/?Rj. - If Mrand,nm 0, we set Mrand,mn 1.
- We then set Rm Rm 1 and Cn Cn 1. If the
entry (m, n) was previously entered to the
randomized matrix, that is, ifMrand,mn 1, or if
m n, we choose a new (m, n). - This process is repeated until all Ri 0 and Cj
0.
21Network motif detection
- For each nonzero element (i,j)
- Looping through all connected elements Mik 1,
Mki 1, Mjk 1, and Mkj 1. This is
recursively repeated with elements (i, k), (k,
i), (j,k), and (k, j) until an n-node subgraph is
obtained. - A table is formed that counts the number of
appearances of each type of subgraph in the
network, correcting for the fact that multiple
submatrices of M can correspond to one isomorphic
architecture owing to symmetries.
22Network motif detection
- This process is repeated for each of the
randomized networks. The number of appearances of
each type of subgraph in the random ensemble is
recorded, to assess its statistical significance. - The present concepts and algorithms are easily
generalized to nondirected or directed graphs
with several colors of edges and nodes,
multipartite graphs, and so forth. -
23Criteria for Network Motif Selection
- The probability that it appears in a randomized
network an equal or greater number of times than
in the real network is smaller than P 0.01.
Reminder p-value the probability to get the
given result when the tested subject is not
affected by the experiment. if p-value lt 0.01
than the subject is considered to be affected
(the hypothesis is correct).
24Run time complexity
- The performance of this algorithm scales with the
total number of n-node subgraphs in the network. - The number of subgraphs and the algorithm runtime
also increase dramatically for subgraphs with n
5.
25Sampling method for subgraph counting
- Kashtan et al. Efficient sampling algorithm for
estimating subgraph concentrations and detecting
network motifs Bioinformatics, 2004. - This algorithm samples subgraphs in order to
estimate their relative frequency. - The runtime of the algorithm asymptotically does
not depend on the network size. - Surprisingly, few samples are needed to detect
network motifs reliably.
26Subgraph sampling
- Procedure description
- pick a random edge from the network and then
expand the subgraph iteratively by picking random
neighboring edges until the subgraph reaches n
nodes. - For each random choice of an edge, in order to
pick an edge that will expand the subgraph size
by one, prepare a list of all such candidate
edges and then randomly choose an edge from the
list.
27Subgraph sampling
- Finally, the sampled subgraph is defined by the
set of n nodes and all the edges that connect
between these nodes in the original network. - Finding n-node subgraphs for n 5 is much easier
now.
28Comparing sampling method results with exhaustive
enumeration
29Transcriptional Regulation Network ofEscherichia
coli
- Operon a group of contiguous genes that are
transcribed into a single mRNA molecule. - The transcriptional network is represented as a
directed graph each operon represents a node and
edges represent - direct transcriptional
- interactions.
30Application to E. Coli
- Alon U. Network motifs in the transcriptional
regulation network of Eschersichia coli Nature
Genetics, 2002. - Database - RegulonDB
- contains interactions between Transcription
Factors and the operons they regulate - Contains 577 interactions, 424 operons and 116
TFs - 35 more TFs were added from literature
- Previously described algorithm was run on this
data (1000 random networks)
31Significant motifs
- Feedforward loop
- found in 22 different systems,
- 10 TFs and 40 operons
- P-Val0.001
32Concentration of FFL
33Same in the yeast regulatory network
- Young et. al Transcriptional Regulatory Networks
in Saccharomyces cerevisiae Science, 2002
34- Can you think of a possible role for this motif?
35Dynamics for the FFL
36- Mangan et al., Structure and function of the
feed-forward loop PNAS, 2003. - Consider Sx and Sy as
- Input signal small molecules
- That activate or inhibit the
- Activity of X and Y.
37Coherency of FFLs
- The FFL is coherent if the direct effect of the
general TF on the effector has the same sign. - 85 of the FFL found were coherent.
38Significant motif
- Single Input Motif (SIM)
- Single Transcription Factor controls set of
operons. - All operons in a SIM are regulated
- with the same sign.
- Appeared in 24 different systems
39Dynamics for the SIM
40Significant motif
- Dense Overlapping Regulon (DOR) -
- a layer of overlapping interactions between
operons and a group of TFs, much denser than this
structure would appear in an Erdos-Renyi random
graph
41E. Coli network
42Dor detection
- Briefly
- Define a (nonmetric) distance measure between
operon k and j. - The operons were clustered.
- DORs corresponded to clusters with more than C10
connections, with ratio of connections to TF
greater than R2.
43mFinder
- A software tool for estimating subgraph
concentrations and detecting network motifs. - www.weizmann.ac.il/mcb/UriAlon/
44Discussion
- The concept of homology between genes based on
sequence motifs has been crucial for
understanding the function of uncharacterized
genes. - Likewise, the notion of similarity between
connectivity patterns in networks, based on
network motifs, may be helpful in gaining insight
into the dynamic behavior of newly identified
gene circuits.
45Discussion
- Until now we considered only transcription
interactions specifically manifested by
transcription factors that bind regulatory sites. - This transcriptional network can be thought of as
slow part of the cellular regulation network
(time scale of minutes).
46Discussion
- An additional layer of faster interactions, which
include interaction between proteins (often
subsecond timescale), contributes to the full
regulatory behavior.
47Finding active subnetworks
- Ideker, T. Discovering regulatory and signaling
circuits in molecular interaction networks
Bioinformatics, 2002. - Integrates protein-protein and protein-DNA
interactions with mRNA expression data, in a goal
of better understanding the molecular mechanism
of the observed gene expression. - Uses a method of searching the network to find
active subnetwork, i.e., connected sets of
genes with unexpectedly high levels of
differential expression, under one or more
perturbation.
48Methodology
- Using a molecular interaction network to analyze
changes in expression over 20 perturbations to
the yeast galactose utilization (GAL) pathway. - Determining which conditions significantly
affected the gene expression in each active
subnetwork.
49The means
- Combining a rigorous statistical measure for
scoring subnetworks with a search algorithm for
identifying subnetworks with high score.
50Basic z-score calculation
- To rate the biological activity of a particular
subnetwork, begin with assessing the significance
of differential expression for each gene. - The error model provided by VERA (Variability and
ERror Assessment) program. - VERA estimates the parameters of a statistical
model using the method of maximum likelihood. - Output p-values (pi), representing the
significance of expression change.
51Basic z-score calculation
- Each pi is converted to z-score
- zi F-1(1-pi)
- F-1 The inverse normal CDF (cumulative
distribution function) - Smaller p-values correspond to larger z-score
52Scoring of Subnetworks
- Aggregate z-score for an entire subnetwork A of k
genes -
- Notice
- zA will also be distributed according the
standard normal (because the variables are
independent). - Subnetworks of all sizes are comparable under
this scoring system, independent of k. - A high zA indicates a biologically active
subnetwork.
53Calibrating z against background distribution
- Randomly sample gene sets of size k using a Monte
Carlo approach, compute their scores zA, and
calculate standard deviation parameters for each
k. - The corrected subnet score SA is
54Scoring an example subnetwork
SA
55Scoring over multiple conditions
- Starting with a matrix of p-values (genes vs.
conditions) and corresponding z-scores. - Producing m different aggregate scores, one for
each condition, and sorting them. - Finding the probability that at least j of the m
conditions had scores above zA(j) - Monte Carlo technique is used for estimating the
mean and the standard deviation from random gene
set of size k.
56Scoring over multiple conditions
57Finding the maximal scoring
- Problem
- Finding the maximal scoring connected subgraph
is NP-hard.
58The Difficulty in Searching Global Optima
Global maxima
Local maxima
Local maxima
significance score
subnetwork
59Rugged landscapes and local maxima problem
60Monte Carlo random search
- Known also as the Metropolis algorithm
- A simulation technique for conformational
sampling and optimization based on a random
search for energetically favourable conformations - Finding global (or at least good local) maximum
by biased random walk may take some luck
61Global maxima
Local maxima
Local maxima
significance score
subnetwork
62Climbing mountains easier simulated annealing
In order to get out from a local maxima one needs
to allow for locally unfavorable moves
Global maxima
Local maxima
Local maxima
significance score
subnetwork
63Introduction to simulated annealing
- Simulated annealing (Kirkpatrick et al.,1983).
- Mathematical method developed together with
Monte Carlo techniques to avoid false maxima
Method simulates slow cooling of a solidifying
solution to form a single crystal - Origin
- The annealing process of heated solids
- Intuition
- By allowing occasional descent in the search
process, we might be able to escape the trap of
local maxima. - In our context
- Allow nodes to be removed from the subsets, even
if the resulting subnetworks score is a (little)
lower.
64- What can be an adverse effect of this method?
65Consequences of the Occasional Ascents
adverse effect
desired effect
Might pass global optima after reaching it
Help escaping the local optima.
- So the result is not guaranteed to be optimal.
- But here we dont care- any high-scoring
subnetwork is suspected to be biologically
significant.
66Climbing mountains easier simulated annealing
- Defining a temperature function.
- Increasing the effective temperature means
higher probability of accepting moves that
increase the energy Thus, the likelihood of
escaping from a local maximum may be tuned.
67Control of Annealing Process
Acceptance of a search step (Metropolis
Criterion)
Assume the performance change in the search
direction is .
Always accept a ascending step, i.e.
Accept a descending step only if it pass a random
test, i.e. with probability p
68Control of Annealing Process
Cooling Schedule
T, the annealing temperature, is the parameter
that control the frequency of acceptance of
decending steps.
We gradually reduce temperature T(k) between 1
and 0.
The probability to accept declining steps is
proportional!
69In our context
- Input
- Graph G (V,E) of molecular interactions,
- N number of iteration
- Ti temperature function which decreases from
Tstart to Tend - Output
- Gw Subgraph of G
- Initialize Gw by setting each node to an
active/inactive state randomly (with p ½).
70Simulated Annealing Algorithm
- For i 1 to N DO
- Randomly pick a node v from V and toggle its
state. - Compute the score si for the working subgraph Gw
- IF (si gt si-1), keep v toggled
- ELSE keep v toggled with probability
71Heuristics for improved annealing
- Look for M active subnetworks simultaneously.
- M is a user defined variable
- Maintaining multiple components can improve the
efficiency of annealing. - Can be done by
- multiple annealing runs
- Or by
- extending the annealing approach to maintain a
graph state vector of the top M component scores.
72Galactose metabolic flow
73Results
Experiment 1 small network of 362 interaction. 2
conditions of the expression data gal80 deletion
vs. WT. 5 significant subnetworks were found,
including 41 out of 77 significant genes.
74Score and temperature vs. number of iteration
- Temperature cooling is geometric from 1 to 0.
- N
- By the end of the run, each of the 5 subnetworks
reach a (local) maximum.
75Evaluation of the subnetworks
Z-score distribution of the top 5 active networks.
Z-score distribution with real data
Z-score distribution with random data ( scrambled
nodes z-scores )
76Experiment 2
Results
- Network consists of all known interactions7145
protein-protein interactions from BIND317
regulation interactions from TRANSFAC - Expression data includes 20 perturbations to
genes in the Galactose pathway. - 7 active subnetworks found. The biggest consists
of 340 genes. - Repeating annealing with the network above,
generated 5 significant sub-sub-networks. - All results were evaluated with methods similar
to what we have seen.
77(No Transcript)
78Discussion
79Cytoscape
80Summary
- Theory of network motifs
- Definition, Alogorithm
- Application to E. Coli transcription network
- The dynamic behavior of the motifs
- Finding active subnetworks
- Simulated annealing
- 2 experiments
81References
- S Shen-Orr, R Milo, S Mangan U Alon,
- Network motifs in the transcriptional regulation
network of Escherichia coli. - Nature Genetics, 3164-68 (2002).
- R Milo, S Shen-Orr, S Itzkovitz, N Kashtan, D
Chklovskii U Alon, - Network Motifs Simple Building Blocks of Complex
Networks - Science, 298824-827 (2002).
- Ideker, T., Ozier, O., Schwikowski, B., and
Siegel, A. - Discovering regulatory and signaling circuits in
molecular interaction networks. - Bioinformatics 18 S233 (2002).
82- S. Mangan and U. Alon
- Structure and function of feed forward loop
network motif. - PNAS 10011980-11985 (2003).
- N. Kashtan, S. Itzkovitz, R. Milo and U. Alon
- Efficient sampling algorithm for estimating
subgraph concentration and detecting network
motifs Bioinformatics 201746-175 (2004). - S. kirkpatrick, C. D. Gelatt and M. P. Vecchi
- Optimization by simulated annealing
- Science 220671-680 (1983).
83Thank you