Title: Module Networks
1Module Networks
- Discovering Regulatory Modules and their
Condition Specific Regulators from Gene
Expression Data
Cohen Jony
2Outline
- The Problem
- Regulators
- Module Networks
- Learning Module Networks
- Results
- Conclusion
3The Problem
- Inferring regulatory networks from gene
expression data.
From
4Regulators
5Regulation types
6Regulators example
This is an example for a regulating module.
7Known solution Bayesian Networks
The problem Too many variables and too little
data cause statistical noise to lead to spurious
dependencies, resulting in models that
significantly over fit the data.
8From Bayesian To Module
9Module Networks
- We assume that we are given a domain of random
variables X X1 Xn. -
- We use Val(Xi) to denote the domain of values of
the variable Xi. - A module set C is a set of such formal
- variables M1 MK. As all the variables in
a module share the same CPD. - Note that all the variables must have the same
domain of values!
10Module Networks
- A module network template T (S ?) for C
defines, for each module Mj in C - 1) a set of parents PaMj from X
- 2) a conditional probability template (CPT) P( Mj
PaMj ) which specifies a distribution over Val
(Mj ) for each assignment in Val (PaMj ). - We use S to denote the dependency structure
encoded by PaMj Mj in C and ? to denote the
parameters required for the CPTs P( Mj PaMj )
Mj in C.
11Module Networks
- A module assignment function for C is a function
A X ? 1 K such
that A(Xi) j only if Val (Xi) Val (Mj ). - A module network is defined by both the module
network template and the assignment function.
12Example
- In our example, we have three modules M1, M2, and
M3. - PaM1 Ø , PaM2 MSFT, and PaM3 AMAT
INTL. - In our example, we have that A(MSFT) 1, A(MOT)
2, A(INTL) 2, and so on.
13Learning Module Networks
- The iterative learning procedure attempts to
search for the model with the highest score by
using the expectation Maximization (EM)
algorithm. - An important property of the EM algorithm is that
each iteration is guaranteed to improve the
likelihood of the model, until convergence to a
local maximum of the score. - Each iteration of the algorithm consists of two
steps
M-step
E-step
14Learning Module Networks cont.
M-step
- In the , the procedure is given
a partition of the genes into modules and learns
the best regulation program (regression tree) for
each module. - The regulation program is learned via a
combinatorial search over the space of trees. - The tree is grown from the root to its leaves. At
any given node, the query which best partitions
the gene expression into two distinct
distributions is chosen, until no such split
exists.
15Learning Module Networks cont.
E-step
- In the , given the inferred
regulation programs, we determine the module
whose associated regulation program best predicts
each genes behavior. - We test the probability of a genes measured
expression values in the dataset under each
regulatory program, obtaining an overall
probability that this genes expression profile
was generated by this regulation program. - We then select the module whose program gives the
genes expression profile the highest
probability, and re-assign the gene to this
module. - We take care not to assign a regulator gene to a
module in which it is also a regulatory input.
16Bayesian score
- When the priors satisfy the assumptions above,
the Bayesian score decomposes into local module
scores
17Bayesian score cont.
- Where Lj(U,X, ?MjD) is the Likelihood function .
- Where P(?Mj Sj u) is the Priors.
- Where Sj U denotes that we chose a structure
where U are the parents of module Mj. - Where Aj X denotes that A is such that Xj X.
18Assumptions
- Let P(A), P(S A), P(? S,A) be assignment,
structure, and parameter priors. -
- P(? S,A) satisfies parameter independence if
- P(? S,A) satisfies parameter modularity if
-
-
-
-
- for all structures S1 and S2 such that
19Assumptions
- P(?, S A) satisfies assignment independence if
- P(? S, A) P(? S) and P(S A) P(S).
-
- P(S) satisfies structure modularity if
- where Sj denotes the choice of parents for
module Mj , and ?j is a distribution over the
possible parent sets for module Mj. - P(A) satisfies assignment modularity if
- where Aj is the choice of variables assigned to
module Mj, and aj j 1 K is a family
of functions from 2X to the positive reals.
20Assumptions - Explainations
- Parameter independence, parameter modularity, and
structure modularity are the natural analogues of
standard assumptions in Bayesian network
learning. - Parameter independence implies that P(? S, A)
is a product of terms that parallels the
decomposition of the likelihood, with one prior
term per local likelihood term Lj. - Parameter modularity states that the prior for
the parameters of a module Mj depends only on the
choice of parents for Mj and not on other aspects
of the structure. - Structure modularity implies that the prior over
the structure S is a product of terms, one per
each module.
21Assumptions - Explainations
- These two assumptions are new to module networks.
- Assignment independence makes the priors on the
parents and parameters of a module independent of
the exact set of variables assigned to the
module. - Assignment modularity implies that the prior on
A is proportional to a product of local terms,
one corresponding to each module. - Thus, the reassignment of one variable from one
module Mi to another Mj does not change our
preferences on the assignment of variables in
modules other than i j.
22Experiments
- The network learning procedure was evaluated on
synthetic data, gene expression data, and stock
market data. - The data consisted solely of continuous values.
As all of the variables have the same domain, the
definition of the module set reduces simply to a
specification of the total number of modules. - Beam search was used as the search algorithm,
using a look ahead of three splits to evaluate
each operator. - As a comparison, Bayesian networks were used with
precisely the same structure learning algorithm,
simply treating each variable as its own module.
23Synthetic data
- The synthetic data was generated by a known
module network. - The generating model had 10 modules and a total
of 35 variables that were a parent of some
module. From the learned module network, 500
variables where selected, including the 35
parents. - This procedure was run for training sets of
various sizes ranging from 25 instances to 500
instances, each repeated 10 times for different
training sets.
24Synthetic data - results
- Generalization to unseen test data, measuring the
likelihood ascribed by the learned model to4500
unseen instances. - As expected, models learned with larger training
sets do better but, when run using the correct
number of 10 modules, the gain of increasing the
number of data instances beyond 100 samples is
small. - Models learned with a larger number of modules
had a wider spread for the assignments of
variables to modules and consequently achieved
poor performance.
25Synthetic data results cont.
- Log-likelihood per instance assigned to held-out
data.
- For all training set sizes, except 25, the model
with 10 modules performs the best.
26Synthetic data results cont.
- Fraction of variables assigned to the largest 10
modules.
- Models learned using 100, 200, or 500 instances
and up to 50 modules assigned 80 of the
variables to 10 modules.
27Synthetic data results cont.
- Average percentage of correct parent-child
relationships recovered.
- The total number of parent-child relationships in
the generating model was 2250. - The procedure recovers 74 of the true
relationships when learning from a dataset of
size 500 instances.
28Synthetic data results cont.
- As the variables begin fragmenting over a large
number of modules, the learned structure contains
many spurious relationships. - Thus in domains with a modular structure,
statistical noise is likely to prevent overly
detailed learned models such as Bayesian networks
from extracting the commonality between different
variables with a shared behavior.
29Gene Expression Data
- Expression data which measured the response of
yeast to different stress conditions was used. - The data consists of 6157 genes and 173
experiments. - 2355 genes that varied significantly in the data
were selected and learned a module network over
these genes. - A Bayesian network was also learned over this
data set.
30Candidate regulators
- A set of 466 candidate regulators was compiled
from SGD and YPD. - Both transcriptional factors and signaling
proteins that may have transcriptional impact. - Also included genes described to be similar to
such regulators. - Excluded global regulators, whose regulation is
not specific to a small set of genes or process.
31Gene Expression reasults
- The figure demonstrates that module networks
generalize much better then Bayesian network to
unseen data for almost all choices of number of
modules.
32Biological validity
- Biological validity of the learned module network
with 50 modules was tested. - The enriched annotations reflect the key
biological processes expected in our dataset. - For example, the protein folding module
contains 10 genes, 7 of which are annotated as
protein folding genes. In the whole data set,
there are only 26 genes with this annotation.
Thus, the p-value of this annotation, that is,
the probability of choosing 7 or more genes in
this category by choosing 10 random genes, is
less than 10-12. - 42 modules, out of 50, had at least one
significantly enriched annotation with a p-value
less than 0.005.
33Biological validity Cont.
- The enrichment of both HAP4 motif and STRE,
recognized by Hap4 and Msn4, respectively,
supporting their inclusion in the modules
regulation program. - Lines represent 500 bp of genomic sequence
located upstream to the start codon of each of
the genes colored boxes represent the presence
of cis-regulatory motifs locates in these
regions.
34Stock Market Data
- NASDAQ stock prices for 2143 companies, covering
273 trading days. - stock ? variable, instance ? trading day.
- The value of the variable is the log of the ratio
between that days and the previous days closing
stock price. - As potential controllers, 250 of the 2143 stocks,
whose average trading volume was the largest
across the dataset were selected.
35Stock Market Data
- Cross validation is used to evaluate the
generalization ability of different models. - Module networks perform significantly better than
Bayesian networks in this domain.
36Stock Market Data
- Module networks compared with Autoclass
- Significant enrichment for 21 annotations,
covering a wide variety of sectors where found. - In 20 of the 21 cases, the enrichment was far
more significant in the modules learned using
module networks compared to the one learned by
AutoClass.
37Conclusions
- The results show that learned module networks
have much higher generalization performance than
a Bayesian network learned from the same data. - Parameter sharing between variables in the same
module allows each parameter to be estimated
based on a much larger sample, this allows us to
learn dependencies that are considered too weak
based on statistics of single variables. (these
are well-known advantages of parameter sharing) - An interesting aspect of the method is that it
determine automatically which variables have
shared parameters.
38Conclusions
- The assumption of shared structure significantly
restricts the space of possible dependency
structures, allowing us to learn more robust
models than those learned in a classical Bayesian
network setting. - In module network, a spurious correlation would
have to arise between a possible parent and a
large number of other variables before the
algorithm would introduce the dependency.
39Overview on Module Networks
40Literature
- Reference Discovering Regulatory Modules and
their Condition Specific Regulators from Gene
Expression Data. - By Eran Segal, Michal Shapira, Aviv Regev, Dana
Peer, David Botstein, Daphne Koller Nir
Friedman. - Bibliography
- P. Cheeseman, J. Kelly, M. Self, J. Stutz, W.
Taylor, and D. Freeman. Autoclass a Bayesian
classification system. In ML 88. 1988.
41THE END