Title: Genetic Networks
1Genetic Networks
2Cellular Networks
- Most processes in the cell are controlled by
networks of interacting molecules - Metabolic Networks
- Signal Transduction Networks
- Regulatory Networks
3Unifying View
- The cell as a state machine
- Cell state S (P1,P2, , R1, R2, m1, m2, )
- P proteins, R mRNA molecules, m metabolites
- Each cell at any given time, can be characterized
using its state S - Dynamics
- Input(t), S(t) gt S(tDt)
4What does it mean?
- Steady Cell State cell type
- Neuron
- RBC
- muscle cell
- Tumor cell
- Dynamics cellular process
- Differentiation
- Apoptosis
- Cell Cycle
5Gene Regulation Networks
- Regulation of expression of genes is crucial
- Regulation occurs at many stages
- pre-transcriptional (chromatin structure)
- transcription initiation
- RNA editing (splicing) and transport
- Translation initiation
- Post-translation modification
- RNA Protein degradation
- Understanding regulatory processes is a central
problem of biological research
6Genetic Network Models Goals
- Incorporate rule-based dependencies between genes
- Rule-based dependencies may constitute important
biological information. - Allow to systematically study global network
dynamics - In particular, individual gene effects on
long-run network behavior. - Must be able to cope with uncertainty
- Small sample size, noisy measurements, biological
noise - Quantify the relative influence and sensitivity
of genes in their interactions with other genes - This allows us to focus on individual (groups of)
genes. - What model should we use?
7Level of Biochemical Detail
- Detailed models require lots of data!
- Highly detailed biochemical models are only
feasible for very small systems which are
extensively studied - Example Arkin et al. (1998), Genetics
149(4)1633-48 - lysis-lysogeny switch in Lambda phage
- 5 genes, 67 parameters based on 50 years of
research - stochastic simulation required supercomputer!
8Example Lysis-Lysogeny
Arkin et al. (1998), Genetics 149(4)1633-48
9Level of Biochemical Detail
- In-depth biochemical simulation of e.g. a whole
cell is infeasible (so far) - Less detailed network models are useful when data
is scarce and/or network structure is unknown - Once network structure has been determined, we
can refine the model
10Boolean or Continuous?
- Boolean Networks (Kauffman (1993), The Origins of
Order) assumes ON/OFF gene states. - Allows analysis at the network-level
- Provides useful insights in network dynamics
- Algorithms for network inference from binary data
11Boolean Formalism Cons
- Boolean abstraction is poor fit to real data
- Cannot model important concepts
- amplification of a signal
- subtraction and addition of signals
- compensating for smoothly varying environmental
parameter (e.g. temperature, nutrients) - varying dynamical behavior (e.g. cell cycle
period) - Feedback control
- negative feedback is used to stabilize expression
- ?? causes oscillation in Boolean model
12Boolean Formalism Pros
- Studies give rise to qualitative phenomena, as
observed by experimentalists. - Some studied systems exhibit multiple steady
states and switchlike transitions between them. - It is experimentally shown that such systems are
robust to exact values of kinetic parameters of
individual reactions.
13Concentrations or Molecules?
- Use of concentrations assumes individual
molecules can be ignored - Known examples (in prokaryotes) where stochastic
fluctuations play an essential role (e.g.
lysis-lysogeny in lambda) - Requires stochastic simulation (Arkin et al.
(1998), Genetics 149(4)1633-48), or modeling
molecule counts (e.g. Petri nets, Goss and
Peccoud (1998), PNAS 95(12)6750-5) - Significantly increases model complexity
14Concentrations or Molecules?
- Eukaryotes larger cell volume, typically longer
half-lives. Few known stochastic effects. - Yeast 80 of the transcriptome
is expressed at 0.1-2 mRNA
copies/cell
Holstege, et al.(1998),
Cell 95717-728. - Human 95 of transcriptome is
expressed at lt5 copies/cell
Velculescu et al.(1997), Cell 88243-251
15Spatial or Non-Spatial
- Spatiality introduces additional complexity
- intercellular interactions
- spatial differentiation
- cell compartments
- cell types
- Spatial patterns also provide more data
- e.g. stripe formation in Drosophila
- Mjolsness et al. (1991), J. Theor. Biol. 152
429-454. - Few (no?) large-scale spatial gene expression
data sets available so far.
16Example Drosophila Segmentation
eve (even-striped) expression
anterior
posterior
high
eve (stripe 2)
hb
Kr
gt
bcd
low
expression of transcription factors in embryo
17Deterministic or Stochastic?
- Many sources of stochasticity
- Bioloical stochasticity
- Experimental noise
- Stochastic models can account for those
- Deterministic models are usually simpler to
analyze (dynamics, steady states) and interpret
18Modeling Approaches
- Boolean Networks
- Linear Models
- Bayesian Networks
19Boolean Network
20What is a Boolean Network?
- Boolean network is a kind of Graph
- G(V, F) V is a set of nodes ( genes )
F is a list of Boolean functions - Every node has only two values ON ( 1 ) and
OFF ( 0 ) - Every function has the result value of each node
- Representation standard, wiring , automaton
21What is a Boolean Network?
- Attractor Certain states revisited infinitely
often depending on the initial starting state. - Basin of attraction
- Limit-cycle attractor
22Boolean Network Example
Nodes (genes)
23Boolean Network Example
Nodes (genes)
24Basic Structure of Boolean Networks
- Each node is a gene
- 1 means active/expressed
- 0 means inactive/unexpressed
A
B
Boolean function A B X 0 0 1 0 1 1 1 0 0 1
1 1
X
In this example, two genes (A and B) regulate
gene X. In principle, any number of input genes
are possible. Positive/negative feedback is also
common (and necessary for homeostasis).
25Dynamics of Boolean Networks
A
B
C
D
E
F
Time
0
1
1
0
0
1
At a given time point, all the genes form a
genome-wide gene activity pattern (GAP) (binary
string of length n ). Consider the state space
formed by all possible GAPs.
26State Space of Boolean Networks
- Similar GAPs lie close together.
- There is an inherent directionality in the state
space. - Some states are attractors (or limit-cycle
attractors). The system may alternate between
several attractors. - Other states are transient.
Picture generated using the program DDLab.
27Reverse Engineering Problem
Can we infer the structure and rules of a genetic
network from gene expression measurements?
28Reverse Engineering Problem
- Input Gene expression data
- Output Network structure and parameters (or
regulation rules)
29Gene Expression Time Series Data
gene 1 gene 2 gene 3
Problem how can these data be used to infer how
these three genes influence each other?
30Modelling Gene Expression Data
gene 1 gene 2 gene 3
assume that genes exist in two states on and off
if expression of gene i is above level ti
consider it on, otherwise, consider it off
31Modelling Gene Expression Data
gene 1 gene 2 gene 3
t1
t2
t3
assume that genes exist in two states on and off
if expression of gene i is above level ti
consider it on, otherwise, consider it off
32Modelling Gene Expression Data
gene 1 gene 2 gene 3
on
on
on
on
on
on
on
on
on
t1
on
on
on
t2
off
on
on
t3
off
off
off
off
off
off
off
off
off
off
off
off
off
off
off
off
off
off
assume that genes exist in two states on and off
if expression of gene i is above level ti
consider it on, otherwise, consider it off
33Modelling Gene Expression Data
- we obtain the following discretized gene
expression data
- the gene expression data is now in the form of
bit streams
34Information Theoretic Tools
- we define some necessary information theoretic
tools - Shannon entropy of data stream
- H(X) - ? pi log(pi)
- where pi is the probability that a random
element of data stream X is i -
- (the base of the logarithm can be anything, but
must be consistent throughout usually we use
base 2)
35Information Theoretic Tools
- e.g. Shannon entropy of data streams X and Y
- X 0, 1, 1, 1, 1, 1, 1, 0, 0, 0
- Y 0, 0, 0, 1, 1, 0, 0, 1, 1, 1
- H(X) - ? pi logn(pi)
- -(pX0 log2(pX0) pX1 log2(pX1))
- -(0.4 log2(0.4) 0.6 log2(0.6))
- 0.971
- H(Y) - ? pi logn(pi)
- -(0.5 log2(0.5) 0.5 log2(0.5))
- 1.0
36Information Theoretic Tools
- e.g. Shannon joint entropy of data streams X and
Y - X 0, 1, 1, 1, 1, 1, 1, 0, 0, 0
- Y 0, 0, 0, 1, 1, 0, 0, 1, 1, 1
- H(X, Y) - ? pi logn(pi)
- -(pX0,Y0 log2(pX0,Y0,) pX1,Y0
log2(pX1,Y0) - pX0,Y1 log2(pX0,Y1,) pX1,Y1
log2(pX1,Y1)) - -(0.1 log2(0.1) 0.4 log2(0.4)
- 0.3 log2(0.3) 0.2 log2(0.2)
- 1.85
37Information Theoretic Tools
- Define
- Conditional Entropy
- H(XY) H(X, Y) H(X)
- H(YX) H(X, Y) H(Y)
- Mutual Information
- M(X, Y) H(Y) - H(YX)
- H(X) - H(XY)
- H(X) H(Y) - H(X,Y)
38Information Theoretic Tools
- It is easy to show that
- Let X be an input data stream
- and Y be an output data stream
- If M(Y, X) H(Y)
- then X exactly determines Y
- Look for pairs(x,y) where M(Yt1, Xt) H(Yt1)
39Identification of the Network Graph
- step 1 put data in state transition table form
40Identification of the Network Graph
- step 1 put data in state transition table form
41Identification of the Network Graph
- state transition table tells us how to get from
- state i 1 to state i as a lookup table
- however, it is difficult to discern functional
relationships, so - step 2 use information theoretic tools to
discover which inputs determine the outputs
42Identification of the Network Graph
- step 2a calculate entropies
note limx?0xx1, therefore in the left-hand
limit, (0)log(0) 0. H(Ai) -((0.25)log(0.25)
(0.75)log(0.75)) 0.81 H(Bi)
-((0.75)log(0.75) (0.25)log(0.25)) 0.81 H(Ci)
-((0.5)log(0.5) (0.5)log(0.5)) 1 H(Ai-1)
H(Bi-1) H(Ci-1) -((0.5)log(0.5)
(0.5)log(0.5)) 1 H(Ai-1, Ci-1)
-((0.25)log(0.25) (0.25)log(0.25)
(0.25)log(0.25) (0.25)log(0.25)) 2
43Identification of the Network Graph
- step 2a calculate entropies
H(Ai, Ai-1, Ci-1) -((0.25)log(0.25)
(0.25)log(0.25)
(0.25)log(0.25) (0.25)log(0.25)) 2 H(Bi,
Ai-1, Ci-1) -((0.25)log(0.25) (0.25)log(0.25)
(0.25)log(0.25)
(0.25)log(0.25)) 2 H(Ci, Ai-1)
-((0.5)log(0.5) (0.5)log(0.5) 1
44Identification of the Network Graph
- step 2b calculate mutual information
M(Ai, Ai-1, Ci-1) H(Ai) H(Ai-1, Ci-1) -
H(Ai, Ai-1, Ci-1) 0.81 2 2
0.81
H(Ai), therefore Ai-1 and Ci-1 determine
Ai M(Bi, Ai-1, Ci-1) H(Bi) H(Ai-1, Ci-1)
- H(Bi, Ai-1, Ci-1) 0.81 2 2
0.81
H(Bi), therefore Ai-1 and Ci-1 determine
Bi M(Ci, Ai-1) H(Ci) H(Ai-1) - H(Ci,
Ai-1) 1 1 1 1
H(Ci), therefore Ai-1
determines Ci
45Identification of the Boolean Circuits
- step 3 determine functional relationship between
variables (this is simply the truth table)
Ai Ai-1 OR Ci-1
46Identification of the Boolean Circuits
- step 3 determine functional relationship between
variables
Bi Ai-1 AND Ci-1
47Identification of the Boolean Circuits
- step 3 determine functional relationship between
variables
Ci NOT Ai-1
48Problems With This Approach
- no theory exists for determining the
discretization level ti - the assumption that genes can be modeled as
either on or off may be sufficient for some
genes, but will certainly not be sufficient for
all genes - Ignores noise of all kinds (experimental,
biological)
49Boolean networks areinherently deterministic
- Conceptually, the regularity of genetic function
and interaction is not due to hard-wired
logical rules, but rather to the intrinsic
self-organizing stability of the dynamical
system. - Additionally, we may want to model an open system
with inputs (stimuli) that affect the dynamics of
the network.
- From an empirical viewpoint, the assumption of
only one logical rule per gene may lead to
incorrect conclusions when inferring these rules
from gene expression measurements, as the latter
are typically noisy and the number of samples is
small relative to the number of parameters to be
inferred.
50Linear Models
- Basic model weighted sum of inputs
- Simple network representation
- Only first-order approximation
- Parameters of the model
weight matrix containing NxN interaction
weights - Fitting the model find the parameters wji, bi
such that model best fits available data
51Underdetermined problem!
- Assumes fully connected network need at least as
many data points (arrays, conditions) as
variables (genes)! - Underdetermined (underconstrained, ill-posed)
model we have many more parameters than data
values to fit - No single solution, rather infinite number of
parameter settings that will all fit the data
equally well
52Solution 1 reduce N
- Rather than trying to model all genes, we can
reduce the dimensionality of the problem - Network of clusters construct a linear model
based on the cluster centroids - rat CNS data (4 clusters) Wahde and Hertz
(2000), Biosystems 55, 1-3129-136. - yeast cell cycle (15-18 clusters) Mjolsness et
al.(2000), NIPS 12 van Someren et al.(2000)
ISMB2000, 355-366. - Network of Principal Components linear model
between characteristic modes of the data - Holter et al.(2001), PNAS 98(4)1693-1698.
53Solution 2
- Take advantage of additional information
- replicates
- accuracy of measurements
- smoothness of time series
-
- Most likely, the network will still be poorly
constrained. - ? Need a method to identify and extract those
parts of the model that are well-determined and
robust
54Danger of Overfitting
- The linear model assumes every gene is regulated
by all other genes (i.e. full connectivity) - This is the richest model of its kind
- Danger to over fit the training data
- Will result in poor prediction on new data
- Far from reality only few regulators for each
gene