Title: Promoters and regulatory elements analysis
1Promoters and regulatory elements analysis
Biological introduction to promoters and
regulatory elements Experimental procedures to
characterize regulatory elements Biophysics of
DNA protein binding sites Word search
algorithms Expectation-Maximization
algorithms Gibbs sampling Evaluation of motif
discovery algorithms Motif discovery programs
Introduction to exercise
2Transcription regulation by transcription
factors(prokaryotes)
RNA POL
TF1
TF2
DNA
RNA
Promoter motif
TF binding site motifs
RNA polymerase recognizes promoter motif
Transcription factor 2 recognizes binding site
and activates RNA POL Transcription factor 1
activates RNA POL from a remote site, for
instance via DNA loping
3Experimental techniques for identifying
transcription factor binding sites and regulatory
elements
- foot-printing (in vitro) locates binding sites
within short regions - foot-printing (in vivo) locates binding sites
within short region, no TF identification - Chromatin immunoprecipitation (ChIP) locates
binding sites within larger regions (100 bp). - Electrophoretic mobility shift assays (EMSA,
bandshifts) determines relative binding
efficiency of an oligonucleotide containing a
binding-site (in vitro) - Mutational analysis defines regulatory element,
no TF identification (in vivo) - Promoter mapping defines transcription start
site (TSS), TF binding sites are expected to
occur within a certain distance range from TSS
(debatable) - Expression profiling (micro-arrays) defines sets
of coordinately regulated genes. - A combination of techniques 1-5 is required for
identification and precise localization
functional TF binding sites. - Techniques 6 and 7 together define promoter
sequence sets for motif discovery
4Transcription Factor Binding Sites Features and
Facts
Degenerate sequence motifs several related
sequences bind same TF, e.g. TATAAA, TTTAAA,
TATAAG, TTTAAG, etc. Typical length 6-20 bp Low
specificity 1 site per 250 to 4000
bp Quantitative recognition mechanism affinity
of different qualifying sequences vary
(affinity DNA-protein binding equilibrium
constant Kb, unit Mol-1). Regulatory function
often depends on cooperative interactions with
neighboring sites (combinatorial gene regulatory
code).
5Motif Discovery Applications in Transcription
Regulation
Input A set of promoter sequences known to be
regulated by a transcription factor o environment
condition (regulon) Promoter sequences
sequences upstream of known transcription start
sites, e.g from ?200 to 10. Sets of
co-regulated genes may be defined by microarray
experiment Output A motif (consensus sequence
of position weight matrix) over-represented in
the in input sequences For each sequence,
predicted motif locations
6Formal tools to describe regulatory elements
- Consensus sequences
- example TATAAA (for eukaryotic TATA-box)
- a limited number of mismatches may be allowed
- May contain IUPAC codes for ambiguous positions,
e.g. R A or G. - Position Weight matrices (PWM)
- synonym position-specific scoring matrix (PSSM),
- a table with numbers for each residue at each
position of a fixed-length (gap-free) motif. - two numerical representations probabilities,
additive scores - More advanced descriptors
- HMMs can model spacers of variable length between
conserved blocks - Dinucleotide matrices dependencies between
nearest neighbors
7Representation of the binding specificity by a
scoring matrix (also referred to as weight
matrix)
Strong C T T T G A T
C T Binding site 5 5 5 5 5
5 5 3 5 43 Random A C
G T A C G T A Sequence -10
-10 -13 5 -10 -15 -13 -11 - 6 -83
8Biophysical interpretation of protein binding
sites
Columns of a weight matrix characterize the
specificity of base-pair acceptor sites on the
protein surface. Weight matrix elements represent
negated energy contributions to the total binding
energy ? weight matrix score inversely
proportional to binding energy
9Berg and von Hippel Theory
Matrix elements w(i,b) represent free energy
contributions, and thus can be scaled in relative
energy units, e.g. kcal/Mol. The probability of
finding base b at position i in a transcription
factor binding site is given by
? is a parameter related to the stringency of the
selection. Selection for low free binding energy
(high affinity) results in high ?. w(i,b)
represents the negated energy-value (high score ?
high affinity). The total binding energy (weight
matrix score) is proportional to the logarithm of
the DNA-protein binding constant with can be
measures by various experimental protocols, e.g.
competitive bandshift experiments.
10Why and when is motif discovery difficul ?t
Problem 1 Long input sequences ? difficult to
guess the location of motif occurrences (see
below) Problen 2 Few input sequences ?
inaccurate estimation of position-specific base
frequencies
11Word search algorithms for consensus sequences
- Purpose to find an optimal consensus sequence
for a given sequence set. - Algorithm For each word wi of size k and
mismatch threshold d - Count total frequency of word wi in data set.
- Compute expected frequency of word wi in data
set, based on base composition of data set or
some other null model. - From the observed and expected frequencies,
compute P-value for word wi according to some
statistical distribution (e.g. Poisson
distribution) - Return word with highest P-value, or N best words
- This algorithm is also referred to as
word-enumeration. It is guaranteed to find the
optimal word. - Computationally feasible only for short words
(length 15). - Heuristic algorithms exist for longer words.
12Expectation-Maximization (EM) Algorithms
- EM algorithms essentials
- An iterative procedure to maximize the likelihood
of a probabilistic model with regard to given
data - It can deal with missing (unobservable) data
(motif positions) - Not guaranteed to find the global maximum
- Input
- A model type with initial parameters to be
optimized - A mathematical formula that allows to compute the
likelihood of the model given complete (observed
and unobserved) data - The observed data
- The maximum likelihood estimation of
- The model parameters (PWM scores or base
probabilities) - The unobserved data (motif position)
-
-
13Motif Discovery by EM Inputs and outputs
14Motif Discovery with EM basic version
Input model A base probability matrix p(i,b) of
length I. A background frequency model
q(b) Observable data Sequences Sn Hidden
data Starting position of motif
yn Log-likelihood for a sequence S s1 s2 sJ
and starting position y. The log-likelihood of
the model given a complete sequence set S1, S2
SN with starting positions y1, y2 yN is the
sum of the log-likelihoods of each sequence given
the corresponding starting positions.
15EM How it works for sequence motifs
Initiation Define some values for initial model
p0(i,b) and background frequencies
q(i). Expectation step for kth iteration
(non-trivial, see next slide) Compute expected
parameter values ek1(i,b) based on previous
model pk(i,b), background frequencies q(i), and
sequences S1, S2 SN. Maximization step for
kth1 iteration (straightforward) Compute
maximum likelihood base probabilities pk1(i,b)
from expected frequencies ek1(i,b). Termination
Stop when total logL increases by less than a
threshold value (or after n iterations). Find
most likely starting positions y1, y2 yN of the
motif for the last model.
16EM Expectation Step
For a given sequence of length J, compute weights
wj for possible starting positions 1, 2 J-I1
Using these weights add up contributions of all
subsequences to compute expected number of bases
at each motif position. Then sum up
contributions of all sequences to obtain
ek1(i,b).
Example (one sequence)
L(S,1) 0.10.10.25 0.0025, w(S,1) 0.045
(TAC) L(S,2) 0.250.70.3 0.0525, w(S,2)
0.955 (TAC)
TAC
S
pk(i,b)
q(b)
ek1(i,b)
A 0.7 0.1 C 0.1 0.3 G 0.1 0.3 T 0.1 0.3
A 0.25 C 0.25 G 0.25 T 0.25
.000 .045 .000 .000 .000 .000 .045 .000
.955 .000 .000 .955 .000 .000 .000 .000
.955 .045 .000 .955 .000 .000 .045 .000
17EM Maximization Step
The Maximum likelihood (ML) estimate for the new
parameters pk1(i,b) is equal to the expected
number of bases ek1(i,b) divided by the number
of sequences
Instead ML, a Posterior Mean Estimator (PME) may
be used. PME is based on a prior probability
distribution over the parameter space. Example
add-one prior (uniform distribution over
parameter space)
18Variations of the basic EM algorithm implemented
in different motif discovery programs
Choice of initial model (seeding) Start with
different random models Start with
over-represented words (converted to matrix) Try
different motif lengths Priors For instance, add
pseudo-counts proportional to background base
probabilities (maximum a posteriori (MAP)
estimation) Background models Higher order
Markov models a first order Markov model takes
into account dinucleotide frequencies. Multiple
motif search Find N best motifs Computation of
statistical quality scores, E-values indicate
whether a motif of a certain quality could have
been found by chance in random sequences.
19OOPS, ZOOPS, and TCM
- OOPS one occurrence per sequence.
- basic EM procedure
- ZOOPS zero or one occurrence per sequence.
- additional model parameter fraction of motif
containing sequences - E-step sequence weighting in addition to
position weighting. - TCM two-component mixture model
- Also called anr any number of repetitions
- A sequence may contain zero to many occurrences
of the motif - Components motif model, background model
- Additional model parameter component frequencies
- Set of sequences essentially treated as one long
sequence - E-step Probabilistic assignment of overlapping
subsequences to components
20Gibbs sampling
Motivation EM is not guaranteed to find globally
optimal motif EM is deterministic same initial
model, same result Stochastic algorithms also
not guaranteed to find global optimum, but same
initial model ? different result By running
stochastic algorithms several times ? better
chance to find optimal or near-optimal
solution Gibbs sampling modification of the
E-step considers only one sequence position for
computing new base frequency matrix. Chooses
motif position by sampling from the probability
distribution given by the sequence position
weights w(S,j).
21Do motif discovery algorithms work in practice ?
- Two recent studies suggest that the performance
of motif discovery algorithms is very bad. - Only about 20 to 30 of motifs found and
remapped to training set correspond to
transcription factor binding sites identified by
experiments. - Possible reasons
- The heuristic motif discovery algorithms fail to
find the optimal motif. - The sequence sets are too small for the
estimation of statistically robust models - The experimental data defining the binding sites
used in these tests are flawed. - The reasons for the bad performance are currently
investigated. - Take home lesson
- Use motif discovery programs with caution!
- Realistic benchmarking procedures are important
(see next slide).
22Limitations of Motif Discovery Algorithms
- The following paper presents and discusses the
outcome of a motif discovery competition
23(No Transcript)
24Performance indices used by Tompa et. al. 2005 (2)
- Specificity xSPxTP/(xTF xFP)
25Performance indices used by Tompa et. al. 2005
26Bad Performance of Motif Discovery algorithms on
Eukaryotic Benchmark Data Sets (Results from
Tompa et. al. 2005)
27Similar results are obtained with prokaryotic
benchmark datasets
28Programs used in Exercise
- MEME
- EM algorithm
- supports oops, zoops, and tcm mode
- slow but effective procedure to find good seeds
(starting models) - uses higher order Markov models as background
models - MotifSampler
- implements Gibbs sampling
- uses higher order Markov models as background
models - can be run several times in parallel
- MDScan
- Seeding step scans only t sequences for
over-represented words - Refinement An EM-like (but not strictly)
probabilistic iterative alignment procedure