Promoters and regulatory elements analysis

1 / 28

About This Presentation

Title:

Promoters and regulatory elements analysis

Description:

Weight matrix elements represent negated energy contributions to the total ... w(i,b) represents the negated energy-value (high score high affinity) ... –

Number of Views:38

Avg rating:3.0/5.0

Slides: 29

Provided by: isrecI

Category:

more less

Transcript and Presenter's Notes

Title: Promoters and regulatory elements analysis

1
Promoters and regulatory elements analysis
Biological introduction to promoters and
regulatory elements Experimental procedures to
characterize regulatory elements Biophysics of
DNA protein binding sites Word search
algorithms Expectation-Maximization
algorithms Gibbs sampling Evaluation of motif
discovery algorithms Motif discovery programs
Introduction to exercise
2
Transcription regulation by transcription
factors(prokaryotes)
RNA POL
TF1
TF2
DNA
RNA
Promoter motif
TF binding site motifs
RNA polymerase recognizes promoter motif
Transcription factor 2 recognizes binding site
and activates RNA POL Transcription factor 1
activates RNA POL from a remote site, for
instance via DNA loping
3
Experimental techniques for identifying
transcription factor binding sites and regulatory
elements

foot-printing (in vitro) locates binding sites
within short regions
foot-printing (in vivo) locates binding sites
within short region, no TF identification
Chromatin immunoprecipitation (ChIP) locates
binding sites within larger regions (100 bp).
Electrophoretic mobility shift assays (EMSA,
bandshifts) determines relative binding
efficiency of an oligonucleotide containing a
binding-site (in vitro)
Mutational analysis defines regulatory element,
no TF identification (in vivo)
Promoter mapping defines transcription start
site (TSS), TF binding sites are expected to
occur within a certain distance range from TSS
(debatable)
Expression profiling (micro-arrays) defines sets
of coordinately regulated genes.
A combination of techniques 1-5 is required for
identification and precise localization
functional TF binding sites.
Techniques 6 and 7 together define promoter
sequence sets for motif discovery

4
Transcription Factor Binding Sites Features and
Facts
Degenerate sequence motifs several related
sequences bind same TF, e.g. TATAAA, TTTAAA,
TATAAG, TTTAAG, etc. Typical length 6-20 bp Low
specificity 1 site per 250 to 4000
bp Quantitative recognition mechanism affinity
of different qualifying sequences vary
(affinity DNA-protein binding equilibrium
constant Kb, unit Mol-1). Regulatory function
often depends on cooperative interactions with
neighboring sites (combinatorial gene regulatory
code).
5
Motif Discovery Applications in Transcription
Regulation
Input A set of promoter sequences known to be
regulated by a transcription factor o environment
condition (regulon) Promoter sequences
sequences upstream of known transcription start
sites, e.g from ?200 to 10. Sets of
co-regulated genes may be defined by microarray
experiment Output A motif (consensus sequence
of position weight matrix) over-represented in
the in input sequences For each sequence,
predicted motif locations
6
Formal tools to describe regulatory elements

Consensus sequences
example TATAAA (for eukaryotic TATA-box)
a limited number of mismatches may be allowed
May contain IUPAC codes for ambiguous positions,
e.g. R A or G.
Position Weight matrices (PWM)
synonym position-specific scoring matrix (PSSM),
a table with numbers for each residue at each
position of a fixed-length (gap-free) motif.
two numerical representations probabilities,
additive scores
More advanced descriptors
HMMs can model spacers of variable length between
conserved blocks
Dinucleotide matrices dependencies between
nearest neighbors

7
Representation of the binding specificity by a
scoring matrix (also referred to as weight
matrix)
Strong C T T T G A T
C T Binding site 5 5 5 5 5
5 5 3 5 43 Random A C
G T A C G T A Sequence -10
-10 -13 5 -10 -15 -13 -11 - 6 -83

8
Biophysical interpretation of protein binding
sites
Columns of a weight matrix characterize the
specificity of base-pair acceptor sites on the
protein surface. Weight matrix elements represent
negated energy contributions to the total binding
energy ? weight matrix score inversely
proportional to binding energy
9
Berg and von Hippel Theory
Matrix elements w(i,b) represent free energy
contributions, and thus can be scaled in relative
energy units, e.g. kcal/Mol. The probability of
finding base b at position i in a transcription
factor binding site is given by
? is a parameter related to the stringency of the
selection. Selection for low free binding energy
(high affinity) results in high ?. w(i,b)
represents the negated energy-value (high score ?
high affinity). The total binding energy (weight
matrix score) is proportional to the logarithm of
the DNA-protein binding constant with can be
measures by various experimental protocols, e.g.
competitive bandshift experiments.
10
Why and when is motif discovery difficul ?t
Problem 1 Long input sequences ? difficult to
guess the location of motif occurrences (see
below) Problen 2 Few input sequences ?
inaccurate estimation of position-specific base
frequencies
11
Word search algorithms for consensus sequences

Purpose to find an optimal consensus sequence
for a given sequence set.
Algorithm For each word wi of size k and
mismatch threshold d
Count total frequency of word wi in data set.
Compute expected frequency of word wi in data
set, based on base composition of data set or
some other null model.
From the observed and expected frequencies,
compute P-value for word wi according to some
statistical distribution (e.g. Poisson
distribution)
Return word with highest P-value, or N best words
This algorithm is also referred to as
word-enumeration. It is guaranteed to find the
optimal word.
Computationally feasible only for short words
(length 15).
Heuristic algorithms exist for longer words.

12
Expectation-Maximization (EM) Algorithms

EM algorithms essentials
An iterative procedure to maximize the likelihood
of a probabilistic model with regard to given
data
It can deal with missing (unobservable) data
(motif positions)
Not guaranteed to find the global maximum
Input
A model type with initial parameters to be
optimized
A mathematical formula that allows to compute the
likelihood of the model given complete (observed
and unobserved) data
The observed data
The maximum likelihood estimation of
The model parameters (PWM scores or base
probabilities)
The unobserved data (motif position)

13
Motif Discovery by EM Inputs and outputs
14
Motif Discovery with EM basic version
Input model A base probability matrix p(i,b) of
length I. A background frequency model
q(b) Observable data Sequences Sn Hidden
data Starting position of motif
yn Log-likelihood for a sequence S s1 s2 sJ
and starting position y. The log-likelihood of
the model given a complete sequence set S1, S2
SN with starting positions y1, y2 yN is the
sum of the log-likelihoods of each sequence given
the corresponding starting positions.
15
EM How it works for sequence motifs
Initiation Define some values for initial model
p0(i,b) and background frequencies
q(i). Expectation step for kth iteration
(non-trivial, see next slide) Compute expected
parameter values ek1(i,b) based on previous
model pk(i,b), background frequencies q(i), and
sequences S1, S2 SN. Maximization step for
kth1 iteration (straightforward) Compute
maximum likelihood base probabilities pk1(i,b)
from expected frequencies ek1(i,b). Termination
Stop when total logL increases by less than a
threshold value (or after n iterations). Find
most likely starting positions y1, y2 yN of the
motif for the last model.
16
EM Expectation Step
For a given sequence of length J, compute weights
wj for possible starting positions 1, 2 J-I1
Using these weights add up contributions of all
subsequences to compute expected number of bases
at each motif position. Then sum up
contributions of all sequences to obtain
ek1(i,b).
Example (one sequence)
L(S,1) 0.10.10.25 0.0025, w(S,1) 0.045
(TAC) L(S,2) 0.250.70.3 0.0525, w(S,2)
0.955 (TAC)
TAC
S
pk(i,b)
q(b)
ek1(i,b)
A 0.7 0.1 C 0.1 0.3 G 0.1 0.3 T 0.1 0.3
A 0.25 C 0.25 G 0.25 T 0.25
.000 .045 .000 .000 .000 .000 .045 .000
.955 .000 .000 .955 .000 .000 .000 .000
.955 .045 .000 .955 .000 .000 .045 .000

17
EM Maximization Step
The Maximum likelihood (ML) estimate for the new
parameters pk1(i,b) is equal to the expected
number of bases ek1(i,b) divided by the number
of sequences
Instead ML, a Posterior Mean Estimator (PME) may
be used. PME is based on a prior probability
distribution over the parameter space. Example
add-one prior (uniform distribution over
parameter space)
18
Variations of the basic EM algorithm implemented
in different motif discovery programs
Choice of initial model (seeding) Start with
different random models Start with
over-represented words (converted to matrix) Try
different motif lengths Priors For instance, add
pseudo-counts proportional to background base
probabilities (maximum a posteriori (MAP)
estimation) Background models Higher order
Markov models a first order Markov model takes
into account dinucleotide frequencies. Multiple
motif search Find N best motifs Computation of
statistical quality scores, E-values indicate
whether a motif of a certain quality could have
been found by chance in random sequences.
19
OOPS, ZOOPS, and TCM

OOPS one occurrence per sequence.
basic EM procedure
ZOOPS zero or one occurrence per sequence.
additional model parameter fraction of motif
containing sequences
E-step sequence weighting in addition to
position weighting.
TCM two-component mixture model
Also called anr any number of repetitions
A sequence may contain zero to many occurrences
of the motif
Components motif model, background model
Additional model parameter component frequencies
Set of sequences essentially treated as one long
sequence
E-step Probabilistic assignment of overlapping
subsequences to components

20
Gibbs sampling
Motivation EM is not guaranteed to find globally
optimal motif EM is deterministic same initial
model, same result Stochastic algorithms also
not guaranteed to find global optimum, but same
initial model ? different result By running
stochastic algorithms several times ? better
chance to find optimal or near-optimal
solution Gibbs sampling modification of the
E-step considers only one sequence position for
computing new base frequency matrix. Chooses
motif position by sampling from the probability
distribution given by the sequence position
weights w(S,j).
21
Do motif discovery algorithms work in practice ?

Two recent studies suggest that the performance
of motif discovery algorithms is very bad.
Only about 20 to 30 of motifs found and
remapped to training set correspond to
transcription factor binding sites identified by
experiments.
Possible reasons
The heuristic motif discovery algorithms fail to
find the optimal motif.
The sequence sets are too small for the
estimation of statistically robust models
The experimental data defining the binding sites
used in these tests are flawed.
The reasons for the bad performance are currently
investigated.
Take home lesson
Use motif discovery programs with caution!
Realistic benchmarking procedures are important
(see next slide).

22
Limitations of Motif Discovery Algorithms

The following paper presents and discusses the
outcome of a motif discovery competition

23
(No Transcript)
24
Performance indices used by Tompa et. al. 2005 (2)

Specificity xSPxTP/(xTF xFP)

25
Performance indices used by Tompa et. al. 2005
26
Bad Performance of Motif Discovery algorithms on
Eukaryotic Benchmark Data Sets (Results from
Tompa et. al. 2005)
27
Similar results are obtained with prokaryotic
benchmark datasets
28
Programs used in Exercise

MEME
EM algorithm
supports oops, zoops, and tcm mode
slow but effective procedure to find good seeds
(starting models)
uses higher order Markov models as background
models
MotifSampler
implements Gibbs sampling
uses higher order Markov models as background
models
can be run several times in parallel
MDScan
Seeding step scans only t sequences for
over-represented words
Refinement An EM-like (but not strictly)
probabilistic iterative alignment procedure