Promoters and regulatory elements analysis - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Promoters and regulatory elements analysis

Description:

Weight matrix elements represent negated energy contributions to the total ... w(i,b) represents the negated energy-value (high score high affinity) ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 29
Provided by: isrecI
Category:

less

Transcript and Presenter's Notes

Title: Promoters and regulatory elements analysis


1
Promoters and regulatory elements analysis
Biological introduction to promoters and
regulatory elements Experimental procedures to
characterize regulatory elements Biophysics of
DNA protein binding sites Word search
algorithms Expectation-Maximization
algorithms Gibbs sampling Evaluation of motif
discovery algorithms Motif discovery programs
Introduction to exercise
2
Transcription regulation by transcription
factors(prokaryotes)
RNA POL
TF1
TF2
DNA
RNA
Promoter motif
TF binding site motifs
RNA polymerase recognizes promoter motif
Transcription factor 2 recognizes binding site
and activates RNA POL Transcription factor 1
activates RNA POL from a remote site, for
instance via DNA loping
3
Experimental techniques for identifying
transcription factor binding sites and regulatory
elements
  • foot-printing (in vitro) locates binding sites
    within short regions
  • foot-printing (in vivo) locates binding sites
    within short region, no TF identification
  • Chromatin immunoprecipitation (ChIP) locates
    binding sites within larger regions (100 bp).
  • Electrophoretic mobility shift assays (EMSA,
    bandshifts) determines relative binding
    efficiency of an oligonucleotide containing a
    binding-site (in vitro)
  • Mutational analysis defines regulatory element,
    no TF identification (in vivo)
  • Promoter mapping defines transcription start
    site (TSS), TF binding sites are expected to
    occur within a certain distance range from TSS
    (debatable)
  • Expression profiling (micro-arrays) defines sets
    of coordinately regulated genes.
  • A combination of techniques 1-5 is required for
    identification and precise localization
    functional TF binding sites.
  • Techniques 6 and 7 together define promoter
    sequence sets for motif discovery

4
Transcription Factor Binding Sites Features and
Facts
Degenerate sequence motifs several related
sequences bind same TF, e.g. TATAAA, TTTAAA,
TATAAG, TTTAAG, etc. Typical length 6-20 bp Low
specificity 1 site per 250 to 4000
bp Quantitative recognition mechanism affinity
of different qualifying sequences vary
(affinity DNA-protein binding equilibrium
constant Kb, unit Mol-1). Regulatory function
often depends on cooperative interactions with
neighboring sites (combinatorial gene regulatory
code).
5
Motif Discovery Applications in Transcription
Regulation
Input A set of promoter sequences known to be
regulated by a transcription factor o environment
condition (regulon) Promoter sequences
sequences upstream of known transcription start
sites, e.g from ?200 to 10. Sets of
co-regulated genes may be defined by microarray
experiment Output A motif (consensus sequence
of position weight matrix) over-represented in
the in input sequences For each sequence,
predicted motif locations
6
Formal tools to describe regulatory elements
  • Consensus sequences
  • example TATAAA (for eukaryotic TATA-box)
  • a limited number of mismatches may be allowed
  • May contain IUPAC codes for ambiguous positions,
    e.g. R A or G.
  • Position Weight matrices (PWM)
  • synonym position-specific scoring matrix (PSSM),
  • a table with numbers for each residue at each
    position of a fixed-length (gap-free) motif.
  • two numerical representations probabilities,
    additive scores
  • More advanced descriptors
  • HMMs can model spacers of variable length between
    conserved blocks
  • Dinucleotide matrices dependencies between
    nearest neighbors

7
Representation of the binding specificity by a
scoring matrix (also referred to as weight
matrix)
Strong C T T T G A T
C T Binding site 5 5 5 5 5
5 5 3 5 43 Random A C
G T A C G T A Sequence -10
-10 -13 5 -10 -15 -13 -11 - 6 -83

8
Biophysical interpretation of protein binding
sites
Columns of a weight matrix characterize the
specificity of base-pair acceptor sites on the
protein surface. Weight matrix elements represent
negated energy contributions to the total binding
energy ? weight matrix score inversely
proportional to binding energy
9
Berg and von Hippel Theory
Matrix elements w(i,b) represent free energy
contributions, and thus can be scaled in relative
energy units, e.g. kcal/Mol. The probability of
finding base b at position i in a transcription
factor binding site is given by
? is a parameter related to the stringency of the
selection. Selection for low free binding energy
(high affinity) results in high ?. w(i,b)
represents the negated energy-value (high score ?
high affinity). The total binding energy (weight
matrix score) is proportional to the logarithm of
the DNA-protein binding constant with can be
measures by various experimental protocols, e.g.
competitive bandshift experiments.
10
Why and when is motif discovery difficul ?t
Problem 1 Long input sequences ? difficult to
guess the location of motif occurrences (see
below) Problen 2 Few input sequences ?
inaccurate estimation of position-specific base
frequencies
11
Word search algorithms for consensus sequences
  • Purpose to find an optimal consensus sequence
    for a given sequence set.
  • Algorithm For each word wi of size k and
    mismatch threshold d
  • Count total frequency of word wi in data set.
  • Compute expected frequency of word wi in data
    set, based on base composition of data set or
    some other null model.
  • From the observed and expected frequencies,
    compute P-value for word wi according to some
    statistical distribution (e.g. Poisson
    distribution)
  • Return word with highest P-value, or N best words
  • This algorithm is also referred to as
    word-enumeration. It is guaranteed to find the
    optimal word.
  • Computationally feasible only for short words
    (length 15).
  • Heuristic algorithms exist for longer words.

12
Expectation-Maximization (EM) Algorithms
  • EM algorithms essentials
  • An iterative procedure to maximize the likelihood
    of a probabilistic model with regard to given
    data
  • It can deal with missing (unobservable) data
    (motif positions)
  • Not guaranteed to find the global maximum
  • Input
  • A model type with initial parameters to be
    optimized
  • A mathematical formula that allows to compute the
    likelihood of the model given complete (observed
    and unobserved) data
  • The observed data
  • The maximum likelihood estimation of
  • The model parameters (PWM scores or base
    probabilities)
  • The unobserved data (motif position)

13
Motif Discovery by EM Inputs and outputs
14
Motif Discovery with EM basic version
Input model A base probability matrix p(i,b) of
length I. A background frequency model
q(b) Observable data Sequences Sn Hidden
data Starting position of motif
yn Log-likelihood for a sequence S s1 s2 sJ
and starting position y. The log-likelihood of
the model given a complete sequence set S1, S2
SN with starting positions y1, y2 yN is the
sum of the log-likelihoods of each sequence given
the corresponding starting positions.
15
EM How it works for sequence motifs
Initiation Define some values for initial model
p0(i,b) and background frequencies
q(i). Expectation step for kth iteration
(non-trivial, see next slide) Compute expected
parameter values ek1(i,b) based on previous
model pk(i,b), background frequencies q(i), and
sequences S1, S2 SN. Maximization step for
kth1 iteration (straightforward) Compute
maximum likelihood base probabilities pk1(i,b)
from expected frequencies ek1(i,b). Termination
Stop when total logL increases by less than a
threshold value (or after n iterations). Find
most likely starting positions y1, y2 yN of the
motif for the last model.
16
EM Expectation Step
For a given sequence of length J, compute weights
wj for possible starting positions 1, 2 J-I1
Using these weights add up contributions of all
subsequences to compute expected number of bases
at each motif position. Then sum up
contributions of all sequences to obtain
ek1(i,b).
Example (one sequence)
L(S,1) 0.10.10.25 0.0025, w(S,1) 0.045
(TAC) L(S,2) 0.250.70.3 0.0525, w(S,2)
0.955 (TAC)
TAC
S
pk(i,b)
q(b)
ek1(i,b)
A 0.7 0.1 C 0.1 0.3 G 0.1 0.3 T 0.1 0.3
A 0.25 C 0.25 G 0.25 T 0.25
.000 .045 .000 .000 .000 .000 .045 .000
.955 .000 .000 .955 .000 .000 .000 .000
.955 .045 .000 .955 .000 .000 .045 .000


17
EM Maximization Step
The Maximum likelihood (ML) estimate for the new
parameters pk1(i,b) is equal to the expected
number of bases ek1(i,b) divided by the number
of sequences
Instead ML, a Posterior Mean Estimator (PME) may
be used. PME is based on a prior probability
distribution over the parameter space. Example
add-one prior (uniform distribution over
parameter space)
18
Variations of the basic EM algorithm implemented
in different motif discovery programs
Choice of initial model (seeding) Start with
different random models Start with
over-represented words (converted to matrix) Try
different motif lengths Priors For instance, add
pseudo-counts proportional to background base
probabilities (maximum a posteriori (MAP)
estimation) Background models Higher order
Markov models a first order Markov model takes
into account dinucleotide frequencies. Multiple
motif search Find N best motifs Computation of
statistical quality scores, E-values indicate
whether a motif of a certain quality could have
been found by chance in random sequences.
19
OOPS, ZOOPS, and TCM
  • OOPS one occurrence per sequence.
  • basic EM procedure
  • ZOOPS zero or one occurrence per sequence.
  • additional model parameter fraction of motif
    containing sequences
  • E-step sequence weighting in addition to
    position weighting.
  • TCM two-component mixture model
  • Also called anr any number of repetitions
  • A sequence may contain zero to many occurrences
    of the motif
  • Components motif model, background model
  • Additional model parameter component frequencies
  • Set of sequences essentially treated as one long
    sequence
  • E-step Probabilistic assignment of overlapping
    subsequences to components

20
Gibbs sampling
Motivation EM is not guaranteed to find globally
optimal motif EM is deterministic same initial
model, same result Stochastic algorithms also
not guaranteed to find global optimum, but same
initial model ? different result By running
stochastic algorithms several times ? better
chance to find optimal or near-optimal
solution Gibbs sampling modification of the
E-step considers only one sequence position for
computing new base frequency matrix. Chooses
motif position by sampling from the probability
distribution given by the sequence position
weights w(S,j).
21
Do motif discovery algorithms work in practice ?
  • Two recent studies suggest that the performance
    of motif discovery algorithms is very bad.
  • Only about 20 to 30 of motifs found and
    remapped to training set correspond to
    transcription factor binding sites identified by
    experiments.
  • Possible reasons
  • The heuristic motif discovery algorithms fail to
    find the optimal motif.
  • The sequence sets are too small for the
    estimation of statistically robust models
  • The experimental data defining the binding sites
    used in these tests are flawed.
  • The reasons for the bad performance are currently
    investigated.
  • Take home lesson
  • Use motif discovery programs with caution!
  • Realistic benchmarking procedures are important
    (see next slide).

22
Limitations of Motif Discovery Algorithms
  • The following paper presents and discusses the
    outcome of a motif discovery competition

23
(No Transcript)
24
Performance indices used by Tompa et. al. 2005 (2)
  • Specificity xSPxTP/(xTF xFP)

25
Performance indices used by Tompa et. al. 2005
26
Bad Performance of Motif Discovery algorithms on
Eukaryotic Benchmark Data Sets (Results from
Tompa et. al. 2005)
27
Similar results are obtained with prokaryotic
benchmark datasets
28
Programs used in Exercise
  • MEME
  • EM algorithm
  • supports oops, zoops, and tcm mode
  • slow but effective procedure to find good seeds
    (starting models)
  • uses higher order Markov models as background
    models
  • MotifSampler
  • implements Gibbs sampling
  • uses higher order Markov models as background
    models
  • can be run several times in parallel
  • MDScan
  • Seeding step scans only t sequences for
    over-represented words
  • Refinement An EM-like (but not strictly)
    probabilistic iterative alignment procedure
Write a Comment
User Comments (0)
About PowerShow.com