Title: Gibbs sampling
1Gibbs sampling
2The Motif Finding Problem
- Given a set of DNA sequences
- cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaat
ctatgcgtttccaaccat - agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaac
gctcagaaccagaagtgc - aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgat
gtataagacgaaaatttt - agcctccgatgtaagtcatagctgtaactattacctgccacccctattac
atcttacgtacgtataca - ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgct
cgatcgttaacgtacgtc - Find the motif in each of the individual sequences
3The Motif Finding Problem
- If starting positions s(s1, s2, st) are given,
finding consensus is easy because we can simply
construct (and evaluate) the profile to find the
motif. - But the starting positions s are usually not
given. How can we find the best profile matrix? - Gibbs sampling
- Expectation-Maximization algorithm
4Notations
- Set of symbols
- Sequences S S1, S2, , SN
- Starting positions of motifs A a1, a2, , aN
- Motif model ( ) qij P(symbol at the i-th
position j) - Background model pj P(symbol j)
- Count of symbols in each column cij count of
symbol, j, in the i-th column in the aligned
region
5Motif Finding Problem
- Problem find starting positions and model
parameters simultaneously to maximize the
posterior probability
- This is equivalent to maximizing the likelihood
by Bayes Theorem, assuming uniform prior
distribution
6Equivalent Scoring Function
- Maximize the log-odds ratio
7Sampling and optimization
- To maximize a function, f(x)
- Brute force method try all possible x
- Sample method sample x from probability
distribution p(x) f(x) - Idea suppose xmax is argmax of f(x), then it is
also argmax of p(x), thus we have a high
probability of selecting xmax
8Gibbs Sampling
- Idea a joint distribution may be hard to sample
from, but it may be easy to sample from the
conditional distributions where all variables are
fixed except one - To sample from p(x1, x2, xn), let each state of
the Markov chain represent (x1, x2, xn), the
probability of moving to a state (x1, x2, xn)
is p(xi x1, xi-1,xi1,xn). It is also called
Markov Chain Monte Carlo (MCMC) method.
9Gibbs Sampling
10Gibbs Sampling in Motif Finding
- Randomly initialize A0
- Repeat
- (1) randomly choose a sequence z from S
- A At \ az
- compute ?t estimator of ? given S and A
- (2) sample az according to P(az x), which is
proportional to Qx/Px - update At1 A union x
- Select At that maximizes F
Qx the probability of generating x according to
?t Px the probability of generating x
according to the background model
11Estimator of ?
- Given an alignment A, i.e. the starting positions
of motifs, ? can be estimated by its MLE with
smoothing (e.g. Dirichlet prior with parameter
bj)
12The Motif Finding Problem
- Given a set of DNA sequences
- cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaat
ctatgcgtttccaaccat - agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaac
gctcagaaccagaagtgc - aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgat
gtataagacgaaaatttt - agcctccgatgtaagtcatagctgtaactattacctgccacccctattac
atcttacgtacgtataca - ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgct
cgatcgttaacgtacgtc - Find the motif in each of the individual sequences
13Gene Regulation
Gene 1
Gene 2
Gene 3
Gene 4
Transcription factor binding site, or motif
instances
14Evolutionary Conservation
CACGTGACC
CACGTGAAC
CACGTGAAC
15Overview of TGS
How did the motifs evolve?
How to find the ancestral motif instances?
Colored lines regulatory regions of genes
Colored boxes motif instances
16How to find the ancestral motif instances?
17How did the motifs evolve?
Background substitution matrix
A C G T A 0.8515 0.0278 0.0775
0.0432 C 0.0464 0.8026 0.0344
0.1167 G 0.1167 0.0350 0.8023
0.0460 T 0.0429 0.0785 0.0264 0.8522
Motif substitution matrix
A C G T A 0.9802 0.0066 0.0066
0.0066 C 0.0120 0.9640 0.0120
0.0120 G 0.0120 0.0120 0.9640
0.0120 T 0.0066 0.0066 0.0066 0.9802
18Evolution of motifs
250 million years
19Overview of Gibbs Sampler
Implementation
Iteratively sample from conditional distribution
when other parameters are fixed.
20Implementation
Parameters
21Implementation
Prior distribution
Beta(1,1)
Poisson distribution
22Implementation
Initialization
Parameters are sampled using prior distributions
Motif instances in current species are sampled
from sequences directly for each current species
Motif instances in ancestral species are randomly
assigned with one of its immediate child motif
instances.
23Implementation
Motif instance updating
Updating motif instances in ancestral species
Updating motif instances in current species
24Implementation
Updating motif instance in ancestral species
Ancestral Motif Weight Matrix 1
2 3 4 5 6
7 8 9 A .036 .892 .036
.036 .036 .036 .892 .036 .036 C
.892 .036 .892 .036 .036 .036
.036 .75 .75 G .036 .036
.036 .892 .036 .892 .036 .036
.036 T .036 .036 .036 .036
.892 .036 .036 .178 .178
M11
M12
CCCGTGACC
CACGTGAAC
25Implementation
Updating motif instances for current species
Updated ancestral motif instance CACTTGAAC
M11
M12
CACACCACGTGAGCTT...
CACATCACGTGAACTT
26Multiple Species?
?
CAGGTGATC
CACGTGAAC
CACGTGAAC
CACGTGAAC
CAGGTGATC
CACGTGATC