Gibbs sampling - PowerPoint PPT Presentation

About This Presentation

Title:

Gibbs sampling

Description:

... A C G T A 0.8515 0.0278 0.0775 0.0432 C 0.0464 0.8026 0.0344 0.1167 G 0.1167 0.0350 0.8023 0.0460 T 0.0429 0.0785 0.0264 0.8522 Background ... – PowerPoint PPT presentation

Number of Views:100

Avg rating:3.0/5.0

Slides: 27

Provided by: hat89

Learn more at: http://darwin.informatics.indiana.edu

Category:

more less

Transcript and Presenter's Notes

Title: Gibbs sampling

1
Gibbs sampling
2
The Motif Finding Problem

Given a set of DNA sequences
cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaat
ctatgcgtttccaaccat
agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaac
gctcagaaccagaagtgc
aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgat
gtataagacgaaaatttt
agcctccgatgtaagtcatagctgtaactattacctgccacccctattac
atcttacgtacgtataca
ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgct
cgatcgttaacgtacgtc
Find the motif in each of the individual sequences

3
The Motif Finding Problem

If starting positions s(s1, s2, st) are given,
finding consensus is easy because we can simply
construct (and evaluate) the profile to find the
motif.
But the starting positions s are usually not
given. How can we find the best profile matrix?
Gibbs sampling
Expectation-Maximization algorithm

4
Notations

Set of symbols
Sequences S S1, S2, , SN
Starting positions of motifs A a1, a2, , aN
Motif model ( ) qij P(symbol at the i-th
position j)
Background model pj P(symbol j)
Count of symbols in each column cij count of
symbol, j, in the i-th column in the aligned
region

5
Motif Finding Problem

Problem find starting positions and model
parameters simultaneously to maximize the
posterior probability

This is equivalent to maximizing the likelihood
by Bayes Theorem, assuming uniform prior
distribution

6
Equivalent Scoring Function

Maximize the log-odds ratio

7
Sampling and optimization

To maximize a function, f(x)
Brute force method try all possible x
Sample method sample x from probability
distribution p(x) f(x)
Idea suppose xmax is argmax of f(x), then it is
also argmax of p(x), thus we have a high
probability of selecting xmax

8
Gibbs Sampling

Idea a joint distribution may be hard to sample
from, but it may be easy to sample from the
conditional distributions where all variables are
fixed except one
To sample from p(x1, x2, xn), let each state of
the Markov chain represent (x1, x2, xn), the
probability of moving to a state (x1, x2, xn)
is p(xi x1, xi-1,xi1,xn). It is also called
Markov Chain Monte Carlo (MCMC) method.

9
Gibbs Sampling
10
Gibbs Sampling in Motif Finding

Randomly initialize A0
Repeat
(1) randomly choose a sequence z from S
A At \ az
compute ?t estimator of ? given S and A
(2) sample az according to P(az x), which is
proportional to Qx/Px
update At1 A union x
Select At that maximizes F

Qx the probability of generating x according to
?t Px the probability of generating x
according to the background model
11
Estimator of ?

Given an alignment A, i.e. the starting positions
of motifs, ? can be estimated by its MLE with
smoothing (e.g. Dirichlet prior with parameter
bj)

12
The Motif Finding Problem

Given a set of DNA sequences
cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaat
ctatgcgtttccaaccat
agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaac
gctcagaaccagaagtgc
aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgat
gtataagacgaaaatttt
agcctccgatgtaagtcatagctgtaactattacctgccacccctattac
atcttacgtacgtataca
ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgct
cgatcgttaacgtacgtc
Find the motif in each of the individual sequences

13
Gene Regulation
Gene 1
Gene 2
Gene 3
Gene 4
Transcription factor binding site, or motif
instances
14
Evolutionary Conservation
CACGTGACC
CACGTGAAC
CACGTGAAC
15
Overview of TGS
How did the motifs evolve?
How to find the ancestral motif instances?
Colored lines regulatory regions of genes
Colored boxes motif instances
16
How to find the ancestral motif instances?
17
How did the motifs evolve?
Background substitution matrix
A C G T A 0.8515 0.0278 0.0775
0.0432 C 0.0464 0.8026 0.0344
0.1167 G 0.1167 0.0350 0.8023
0.0460 T 0.0429 0.0785 0.0264 0.8522
Motif substitution matrix
A C G T A 0.9802 0.0066 0.0066
0.0066 C 0.0120 0.9640 0.0120
0.0120 G 0.0120 0.0120 0.9640
0.0120 T 0.0066 0.0066 0.0066 0.9802
18
Evolution of motifs

Distant species

250 million years
19
Overview of Gibbs Sampler
Implementation
Iteratively sample from conditional distribution
when other parameters are fixed.
20
Implementation
Parameters
21
Implementation
Prior distribution
Beta(1,1)
Poisson distribution
22
Implementation
Initialization
Parameters are sampled using prior distributions
Motif instances in current species are sampled
from sequences directly for each current species
Motif instances in ancestral species are randomly
assigned with one of its immediate child motif
instances.
23
Implementation
Motif instance updating
Updating motif instances in ancestral species
Updating motif instances in current species
24
Implementation
Updating motif instance in ancestral species
Ancestral Motif Weight Matrix 1
2 3 4 5 6
7 8 9 A .036 .892 .036
.036 .036 .036 .892 .036 .036 C
.892 .036 .892 .036 .036 .036
.036 .75 .75 G .036 .036
.036 .892 .036 .892 .036 .036
.036 T .036 .036 .036 .036
.892 .036 .036 .178 .178
M11
M12
CCCGTGACC
CACGTGAAC
25
Implementation
Updating motif instances for current species
Updated ancestral motif instance CACTTGAAC
M11
M12
CACACCACGTGAGCTT...
CACATCACGTGAACTT
26
Multiple Species?
?
CAGGTGATC
CACGTGAAC
CACGTGAAC
CACGTGAAC
CAGGTGATC
CACGTGATC

Write a Comment

User Comments (0)