Gibbs sampling - PowerPoint PPT Presentation

About This Presentation
Title:

Gibbs sampling

Description:

... A C G T A 0.8515 0.0278 0.0775 0.0432 C 0.0464 0.8026 0.0344 0.1167 G 0.1167 0.0350 0.8023 0.0460 T 0.0429 0.0785 0.0264 0.8522 Background ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 27
Provided by: hat89
Category:
Tags: gibbs | sampling

less

Transcript and Presenter's Notes

Title: Gibbs sampling


1
Gibbs sampling
2
The Motif Finding Problem
  • Given a set of DNA sequences
  • cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaat
    ctatgcgtttccaaccat
  • agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaac
    gctcagaaccagaagtgc
  • aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgat
    gtataagacgaaaatttt
  • agcctccgatgtaagtcatagctgtaactattacctgccacccctattac
    atcttacgtacgtataca
  • ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgct
    cgatcgttaacgtacgtc
  • Find the motif in each of the individual sequences

3
The Motif Finding Problem
  • If starting positions s(s1, s2, st) are given,
    finding consensus is easy because we can simply
    construct (and evaluate) the profile to find the
    motif.
  • But the starting positions s are usually not
    given. How can we find the best profile matrix?
  • Gibbs sampling
  • Expectation-Maximization algorithm

4
Notations
  • Set of symbols
  • Sequences S S1, S2, , SN
  • Starting positions of motifs A a1, a2, , aN
  • Motif model ( ) qij P(symbol at the i-th
    position j)
  • Background model pj P(symbol j)
  • Count of symbols in each column cij count of
    symbol, j, in the i-th column in the aligned
    region

5
Motif Finding Problem
  • Problem find starting positions and model
    parameters simultaneously to maximize the
    posterior probability
  • This is equivalent to maximizing the likelihood
    by Bayes Theorem, assuming uniform prior
    distribution

6
Equivalent Scoring Function
  • Maximize the log-odds ratio

7
Sampling and optimization
  • To maximize a function, f(x)
  • Brute force method try all possible x
  • Sample method sample x from probability
    distribution p(x) f(x)
  • Idea suppose xmax is argmax of f(x), then it is
    also argmax of p(x), thus we have a high
    probability of selecting xmax

8
Gibbs Sampling
  • Idea a joint distribution may be hard to sample
    from, but it may be easy to sample from the
    conditional distributions where all variables are
    fixed except one
  • To sample from p(x1, x2, xn), let each state of
    the Markov chain represent (x1, x2, xn), the
    probability of moving to a state (x1, x2, xn)
    is p(xi x1, xi-1,xi1,xn). It is also called
    Markov Chain Monte Carlo (MCMC) method.

9
Gibbs Sampling
10
Gibbs Sampling in Motif Finding
  • Randomly initialize A0
  • Repeat
  • (1) randomly choose a sequence z from S
  • A At \ az
  • compute ?t estimator of ? given S and A
  • (2) sample az according to P(az x), which is
    proportional to Qx/Px
  • update At1 A union x
  • Select At that maximizes F

Qx the probability of generating x according to
?t Px the probability of generating x
according to the background model
11
Estimator of ?
  • Given an alignment A, i.e. the starting positions
    of motifs, ? can be estimated by its MLE with
    smoothing (e.g. Dirichlet prior with parameter
    bj)

12
The Motif Finding Problem
  • Given a set of DNA sequences
  • cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaat
    ctatgcgtttccaaccat
  • agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaac
    gctcagaaccagaagtgc
  • aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgat
    gtataagacgaaaatttt
  • agcctccgatgtaagtcatagctgtaactattacctgccacccctattac
    atcttacgtacgtataca
  • ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgct
    cgatcgttaacgtacgtc
  • Find the motif in each of the individual sequences

13
Gene Regulation
Gene 1
Gene 2
Gene 3
Gene 4
Transcription factor binding site, or motif
instances
14
Evolutionary Conservation
CACGTGACC
CACGTGAAC
CACGTGAAC
15
Overview of TGS
How did the motifs evolve?
How to find the ancestral motif instances?
Colored lines regulatory regions of genes
Colored boxes motif instances
16
How to find the ancestral motif instances?
17
How did the motifs evolve?
Background substitution matrix
A C G T A 0.8515 0.0278 0.0775
0.0432 C 0.0464 0.8026 0.0344
0.1167 G 0.1167 0.0350 0.8023
0.0460 T 0.0429 0.0785 0.0264 0.8522
Motif substitution matrix
A C G T A 0.9802 0.0066 0.0066
0.0066 C 0.0120 0.9640 0.0120
0.0120 G 0.0120 0.0120 0.9640
0.0120 T 0.0066 0.0066 0.0066 0.9802
18
Evolution of motifs
  • Distant species

250 million years
19
Overview of Gibbs Sampler
Implementation
Iteratively sample from conditional distribution
when other parameters are fixed.
20
Implementation
Parameters
21
Implementation
Prior distribution
Beta(1,1)
Poisson distribution
22
Implementation
Initialization
Parameters are sampled using prior distributions
Motif instances in current species are sampled
from sequences directly for each current species
Motif instances in ancestral species are randomly
assigned with one of its immediate child motif
instances.
23
Implementation
Motif instance updating
Updating motif instances in ancestral species
Updating motif instances in current species
24
Implementation
Updating motif instance in ancestral species
Ancestral Motif Weight Matrix 1
2 3 4 5 6
7 8 9 A .036 .892 .036
.036 .036 .036 .892 .036 .036 C
.892 .036 .892 .036 .036 .036
.036 .75 .75 G .036 .036
.036 .892 .036 .892 .036 .036
.036 T .036 .036 .036 .036
.892 .036 .036 .178 .178
M11
M12
CCCGTGACC
CACGTGAAC
25
Implementation
Updating motif instances for current species
Updated ancestral motif instance CACTTGAAC
M11
M12
CACACCACGTGAGCTT...
CACATCACGTGAACTT
26
Multiple Species?
?
CAGGTGATC
CACGTGAAC
CACGTGAAC
CACGTGAAC
CAGGTGATC
CACGTGATC
Write a Comment
User Comments (0)
About PowerShow.com