Motif%20Finding - PowerPoint PPT Presentation

About This Presentation
Title:

Motif%20Finding

Description:

A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3. C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12 ... Example: HAP1 binding sites in 5 sequences. consensus motif: CGGNNNTANCGG ... – PowerPoint PPT presentation

Number of Views:148
Avg rating:3.0/5.0
Slides: 86
Provided by: fenBilk
Category:
Tags: 20finding | hap1 | motif

less

Transcript and Presenter's Notes

Title: Motif%20Finding


1
Motif Finding
  • PSSMs
  • Expectation Maximization
  • Gibbs Sampling

2
Complexity of Transcription
3
Representing Binding Sites for a TF
Set of binding sites AAGTTAATGA CAGTTAATAA GAGTT
AAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA
CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTG
ATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA A
AGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAA
TGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AA
GTTAATGA AAGTTAATGA AAGTTAATGA
  • A single site
  • AAGTTAATGA
  • A set of sites represented as a consensus
  • VDRTWRWWSHD (IUPAC degenerate DNA)

4
Nucleic acid codes
code description
A Adenine
C Cytosine
G Guanine
T Thymine
U Uracil
R Purine (A or G)
Y Pyrimidine (C, T, or U)
M C or A
K T, U, or G
W T, U, or A
S C or G
B C, T, U, or G (not A)
D A, T, U, or G (not C)
H A, T, U, or C (not G)
V A, C, or G (not T, not U)
N Any base (A, C, G, T, or U)
5
From frequencies to log scores
w matrix
f matrix
A 5 0 1 0 0 C 0 2 2 4 0 G 0 3 1 0
4 T 0 0 1 1 1
A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5
0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3 T
-1.7 -1.7 -0.2 -0.2 -0.2
f(b,i) s(N)
Log ( )
p(b)
6
TFs do not act alone
http//www.bioinformatics.ca/
7
PSSMs for Liver TFs
HNF3
HNF1
HNF4
C/EBP
8
PSSMs for Helix-Turn-Helix Motif
9
Promoter
10
Promoter Weight Matrices (PWM)
11
E.Coli PWMs
12
Motif Logo
1234567 TGGGGGA TGAGAGA TGGGGGA TGAGAGA TGAGGGA
Position
  • Motifs can mutate on less important bases.
  • The five motifs at top right have mutations in
    position 3 and 5.
  • Representations called motif logos illustrate the
    conserved regions of a motif.

http//weblogo.berkeley.edu http//fold.stanford.e
du/eblocks/acsearch.html
13
Example Calmodulin-Binding Motif
(calcium-binding proteins)
14
Sequence Motifs
http//webcourse.cs.technion.ac.il/236523/Winter20
05-2006/en/ho_Lectures.html
15
Regulatory Motifs
  • Transcription Factors bind to regulatory motifs
  • Motifs are 6 20 nucleotides long
  • Activators and repressors
  • Usually located near target gene, mostly upstream

16
Challenges
  • How to recognize a regulatory motif?
  • Can we identify new occurrences of known motifs
    in genome sequences?
  • Can we discover new motifs within upstream
    sequences of genes?

17
Motif Representation
  • Exact motif CGGATATA
  • Consensus represent only deterministic
    nucleotides.
  • Example HAP1 binding sites in 5 sequences.
  • consensus motif CGGNNNTANCGG
  • N stands for any nucleotide.
  • Representing only consensus loses information.
    How can this be avoided?

CGGATATACCGG CGGTGATAGCGG CGGTACTAACGG CGGCGGTAACG
G CGGCCCTAACGG ------------ CGGNNNTANCGG
18
PSPM Position Specific Probability Matrix
  • Represents a motif of length k (5)
  • Count the number of occurrence of each
    nucleotide in each position

1 2 3 4 5
A 10 25 5 70 60
C 30 25 80 10 15
T 50 25 5 10 5
G 10 25 10 10 20
19
PSPM Position Specific Probability Matrix
  • Defines PiA,C,G,T for i1,..,k.
  • Pi (A) frequency of nucleotide A in position i.

1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
C 0.3 0.25 0.8 0.1 0.15
T 0.5 0.25 0.05 0.1 0.05
G 0.1 0.25 0.1 0.1 0.2
20
Identification of Known Motifs within Genomic
Sequences
  • Motivation
  • identification of new genes controlled by the
    same TF.
  • Infer the function of these genes.
  • enable better understanding of the regulation
    mechanism.

21
PSPM Position Specific Probability Matrix
  • Each k-mer is assigned a probability.
  • Example P(TCCAG)0.50.250.80.70.2

1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
C 0.3 0.25 0.8 0.1 0.15
T 0.5 0.25 0.05 0.1 0.05
G 0.1 0.25 0.1 0.1 0.2
22
Detecting a Known Motif within a Sequence using
PSPM
  • The PSPM is moved along the query sequence.
  • At each position the sub-sequence is scored for a
    match to the PSPM.
  • Example
  • sequence ATGCAAGTCT

1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
C 0.3 0.25 0.8 0.1 0.15
T 0.5 0.25 0.05 0.1 0.05
G 0.1 0.25 0.1 0.1 0.2
23
Detecting a Known Motif within a Sequence using
PSPM
  • The PSPM is moved along the query sequence.
  • At each position the sub-sequence is scored for a
    match to the PSPM.
  • Example
  • sequence ATGCAAGTCT
  • Position 1 ATGCA 0.10.250.10.10.61.510-4

1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
C 0.3 0.25 0.8 0.1 0.15
T 0.5 0.25 0.05 0.1 0.05
G 0.1 0.25 0.1 0.1 0.2
24
Detecting a Known Motif within a Sequence using
PSPM
  • The PSPM is moved along the query sequence.
  • At each position the sub-sequence is scored for a
    match to the PSPM.
  • Example
  • sequence ATGCAAGTCT
  • Position 1 ATGCA 0.10.250.10.10.61.510-4
  • Position 2 TGCAA 0.50.250.80.70.60.042

1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
C 0.3 0.25 0.8 0.1 0.15
T 0.5 0.25 0.05 0.1 0.05
G 0.1 0.25 0.1 0.1 0.2
25
Detecting a Known Motif within a Sequence using
PSSM
  • Is it a random match, or is it indeed an
    occurrence of the motif?
  • PSPM -gt PSSM (Probability Specific Scoring
    Matrix)
  • odds score matrix Oi(n) where n? A,C,G,T for
    i1,..,k
  • defined as Pi(n)/P(n), where P(n) is background
    frequency.
  • Oi(n) increases gt higher odds that n at position
    i is part of a real motif.

26
PSSM as Odds Score Matrix
  • Assumption the background frequency of each
    nucleotide is 0.25.
  • Original PSPM (Pi)
  • Odds Matrix (Oi)
  • Going to log scale we get an additive score,Log
    odds Matrix (log2Oi)

1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
1 2 3 4 5
A 0.4 1 0.2 2.8 2.4
1 2 3 4 5
A -1.322 0 -2.322 1.485 1.263
27
Calculating using Log Odds Matrix
  • Odds ? 0 implies random match Odds gt 0 implies
    real match (?).
  • Example sequence ATGCAAGTCT
  • Position 1 ATGCA -1.320-1.32-1.321.26-2.7odd
    s 2-2.70.15
  • Position 2 TGCAA101.681.481.26
    5.42odds25.4242.8

1 2 3 4 5
A -1.32 0 -2.32 1.48 1.26
C 0.26 0 1.68 -1.32 -0.74
T 1 0 -2.32 -1.32 -2.32
G -1.32 0 -1.32 -1.32 -0.32
28
Calculating the probability of a match
  • ATGCAAG
  • Position 1 ATGCA 0.15
  • Position 2 TGCAA 42.3
  • Position 3 GCAAG 0.18

P (1) 0.003 P (2) 0.993 P (3) 0.004
P (i) S / (? S) Example 0.15 /(.1542.8.18)0.0
03
29
Building a PSSM
  • Collect all known sequences that bind a certain
    TF.
  • Align all sequences (using multiple sequence
    alignment).
  • Compute the frequency of each nucleotide in each
    position (PSPM).
  • Incorporate background frequency for each
    nucleotide (PSSM).

30
Finding new Motifs
  • We are given a group of genes, which presumably
    contain a common regulatory motif.
  • We know nothing of the TF that binds to the
    putative motif.
  • The problem discover the motif.

31
Example
Predicting the cAMP Receptor Protein (CRP)
binding site motif
32
Extract experimentally defined CRP Binding Sites
GGATAACAATTTCACA AGTGTGTGAGCGGATAACAA AAGGTGTGAGT
TAGCTCACTCCCC TGTGATCTCTGTTACATAG ACGTGCGAGGATGAGA
ACACA ATGTGTGTGCTCGGTTTAGTTCACC TGTGACACAGTGCAAACG
CG CCTGACGGAGTTCACA AATTGTGAGTGTCTATAATCACG ATCGAT
TTGGAATATCCATCACA TGCAAAGGACGTCACGATTTGGG AGCTGGCG
ACCTGGGTCATG TGTGATGTGTATCGAACCGTGT ATTTATTTGAACCA
CATCGCA GGTGAGAGCCATCACAG GAGTGTGTAAGCTGTGCCACG TT
TATTCCATGTCACGAGTGT TGTTATACACATCACTAGTG AAACGTGCT
CCCACTCGCA TGTGATTCGATTCACA
33
Create a Multiple Sequence Alignment
GGATAACAATTTCACA TGTGAGCGGATAACAA TGTGAGTTAGCTCAC
T TGTGATCTCTGTTACA CGAGGATGAGAACACA CTCGGTTTAGTTCA
CC TGTGACACAGTGCAAA CCTGACGGAGTTCACA AGTGTCTATAATC
ACG TGGAATATCCATCACA TGCAAAGGACGTCACG GGCGACCTGGGT
CATG TGTGATGTGTATCGAA TTTGAACCACATCGCA GGTGAGAGCCA
TCACA TGTAAGCTGTGCCACG TTTATTCCATGTCACG TGTTATACAC
ATCACT CGTGCTCCCACTCGCA TGTGATTCGATTCACA
34
Generate a PSSM
A C G T
1 -0.43 0.1 -0.46 0.55
2 1.37 0.12 -1.59 -11.2
3 1.69 -1.28 -11.2 -1.43
4 -1.28 0.12 -11.2 1.32
5 0.91 -11.2 -0.46 0.47
6 1.53 -1.38 -1.48 -1.43
7 0.9 -0.48 -11.2 0.12
8 -1.37 -1.28 -11.2 1.68
9 -11.2 -11.2 1.73 -0.56
10 -11.2 -0.51 -11.2 1.72
11 -0.48 -11.2 1.72 -11.2
12 1.56 -1.59 -11.2 -0.46
13 -0.51 -0.38 -0.55 0.88
14 -11.2 0.5 0.57 0.13
15 0.17 -0.51 0.12 0.12
16 0.9 -11.2 0.5 -0.48
17 0.17 0.16 0.06 -0.48
18 -0.4 -0.38 0.82 -0.48
19 -1.38 -1.28 -11.2 1.68
20 -1.48 1.7 -11.2 -1.38
21 1.5 -1.38 -1.43 -1.28
35
Shannon Entropy
  • Expected variation per column can be calculated
  • Low entropy means higher conservation

36
Entropy
  • The entropy (H) for a column is
  • a is a residue,
  • fa frequency of residue a in a column,
  • pa probability of residue a in that column

37
Entropy
  • entropy measures can determine which evolutionary
    distance (PAM250, BLOSUM80, etc) should be used
  • Entropy yields amount of information per column
    (discussed with sequence logos in a bit)

38
Log-odds score
  • Profiles can also indicate log-odds score
  • Log2(observedexpected)
  • Result is a bit score

39
Matlab
  • Multalign
  • 1 Enter an array of sequences.
  • seqs 'CACGTAACATCTC','ACGACGTAACATCTTCT','AAACG
    TAACATCTCGC'
  • 2 Promote terminations with gaps in the
    alignment.
  • multialign(seqs,'terminalGapAdjust',true)
  • ans
  • --CACGTAACATCTC--
  • ACGACGTAACATCTTCT
  • -AAACGTAACATCTCGC

40
Matlab
  • 3 Compare alignment without termination gap
    adjustment.
  • multialign(seqs)
  • ans
  • CA--CGTAACATCT--C
  • ACGACGTAACATCTTCT
  • AA-ACGTAACATCTCGC

41
Matlab
  • gtgt a'ATATAGGAG','AATTATAGA','TTAGAGAAA'
  • gtgt a
  • 'ATATAGGAG' 'AATTATAGA' 'TTAGAGAAA'

42
Char function
  • gtgt cseqchar(a)
  • cseq
  • ATATAGGAG
  • AATTATAGA
  • TTAGAGAAA

43
Double function
  • gtgt intseqdouble(cseq)
  • intseq
  • 65 84 65 84 65 71 71 65
    71
  • 65 65 84 84 65 84 65 71
    65
  • 84 84 65 71 65 71 65 65 65

44
double
  • gtgt double('A')
  • ans
  • 65
  • gtgt double('C')
  • ans
  • 67
  • gtgt double('G')
  • ans
  • 71
  • gtgt double('T')
  • ans
  • 84

45
Initiate PSPM matrix
  • gtgt Pspmzeros(4,length(intseq))
  • Pspm
  • 0 0 0 0 0 0 0 0
    0
  • 0 0 0 0 0 0 0 0
    0
  • 0 0 0 0 0 0 0 0
    0
  • 0 0 0 0 0 0 0 0
    0

46
Use a for loop to count each nucleotide at each
position
  • gtgt for i 1length(intseq)
  • Pspm(1,i)length(find(intseq(,i)65))
  • Pspm(2,i)length(find(intseq(,i)67))
  • Pspm(3,i)length(find(intseq(,i)71))
  • Pspm(4,i)length(find(intseq(,i)84))
  • end
  • gtgt Pspm
  • Pspm
  • 2 1 2 0 3 0 2 2
    2
  • 0 0 0 0 0 0 0 0
    0
  • 0 0 0 1 0 2 1 1
    1
  • 1 2 1 2 0 1 0 0
    0

47
Add pseudocounts
  • gtgt PspmpPspm1
  • Pspmp
  • 3 2 3 1 4 1 3 3
    3
  • 1 1 1 1 1 1 1 1
    1
  • 1 1 1 2 1 3 2 2
    2
  • 2 3 2 3 1 2 1 1
    1

48
Normalize to get frequencies
  • gtgt PspmnormPspmp./repmat(sum(Pspmp),4,1)
  • Pspmnorm
  • Columns 1 through 7
  • 0.4286 0.2857 0.4286 0.1429
    0.5714 0.1429 0.4286
  • 0.1429 0.1429 0.1429 0.1429
    0.1429 0.1429 0.1429
  • 0.1429 0.1429 0.1429 0.2857
    0.1429 0.4286 0.2857
  • 0.2857 0.4286 0.2857 0.4286
    0.1429 0.2857 0.1429
  • Columns 8 through 9
  • 0.4286 0.4286
  • 0.1429 0.1429
  • 0.2857 0.2857
  • 0.1429 0.1429

49
Calculate odds score
  • gtgt PswmPspmnorm/0.25
  • Pswm
  • Columns 1 through 7
  • 1.7143 1.1429 1.7143 0.5714
    2.2857 0.5714 1.7143
  • 0.5714 0.5714 0.5714 0.5714
    0.5714 0.5714 0.5714
  • 0.5714 0.5714 0.5714 1.1429
    0.5714 1.7143 1.1429
  • 1.1429 1.7143 1.1429 1.7143
    0.5714 1.1429 0.5714
  • Columns 8 through 9
  • 1.7143 1.7143
  • 0.5714 0.5714
  • 1.1429 1.1429
  • 0.5714 0.5714

50
Log odds ratio
  • gtgt logPswmlog2(Pswm)
  • logPswm
  • Columns 1 through 7
  • 0.7776 0.1926 0.7776 -0.8074
    1.1926 -0.8074 0.7776
  • -0.8074 -0.8074 -0.8074 -0.8074
    -0.8074 -0.8074 -0.8074
  • -0.8074 -0.8074 -0.8074 0.1926
    -0.8074 0.7776 0.1926
  • 0.1926 0.7776 0.1926 0.7776
    -0.8074 0.1926 -0.8074
  • Columns 8 through 9
  • 0.7776 0.7776
  • -0.8074 -0.8074
  • 0.1926 0.1926
  • -0.8074 -0.8074

51
Estimate the probability of the given sequence to
belong to the defined PSWM
  • gtgt Unknown'TTAAGAAGG'
  • Unknown
  • TTAAGAAGG
  • gtgt intunknowndouble(Unknown)
  • intunknown
  • 84 84 65 65 71 65 65 71
    71

52
Get the index of the PSWM for the unknown sequence
  • gtgt for i1length(intunknown)
  • Afind(intunknown65)
  • intunknown(A)1
  • Cfind(intunknown67)
  • intunknown(C)2
  • Gfind(intunknown71)
  • intunknown(G)3
  • Tfind(intunknown84)
  • intunknown(T)4
  • end
  • gtgt intunknown
  • intunknown
  • 4 4 1 1 3 1 1 3
    3

53
Calculate the log odds-ratio of the Unknown
'TTAAGAAGG'
  • gtgt logunknownlogPswm(intunknown)
  • logunknown
  • Columns 1 through 7
  • 0.1926 0.1926 0.7776 0.7776
    -0.8074 0.7776 0.7776
  • Columns 8 through 9
  • -0.8074 -0.8074
  • gtgt Punknownsum(logunknown)
  • Punknown
  • 1.0737

54
Is this significant score or just random
similarity?
  • gtgt cseq
  • cseq
  • ATATAGGAG
  • AATTATAGA
  • TTAGAGAAA
  • gtgt Unknown
  • Unknown
  • TTAAGAAGG

55
What would be the maximum score?
  • gtgt logPswm
  • logPswm
  • Columns 1 through 7
  • 0.7776 0.1926 0.7776 -0.8074
    1.1926 -0.8074 0.7776
  • -0.8074 -0.8074 -0.8074 -0.8074
    -0.8074 -0.8074 -0.8074
  • -0.8074 -0.8074 -0.8074 0.1926
    -0.8074 0.7776 0.1926
  • 0.1926 0.7776 0.1926 0.7776
    -0.8074 0.1926 -0.8074
  • Columns 8 through 9
  • 0.7776 0.7776
  • -0.8074 -0.8074
  • 0.1926 0.1926
  • -0.8074 -0.8074
  • gtgt maxscoremax(logPswm)
  • maxscore
  • Columns 1 through 7
  • 0.7776 0.7776 0.7776 0.7776 1.1926
    0.7776 0.7776
  • Columns 8 through 9

56
Write a function using the above statements to
scan a sequence
  • Write a function named logodds that calculates
    the logs-odd ratio of a given alignment.
  • Write a function named scanmotif that calls the
    logodds to search through a sequence using a
    sliding window to calculate the logodds of a
    subsequence and store these scores. The function
    should allow for selection of a maximum number of
    locations that are likely to contain the motif
    based on the scores obtained.

57
Position Specific Scoring Matrix (PSSM)
  • incorporate information theory to indicate
    information contained within each column of a
    multiple alignment.
  • information is a logarithmic transformation of
    the frequency of each residue in the motif

58
PSSMs and Pseudocounts
  • Problem PSSMs are only as good as the initial
    msa
  • Some residues may be underrepresented
  • Other columns may be too conserved
  • Solution Introduce Pseudocounts to get a better
    indication

59
Pseudocounts
  • New estimated probability
  • Pca Probability of residue a in column c
  • nca count of as in column c
  • bca pseudocount of as in column c
  • Nc total count in column c
  • Bc total pseudocount in column c

60
PSSMs and pseudocounts
  • probabilities converted into a log-odds form
    (usually log2 so the information can be reported
    in bits) and placed in the PSSM.

61
Searching PSSMs
  • value for the first residue in the sequence
    occurring in the first column is calculated by
    searching the PSSM
  • the value for the residue occurring in each
    column is calculated

62
Searching PSSMs
  • values are added (since they are logarithms) to
    produce a summed log odds score, S
  • S can be converted to an odds score using the
    formula 2S
  • odds scores for each position can be summed
    together and normalized to produce a probability
    of the motif occurring at each location.

63
Information in PSSMs
  • Information theory amount of information
    contained within each sequence.
  • No information amount of uncertainty can be
    measured as log220 4.32 for amino acids, since
    there are 20 amino acids. For nucleic acid
    sequences, the amount of uncertainty can be
    measured as log24 2.

64
Information in PSSMs
  • If a column is completely conserved then the
    uncertainty is 0 there is only one choice.
  • two residues occurring with equal probability --
    uncertainty to deciding which residue it is.

65
Measure of Uncertainty
  • Measured as the entropy

66
Relative Entropy
  • . Relative entropy takes into account overall
    composition of the organism being studied
  •  
  • Ba is background frequency of residue a in the
    organism

67
PSSM Uncertainty
  • Uncertainty for whole model is summed over all
    columns

68
Sequence Logos
  • Information in PSSMs can be viewed visually
  • Sequence logos illustrate information in each
    column of a motif
  • height of logo is calculated as the amount by
    which uncertainty has been decreased

69
Sequence Logos
70
Statistical Methods
  • Commonly used methods for locating motifs
  • Expectation-Maximization (EM)
  • Gibbs Sampling

71
Expectation-Maximization
  • Begin with set of sequences with an unknown
    signal in common
  • Signal may be subtle
  • Approximate length of signal must be given
  • Randomly assign locations of this motif in each
    sequence

72
Expectation-Maximization
  • Two steps
  • Expectation Step
  • Maximization Step

73
Expectation-Maximization
  • Expectation step
  • Residue Frequencies for each position calculated
  • Residues not in a motif are background
  • Frequencies used to determine probability of
    finding site at any position in a sequence to fit
    motif model

74
Maximization Step
  • Determine location for each sequence that
    maximally aligns to the motif pattern
  • Once new motif location found for each sequence,
    motif pattern is revised in the expectation
  • E-M continues until solution converges

75
TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT CCCACGCA
GCCGCCCTCCTCCCCGGTCACTGACTGGTCCTG TCGACCCTCTGAACCT
ATCAGGGACCACAGTCAGCCAGGCAAG AAAACACTTGAGGGAGCAGATA
ACTGGGCCAACCATGACTC GGGTGAATGGTACTGCTGATTACAACCTCT
GGTGCTGC AGCCTAGAGTGATGACTCCTATCTGGGTCCCCAGCAGGA G
CCTCAGGATCCAGCACACATTATCACAAACTTAGTGTCCA CATTATCAC
AAACTTAGTGTCCATCCATCACTGCTGACCCT TCGGAACAAGGCAAAGG
CTATAAAAAAAATTAAGCAGC GCCCCTTCCCCACACTATCTCAATGCAA
ATATCTGTCTGAAACGGTTCC CATGCCCTCAAGTGTGCAGATTGGTCAC
AGCATTTCAAGG GATTGGTCACAGCATTTCAAGGGAGAGACCTCATTGT
AAG TCCCCAACTCCCAACTGACCTTATCTGTGGGGGAGGCTTTTGA CC
TTATCTGTGGGGGAGGCTTTTGAAAAGTAATTAGGTTTAGC ATTATTTT
CCTTATCAGAAGCAGAGAGACAAGCCATTTCTCTTTCCTCCCGGT AGGC
TATAAAAAAAATTAAGCAGCAGTATCCTCTTGGGGGCCCCTTC CCAGCA
CACACACTTATCCAGTGGTAAATACACATCAT TCAAATAGGTACGGATA
AGTAGATATTGAAGTAAGGAT ACTTGGGGTTCCAGTTTGATAAGAAAAG
ACTTCCTGTGGA TGGCCGCAGGAAGGTGGGCCTGGAAGATAACAGCTAG
TAGGCTAAGGCCAG CAACCACAACCTCTGTATCCGGTAGTGGCAGATGG
AAA CTGTATCCGGTAGTGGCAGATGGAAAGAGAAACGGTTAGAA GAAA
AAAAATAAATGAAGTCTGCCTATCTCCGGGCCAGAGCCCCT TGCCTTGT
CTGTTGTAGATAATGAATCTATCCTCCAGTGACT GGCCAGGCTGATGGG
CCTTATCTCTTTACCCACCTGGCTGT CAACAGCAGGTCCTACTATCGCC
TCCCTCTAGTCTCTG CCAACCGTTAATGCTAGAGTTATCACTTTCTGTT
ATCAAGTGGCTTCAGCTATGCA GGGAGGGTGGGGCCCCTATCTCTCCTA
GACTCTGTG CTTTGTCACTGGATCTGATAAGAAACACCACCCCTGC
76
Residue Counts
  • Given motif alignment, count for each location is
    calculated

77
Residue Frequencies
  • The counts are then converted to frequencies

78
Example Maximization Step
  • Consider the first sequence
  • TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT
  •  
  • There are 41 residues 41-61 36 sites to
    consider

79
MEME Software
  • One of three motif models
  • OOPS One expected occurrence per sequence
  • ZOOPS Zero or one expected occurrence per
    sequence
  • TCM Any number of occurrences of the motif

80
Gibbs Sampling
  • Similar to E-M algorithm
  • Combines E-M and simulated annealing
  • Goal Find most probable pattern by sampling from
    motif probabilities to maximize ratio of
    modelbackground probabilities

81
Predictive Update Step
  • random motif start position chosen for all
    sequences except one
  • Initial alignment used to calculate residue
    frequencies for motif and background
  • similar to the Expectation Step of EM

82
Sampling Step
  • ratio of modelbackground probabilities
    normalized and weighted
  • motif start position chosen based on a random
    sampling with the given weights
  • Different than E-M algorithm

83
Gibbs Sampling
  • process repeated until residue frequencies in
    each column do not change
  • The sampling step is then repeated for a
    different initial random alignment
  • Sampling allows escape from local maxima

84
Gibbs Sampling
  • Dirichlet priors (pseudocounts) are added into
    the nucleotide counts to improve performance
  • shifting routine shifts motif a few bases to the
    left or the right
  • A range of motif sizes is checked

85
Gibbs Sampler Web Interface
  • http//bayesweb.wadsworth.org/gibbs/gibbs.html
Write a Comment
User Comments (0)
About PowerShow.com