Motif%20Finding

About This Presentation

Title:

Motif%20Finding

Description:

A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3. C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12 ... Example: HAP1 binding sites in 5 sequences. consensus motif: CGGNNNTANCGG ... – PowerPoint PPT presentation

Number of Views:148

Avg rating:3.0/5.0

Slides: 86

Provided by: fenBilk

Category:

more less

Transcript and Presenter's Notes

Title: Motif%20Finding

1
Motif Finding

PSSMs
Expectation Maximization
Gibbs Sampling

2
Complexity of Transcription
3
Representing Binding Sites for a TF
Set of binding sites AAGTTAATGA CAGTTAATAA GAGTT
AAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA
CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTG
ATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA A
AGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAA
TGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AA
GTTAATGA AAGTTAATGA AAGTTAATGA

A single site
AAGTTAATGA

A set of sites represented as a consensus
VDRTWRWWSHD (IUPAC degenerate DNA)

4
Nucleic acid codes
code description
A Adenine
C Cytosine
G Guanine
T Thymine
U Uracil
R Purine (A or G)
Y Pyrimidine (C, T, or U)
M C or A
K T, U, or G
W T, U, or A
S C or G
B C, T, U, or G (not A)
D A, T, U, or G (not C)
H A, T, U, or C (not G)
V A, C, or G (not T, not U)
N Any base (A, C, G, T, or U)
5
From frequencies to log scores
w matrix
f matrix
A 5 0 1 0 0 C 0 2 2 4 0 G 0 3 1 0
4 T 0 0 1 1 1
A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5
0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3 T
-1.7 -1.7 -0.2 -0.2 -0.2
f(b,i) s(N)
Log ( )
p(b)
6
TFs do not act alone
http//www.bioinformatics.ca/
7
PSSMs for Liver TFs
HNF3
HNF1
HNF4
C/EBP
8
PSSMs for Helix-Turn-Helix Motif
9
Promoter
10
Promoter Weight Matrices (PWM)
11
E.Coli PWMs
12
Motif Logo
1234567 TGGGGGA TGAGAGA TGGGGGA TGAGAGA TGAGGGA
Position

Motifs can mutate on less important bases.
The five motifs at top right have mutations in
position 3 and 5.
Representations called motif logos illustrate the
conserved regions of a motif.

http//weblogo.berkeley.edu http//fold.stanford.e
du/eblocks/acsearch.html
13
Example Calmodulin-Binding Motif
(calcium-binding proteins)
14
Sequence Motifs
http//webcourse.cs.technion.ac.il/236523/Winter20
05-2006/en/ho_Lectures.html
15
Regulatory Motifs

Transcription Factors bind to regulatory motifs
Motifs are 6 20 nucleotides long
Activators and repressors
Usually located near target gene, mostly upstream

16
Challenges

How to recognize a regulatory motif?
Can we identify new occurrences of known motifs
in genome sequences?
Can we discover new motifs within upstream
sequences of genes?

17
Motif Representation

Exact motif CGGATATA
Consensus represent only deterministic
nucleotides.
Example HAP1 binding sites in 5 sequences.
consensus motif CGGNNNTANCGG
N stands for any nucleotide.
Representing only consensus loses information.
How can this be avoided?

CGGATATACCGG CGGTGATAGCGG CGGTACTAACGG CGGCGGTAACG
G CGGCCCTAACGG ------------ CGGNNNTANCGG
18
PSPM Position Specific Probability Matrix

Represents a motif of length k (5)
Count the number of occurrence of each
nucleotide in each position

1 2 3 4 5
A 10 25 5 70 60
C 30 25 80 10 15
T 50 25 5 10 5
G 10 25 10 10 20
19
PSPM Position Specific Probability Matrix

Defines PiA,C,G,T for i1,..,k.
Pi (A) frequency of nucleotide A in position i.

1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
C 0.3 0.25 0.8 0.1 0.15
T 0.5 0.25 0.05 0.1 0.05
G 0.1 0.25 0.1 0.1 0.2
20
Identification of Known Motifs within Genomic
Sequences

Motivation
identification of new genes controlled by the
same TF.
Infer the function of these genes.
enable better understanding of the regulation
mechanism.

21
PSPM Position Specific Probability Matrix

Each k-mer is assigned a probability.
Example P(TCCAG)0.50.250.80.70.2

1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
C 0.3 0.25 0.8 0.1 0.15
T 0.5 0.25 0.05 0.1 0.05
G 0.1 0.25 0.1 0.1 0.2
22
Detecting a Known Motif within a Sequence using
PSPM

The PSPM is moved along the query sequence.
At each position the sub-sequence is scored for a
match to the PSPM.
Example
sequence ATGCAAGTCT

1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
C 0.3 0.25 0.8 0.1 0.15
T 0.5 0.25 0.05 0.1 0.05
G 0.1 0.25 0.1 0.1 0.2
23
Detecting a Known Motif within a Sequence using
PSPM

The PSPM is moved along the query sequence.
At each position the sub-sequence is scored for a
match to the PSPM.
Example
sequence ATGCAAGTCT
Position 1 ATGCA 0.10.250.10.10.61.510-4

1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
C 0.3 0.25 0.8 0.1 0.15
T 0.5 0.25 0.05 0.1 0.05
G 0.1 0.25 0.1 0.1 0.2
24
Detecting a Known Motif within a Sequence using
PSPM

The PSPM is moved along the query sequence.
At each position the sub-sequence is scored for a
match to the PSPM.
Example
sequence ATGCAAGTCT
Position 1 ATGCA 0.10.250.10.10.61.510-4
Position 2 TGCAA 0.50.250.80.70.60.042

1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
C 0.3 0.25 0.8 0.1 0.15
T 0.5 0.25 0.05 0.1 0.05
G 0.1 0.25 0.1 0.1 0.2
25
Detecting a Known Motif within a Sequence using
PSSM

Is it a random match, or is it indeed an
occurrence of the motif?
PSPM -gt PSSM (Probability Specific Scoring
Matrix)
odds score matrix Oi(n) where n? A,C,G,T for
i1,..,k
defined as Pi(n)/P(n), where P(n) is background
frequency.
Oi(n) increases gt higher odds that n at position
i is part of a real motif.

26
PSSM as Odds Score Matrix

Assumption the background frequency of each
nucleotide is 0.25.
Original PSPM (Pi)
Odds Matrix (Oi)
Going to log scale we get an additive score,Log
odds Matrix (log2Oi)

1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
1 2 3 4 5
A 0.4 1 0.2 2.8 2.4
1 2 3 4 5
A -1.322 0 -2.322 1.485 1.263
27
Calculating using Log Odds Matrix

Odds ? 0 implies random match Odds gt 0 implies
real match (?).
Example sequence ATGCAAGTCT
Position 1 ATGCA -1.320-1.32-1.321.26-2.7odd
s 2-2.70.15
Position 2 TGCAA101.681.481.26
5.42odds25.4242.8

1 2 3 4 5
A -1.32 0 -2.32 1.48 1.26
C 0.26 0 1.68 -1.32 -0.74
T 1 0 -2.32 -1.32 -2.32
G -1.32 0 -1.32 -1.32 -0.32
28
Calculating the probability of a match

ATGCAAG
Position 1 ATGCA 0.15
Position 2 TGCAA 42.3
Position 3 GCAAG 0.18

P (1) 0.003 P (2) 0.993 P (3) 0.004
P (i) S / (? S) Example 0.15 /(.1542.8.18)0.0
03
29
Building a PSSM

Collect all known sequences that bind a certain
TF.
Align all sequences (using multiple sequence
alignment).
Compute the frequency of each nucleotide in each
position (PSPM).
Incorporate background frequency for each
nucleotide (PSSM).

30
Finding new Motifs

We are given a group of genes, which presumably
contain a common regulatory motif.
We know nothing of the TF that binds to the
putative motif.
The problem discover the motif.

31
Example
Predicting the cAMP Receptor Protein (CRP)
binding site motif
32
Extract experimentally defined CRP Binding Sites
GGATAACAATTTCACA AGTGTGTGAGCGGATAACAA AAGGTGTGAGT
TAGCTCACTCCCC TGTGATCTCTGTTACATAG ACGTGCGAGGATGAGA
ACACA ATGTGTGTGCTCGGTTTAGTTCACC TGTGACACAGTGCAAACG
CG CCTGACGGAGTTCACA AATTGTGAGTGTCTATAATCACG ATCGAT
TTGGAATATCCATCACA TGCAAAGGACGTCACGATTTGGG AGCTGGCG
ACCTGGGTCATG TGTGATGTGTATCGAACCGTGT ATTTATTTGAACCA
CATCGCA GGTGAGAGCCATCACAG GAGTGTGTAAGCTGTGCCACG TT
TATTCCATGTCACGAGTGT TGTTATACACATCACTAGTG AAACGTGCT
CCCACTCGCA TGTGATTCGATTCACA
33
Create a Multiple Sequence Alignment
GGATAACAATTTCACA TGTGAGCGGATAACAA TGTGAGTTAGCTCAC
T TGTGATCTCTGTTACA CGAGGATGAGAACACA CTCGGTTTAGTTCA
CC TGTGACACAGTGCAAA CCTGACGGAGTTCACA AGTGTCTATAATC
ACG TGGAATATCCATCACA TGCAAAGGACGTCACG GGCGACCTGGGT
CATG TGTGATGTGTATCGAA TTTGAACCACATCGCA GGTGAGAGCCA
TCACA TGTAAGCTGTGCCACG TTTATTCCATGTCACG TGTTATACAC
ATCACT CGTGCTCCCACTCGCA TGTGATTCGATTCACA
34
Generate a PSSM
A C G T
1 -0.43 0.1 -0.46 0.55
2 1.37 0.12 -1.59 -11.2
3 1.69 -1.28 -11.2 -1.43
4 -1.28 0.12 -11.2 1.32
5 0.91 -11.2 -0.46 0.47
6 1.53 -1.38 -1.48 -1.43
7 0.9 -0.48 -11.2 0.12
8 -1.37 -1.28 -11.2 1.68
9 -11.2 -11.2 1.73 -0.56
10 -11.2 -0.51 -11.2 1.72
11 -0.48 -11.2 1.72 -11.2
12 1.56 -1.59 -11.2 -0.46
13 -0.51 -0.38 -0.55 0.88
14 -11.2 0.5 0.57 0.13
15 0.17 -0.51 0.12 0.12
16 0.9 -11.2 0.5 -0.48
17 0.17 0.16 0.06 -0.48
18 -0.4 -0.38 0.82 -0.48
19 -1.38 -1.28 -11.2 1.68
20 -1.48 1.7 -11.2 -1.38
21 1.5 -1.38 -1.43 -1.28
35
Shannon Entropy

Expected variation per column can be calculated
Low entropy means higher conservation

36
Entropy

The entropy (H) for a column is
a is a residue,
fa frequency of residue a in a column,
pa probability of residue a in that column

37
Entropy

entropy measures can determine which evolutionary
distance (PAM250, BLOSUM80, etc) should be used
Entropy yields amount of information per column
(discussed with sequence logos in a bit)

38
Log-odds score

Profiles can also indicate log-odds score
Log2(observedexpected)
Result is a bit score

39
Matlab

Multalign
1 Enter an array of sequences.
seqs 'CACGTAACATCTC','ACGACGTAACATCTTCT','AAACG
TAACATCTCGC'
2 Promote terminations with gaps in the
alignment.
multialign(seqs,'terminalGapAdjust',true)
ans
--CACGTAACATCTC--
ACGACGTAACATCTTCT
-AAACGTAACATCTCGC

40
Matlab

3 Compare alignment without termination gap
adjustment.
multialign(seqs)
ans
CA--CGTAACATCT--C
ACGACGTAACATCTTCT
AA-ACGTAACATCTCGC

41
Matlab

gtgt a'ATATAGGAG','AATTATAGA','TTAGAGAAA'
gtgt a
'ATATAGGAG' 'AATTATAGA' 'TTAGAGAAA'

42
Char function

gtgt cseqchar(a)
cseq
ATATAGGAG
AATTATAGA
TTAGAGAAA

43
Double function

gtgt intseqdouble(cseq)
intseq
65 84 65 84 65 71 71 65
71
65 65 84 84 65 84 65 71
65
84 84 65 71 65 71 65 65 65

44
double

gtgt double('A')
ans
65
gtgt double('C')
ans
67
gtgt double('G')
ans
71
gtgt double('T')
ans
84

45
Initiate PSPM matrix

gtgt Pspmzeros(4,length(intseq))
Pspm
0 0 0 0 0 0 0 0
0
0 0 0 0 0 0 0 0
0
0 0 0 0 0 0 0 0
0
0 0 0 0 0 0 0 0
0

46
Use a for loop to count each nucleotide at each
position

gtgt for i 1length(intseq)
Pspm(1,i)length(find(intseq(,i)65))
Pspm(2,i)length(find(intseq(,i)67))
Pspm(3,i)length(find(intseq(,i)71))
Pspm(4,i)length(find(intseq(,i)84))
end
gtgt Pspm
Pspm
2 1 2 0 3 0 2 2
2
0 0 0 0 0 0 0 0
0
0 0 0 1 0 2 1 1
1
1 2 1 2 0 1 0 0
0

47
Add pseudocounts

gtgt PspmpPspm1
Pspmp
3 2 3 1 4 1 3 3
3
1 1 1 1 1 1 1 1
1
1 1 1 2 1 3 2 2
2
2 3 2 3 1 2 1 1
1

48
Normalize to get frequencies

gtgt PspmnormPspmp./repmat(sum(Pspmp),4,1)
Pspmnorm
Columns 1 through 7
0.4286 0.2857 0.4286 0.1429
0.5714 0.1429 0.4286
0.1429 0.1429 0.1429 0.1429
0.1429 0.1429 0.1429
0.1429 0.1429 0.1429 0.2857
0.1429 0.4286 0.2857
0.2857 0.4286 0.2857 0.4286
0.1429 0.2857 0.1429
Columns 8 through 9
0.4286 0.4286
0.1429 0.1429
0.2857 0.2857
0.1429 0.1429

49
Calculate odds score

gtgt PswmPspmnorm/0.25
Pswm
Columns 1 through 7
1.7143 1.1429 1.7143 0.5714
2.2857 0.5714 1.7143
0.5714 0.5714 0.5714 0.5714
0.5714 0.5714 0.5714
0.5714 0.5714 0.5714 1.1429
0.5714 1.7143 1.1429
1.1429 1.7143 1.1429 1.7143
0.5714 1.1429 0.5714
Columns 8 through 9
1.7143 1.7143
0.5714 0.5714
1.1429 1.1429
0.5714 0.5714

50
Log odds ratio

gtgt logPswmlog2(Pswm)
logPswm
Columns 1 through 7
0.7776 0.1926 0.7776 -0.8074
1.1926 -0.8074 0.7776
-0.8074 -0.8074 -0.8074 -0.8074
-0.8074 -0.8074 -0.8074
-0.8074 -0.8074 -0.8074 0.1926
-0.8074 0.7776 0.1926
0.1926 0.7776 0.1926 0.7776
-0.8074 0.1926 -0.8074
Columns 8 through 9
0.7776 0.7776
-0.8074 -0.8074
0.1926 0.1926
-0.8074 -0.8074

51
Estimate the probability of the given sequence to
belong to the defined PSWM

gtgt Unknown'TTAAGAAGG'
Unknown
TTAAGAAGG
gtgt intunknowndouble(Unknown)
intunknown
84 84 65 65 71 65 65 71
71

52
Get the index of the PSWM for the unknown sequence

gtgt for i1length(intunknown)
Afind(intunknown65)
intunknown(A)1
Cfind(intunknown67)
intunknown(C)2
Gfind(intunknown71)
intunknown(G)3
Tfind(intunknown84)
intunknown(T)4
end
gtgt intunknown
intunknown
4 4 1 1 3 1 1 3
3

53
Calculate the log odds-ratio of the Unknown
'TTAAGAAGG'

gtgt logunknownlogPswm(intunknown)
logunknown
Columns 1 through 7
0.1926 0.1926 0.7776 0.7776
-0.8074 0.7776 0.7776
Columns 8 through 9
-0.8074 -0.8074
gtgt Punknownsum(logunknown)
Punknown
1.0737

54
Is this significant score or just random
similarity?

gtgt cseq
cseq
ATATAGGAG
AATTATAGA
TTAGAGAAA
gtgt Unknown
Unknown
TTAAGAAGG

55
What would be the maximum score?

gtgt logPswm
logPswm
Columns 1 through 7
0.7776 0.1926 0.7776 -0.8074
1.1926 -0.8074 0.7776
-0.8074 -0.8074 -0.8074 -0.8074
-0.8074 -0.8074 -0.8074
-0.8074 -0.8074 -0.8074 0.1926
-0.8074 0.7776 0.1926
0.1926 0.7776 0.1926 0.7776
-0.8074 0.1926 -0.8074
Columns 8 through 9
0.7776 0.7776
-0.8074 -0.8074
0.1926 0.1926
-0.8074 -0.8074
gtgt maxscoremax(logPswm)
maxscore
Columns 1 through 7
0.7776 0.7776 0.7776 0.7776 1.1926
0.7776 0.7776
Columns 8 through 9

56
Write a function using the above statements to
scan a sequence

Write a function named logodds that calculates
the logs-odd ratio of a given alignment.
Write a function named scanmotif that calls the
logodds to search through a sequence using a
sliding window to calculate the logodds of a
subsequence and store these scores. The function
should allow for selection of a maximum number of
locations that are likely to contain the motif
based on the scores obtained.

57
Position Specific Scoring Matrix (PSSM)

incorporate information theory to indicate
information contained within each column of a
multiple alignment.
information is a logarithmic transformation of
the frequency of each residue in the motif

58
PSSMs and Pseudocounts

Problem PSSMs are only as good as the initial
msa
Some residues may be underrepresented
Other columns may be too conserved
Solution Introduce Pseudocounts to get a better
indication

59
Pseudocounts

New estimated probability
Pca Probability of residue a in column c
nca count of as in column c
bca pseudocount of as in column c
Nc total count in column c
Bc total pseudocount in column c

60
PSSMs and pseudocounts

probabilities converted into a log-odds form
(usually log2 so the information can be reported
in bits) and placed in the PSSM.

61
Searching PSSMs

value for the first residue in the sequence
occurring in the first column is calculated by
searching the PSSM
the value for the residue occurring in each
column is calculated

62
Searching PSSMs

values are added (since they are logarithms) to
produce a summed log odds score, S
S can be converted to an odds score using the
formula 2S
odds scores for each position can be summed
together and normalized to produce a probability
of the motif occurring at each location.

63
Information in PSSMs

Information theory amount of information
contained within each sequence.
No information amount of uncertainty can be
measured as log220 4.32 for amino acids, since
there are 20 amino acids. For nucleic acid
sequences, the amount of uncertainty can be
measured as log24 2.

64
Information in PSSMs

If a column is completely conserved then the
uncertainty is 0 there is only one choice.
two residues occurring with equal probability --
uncertainty to deciding which residue it is.

65
Measure of Uncertainty

Measured as the entropy

66
Relative Entropy

. Relative entropy takes into account overall
composition of the organism being studied
Ba is background frequency of residue a in the
organism

67
PSSM Uncertainty

Uncertainty for whole model is summed over all
columns

68
Sequence Logos

Information in PSSMs can be viewed visually
Sequence logos illustrate information in each
column of a motif
height of logo is calculated as the amount by
which uncertainty has been decreased

69
Sequence Logos
70
Statistical Methods

Commonly used methods for locating motifs
Expectation-Maximization (EM)
Gibbs Sampling

71
Expectation-Maximization

Begin with set of sequences with an unknown
signal in common
Signal may be subtle
Approximate length of signal must be given
Randomly assign locations of this motif in each
sequence

72
Expectation-Maximization

Two steps
Expectation Step
Maximization Step

73
Expectation-Maximization

Expectation step
Residue Frequencies for each position calculated
Residues not in a motif are background
Frequencies used to determine probability of
finding site at any position in a sequence to fit
motif model

74
Maximization Step

Determine location for each sequence that
maximally aligns to the motif pattern
Once new motif location found for each sequence,
motif pattern is revised in the expectation
E-M continues until solution converges

75
TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT CCCACGCA
GCCGCCCTCCTCCCCGGTCACTGACTGGTCCTG TCGACCCTCTGAACCT
ATCAGGGACCACAGTCAGCCAGGCAAG AAAACACTTGAGGGAGCAGATA
ACTGGGCCAACCATGACTC GGGTGAATGGTACTGCTGATTACAACCTCT
GGTGCTGC AGCCTAGAGTGATGACTCCTATCTGGGTCCCCAGCAGGA G
CCTCAGGATCCAGCACACATTATCACAAACTTAGTGTCCA CATTATCAC
AAACTTAGTGTCCATCCATCACTGCTGACCCT TCGGAACAAGGCAAAGG
CTATAAAAAAAATTAAGCAGC GCCCCTTCCCCACACTATCTCAATGCAA
ATATCTGTCTGAAACGGTTCC CATGCCCTCAAGTGTGCAGATTGGTCAC
AGCATTTCAAGG GATTGGTCACAGCATTTCAAGGGAGAGACCTCATTGT
AAG TCCCCAACTCCCAACTGACCTTATCTGTGGGGGAGGCTTTTGA CC
TTATCTGTGGGGGAGGCTTTTGAAAAGTAATTAGGTTTAGC ATTATTTT
CCTTATCAGAAGCAGAGAGACAAGCCATTTCTCTTTCCTCCCGGT AGGC
TATAAAAAAAATTAAGCAGCAGTATCCTCTTGGGGGCCCCTTC CCAGCA
CACACACTTATCCAGTGGTAAATACACATCAT TCAAATAGGTACGGATA
AGTAGATATTGAAGTAAGGAT ACTTGGGGTTCCAGTTTGATAAGAAAAG
ACTTCCTGTGGA TGGCCGCAGGAAGGTGGGCCTGGAAGATAACAGCTAG
TAGGCTAAGGCCAG CAACCACAACCTCTGTATCCGGTAGTGGCAGATGG
AAA CTGTATCCGGTAGTGGCAGATGGAAAGAGAAACGGTTAGAA GAAA
AAAAATAAATGAAGTCTGCCTATCTCCGGGCCAGAGCCCCT TGCCTTGT
CTGTTGTAGATAATGAATCTATCCTCCAGTGACT GGCCAGGCTGATGGG
CCTTATCTCTTTACCCACCTGGCTGT CAACAGCAGGTCCTACTATCGCC
TCCCTCTAGTCTCTG CCAACCGTTAATGCTAGAGTTATCACTTTCTGTT
ATCAAGTGGCTTCAGCTATGCA GGGAGGGTGGGGCCCCTATCTCTCCTA
GACTCTGTG CTTTGTCACTGGATCTGATAAGAAACACCACCCCTGC
76
Residue Counts

Given motif alignment, count for each location is
calculated

77
Residue Frequencies

The counts are then converted to frequencies

78
Example Maximization Step

Consider the first sequence
TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT
There are 41 residues 41-61 36 sites to
consider

79
MEME Software

One of three motif models
OOPS One expected occurrence per sequence
ZOOPS Zero or one expected occurrence per
sequence
TCM Any number of occurrences of the motif

80
Gibbs Sampling

Similar to E-M algorithm
Combines E-M and simulated annealing
Goal Find most probable pattern by sampling from
motif probabilities to maximize ratio of
modelbackground probabilities

81
Predictive Update Step

random motif start position chosen for all
sequences except one
Initial alignment used to calculate residue
frequencies for motif and background
similar to the Expectation Step of EM

82
Sampling Step

ratio of modelbackground probabilities
normalized and weighted
motif start position chosen based on a random
sampling with the given weights
Different than E-M algorithm

83
Gibbs Sampling

process repeated until residue frequencies in
each column do not change
The sampling step is then repeated for a
different initial random alignment
Sampling allows escape from local maxima

84
Gibbs Sampling

Dirichlet priors (pseudocounts) are added into
the nucleotide counts to improve performance
shifting routine shifts motif a few bases to the
left or the right
A range of motif sizes is checked