Title: Motif Finding
1Motif Finding
2Motif Finding
- Can be used identify
- Promoters
- Transcription Factor Binding Sites
- Problem The identification of a motif without
any prior knowledge of how the motif looks - If you do not know what the motif looks like, or
where it is located, you need an algorithm that
given a set of sequences, it can find short
substrings that occurs more often than random. - Methods
- Position Weight Matrices
- Maximization Expectation
- Gibbs Sampling
- Phylogenetic Footprinting
3An Example
- 7 32-nucleotide DNA sequences, generated randomly
CTGCGGTACCCCAAAGTTTCTGTCTTACTAAC TGGTCCCAGGACTCATG
CGTAGGCTAAGAGTT GTGTAATGGTCAACGTGTCCCGCCAAACATTA A
ATGTCTCACTGGTGCCATTAATTATAGAATG TTTAACCGATATGAAATA
GGCCTGGCCACATT GCCGTACCGACACACATTCTTTGGCATCCCTA TA
GGTCTCGCTCGGCTGGTCGAATGGTCCGAG
4An Example
- Insert a pattern, PATGCAACT of length l8 at
random positions in each sequence
CTGCGGTACATGCAACTCCCAAAGTTTCTGTCTTACTAAC TGGTCCCAG
GACTCATGCATGCAACTGTAGGCTAAGAGTT GTATGCAACTGTAATGGT
CAACGTGTCCCGCCAAACATTA AATGTCATGCAACTTCACTGGTGCCAT
TAATTATAGAATG TTTAACCGATATGAAATAGGCCTGGCCATGCAACTA
CATT GCCGTACCGACACACATTCTTTGGATGCAACTCATCCCTA TAGG
TCTCGCTATGCAACTCGGCTGGTCGAATGGTCCGAG
5Problem
- If you dont know what the pattern is, or where
it has been inserted, can you find the pattern?
CTGCGGTACATGCAACTCCCAAAGTTTCTGTCTTACTAAC TGGTCCCAG
GACTCATGCATGCAACTGTAGGCTAAGAGTT GTATGCAACTGTAATGGT
CAACGTGTCCCGCCAAACATTA AATGTCATGCAACTTCACTGGTGCCAT
TAATTATAGAATG TTTAACCGATATGAAATAGGCCTGGCCATGCAACTA
CATT GCCGTACCGACACACATTCTTTGGATGCAACTCATCCCTA TAGG
TCTCGCTATGCAACTCGGCTGGTCGAATGGTCCGAG
6Solution
- Count the number of times each l-mer occurs in
the sequences. - (328)7 280 total nucleotides
- The probability of finding any 8-mer is less than
280/480.004 - After counting all 8-mers, 1 appears many more
times. This overrepresented 8-mer is the pattern
P we are trying to find. Use EMBOSSwordcount.
7Problem 2
- Suppose we allow for mutations at some positions
(try to use EMBOSSwordcount)
CTGCGGTACATcCAgCTCCCAAAGTTTCTGTCTTACTAAC TGGTCCCAG
GACTCATGCggGCAACTGTAGGCTAAGAGTT GTATGgAtCTGTAATGGT
CAACGTGTCCCGCCAAACATTA AATGTCAaGCAACcTCACTGGTGCCAT
TAATTATAGAATG TTTAACCGATATGAAATAGGCCTGGCCtTGgAACTA
CATT GCCGTACCGACACACATTCTTTGGATGCcAtTCATCCCTA TAGG
TCTCGCTATGgcACTCGGCTGGTCGAATGGTCCGAG
8Solution
- Use a profile matrix to allow for variations in
the pattern - The length (number of columns) of the matrix is
the length of the profile, l. - The rows of the matrix correspond to the number
of possible bases/residues.
9Solution
- Given a set of t sequences, construct a 4 x l
matrix, and align the sequences in all possible
positions. - For each possible starting position for all the
sequences - Count the number of times each nucleotide appears
in each position. - The score of the resulting alignment is the
maximum score of each nucleotide in each
position. - The alignment with the greatest score corresponds
to the motif.
10Example
CTGCGGTACATcCAgCTCCCAAAGTTTCTGTCTTACTAAC TGGTCCCAG
GACTCATGCggGCAACTGTAGGCTAAGAGTT GTATGgAtCTGTAATGGT
CAACGTGTCCCGCCAAACATTA AATGTCAaGCAACcTCACTGGTGCCAT
TAATTATAGAATG TTTAACCGATATGAAATAGGCCTGGCCtTGgAACTA
CATT GCCGTACCGACACACATTCTTTGGATGCcAtTCATCCCTA TAGG
TCTCGCTATGgcACTCGGCTGGTCGAATGGTCCGAG
- Construct a PWM for the alignment in bold
11Example
CTGCGGTACATcCAgCTCCCAAAGTTTCTGTCTTACTAAC TGGTCCCAG
GACTCATGCggGCAACTGTAGGCTAAGAGTT GTATGgAtCTGTAATGGT
CAACGTGTCCCGCCAAACATTA AATGTCAaGCAACcTCACTGGTGCCAT
TAATTATAGAATG TTTAACCGATATGAAATAGGCCTGGCCtTGgAACTA
CATT GCCGTACCGACACACATTCTTTGGATGCcAtTCATCCCTA TAGG
TCTCGCTATGgcACTCGGCTGGTCGAATGGTCCGAG
- Construct a PWM for the alignment in bold
12Example
- Score the alignment by taking the maximum score
in each column
- Score 25
- Consensus Sequence TTGGTCCA
13Example
CTGCGGTACATcCAgCTCCCAAAGTTTCTGTC
TTACTAAC TGGTCCCAGGACTCATGCggGCAACTGTAGGC
TAAGAGTT
GTATGgAtCTGTAATGGTCAACGTGTCCCGCCAAACATTA
AATGTCAaGCAACcTCACTGGTGCCATTAATTATAGAA
TG TTTAACCGATATGAAATAGGCCTGGCCtTGgAACTACATT
GCCGTACCGACACACATTCTTTGGATGCcAtTCATCCCTA
TAGGTCTCGCTATGgcACTCGGCTGGTCGAATGGTCCGAG
- Construct a PWM for the alignment in bold
14Example
CTGCGGTACATcCAgCTCCCAAAGTTTCTGTC
TTACTAAC TGGTCCCAGGACTCATGCggGCAACTGTAGGC
TAAGAGTT
GTATGgAtCTGTAATGGTCAACGTGTCCCGCCAAACATTA
AATGTCAaGCAACcTCACTGGTGCCATTAATTATAGAA
TG TTTAACCGATATGAAATAGGCCTGGCCtTGgAACTACATT
GCCGTACCGACACACATTCTTTGGATGCcAtTCATCCCTA
TAGGTCTCGCTATGgcACTCGGCTGGTCGAATGGTCCGAG
- Construct a PWM for the alignment in bold
15Example
- Score the alignment by taking the maximum score
in each column
- Score 42
- Consensus Sequence ATGCAACT
16Formally Stated
- Given
- Set of t DNA sequences, each with n nucleotides
- Select one position in each of the t sequences
forming an array of starting positions
s(s1,s2,,st) where 1ltsiltn-l1 - How many starting positions are these in each
sequence? - How many combinations of starting positions are
there in all?
17Scoring
- Let P(s) profile matrix corresponding to
starting positions s. - Mp(s)(j) largest count in column j of P(s).
- Mp(last example)(1) 5
- Mp(last example)(3) 6
- Score
This score is based on absolute frequencies. A
statistically based score would be
18Representing patterns Matrices
CTTGGTGACGTG GTGAGTGACGTC CGGGTTGACGCA CCTACTTACGT
A TATGGTGACGTC TCGGATGACGAT TAGGATGACGTC CCTGGTGAC
GCC CGCGGTGACGTA GCCGTTGACGCC CGCGATGACGCA CCTGTTG
ACGTG TTGCATGACGTC GTTGGTGACGTG GAGGATGACGTT GGTCG
TGACGTA
Given N sequence fragments of fixed length, one
can assemble a position frequency matrix (number
of times a particular nucleotide appears at a
given position)
A 0 3 0 2 5 0 0 16 0 0 1 5 C 7
5 3 3 1 0 0 0 16 0 5 6 G 5 4 6 11
7 0 15 0 0 16 0 3 T 4 4 7 0 3 16
1 0 0 0 10 2
Position frequency matrix (PFM) (aka raw count
matrix or conservation and substitution matrix)
19More on scores
Using pseudocounts (N 16)
Converting a PFM into a PWM
A 0 3 0 2 5 0 0 16 0 0 1 5 C 7
5 3 3 1 0 0 0 16 0 5 6 G 5 4 6 11
7 0 15 0 0 16 0 3 T 4 4 7 0 3 16
1 0 0 0 10 2
For each matrix element do
A -2 0 -2 -0.415 0.585 -2 -2
2.088 -2 -2 -1 0.585 C 1 0.585
0 0 -1 -2 -2 -2 2.088 -2
0.585 0.807 G 0.585 0.322 0.807 1.585
1 -2 2 -2 -2 2.088 -2 0 T
0.319 0.322 1 -2 0 2.088 -1
-2 -2 -2 1.459 -0.415
n(b,i) raw count (PFM matrix element) of
nucleotide b in column i N number of
sequences used to create PFM ( column sum)
- pseudocounts (correction for small sample
size) p(b) - background frequency of nucleotide
b
20Detecting binding sites using a PWM
21Problems with Matrices
- Large search space gives larger portion of false
positives - There are (n-l1)t sets of starting points for t
sequences of length n, when looking for a motif
of length l. - This grows exponentially with the number of
sequences.
22- Phylogenetic footprinting
Using cross-species conservation of regulatory
elements to improve binding site prediction
23Phylogenetic-Footprinting
- A method of identifying conserved motifs in a set
of orthologous sequences from multiple species
using the notion that - Selective pressure causes functional elements to
evolve at a slower rate than nonfunctional
sequences - Identifies the best conserved motifs in those
regions - Guaranteed to report all sets of motifs with the
lowest parsimony scores.
24Parsimony Score
- Hamming distance (Parsimony Score)
- Number of changes applied to a sequence to obtain
another sequence - If vATTGTC and wACTCTC, then d(v,w) 2
v ATTGTC x x w ACTCTC
25Substring Parsimony Problem
- Given
- n orthologous sequences S1, S2, S3, , Sn
- T phylogenetic tree relating sequences
- k length of motif
- d maximum parsimony score
- Problem
- Find set of k-mers, one from each input sequence
with parsimony score lt d with respect to T - Parameters
- k and d can be specified by the user
26Terminology
Leaf Nodes
Internal Nodes
Chicken Human Mouse
v
C(v)Human, Mouse
C(v)?
27DP Solution
- Proceeds from the leaves of T up to its root
- At each node u of T, compute a table W(u)
containing 4k entries, one for each possible
k-mer - For a string s of length k,
- Wus the best parsimony score that can be
achieved for the subtree rooted at u
28DP Solution
- C(u) set of children of u
- d(s,t) Hamming distance between sequences s and
t - ? A,C,G,T
- Tables W can now be computed
- 0 if u is a leaf and s is a substring of Su
- ? if u is a leaf and s is not a substring of Su
- if u is not a leaf
29An Exact Algorithm
Wu s best parsimony score for subtree rooted
at node u, if u is labeled with string s.
30Small Example
ATGC... (Chicken) ACGT... (Human) ACGG... (Mouse)
Size of motif sought k 2
31Problems with Phylogenetic-Footprinting
- Divergence of organisms being compared needs to
be sufficient to allow for sequence divergence - Comparing non-mammalian organisms to mammalian
organisms can limit types of functional elements
being found (primate specific, mammalian
specific, etc) - Requires several different organisms