Motif Finding - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

Motif Finding

Description:

Problem: The identification of a motif without any prior knowledge of how the motif looks ... if u is not a leaf. An Exact Algorithm ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 32

Provided by: ryang

Category:

more less

Transcript and Presenter's Notes

Title: Motif Finding

1
Motif Finding
2
Motif Finding

Can be used identify
Promoters
Transcription Factor Binding Sites
Problem The identification of a motif without
any prior knowledge of how the motif looks
If you do not know what the motif looks like, or
where it is located, you need an algorithm that
given a set of sequences, it can find short
substrings that occurs more often than random.
Methods
Position Weight Matrices
Maximization Expectation
Gibbs Sampling
Phylogenetic Footprinting

3
An Example

7 32-nucleotide DNA sequences, generated randomly

CTGCGGTACCCCAAAGTTTCTGTCTTACTAAC TGGTCCCAGGACTCATG
CGTAGGCTAAGAGTT GTGTAATGGTCAACGTGTCCCGCCAAACATTA A
ATGTCTCACTGGTGCCATTAATTATAGAATG TTTAACCGATATGAAATA
GGCCTGGCCACATT GCCGTACCGACACACATTCTTTGGCATCCCTA TA
GGTCTCGCTCGGCTGGTCGAATGGTCCGAG
4
An Example

Insert a pattern, PATGCAACT of length l8 at
random positions in each sequence

CTGCGGTACATGCAACTCCCAAAGTTTCTGTCTTACTAAC TGGTCCCAG
GACTCATGCATGCAACTGTAGGCTAAGAGTT GTATGCAACTGTAATGGT
CAACGTGTCCCGCCAAACATTA AATGTCATGCAACTTCACTGGTGCCAT
TAATTATAGAATG TTTAACCGATATGAAATAGGCCTGGCCATGCAACTA
CATT GCCGTACCGACACACATTCTTTGGATGCAACTCATCCCTA TAGG
TCTCGCTATGCAACTCGGCTGGTCGAATGGTCCGAG
5
Problem

If you dont know what the pattern is, or where
it has been inserted, can you find the pattern?

Count the number of times each l-mer occurs in
the sequences.
(328)7 280 total nucleotides
The probability of finding any 8-mer is less than
280/480.004
After counting all 8-mers, 1 appears many more
times. This overrepresented 8-mer is the pattern
P we are trying to find. Use EMBOSSwordcount.

7
Problem 2

Suppose we allow for mutations at some positions
(try to use EMBOSSwordcount)

CTGCGGTACATcCAgCTCCCAAAGTTTCTGTCTTACTAAC TGGTCCCAG
GACTCATGCggGCAACTGTAGGCTAAGAGTT GTATGgAtCTGTAATGGT
CAACGTGTCCCGCCAAACATTA AATGTCAaGCAACcTCACTGGTGCCAT
TAATTATAGAATG TTTAACCGATATGAAATAGGCCTGGCCtTGgAACTA
CATT GCCGTACCGACACACATTCTTTGGATGCcAtTCATCCCTA TAGG
TCTCGCTATGgcACTCGGCTGGTCGAATGGTCCGAG
8
Solution

Use a profile matrix to allow for variations in
the pattern
The length (number of columns) of the matrix is
the length of the profile, l.
The rows of the matrix correspond to the number
of possible bases/residues.

9
Solution

Given a set of t sequences, construct a 4 x l
matrix, and align the sequences in all possible
positions.
For each possible starting position for all the
sequences
Count the number of times each nucleotide appears
in each position.
The score of the resulting alignment is the
maximum score of each nucleotide in each
position.
The alignment with the greatest score corresponds
to the motif.

10
Example
CTGCGGTACATcCAgCTCCCAAAGTTTCTGTCTTACTAAC TGGTCCCAG
GACTCATGCggGCAACTGTAGGCTAAGAGTT GTATGgAtCTGTAATGGT
CAACGTGTCCCGCCAAACATTA AATGTCAaGCAACcTCACTGGTGCCAT
TAATTATAGAATG TTTAACCGATATGAAATAGGCCTGGCCtTGgAACTA
CATT GCCGTACCGACACACATTCTTTGGATGCcAtTCATCCCTA TAGG
TCTCGCTATGgcACTCGGCTGGTCGAATGGTCCGAG

Construct a PWM for the alignment in bold

11
Example
CTGCGGTACATcCAgCTCCCAAAGTTTCTGTCTTACTAAC TGGTCCCAG
GACTCATGCggGCAACTGTAGGCTAAGAGTT GTATGgAtCTGTAATGGT
CAACGTGTCCCGCCAAACATTA AATGTCAaGCAACcTCACTGGTGCCAT
TAATTATAGAATG TTTAACCGATATGAAATAGGCCTGGCCtTGgAACTA
CATT GCCGTACCGACACACATTCTTTGGATGCcAtTCATCCCTA TAGG
TCTCGCTATGgcACTCGGCTGGTCGAATGGTCCGAG

Construct a PWM for the alignment in bold

12
Example

Score the alignment by taking the maximum score
in each column

Score 25
Consensus Sequence TTGGTCCA

13
Example
CTGCGGTACATcCAgCTCCCAAAGTTTCTGTC
TTACTAAC TGGTCCCAGGACTCATGCggGCAACTGTAGGC
TAAGAGTT
GTATGgAtCTGTAATGGTCAACGTGTCCCGCCAAACATTA
AATGTCAaGCAACcTCACTGGTGCCATTAATTATAGAA
TG TTTAACCGATATGAAATAGGCCTGGCCtTGgAACTACATT
GCCGTACCGACACACATTCTTTGGATGCcAtTCATCCCTA
TAGGTCTCGCTATGgcACTCGGCTGGTCGAATGGTCCGAG

Construct a PWM for the alignment in bold

14
Example
CTGCGGTACATcCAgCTCCCAAAGTTTCTGTC
TTACTAAC TGGTCCCAGGACTCATGCggGCAACTGTAGGC
TAAGAGTT
GTATGgAtCTGTAATGGTCAACGTGTCCCGCCAAACATTA
AATGTCAaGCAACcTCACTGGTGCCATTAATTATAGAA
TG TTTAACCGATATGAAATAGGCCTGGCCtTGgAACTACATT
GCCGTACCGACACACATTCTTTGGATGCcAtTCATCCCTA
TAGGTCTCGCTATGgcACTCGGCTGGTCGAATGGTCCGAG

Construct a PWM for the alignment in bold

15
Example

Score the alignment by taking the maximum score
in each column

Score 42
Consensus Sequence ATGCAACT

16
Formally Stated

Given
Set of t DNA sequences, each with n nucleotides
Select one position in each of the t sequences
forming an array of starting positions
s(s1,s2,,st) where 1ltsiltn-l1
How many starting positions are these in each
sequence?
How many combinations of starting positions are
there in all?

17
Scoring

Let P(s) profile matrix corresponding to
starting positions s.
Mp(s)(j) largest count in column j of P(s).
Mp(last example)(1) 5
Mp(last example)(3) 6
Score

This score is based on absolute frequencies. A
statistically based score would be
18
Representing patterns Matrices
CTTGGTGACGTG GTGAGTGACGTC CGGGTTGACGCA CCTACTTACGT
A TATGGTGACGTC TCGGATGACGAT TAGGATGACGTC CCTGGTGAC
GCC CGCGGTGACGTA GCCGTTGACGCC CGCGATGACGCA CCTGTTG
ACGTG TTGCATGACGTC GTTGGTGACGTG GAGGATGACGTT GGTCG
TGACGTA
Given N sequence fragments of fixed length, one
can assemble a position frequency matrix (number
of times a particular nucleotide appears at a
given position)
A 0 3 0 2 5 0 0 16 0 0 1 5 C 7
5 3 3 1 0 0 0 16 0 5 6 G 5 4 6 11
7 0 15 0 0 16 0 3 T 4 4 7 0 3 16
1 0 0 0 10 2
Position frequency matrix (PFM) (aka raw count
matrix or conservation and substitution matrix)
19
More on scores
Using pseudocounts (N 16)
Converting a PFM into a PWM
A 0 3 0 2 5 0 0 16 0 0 1 5 C 7
5 3 3 1 0 0 0 16 0 5 6 G 5 4 6 11
7 0 15 0 0 16 0 3 T 4 4 7 0 3 16
1 0 0 0 10 2
For each matrix element do
A -2 0 -2 -0.415 0.585 -2 -2
2.088 -2 -2 -1 0.585 C 1 0.585
0 0 -1 -2 -2 -2 2.088 -2
0.585 0.807 G 0.585 0.322 0.807 1.585
1 -2 2 -2 -2 2.088 -2 0 T
0.319 0.322 1 -2 0 2.088 -1
-2 -2 -2 1.459 -0.415
n(b,i) raw count (PFM matrix element) of
nucleotide b in column i N number of
sequences used to create PFM ( column sum)
- pseudocounts (correction for small sample
size) p(b) - background frequency of nucleotide
b
20
Detecting binding sites using a PWM
21
Problems with Matrices

Large search space gives larger portion of false
positives
There are (n-l1)t sets of starting points for t
sequences of length n, when looking for a motif
of length l.
This grows exponentially with the number of
sequences.

Phylogenetic footprinting

Using cross-species conservation of regulatory
elements to improve binding site prediction
23
Phylogenetic-Footprinting

A method of identifying conserved motifs in a set
of orthologous sequences from multiple species
using the notion that
Selective pressure causes functional elements to
evolve at a slower rate than nonfunctional
sequences
Identifies the best conserved motifs in those
regions
Guaranteed to report all sets of motifs with the
lowest parsimony scores.

24
Parsimony Score

Hamming distance (Parsimony Score)
Number of changes applied to a sequence to obtain
another sequence
If vATTGTC and wACTCTC, then d(v,w) 2

v ATTGTC x x w ACTCTC
25
Substring Parsimony Problem

Given
n orthologous sequences S1, S2, S3, , Sn
T phylogenetic tree relating sequences
k length of motif
d maximum parsimony score
Problem
Find set of k-mers, one from each input sequence
with parsimony score lt d with respect to T
Parameters
k and d can be specified by the user

26
Terminology
Leaf Nodes
Internal Nodes
Chicken Human Mouse
v
C(v)Human, Mouse
C(v)?
27
DP Solution

Proceeds from the leaves of T up to its root
At each node u of T, compute a table W(u)
containing 4k entries, one for each possible
k-mer
For a string s of length k,
Wus the best parsimony score that can be
achieved for the subtree rooted at u

28
DP Solution

C(u) set of children of u
d(s,t) Hamming distance between sequences s and
t
? A,C,G,T
Tables W can now be computed
0 if u is a leaf and s is a substring of Su
? if u is a leaf and s is not a substring of Su
if u is not a leaf

29
An Exact Algorithm
Wu s best parsimony score for subtree rooted
at node u, if u is labeled with string s.
30
Small Example
ATGC... (Chicken) ACGT... (Human) ACGG... (Mouse)
Size of motif sought k 2
31
Problems with Phylogenetic-Footprinting

Divergence of organisms being compared needs to
be sufficient to allow for sequence divergence
Comparing non-mammalian organisms to mammalian
organisms can limit types of functional elements
being found (primate specific, mammalian
specific, etc)
Requires several different organisms

Write a Comment

User Comments (0)