Title: Efficient Algorithms for Motif Search
1Efficient Algorithms for Motif Search
- Sudha Balla
- Sanguthevar Rajasekaran
- University of Connecticut
2Problem1 Definition
Input n sequences of length m each, integers l
and d, s.t. l ltlt m and d lt l. Each input sequence
has an occurrence of a motif M of length l at a
Hamming Distance of d from M. Output M The above
problem is known as the Planted (l, d) Motif
Problem.
3Problem2 Definition
Input is a database DB of n sequences, integers
l, d, and q. Output should be all the patterns
in DB such that each pattern is of length l and
it occurs in at least q of the n sequences. A
pattern u is considered an occurrence of another
pattern v as long as the edit distance between u
and v is at most d.
4Problem 1 State of the Art
Two kinds of algorithms are known Approximate
and Exact. WINNOWER (Pevzner and Sze2000) and
PROJECTION (Buhler and Tompa2001) are
approximate algorithms. MITRA (Eskin and Pevzner
2002) is an exact algorithm.
5A Probabilistic Analysis
Problem1 is complicated by the fact that, for a
given value of l, the higher the value of d, the
higher the expected number of motifs that occur
by random chance. For instance, when n20,
m600, l9, d2, the expected number of spurious
motifs is 1.6. On the other hand for n20, m600,
l10, d2, the expected number of spurious motifs
is only 6.1 X 10-8.
6 WINNOWER
Generate all l-mers from out of all the input
sequences. The number of such l-mers is
O(nm). Generate a graph G(V,E). Each l-mer is a
node in G. Two nodes are connected if the hamming
distance between them is at most 2d. Find all
cliques in the graph. Process these cliques to
identify M.
7 WINNOWER Details
Pevzner and Sze observe that the graph G
constructed above is 'almost random' and is
multipartite. They use the notion of an
extendable clique. If Q is any clique, node u is
called a neighbor of Q if the nodes in Q and u
also form a clique. A clique is called
extendable if it has at least one neighbor in
every part of the multipartite graph G. The
algorithm WINNOWER is based on the observation
that every edge in a maximal n-clique belongs to
at least (n-2) extendable cliques of size k.
This
(k-2) observation is used to eliminate edges.
8 PROJECTION
Let C be the collection of all l-mers in the
input. Project these l-mers along k randomly
chosen columns. (k is typically 7). Group the
k-mers such that equal k-mers are in the same
group. If a group is of size greater than a
threshold s (s is typically 3), then M is likely
to have this k-mer. The rest of M is computed
using maximum likelihood estimates.
9 MITRA
MITRA is based on WINNOWER Uses pairwise
similarity information. MITRA uses a mismatch
tree data structure and splits the space of all
possible patterns into disjoint subspaces that
start with a given prefix. Pruning is applied in
each subspace.
10 Pattern Branching
One way of solving the planted motif search
problem is to start from each l-mer in the input,
search the neighbors of this l-mer, score them
appropriately and output the best scoring
neighbor. Pattern Branching only examines a
selected subset of neighbors of any l-mer u of
the input and hence is more efficient. For any
l-mer u, let Di(u) stand for the set of neighbors
of u that are at a hamming distance of i. For any
input sequence Sj let d(u,Sj) denote the minimum
hamming distance between u and any l-mer of Sj.
Let d(u,S)Snj1 d(u,Sj).
11 Pattern Branching Contd
For any l-mer u in the input let BestNeighbor(u)
stand for the neighbor v in D1(u) whose distance
d(v,S) is minimum from among all the elements of
D1(u). The PatternBranching algorithm starts
from a u, identifies u1 BestNeighbor(u) Then it
identifies u2BestNeighbor(u1) and so on. It
finally outputs ud. The best ud from among all
possible u's is output.
12 A Simple Algorithm
- Form all possible l-mers from the input
sequences. Let C be this collection. Let C be
the collection of l-mers in the first input
sequence. - 2) For every u in C generate all l-mers that are
at a hamming distance of d from u. Let C be the
collection of these l-mers. Note that C
contains M. - 3) For every pair of l-mers (u, v) with u in C
and v in C compute the hamming distance between
u and v. Output that l-mer of C that has a
neighbor (i.e., an l-mer at a hamming distance of
d) in each one of the n input sequences.
13 A Simple Algorithm Contd
The run time of the above algorithm is
14 PMS1
1) Generate all possible l-mers from out of each
of the n input sequences. Let Ci be the
collection of l-mers from the i-th sequence. 2)
For each Ci and each u in Ci do Generate all
l-mers v such that u and v are at a hamming
distance of d. Let Ci be the neighbors of
Ci. 3) Sort all the l-mers in every Ci. Let Li
be the sorted list corresponding to Ci. 4) Merge
all the Lis and output the generated (in step 2)
l-mer that occurs in all the Lis.
15 PMS1 Contd
The run time of PMS1 is (Here w is the word
length of the computer. Radix sort is used.)
16 PMS2
Note that if M occurs in every input sequence,
then every substring of M also occurs in every
input sequence. In particular, there are at
least l - k 1 k-mers (for d lt k lt l) such
that each of these occurs in every input sequence
at a hamming distance of at most d. Let Q be
the collection of k-mers that can be formed out
of M. There are l - k 1 k-mers in Q. Each one
of these k-mers will be present in each input
sequence at a hamming distance of at most d.
17 PMS3
This algorithm enables one to handle large values
of d. Let dd/2. Let M be the motif of interest
with Ml2l for some integer l. Let M refer
to the first half of M and M to the second
half. We know that M occurs in every input
sequence. Let S be an arbitrary input sequence
and let p be the occurrence of M in S. If p
and p are the two halves of p, then, either
(1) the hamming distance between M and p is at
most d or (2) the hamming distance between M
and p is at most d.
18 PMS3 Contd
Also, note that in every input sequence either M
occurs with a hamming distance of at most d or
M occurs with a hamming distance of at most d.
As a result, in at least n/2 sequences either
M occurs with a hamming distance of at most d
or M occurs with a hamming distance of at most
d. PMS3 exploits these observations.
19 Experimental Data
l d T l d T l d T
9 2 1.44
10 2 0.84
11 2 0.78 11 3 19.84
12 2 0.84 12 3 15.53
13 2 0.70 13 3 20.98 13 4 228.94
14 2 1.05 14 3 20.38 14 4 226.83
15 2 1.33 15 3 20.53 15 4 217.34
16 2 2.61 16 3 21.20 16 4 216.92
20 A Comparison with MITRA
For l11 and d2, MITRA takes one minute whereas
PMS2 takes around a second. For l12 and d3,
two versions of MITRA take one minute and four
minutes, respectively. PMS2 takes 15.53
seconds. For l14 and d4, two versions of MITRA
take 4 minutes and 10 minutes, respectively. PMS2
takes 226.83 seconds.
21Known Algorithms for Problem 2
Sagot 1998s algorithm runs in time O(n2mld
Sd) and is based on generalized suffix trees.
Space used is O(n2m/w) where w is the word length
of the computer. This algorithm builds a suffix
tree on the given sequences in O(nm) time using
O(nm) space. If u is any l-mer present in the
input, there are O(ld (S-1)d) possible
neighbors for u. Any of these neighbors could
potentially be a motif of interest. Since there
are O(nm) l-mers in the input, the number of such
neighbors is O(nmld(S-1)d).
22 Sagots Algorithm Contd
This algorithm, for each such neighbor v, walks
through the tree to check if v is a possible
answer. This walking step is referred to as
'spelling'. The spelling operation takes a total
of O(n2 mld(S-1)d) time using an additional
O(nm) space. When employed for solving Problem 2,
the same algorithm takes O(n2 mldSd )
time. The algorithm of Adebiyi and Kaufmann
2002 takes an expected O(nmd(nm)1.9 log nm)
time.
23 An Algorithm Similar to PMS1
The basic idea behind the algorithm is We
generate all possible l-mers in the database.
There are at most mn such l-mers and these are
the patterns of interest. For each such l-mer we
want to determine if it occurs in at least q of
the input sequences. Let u be one of the above
l-mers. If v is a string such that the edit
distance between u and v is at most d, then we
say v is a neighbor of u. We generate all the
neighbors of u. For each neighbor v of u we
determine a list of input sequences in which v is
present. These lists (over all possible neighbors
of u) are then merged to obtain a list of input
sequences in which u occurs (within an edit
distance of d).
24 New Algorithm Contd
The above algorithm runs in time O(n2 mldSd).
The space used is O(nmdldSd). Space used is
less than those of prior algorithms. Only arrays
are used in the new algorithm. The underlying
constant is small and hence will potentially
perform better in practice than Sagots
algorithm.
25Thank You.