Title: An Optimal Algorithm for MAX-SUM SEGMENT
1An Optimal Algorithm for MAX-SUM SEGMENT Its
Applications in Bioinformatics
- Hsueh-I Lu
- Academia Sinica, Taiwan
2This is joint work with
- Institute of Statistics, National Central
University - Tsai-Hung Fan
- Tsung-Shan Tsou
- Institute of Biomedical Sciences, Academia Sinica
- Adam Yao
- Shufen Lee
- Tsai-Cheng Wang
3Outline
- The MAX-SUM SEGMENT problem
- Previous work and our algorithm
- A feature of our algorithm
- Processing the input sequence in an on-line
manner - An application in bioinformatics
- Finding new repeats in genomic sequences
4The MAX-SUM SEGMENT Problem
- Input
- a sequence S of n numbers
- two length bounds L and U.
- Output
- a segment Si, j with maximum sum over all
segments of S with length at least L and at most
U.
5Example
sum 2
sum 4
- S -3 2 -2 5 -4 1 -2 3 1
- L 2 U 2
- L 2 U 8
sum 5
sum 3
A feasible segment
6Simple Observations
- There are only O(n2) segments Si, j in S.
- The sum of each segment Si, j can be computed
in O(1) time.
O(n) values for i and O(n) values for j.
7Simple Observations
- There are only O(n2) segments Si, j in S.
- The sum of each segment Si, j can be computed
in O(1) time.
With appropriate linear-time preprocessing
8Prefix sums of S
- Define prefix-sum(i) sum of S1, i
- O(n) time to pre-compute prefix-sum(i) for all
indices i. - Sum of Si, j
- prefix-sum(j) prefix-sum(i 1)
j
i
1
9As a result
- The problem has a naïve O(n2)-time algorithm
- Go through all O(n2) feasible segments and output
one segment with maximum sum.
10Previous work
- Lin, Jiang, and Chao JCSS 2002 gave the first
known O(n)-time algorithm for the MAX-SUM SEGMENT
problem.
11Lin-Jiang-Chao
- Based upon a clever but somewhat complicated
technique called left-negative decomposition of
the input sequence.
12Our result
- An alternative linear-time algorithm
- Bypassing the pre-processing step of
left-negative segment decomposition. - Has the capability to process the input sequence
in an on-line manner
13On-line processing
- Suppose the input sequence S is given in
iterations. - For each i 1, 2, , n, the number Si is given
in the i-th iteration. - When Si is received, our algorithm is capable
of interatively outputing a max-sum segment of
S1,i. - The required working space is O(U L 1).
14Our algorithm
15Geometric representation
height of j-th point prefix-sum(j)
S 2 -1 1 1 -2 3 -2
1
2
j
16Sum of Si, j prefix-sum(j) prefix-sum(i
1)
- For each ending index j,
- Maximizing the sum of Si, j is equivalent to
minimizing prefix-sum(i 1).
17Lowest point in j-U, j-L
valley(j)
Feasible index set F(j) for index j.
Feasible index set F(j) for index j.
j
j U
j L
18valley(j)
- Clearly, for each ending index j, 1valley(j) has
to be the best starting index. - In other words, 1valley(j) maximizes the sum of
Si, j over all feasible starting indices i.
19As a result
- The MAX-SUM SEGMENT problem is linear-time
reducible to computing valley(j) for all indices
j. - More specifically, the solution to MAX-SUM
SEGMENT is simply the segment S1valley(j), j
with maximum sum over all indices j.
20Computing valley(j) for all indices j in O(n) time
- Case 1 U is ineffective (e.g., U n)
- Case 2 U is effective
21Case 1 U is ineffective
j
F(j)
j 1
F(j 1)
j 2
F(j 2)
22F(j 1) F(j) ? j-L1
- if prefix-sum( jL1) lt prefix-sum(valley(j))
- let valley(j 1) j L 1
- otherwise,
- let valley(j 1) valley(j)
O(n) time to compute valley(j) for all j
O(n) time to compute a max-sum segment.
O(1) time to compute valley(j 1) from valley(j)
23Case 2 U is effective
F(j)
j
j 1
j 1
F(j 1)
F(j 2)
24We need a data structure D(j) for the indices in
F(j)
- The specification
- valley(j) can be obtained from D(j) in O(1)
time. - D(j 1) can be updated from D(j) in amortized
O(1) time.
25A solution
- Let D(j) be a list of indices X(1), X(2) such
that - X(1) is the lowest point in F(j) j U, jL
- That is, X(1) valley(j).
- X(2) is the lowest point in X(1)1, jL
- X(3) is the lowest point in X(2)1, jL
26j - U
j - L
27Spec. 1
- valley(j) can be obtained from D(j) in O(1)
time? - Yes! Just read the first number in D(j).
28Spec. 2
- D(j1) can be updated from D(j) in amortized O(1)
time? - Yes! Two steps
- (1) If valley(j) is not in F(j1), we just delete
the first number from the list. - (2) Scan the list of indices in D(j) from right
to left to see where j-U1 fits in.
29(No Transcript)
30time complexity
- The overall time complexity is O(n), although
some iterations may take more than constant time.
31On-line processing
- In the j-th iteration, when Sj is received,
- prefix-sum(j) can be computed on the fly
- valley(j) can be computed interatively,
- So, our algorithm can process the input sequence
in an online manner. - the required working space is O(U L 1).
32An application in bioinformatics
- Identifying repeats in genomic sequences.
33Repeats
- DNA repeats usually contain important biological
information. - TIGR (The Institute of Genomic Research)
maintains databases for repeats in various
genomes.
34Finding repeats via MAX-SUM SEGMENT
- Input a DNA sequence R
- Output a segment Ri,j of R that is likely to
be a previously unknown repeat in R. - Step 1. Filter out known repeats
- Step 2. Self-alignment and reducing the 2D
alignment scores into a sequence S of numbers. - Step 3. Run MAX-SUM SEGMENT algorithm on S
351.Masking Repeats of Chromosome 1 listed in the
Tigr Rice Repeat Database
Unmasked Chromosome 1
Masked Chromosome 1
Masked region
362.Filtering Simple Regions of Chromosome 1
Masked Chromosome 1
Filtered and Masked Chromosome 1
Masked region
Filtered region
37Aligning Filtered and Rice Chromosome 1 against
itself
Negative scores to penalize unaligned area.
E lt 1e -10
38Max-Sum
Adding the scores in the same column into a
score.
39left36345576 right36345588 Score253
Length44744612 4 1 30 144 56 2 9 2 1 1 9
-3 36345580
ataaaaaatattagaagaaaagtatagagtgcatatagaaatataattaa
gaaataatagaaattcggaattagaaaacaacagatattagaagaagagt
atagagtccatataagaatttagaatgaactaaaattcggaataaaaatt
aaaattaaagatagaatttagagtctata
over 80 homologues (e-value lt e-10 ) in rice
chromosome 1 but none in the TIGR Rice Repeat
Database released on July 29, 2002
40Conclusion
- We give a new O(n)-time algorithm for the MAXSUM
SEGMENT problem. - Our algorithm can handle the input sequence in an
online manner, which is an important feature for
handling genome-sized input. The working space
required is only O(U-L1). - Our algorithm can be an important subroutine in
finding new repeats in genomic sequences.