An Optimal Algorithm for MAX-SUM SEGMENT - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

An Optimal Algorithm for MAX-SUM SEGMENT

Description:

An application in bioinformatics. Finding new repeats in genomic sequences ... O(n2) feasible segments and output one segment with maximum sum. Previous work ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 39

Provided by: iisSin

Category:

more less

Transcript and Presenter's Notes

Title: An Optimal Algorithm for MAX-SUM SEGMENT

1
An Optimal Algorithm for MAX-SUM SEGMENT Its
Applications in Bioinformatics

Hsueh-I Lu
Academia Sinica, Taiwan

2
This is joint work with

Institute of Statistics, National Central
University
Tsai-Hung Fan
Tsung-Shan Tsou
Institute of Biomedical Sciences, Academia Sinica
Adam Yao
Shufen Lee
Tsai-Cheng Wang

3
Outline

The MAX-SUM SEGMENT problem
Previous work and our algorithm
A feature of our algorithm
Processing the input sequence in an on-line
manner
An application in bioinformatics
Finding new repeats in genomic sequences

4
The MAX-SUM SEGMENT Problem

Input
a sequence S of n numbers
two length bounds L and U.
Output
a segment Si, j with maximum sum over all
segments of S with length at least L and at most
U.

5
Example
sum 2
sum 4

S -3 2 -2 5 -4 1 -2 3 1
L 2 U 2
L 2 U 8

sum 5
sum 3
A feasible segment
6
Simple Observations

There are only O(n2) segments Si, j in S.
The sum of each segment Si, j can be computed
in O(1) time.

O(n) values for i and O(n) values for j.
7
Simple Observations

There are only O(n2) segments Si, j in S.
The sum of each segment Si, j can be computed
in O(1) time.

With appropriate linear-time preprocessing
8
Prefix sums of S

Define prefix-sum(i) sum of S1, i
O(n) time to pre-compute prefix-sum(i) for all
indices i.
Sum of Si, j
prefix-sum(j) prefix-sum(i 1)

j
i
1
9
As a result

The problem has a naïve O(n2)-time algorithm
Go through all O(n2) feasible segments and output
one segment with maximum sum.

10
Previous work

Lin, Jiang, and Chao JCSS 2002 gave the first
known O(n)-time algorithm for the MAX-SUM SEGMENT
problem.

11
Lin-Jiang-Chao

Based upon a clever but somewhat complicated
technique called left-negative decomposition of
the input sequence.

12
Our result

An alternative linear-time algorithm
Bypassing the pre-processing step of
left-negative segment decomposition.
Has the capability to process the input sequence
in an on-line manner

13
On-line processing

Suppose the input sequence S is given in
iterations.
For each i 1, 2, , n, the number Si is given
in the i-th iteration.
When Si is received, our algorithm is capable
of interatively outputing a max-sum segment of
S1,i.
The required working space is O(U L 1).

14
Our algorithm
15
Geometric representation
height of j-th point prefix-sum(j)
S 2 -1 1 1 -2 3 -2
1
2
j
16
Sum of Si, j prefix-sum(j) prefix-sum(i
1)

For each ending index j,
Maximizing the sum of Si, j is equivalent to
minimizing prefix-sum(i 1).

17
Lowest point in j-U, j-L
valley(j)
Feasible index set F(j) for index j.
Feasible index set F(j) for index j.
j
j U
j L
18
valley(j)

Clearly, for each ending index j, 1valley(j) has
to be the best starting index.
In other words, 1valley(j) maximizes the sum of
Si, j over all feasible starting indices i.

19
As a result

The MAX-SUM SEGMENT problem is linear-time
reducible to computing valley(j) for all indices
j.
More specifically, the solution to MAX-SUM
SEGMENT is simply the segment S1valley(j), j
with maximum sum over all indices j.

20
Computing valley(j) for all indices j in O(n) time

Case 1 U is ineffective (e.g., U n)
Case 2 U is effective

21
Case 1 U is ineffective
j
F(j)
j 1
F(j 1)
j 2
F(j 2)
22
F(j 1) F(j) ? j-L1

if prefix-sum( jL1) lt prefix-sum(valley(j))
let valley(j 1) j L 1
otherwise,
let valley(j 1) valley(j)

O(n) time to compute valley(j) for all j
O(n) time to compute a max-sum segment.
O(1) time to compute valley(j 1) from valley(j)
23
Case 2 U is effective
F(j)
j
j 1
j 1
F(j 1)
F(j 2)
24
We need a data structure D(j) for the indices in
F(j)

The specification
valley(j) can be obtained from D(j) in O(1)
time.
D(j 1) can be updated from D(j) in amortized
O(1) time.

25
A solution

Let D(j) be a list of indices X(1), X(2) such
that
X(1) is the lowest point in F(j) j U, jL
That is, X(1) valley(j).
X(2) is the lowest point in X(1)1, jL
X(3) is the lowest point in X(2)1, jL

26
j - U
j - L
27
Spec. 1

valley(j) can be obtained from D(j) in O(1)
time?
Yes! Just read the first number in D(j).

28
Spec. 2

D(j1) can be updated from D(j) in amortized O(1)
time?
Yes! Two steps
(1) If valley(j) is not in F(j1), we just delete
the first number from the list.
(2) Scan the list of indices in D(j) from right
to left to see where j-U1 fits in.

29
(No Transcript)
30
time complexity

The overall time complexity is O(n), although
some iterations may take more than constant time.

31
On-line processing

In the j-th iteration, when Sj is received,
prefix-sum(j) can be computed on the fly
valley(j) can be computed interatively,
So, our algorithm can process the input sequence
in an online manner.
the required working space is O(U L 1).

32
An application in bioinformatics

Identifying repeats in genomic sequences.

33
Repeats

DNA repeats usually contain important biological
information.
TIGR (The Institute of Genomic Research)
maintains databases for repeats in various
genomes.

34
Finding repeats via MAX-SUM SEGMENT

Input a DNA sequence R
Output a segment Ri,j of R that is likely to
be a previously unknown repeat in R.
Step 1. Filter out known repeats
Step 2. Self-alignment and reducing the 2D
alignment scores into a sequence S of numbers.
Step 3. Run MAX-SUM SEGMENT algorithm on S

35
1.Masking Repeats of Chromosome 1 listed in the
Tigr Rice Repeat Database
Unmasked Chromosome 1
Masked Chromosome 1
Masked region
36
2.Filtering Simple Regions of Chromosome 1
Masked Chromosome 1
Filtered and Masked Chromosome 1
Masked region
Filtered region
37
Aligning Filtered and Rice Chromosome 1 against
itself
Negative scores to penalize unaligned area.
E lt 1e -10
38
Max-Sum
Adding the scores in the same column into a
score.
39
left36345576 right36345588 Score253
Length44744612 4 1 30 144 56 2 9 2 1 1 9
-3 36345580
ataaaaaatattagaagaaaagtatagagtgcatatagaaatataattaa
gaaataatagaaattcggaattagaaaacaacagatattagaagaagagt
atagagtccatataagaatttagaatgaactaaaattcggaataaaaatt
aaaattaaagatagaatttagagtctata
over 80 homologues (e-value lt e-10 ) in rice
chromosome 1 but none in the TIGR Rice Repeat
Database released on July 29, 2002
40
Conclusion

We give a new O(n)-time algorithm for the MAXSUM
SEGMENT problem.
Our algorithm can handle the input sequence in an
online manner, which is an important feature for
handling genome-sized input. The working space
required is only O(U-L1).
Our algorithm can be an important subroutine in
finding new repeats in genomic sequences.

Write a Comment

User Comments (0)