Title: On the Range Maximum-Sum Segment Query Problem
1On the Range Maximum-Sum Segment Query Problem
- Kuan-Yu Chen and Kun-Mao Chao
- Department of Computer Science and Information
Engineering, - National Taiwan University, Taiwan
2Outlines
- Motivations
- Problems arising from some bioinformatics
applications - Defining the RMSQ problem
- Our main idea
- Reducing the RMSQ problem to the Range Minima
Query problem (RMQ) - Conclusions and applications
- Solving some relevant problems in O(n) time
3Applications to Biomolecular Sequence Analysis
- Locating conserved regions or GC-rich regions in
a DNA sequence - Assign a real number (also called scores) to each
residue - Looking for the maximum-sum or maximum-average
segments - Add length constraints or average constraints
4An example Locating GC-rich regions (1)
- One reasonable scoring expression to measure the
richness of a region is x-pl , where x is the
CG count of the region, l is the length of the
region, and p is a positive ratio constant. - The goal is to design an algorithm to report the
region that maximizes the expression x-pl
5An example Locating GC-rich regions (2)
- Let x be the CG count of the region, and y be
the AT count of the region - Hence, we have
- x-pl x-p(xy) (1-p)x - py
- Therefore, to calculate the value of x-pl, one
can assign - w(G) w(C)1-p
- w(A)w(T)-p
6The Maximum-Sum Segment
- Also called the maximum-sum interval or the
maximum-scoring region - Given a sequence of numbers, the maximum-sum
segment is simply the contiguous subsequence
having the greatest total sum. - lt5, -5.1, 1, 3, -4, 2, 3, -4, 7gt
With greatest total sum 8
Zero prefix-/suffix-sums are possible.
7A Relevant Problem - RMQ
- Range Minima (Maxima) Query Problem (also called
Discrete Range Searching) - Given a sequence of numbers, by preprocessing the
sequence we wish to retrieve the minimum
(maximum) value within a given querying interval
efficiently - lt5, -5.1, 1, 3, -4, 2, 3, -4, 7gt
Minimum
Maximum
8Range Maximum-Sum Segment Query Problem
- Definition
- The input is a sequence lta1,a2, angt of real
numbers which is to be preprocessed. - A query is comprised of two intervals S and E.
- Our goal is to return the maximum-sum segment
whose starting index lies in S and end index lies
in E.
9A Nonoverlapping Example
- Input Sequence
- 9, -10, 4, -2, 4, -5, 4, -3, 6, -11, 8, -3, 4,
-5, 3
Total sum 6
Starting region
End region
10An Overlapping Example
- Input Sequence
- 9, -10, 4, -2, 4, -5, 4, -3, 6, -11, 8, -3, 4,
-5, 3
Total sum 8
Starting region
End region
11Our Results
- We propose an algorithm that runs in O(n)
preprocessing time and O(1) query time under the
unit-cost RAM model. - In fact, we show that RMSQ and RMQ are
computationally linearly equivalent. - We show that the RMSQ techniques yield
alternative O(n) time algorithms for the
following problems - The maximum-sum segment with length constraints
- All maximal-sum segments
12Strategy
- Reduce the RMSQ to the RMQ problem
- Theorem. If there is a ltf(n), g(n)gt-time solution
for the RMQ problem, then there is a ltf(n)O(n),
g(n)O(1)gt-time solution for the RMSQ problem.
O(n)
RMSQ
RMQ
O(1)
13Cumulative Sum/ Prefix Sum
prefix-sum(i) a1a2ai
14Computing sum(i,j) in O(1) time
- prefix-sum(i) a1a2ai
- all n prefix sums are computable in O(n) time.
- sum(i, j) prefix-sum(j) prefix-sum(i-1)
i
j
prefix-sum(j)
prefix-sum(i-1)
15Case 1 Nonoverlapping
Maximize
Maximize
Minimize
- sum(i, j) prefix-sum(j) prefix-sum(i-1)
- Prefix-sum sequence
- 9, -10, 4, -2, 4, -5, 4, -3, 6, -11, 8, -3, 4,
-5, 3
Range Minima Query
Find the highest point here
Find the lowest point here
16Case 2 Overlapping
- Some problems may occur
- Prefix-sum sequence
- 9, -10, 4, -2, 5, -5, 4, -3, 6, -11, 8, -3, 4,
-5, 3
Negative Sum !!
Find the highest point here
Find the lowest point here
17Case 2 Overlapping
- Divide into 3 possible cases
- Prefix-sum sequence
- 9, -10, 4, -2, 4, -5, 4, -3, 6, -11, 8, -3, 4,
-5, 3
Range Minima Query Preprocessing time
f(n) Query time g(n)
Range Minima Query Preprocessing time
f(n) Query time g(n)
Find the highest point here
Find the highest point here
What should we do?
Find the lowest point here
Find the lowest point here
18Dealing with the Special CaseSingle Range Query
- Input Sequence
- 9, -10, 4, -2, 4, -5, 4, -3, 6, -11, 8, -3, 4,
-5, 3 - Challenge Can this special case be reduced to
the RMQ problem?
Total sum 6
19Reduction Procedure
- Step 1. Find a partner for each index.
- Step 2. Record the sum of each pair in an array
- Step 3. Retrieve the maximum-sum pair by applying
the RMQ techniques
20Our First Attempt (1)
- Step 1 For each index i, we define the lowest
point preceding i as its partner - Prefix-sum sequence
i
Lowest point
Find a partner within this region
21Our First Attempt (2)
- Step 2 Record sum(partner(i), i) in an array
i
Lowest point
sum(partner(i), i)
22Our First Attempt (3)
- Step 3 Apply the RMQ techniques to the array
i
The maximum-sum pair can be retrieved
Applying RMQ to this sequence
Querying this interval
Lowest point
sum(partner(i), i)
23Bump into Difficulties
- What if its partners go beyond the querying
interval?
i
We might have to update every pair!
Needs to be updated
partner(i)
sum(partner(i), i)
24A Better Partner
How?
Find the nearest point at least as large as
prefix-sum(i)
i
Left_bound(i)
Find the lowest point
New partner(i)
25Why Is It Better? (1)
- It remains the best choice.
- It saves lots of update steps.
- It turns out that zero or one point needs to be
updated.
26Why Is It Better? (2)-- Remains the Best
Find the nearest point at least as large as
prefix-sum(i)
i
Left_bound(i)
Find the lowest point
partner(i)
Impossible region
27Why Is It Better? (3)-- Minimal-Maximal Property
- Height(partner(i))lt Height(j) lt Height(i), for
all partner(i)lt jlt i
Next higher point
Maximal point
i
Minimal point
partner(i)
No one higher than i
No one lower than partner(i)
28Why Is It Better? (4)-- Save Some Updates
Next higher point
Can not be the right end of the maximum-sum
segment
Querying interval
i
partner(i)
No one higher than i
29Why Is It Better? (5)-- Nesting Property
- For two indices i lt j, it cannot be the case that
partner(i)ltpartner(j) ?iltj
Maximal point
j
i
Minimal point
Minimal point
Maximal point
partner(j)
partner(i)
30Why Is It Better? (6)-- An example
- No overlapping is allowed
- 9, -10, 4, -2, 4, -5, 4, -3, 6, -11, 8, -3, 4,
-5, 3 - Nesting Property
- 9, -10, 4, -2, 4, -5, 4, -3, 6, -11, 8, -3, 4,
-5, 3
31When a Query Comes-- Case 1 No Exceeding
- The maximum pair (partner(i), i) lies in the
querying interval
Retrieve the maximum pair
Querying interval
i
partner(i)
We are done. Output (partner(i), i).
32When a Query Comes-- Case 2 Exceeding
- The maximum pair (partner(i), i) goes beyond the
querying interval
Retrieve the maximum pair
Retrieve the maximum pair
Querying interval
j
i
Maximal
Minimal
partner(i)
Update partner(i)
partner(j)
(Partner(i), i) is the maximum pair.
Nesting property
Can not be the right end of the maximum-sum
segment.
Compare (new_partner(i), i) and (partner(j), j)
33Time Complexity
- RMSQ can be reduced to the RMQ problem in O(n)
time - Since under the unit-cost RAM model, there is a
ltO(n), O(1)gt-time solution for the RMQ problem,
there is a ltO(n), O(1)gt-time solution for the
RMSQ problem. - On the other hand, RMQ can be reduced to the RMSQ
problem in O(n) time, too. (Range Maxima Query
For each two adjacent elements, we augment a
negative number whose absolute value is larger
than them.)
O(n)
RMQ
RMSQ
O(1)
34Use RMSQ Techniques to Solve Two Relevant
Problems
- 1. Finding the Maximum-Sum Segment with length
constraints in O(n) time. - - Y.-L. Lin, T. Jiang, K.-M. Chao, 2002
- - T.-H Fan et al., 2003
- 2. Finding all maximal scoring subsequences in
O(n) time. - - W. L. Ruzzo M. Tompa, 1999
-
35Problem 1The Maximum-Sum Segment with Length
Constraints
- Lin, Jiang, and Chao JCSS 2002 and Fan et al.
CIAA 2003 gave O(n)-time algorithms for this
problem. - Length at least L, and at most U
L
U
36Problem 1 Finding the Maximum-Sum Segment with
Length Constraints
- Length at least L, at most U
- For each index i, find the maximum-sum segment
whose starting point lies in i-U1, i-L1 and
end point is i
i
RMSQ query
L
U
Runs in O(n) time since each query costs O(1) time
37Problem 2 All Maximal-Sum Segments
- Ruzzo and Tompa ISMB 1999 gave a O(n)-time
algorithm for this problem. - Recursive definition.
R(S)
L(S)
S
38Problem 2 Finding All Maximal Scoring
Subsequences
- Recursive calls.
- Input sequence
R(S)
L(S)
S
RMSQ query
Runs in O(n) time since each query costs O(1) time