Title: RNA Secondary Structure Prediction
1RNA Secondary Structure Prediction
- RNA is a single-stranded chain of the nucleotides
A, C, G, and U. The string of nucleotides
specifies the linear structure of the RNA strand. - When RNA folds, complementary nucleotides form
base pairs (CG and AU). - The tertiary (3 dimensional) structure is too
complicated for us to calculate. - We calculate only secondary structures, lists of
base pairs. - Knowing the base pairs tells a lot about the 3
dimensional structure.
2Chemical Structure of RNA
- Four base types.
- Distinguishable ends.
3Partial Tertiary Structure
4Yet Another Tertiary Structure
5Our Final Tertiary Picture
6A Partial RNA Secondary Structure
7Pure Secondary Structure
8Our Basic Model
- RNA linear structure Rr1 r2 . . . rn from
A,C,G,U - RNA secondary structure pairs (ri,rj) such that
0ltiltjltn1. - Goal secondary structures with minimum free
energy.
9Implementing Model Restrictions
- No knots pairs (ri,rj) and (rk,rl) such that
iltkltjltl. RNA does contain knots. - Program loop structure.
- No close base pairs j-igtt for some tgt0.
- High free energy.
- Complementary base pairs A-U, C-G.
- High free energy.
10Our Two Algorithms
- Independent base pairs quite easy, but
inaccurate. - Calculate loops free energy best we can do for
todays class.
11Independent Base Pair Algorithm
- Assumption Independent base pairs.
- Advantage 1 Simpler calculations.
- Advantage 2 Illustrates ideas for a much
- more accurate
algorithm. - Disadvantage Unrealistic answers.
12Independent Base PairsWhat Makes It Easy?
- Assumption The energy of each base pair is
independent of all of the other pairs and the
loop structure. - Consequence Total free energy is the sum of all
of the base pair free energies.
13Independent Base PairsBasic Approach
- Use solutions for smaller strings to determine
solutions for larger strings. - This is precisely the kind of decoupling required
for dynamic programming algorithms to work.
14Independent Base Pairs Notation
- a(ri,rj) the free energy of a base pair joining
ri and rj. - Si,j The secondary structure of the RNA strand
from base ri to base rj. Ie, the set of base
pairs between ri and rj inclusive. - E(Si,j) The free energy associated with the
secondary structure Si,j. - We define a(ri,rj) large when constraints are
violated.
15Independent Base PairsCalculating Free Energy
- Consider the RNA strand from position i to j.
- Consider whether rj is paired
- If rj is paired, E(Si,j)E(Si,k-1)a(k,j)E(Sk1,j
-1) for some i-1ltkltj - If rj isnt paired, then E(Si,j)E(Si,j-1)
16Independent Base Pairs - Algorithm
- We search for intervals with minimum free energy.
- For each interval, the free energy is given by
this formula - E(Si,j) min(
-
E(Si1,j-1)a(ri,rj), -
E(Si,k-1a(ri,rk)Sk1,j-1), i -1ltkltj1 - )
- The free energy of the RNA strand is E(S1,n).
17Independent Base PairsQuestion 1
- How does this formula deal with the case where rj
isnt paired with any base? - A special case of
- E(Si,k-1a(ri,rk)Sk1,j-1), i -1ltkltj1
- The special case with kj.
18Independent Base PairsQuestion 2
- What is the high level algorithm flow?
- Advance from smaller to larger intervals,
calculating free energy costs. - Trace back the path that corresponds to the
maximum free energy cost.
19Independent Base PairsQuestion 3
- In what orders can the intervals free energy
costs be evaluated? - Major lower, minor upper bound
- Major upper, minor lower bound
- Diagonally
- Any order (eg, random) that respects the partial
order induced by inclusion
20Independent Base PairsQuestion 4
- What are the time and storage requirements of
this algorithm? - Express your answer in terms of the number of
bases in the RNA strand. - Since the number of intervals is quadratic, the
storage requirements are quadratic. - Since the time requirement for each interval is
linear, total time is cubic.
21Independent Base Pairs Question 5
- Why not simply calculate free energies as they
are needed? Why store them at all? - Because the recursive calls would turn our
polynomial algorithm into an exponential
algorithm.
22Independent Base PairsQuestion 6
- How does traceback work for this algorithm?
- Recalculate which subinterval yields the
maximum free energy. - Save traceback paths.
23Loop Free Energy Algorithm
- An RNA molecules free energy is not independent
of all other base pairs. - An RNA molecules free energy actually depends on
its loop structure. - What do we mean by loops?
24Types of Loops
- Each base pair (ri,rj) encloses a loop
- Hairpin loop
- Bulge on i or j
- Interior loop
- Helical region
25Hairpin Loop
- There are no base pairs (rk,rl) for iltkltlltj.
26Bulge on i and j
- Bulge on i
- (ri,rj) and (rk,rj-1) are base pairs with kgti1.
- ri1 is not paired.
- The bulge on j is symmetric.
27Interior loop
- (ri,rj) and (rk,rl) are base pairs with
i1ltk1ltk2ltj-1. - ri1 and rj-1 are not in base pairs
28Helical region
- (ri,rj) and (ri1,rj-1) are base pairs.
29Free energy analysis
- E(Si,j) E(Si1,j) when ri isnt paired.
- E(Si,j) E(Si,j-1) when rj isnt paired.
- E(Si,j) min(E(Si,k)E(Sk1,j)) for iltkltl,
- k between is and js pairs
- when i and j are paired
- but not to each other
- E(Si,j) E(Li,j) where Li,j is loop energy
- when I and j are paired
- to each other
30Free Energy Functions
- a(ri,rj) Free energy of base pair (ri,rj)
- H(k) Destabilizing free energy of a hairpin
loop with size k. - R Stabilizing free energy of adjacent base
pairs (helical region). - B(k) Destabilizing free energy of a bulge of
size k. - I(k) Destabilizing free energy of an interior
loop of size k.
31Loop Energy Formulas
- H(j-i-1) for a hairpin loop
- R E(Si1,j-1) for a helical region
- B(k) E(Sik1,j-1) for a bulge on i
- B(k) E(Si1,j-k-1) for a bulge on j
- I(k1k2) E(Sik11,j-k2-1)
- for an interior
loop
32Free Energy Calculationfor interval (i,j)
- Minimize over
- Case where (ri,rj) is not a pair.
- Case where (ri,rj) is a pair.
- Add a(ri,rj) to the formulas.
- Minimize over k, k1, and k2.
33What is the Apparent Complexity?
- The interior loop calculations are given by
I(k1k2) E(Sik11,j-k2-1) - The number of inner loop possibilities is
quadratic in the interval size. - The number of intervals is quadratic in the size
of the problem. - The complexity appears to grow as n4.
34What is the Actual Complexity?
- Overall reduction from n4 to n3 is possible.
- Interval reduction from n2 to linear.
- Store the minimum free energy Vi,j,k where the
interval (i,j) contains an interior loop of size
k.
35Multiple Solutions
- Care must be taken to define the issues.
- Multiple solutions can be obtained by adding
flexibility to the traceback logic. - The number of solutions can grow exponentially.
36References
- M. Zuker, The Use of dynamic programming in RNA
secondary structure prdiction. In M. S.
Waterman, editor, Mathematical Methods for DNS
Sequences. Boca Raton, FL CRC Press, 1989 - J, Setubal and J. Meidanis,Ch 8.1, Introduction
to Computational Molecular Biology, Pacific
Grove, CA Brooks/Cole Publishing Co., 1997