Title: Prediction of RNA structure
1Prediction of RNA structure
2The central dogma
- The DNA contains the code to all of the essential
machinery of the cell, mainly proteins. - Information about proteins is in genes.
- The genic information is first copied to an
intermediate molecule (transcription) - The transcript goes outside the nucleus and is
translated into a protein.
3(No Transcript)
4The Central Dogma
- However, not all genes are translated!
5Example tRNA
6RNA
- Could it be that, there is a large trove of
non-coding RNA, that is not discovered yet? - The answer seems to be YES
- Todays lecture How can we computationally
identify ncRNA?
7Novel ncRNAs are abundant Ex miRNAs
- miRNAs were the second major story in 2001 (after
the genome). - Subsequently, many other non-coding genes have
been found
8(No Transcript)
9Scientific American, 2006
10Bacterial Riboswitches
11ncRNA gene finding
- The RNA world hypothesis
- RNA are as important as protein coding genes.
- Many undiscovered ncRNA exist
- Computational methods for discovering ncRNA are
not mature. - What are the clues to non-coding genes?
- Structure Given a sequence, what is the
structure into which it can fold with minimum
energy?
12RNA structure Basics
- Key RNA is single-stranded. Think of a string
over 4 letters, AC,G, and U. - The complementary bases form pairs.
- A lt-gt U, C lt-gt G, G lt-gt U
- Base-pairing defines a secondary structure. The
base-pairing is usually non-crossing.
13RNA structure Basics
- Key RNA is single-stranded. Think of a string
over 4 letters, AC,G, and U. - The complementary bases form pairs.
- Base-pairing defines a secondary structure. The
base-pairing is usually non-crossing.
14RNA structure pseudoknots
- Sometimes, unpaired bases in loops form crossing
pairs. These are pseudoknots.
15De novo RNA structure prediction
- Any set of non-crossing base-pairs defines a
secondary structure. - Abstract Question
- Given an RNA string find a structure that
maximizes the number of non-crossing base-pairs - Incorporate the true energetics of folding
- Incorporate Pseudo-knots
16(No Transcript)
17De novo RNA prediction a combinatorial
formulation
A C G A U U
- Input
- A string over A,C,G,U
- A pairs with U, C pairs with G, G with U
- Output
- A subset of possible base-pairings of maximum
size (score) such that - No two base-pairs intersect
- Each nucleotide pairs with at most one other
nucleotide - How can we compute this set efficiently?
18RNA structure
- Nussinovs algorithm
- Score B for every base-pair. No penalty for
loops. No pesudo-knots. - Let W(i,j) be the score of the best structure of
the subsequence from i to j.
for i n down to 1 for j i1 to n
19Obtaining RNA structure
for i n downto 1 for j i1 to n
i
j
i
j
20Obtaining RNA Structure
Procedure print_RNA(i,j) if S(i,j) /
print (i,j)
print_RNA(i1,j-1) else if (S(i,j) -)
print_RNA(i1,j) else if (S(i,j) )
print_RNA(i,j-1) else
kS(i,j) print_RNA(i,k)
print_RNA(k1,j)
i
j
21RNA structure example
i 1 2 3 4 5 6
j
2
3
4
5
6
22RNA Structure Details
23Base-pairing Loops
- Base-pairs arise from complementary nucleotides
- Single-stranded
- Stack is when 2 base-pairs are contiguous
- Loops arise when there are unpaired bases.
- They are characterized by the number of
base-pairs that close it. - Hairpin closed by 1 base-pair
- Bulge/Interior Loops (2 base-pairs)
- Multiple Internal loops (k base-pairs)
24Scoring Loops, multi-loops
- Zuker-Turner Energy Rules
- http//www.bioinfo.rpi.edu/zukerm/rna/energy/node
2.html - Stacking Energies
- Energy for Bulges and Interior Loops
- Energy for Multi-loops
25Improving the efficiency
- Computing structure on a 2000nt sequence is about
20s (2GHz, 2Gb) - (Wexler et al., 2006)
- How much time would it take to search 10000
candidate regions? - What would be the impact of an O(n) speedup on
this search?
26A triangle inequality
- After the computations are done,
- A triangle inequality is satisfied
27Towards more efficient algorithms
- Consider the main loop again
- We only need to branch, when i forms a base-pair.
Therefore, w.l.o.g
28A candidate list approach
- Note that if C is the maximum size of the
candidate list, then the running time is O(Cn2) - How can we reduce the size of the candidate list?
- Consider only those k s.t. bk is complementary to
bi
We will consider cases when k is NOT optimal,
using the triangle inequality
29Candidate list
- Consider i,j,k s.t.
- W(i,j) B(ri,rk)W(i1,k-1)W(k1,j) (1)
- Claim For all j gt j
- j is not a candidate in computing W(i,j)
30Proof
31The candidate list algorithm
- For all i n down to 1
-
- For all j i1 to n
32How large is the candidate list?
- Here, the answer depends very much on the energy
computations. - If we simply count the number of base-pairs, the
candidate list should be large (probably) - For more realistic functions, it is very small
(experiments with n upto 1000, showed C6)
33Empirical estimates of C
A Study of Accessible Motifs and RNA Folding
Complexity Ydo Wexler, Chaya Zilberstein, and
Michal Ziv-Ukelson LNBI 3909, p. 473 ff.
34A thermodynamic argument
- It is conjectured, and empirically shown that RNA
has the polymer-zeta property - Probability that a string of length m folds into
an RNA structure is b/mc for constants b,cgt0 - Note that if cgt0, the number of indices j that
base-pair with i is O(1) - In practice c1.5
35End of Lecture
36Incorporating pseudoknots in structure prediction
- Q Determine optimal structure, when pseudoknots
are allowed. - Pseudoknots are only loosely defined.
- If any interleaving is allowed, then simply
select a structure in which a max number of
nucleotides can be paired. - Under some restrictive notions, the structure
problem becomes NP-hard.
37Akutsus Simple pseudoknots
- A region i0,k0 forms a simple pseudoknot if
there exist positions j0, j0 s.t. - Each (i,j) ? M satisfies either
- i0 ? i lt j0 ? j lt j0 or j0 ? i lt j0 ? j ? k0
- Within one of the two groups, there is no
interleaving. - iltiltj0 or j0 ?ilti implies jgtj.
38Simple pseudoknots
A
B
C
- A region i0,k0 forms a simple pseudoknot if it
can be partioined into 3 regions s.t. every
base-pair has the following properties - It either connects (A,B) , or (B,C)
- Two base-pairs within (A,B) (or, (B,C)) do not
interleave
39Simple pseudoknotted structure
M
M2
M1
- Collection of simple pseudoknots (Mi) and other
basepairs M. - None of the simple pseudoknot regions overlap.
- M is a secondary structure without pseudoknots
for the sequence obtained by excising all
pseudoknotted regions. - Can we identify a sub-structure for the
pseudo-knots?
40Akutsus algorithm (Main idea)
- Rotate the sequence so that it forms two loops.
- Each allowed base-pair is a horizontal line in
exactly one of the two loops. The horizontal
lines in the two loops are non-crossing. - All base-pairs can be ordered.
41Main idea
j0
k0
i0
j0
- The base-pairs have a total order that a d.p. can
exploit. - (i,j) lt (i,j) if one of the following holds
- iltiltjltj
- iltiltjltjltj0
- j0ltiltiltjltj
42Frontiers of D.P.
j0
k0
i
j
k
i0
j0
- We need to construct an increasing path of
base-pairs. - Consider a frontier triple (i,j,k),
corresponding to the region (i0,i) and (j,k) - We have the following cases
- (i,j) form a base-pair
- (j,k) form a base-pair
- Neither (i,j) nor (j,k) form a base-pair
43Ordering frontiers
j0
k0
i
j
k
i0
j0
- frontier triple (i,j,k) gt (i,j,k) if and
only if - i lt i
- j lt j
- k lt k
44Ordering Frontiers
- (i,j,k) gt (i,j,k) if and only if
45Recurrences
- SL(i,j,k) is the optimum score of a frontier
(i,j,k) assuming that - iltj0ltjltj0ltk
- (i,j) form a base-pair
- SR(i,j,k) is the optimum score of a frontier
(i,j,k) assuming that iltj0ltjltj0ltk, and - (j,k) form a base-pair
- SM(i,j,k) is the optimum score of a frontier
(i,j,k) assuming that iltj0ltjltj0ltk, and - Neither (i,j) nor (j,k) form a base-pair
46Computing SL
47Putting it all together
48Computing optimal pseudo-knotted structures
- Let S(i,j) be the opt score of a simple
pseudoknotted structure. - Either (i,j) is a simple pseudoknot
- S(i,j) Spseudo(i,j)
- Or, not
- S(i,j) max v(ai,aj) S(i1,j-1),
- max k S(i,k-1)S(k,j)
-
49Time Complexity
- We compute Spseudo(i,j) for all (i,j). Each
computation is O(n3). Total time? - For each i0, perform the following computation
- Spseudo(i0,k0) max i0?iltj ?k0SL(i,j,k0),
- Total time O(n4) Can you do better?
50Recursive pseudoknots
- Each loop of the RNA structure is a recursive
pseudoknotted RNA structure. - Optimal recursive pseudoknotted RNA structure
problem can be solved in O(n5) time.
51Open Questions
- There should be a direct generalization of simple
pseudoknots to the following structure. - Rivas and Eddy do consider such a generalization,
but a systematic treatment is missing. - Q Given a pseudo-knotted structure, is it an
Akutsu simple pseudoknotted structure? - Linear time algorithm was devised recently for
this problem.
52Structure as a proxy for ncRNA
- Any genomic region with an energetically
favorable fold is a candidate ncRNA? - Rivas and Eddy show otherwise.
- They use
- LOD score L Pr(sRNA)/Pr(sNull)
- Z-score Z G-?/?
53LOD scores for putative ncRNA
- A region of C. elegans is containing two tRNAs is
chosen. The position of tRNA is indicated by
stars. - The true positions and the LOD-score correspond.
- Stronger results in M. janaschii (AT70)
- Weak results in E.coli (GC50)
54Correlation with GC content
- GC content detector has results similar to
structure prediction!
55Significance of detected regions
- Test for significance was using Z-scores on
shuffled sequences. - Earlier tests shuffled sequences using a large
window. - Shuffling high scoring windows led to lower
Z-scores.
56Testing chimeric sequence
- Test on chimeric sequence. A real tRNA was
embedded in 2000bp random sequence with similar
GC-composition.
57Increasing window size
- tRNA Z-score computation with a larger window
(85nt).
58- For very strong signals, a Z-score gt 4 may still
be significant.
59Z-score distribution
- Z-scores of 415 tRNA genes. 98 have Z-score
lower than 4.
60Structure
- We hoped that RNA structure computation would
give us a clue into detecting novel ncRNA - Turns out that random DNA can fold into a low
energy structure.