Prediction of RNA structure - PowerPoint PPT Presentation

1 / 60

About This Presentation

Title:

Prediction of RNA structure

Description:

The DNA contains the code to all of the essential machinery of the cell, ... It is conjectured, and empirically shown that RNA has the polymer-zeta property ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 61

Provided by: vineet50

Category:

more less

Transcript and Presenter's Notes

Title: Prediction of RNA structure

1
Prediction of RNA structure

Vineet Bafna

2
The central dogma

The DNA contains the code to all of the essential
machinery of the cell, mainly proteins.
Information about proteins is in genes.
The genic information is first copied to an
intermediate molecule (transcription)
The transcript goes outside the nucleus and is
translated into a protein.

3
(No Transcript)
4
The Central Dogma

However, not all genes are translated!

5
Example tRNA
6
RNA

Could it be that, there is a large trove of
non-coding RNA, that is not discovered yet?
The answer seems to be YES
Todays lecture How can we computationally
identify ncRNA?

7
Novel ncRNAs are abundant Ex miRNAs

miRNAs were the second major story in 2001 (after
the genome).
Subsequently, many other non-coding genes have
been found

8
(No Transcript)
9
Scientific American, 2006
10
Bacterial Riboswitches

-Breaker Lab

11
ncRNA gene finding

The RNA world hypothesis
RNA are as important as protein coding genes.
Many undiscovered ncRNA exist
Computational methods for discovering ncRNA are
not mature.
What are the clues to non-coding genes?
Structure Given a sequence, what is the
structure into which it can fold with minimum
energy?

12
RNA structure Basics

Key RNA is single-stranded. Think of a string
over 4 letters, AC,G, and U.
The complementary bases form pairs.
A lt-gt U, C lt-gt G, G lt-gt U
Base-pairing defines a secondary structure. The
base-pairing is usually non-crossing.

13
RNA structure Basics

Key RNA is single-stranded. Think of a string
over 4 letters, AC,G, and U.
The complementary bases form pairs.
Base-pairing defines a secondary structure. The
base-pairing is usually non-crossing.

14
RNA structure pseudoknots

Sometimes, unpaired bases in loops form crossing
pairs. These are pseudoknots.

15
De novo RNA structure prediction

Any set of non-crossing base-pairs defines a
secondary structure.
Abstract Question
Given an RNA string find a structure that
maximizes the number of non-crossing base-pairs
Incorporate the true energetics of folding
Incorporate Pseudo-knots

16
(No Transcript)
17
De novo RNA prediction a combinatorial
formulation
A C G A U U

Input
A string over A,C,G,U
A pairs with U, C pairs with G, G with U
Output
A subset of possible base-pairings of maximum
size (score) such that
No two base-pairs intersect
Each nucleotide pairs with at most one other
nucleotide
How can we compute this set efficiently?

18
RNA structure

Nussinovs algorithm
Score B for every base-pair. No penalty for
loops. No pesudo-knots.
Let W(i,j) be the score of the best structure of
the subsequence from i to j.

for i n down to 1 for j i1 to n

19
Obtaining RNA structure
for i n downto 1 for j i1 to n

i
j
i
j
20
Obtaining RNA Structure
Procedure print_RNA(i,j) if S(i,j) /
print (i,j)
print_RNA(i1,j-1) else if (S(i,j) -)
print_RNA(i1,j) else if (S(i,j) )
print_RNA(i,j-1) else
kS(i,j) print_RNA(i,k)
print_RNA(k1,j)
i
j
21
RNA structure example
i 1 2 3 4 5 6
j
2
3
4
5
6
22
RNA Structure Details
23
Base-pairing Loops

Base-pairs arise from complementary nucleotides
Single-stranded
Stack is when 2 base-pairs are contiguous
Loops arise when there are unpaired bases.
They are characterized by the number of
base-pairs that close it.
Hairpin closed by 1 base-pair
Bulge/Interior Loops (2 base-pairs)
Multiple Internal loops (k base-pairs)

24
Scoring Loops, multi-loops

Zuker-Turner Energy Rules
http//www.bioinfo.rpi.edu/zukerm/rna/energy/node
2.html
Stacking Energies
Energy for Bulges and Interior Loops
Energy for Multi-loops

25
Improving the efficiency

Computing structure on a 2000nt sequence is about
20s (2GHz, 2Gb)
(Wexler et al., 2006)
How much time would it take to search 10000
candidate regions?
What would be the impact of an O(n) speedup on
this search?

26
A triangle inequality

After the computations are done,
A triangle inequality is satisfied

27
Towards more efficient algorithms

Consider the main loop again
We only need to branch, when i forms a base-pair.
Therefore, w.l.o.g

28
A candidate list approach

Note that if C is the maximum size of the
candidate list, then the running time is O(Cn2)
How can we reduce the size of the candidate list?
Consider only those k s.t. bk is complementary to
bi

We will consider cases when k is NOT optimal,
using the triangle inequality
29
Candidate list

Consider i,j,k s.t.
W(i,j) B(ri,rk)W(i1,k-1)W(k1,j) (1)
Claim For all j gt j
j is not a candidate in computing W(i,j)

30
Proof
31
The candidate list algorithm

For all i n down to 1
For all j i1 to n

32
How large is the candidate list?

Here, the answer depends very much on the energy
computations.
If we simply count the number of base-pairs, the
candidate list should be large (probably)
For more realistic functions, it is very small
(experiments with n upto 1000, showed C6)

33
Empirical estimates of C
A Study of Accessible Motifs and RNA Folding
Complexity Ydo Wexler, Chaya Zilberstein, and
Michal Ziv-Ukelson LNBI 3909, p. 473 ff.
34
A thermodynamic argument

It is conjectured, and empirically shown that RNA
has the polymer-zeta property
Probability that a string of length m folds into
an RNA structure is b/mc for constants b,cgt0
Note that if cgt0, the number of indices j that
base-pair with i is O(1)
In practice c1.5

35
End of Lecture
36
Incorporating pseudoknots in structure prediction

Q Determine optimal structure, when pseudoknots
are allowed.
Pseudoknots are only loosely defined.
If any interleaving is allowed, then simply
select a structure in which a max number of
nucleotides can be paired.
Under some restrictive notions, the structure
problem becomes NP-hard.

37
Akutsus Simple pseudoknots

A region i0,k0 forms a simple pseudoknot if
there exist positions j0, j0 s.t.
Each (i,j) ? M satisfies either
i0 ? i lt j0 ? j lt j0 or j0 ? i lt j0 ? j ? k0
Within one of the two groups, there is no
interleaving.
iltiltj0 or j0 ?ilti implies jgtj.

38
Simple pseudoknots
A
B
C

A region i0,k0 forms a simple pseudoknot if it
can be partioined into 3 regions s.t. every
base-pair has the following properties
It either connects (A,B) , or (B,C)
Two base-pairs within (A,B) (or, (B,C)) do not
interleave

39
Simple pseudoknotted structure
M
M2
M1

Collection of simple pseudoknots (Mi) and other
basepairs M.
None of the simple pseudoknot regions overlap.
M is a secondary structure without pseudoknots
for the sequence obtained by excising all
pseudoknotted regions.
Can we identify a sub-structure for the
pseudo-knots?

40
Akutsus algorithm (Main idea)

Rotate the sequence so that it forms two loops.
Each allowed base-pair is a horizontal line in
exactly one of the two loops. The horizontal
lines in the two loops are non-crossing.
All base-pairs can be ordered.

41
Main idea
j0
k0
i0
j0

The base-pairs have a total order that a d.p. can
exploit.
(i,j) lt (i,j) if one of the following holds
iltiltjltj
iltiltjltjltj0
j0ltiltiltjltj

42
Frontiers of D.P.
j0
k0
i
j
k
i0
j0

We need to construct an increasing path of
base-pairs.
Consider a frontier triple (i,j,k),
corresponding to the region (i0,i) and (j,k)
We have the following cases
(i,j) form a base-pair
(j,k) form a base-pair
Neither (i,j) nor (j,k) form a base-pair

43
Ordering frontiers
j0
k0
i
j
k
i0
j0

frontier triple (i,j,k) gt (i,j,k) if and
only if
i lt i
j lt j
k lt k

44
Ordering Frontiers

(i,j,k) gt (i,j,k) if and only if

45
Recurrences

SL(i,j,k) is the optimum score of a frontier
(i,j,k) assuming that
iltj0ltjltj0ltk
(i,j) form a base-pair
SR(i,j,k) is the optimum score of a frontier
(i,j,k) assuming that iltj0ltjltj0ltk, and
(j,k) form a base-pair
SM(i,j,k) is the optimum score of a frontier
(i,j,k) assuming that iltj0ltjltj0ltk, and
Neither (i,j) nor (j,k) form a base-pair

46
Computing SL
47
Putting it all together
48
Computing optimal pseudo-knotted structures

Let S(i,j) be the opt score of a simple
pseudoknotted structure.
Either (i,j) is a simple pseudoknot
S(i,j) Spseudo(i,j)
Or, not
S(i,j) max v(ai,aj) S(i1,j-1),
max k S(i,k-1)S(k,j)

49
Time Complexity

We compute Spseudo(i,j) for all (i,j). Each
computation is O(n3). Total time?
For each i0, perform the following computation
Spseudo(i0,k0) max i0?iltj ?k0SL(i,j,k0),
Total time O(n4) Can you do better?

50
Recursive pseudoknots

Each loop of the RNA structure is a recursive
pseudoknotted RNA structure.
Optimal recursive pseudoknotted RNA structure
problem can be solved in O(n5) time.

51
Open Questions

There should be a direct generalization of simple
pseudoknots to the following structure.
Rivas and Eddy do consider such a generalization,
but a systematic treatment is missing.
Q Given a pseudo-knotted structure, is it an
Akutsu simple pseudoknotted structure?
Linear time algorithm was devised recently for
this problem.

52
Structure as a proxy for ncRNA

Any genomic region with an energetically
favorable fold is a candidate ncRNA?
Rivas and Eddy show otherwise.
They use
LOD score L Pr(sRNA)/Pr(sNull)
Z-score Z G-?/?

53
LOD scores for putative ncRNA

A region of C. elegans is containing two tRNAs is
chosen. The position of tRNA is indicated by
stars.
The true positions and the LOD-score correspond.
Stronger results in M. janaschii (AT70)
Weak results in E.coli (GC50)

54
Correlation with GC content

GC content detector has results similar to
structure prediction!

55
Significance of detected regions

Test for significance was using Z-scores on
shuffled sequences.
Earlier tests shuffled sequences using a large
window.
Shuffling high scoring windows led to lower
Z-scores.

56
Testing chimeric sequence

Test on chimeric sequence. A real tRNA was
embedded in 2000bp random sequence with similar
GC-composition.

57
Increasing window size

tRNA Z-score computation with a larger window
(85nt).

For very strong signals, a Z-score gt 4 may still
be significant.

59
Z-score distribution

Z-scores of 415 tRNA genes. 98 have Z-score
lower than 4.

60
Structure

We hoped that RNA structure computation would
give us a clue into detecting novel ncRNA
Turns out that random DNA can fold into a low
energy structure.

Write a Comment

User Comments (0)