Prediction of RNA structure - PowerPoint PPT Presentation

1 / 60
About This Presentation
Title:

Prediction of RNA structure

Description:

The DNA contains the code to all of the essential machinery of the cell, ... It is conjectured, and empirically shown that RNA has the polymer-zeta property ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 61
Provided by: vineet50
Category:

less

Transcript and Presenter's Notes

Title: Prediction of RNA structure


1
Prediction of RNA structure
  • Vineet Bafna

2
The central dogma
  • The DNA contains the code to all of the essential
    machinery of the cell, mainly proteins.
  • Information about proteins is in genes.
  • The genic information is first copied to an
    intermediate molecule (transcription)
  • The transcript goes outside the nucleus and is
    translated into a protein.

3
(No Transcript)
4
The Central Dogma
  • However, not all genes are translated!

5
Example tRNA
6
RNA
  • Could it be that, there is a large trove of
    non-coding RNA, that is not discovered yet?
  • The answer seems to be YES
  • Todays lecture How can we computationally
    identify ncRNA?

7
Novel ncRNAs are abundant Ex miRNAs
  • miRNAs were the second major story in 2001 (after
    the genome).
  • Subsequently, many other non-coding genes have
    been found

8
(No Transcript)
9
Scientific American, 2006
10
Bacterial Riboswitches
  • -Breaker Lab

11
ncRNA gene finding
  • The RNA world hypothesis
  • RNA are as important as protein coding genes.
  • Many undiscovered ncRNA exist
  • Computational methods for discovering ncRNA are
    not mature.
  • What are the clues to non-coding genes?
  • Structure Given a sequence, what is the
    structure into which it can fold with minimum
    energy?

12
RNA structure Basics
  • Key RNA is single-stranded. Think of a string
    over 4 letters, AC,G, and U.
  • The complementary bases form pairs.
  • A lt-gt U, C lt-gt G, G lt-gt U
  • Base-pairing defines a secondary structure. The
    base-pairing is usually non-crossing.

13
RNA structure Basics
  • Key RNA is single-stranded. Think of a string
    over 4 letters, AC,G, and U.
  • The complementary bases form pairs.
  • Base-pairing defines a secondary structure. The
    base-pairing is usually non-crossing.

14
RNA structure pseudoknots
  • Sometimes, unpaired bases in loops form crossing
    pairs. These are pseudoknots.


15
De novo RNA structure prediction
  • Any set of non-crossing base-pairs defines a
    secondary structure.
  • Abstract Question
  • Given an RNA string find a structure that
    maximizes the number of non-crossing base-pairs
  • Incorporate the true energetics of folding
  • Incorporate Pseudo-knots

16
(No Transcript)
17
De novo RNA prediction a combinatorial
formulation
A C G A U U
  • Input
  • A string over A,C,G,U
  • A pairs with U, C pairs with G, G with U
  • Output
  • A subset of possible base-pairings of maximum
    size (score) such that
  • No two base-pairs intersect
  • Each nucleotide pairs with at most one other
    nucleotide
  • How can we compute this set efficiently?

18
RNA structure
  • Nussinovs algorithm
  • Score B for every base-pair. No penalty for
    loops. No pesudo-knots.
  • Let W(i,j) be the score of the best structure of
    the subsequence from i to j.

for i n down to 1 for j i1 to n

19
Obtaining RNA structure
for i n downto 1 for j i1 to n

i
j
i
j
20
Obtaining RNA Structure
Procedure print_RNA(i,j) if S(i,j) /
print (i,j)
print_RNA(i1,j-1) else if (S(i,j) -)
print_RNA(i1,j) else if (S(i,j) )
print_RNA(i,j-1) else
kS(i,j) print_RNA(i,k)
print_RNA(k1,j)
i
j
21
RNA structure example
i 1 2 3 4 5 6
j
2
3
4
5
6
22
RNA Structure Details
23
Base-pairing Loops
  • Base-pairs arise from complementary nucleotides
  • Single-stranded
  • Stack is when 2 base-pairs are contiguous
  • Loops arise when there are unpaired bases.
  • They are characterized by the number of
    base-pairs that close it.
  • Hairpin closed by 1 base-pair
  • Bulge/Interior Loops (2 base-pairs)
  • Multiple Internal loops (k base-pairs)

24
Scoring Loops, multi-loops
  • Zuker-Turner Energy Rules
  • http//www.bioinfo.rpi.edu/zukerm/rna/energy/node
    2.html
  • Stacking Energies
  • Energy for Bulges and Interior Loops
  • Energy for Multi-loops

25
Improving the efficiency
  • Computing structure on a 2000nt sequence is about
    20s (2GHz, 2Gb)
  • (Wexler et al., 2006)
  • How much time would it take to search 10000
    candidate regions?
  • What would be the impact of an O(n) speedup on
    this search?

26
A triangle inequality
  • After the computations are done,
  • A triangle inequality is satisfied

27
Towards more efficient algorithms
  • Consider the main loop again
  • We only need to branch, when i forms a base-pair.
    Therefore, w.l.o.g

28
A candidate list approach
  • Note that if C is the maximum size of the
    candidate list, then the running time is O(Cn2)
  • How can we reduce the size of the candidate list?
  • Consider only those k s.t. bk is complementary to
    bi

We will consider cases when k is NOT optimal,
using the triangle inequality
29
Candidate list
  • Consider i,j,k s.t.
  • W(i,j) B(ri,rk)W(i1,k-1)W(k1,j) (1)
  • Claim For all j gt j
  • j is not a candidate in computing W(i,j)

30
Proof
31
The candidate list algorithm
  • For all i n down to 1
  • For all j i1 to n

32
How large is the candidate list?
  • Here, the answer depends very much on the energy
    computations.
  • If we simply count the number of base-pairs, the
    candidate list should be large (probably)
  • For more realistic functions, it is very small
    (experiments with n upto 1000, showed C6)

33
Empirical estimates of C
A Study of Accessible Motifs and RNA Folding
Complexity Ydo Wexler, Chaya Zilberstein, and
Michal Ziv-Ukelson LNBI 3909, p. 473 ff.
34
A thermodynamic argument
  • It is conjectured, and empirically shown that RNA
    has the polymer-zeta property
  • Probability that a string of length m folds into
    an RNA structure is b/mc for constants b,cgt0
  • Note that if cgt0, the number of indices j that
    base-pair with i is O(1)
  • In practice c1.5

35
End of Lecture
36
Incorporating pseudoknots in structure prediction
  • Q Determine optimal structure, when pseudoknots
    are allowed.
  • Pseudoknots are only loosely defined.
  • If any interleaving is allowed, then simply
    select a structure in which a max number of
    nucleotides can be paired.
  • Under some restrictive notions, the structure
    problem becomes NP-hard.

37
Akutsus Simple pseudoknots
  • A region i0,k0 forms a simple pseudoknot if
    there exist positions j0, j0 s.t.
  • Each (i,j) ? M satisfies either
  • i0 ? i lt j0 ? j lt j0 or j0 ? i lt j0 ? j ? k0
  • Within one of the two groups, there is no
    interleaving.
  • iltiltj0 or j0 ?ilti implies jgtj.

38
Simple pseudoknots
A
B
C
  • A region i0,k0 forms a simple pseudoknot if it
    can be partioined into 3 regions s.t. every
    base-pair has the following properties
  • It either connects (A,B) , or (B,C)
  • Two base-pairs within (A,B) (or, (B,C)) do not
    interleave

39
Simple pseudoknotted structure
M
M2
M1
  • Collection of simple pseudoknots (Mi) and other
    basepairs M.
  • None of the simple pseudoknot regions overlap.
  • M is a secondary structure without pseudoknots
    for the sequence obtained by excising all
    pseudoknotted regions.
  • Can we identify a sub-structure for the
    pseudo-knots?

40
Akutsus algorithm (Main idea)
  • Rotate the sequence so that it forms two loops.
  • Each allowed base-pair is a horizontal line in
    exactly one of the two loops. The horizontal
    lines in the two loops are non-crossing.
  • All base-pairs can be ordered.

41
Main idea
j0
k0
i0
j0
  • The base-pairs have a total order that a d.p. can
    exploit.
  • (i,j) lt (i,j) if one of the following holds
  • iltiltjltj
  • iltiltjltjltj0
  • j0ltiltiltjltj

42
Frontiers of D.P.
j0
k0
i
j
k
i0
j0
  • We need to construct an increasing path of
    base-pairs.
  • Consider a frontier triple (i,j,k),
    corresponding to the region (i0,i) and (j,k)
  • We have the following cases
  • (i,j) form a base-pair
  • (j,k) form a base-pair
  • Neither (i,j) nor (j,k) form a base-pair

43
Ordering frontiers
j0
k0
i
j
k
i0
j0
  • frontier triple (i,j,k) gt (i,j,k) if and
    only if
  • i lt i
  • j lt j
  • k lt k

44
Ordering Frontiers
  • (i,j,k) gt (i,j,k) if and only if

45
Recurrences
  • SL(i,j,k) is the optimum score of a frontier
    (i,j,k) assuming that
  • iltj0ltjltj0ltk
  • (i,j) form a base-pair
  • SR(i,j,k) is the optimum score of a frontier
    (i,j,k) assuming that iltj0ltjltj0ltk, and
  • (j,k) form a base-pair
  • SM(i,j,k) is the optimum score of a frontier
    (i,j,k) assuming that iltj0ltjltj0ltk, and
  • Neither (i,j) nor (j,k) form a base-pair

46
Computing SL
47
Putting it all together
48
Computing optimal pseudo-knotted structures
  • Let S(i,j) be the opt score of a simple
    pseudoknotted structure.
  • Either (i,j) is a simple pseudoknot
  • S(i,j) Spseudo(i,j)
  • Or, not
  • S(i,j) max v(ai,aj) S(i1,j-1),
  • max k S(i,k-1)S(k,j)

49
Time Complexity
  • We compute Spseudo(i,j) for all (i,j). Each
    computation is O(n3). Total time?
  • For each i0, perform the following computation
  • Spseudo(i0,k0) max i0?iltj ?k0SL(i,j,k0),
  • Total time O(n4) Can you do better?

50
Recursive pseudoknots
  • Each loop of the RNA structure is a recursive
    pseudoknotted RNA structure.
  • Optimal recursive pseudoknotted RNA structure
    problem can be solved in O(n5) time.

51
Open Questions
  • There should be a direct generalization of simple
    pseudoknots to the following structure.
  • Rivas and Eddy do consider such a generalization,
    but a systematic treatment is missing.
  • Q Given a pseudo-knotted structure, is it an
    Akutsu simple pseudoknotted structure?
  • Linear time algorithm was devised recently for
    this problem.

52
Structure as a proxy for ncRNA
  • Any genomic region with an energetically
    favorable fold is a candidate ncRNA?
  • Rivas and Eddy show otherwise.
  • They use
  • LOD score L Pr(sRNA)/Pr(sNull)
  • Z-score Z G-?/?

53
LOD scores for putative ncRNA
  • A region of C. elegans is containing two tRNAs is
    chosen. The position of tRNA is indicated by
    stars.
  • The true positions and the LOD-score correspond.
  • Stronger results in M. janaschii (AT70)
  • Weak results in E.coli (GC50)

54
Correlation with GC content
  • GC content detector has results similar to
    structure prediction!

55
Significance of detected regions
  • Test for significance was using Z-scores on
    shuffled sequences.
  • Earlier tests shuffled sequences using a large
    window.
  • Shuffling high scoring windows led to lower
    Z-scores.

56
Testing chimeric sequence
  • Test on chimeric sequence. A real tRNA was
    embedded in 2000bp random sequence with similar
    GC-composition.

57
Increasing window size
  • tRNA Z-score computation with a larger window
    (85nt).

58
  • For very strong signals, a Z-score gt 4 may still
    be significant.

59
Z-score distribution
  • Z-scores of 415 tRNA genes. 98 have Z-score
    lower than 4.

60
Structure
  • We hoped that RNA structure computation would
    give us a clue into detecting novel ncRNA
  • Turns out that random DNA can fold into a low
    energy structure.
Write a Comment
User Comments (0)
About PowerShow.com