Prediction of RNA

About This Presentation

Title:

Prediction of RNA

Description:

Dot matrix analysis, Global Energy Minimization Methods, Mfold, VIENNA RNA package ... Jaeger et al., 1989, 1990, Zuker, 1994 MFOLD ... – PowerPoint PPT presentation

Number of Views:960

Avg rating:3.0/5.0

Slides: 52

Provided by: Ole106

Category:

more less

Transcript and Presenter's Notes

Title: Prediction of RNA

1
Prediction of RNA secondary structure

Features of RNA secondary structure,
presentations
Dot matrix analysis, Global Energy Minimization
Methods, Mfold, VIENNA RNA package
Dynamic Programming methods and Sequence
Covariation methods

Computational analysis of microRNAs

Small RNA world
Computational identification of miRNAs
Prediction of miRNA targets

2
Prediction of RNA secondary structure
3
RNA follows the same basic rules of base-pairing
as DNA, but short single -stranded RNA molecules
can take a variety of 3D shapes (tRNA,
ribozymes, splicing etc)

Information for self-assembly
the genetic code specifying the order of AA
control of the beginning and ends of coding
sequences
splicing signals
determination of the stability and its relative
transcriptional
level
regulation of gene expression

http//www.rnabase.org
4

What is RNA secondary structure?
RNA secondary structure is similar to an
alignment of
protein and nucleic acid sequences, except that
the sequence
folds back on itself and complementary bases
pair rather
than identical or similar bases.
Also, an alignment of 2 or more biosequences is
a statement
about an inferred evolutionary history. In
contrast, not
necessarily the sequence but structure
conservation is most
important with RNA

Main Points
RNA structure is dynamic in solution, i.e.
constantly fluctuating between different folded
states
There are many alternative structures that are
nearly identical in energy (both predicted and
actual)
Highly sensitive to solution conditions, e.g.
salt and temperature
Highly sensitive to protein binding
Tertiary structure (e.g. pseudoknots are
important)
Biologically important structure may not have
lowest predicted free energy, but it should be
one of the lower ones - must look at sub-optimal
structures
Three dimensional structure difficult to
determine due to flexibility of molecule
Most analysis of correctness must therefore
rely on phylogenetically determined models
Phylogenetic models look for invariant base
pairs, but may not identify all unique structures
Structural information also comes from
nuclease digestion studies and sometimes
crosslinking

6
The complementary bases, C-G and A-U form stable
base pairs with each other through the creation
of hydrogen bonds between donor and acceptor
sites on the bases. These are called
Watson-Crick (W-C) base pairs. In addition, we
consider the weaker G-U wobble pair, where the
bases bond in a skewed fashion. All of these are
called canonical base pairs. Other base pairs
occur, some of which are stable. They are called
non-canonical base pairs.
Most common Biologically informative
Difficult to compare
7
A computer predicted folding of Bacillus
subtilis RNase P RNA
A circular representation of the B. subtilis
folding.
The nucleotides are stretched out uniformly
along the circumference of a circle and the
base pairs are represented by circular arcs
that link paired bases and meet the circle at
right angles.
The triangular image in Figure is referred to as
an RNA structure dot plot Plot sequence vs.
reverse complement Possible stems run
perpendicular to axis of symmetry
8
MOUNTAINS
Less common Is used in RNA literature Much
easier to see similarity than squiggles Good
for revealing pattern of nested stems
9
(No Transcript)
10
Single-stranded
Double-stranded
. . . . . . . . . . . . . . . . . . .
Stem and loop/hairpin loop
Bulge loop
. . . . . . . . .
. . . . . . . . .
Interior loop
. . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
Junctions or multi-loops
. . . . . . . . . .

Interactions Among Secondary Structures
Pseudoknots Kissing Hairpins
Hairpin-bulge contacts
11

RNA structure prediction methods
self-complementary regions (Dot Plot Analysis)
most energy stable molecules
Base-Pair Maximization
Free Energy Methods
conserved patterns of base-pairing during
evolution
Covariance Models
-

Energy minimization dynamic programming
approach does not require prior sequence
alignment require estimation of energy terms
contributing to secondary structure Comparative
sequence analysis Using sequence alignment to
find conserved residues and covariant base
pairs. most trusted
12
Development of RNA prediction

Tinoco et al. 1971 extrapolation from studies
on small
molecules
Pipas and McMahon, 1975 computer programs
estimating all possible structures in tRNAs
Nussinov and Jacobson, 1980 precise and
efficient
algorithm for structure predictions (two scoring
matrices
approach)
Zuker and Stiegler, 1981 dynamic programming
algorithm
Jaeger et al., 1989, 1990, Zuker, 1994 MFOLD
Set of possible structures within a given energy
range
Indication of reliability
Uses covariance information
Wuchty et al., 1999 the partition function
method

13
Self-complementary regions in RNA sequences
Dot matrix method- search for a
self-complementary regions (long window, many
matches)
Possible stems run perpendicular To axis of
symmetry
14

StemLoop
StemLoop
StemLoop finds stems (inverted repeats) within a
sequence. You specify the
minimum stem length, minimum and maximum loop
sizes, and the minimum
number of bonds per stem. All stems or only the
best stems can be displayed on
your screen or written into a file.
2. DotPlot
DotPlot makes a dot-plot with the
output file from Compare or StemLoop.

Calculates score over a window Finds stems over
a threshold score Minimum/maximum loopsize
Sort by position or score
15
Inverted repeats only
16
RNA Folding by Energy Minimization The
quickest and easiest route to RNA structure
prediction is through the use of simple energy
rules. One way is to assign an energy to each
base pair in a secondary structure. That e (ri,
rj) is, there is a function e such that is the
energy of a base pair. The energy, E (S) , of
the entire structure, is then given by
Reasonable values of e at are -3, -2 and -1
kcal/mole for GC, AU and GU base pairs,
respectively. Unfortunately, such simple minded
rules are insufficient to capture the
destabilizing effects of various loops, or the
nearest neighbor interactions in helices and
loops. More sophistication is required.

The energy associated with any position in the
structure
is influenced only by local sequence and
structure
The structure is assumed to be formed by folding
that
does not produce knots

17
GLOBAL Energy-Minimization Methods (Minimum-Free
Energy Maximum Enthalpy) Stabilization
Energy (kcal/mole) Total Free Energy
optimality criterion Boltzmann
function-based optimality criteria Loops/bulges
introduce positive free energy and are
destabilizing. How is Stability Measured? (A)
Base pairs Stability Introduced by
Double-Stranded Regions (dsRNA)
Energy of paired bases are stored in a look-up
table these vary with temperature
Energy required by all base-pairs in a
structure are summed this sum is the cost of the
structure (B) Stacking energies (DH Turner,
Rochester) Stacking energies
are energies added by surrounding bases. (C)
Loops (Loop Destabilizing Energies) - Instability
Introduced by Single-Stranded (Unpaired) Loops-
all enthalpic (D) Branches Multibranches
18
-each base is compared to every other base
(similar to dot matrix) -energy is estimated by
nearest-neighbor rule -complementary regions are
evaluated by dynamic programming algorithm
Energies are determined empirically
Energy scoring base pairing (kcal/M)
G-C -3
A-U -2
G-U -1
Energy scoring loop penalties (kcal/M)
Size Internal Bulge Hairpin
1 3.9 2 4.1 3.1
3 4.5 3.5 4.5
4 4.9 4.2 5.5
5 5.3 4.8 4.9
Stacking energies for base pairs
19
Base-pair stacking
Favorable energies come from base-pair stacking
NOT from formation of base-pairs Un-paired
bases make hydrogen bonds with water therefore
there is no netchange when they pair Favorable
interactions come from electronic interactions
between stacked bases Base-pair stacking is
the ONLY favorable energy term in RNA folding
20
Base comparisons
Free energy calculations
Stacking energies for base pairs (kcal/mole 370C)
21
Matches/Mismatches Get some favorable energy
even if not hydrogen bonded due to stacking,
for instance for a mismatch next to an AU 5' AX
3 3' UY 5'
22
Algorithm RNA folding is implicitly an N4
algorithm N2 dynamic programming to find the
stems N2 dynamic programming to find the best
combination Zuker algorithm is N3 due to
approximations in searching for lopsided internal
loops Note that very asymmetric internal loops
will not be found with the default settings
23
(No Transcript)
24

Dynamic Programming Methods
Use Trace-Back methods
Applied by Zuker Steigler using Energetics as
the
Criterion
StemLoop Program calculates the optimal energies
of
local stems loops independently based on
inverted
repeats (ignores internal loops bulges but
mFOLD
does not).

j
1 2 3
n
ij
1 2 3
i1 j-1
i
Energy matrix W
n
25
Zuker Algorithm Calculation proceeds from center
towards edges Includes stacking, bulge,
internal, and hairpin loop terms Start from
center because the center line is location of
hairpin loops
Limited number of alternative structures
!!!!!!!!!!!!
Vienna RNA Package http//rna.tbi.univie.ac.at/cgi
-bin/RNAfold.cgi
26
ENERGY DOT PLOT
-alternative choices
Which regions are more/less predictive?
27
Reliability of secondary structure prediction
Pnum - total number of energy dotes in the i-th
row and I-th column of the
energy dot plot - represents the number of base
pairs that the i-th base can form with all
other base pairs within the defined energy
range - the lower this value-the more well
defined the local structure Hnum the sum of
Pnum (i) and Pnum (j) less 1 and is the
total number of dots in the i-th row and j-th
column - the lower this number-the more well
determined the double-stranded region Ssum
the number of foldings in which base i is
single- stranded divided by m, the number of
foldings - represents the probability that base
i is single- stranded 1-probably single
stranded 0-probably not
28
MFold MfoldPlotFold predicts optimal and
suboptimal secondary structures for an RNA or DNA
molecule using the most recent energy
minimization method of Zuker.
MFold calculates energy matrices that
determine all optimal and suboptimal
secondary structures for an RNA or DNA
molecule. The program writes these energy
matrices to an output file. A companion
program, PlotFold, reads this output file and
displays a representative set of optimal and
suboptimal secondary structures for the
molecule within any increment of the computed
minimum free energy you choose. You can
choose any of several different graphic
representations for displaying the secondary
structures in PlotFold.
P-Num Plot
This plot shows the amount of variability in
pairing at each position in the sequence in all
predicted foldings within the increment of the
optimal folding energy you specify.
Squiggles Plot
The squiggles plot is a representation
similar to what you might draw by hand
that is, bonds formed between bases are
drawn as chords. Bases are shown participating in
stems, as well as in hairpin, bulge,
interior, and multibranched loops.
29
Lower left to upper right diagonals free energy
encoded by colors (dark is most optimal). Note
that some short-cut algorithms will not explore
all possible structures but instead will ignore
the 'blank' areas in the biplot.
Once structures are predicted they can be
compared using Structure Dot Plot
Structure plots summarize the Commonalities
between two Predicted Structures (in this case
the top two structures).
http//www.bioinfo.rpi.edu/zukerm/rna/
30
Lower left to upper right diagonals free energy
encoded by colors (dark is most optimal). Note
that some short-cut algorithms will not explore
all possible structures but instead will ignore
the 'blank' areas in the biplot.
Once structures are predicted they can be
compared using Structure Dot Plot
Structure plots summarize the Commonalities
between two Predicted Structures (in this case
the top two structures).

LIMITATION
do not compute all the structures within a given
energy range of
the minimum free-energy structure

31
Vienna RNA Package 1.4
http//www.tbi.univie.ac.at/ivo/RNA/

three kinds of dynamic programming algorithms for
structure prediction
1-the minimum free energy algorithm of (Zuker
Stiegler 1981) which yields a single optimal
structure, the partition function
2-algorithm of (McCaskill 1990) which calculates
base pair probabilities in the thermodynamic
ensemble
3-suboptimal folding algorithm of (Wuchty et.al
1999) which generates all suboptimal structures
within a given energy range of the optimal
energy.
For secondary structure comparison, the package
contains several measures of distance
(dissimilarities) using either string alignment
or tree-editing (Shapiro Zhang 1990).
Finally, an algorithm is provided to design
sequences with a predefined structure (inverse
folding).

RNAfold -- predict minimum energy secondary
structures and pair probabilities RNAeval
-- evaluate energy of RNA secondary structures
RNAheat -- calculate the specific heat (melting
curve) of an RNA sequence RNAinverse --
inverse fold (design) sequences with predefined
structure RNAdistance -- compare secondary
structures RNApdist -- compare base pair
probabilities RNAsubopt -- complete
suboptimal folding
32
Minimum free energy structure and base pair
probabilities for the Sarcin loop of 23S
ribosomal RNA, as predicted by the RNAfold
program.
33
Evaluation Biological RNAs (with important
structure) are difficult to distinguish from
random RNAs Same number and length of stems and
loops Same GC content of stems Same free
predicted free energy Biologically important
structures are exceptional in lacking competing
structures this insures that the structure will
be present regardless of the net DG PNUM plot
shows number of alternative structures within
energy increment Agrees well with phylogenetic
predictions, but most effective for large
molecules
34

Sequence Covariation Methods (non-independent
changes)
determined by comparing sequences among species.
Joint substitutions that are
seen may reflect sites paired in the structure.
Improves structure prediction by
Dynamic Programming Methods
for double-stranded regions in RNA
molecules, sequence changes that take place
in evolution should maintain the base
pairing
sequence changes in loops and
single-stranded regions should not have such a
constraint
You are looking for sequence positions at which
covariation
maintains the base-pairing properties

Seq 1----------------G-------------C--------- Seq
2----------------C-------------G--------- Seq
3----------------A-------------C--------- Seq
4----------------A-------------T---------
?
AT
CG
AC
GC
http//www.rna.icmb.utexas.edu/
35
Covariance secondary structure prediction in
RNA takes into account conserved patterns of
basepairing Positions of covariance are
conserved matches, since they maintain the
secondary Structure computationally challenging
36
Eddy Durbin (1994) formal covariance model

LIMITATION
slow
unsuitable for searching through large genomes
usually use information from already existing RNA
secondary structure
How to discover this information??????
Construct a more general model
Train the model
Discover the most likely base-paired regions

Similarity with HMMs
Mutual information content M superimposed on the
information content of each sequence position in
an RNA alignment
http//www.cbs.dtu.dk/7Egorodkin/appl/slogo.html
37
(No Transcript)
38
Phylogeny based prediction
Inference of structure from covariance or
mutual information depends on having the
correct alignment Correct alignment depends on
knowing the correct structures Can only find
common structures, not structures unique to a
molecule Can, in principle, detect pseudoknots
39
(No Transcript)
40
Interaction among base pairs versus
Context-free grammar
41
Interaction among base pairs versus
Context-free grammar
SCFG
Stochastic context-free grammars
42
Interaction among base pairs versus
Context-free grammar
SCFG
Stochastic context-free grammars
Language Terminal symbols A C G U Nonterminal
symbols S0, S1, S2, S3,..
COVE is an implementation of stochastic context
free grammar methods for RNA sequence/structure
analysis
43
CAUCAGGGAAGAUCUCUUG
44
RNA world
http//www.imb-jena.de/RNA.html
Tutorials
http//www.ambion.com/techlib/resources/linkspage.
html
RNA Secondary Structure Prediction at Belozersky
Institute, Moscow
http//www.genebee.msu.su/services/rna2_reduced.ht
ml
45
RNA-specifying genes

tRNAscan-SE
-identifies transfer RNA genes in genomic DNA or
RNA sequences.
specificity of the Cove probabilistic RNA
prediction package (Eddy Durbin, 1994)
speed and sensitivity of tRNAscan 1.3 (Fichant
Burks, 1991)
implementation of an algorithm described by
Pavesi and colleagues (1994) which searches for
eukaryotic pol III tRNA promoters (our
implementation referred to as EufindtRNA).
tRNAscan and EufindtRNA are used as first-pass
prefilters to identify candidate'' tRNA regions
of the sequence. These subsequences are then
passed to Cove for further analysis, and output
if Cove confirms the initial tRNA prediction. In
this way, tRNAscan-SE attains the best of both
worlds
- a false positive rate of less than one per 15
billion nucleotides of random sequence
- the combined sensitivities of tRNAscan and
EufindtRNA (detection of 99 of true tRNAs)
- search speed 1,000 to 3,000 times faster than
Cove analysis and 30 to 90 times faster than the
original tRNAscan 1.3 (tRNAscan-SE uses both a
code-optimized version of tRNAscan 1.3 which
gives a 650-fold increase in speed, and a fast C
implementation of the Pavesi et al. algorithm).
published in Lowe Eddy, Nucleic Acids Research
25 955-964 (1997). .

http//lowelab.ucsc.edu/tRNAscan-SE/
NCBI CP000030
46
Automatic detection of conserved RNA structure
elements in complete RNA virus genomes
Nucleic Acids Research, 1998, Vol. 26, No. 16
a new method for detecting conserved RNA
secondary structures in a family of related RNA
sequences. Method is based on a combination of
thermodynamic structure prediction and
phylogenetic comparison. In contrast to purely
phylogenetic methods, our algorithm can be used
for small data sets of 10 sequences,
efficiently exploiting the information contained
in the sequence variability.

Distant groups of RNA viruses have very little or
no detectable sequence homology and often very
different genomic organization
(ii) RNA viruses show an extremely high mutation
rate, of the order of 10-5-10-3 mutations per
nucleotide and replication.
Due to the high sequence variation, the
application of classical methods of sequence
analysis
is, therefore, difficult or outright impossible.
The high mutation rate of RNA viruses also
explains their short genomes, of less than 20
000
nt. A large number of complete genomic sequences
is available in databases. The non-coding
regions are most likely functionally important,
since the high selection pressure acting on viral
replication rates makes junk RNA' very
unlikely.

RNA secondary structures are predicted as minimum
energy structures by means of dynamic
programming techniques. An efficient
implementation of this algorithm is part of the
Vienna RNA Package
47
Sequences are aligned using a standard multiple
alignment procedure. Secondary structures for
each sequence are predicted and gaps are
inserted bases in the sequence alignment. The
resulting aligned structures can be represented
as aligned mountain plots. From the aligned
structures consistently predicted base pairs are
identified. The alignment is used to identify
compensatory mutations that support base pairs
and inconsistent mutants that contradict pairs.
This information is used to rank proposed base
pairs by their credibility and to filter the
original list of predicted pairs.
48
Aligned mountain representations m(k) of the RNA
secondary structure of 13 complete HCV genomes.
Peaks and plateaux in the mountain
representation correspond to hairpins and
unpaired regions in the secondary structure.
Colors indicate the number of consistent
mutations red 1, yellow 2 and green 3 different
types of base pairs. These saturated colors
indicate that there are only compatible
sequences. Decreasing saturation of the colors
indicates an increasing number of non-compatible
sequences
Comparison of predicted minimum energy structures
in region A (around position 8000) of the HCV
genome. The lower left part of the plot shows a
conventional picture of the predicted structure.
Base pairs marked in green have non-consistent
mutations, circles indicate compensatory
mutations. The extended outer stem contains a
number of compensatory mutations supporting its
existence.
49
The TAR structure of HIV-1. Almost all predicted
base pairs are consistent with all 13 sequences,
most of them are predicted in at least 11
sequences. A large number of compensatory
mutations supports the thermodynamic
predictions. Our computed consensus structure
(lower left) matches the structure determined by
probing and phylogenetic reconstruction (4). We
display here the consensus dot plot, the
classical secondary structure and a mountain
representation. The latter is a convenient
alternative to dot plots for larger structural
motifs. Base pairs are represented by slabs
connecting the two sequence positions. The width
and color of a slab corresponds to size and color
of the corresponding dot plot entry.
50
Consensus structures of the HIV-1 RRE region from
a set of 13 sequences and from the 21 sequences
51
Primary Structure of RNA e.g., Human tRNAgene
for Methionine gtgi1181147embZ69292.1HSC6TRNAM
H.sapiens tRNA-Met gene GGCCUUUUUUUUCCUUUUUUUUA
AUUUUAUUGAGACAGGGUCUCGCUAUGUUGCCUGCCUGGGUCUUCCA
AAGUGCAGUGACUACAGGGAGCUGAGCCCGGCGCCUAGCCCACCAGUGU
AUUGAUAUUUAUUUUUCUAUC CCUUGUUUUGUUUUCUGUUUGAUUCUG
GUGAUUCCUUUUUCCAAAGUGAGUUGGCAACCUGUGGUAGCCA
GCAAGUAGGCAACUGCUCGUAGGUUUUUUCUUAAAUUACGAGGUAGUCU
GUUCGGCAUCUCCUGUAAGUA GUUAAGAGUACUGUGAGACCGUGUGCU
UGGCAGAACAGCAGAGUGGCGCAGCGGAAGCGUGCUGGGCCCA
UAACCCAGAGGUCGAUGGAUCGAAACCAUCCUCUGCUAGGUCCUUUUUU
UUUUUCCCCCCCCGUCUAUUU UCCUGAGGAUCCCUUUUUUUAAGUUAC
AGUUUUUUAGGUUAAACAAUGACAAAGAAAACAAAAUGAACCC
GAGUAUUUCUUUAAUUCCAGAAUUACAAGCAUUUCCGGGAAAUAAUGUG
AAACUACAAUCUCUGCAUGUA CAAUUUUGAUUUUCAUGGACACCCAAG
UGUCAUUAAUCAAUAUUUCAUCUGUAAACAAAGCAAAUUUCUC
UUGUUUAGAGGCUAUACCACUGUUGCAGCCAGUUAUGACAGUUGUAAGU
UAACCUGCCAAGAAGGAGAAU CGUUACAUAAACUGAGUGCCAAGGGUG
GGGUGGGGUGGGAGCCCAGGAAUGGAGUUUUAUAUCUUUUGAU
ACAUAAUUCAGAAAGCACUAUUUGCCAAGUAGUUAACGCCAUCGAUUAG
GAAUUC
http//www.genetics.wustl.edu/eddy/tRNAscan-SE/
http//lowelab.ucsc.edu/tRNAscan-SE/
6137599 n
L35894

http//www.bioinfo.rpi.edu/zukerm/rna/
http//www.bioinfo.rpi.edu/applications/mfold/old/
rna/

Write a Comment

User Comments (0)