Title: molecule's structure prediction
1molecule's structure prediction
2Outline
- RNA
- RNA folding
- Dynamic programming for RNA secondary structure
prediction - Protein
- Secondary Structure Prediction
- Homology Modeling
- Protein Threading
- ab-initio
3RNA Basics
3 Hydrogen Bonds more stable
2 Hydrogen Bonds
- RNA bases A,C,G,U
- Canonical Base Pairs
- A-U
- G-C
- G-U
- wobble pairing
- Bases can only pair with one other base.
Image http//www.bioalgorithms.info/
4RNA Secondary Structure
Pseudoknot
Stem
Interior Loop
Single-Stranded
Bulge Loop
Junction (Multiloop)
Hairpin loop
Image Wuchty
5RNA secondary structure representation
Circular representation
Bacillus Subtilis RNase P RNA
6RNA secondary structure representation
DotPlot representation of the same
Bacillus Subtilis RNA folding A dot is placed
to represent a base pair
7RNA secondary structure definition
An RNA sequence is represented as R r1, r2,
r3, , rn (ri is the i-th nucleotide). Each ri
belongs to the set A, C, G, U. A secondary
structure on R is a set S of ordered pairs,
written as ij, 1iltjn, satisfying
8Computing RNA secondary structure
- Working hypothesis
- The native secondary structure of a RNA
molecule is the one with the minimum free energy - Restrictions
- No knots (ri,rj) , (rk,rl), iltkltjltl
- No close base pairs (ri,rj) j i gt 3 (exclude
close base pairs) - Base pairs A-U, C-G and G-U
-
9Computing RNA secondary structure
- Tinoco-Uhlenbeck postulate
- Assumption the free energy of each base pair is
independent of all the other pairs and the loop
structures - Consequence the total free energy of an RNA is
the sum of all of the base pair free energies -
10Independent Base Pairs Approach
- Use solution for smaller strings to find
solutions for larger strings - This is precisely the basic principle behind
dynamic programming algorithms!
11RNA folding Dynamic Programming
- Notation
- e(ri,rj) free energy of a base pair joining ri
and rj - Bij secondary structure of the RNA strand
from base ri to base rj. Its energy is E(Bij) - S(i,j) optimal free energy associated with
segment rirj - S(i,j) max -E(Bij)
B
12RNA folding Dynamic Programming
There are only four possible ways that a
secondary structure of nested base pair can be
constructed on a RNA strand from position i to j
- i is unpaired, added on to
- a structure for i1j
- S(i,j) S(i1,j)
- j is unpaired, added on to
- a structure for ij-1
- S(i,j) S(i,j-1)
13RNA folding Dynamic Programming
- i j paired, but not to each other
- the structure for ij adds together
- structures for 2 sub regions,
- ik and k1j
- S(i,j) max S(i,k)S(k1,j)
- i j paired, added on to
- a structure for i1j-1
- S(i,j) S(i1,j-1)e(ri,rj)
iltkltj
14RNA folding Dynamic Programming
Since there are only four cases, the optimal
score S(i,j) is just the maximum of the four
possibilities
To compute this efficiently, we need to make sure
that the scores for the smaller sub-regions have
already been calculated Dynamic
Programming !!
15RNA folding Dynamic Programming
Notes S(i,j) 0 if j-i lt 4 do
not allow close base pairs Reasonable values
of e are -3, -2, and -1 kcal/mole for GC, AU and
GU, respectively. In the DP procedure, we use
3, 2, 1 (or replace max with min) Build upper
triangular part of DP matrix - start with
diagonal all 0 - works outward on larger and
larger regions - ends with S(1,n) Traceback
starts with S(1,n), and finds optimal path that
lead there.
16j
A U A C C C U G U G G U A U
A 0 0 0 0
U 0 0 0 0
A 0 0 0 0
C 0 0 0 0
C 0 0 0 0
C 0 0 0 0
U 0 0 0 0
G 0 0 0 0
U 0 0 0 0
G 0 0 0 0
G 0 0 0 0
U 0 0 0
A 0 0
U 0
Initialisation No close basepairs
i
17j
1
10
5
A U A C C C U G U G G U A U
A 0 0 0 0 0
U 0 0 0 0 0
A 0 0 0 0 2
C 0 0 0 0 3
C 0 0 0 0 0
C 0 0 0 0 3
U 0 0 0 0 1
G 0 0 0 0 1
U 0 0 0 0 2
G 0 0 0 0 1
G 0 0 0 0
U 0 0 0
A 0 0
U 0
Propagation
1
C5.U9 C5 unpaired S(6,9) 0 U10
unpaired S(5,8)0 C5-U10 paired S(6,8)
e(C,U)0 C5 paired, U10 paired S(5,6)S(7,9)0
S(5,7)S(8,9)0
5
10
18j
1
5
10
A U A C C C U G U G G U A U
A 0 0 0 0 0 0 2
U 0 0 0 0 0 2 3
A 0 0 0 0 2 3 5
C 0 0 0 0 3 3 3
C 0 0 0 0 0 3 6
C 0 0 0 0 3 3
U 0 0 0 0 1 1
G 0 0 0 0 1 2
U 0 0 0 0 2 2
G 0 0 0 0 1
G 0 0 0 0
U 0 0 0
A 0 0
U 0
Propagation
1
C5.G11 C5 unpaired S(6,11) 3 G11
unpaired S(5,10)3 C5-G11 paired S(6,10)e(C,G)
6 C5 paired, G11 paired S(5,6)S(7,11)1 S(5,7)
S(8,11)0 S(5,8)S(9,11)0 S(5,9)S(10,11)0
5
10
19j
1
5
10
A U A C C C U G U G G U A U
A 0 0 0 0 0 0 2 3 5 6 6 8 10 12
U 0 0 0 0 0 2 3 5 6 6 8 10 10
A 0 0 0 0 2 3 5 5 6 8 8 8
C 0 0 0 0 3 3 3 6 6 6 6
C 0 0 0 0 0 3 6 6 6 6
C 0 0 0 0 3 3 3 3 3
U 0 0 0 0 1 1 3 3
G 0 0 0 0 1 2 2
U 0 0 0 0 2 2
G 0 0 0 0 1
G 0 0 0 0
U 0 0 0
A 0 0
U 0
Propagation
1
5
i
10
20j
A U A C C C U G U G G U A U
A 0 0 0 0 0 0 2 3 5 6 6 8 10 12
U 0 0 0 0 0 2 3 5 6 6 8 10 10
A 0 0 0 0 2 3 5 5 6 8 8 8
C 0 0 0 0 3 3 3 6 6 6 6
C 0 0 0 0 0 3 6 6 6 6
C 0 0 0 0 3 3 3 3 3
U 0 0 0 0 1 1 3 3
G 0 0 0 0 1 2 2
U 0 0 0 0 2 2
G 0 0 0 0 1
G 0 0 0 0
U 0 0 0
A 0 0
U 0
Traceback
i
21FINAL PREDICTION
AUACCCUGUGGUAU
Total free energy -12 kcal/mol
22Protein structure prediction
23The sequence-structure gapThe gap is getting
bigger
200000
180000
160000
140000
120000
100000
Sequences
Structures
80000
60000
40000
20000
0
24The protein folding problem
- The information for 3D structures is coded in the
protein sequence - Proteins fold in their native structure in
seconds
25Secondary Structure Prediction
- Given a primary sequence
- ADSGHYRFASGFTYKKMNCTEAA
- what secondary structure will it adopt ?
26Backbone
- A polypeptide chain. The R1 side chains identify
the component amino acids. - Atoms inside each quadrilateral are on the same
plane, which can rotate according - to angles ? and ? .
27Protein structure
28Secondary Structure Prediction Methods
- Chou-Fasman / GOR Method
- Based on amino acid frequencies
- Machine learning methods
- PHDsec and PSIpred
29Chou and Fasman (1974)
Name P(a) P(b) P(turn) Alanine
142 83 66 Arginine 98 93
95 Aspartic Acid 101 54
146 Asparagine 67 89 156 Cysteine
70 119 119 Glutamic Acid 151 037
74 Glutamine 111 110
98 Glycine 57 75 156 Histidine
100 87 95 Isoleucine 108 160
47 Leucine 121 130 59 Lysine
114 74 101 Methionine 145
105 60 Phenylalanine 113 138
60 Proline 57 55 152 Serine
77 75 143 Threonine 83 119
96 Tryptophan 108 137
96 Tyrosine 69 147 114 Valine
106 170 50
The propensity of an amino acid to be part of a
certain secondary structure (e.g. Proline has a
low propensity of being in an alpha helix or beta
sheet ? breaker)
30Secondary Structure Method Improvements
- Sliding window approach
- Most alpha helices are 12 residues longMost
beta strands are 6 residues long - Look at all windows, calculate a score for each
window. If gtthreshold ? predict this is an alpha
helix/beta sheet
TGTAGPOLKCHIQWMLPLKK
31Improvements since 1980s
- Adding information from conservation in MSA
- Smarter algorithms (e.g. Machine learning).
Success -gt 75-80
32Machine learning approach for predicting
Secondary Structure (PHD, PSIpred)
Query
SwissProt
- Step 1
- Generating a multiple sequence alignment
Query
Subject
Subject
Subject
Subject
33- Step 2
- Additional sequences are added using a profile.
We end up with a MSA which represents the protein
family.
Query
seed
MSA
Query
Subject
Subject
Subject
Subject
34Step 3
- The sequence profile of the protein family is
compared (by machine learning methods) to
sequences with known secondary structure.
Query
seed
Machine Learning Approach
MSA
Known structures
Query
Subject
Subject
Subject
Subject
35Neural Network architecture used in BetaTPred2
36Predicting protein 3d structure
- Goal 3d structure from 1d sequence
An existing fold
A new fold
Fold recognition
ab-initio
Homology modeling
37Homology Modeling
- Simplest, reliable approach
- Basis proteins with similar sequences tend to
fold into similar structures - Has been observed that even proteins with 25
sequence identity fold into similar structures - Does not work for remote homologs (lt 25 pairwise
identity)
38Homology Modeling
- Given
- A query sequence Q
- A database of known protein structures
- Find protein P such that P has high sequence
similarity to Q - Return Ps structure as an approximation to Qs
structure
39Homology modeling needs three items of input
- The sequence of a protein with unknown 3D
structure, the "target sequence." - A 3D template a structure having the highest
sequence identity with the target sequence ( gt25
sequence identity) - An sequence alignment between the target sequence
and the template sequence
40Fold recognition Protein Threading
- Which of the known folds is likely to be similar
to the (unknown) fold of a new protein when only
its amino-acid sequence is known?
41Protein Threading
- The goal find the correct sequence-structure
alignment between a target sequence and its
native-like fold in PDB - Energy function knowledge (or statistics) based
rather than physics based - Should be able to distinguish correct structural
folds from incorrect structural folds - Should be able to distinguish correct
sequence-fold alignment from incorrect
sequence-fold alignments
42Protein Threading
- Basic premise
- Statistics from Protein Data Bank (2,000
structures) - Chances for a protein to have a structural fold
that already exists in PDB are quite good.
The number of unique structural (domain) folds in
nature is fairly small (possibly a few thousand)
90 of new structures submitted to PDB in the
past three years have similar structural folds in
PDB
43Protein Threading
- Basic components
- Structure database
- Energy function
- Sequence-structure alignment algorithm
- Prediction reliability assessment
44ab-initio folding
- Goal Predict structure from first principles
- Requires
- A free energy function, sufficiently close to the
true potential - A method for searching the conformational space
- Advantages
- Works for novel folds
- Shows that we understand the process
- Disadvantages
- Applicable to short sequences only
45Qian et al. (Nature 2007) used distributed
computing to predict the 3D structure of a
protein from its amino-acid sequence. Here, their
predicted structure (grey) of a protein is
overlaid with the experimentally determined
crystal structure (color) of that protein. The
agreement between the two is excellent. 70,000
home computers for about two years.
46Overall Approach
Protein Sequence
Multiple Sequence Alignment
Database Searching
Homologuein PDB
Secondary Structure Prediction
FoldRecognition
No
Yes
PredictedFold
Yes
Sequence-Structure Alignment
Homology Modelling
Ab-initioStructure Prediction
No
3-D Protein Model
47Thank you for learning with me!