molecule's structure prediction - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

molecule's structure prediction

Description:

molecule's structure prediction – PowerPoint PPT presentation

Number of Views:139

Avg rating:3.0/5.0

Slides: 48

Provided by: SarahA166

Category:

more less

Transcript and Presenter's Notes

Title: molecule's structure prediction

1
molecule's structure prediction
2
Outline

RNA
RNA folding
Dynamic programming for RNA secondary structure
prediction
Protein
Secondary Structure Prediction
Homology Modeling
Protein Threading
ab-initio

3
RNA Basics
3 Hydrogen Bonds more stable
2 Hydrogen Bonds

RNA bases A,C,G,U
Canonical Base Pairs
A-U
G-C
G-U
wobble pairing
Bases can only pair with one other base.

Image http//www.bioalgorithms.info/
4
RNA Secondary Structure
Pseudoknot
Stem
Interior Loop
Single-Stranded
Bulge Loop
Junction (Multiloop)
Hairpin loop
Image Wuchty
5
RNA secondary structure representation
Circular representation
Bacillus Subtilis RNase P RNA
6
RNA secondary structure representation
DotPlot representation of the same
Bacillus Subtilis RNA folding A dot is placed
to represent a base pair
7
RNA secondary structure definition
An RNA sequence is represented as R r1, r2,
r3, , rn (ri is the i-th nucleotide). Each ri
belongs to the set A, C, G, U. A secondary
structure on R is a set S of ordered pairs,
written as ij, 1iltjn, satisfying
8
Computing RNA secondary structure

Working hypothesis
The native secondary structure of a RNA
molecule is the one with the minimum free energy
Restrictions
No knots (ri,rj) , (rk,rl), iltkltjltl
No close base pairs (ri,rj) j i gt 3 (exclude
close base pairs)
Base pairs A-U, C-G and G-U

9
Computing RNA secondary structure

Tinoco-Uhlenbeck postulate
Assumption the free energy of each base pair is
independent of all the other pairs and the loop
structures
Consequence the total free energy of an RNA is
the sum of all of the base pair free energies

10
Independent Base Pairs Approach

Use solution for smaller strings to find
solutions for larger strings
This is precisely the basic principle behind
dynamic programming algorithms!

11
RNA folding Dynamic Programming

Notation
e(ri,rj) free energy of a base pair joining ri
and rj
Bij secondary structure of the RNA strand
from base ri to base rj. Its energy is E(Bij)
S(i,j) optimal free energy associated with
segment rirj
S(i,j) max -E(Bij)

B
12
RNA folding Dynamic Programming
There are only four possible ways that a
secondary structure of nested base pair can be
constructed on a RNA strand from position i to j

i is unpaired, added on to
a structure for i1j
S(i,j) S(i1,j)

j is unpaired, added on to
a structure for ij-1
S(i,j) S(i,j-1)

13
RNA folding Dynamic Programming

i j paired, but not to each other
the structure for ij adds together
structures for 2 sub regions,
ik and k1j
S(i,j) max S(i,k)S(k1,j)

i j paired, added on to
a structure for i1j-1
S(i,j) S(i1,j-1)e(ri,rj)

iltkltj
14
RNA folding Dynamic Programming
Since there are only four cases, the optimal
score S(i,j) is just the maximum of the four
possibilities
To compute this efficiently, we need to make sure
that the scores for the smaller sub-regions have
already been calculated Dynamic
Programming !!
15
RNA folding Dynamic Programming
Notes S(i,j) 0 if j-i lt 4 do
not allow close base pairs Reasonable values
of e are -3, -2, and -1 kcal/mole for GC, AU and
GU, respectively. In the DP procedure, we use
3, 2, 1 (or replace max with min) Build upper
triangular part of DP matrix - start with
diagonal all 0 - works outward on larger and
larger regions - ends with S(1,n) Traceback
starts with S(1,n), and finds optimal path that
lead there.
16
j
A U A C C C U G U G G U A U
A 0 0 0 0
U 0 0 0 0
A 0 0 0 0
C 0 0 0 0
C 0 0 0 0
C 0 0 0 0
U 0 0 0 0
G 0 0 0 0
U 0 0 0 0
G 0 0 0 0
G 0 0 0 0
U 0 0 0
A 0 0
U 0
Initialisation No close basepairs
i
17
j
1
10
5
A U A C C C U G U G G U A U
A 0 0 0 0 0
U 0 0 0 0 0
A 0 0 0 0 2
C 0 0 0 0 3
C 0 0 0 0 0
C 0 0 0 0 3
U 0 0 0 0 1
G 0 0 0 0 1
U 0 0 0 0 2
G 0 0 0 0 1
G 0 0 0 0
U 0 0 0
A 0 0
U 0
Propagation
1
C5.U9 C5 unpaired S(6,9) 0 U10
unpaired S(5,8)0 C5-U10 paired S(6,8)
e(C,U)0 C5 paired, U10 paired S(5,6)S(7,9)0
S(5,7)S(8,9)0
5
10
18
j
1
5
10
A U A C C C U G U G G U A U
A 0 0 0 0 0 0 2
U 0 0 0 0 0 2 3
A 0 0 0 0 2 3 5
C 0 0 0 0 3 3 3
C 0 0 0 0 0 3 6
C 0 0 0 0 3 3
U 0 0 0 0 1 1
G 0 0 0 0 1 2
U 0 0 0 0 2 2
G 0 0 0 0 1
G 0 0 0 0
U 0 0 0
A 0 0
U 0
Propagation
1
C5.G11 C5 unpaired S(6,11) 3 G11
unpaired S(5,10)3 C5-G11 paired S(6,10)e(C,G)
6 C5 paired, G11 paired S(5,6)S(7,11)1 S(5,7)
S(8,11)0 S(5,8)S(9,11)0 S(5,9)S(10,11)0
5
10
19
j
1
5
10
A U A C C C U G U G G U A U
A 0 0 0 0 0 0 2 3 5 6 6 8 10 12
U 0 0 0 0 0 2 3 5 6 6 8 10 10
A 0 0 0 0 2 3 5 5 6 8 8 8
C 0 0 0 0 3 3 3 6 6 6 6
C 0 0 0 0 0 3 6 6 6 6
C 0 0 0 0 3 3 3 3 3
U 0 0 0 0 1 1 3 3
G 0 0 0 0 1 2 2
U 0 0 0 0 2 2
G 0 0 0 0 1
G 0 0 0 0
U 0 0 0
A 0 0
U 0
Propagation
1
5
i
10
20
j
A U A C C C U G U G G U A U
A 0 0 0 0 0 0 2 3 5 6 6 8 10 12
U 0 0 0 0 0 2 3 5 6 6 8 10 10
A 0 0 0 0 2 3 5 5 6 8 8 8
C 0 0 0 0 3 3 3 6 6 6 6
C 0 0 0 0 0 3 6 6 6 6
C 0 0 0 0 3 3 3 3 3
U 0 0 0 0 1 1 3 3
G 0 0 0 0 1 2 2
U 0 0 0 0 2 2
G 0 0 0 0 1
G 0 0 0 0
U 0 0 0
A 0 0
U 0
Traceback
i
21
FINAL PREDICTION
AUACCCUGUGGUAU
Total free energy -12 kcal/mol
22
Protein structure prediction
23
The sequence-structure gapThe gap is getting
bigger
200000
180000
160000
140000
120000
100000
Sequences
Structures
80000
60000
40000
20000
0
24
The protein folding problem

The information for 3D structures is coded in the
protein sequence
Proteins fold in their native structure in
seconds

25
Secondary Structure Prediction

Given a primary sequence
ADSGHYRFASGFTYKKMNCTEAA
what secondary structure will it adopt ?

26
Backbone

A polypeptide chain. The R1 side chains identify
the component amino acids.
Atoms inside each quadrilateral are on the same
plane, which can rotate according
to angles ? and ? .

27
Protein structure
28
Secondary Structure Prediction Methods

Chou-Fasman / GOR Method
Based on amino acid frequencies
Machine learning methods
PHDsec and PSIpred

29
Chou and Fasman (1974)
Name P(a) P(b) P(turn) Alanine
142 83 66 Arginine 98 93
95 Aspartic Acid 101 54
146 Asparagine 67 89 156 Cysteine
70 119 119 Glutamic Acid 151 037
74 Glutamine 111 110
98 Glycine 57 75 156 Histidine
100 87 95 Isoleucine 108 160
47 Leucine 121 130 59 Lysine
114 74 101 Methionine 145
105 60 Phenylalanine 113 138
60 Proline 57 55 152 Serine
77 75 143 Threonine 83 119
96 Tryptophan 108 137
96 Tyrosine 69 147 114 Valine
106 170 50
The propensity of an amino acid to be part of a
certain secondary structure (e.g. Proline has a
low propensity of being in an alpha helix or beta
sheet ? breaker)

Success rate of 50

30
Secondary Structure Method Improvements

Sliding window approach
Most alpha helices are 12 residues longMost
beta strands are 6 residues long
Look at all windows, calculate a score for each
window. If gtthreshold ? predict this is an alpha
helix/beta sheet

TGTAGPOLKCHIQWMLPLKK
31
Improvements since 1980s

Adding information from conservation in MSA
Smarter algorithms (e.g. Machine learning).

Success -gt 75-80
32
Machine learning approach for predicting
Secondary Structure (PHD, PSIpred)
Query
SwissProt

Step 1
Generating a multiple sequence alignment

Query
Subject
Subject
Subject
Subject
33

Step 2
Additional sequences are added using a profile.
We end up with a MSA which represents the protein
family.

Query
seed
MSA
Query
Subject
Subject
Subject
Subject
34
Step 3

The sequence profile of the protein family is
compared (by machine learning methods) to
sequences with known secondary structure.

Query
seed
Machine Learning Approach
MSA
Known structures
Query
Subject
Subject
Subject
Subject
35
Neural Network architecture used in BetaTPred2
36
Predicting protein 3d structure

Goal 3d structure from 1d sequence

An existing fold
A new fold
Fold recognition
ab-initio
Homology modeling
37
Homology Modeling

Simplest, reliable approach
Basis proteins with similar sequences tend to
fold into similar structures
Has been observed that even proteins with 25
sequence identity fold into similar structures
Does not work for remote homologs (lt 25 pairwise
identity)

38
Homology Modeling

Given
A query sequence Q
A database of known protein structures
Find protein P such that P has high sequence
similarity to Q
Return Ps structure as an approximation to Qs
structure

39
Homology modeling needs three items of input

The sequence of a protein with unknown 3D
structure, the "target sequence."
A 3D template a structure having the highest
sequence identity with the target sequence ( gt25
sequence identity)
An sequence alignment between the target sequence
and the template sequence

40
Fold recognition Protein Threading

Which of the known folds is likely to be similar
to the (unknown) fold of a new protein when only
its amino-acid sequence is known?

41
Protein Threading

The goal find the correct sequence-structure
alignment between a target sequence and its
native-like fold in PDB
Energy function knowledge (or statistics) based
rather than physics based
Should be able to distinguish correct structural
folds from incorrect structural folds
Should be able to distinguish correct
sequence-fold alignment from incorrect
sequence-fold alignments

42
Protein Threading

Basic premise
Statistics from Protein Data Bank (2,000
structures)
Chances for a protein to have a structural fold
that already exists in PDB are quite good.

The number of unique structural (domain) folds in
nature is fairly small (possibly a few thousand)
90 of new structures submitted to PDB in the
past three years have similar structural folds in
PDB
43
Protein Threading

Basic components
Structure database
Energy function
Sequence-structure alignment algorithm
Prediction reliability assessment

44
ab-initio folding

Goal Predict structure from first principles
Requires
A free energy function, sufficiently close to the
true potential
A method for searching the conformational space
Advantages
Works for novel folds
Shows that we understand the process
Disadvantages
Applicable to short sequences only

45
Qian et al. (Nature 2007) used distributed
computing to predict the 3D structure of a
protein from its amino-acid sequence. Here, their
predicted structure (grey) of a protein is
overlaid with the experimentally determined
crystal structure (color) of that protein. The
agreement between the two is excellent. 70,000
home computers for about two years.
46
Overall Approach
Protein Sequence
Multiple Sequence Alignment
Database Searching
Homologuein PDB
Secondary Structure Prediction
FoldRecognition
No
Yes
PredictedFold
Yes
Sequence-Structure Alignment
Homology Modelling
Ab-initioStructure Prediction
No
3-D Protein Model
47
Thank you for learning with me!

Write a Comment

User Comments (0)