molecule's structure prediction - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

molecule's structure prediction

Description:

molecule's structure prediction – PowerPoint PPT presentation

Number of Views:139
Avg rating:3.0/5.0
Slides: 48
Provided by: SarahA166
Category:

less

Transcript and Presenter's Notes

Title: molecule's structure prediction


1
molecule's structure prediction
2
Outline
  • RNA
  • RNA folding
  • Dynamic programming for RNA secondary structure
    prediction
  • Protein
  • Secondary Structure Prediction
  • Homology Modeling
  • Protein Threading
  • ab-initio

3
RNA Basics
3 Hydrogen Bonds more stable
2 Hydrogen Bonds
  • RNA bases A,C,G,U
  • Canonical Base Pairs
  • A-U
  • G-C
  • G-U
  • wobble pairing
  • Bases can only pair with one other base.

Image http//www.bioalgorithms.info/
4
RNA Secondary Structure
Pseudoknot
Stem
Interior Loop
Single-Stranded
Bulge Loop
Junction (Multiloop)
Hairpin loop
Image Wuchty
5
RNA secondary structure representation
Circular representation
Bacillus Subtilis RNase P RNA
6
RNA secondary structure representation
DotPlot representation of the same
Bacillus Subtilis RNA folding A dot is placed
to represent a base pair
7
RNA secondary structure definition
An RNA sequence is represented as R r1, r2,
r3, , rn (ri is the i-th nucleotide). Each ri
belongs to the set A, C, G, U. A secondary
structure on R is a set S of ordered pairs,
written as ij, 1iltjn, satisfying
8
Computing RNA secondary structure
  • Working hypothesis
  • The native secondary structure of a RNA
    molecule is the one with the minimum free energy
  • Restrictions
  • No knots (ri,rj) , (rk,rl), iltkltjltl
  • No close base pairs (ri,rj) j i gt 3 (exclude
    close base pairs)
  • Base pairs A-U, C-G and G-U

9
Computing RNA secondary structure
  • Tinoco-Uhlenbeck postulate
  • Assumption the free energy of each base pair is
    independent of all the other pairs and the loop
    structures
  • Consequence the total free energy of an RNA is
    the sum of all of the base pair free energies

10
Independent Base Pairs Approach
  • Use solution for smaller strings to find
    solutions for larger strings
  • This is precisely the basic principle behind
    dynamic programming algorithms!

11
RNA folding Dynamic Programming
  • Notation
  • e(ri,rj) free energy of a base pair joining ri
    and rj
  • Bij secondary structure of the RNA strand
    from base ri to base rj. Its energy is E(Bij)
  • S(i,j) optimal free energy associated with
    segment rirj
  • S(i,j) max -E(Bij)

B
12
RNA folding Dynamic Programming
There are only four possible ways that a
secondary structure of nested base pair can be
constructed on a RNA strand from position i to j
  • i is unpaired, added on to
  • a structure for i1j
  • S(i,j) S(i1,j)
  • j is unpaired, added on to
  • a structure for ij-1
  • S(i,j) S(i,j-1)

13
RNA folding Dynamic Programming
  • i j paired, but not to each other
  • the structure for ij adds together
  • structures for 2 sub regions,
  • ik and k1j
  • S(i,j) max S(i,k)S(k1,j)
  • i j paired, added on to
  • a structure for i1j-1
  • S(i,j) S(i1,j-1)e(ri,rj)

iltkltj
14
RNA folding Dynamic Programming
Since there are only four cases, the optimal
score S(i,j) is just the maximum of the four
possibilities
To compute this efficiently, we need to make sure
that the scores for the smaller sub-regions have
already been calculated Dynamic
Programming !!
15
RNA folding Dynamic Programming
Notes S(i,j) 0 if j-i lt 4 do
not allow close base pairs Reasonable values
of e are -3, -2, and -1 kcal/mole for GC, AU and
GU, respectively. In the DP procedure, we use
3, 2, 1 (or replace max with min) Build upper
triangular part of DP matrix - start with
diagonal all 0 - works outward on larger and
larger regions - ends with S(1,n) Traceback
starts with S(1,n), and finds optimal path that
lead there.
16
j
A U A C C C U G U G G U A U
A 0 0 0 0
U 0 0 0 0
A 0 0 0 0
C 0 0 0 0
C 0 0 0 0
C 0 0 0 0
U 0 0 0 0
G 0 0 0 0
U 0 0 0 0
G 0 0 0 0
G 0 0 0 0
U 0 0 0
A 0 0
U 0
Initialisation No close basepairs
i
17
j
1
10
5
A U A C C C U G U G G U A U
A 0 0 0 0 0
U 0 0 0 0 0
A 0 0 0 0 2
C 0 0 0 0 3
C 0 0 0 0 0
C 0 0 0 0 3
U 0 0 0 0 1
G 0 0 0 0 1
U 0 0 0 0 2
G 0 0 0 0 1
G 0 0 0 0
U 0 0 0
A 0 0
U 0
Propagation
1
C5.U9 C5 unpaired S(6,9) 0 U10
unpaired S(5,8)0 C5-U10 paired S(6,8)
e(C,U)0 C5 paired, U10 paired S(5,6)S(7,9)0
S(5,7)S(8,9)0
5
10
18
j
1
5
10
A U A C C C U G U G G U A U
A 0 0 0 0 0 0 2
U 0 0 0 0 0 2 3
A 0 0 0 0 2 3 5
C 0 0 0 0 3 3 3
C 0 0 0 0 0 3 6
C 0 0 0 0 3 3
U 0 0 0 0 1 1
G 0 0 0 0 1 2
U 0 0 0 0 2 2
G 0 0 0 0 1
G 0 0 0 0
U 0 0 0
A 0 0
U 0
Propagation
1
C5.G11 C5 unpaired S(6,11) 3 G11
unpaired S(5,10)3 C5-G11 paired S(6,10)e(C,G)
6 C5 paired, G11 paired S(5,6)S(7,11)1 S(5,7)
S(8,11)0 S(5,8)S(9,11)0 S(5,9)S(10,11)0
5
10
19
j
1
5
10
A U A C C C U G U G G U A U
A 0 0 0 0 0 0 2 3 5 6 6 8 10 12
U 0 0 0 0 0 2 3 5 6 6 8 10 10
A 0 0 0 0 2 3 5 5 6 8 8 8
C 0 0 0 0 3 3 3 6 6 6 6
C 0 0 0 0 0 3 6 6 6 6
C 0 0 0 0 3 3 3 3 3
U 0 0 0 0 1 1 3 3
G 0 0 0 0 1 2 2
U 0 0 0 0 2 2
G 0 0 0 0 1
G 0 0 0 0
U 0 0 0
A 0 0
U 0
Propagation
1
5
i
10
20
j
A U A C C C U G U G G U A U
A 0 0 0 0 0 0 2 3 5 6 6 8 10 12
U 0 0 0 0 0 2 3 5 6 6 8 10 10
A 0 0 0 0 2 3 5 5 6 8 8 8
C 0 0 0 0 3 3 3 6 6 6 6
C 0 0 0 0 0 3 6 6 6 6
C 0 0 0 0 3 3 3 3 3
U 0 0 0 0 1 1 3 3
G 0 0 0 0 1 2 2
U 0 0 0 0 2 2
G 0 0 0 0 1
G 0 0 0 0
U 0 0 0
A 0 0
U 0
Traceback
i
21
FINAL PREDICTION
AUACCCUGUGGUAU
Total free energy -12 kcal/mol
22
Protein structure prediction
23
The sequence-structure gapThe gap is getting
bigger
200000
180000
160000
140000
120000
100000
Sequences
Structures
80000
60000
40000
20000
0
24
The protein folding problem
  • The information for 3D structures is coded in the
    protein sequence
  • Proteins fold in their native structure in
    seconds

25
Secondary Structure Prediction
  • Given a primary sequence
  • ADSGHYRFASGFTYKKMNCTEAA
  • what secondary structure will it adopt ?

26
Backbone
  • A polypeptide chain. The R1 side chains identify
    the component amino acids.
  • Atoms inside each quadrilateral are on the same
    plane, which can rotate according
  • to angles ? and ? .

27
Protein structure
28
Secondary Structure Prediction Methods
  • Chou-Fasman / GOR Method
  • Based on amino acid frequencies
  • Machine learning methods
  • PHDsec and PSIpred

29
Chou and Fasman (1974)
Name P(a) P(b) P(turn) Alanine
142 83 66 Arginine 98 93
95 Aspartic Acid 101 54
146 Asparagine 67 89 156 Cysteine
70 119 119 Glutamic Acid 151 037
74 Glutamine 111 110
98 Glycine 57 75 156 Histidine
100 87 95 Isoleucine 108 160
47 Leucine 121 130 59 Lysine
114 74 101 Methionine 145
105 60 Phenylalanine 113 138
60 Proline 57 55 152 Serine
77 75 143 Threonine 83 119
96 Tryptophan 108 137
96 Tyrosine 69 147 114 Valine
106 170 50
The propensity of an amino acid to be part of a
certain secondary structure (e.g. Proline has a
low propensity of being in an alpha helix or beta
sheet ? breaker)
  • Success rate of 50

30
Secondary Structure Method Improvements
  • Sliding window approach
  • Most alpha helices are 12 residues longMost
    beta strands are 6 residues long
  • Look at all windows, calculate a score for each
    window. If gtthreshold ? predict this is an alpha
    helix/beta sheet

TGTAGPOLKCHIQWMLPLKK
31
Improvements since 1980s
  • Adding information from conservation in MSA
  • Smarter algorithms (e.g. Machine learning).

Success -gt 75-80
32
Machine learning approach for predicting
Secondary Structure (PHD, PSIpred)
Query
SwissProt
  • Step 1
  • Generating a multiple sequence alignment

Query
Subject
Subject
Subject
Subject
33
  • Step 2
  • Additional sequences are added using a profile.
    We end up with a MSA which represents the protein
    family.

Query
seed
MSA
Query
Subject
Subject
Subject
Subject
34
Step 3
  • The sequence profile of the protein family is
    compared (by machine learning methods) to
    sequences with known secondary structure.

Query
seed
Machine Learning Approach
MSA
Known structures
Query
Subject
Subject
Subject
Subject
35
Neural Network architecture used in BetaTPred2
36
Predicting protein 3d structure
  • Goal 3d structure from 1d sequence

An existing fold
A new fold
Fold recognition
ab-initio
Homology modeling
37
Homology Modeling
  • Simplest, reliable approach
  • Basis proteins with similar sequences tend to
    fold into similar structures
  • Has been observed that even proteins with 25
    sequence identity fold into similar structures
  • Does not work for remote homologs (lt 25 pairwise
    identity)

38
Homology Modeling
  • Given
  • A query sequence Q
  • A database of known protein structures
  • Find protein P such that P has high sequence
    similarity to Q
  • Return Ps structure as an approximation to Qs
    structure

39
Homology modeling needs three items of input
  • The sequence of a protein with unknown 3D
    structure, the "target sequence."
  • A 3D template a structure having the highest
    sequence identity with the target sequence ( gt25
    sequence identity)
  • An sequence alignment between the target sequence
    and the template sequence

40
Fold recognition Protein Threading
  • Which of the known folds is likely to be similar
    to the (unknown) fold of a new protein when only
    its amino-acid sequence is known?

41
Protein Threading
  • The goal find the correct sequence-structure
    alignment between a target sequence and its
    native-like fold in PDB
  • Energy function knowledge (or statistics) based
    rather than physics based
  • Should be able to distinguish correct structural
    folds from incorrect structural folds
  • Should be able to distinguish correct
    sequence-fold alignment from incorrect
    sequence-fold alignments

42
Protein Threading
  • Basic premise
  • Statistics from Protein Data Bank (2,000
    structures)
  • Chances for a protein to have a structural fold
    that already exists in PDB are quite good.

The number of unique structural (domain) folds in
nature is fairly small (possibly a few thousand)
90 of new structures submitted to PDB in the
past three years have similar structural folds in
PDB
43
Protein Threading
  • Basic components
  • Structure database
  • Energy function
  • Sequence-structure alignment algorithm
  • Prediction reliability assessment

44
ab-initio folding
  • Goal Predict structure from first principles
  • Requires
  • A free energy function, sufficiently close to the
    true potential
  • A method for searching the conformational space
  • Advantages
  • Works for novel folds
  • Shows that we understand the process
  • Disadvantages
  • Applicable to short sequences only

45
Qian et al. (Nature 2007) used distributed
computing to predict the 3D structure of a
protein from its amino-acid sequence. Here, their
predicted structure (grey) of a protein is
overlaid with the experimentally determined
crystal structure (color) of that protein. The
agreement between the two is excellent. 70,000
home computers for about two years.
46
Overall Approach
Protein Sequence
Multiple Sequence Alignment
Database Searching
Homologuein PDB
Secondary Structure Prediction
FoldRecognition
No
Yes
PredictedFold
Yes
Sequence-Structure Alignment
Homology Modelling
Ab-initioStructure Prediction
No
3-D Protein Model
47
Thank you for learning with me!
Write a Comment
User Comments (0)
About PowerShow.com