Protein structure prediction: The holy grail of bioinformatics

About This Presentation

Title:

Protein structure prediction: The holy grail of bioinformatics

Description:

a helix = A helical structure, whose chain coils tightly as a right ... Ammonium thioglycolate (alkaline) pH 9.0-10. Glycerylmonothioglycolate (acid) pH 6.5-8.2 ... – PowerPoint PPT presentation

Number of Views:225

Avg rating:3.0/5.0

Slides: 97

Provided by: nsm5

Category:

more less

Transcript and Presenter's Notes

Title: Protein structure prediction: The holy grail of bioinformatics

1
Protein structure predictionThe holy grail of
bioinformatics
2
Proteins Four levels of structural organization
Primary structure Secondary structure Tertiary
structure Quaternary structure
3
Primary structure the linear amino acid sequence
4
Secondary structure spatial arrangement of
amino-acid residues that are adjacent in the
primary structure
5
a helix A helical structure, whose chain coils
tightly as a right-handed screw with all the side
chains sticking outward in a helical array. The
tight structure of the a helix is stabilized by
same-strand hydrogen bonds between -NH groups and
-CO groups spaced at four amino-acid residue
intervals.
6
The b-pleated sheet is made of loosely coiled b
strands are stabilized by hydrogen bonds between
-NH and -CO groups from adjacent strands.
7
An antiparallel ß sheet. Adjacent ß strands run
in opposite directions. Hydrogen bonds between NH
and CO groups connect each amino acid to a single
amino acid on an adjacent strand, stabilizing the
structure.
8
A parallel ß sheet. Adjacent ß strands run in the
same direction. Hydrogen bonds connect each amino
acid on one strand with two different amino acids
on the adjacent strand.
9
(No Transcript)
10
Silk fibroin
11
a helix b sheet (parallel and antiparallel) tight
turns flexible loops irregular elements (random
coil)
12
Tertiary structure three-dimensional structure
of protein
13
The tertiary structure is formed by the folding
of secondary structures by covalent and
non-covalent forces, such as hydrogen bonds,
hydrophobic interactions, salt bridges between
positively and negatively charged residues, as
well as disulfide bonds between pairs of
cysteines.
14
Quaternary structure spatial arrangement of
subunits and their contacts.
15
(No Transcript)
16
Holoproteins Apoproteins
Prosthetic group
Holoprotein
Apoprotein
Holoprotein
Prosthetic group
17
Apohemoglobin 2a 2b
18
Prosthetic group
Heme
19
Hemoglobin Apohemoglobin 4Heme
20
(No Transcript)
21
Christian B. Anfinsen 1916-1995
Sela M, White FH, Anfinsen CB. 1959. The
reductive cleavage of disulfide bonds and its
application to problems of protein structure.
Biochim. Biophys. Acta. 31417-426.
22
(No Transcript)
23
The denaturation and renaturation of proteins
24
(No Transcript)
25
Reducing agents Ammonium thioglycolate
(alkaline) pH 9.0-10 Glycerylmonothioglycolate
(acid) pH 6.5-8.2
26
Oxidant
27
What do we need to know in order to state that
the tertiary structure of a protein has been
solved?

Ideally We need to determine the position of all
atoms and their connectivity.
Less Ideally We need to determine the position
of all C???backbone structure).

28
Protein structure Limitations and caveats

Not all proteins or parts of proteins assume a
well-defined 3D structure in solution.
Protein structure is not static, there are
various degrees of thermal motion for different
parts of the structure.
There may be a number of slightly different
conformations in solution.
Some proteins undergo conformational changes when
interacting with STUFF.

29
Experimental Protein Structure Determination

X-ray crystallography
most accurate
in vitro
needs crystals
100-200K per structure
NMR
fairly accurate
in vivo
no need for crystals
limited to very small proteins
Cryo-electron-microscopy
imaging technology
low resolution

30
Why predict protein structure?

Structural knowledge some understanding of
function and mechanism of action
Predicted structures can be used in
structure-based drug design
It can help us understand the effects of
mutations on structure and function
It is a very interesting scientific problem
(still unsolved in its most general form after
more than 50 years of effort)

31
Secondary structure prediction
32
Secondary structure prediction

Historically first structure prediction methods
predicted secondary structure
Can be used to improve alignment accuracy
Can be used to detect domain boundaries within
proteins with remote sequence homology
Often the first step towards 3D structure
prediction
Informative for mutagenesis studies

33
Protein Secondary Structures (Simplifications)
?-HELIX
?-STRAND
COIL (everything else)
34
Assumptions

The entire information for forming secondary
structure is contained in the primary sequence
side groups of residues will determine structure
examining windows of 13-17 residues is sufficient
to predict secondary structure
a-helices 540 residues long
b-strands 510 residues long

35
Predicting Secondary Structure From Primary
Structure

accuracy 64-75
higher accuracy for a-helices than for b-sheets
accuracy is dependent on protein family
predictions of engineered (artificial) proteins
are less accurate

36
A surprising result!

Chameleon
sequences

37
The Chameleon sequence
sequence 1 sequence 2
TEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTEK
Replace both sequences with an engineered peptide
(chameleon)
TEAVDAWTVEKAFKTFANDNGVDGAWTVEKAFKTFTVTEK
a -helix b-strand
Source Minor and Kim. 1996. Nature 380730-734
38
Measures of prediction accuracy

Qindex and Q3
Correlation coefficient

39
Qindex

Qindex (Qhelix, Qstrand, Qcoil, Q3)
percentage of residues correctly predicted as
a-helix, b-strand, coil, or for all 3
conformations.
Drawbacks
- even a random assignment of structure can
achieve a high score (Holley Karpus 1991)

40
Correlation coefficient
Ca 1 (100)
41
Methods of secondary structure prediction
42
First generation methods single residue
statistics

Chou Fasman (1974 1978)
Some residues have particular
secondary-structure preferences. Based on
empirical frequencies of residues in ?-helices,
?-sheets, and coils.
Examples Glu a-helix
Val
ß-strand

43
Chou-Fasman method
44
(No Transcript)
45
Chou-Fasman Method

Accuracy Q3 50-60

46
Second generation methods segment statistics

Similar to single-residue methods, but
incorporating additional information (adjacent
residues, segmental statistics).
Problems
Low accuracy - Q3 below 66 (results).
Q3 of ?-strands (E) 28 - 48.
Predicted structures were too short.

47
The GOR method

developed by Garnier, Osguthorpe Robson
build on Chou-Fasman Pij values
evaluate each residue PLUS adjacent 8 N-terminal
and 8 carboxyl-terminal residues
sliding window of 17 residues
underpredicts b-strand regions
GOR method accuracy Q3 64

48
Third generation methods

Third generation methods reached 77 accuracy.
They consist of two new ideas
1. A biological idea
Using evolutionary information based on
conservation analysis of multiple sequence
alignments.
2. A technological idea
Using neural networks.

49
Artificial Neural Networks
An attempt to imitate the human brain (assuming
that this is the way it works).
50
Neural network models

machine learning approach
provide training sets of structures (e.g.
a-helices, non a -helices)
computers are trained to recognize patterns in
known secondary structures
provide test set (proteins with known structures)
accuracy 70 75

51
Reasons for improved accuracy

Align sequence with other related proteins of the
same protein family
Find members that has a known structure
If significant matches between structure and
sequence assign secondary structures to
corresponding residues

52
New and Improved Third-Generation Methods

Exploit evolutionary information. Based on
conservation analysis of multiple sequence
alignments.
PHD (Q3 70)
Rost B, Sander, C. (1993) J. Mol. Biol. 232,
584-599.
PSIPRED (Q3 77)
Jones, D. T. (1999) J. Mol. Biol. 292, 195-202.
Arguably remains the top secondary structure
prediction method(won all CASP competitions
since 1998).

53
Secondary Structure Prediction Summary

1st Generation - 1970s
Q3 50-55
Chou Fausman, GOR
2nd Generation -1980s
Q3 60-65
Qian Sejnowski, GORIII
3rd Generation - 1990s
Q3 70-80
PhD, PSIPRED
Many 3rd generation methods exist
PSI-PRED - http//bioinf.cs.ucl.ac.uk/psipred/
JPRED - http//www.compbio.dundee.ac.uk/www-jpre
d/
PHD - http//www.embl-heidelberg.de/predictprotei
n/predictprotein.html
NNPRED - http//www.cmpharm.ucsf.edu/nomi/nnpred
ict.html

54
The sequence-structure gap
November 3, 2009

More than 8,835,796 known protein sequences,
61,086 experimentally determined structures
The gap is getting bigger.

55
Sequences vs. Structures
200000
180000
160000
140000
120000
100000
Sequences
Structures
80000
60000
40000
20000
0
56
Protein Secondary Structures (Simplifications)
?-HELIX
?-STRAND
COIL (everything else)
57
Beyond Secondary StructureBefore Tertiary
Structure

Supersecondary structures (motifs) small,
discrete, commonly observed aggregates of
secondary structures
helix-loop-helix
b?a?b
Domains independent units of structure
b barrel
four-helix bundle
The terms domain and motif are sometimes used
interchangeably.

58
Helix-loop-helix
59
Beyond Secondary StructureBefore Tertiary
Structure
Folds Compact folding arrangements of a
polypeptide chain (a protein or part of a
protein). The terms domain and fold are
sometimes used interchangeably.
60
EF Fold
Found in Calcium binding proteins such as
Calmodulin
61
Leucine Zipper
62
Rossman Fold

The beta-alpha-beta-alpha-beta subunit
Often present in nucleotide-binding proteins

63
b sandwich
b barrel
64
a/b horseshoe
65
Four helix bundle

24 amino acid peptide with a hydrophobic surface
Assembles into 4 helix bundle through hydrophobic
regions
Maintains solubility of membrane proteins

66
TIM Barrel
67
PDB New Fold Growth
Old fold
New fold

The number of unique folds in nature is fairly
small (possibly a few thousands)
90 of new structures submitted to PDB in the
past three years have similar structural folds in
PDB

68
Protein data bank

http//www.rcsb.org/pdb/

69
Protein 3D structure data
The structure of a protein consists of the 3D
(X,Y,Z) coordinates of each non-hydrogen atom of
the protein. Some protein structure also include
coordinates of covalently linked prosthetic
groups, non-covalently linked ligand molecules,
or metal ions. For some purposes (e.g. structural
alignment) only the Ca coordinates are needed.
Example of PDB format X
Y Z occupancy / temp.
factor ATOM 18 N GLY 27 40.315
161.004 11.211 1.00 10.11 ATOM 19 CA GLY
27 39.049 160.737 10.462 1.00 14.18 ATOM
20 C GLY 27 38.729 159.239 10.784
1.00 20.75 ATOM 21 O GLY 27
39.507 158.484 11.404 1.00 21.88 Note the PDB
format provides no information about connectivity
between atoms. The last two numbers (occupancy,
temperature factor) relate to disorders of atomic
positions in crystals.
70
(No Transcript)
71
Protein structure Some computational tasks

Building a protein structure model from X-ray
data
Building a protein structure model from NMR data
Computing the energy for a given protein
structure (conformation)
Energy minimization Finding the structure with
the minimal energy according to some empirical
force fields.
Simulating the protein folding process (molecular
dynamics)
Structure visualization
Computing secondary structure from atomic
coordinates
Protein superposition, structural alignment
Protein fold classification
Threading finding a fold (prototype structure)
that fits to a sequence
Docking fitting ligands onto a protein surface
by molecular dynamics or energy minimization
Protein 3D structure prediction from sequence

72
Viewing protein structures

When looking at a protein structure, we may ask
the following types of questions
Is a particular residue on the inside or outside
of a protein?
Which amino acids interact with each other?
Which amino acids are in contact with a ligand
(DNA, peptide hormone, small molecule, etc.)?
Is an observed mutation likely to disturb the
protein structure?
Standard capabilities of protein structure
software
Display of protein structures in different ways
(wireframe, backbone, sticks, spacefill, ribbon.
Highlighting of individual atoms, residues or
groups of residues
Calculation of interatomic distances
Advanced feature Superposition of related
structures

73
Example c-abl oncoprotein SH2 domain, display
wireframe
74
Example c-abl oncoprotein SH2 domain, display
sticks
75
Example c-abl oncoprotein SH2 domain, display
backbone
76
Example c-abl oncoprotein SH2 domain, display
spacefill
77
Example c-abl oncoprotein SH2 domain, display
ribbons
78
Predicting protein 3d structure

Goal 3d structure from 1d sequence

An existing fold
A new fold
Fold recognition
ab-initio
Homology modeling
79
Homology modeling

Based on the two major observations
The structure of a protein is uniquely defined by
its amino acid sequence.
Similar sequences adopt similar structures.
(Distantly related sequences may still fold into
similar structures.)

80
Homology modeling needs three items of input

The sequence of a protein with unknown 3D
structure, the "target sequence."
A 3D template a structure having the highest
sequence identity with the target sequence ( gt30
sequence identity)
An sequence alignment between the target sequence
and the template sequence

81
Homology Modeling How it works

Find template
Align target sequence
with template
Generate model
- add loops
- add sidechains
Refine model

82
Two zones of homology modeling
Rost, Protein Eng. 1999
83
Automated Web-Based Homology Modelling

SWISS Model http//www.expasy.org/swissmod/SWISS
-MODEL.html
WHAT IF http//www.cmbi.kun.nl/swift/servers/
The CPHModels Server http//www.cbs.dtu.dk/serv
ices/CPHmodels/
3D Jigsaw http//www.bmm.icnet.uk/3djigsaw/
SDSC1 http//cl.sdsc.edu/hm.html
EsyPred3D http//www.fundp.ac.be/urbm/bioinfo/e
sypred/

84
Fold recognition Protein Threading

Which of the known folds is likely to be similar
to the (unknown) fold of a new protein when only
its amino-acid sequence is known?

85
Protein Threading

The goal find the correct sequence-structure
alignment between a target sequence and its
native-like fold in PDB
Energy function knowledge (or statistics) based
rather than physics based
Should be able to distinguish correct structural
folds from incorrect structural folds
Should be able to distinguish correct
sequence-fold alignment from incorrect
sequence-fold alignments

86
Protein Threading

Basic premise
Statistics from Protein Data Bank (2,000
structures)
Chances for a protein to have a structural fold
that already exists in PDB are quite good.

The number of unique structural (domain) folds in
nature is fairly small (possibly a few thousand)
90 of new structures submitted to PDB in the
past three years have similar structural folds in
PDB
87
Protein Threading

Basic components
Structure database
Energy function
Sequence-structure alignment algorithm
Prediction reliability assessment

88
Protein Threading structure database

Build a template database

89
Process

Threading - A protein fold recognition technique
that involves incrementally replacing the
sequence of a known protein structure with a
query sequence of unknown structure. The new
model structure is evaluated using a simple
heuristic measure of protein fold quality. The
process is repeated against all known 3D
structures until an optimal fit is found.

90
Fold recognition methods

3D-PSSM
http//www.sbg.bio.ic.ac.uk/3dpssm/
Fugue
http//www-cryst.bioc.cam.ac.uk/fugue/
HHpred http//protevo.eb.tuebingen.mpg.de/toolkit/
index.php?viewhhpred

91
ab-initio folding

Goal Predict structure from first principles
Requires
A free energy function, sufficiently close to the
true potential
A method for searching the conformational space
Advantages
Works for novel folds
Shows that we understand the process
Disadvantages
Applicable to short sequences only

92
Rosetta Simons et al. 1997
http//www.bioinfo.rpi.edu/bystrc/hmmstr/server.p
hp
93
Qian et al. (Nature 2007) used distributed
computing to predict the 3D structure of a
protein from its amino-acid sequence. Here, their
predicted structure (grey) of a protein is
overlaid with the experimentally determined
crystal structure (color) of that protein. The
agreement between the two is excellent. 70,000
home computers for about two years.
94
Overall Approach
Protein Sequence
Multiple Sequence Alignment
Database Searching
Homologuein PDB
Secondary Structure Prediction
FoldRecognition
No
Yes
PredictedFold
Yes
Sequence-Structure Alignment
Homology Modelling
Ab-initioStructure Prediction
No
3-D Protein Model
95
ExPASy Proteomics ServerExpert Protein Analysis
System

links to lots of protein prediction resources
http//expasy.org/

96
RMSDmin The root mean square deviation (RMSD) is
the measure of the average distance between the
backbones of superimposed proteins. In the study
of globular protein conformations, one
customarily measures the similarity in
three-dimensional structure by the RMSD of the Ca
atomic coordinates after optimal rigid body
superposition. A widely used way to compare the
structures of biomolecules or solid bodies is to
translate or rotate one structure with respect
to the other to minimize the RMSD. This RMSDmin
can be used as a distance measure between two
proteins.

Write a Comment

User Comments (0)