Title: Protein structure prediction: The holy grail of bioinformatics
1Protein structure predictionThe holy grail of
bioinformatics
2Proteins Four levels of structural organization
Primary structure Secondary structure Tertiary
structure Quaternary structure
3Primary structure the linear amino acid sequence
4Secondary structure spatial arrangement of
amino-acid residues that are adjacent in the
primary structure
5a helix A helical structure, whose chain coils
tightly as a right-handed screw with all the side
chains sticking outward in a helical array. The
tight structure of the a helix is stabilized by
same-strand hydrogen bonds between -NH groups and
-CO groups spaced at four amino-acid residue
intervals.
6The b-pleated sheet is made of loosely coiled b
strands are stabilized by hydrogen bonds between
-NH and -CO groups from adjacent strands.
7An antiparallel ß sheet. Adjacent ß strands run
in opposite directions. Hydrogen bonds between NH
and CO groups connect each amino acid to a single
amino acid on an adjacent strand, stabilizing the
structure.
8A parallel ß sheet. Adjacent ß strands run in the
same direction. Hydrogen bonds connect each amino
acid on one strand with two different amino acids
on the adjacent strand.
9(No Transcript)
10Silk fibroin
11a helix b sheet (parallel and antiparallel) tight
turns flexible loops irregular elements (random
coil)
12Tertiary structure three-dimensional structure
of protein
13The tertiary structure is formed by the folding
of secondary structures by covalent and
non-covalent forces, such as hydrogen bonds,
hydrophobic interactions, salt bridges between
positively and negatively charged residues, as
well as disulfide bonds between pairs of
cysteines.
14Quaternary structure spatial arrangement of
subunits and their contacts.
15(No Transcript)
16Holoproteins Apoproteins
Prosthetic group
Holoprotein
Apoprotein
Holoprotein
Prosthetic group
17Apohemoglobin 2a 2b
18Prosthetic group
Heme
19Hemoglobin Apohemoglobin 4Heme
20(No Transcript)
21Christian B. Anfinsen 1916-1995
Sela M, White FH, Anfinsen CB. 1959. The
reductive cleavage of disulfide bonds and its
application to problems of protein structure.
Biochim. Biophys. Acta. 31417-426.
22Not all proteins fold independently. Chaperones.
23(No Transcript)
24The denaturation and renaturation of proteins
25(No Transcript)
26Reducing agents Ammonium thioglycolate
(alkaline) pH 9.0-10 Glycerylmonothioglycolate
(acid) pH 6.5-8.2
27Oxidant
28What do we need to know in order to state that
the tertiary structure of a protein has been
solved?
- Ideally We need to determine the position of all
atoms and their connectivity. - Less Ideally We need to determine the position
of all C???backbone structure).
29Protein structure Limitations and caveats
- Not all proteins or parts of proteins assume a
well-defined 3D structure in solution. - Protein structure is not static, there are
various degrees of thermal motion for different
parts of the structure. - There may be a number of slightly different
conformations in solution. - Some proteins undergo conformational changes when
interacting with STUFF.
30Experimental Protein Structure Determination
- X-ray crystallography
- most accurate
- in vitro
- needs crystals
- 100-200K per structure
- NMR
- fairly accurate
- in vivo
- no need for crystals
- limited to very small proteins
- Cryo-electron-microscopy
- imaging technology
- low resolution
31Why predict protein structure?
- Structural knowledge some understanding of
function and mechanism of action - Predicted structures can be used in
structure-based drug design - It can help us understand the effects of
mutations on structure and function - It is a very interesting scientific problem
(still unsolved in its most general form after
more than 50 years of effort)
32Secondary structure prediction
33Secondary structure prediction
- Historically first structure prediction methods
predicted secondary structure - Can be used to improve alignment accuracy
- Can be used to detect domain boundaries within
proteins with remote sequence homology - Often the first step towards 3D structure
prediction - Informative for mutagenesis studies
34Protein Secondary Structures (Simplifications)
?-HELIX
?-STRAND
COIL (everything else)
35Assumptions
- The entire information for forming secondary
structure is contained in the primary sequence - side groups of residues will determine structure
- examining windows of 13-17 residues is sufficient
to predict secondary structure - a-helices 540 residues long
- b-strands 510 residues long
36Predicting Secondary Structure From Primary
Structure
- accuracy 64-75
- higher accuracy for a-helices than for b-sheets
- accuracy is dependent on protein family
- predictions of engineered (artificial) proteins
are less accurate
37A surprising result!
38The Chameleon sequence
sequence 1 sequence 2
TEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTEK
Replace both sequences with an engineered peptide
(chameleon)
TEAVDAWTVEKAFKTFANDNGVDGAWTVEKAFKTFTVTEK
a -helix b-strand
Source Minor and Kim. 1996. Nature 380730-734
39Measures of prediction accuracy
- Qindex and Q3
- Correlation coefficient
40Qindex
- Qindex (Qhelix, Qstrand, Qcoil, Q3)
- percentage of residues correctly predicted as
a-helix, b-strand, coil, or for all 3
conformations. - Drawbacks
- - even a random assignment of structure can
achieve a high score (Holley Karpus 1991)
41Correlation coefficient
True positive pa False positive (overpredicted) oa
True negative na False negative (underpredicted) ua
Ca 1 (100)
42Methods of secondary structure prediction
43First generation methods single residue
statistics
- Chou Fasman (1974 1978)
- Some residues have particular
secondary-structure preferences. Based on
empirical frequencies of residues in ?-helices,
?-sheets, and coils. - Examples Glu a-helix
- Val
ß-strand
44Chou-Fasman method
45(No Transcript)
46Chou-Fasman Method
47Second generation methods segment statistics
- Similar to single-residue methods, but
incorporating additional information (adjacent
residues, segmental statistics). - Problems
- Low accuracy - Q3 below 66 (results).
- Q3 of ?-strands (E) 28 - 48.
- Predicted structures were too short.
48 The GOR method
- developed by Garnier, Osguthorpe Robson
- build on Chou-Fasman Pij values
- evaluate each residue PLUS adjacent 8 N-terminal
and 8 carboxyl-terminal residues - sliding window of 17 residues
- underpredicts b-strand regions
- GOR method accuracy Q3 64
49Third generation methods
- Third generation methods reached 77 accuracy.
- They consist of two new ideas
- 1. A biological idea
- Using evolutionary information based on
conservation analysis of multiple sequence
alignments. - 2. A technological idea
- Using neural networks.
50Artificial Neural Networks
An attempt to imitate the human brain (assuming
that this is the way it works).
51Neural network models
- machine learning approach
- provide training sets of structures (e.g.
a-helices, non a -helices) - computers are trained to recognize patterns in
known secondary structures - provide test set (proteins with known structures)
- accuracy 70 75
52Reasons for improved accuracy
- Align sequence with other related proteins of the
same protein family - Find members that has a known structure
- If significant matches between structure and
sequence assign secondary structures to
corresponding residues
53New and Improved Third-Generation Methods
- Exploit evolutionary information. Based on
conservation analysis of multiple sequence
alignments. - PHD (Q3 70)
- Rost B, Sander, C. (1993) J. Mol. Biol. 232,
584-599. - PSIPRED (Q3 77)
- Jones, D. T. (1999) J. Mol. Biol. 292, 195-202.
- Arguably remains the top secondary structure
prediction method(won all CASP competitions
since 1998).
54Secondary Structure Prediction Summary
- 1st Generation - 1970s
- Q3 50-55
- Chou Fausman, GOR
- 2nd Generation -1980s
- Q3 60-65
- Qian Sejnowski, GORIII
- 3rd Generation - 1990s
- Q3 70-80
- PhD, PSIPRED
- Many 3rd generation methods exist
- PSI-PRED - http//bioinf.cs.ucl.ac.uk/psipred/
- JPRED - http//www.compbio.dundee.ac.uk/www-jpre
d/ - PHD - http//www.embl-heidelberg.de/predictprotei
n/predictprotein.html - NNPRED - http//www.cmpharm.ucsf.edu/nomi/nnpred
ict.html
55The sequence-structure gap
September 13, 2011
- More than 13,137,813 known protein sequences,
76,495 experimentally determined structures.
56The sequence-structure gap
The gap is getting bigger.
200000
180000
160000
140000
120000
100000
Sequences
Structures
80000
60000
40000
20000
0
57Protein Secondary Structures (Simplifications)
?-HELIX
?-STRAND
COIL (everything else)
58Beyond Secondary StructureBefore Tertiary
Structure
- Supersecondary structures (motifs) small,
discrete, commonly observed aggregates of
secondary structures - helix-loop-helix
- b?a?b
- Domains independent units of structure
- b barrel
- four-helix bundle
- The terms domain and motif are sometimes used
interchangeably.
59Helix-loop-helix
60Beyond Secondary StructureBefore Tertiary
Structure
Folds Compact folding arrangements of a
polypeptide chain (a protein or part of a
protein). The terms domain and fold are
sometimes used interchangeably.
61EF Fold
Found in Calcium binding proteins such as
Calmodulin
62Leucine Zipper
63Rossman Fold
- The beta-alpha-beta-alpha-beta subunit
- Often present in nucleotide-binding proteins
64b sandwich
b barrel
65a/b horseshoe
66Four helix bundle
- 24 amino acid peptide with a hydrophobic surface
- Assembles into 4 helix bundle through hydrophobic
regions - Maintains solubility of membrane proteins
67TIM Barrel
68PDB New Fold Growth
Old fold
New fold
- The number of unique folds in nature is fairly
small (possibly a few thousands) - 90 of new structures submitted to PDB in the
past three years have similar structural folds in
PDB
69Protein data bank
70Protein 3D structure data
The structure of a protein consists of the 3D
(X,Y,Z) coordinates of each non-hydrogen atom of
the protein. Some protein structure also include
coordinates of covalently linked prosthetic
groups, non-covalently linked ligand molecules,
or metal ions. For some purposes (e.g. structural
alignment) only the Ca coordinates are needed.
Example of PDB format X
Y Z occupancy / temp.
factor ATOM 18 N GLY 27 40.315
161.004 11.211 1.00 10.11 ATOM 19 CA GLY
27 39.049 160.737 10.462 1.00 14.18 ATOM
20 C GLY 27 38.729 159.239 10.784
1.00 20.75 ATOM 21 O GLY 27
39.507 158.484 11.404 1.00 21.88 Note the PDB
format provides no information about connectivity
between atoms. The last two numbers (occupancy,
temperature factor) relate to disorders of atomic
positions in crystals.
71(No Transcript)
72Protein structure Some computational tasks
- Building a protein structure model from X-ray
data - Building a protein structure model from NMR data
- Computing the energy for a given protein
structure (conformation) - Energy minimization Finding the structure with
the minimal energy according to some empirical
force fields. - Simulating the protein folding process (molecular
dynamics) - Structure visualization
- Computing secondary structure from atomic
coordinates - Protein superposition, structural alignment
- Protein fold classification
- Threading finding a fold (prototype structure)
that fits to a sequence - Docking fitting ligands onto a protein surface
by molecular dynamics or energy minimization - Protein 3D structure prediction from sequence
73Viewing protein structures
- When looking at a protein structure, we may ask
the following types of questions - Is a particular residue on the inside or outside
of a protein? - Which amino acids interact with each other?
- Which amino acids are in contact with a ligand
(DNA, peptide hormone, small molecule, etc.)? - Is an observed mutation likely to disturb the
protein structure? - Standard capabilities of protein structure
software - Display of protein structures in different ways
(wireframe, backbone, sticks, spacefill, ribbon. - Highlighting of individual atoms, residues or
groups of residues - Calculation of interatomic distances
- Advanced feature Superposition of related
structures
74Example c-abl oncoprotein SH2 domain, display
wireframe
75Example c-abl oncoprotein SH2 domain, display
sticks
76Example c-abl oncoprotein SH2 domain, display
backbone
77Example c-abl oncoprotein SH2 domain, display
spacefill
78Example c-abl oncoprotein SH2 domain, display
ribbons
79Predicting protein 3d structure
- Goal 3d structure from 1d sequence
An existing fold
A new fold
Fold recognition
ab-initio
Homology modeling
80Homology modeling
- Based on the two major observations (and some
simplifications) - The structure of a protein is uniquely defined by
its amino acid sequence. - Similar sequences adopt similar structures.
(Distantly related sequences may still fold into
similar structures.)
81Homology modeling needs three items of input
- The sequence of a protein with unknown 3D
structure, the "target sequence." - A 3D template a structure having the highest
sequence identity with the target sequence ( gt30
sequence identity) - An sequence alignment between the target sequence
and the template sequence
82Homology Modeling How it works
- Find template
- Align target sequence
- with template
- Generate model
- - add loops
- - add sidechains
- Refine model
83Two zones of homology modeling
Rost, Protein Eng. 1999
84Automated Web-Based Homology Modelling
- SWISS Model http//www.expasy.org/swissmod/SWISS
-MODEL.html - WHAT IF http//www.cmbi.kun.nl/swift/servers/
- The CPHModels Server http//www.cbs.dtu.dk/serv
ices/CPHmodels/ - 3D Jigsaw http//www.bmm.icnet.uk/3djigsaw/
- SDSC1 http//cl.sdsc.edu/hm.html
- EsyPred3D http//www.fundp.ac.be/urbm/bioinfo/e
sypred/
85Fold recognition Protein Threading
- Which of the known folds is likely to be similar
to the (unknown) fold of a new protein when only
its amino-acid sequence is known?
86Protein Threading
- The goal find the correct sequence-structure
alignment between a target sequence and its
native-like fold in PDB - Energy function knowledge (or statistics) based
rather than physics based - Should be able to distinguish correct structural
folds from incorrect structural folds - Should be able to distinguish correct
sequence-fold alignment from incorrect
sequence-fold alignments
87Protein Threading
- Basic premise
- Statistics from Protein Data Bank (2,000
structures) - Chances for a protein to have a structural fold
that already exists in PDB are quite good.
The number of unique structural (domain) folds in
nature is fairly small (possibly a few thousand)
90 of new structures submitted to PDB in the
past three years have similar structural folds in
PDB
88Protein Threading
- Basic components
- Structure database
- Energy function
- Sequence-structure alignment algorithm
- Prediction reliability assessment
89Protein Threading structure database
- Build a template database
90Process
- Threading - A protein fold recognition technique
that involves incrementally replacing the
sequence of a known protein structure with a
query sequence of unknown structure. The new
model structure is evaluated using a simple
heuristic measure of protein fold quality. The
process is repeated against all known 3D
structures until an optimal fit is found.
91Fold recognition methods
- 3D-PSSM
- http//www.sbg.bio.ic.ac.uk/3dpssm/
- Fugue
- http//www-cryst.bioc.cam.ac.uk/fugue/
- HHpred http//protevo.eb.tuebingen.mpg.de/toolkit/
index.php?viewhhpred
92ab-initio folding
- Goal Predict structure from first principles
- Requires
- A free energy function, sufficiently close to the
true potential - A method for searching the conformational space
- Advantages
- Works for novel folds
- Shows that we understand the process
- Disadvantages
- Applicable to short sequences only
93Rosetta Simons et al. 1997
http//www.bioinfo.rpi.edu/bystrc/hmmstr/server.p
hp
94Qian et al. (Nature 2007) used distributed
computing to predict the 3D structure of a
protein from its amino-acid sequence. Here, their
predicted structure (grey) of a protein is
overlaid with the experimentally determined
crystal structure (color) of that protein. The
agreement between the two is excellent. 70,000
home computers for about two years.
95Overall Approach
Protein Sequence
Multiple Sequence Alignment
Database Searching
Homologuein PDB
Secondary Structure Prediction
FoldRecognition
No
Yes
PredictedFold
Yes
Sequence-Structure Alignment
Homology Modelling
Ab-initioStructure Prediction
No
3-D Protein Model
96ExPASy Proteomics ServerExpert Protein Analysis
System
- links to lots of protein prediction resources
- http//expasy.org/
97RMSDmin The root mean square deviation (RMSD) is
the measure of the average distance between the
backbones of superimposed proteins. In the study
of globular protein conformations, one
customarily measures the similarity in
three-dimensional structure by the RMSD of the Ca
atomic coordinates after optimal rigid body
superposition. A widely used way to compare the
structures of biomolecules or solid bodies is to
translate or rotate one structure with respect
to the other to minimize the RMSD. This RMSDmin
can be used as a distance measure between two
proteins.