Protein structure prediction: The holy grail of bioinformatics - PowerPoint PPT Presentation

1 / 96
About This Presentation
Title:

Protein structure prediction: The holy grail of bioinformatics

Description:

a helix = A helical structure, whose chain coils tightly as a right ... Ammonium thioglycolate (alkaline) pH 9.0-10. Glycerylmonothioglycolate (acid) pH 6.5-8.2 ... – PowerPoint PPT presentation

Number of Views:223
Avg rating:3.0/5.0
Slides: 97
Provided by: nsm5
Category:

less

Transcript and Presenter's Notes

Title: Protein structure prediction: The holy grail of bioinformatics


1
Protein structure predictionThe holy grail of
bioinformatics
2
Proteins Four levels of structural organization
Primary structure Secondary structure Tertiary
structure Quaternary structure
3
Primary structure the linear amino acid sequence
4
Secondary structure spatial arrangement of
amino-acid residues that are adjacent in the
primary structure
5
a helix A helical structure, whose chain coils
tightly as a right-handed screw with all the side
chains sticking outward in a helical array. The
tight structure of the a helix is stabilized by
same-strand hydrogen bonds between -NH groups and
-CO groups spaced at four amino-acid residue
intervals.
6
The b-pleated sheet is made of loosely coiled b
strands are stabilized by hydrogen bonds between
-NH and -CO groups from adjacent strands.
7
An antiparallel ß sheet. Adjacent ß strands run
in opposite directions. Hydrogen bonds between NH
and CO groups connect each amino acid to a single
amino acid on an adjacent strand, stabilizing the
structure.
8
A parallel ß sheet. Adjacent ß strands run in the
same direction. Hydrogen bonds connect each amino
acid on one strand with two different amino acids
on the adjacent strand.
9
(No Transcript)
10
Silk fibroin
11
a helix b sheet (parallel and antiparallel) tight
turns flexible loops irregular elements (random
coil)
12
Tertiary structure three-dimensional structure
of protein
13
The tertiary structure is formed by the folding
of secondary structures by covalent and
non-covalent forces, such as hydrogen bonds,
hydrophobic interactions, salt bridges between
positively and negatively charged residues, as
well as disulfide bonds between pairs of
cysteines.
14
Quaternary structure spatial arrangement of
subunits and their contacts.
15
(No Transcript)
16
Holoproteins Apoproteins
Prosthetic group
Holoprotein
Apoprotein
Holoprotein
Prosthetic group
17
Apohemoglobin 2a 2b
18
Prosthetic group
Heme
19
Hemoglobin Apohemoglobin 4Heme
20
(No Transcript)
21
Christian B. Anfinsen 1916-1995
Sela M, White FH, Anfinsen CB. 1959. The
reductive cleavage of disulfide bonds and its
application to problems of protein structure.
Biochim. Biophys. Acta. 31417-426.
22
(No Transcript)
23
The denaturation and renaturation of proteins
24
(No Transcript)
25
Reducing agents Ammonium thioglycolate
(alkaline) pH 9.0-10 Glycerylmonothioglycolate
(acid) pH 6.5-8.2
26
Oxidant
27
What do we need to know in order to state that
the tertiary structure of a protein has been
solved?
  • Ideally We need to determine the position of all
    atoms and their connectivity.
  • Less Ideally We need to determine the position
    of all C???backbone structure).

28
Protein structure Limitations and caveats
  • Not all proteins or parts of proteins assume a
    well-defined 3D structure in solution.
  • Protein structure is not static, there are
    various degrees of thermal motion for different
    parts of the structure.
  • There may be a number of slightly different
    conformations in solution.
  • Some proteins undergo conformational changes when
    interacting with STUFF.

29
Experimental Protein Structure Determination
  • X-ray crystallography
  • most accurate
  • in vitro
  • needs crystals
  • 100-200K per structure
  • NMR
  • fairly accurate
  • in vivo
  • no need for crystals
  • limited to very small proteins
  • Cryo-electron-microscopy
  • imaging technology
  • low resolution

30
Why predict protein structure?
  • Structural knowledge some understanding of
    function and mechanism of action
  • Predicted structures can be used in
    structure-based drug design
  • It can help us understand the effects of
    mutations on structure and function
  • It is a very interesting scientific problem
    (still unsolved in its most general form after
    more than 50 years of effort)

31
Secondary structure prediction
32
Secondary structure prediction
  • Historically first structure prediction methods
    predicted secondary structure
  • Can be used to improve alignment accuracy
  • Can be used to detect domain boundaries within
    proteins with remote sequence homology
  • Often the first step towards 3D structure
    prediction
  • Informative for mutagenesis studies

33
Protein Secondary Structures (Simplifications)
?-HELIX
?-STRAND
COIL (everything else)
34
Assumptions
  • The entire information for forming secondary
    structure is contained in the primary sequence
  • side groups of residues will determine structure
  • examining windows of 13-17 residues is sufficient
    to predict secondary structure
  • a-helices 540 residues long
  • b-strands 510 residues long

35
Predicting Secondary Structure From Primary
Structure
  • accuracy 64-75
  • higher accuracy for a-helices than for b-sheets
  • accuracy is dependent on protein family
  • predictions of engineered (artificial) proteins
    are less accurate

36
A surprising result!
  • Chameleon
  • sequences

37
The Chameleon sequence
sequence 1 sequence 2
TEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTEK
Replace both sequences with an engineered peptide
(chameleon)
TEAVDAWTVEKAFKTFANDNGVDGAWTVEKAFKTFTVTEK
a -helix b-strand
Source Minor and Kim. 1996. Nature 380730-734
38
Measures of prediction accuracy
  • Qindex and Q3
  • Correlation coefficient

39
Qindex
  • Qindex (Qhelix, Qstrand, Qcoil, Q3)
  • percentage of residues correctly predicted as
    a-helix, b-strand, coil, or for all 3
    conformations.
  • Drawbacks
  • - even a random assignment of structure can
    achieve a high score (Holley Karpus 1991)

40
Correlation coefficient
Ca 1 (100)
41
Methods of secondary structure prediction
42
First generation methods single residue
statistics
  • Chou Fasman (1974 1978)
  • Some residues have particular
    secondary-structure preferences. Based on
    empirical frequencies of residues in ?-helices,
    ?-sheets, and coils.
  • Examples Glu a-helix
  • Val
    ß-strand

43
Chou-Fasman method
44
(No Transcript)
45
Chou-Fasman Method
  • Accuracy Q3 50-60

46
Second generation methods segment statistics
  • Similar to single-residue methods, but
    incorporating additional information (adjacent
    residues, segmental statistics).
  • Problems
  • Low accuracy - Q3 below 66 (results).
  • Q3 of ?-strands (E) 28 - 48.
  • Predicted structures were too short.

47
The GOR method
  • developed by Garnier, Osguthorpe Robson
  • build on Chou-Fasman Pij values
  • evaluate each residue PLUS adjacent 8 N-terminal
    and 8 carboxyl-terminal residues
  • sliding window of 17 residues
  • underpredicts b-strand regions
  • GOR method accuracy Q3 64

48
Third generation methods
  • Third generation methods reached 77 accuracy.
  • They consist of two new ideas
  • 1. A biological idea
  • Using evolutionary information based on
    conservation analysis of multiple sequence
    alignments.
  • 2. A technological idea
  • Using neural networks.

49
Artificial Neural Networks
An attempt to imitate the human brain (assuming
that this is the way it works).
50
Neural network models
  • machine learning approach
  • provide training sets of structures (e.g.
    a-helices, non a -helices)
  • computers are trained to recognize patterns in
    known secondary structures
  • provide test set (proteins with known structures)
  • accuracy 70 75

51
Reasons for improved accuracy
  • Align sequence with other related proteins of the
    same protein family
  • Find members that has a known structure
  • If significant matches between structure and
    sequence assign secondary structures to
    corresponding residues

52
New and Improved Third-Generation Methods
  • Exploit evolutionary information. Based on
    conservation analysis of multiple sequence
    alignments.
  • PHD (Q3 70)
  • Rost B, Sander, C. (1993) J. Mol. Biol. 232,
    584-599.
  • PSIPRED (Q3 77)
  • Jones, D. T. (1999) J. Mol. Biol. 292, 195-202.
  • Arguably remains the top secondary structure
    prediction method(won all CASP competitions
    since 1998).

53
Secondary Structure Prediction Summary
  • 1st Generation - 1970s
  • Q3 50-55
  • Chou Fausman, GOR
  • 2nd Generation -1980s
  • Q3 60-65
  • Qian Sejnowski, GORIII
  • 3rd Generation - 1990s
  • Q3 70-80
  • PhD, PSIPRED
  • Many 3rd generation methods exist
  • PSI-PRED - http//bioinf.cs.ucl.ac.uk/psipred/
  • JPRED - http//www.compbio.dundee.ac.uk/www-jpre
    d/
  • PHD - http//www.embl-heidelberg.de/predictprotei
    n/predictprotein.html
  • NNPRED - http//www.cmpharm.ucsf.edu/nomi/nnpred
    ict.html

54
The sequence-structure gap
November 3, 2009
  • More than 8,835,796 known protein sequences,
    61,086 experimentally determined structures
  • The gap is getting bigger.

55
Sequences vs. Structures
200000
180000
160000
140000
120000
100000
Sequences
Structures
80000
60000
40000
20000
0
56
Protein Secondary Structures (Simplifications)
?-HELIX
?-STRAND
COIL (everything else)
57
Beyond Secondary StructureBefore Tertiary
Structure
  • Supersecondary structures (motifs) small,
    discrete, commonly observed aggregates of
    secondary structures
  • helix-loop-helix
  • b?a?b
  • Domains independent units of structure
  • b barrel
  • four-helix bundle
  • The terms domain and motif are sometimes used
    interchangeably.

58
Helix-loop-helix
59
Beyond Secondary StructureBefore Tertiary
Structure
Folds Compact folding arrangements of a
polypeptide chain (a protein or part of a
protein). The terms domain and fold are
sometimes used interchangeably.
60
EF Fold
Found in Calcium binding proteins such as
Calmodulin
61
Leucine Zipper
62
Rossman Fold
  • The beta-alpha-beta-alpha-beta subunit
  • Often present in nucleotide-binding proteins

63
b sandwich
b barrel
64
a/b horseshoe
65
Four helix bundle
  • 24 amino acid peptide with a hydrophobic surface
  • Assembles into 4 helix bundle through hydrophobic
    regions
  • Maintains solubility of membrane proteins

66
TIM Barrel
67
PDB New Fold Growth
Old fold
New fold
  • The number of unique folds in nature is fairly
    small (possibly a few thousands)
  • 90 of new structures submitted to PDB in the
    past three years have similar structural folds in
    PDB

68
Protein data bank
  • http//www.rcsb.org/pdb/

69
Protein 3D structure data
The structure of a protein consists of the 3D
(X,Y,Z) coordinates of each non-hydrogen atom of
the protein. Some protein structure also include
coordinates of covalently linked prosthetic
groups, non-covalently linked ligand molecules,
or metal ions. For some purposes (e.g. structural
alignment) only the Ca coordinates are needed.
Example of PDB format X
Y Z occupancy / temp.
factor ATOM 18 N GLY 27 40.315
161.004 11.211 1.00 10.11 ATOM 19 CA GLY
27 39.049 160.737 10.462 1.00 14.18 ATOM
20 C GLY 27 38.729 159.239 10.784
1.00 20.75 ATOM 21 O GLY 27
39.507 158.484 11.404 1.00 21.88 Note the PDB
format provides no information about connectivity
between atoms. The last two numbers (occupancy,
temperature factor) relate to disorders of atomic
positions in crystals.
70
(No Transcript)
71
Protein structure Some computational tasks
  • Building a protein structure model from X-ray
    data
  • Building a protein structure model from NMR data
  • Computing the energy for a given protein
    structure (conformation)
  • Energy minimization Finding the structure with
    the minimal energy according to some empirical
    force fields.
  • Simulating the protein folding process (molecular
    dynamics)
  • Structure visualization
  • Computing secondary structure from atomic
    coordinates
  • Protein superposition, structural alignment
  • Protein fold classification
  • Threading finding a fold (prototype structure)
    that fits to a sequence
  • Docking fitting ligands onto a protein surface
    by molecular dynamics or energy minimization
  • Protein 3D structure prediction from sequence

72
Viewing protein structures
  • When looking at a protein structure, we may ask
    the following types of questions
  • Is a particular residue on the inside or outside
    of a protein?
  • Which amino acids interact with each other?
  • Which amino acids are in contact with a ligand
    (DNA, peptide hormone, small molecule, etc.)?
  • Is an observed mutation likely to disturb the
    protein structure?
  • Standard capabilities of protein structure
    software
  • Display of protein structures in different ways
    (wireframe, backbone, sticks, spacefill, ribbon.
  • Highlighting of individual atoms, residues or
    groups of residues
  • Calculation of interatomic distances
  • Advanced feature Superposition of related
    structures

73
Example c-abl oncoprotein SH2 domain, display
wireframe
74
Example c-abl oncoprotein SH2 domain, display
sticks
75
Example c-abl oncoprotein SH2 domain, display
backbone
76
Example c-abl oncoprotein SH2 domain, display
spacefill
77
Example c-abl oncoprotein SH2 domain, display
ribbons
78
Predicting protein 3d structure
  • Goal 3d structure from 1d sequence

An existing fold
A new fold
Fold recognition
ab-initio
Homology modeling
79
Homology modeling
  • Based on the two major observations
  • The structure of a protein is uniquely defined by
    its amino acid sequence.
  • Similar sequences adopt similar structures.
    (Distantly related sequences may still fold into
    similar structures.)

80
Homology modeling needs three items of input
  • The sequence of a protein with unknown 3D
    structure, the "target sequence."
  • A 3D template a structure having the highest
    sequence identity with the target sequence ( gt30
    sequence identity)
  • An sequence alignment between the target sequence
    and the template sequence

81
Homology Modeling How it works
  • Find template
  • Align target sequence
  • with template
  • Generate model
  • - add loops
  • - add sidechains
  • Refine model

82
Two zones of homology modeling
Rost, Protein Eng. 1999
83
Automated Web-Based Homology Modelling
  • SWISS Model http//www.expasy.org/swissmod/SWISS
    -MODEL.html
  • WHAT IF http//www.cmbi.kun.nl/swift/servers/
  • The CPHModels Server http//www.cbs.dtu.dk/serv
    ices/CPHmodels/
  • 3D Jigsaw http//www.bmm.icnet.uk/3djigsaw/
  • SDSC1 http//cl.sdsc.edu/hm.html
  • EsyPred3D http//www.fundp.ac.be/urbm/bioinfo/e
    sypred/

84
Fold recognition Protein Threading
  • Which of the known folds is likely to be similar
    to the (unknown) fold of a new protein when only
    its amino-acid sequence is known?

85
Protein Threading
  • The goal find the correct sequence-structure
    alignment between a target sequence and its
    native-like fold in PDB
  • Energy function knowledge (or statistics) based
    rather than physics based
  • Should be able to distinguish correct structural
    folds from incorrect structural folds
  • Should be able to distinguish correct
    sequence-fold alignment from incorrect
    sequence-fold alignments

86
Protein Threading
  • Basic premise
  • Statistics from Protein Data Bank (2,000
    structures)
  • Chances for a protein to have a structural fold
    that already exists in PDB are quite good.

The number of unique structural (domain) folds in
nature is fairly small (possibly a few thousand)
90 of new structures submitted to PDB in the
past three years have similar structural folds in
PDB
87
Protein Threading
  • Basic components
  • Structure database
  • Energy function
  • Sequence-structure alignment algorithm
  • Prediction reliability assessment

88
Protein Threading structure database
  • Build a template database

89
Process
  • Threading - A protein fold recognition technique
    that involves incrementally replacing the
    sequence of a known protein structure with a
    query sequence of unknown structure. The new
    model structure is evaluated using a simple
    heuristic measure of protein fold quality. The
    process is repeated against all known 3D
    structures until an optimal fit is found.

90
Fold recognition methods
  • 3D-PSSM
  • http//www.sbg.bio.ic.ac.uk/3dpssm/
  • Fugue
  • http//www-cryst.bioc.cam.ac.uk/fugue/
  • HHpred http//protevo.eb.tuebingen.mpg.de/toolkit/
    index.php?viewhhpred

91
ab-initio folding
  • Goal Predict structure from first principles
  • Requires
  • A free energy function, sufficiently close to the
    true potential
  • A method for searching the conformational space
  • Advantages
  • Works for novel folds
  • Shows that we understand the process
  • Disadvantages
  • Applicable to short sequences only

92
Rosetta Simons et al. 1997
http//www.bioinfo.rpi.edu/bystrc/hmmstr/server.p
hp
93
Qian et al. (Nature 2007) used distributed
computing to predict the 3D structure of a
protein from its amino-acid sequence. Here, their
predicted structure (grey) of a protein is
overlaid with the experimentally determined
crystal structure (color) of that protein. The
agreement between the two is excellent. 70,000
home computers for about two years.
94
Overall Approach
Protein Sequence
Multiple Sequence Alignment
Database Searching
Homologuein PDB
Secondary Structure Prediction
FoldRecognition
No
Yes
PredictedFold
Yes
Sequence-Structure Alignment
Homology Modelling
Ab-initioStructure Prediction
No
3-D Protein Model
95
ExPASy Proteomics ServerExpert Protein Analysis
System
  • links to lots of protein prediction resources
  • http//expasy.org/

96
RMSDmin The root mean square deviation (RMSD) is
the measure of the average distance between the
backbones of superimposed proteins. In the study
of globular protein conformations, one
customarily measures the similarity in
three-dimensional structure by the RMSD of the Ca
atomic coordinates after optimal rigid body
superposition. A widely used way to compare the
structures of biomolecules or solid bodies is to
translate or rotate one structure with respect
to the other to minimize the RMSD. This RMSDmin
can be used as a distance measure between two
proteins.
Write a Comment
User Comments (0)
About PowerShow.com