Title: Introduction to Bioinformatics: Lecture XI Computational Protein Structure Prediction
1Introduction to Bioinformatics Lecture
XIComputational Protein Structure Prediction
- Jarek Meller
- Division of Biomedical Informatics,
- Childrens Hospital Research Foundation
- Department of Biomedical Engineering, UC
2Outline of the lecture
-
- Protein structure and complexity of
conformational search from similarity based
methods to de novo structure prediction - Multiple sequence alignment and family profiles
- Secondary structure and solvent accessibility
prediction - Matching sequences with known structures
threading and fold recognition - Ab initio folding simulations
3Polypeptide chains backbone and side-chains
N-ter
C-ter
4Distinct chemical nature of amino acid side-chains
C-ter
PHE
N-ter
CYS
VAL
GLU
ARG
5Hydrogen bonds and secondary structures
b-strand
a-helix
6Tertiary structure and long range contacts
annexin
7Quaternary structure and protein-protein
interactions annexin hexamer
8Domains, interactions, complexes cyclin D and Cdk
Cyclin Box
9Domains, interactions, complexes VHL
10Protein folding problem
- The protein folding problem consists of
predicting three-dimensional structure of a
protein from its amino acid sequence - Hierarchical organization of protein structures
helps to break the problem into secondary
structure, tertiary structure and protein-protein
interaction predictions - Computational approaches for protein structure
prediction similarity based and de novo methods
11Polypeptide chains backbone and rotational
degrees of freedom
             H   O        R2
                      Â
        NH3--Ca -- C -- N -- Ca -- C --O-
                          Â
     \\                R1        Â
HÂ Â Â HÂ Â Â Â Â Â Â O
The equilibrium length of the peptide bond (C --
N) is about 2 Ang. The average Ca - Ca
distance in a polypeptide chain is about 3.8
Ang. The angle of rotation around N - Ca bond
is called j, and the angle around the Ca - C
bond is called f. These two angles define the
overall conformation of polypeptide
chains. Simplifying, there are three discrete
states (rotations) for each of these single
bonds, implying 9N possible backbone
conformations.Â
12Scoring alternative conformations with empirical
force fields (folding potentials)
Ideally, each misfolded structure should have an
energy higher than the native energy, i.e.
Emisfolded - Enative gt 0
E
misfolded
native
13Ab initio (or de novo) folding simulations
-
- When dealing with a new fold, the similarity base
methods cannot be applied - Ab initio folding simulations consist of
conformational search with an empirical scoring
function (force field) to be maximized (or
minimized) - Computational bottleneck exponential search
space and sampling problem (global optimization!) - Fundamental problem inaccuracy of empirical
force fields - Importance of mixed protocols, such as Rosetta by
D. Baker and colleagues (more when Monte Carlo
protocols for global optimization are introduced)
14Similarity based approaches to structure
prediction from sequence alignment to fold
recognition
-
- High level of redundancy in biology sequence
similarity is often sufficient to use the guilt
by association rule if similar sequence then
similar structure and function - Multiple alignments and family profiles can
detect evolutionary relatedness with much lower
sequence similarity, hard to detect with pairwise
sequence alignments Psi-BLAST by S. Altschul et.
al. - For sufficiently close proteins one may
superimpose the backbones using sequence
alignment and then perform conformational search
(with the backbone fixed) to find the optimal
geometry (according to atomistic empirical force
field) of the side-chains homology modeling
(e.g. Modeller by A. Sali et. al.) - Many structures are already known (see PDB) and
one can match sequences directly with structures
to enhance structure recognition fold
recognition - For both, fold recognition and de novo
simulation, prediction of intermediate attributes
such secondary structure or solvent accessibility
helps to achieve better sensitivity and
specificity
15Protein families and domains
The notion of protein family is derived from
evolutionary considerations members of the same
family are related, perform the same function
and are assumed to have diverged from the same
ancestor. The notion of domain is derived from
structural considerations A domain is defined
as an autonomous structural unit, or a reusable
sequence unit that may be found in multiple
protein contexts, Baterman et. al.
PFAM (7246 families as of April
2004) http//www.sanger.ac.uk/Software/Pfam/ PRO
DOM http//prodes.toulouse.inra.fr/prodom/current
/html/home.php CDD http//www.ncbi.nlm.nih.gov/S
tructure/cdd/cddsrv.cgi Check pfam00134.11,
Cyclin_N
16Multiple alignment and PSSM
17Multiple alignment, clustering and families
- DP search gives optimal solution scaling
exponentially with the number of sequences K,
O(nK), not practical for more than 3,4 sequences. - Standard heuristics start from pairwise
alignments (e.g. PsiBLAST, Clustalw) - Hidden Markov Model approach to family profiles
(profile HMM) as an alternative with pre-fixed
parameters, trained separately for each family.
Some initial multiple alignments necessary for
training (next lecture).
18Predicting 1D protein profiles from sequences
secondary structures and solvent accessibility
a) Multiple alignment and family profiles improve
prediction of local structural propensities. b)
Use of advanced machine learning techniques, such
as Neural Networks or Support Vector Machines
improves results as well. B. Rost and C. Sander
were first to achieve more than 70 accuracy in
three state (H, E, C) classification, applying a)
and b).
SABLE server http//sable.cchmc.org POLYVIEW
server http//polyview.cchmc.org
19Predicting 1D protein profiles from sequences
secondary structures and solvent accessibility
20 Predicting transmembrane domains
21Hydropathy profiles and membrane domains
prediction
Problem Design a simple algorithm for finding
putative trans- membrane regions based on
hydropathy (or hydrophobicity) profiles.
Consider an extension based on prototypes and
k-NN.
22 Predicting transmembrane domains
23Going beyond sequence similarity threading and
fold recognition
When sequence similarity is not detectable use a
library of known structures to match your
query with target structures. As in case of de
novo folding, one needs a scoring function that
measures compatibility between sequences and
structures.
24Why fold recognition?
- Divergent (common ancestor) vs. convergent (no
ancestor) evolution - PDB virtually all proteins with 30 seq.
identity have similar structures, however most of
the similar structures share only up to 10 of
seq. identity ! - www.columbia.edu/rost/Papers/1997_evolution/paper
.html (B. Rost) - www.bioinfo.mbb.yale.edu/genome/foldfunc/ (H.
Hegyi, M. Gerstein)
25Simple contact model for protein structure
prediction
Each amino acid is represented by a point in 3D
space and two amino acids are said to be in
contact if their distance is smaller than a
cutoff distance, e.g. 7 Ang.
26Sequence-to-structure matching with contact models
- Generalized string matching problem aligning a
string of amino acids against a string of
structural sites characterized by other
residues in contact - Finding an optimal alignment with gaps using
inter-residue pairwise models - E S klt l e k l ,
- is NP-hard because of the non-local character
of scores at a given structural site (identity of
the interaction partners may change depending on
location of gaps in the alignment) - R.H. Lathrop, Protein Eng. 7 (1994)
27Hydrophobic contact model and sequence-to-structur
e alignment
-
HPHPP
- Solutions to this yet another instance of the
global optimization problem - Heuristic (e.g. frozen environment approximation)
- Profile or local scoring functions (folding
potentials)
28Using sequence similarity, predicted secondary
structures and contact potentials fold
recognition protocols
In practice fold recognition methods are often
mixtures of sequence matching and threading,
e.g., with compatibility between a sequence and a
structure measured by contact potentials and
predicted secondary structures compared to the
secondary structure of a template). D.Fischer
and D. Eisenberg, Curr. Opinion in Struct. Biol.
1999, 9 208
29Some fold recognition servers
- PsiBLAST (Altschul SF et. al., Nucl. Acids Res.
25 3389) - Live Bench evaluation (http//BioInfo.PL/LiveBench
/1/) - FFAS (L. Rychlewski, L. Jaroszewski, W. Li, A.
Godzik (2000), Protein Science 9 232) seq.
profile against profile - 3D-PSSM (Kelley LA, MacCallum RM, Sternberg JE,
JMB 299 499 ) 1D-3D profile combined with
secondary structures and solvation potential - GenTHREADER (Jones DT, JMB 287 797) seq.
profile combined with pairwise interactions and
solvation potential - LOOPP annotations of remote homologs
- http//www.tc.cornell.edu/CBIO/loopp