Title: Homology Modeling (advanced)
1Homology Modeling(advanced)
- Boris Steipe
- University of Toronto
- boris.steipe_at_utoronto.ca
2Concepts
- Review of homology modeling basics
- Multiple sequence alignment revisited
- Modeling goals revisited
- Modeling methods revisited
- Modeling problems and energy minimization
- Conclusions
3Concept 1
- Sequence alignment is the single most important
step in homology modeling.
4Alignment is the limiting step for homology model
accuracy
No amount of forcefield minimization will put a
misaligned residue in the right place !
HOMSTRAD _at_ CASP4 Williams MG et al. (2001)
Proteins Suppl.5 92-97
5Superposition vs. Alignment
- The coordinates of two proteins can be
superimposed in space. - An alignment may be derived from a superposition
by correlating residues that are close in space. - An optimal sequence alignment may lead to a
different alignment ...
Superposition of 1GTR and 2TS1
6Superposition vs. Alignment
TyrRS ERVTLYCGFDPTAdS--LHIGHLATILTMRRFQQAGHRPIA
LVGGAtgligdpsgkkser
1GTR
26 TTVHTRFPPEPNG-YLHIGHAKSICL--NF---------------
GIAqDYKGQCN--
2TS1 29
ERVTLYCGFDPTAdSLHIGHLATILT--MR---------------RFQ-Q
AGHRPI-- TyrRS tlnaketVEAWSARIKEQLgrfldfeadgn
pa----------------k--------IKN
1GTR 26 ----------------------LRFD-DTnpv-----
-----------keDIEYVESIKN
2TS1
29 ----------------------ALVG-GAtgligdpsgkksertlna
ketVEAWSARIKE TyrRS NYDWIgpldvitflrdvgk----hf
svnymmakesvqsrietgisftefsYMMLQAYDFL
1GTR 26 DVewl------------gf----hwsgnVRYSSD-
--------------------YFdql
2TS1 29 QLgrf------------ldfeadgnpakIKNNYD------
---------------WIgpl TyrRS
RLYetegCRLQIGGSDQwgnitaGL--------ELIRKTKgearAFGLTI
PLV
1GTR 26
hayaie-------------linkglayvdeltpeqireyrgtltqpgkns
pyrdrsveen
2TS1 29
dvitfl-------------rdvgkhfsvnym-------------------
---------- TyrRS
1GTR
26 lalfekmraggfeegkaclrakidmaspfivmrdpvlyrikfaehh
qtgnkwciypmYDF
2TS1 29
-------------------------------------makesvqsrietg
isftefsYMM TyrRS 1GTR 26
THCISDALEG----ITHSLCTLEFqdnrrlYDWVLDNITipvhPRQYEFS
RL 262
2TS1 29
LQAYDFLRLYetegCRLQIGGSDQwgnitaGLELIRKTKgearAFGLTIP
LV 223
Example E. coli GlnRS (1GTR) and G.
stearothermophilus TyrRS (2TS1). Although the
optimal sequence alignment (top/middle) is not
unreasonable (19 ID 40/212 residues, similar
function, ATP binding motif conserved (box)),
comparison with the structure shows it is
actually wrong for all but 11 residues ! The
superposition-based alignment (middle/bottom) is
quite dissimilar in sequence (Â 4.5ID 12/265
residues) but the superposition actually matches
39 of residues (Â 104/265 ) as pairs in space
over the length of the domain.
7Inserts may be accomodated in a distant part of
the structure
Example - a five residue insert
- Sequence aligment (shows what happened)
- gktlit nfsqehip
- gktlisflyeqnfsqehip
- Structure alignment (shows how it's accomodated)
- gktlitnfsq ehip
- gktlisflyeqnfsqehip
a-helix
8Indels (inserts or deletions)
- Comparisons of alignments and structures
demonstrate that uniform gap penalty assumptions
are NOT BIOLOGICAL. - Indels are most often observed in loops, less
often in secondary structure elements - When they do not occur in loops, there is
frequently a maintenance of helical or strand
properties.
9Can we do better than using a uniform gap
assumption?
- Required position specific gap penalties
- One approach implemented in Clustal as secondary
structure masks - Get secondary structure information, convert it
to Clustal mask format. (Easy - read
documentation !) - Alternatively use a manual sequence alignment
editor to move gaps out of secondary structure
regions.
However This is "automatically" achieved by
modern multiple sequence alignment programs.
10Concept 4
11Homology Modeling Process
TAR
PSI-BLAST
Search
nr (PDB)
These are really two queries rolled into one
procedure.
TAR Target sequence
MSA
Align
Search Sequence database similarity search
Cinema
nr non-redundant Genbank subset, (with annotated
structures)
MSA
HOM Homologous sequences
SwissModel
Model
ExPDB
TEM Sequences of homologues with known structure
LIG
Align Careful Multiple Sequence Alignment
3D
MSA Multiple Sequence Alignment
Model Generate 3D Model
TextEditor
Complete
ExPDB Modeling template structure database
3DC
Complete Add ligands, substrates etc. to model
Analyse Interpret and conclude
RasMol
Analyse
PUB Publish results
Consurf
PUB
12SwissModel ... first approach mode
http//www.expasy.org/swissmod
13Uses of structure revisited - I
- Prototype 1 Analytical
- Explain mechanistic aspects of protein.
- (e.g. in terms of)
- residues involved in catalysis
- global properties (like electrostatics)
- shape, relative orientation and distances of
domains or subdomains - flexibility and dynamics - e.g. hypothesizing
about the rate limiting step
14Uses of structure revisited - II
- Prototype 2 Comparative
- Bring conservation patterns into a spatial
context in order to infer causality from
(database) correlations. - (e.g. in terms of)
- describing context specific conservation patterns
and anlyizing these according to conserved
properties - analyizing the predicted effect of sequence
variation (e.g. for engineering changes, fusing
domains or predicting SNP effects) - distinguish physiological vs. nonphysiological
interactions
15Item 2
Multiple Sequence Alignment Revisited
16Current State of the Art ProbCons
ProbCons is a novel tool for generating multiple
alignments of protein sequences. Using a
combination of probabilistic modeling and
consistency-based alignment techniques, ProbCons
has achieved the highest accuracies of all
alignment methods to date. On the BAliBASE
benchmark alignment database, alignments produced
by ProbCons show statistically significant
improvement over current programs, containing an
average of 7 more correctly aligned columns than
those of T-Coffee, 11 more correctly aligned
columns than those of CLUSTAL W, and 14 more
correctly aligned columns than those of DIALIGN.
http//probcons.stanford.edu
Do, C.B., Mahabhashyam, M.S.P., Brudno, M., and
Batzoglou, S. 2005. PROBCONS Probabilistic
Consistency-based Multiple Sequence
Alignment. Genome Research 15 330-340.
17Item 3
Modeling Goals Revisited
18A homology model is ...
- A 3-D map that integrates information on
- evolutionary conservation of structures
- a protein sequence
- principles of protein structure
Always ask where does the information come from
... how reliable is it.
19What is a homology model useful for ?
Goal Biochemical inference from 3D similarity
- Bonds
- Angles, plain and dihedral
- Surfaces, solvent accessibility
- Amino acid functions, presence in structure
patterns - Spatial relationship of residues to active site
- Spatial relationship to other residues
- Participation in function / mechanism
- Static and dynamic disorder
- Electrostatics
- Conservation patterns (structural and functional)
- Plausibility of posttranslational modification
sites - Suitability as drug target
Unreliable
Primary use
Educated guesswork
... but you can't predict the structural
consequences of posttranslational modifications!
20Abuse of homology models
- Modelling properties that cannot / will not be
verified - Analysing geometry of model
- Interpreting loop structures near indels
- Inferring relative domain arrangement
- Inferring structures of complexes
Homology models map information from a sequence
alignment into 3D space. They cannot be used to
"predict structure".
21Databases of Models
- Dont make models unless you check first...
- Swiss-Model repository
- 64,000 models based on 4000 structures and
Swiss-Prot proteins - ModBase
- Made with "Modeller" - 15,000 reliable models for
substantial segments of approximately 4,000
proteins in the genomes of Saccharomyces
cerevisiae, Mycoplasma genitalium, Methanococcus
jannaschii, Caenorhabditis elegans, and
Escherichia coli. - 3D crunch
- 1998 large scale modeling experiment
22http//modbase.compbio.ucsf.edu/modbase-cgi-new/in
dex.cgi
23http//swissmodel.expasy.org/repository/
24http//www.expasy.ch/swissmod/SM_3DCrunch.html
25Example Interpreting peptide scans
Peptides. 2005 26(3)395-404. Identification
of immunodominant regions of Brassica juncea
glyoxalase I as potential antitumor
immunomodulation targets. Deswal R, Singh R, Lynn
AM, Frank R.
Goals Validate exposed properties of
immuno-reactive peptides identified by peptide
scanning.
Methods Simple threading of sequence on human
homologue
26Example Comparative Drug Design
Caffrey CR, Placha L, Barinka C, Hradilek M,
Dostal J, Sajid M, McKerrow JH, Majer P,
Konvalinka J, Vondrasek J. Homology modeling and
SAR analysis of Schistosoma japonicum cathepsin D
(SjCD) with statin inhibitors identify a unique
active site steric barrier with potential for the
design of specific inhibitors. Biol Chem. 2005
Apr386(4)339-49.
Goals Compare active sites to obtain hints for
drug design
Methods Homology modeling of s.japonicum
sequence on human structure with a commercial
package (Insight, accelrys) extensive energy
minimization.
Comments Questionable results. Inappropriate
method, and residues that are identified actually
appear conserved (F/F, M/I).
27Example Inferring complexes (I)
High-quality homology models derived from NMR and
X-ray structures of E. coli proteins YgdK and Suf
E suggest that all members of the YgdK/Suf E
protein family are enhancers of cysteine
desulfurases. Protein Sci. 2005
Jun14(6)1597-608. Liu G, Li Z, Chiang Y, Acton
T, Montelione GT, Murray D, Szyperski T. The
structural biology of proteins mediating
iron-sulfur (Fe-S) cluster assembly is central
for understanding several important biological
processes. Here we present the NMR structure of
the 16-kDa protein YgdK from Escherichia coli,
which shares 35 sequence identity with the E.
coli protein SufE. The SufE X-ray crystal
structure was solved in parallel with the YdgK
NMR structure in the Northeast Structural
Genomics (NESG) consortium. Both proteins are (1)
key components for Fe-S metabolism, (2) exhibit
the same distinct fold, and (3) belong to a
family of at least 70 prokaryotic and eukaryotic
sequence homologs. Accurate homology models were
calculated for the YgdK/SufE family based on YgdK
NMR and SufE crystal structure. Both structural
templates contributed equally, exemplifying
synergy of NMR and X-ray crystallography. SufE
acts as an enhancer of the cysteine desulfurase
activity of SufS by SufE-SufS complex formation.
A homology model of CsdA, a desulfurase encoded
in the same operon as YgdK, was modeled using the
X-ray structure of SufS as a template. Protein
surface and electrostatic complementarities
strongly suggest that YgdK and CsdA likewise form
a functional two-component desulfurase complex.
Moreover, structural features of YgdK and SufS,
which can be linked to their interaction with
desulfurases, are conserved in all homology
models. It thus appears very likely that all
members of the YgdK/SufE family act as enhancers
of Suf-S-like desulfurases. The present study
exemplifies that "refined" selection of two (or
more) targets enables high-quality homology
modeling of large protein families.
28Example Inferring complexes (II)
Methods "Nest" http//honiglab.cpmc.columbia.edu/
Comments Structural similarity of models is NOT
a sign of accurracy!
29Example Annotation of Function
Saunders NF, Goodchild A, Raftery M, Guilhaus M,
Curmi PM, Cavicchioli R. Predicted roles for
hypothetical proteins in the low-temperature
expressed proteome of the Antarctic archaeon
Methanococcoides burtonii. J Proteome Res.
2005 4(2)464-72.
Goals Derive functional annotation
Methods InterProScan, Prospect, prediction of
subcellular localization (secretion),
visualization of conserved genomic context
Comments Difficult challenge (archaeon!). Well
done, state-of-the art analysis gives functional
informartion for 55/135 novel proteins. (see
also http//psychro.bioinformatics.unsw.edu.au/)
30Synopsis of Goals
Valid goals use 3-D models as a map of
information on conservation, such as spatial
proximity and surface exposure of
residues. Poorly stated goals attempt to
interpret details of geometry.
31Item 4
Modeling Methods Revisited
32Homology Modeling Software?
- Freely available packages perform as good as
commercial ones at CASP (Critical Assessment of
Structure Prediction) - Swiss Model (see February's Integrated
Assignment) - Modeller (http//guitar.rockefeller.edu)
- others ...
33Swissmodel in comparison
3D-Crunch Experiment 211,000 sequences ? 64,000
models gt50 seqID ? 1 Ã… RMSD 40-49
seqID ? 63 lt 3Ã… 25-29 seqID ? 49 lt 4Ã…
Manual alternatives Modeller ... Automatic
alternatives SwissModel sdsc1 3djigsaw
pcomb_pcons cphmodels easypred
First place for RMSD and correct
aligned, Second place for coverage
Guex et al. (1999) TIBS 24365-367 EVA Eyrich et
al. (2001) Bioinformatics 171242-1243
(http//cubic.bioc.columbia.edu/eva)
34Comparison of Approaches
Wallner B, Elofsson A. All are not equal a
benchmark of different homology modeling
programs. Protein Sci. 2005 14(5)1315-27.
35Item 5
Modeling Problems
36Homology Modeling in Practice
How to assess model reliability ? - All indels
are wrong - Structure analysis ("threading",
"solvent accessibility", compatibility with
ligands) can point out possible alignment
errors - But no point in "repairing"
stereochemistry, only review alignment.
37Homology Modeling in Practice
Can you predict function from your model ? No
(and yes) - the model may be incompatible with a
specific function.
38Homology Modeling in Practice
Evaluation of errors We found that 'through
space' proximity to gaps and chain termini, local
three-dimensional 'density', three-dimensional
environment conservation, and B-factor of the
template contribute to local deviations in the
backbone in addition to local sequence identity.
Comput Chem. 2000 24(1)13-31. Estimating local
backbone structural deviation in homology models.
Cardozo T, Batalov S, Abagyan R.
39Can energy minimization correct errors ?
40Energy Minimization (slides by David Wishart)
41Energy Minimization
- Efficient way of polishing and shining your
protein model - Removes atomic overlaps and unnatural strains in
the structure - Stabilizes or reinforces strong hydrogen bonds,
breaks weak ones - Brings protein to lowest energy in about 1-2
minutes CPU time
42Energy Minimization (Theory)
- Treat Protein molecule as a set of balls (with
mass) connected by rigid rods and springs - Rods and springs have empirically determined
force constants - Allows one to treat atomic-scale motions in
proteins as classical physics problems (OK
approximation)
43Standard Energy Function
E
Kr(ri - rj)2 Kq(qi - qj)2 Kf(1-cos(nfj))2
qiqj/4perij Aij/r6 - Bij/r12 Cij/r10 -
Dij/r12
Bond length Bond bending Bond torsion Coulomb van
der Waals H-bond
44Energy Terms
r
f
q
Kr(ri - rj)2
Kq(qi - qj)2
Kf(1-cos(nfj))2
Stretching Bending
Torsional
45Energy Terms
r
r
r
qiqj/4perij
Aij/r6 - Bij/r12
Cij/r10 - Dij/r12
Coulomb van der Waals H-bond
46An Energy Surface
High Energy
Low Energy
Overhead View Side View
47A More Realistic Protein Energy Surface
The Folding Funnel
48Minimization Methods
- Energy surfaces for proteins are complex
hyperdimensional spaces - Biggest problem is overcoming local minimum
problem - Simple methods (slow) to complex methods (fast)
- Monte Carlo Method
- Steepest Descent
- Conjugate Gradient
49Monte Carlo Algorithm
- Generate a conformation or alignment (a state)
- Calculate that states energy or score
- If that states energy is less than the previous
state accept that state and go back to step 1 - If that states energy is greater than the
previous state accept it if a randomly chosen
number is lt e-E/kT where E is the state energy
otherwise reject it - Go back to step 1 and repeat until done
50Conformational Sampling
Mid-energy lower energy lowest energy
highest energy
51Monte Carlo Minimization
High Energy
Low Energy
Performs a progressive or directed random search
52Steepest Descent Conjugate Gradients
- Frequently used for energy minimization of large
(and small) molecules - Ideal for calculating minima for complex (I.e.
non-linear) surfaces or functions - Both use derivatives to calculate the slope and
direction of the optimization path - Both require that the scoring or energy function
be differentiable (smooth)
53Steepest Descent Minimization
High Energy
Low Energy
Makes small locally steep moves down gradient
54Conjugate Gradient Minimization
High Energy
Low Energy
Includes information about the prior history of
path
55Energy Minimization (end of slides by David
Wishart)
- Very complex programs that have taken years to
develop and refine - Several freeware options to choose
- XPLOR (Axel Brunger, Yale)
- GROMACS (Gronnigen, The Netherlands)
- AMBER (Peter Kollman, UCSF)
- CHARMM (Martin Karplus, Harvard)
- TINKER (Jay Ponder, Wash U))
56However(CASP5 (2002) - State of the art in
Homology modeling)
The good
The ugly
better
worse than template
shocking!
Coordinate manipulations do not improve accuracy !
Remote sequence similarity detection methods have
improved.
Tramontano A Morea V (2003) Assessment of
homology based predictions in CASP5 Proteins
S6352-368
57Can energy minimization correct errors ?
Apparently not errors are avoided by better
alignment, judicious choice of templates and
careful interpretation, considering the
limitations of the method.