Title: The University, St' Andrews
1Refinement
- Jim Naismith
- St. Andrews University
- CCP4 BBSRC SUMMER SCHOOL
- 23rd-27th August 1999
2What is refinement?
- Needed for a publication
- A black box, something you do with CNS
- Lowering the R-factor
- Lowering the Free R-factor
- Improving your models geometry
- Producing the most biologically meaningful
structure from your experimental data
3History of refinement
- First structures were not refined
- mid 1970s saw first attempts, many programs and
approaches - Konnert-Hendrickson emerges as dominant program
(implemented as PROTIN PROLSQ) - TNT probably best implementation of these ideas
4More History
- In late 1980s a new program emerged X-PLOR
- based on energies, pictorial satisfying to
molecular people, not so abstract - included simulated annealing, all approaches to
this point use least squares, R-factors drop - early 1990s introduction of Free R to curb the
boom years of the late 80s
5Recent developments
- Free R saw R-factors climb
- Geometry examined more carefully, models had
substantial errors, counter act climate of
R-factor fixation - Free R and geometric analysis tend to concur
- new maximum likelihood targets introduced (CNS,
REFMAC TNT)
6Your data gt your model
- Your model can only be as meaningful as your
data, crap data crap model - Resolution is very important
- refinement is a process where a model is adjusted
in a way to minimise a certain target value - a model consists of x, y and z coordinates and a
B (thermal) factor
7Effect of missing data
- Kevin Cowtans ducks
- Note that low resolution data give low resolution
model - omitting low resolution terms severely distorts
the image - systematically absent data
- What does omitting free data do?
8What is your model?
- Your model is the x, y and z positions of each
atom (waters, ligands as well) - each atom also a B-factor
- Atoms with zero occupancy are in the geometric
model but do not model the data - It also includes any modification you make to the
Fcalc (bulk solvent, anisotropic)
9What is the target value?
- What can we measure about a structure
- R-factors
- bond deviations
- angle deviations
- dihedral and planar deviations
- NCS deviations
- B-factor deviations along bonds, thru angles
- Ramachandran (the Free geometery term?)
10Dont forget the R-factor!
- Reasonable R-factor is a necessary but not
sufficient proof of correctness - R-free consists of 2-10 of data which are not
used in calculating adjustments to the models - These data are only used to calculate an R-factor
after a refinement step - R-free validates refinement protocol but not
structure
11Why does R-free work?
- Refinement creates feed back error, phases which
are calculated dominate the refinement process,
errors then become self fulfilling . - By removing a statistically valid sample, one
test whether refinement is actually improving
your model or fudging your model to fit the data
12How to use R-Free
- Create your set at the start. Never ever refine
against it, otherwise it is lost - I use it in map calculation
- If you are molecular replacing or extending
resolution, you must keep the same set free. This
is done by using MNF and update reflections in
CCP4 - Problems with Free-R where multiple NCS molecules
(gt4)
13Why geometry terms?
- In small molecule crystallography, we dont worry
about anything other than R-factors - Protein crystallography is under determined
- Small molecule refinement have 7 measurements to
1 parameter (something which is refined). Why? - Resolution, even at 1.85Å approx 2 measurements
per parameter
14Getting more observations
- Using our knowledge we can use geometry to
create the extra observations - bond lengths, know a C-C bonds is 1.54Å
- bond angles, know CH2-CH2-CH2 is 109º
- bond planarity, know Try ring and peptide
- ncs, know two molecules should be similar
- bonded atoms, similar B-factors
- called RESTRAINTS
- leads to more measurement
15Restraints versus constraints
- Restraint increases observations
- Constraints decreases parameters
- By restraining a bond distance, we allow the
atoms to move relative to each other. As they
move closer or further than the ideal they pay a
penalty. However, if the gain elsewhere is higher
than the penalty, we pay - By constraining, we allow no flexibility
16When to use constraints?
- Most common use in very large ncs ensembles,
dramatically cuts parameters - rigid body refinement constrains all atoms to
move together - probably should be used more frequently in
B-factor refinement at low resolution (2.8Å) - rarely used in protein crystallography in general
because hard to do
17Where do we get restraints?
- Fortunately we dont have to go and measure them
ourselves. - A compendium known as the Engh and Huber exists
and gives all values for proteins - For novel ligands, you should trawl the small
molecule data base and use sense
18What does a restraint look like?
- Jack-Levitt (energy based, cns x-plor) is the
easiest to visualise - bond CH2 N 1.431 657.77
- The values are the optimum bond length and the
energy paid per angstrom deviation from it - In PROTIN the values are expressed in the same
units as the parameter and reflect a distance
from ideality
19Weighting restraints
- In CNS one has no control over the weighting of
the relative geometric terms. These are defined
along with parameters. - In PROTIN most people use the defaults, so again
one has no discretion in practice - TNT more flexible
- Mainly realm of experts
20Weighting (more)
- In practice then you weight only the data and any
non-crystallographic restraints - Most programs will calculate a weight based on a
quick refinement - Accurate determination of weight for X-ray term
is best done by trial and error (even in CNS) see
recipes - Free R-factor, appears to be an excellent guide
21Non-crystallographic symmetry
- How tight should it be?
- Dealing with two or more molecules with identical
chemical composition - Null hypothesis, molecules identical
- Restrains should be very tight, unless you have a
good reason to believe otherwise - contentious issue, many people dont restrain
tightly enough
22How and what is NCS restrained?
- Simplest method (and most common) is to restrain
atomic positions (general form) - (i)(M)(i)
- E(ncs) Wncs ?(i-j)2
- i coordinates of atom in subunit a, j coordinates
of equivalent atom in subunit b, M is NCS matrix,
Wncs is user defined weight and i is predicted
coordinate position with perfect
non-crystallographic symmetry
23How to do it?
- Typically one restrains different parts of the
molecule (common sense) - core main chain region very tight (lt0.1Å)
- core side chain tight (lt 0.2Å)
- outer main chain tight (lt0.2Å)
- outer side chain loose (lt0.3Å)
- Exclude obvious differences you verify from the
graphics
24NCS and multidomain
- Very common problem, for example see figure
- if you try to restrain the whole molecule
introduce substantial errors (possibly fatal) - solution is to treat each domain separately
- only the hinge does not obey NCS
25Alternative approach
- Rather than restrain the atomic positions, one
could restrain the orientation of atom with
respect to its neighbors. - Most of the flexibility occurs around dihedral
bonds, so by restraining dihedral angles with
appropriate exclusions eliminate need for
matrices and domain splitting (SHELX)
26Resolution and its role
- The consequences of resolution are not considered
by most people - The less experimental data you have the more you
MUST rely on your restraints. Undue weighting to
the X-ray term for low resolution data is the
road to hell. This is the null hypothesis, all
geometry/ncs is perfect. You do not have the
information do go against this.
27Resolution and convergence
- The higher the resolution the deeper the well for
refinement but the narrower too. A small radius
of convergence. At high resolution the atomic
positions are tightly defined and cannot move
very far. - A common error is to get trapped in the wrong
well. Known as the false minimum
28Minimisation
29How to avoid the false minimum
- Traditionally this was done by starting at low
resolution with incomplete model - still in my view a good starting point large
radius of convergence - claimed to be irrelevant with ML target
- Restrain the molecule as tightly as possible
(especially at the start) - Especially important with NCS
- Common problem with molecular replacement, hard
to remove previous errors
30How to get out?
- Graphic rebuilding (OOPS)
- omit difficult regions and re-refine
- re-start refinement at lower resolution
- chuck out waters
- reset all B-factor and occupancies
- impose NCS
- simulated annealing/torsion dynamics
31Annealing
- Unlike all other minimization strategies
annealing does not have to go downhill - depending on the thermal energy in the system it
can actually increase
32Is simulated annealing the answer?
- Apparently not
- When it first appeared everyone found it
wonderful, (only in CNS X-PLOR) - Free-R suggests that its use is very limited
(tends to make things worse) - It is useful for calculating maps of difficult
regions, it seems to reduce bias (simulated
anneal omit maps)
33Torsional dynamics (CNS)
- Still based on an annealing algorithm
- This time the molecule is split into rigid
groups, only the rotation between the groups is
allowed to vary - By reducing the variables it allows you to
increase the radius of convergence (related to
temperature) - this is a powerful technique in removing errors
at the start of refinement - a full description by Paul Adams is here
34Bulk solvent
- Your model is your attempt to mimic the
diffraction pattern - Diffraction comes from the whole crystal
- The cavities are full of water
- An atomic model does not represent these channel
- The water cavities (bulk solvent) contribute
strongly to data at a resolution lower than 8Å
and dominate below 15Å
35Handling bulk solvent
- To refine against low resolution data with out
some sort of model for bulk solvent will cause
errors - Ignoring low resolution data causes errors
- Simplier approach (REFMAC/TNT) scales your Fcalcs
at low resolution - CNS approach is to calculate a partial structure
factor, based on the volume occupied by solvent
(best)
36Anisotropic B
- Data is always stronger along the Lorentz axis
- If your data is stronger along one crystal axis
than another you have marked anisotropy - Even if you dont you probably have it anyway!
- It is corrected for by applying anisotropic
B-factor corrections (typically 6 Bs) - If you apply this, check it!
37Method of least squares
- There are two excellent and detailed lectures by
David Watkin on the method least squares
refinement. There is also Randy Reads lecture
which discusses some of these concepts. - Ill give you the bluffers guide to LS refinement
38Least squares in the least effort
- We wish to make Fc the same as Fo, we must have a
starting value for Fc. - We construct a matrix which tells us how each
parameter influences Fc (gradient) - This matrix is square and explicitly deals with
correlation between parameters
39More least squares
- The inverse of the matrix multiplied by a vector
to give the shifts which need be applied to
improve the model. - Each term in the vector is composed of the change
in Fc with parameter multiplied by the difference
between Fo and Fc - constraints show their effect here
- The big challenge is to set up the normal matrix
40Least squares in protein cryst
- Formulating the full matrix was beyond our
abilities until very recently - Inverting the matrix is extremely time consuming
- The result is we make approximations for the
off-diagonal terms in the Normal matrix (usually
set them to 0). There are a number of other
mathematical tricks.
41Conjugate gradient
- This is a type of algorithm for minimisation,, it
is still based on least squares - In essence, it uses the results from the previous
calculation to influence the current one - The principle is that you are the right road so
without a compelling reason you should stay on it
42Consequences of least squares
- We are trying to minimize the difference between
Fobs and Fcalc. - We express this as an R-factor which only
involves the modulus, in reality the target is a
complex number involving phases - Where do we get observed phase?
- The phases are derived from the current working
model (last cycle of refinement), we assume the
calculated phases is correct
43Phases soak up error
- The assumption that the calculated phases are
correct must be wrong, provided you are moving
the right direction, this become progressively
more true - If you do not have a good estimation of the
phases from your model, least squares cannot work - Least squares shunts the error into the phase,
thus you move further away from the correct
solution
44Maximum likelihood
- Maximum likelihood is extremely complex and
difficult. I dont really understand it. - The following are the maxims I live by
- it assumes the phases have errors
- it down weights terms which are poorly defined
(ie high resol data at the starting stages) - it will give a different model than least squares
- should be much better than LS in early stages
45Recipes/hints I
- Spend as much time on the graphics as possible,
fit the map as best as you can - set to zero occupancy anything you are unsure of
- impose ncs, decide on exclusions, when in doubt
include - start with lower resolution data (2.8Å is my
favourite)
46Recipes/hints II
- Know your data!
- What is the Wilson B-factor?
- What is the R-merge, redundancy
- calculate ramachandran plots regularly and
monitor rms deviations of bonds/angles etc - experiment with refinement, if it does not work
toss it, restrain things tightly - dont get hung up on 0.2 difference
47CNS strategy
- For molecular replacement I find this to be a
generally successful scheme - rigid body, assembly, dimer, monomer, domain
- b-factor, reset to Wilson B
- positional refinement till convergence
- torsional dynamics (includes positional on the
fly) - b-factor until convergence
- At this point I do a major graphics rebuild
48More CNS_SOLVE
- After this rebuild I almost never do torsional
dynamics again, cycle through positional,
B-factor, water addition and manual
rebuilding/correcting - I often reset the all Bs to Wilson value after
rebuild - Each automated pass, is usually 3 xyz, 3 b-factor
and 2 water
49MIR structures
- In the early stage, these have to be restrained
more tightly, the model phases are too inaccurate
for sensible refinement (rule of thumb R-factor
above 45 unlikely to work or less than 65 of
atoms) - use experimental phases to hold the model in
place (this is easily done in all refinement
packages) and is very effective
50Water addition
- CNS has a script to do this, so does CCP4 (ARP)
- I check every water once, after it has been
included. I chuck out non-H-bonds and those
without density. I add them in batches, when the
Free R does not drop by more than 0.5, unless
map is very clear thats it no more
51What to look for
- B-factor distribution is very important, the
absolute value is less important, it is outliers
that you need to pick up - cold in the inside hot on the outside
- water b-factor should not be more than twice the
average - real space R-factors are a good guide, routine
with cns
52What to measure?
- R-factors, bond deviations, angle deviations, ncs
deviations, ramachandran plot - Whats good?
- Rfree less than 30, R and Rfree gap less than 5
- bonds lt 0.02, angles lt 2, ramachandran 80 in
core
53How do I know its correct?
- The oldest and most reliable method is does your
map contain sensible information which your model
did not - co-factor where you expect one
- side chain instead of Ala
- When this happens, the phases must be
substantially correct - Trust your instinct and all values, dont focus
on any one number
54What do you do with final coordinates?
- Deposit them!
- This very important
- Still a cumbersome pain but we owe it to our
sponsors to get it right - Currently all data are deposited at RCSB
- Important to deposit as much information as
possible
55Words of warning
- What does your coordinate file actually mean?
- Accuracy, your xyz are given to 3 decimal places,
is this realisitic? - Occupancy set to 0, why?
- High B-factors, how do you explain the meaning to
biologists? - How do we explain messy regions?
56Validation
- Validation should proceed hand in hand with
refinement, not be done at the end - It is very useful tool in refining your structure
- Remember not everything in your structure can be
perfect. Some bonds will too long, you may have
Ramachandran outliers. Try to assess validation
information critically