The University, St' Andrews - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

The University, St' Andrews

Description:

included simulated annealing, all approaches to this point use least squares, ... maps of difficult regions, it seems to reduce bias (simulated anneal omit maps) ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 57
Provided by: drjamesh
Category:

less

Transcript and Presenter's Notes

Title: The University, St' Andrews


1
Refinement
  • Jim Naismith
  • St. Andrews University
  • CCP4 BBSRC SUMMER SCHOOL
  • 23rd-27th August 1999

2
What is refinement?
  • Needed for a publication
  • A black box, something you do with CNS
  • Lowering the R-factor
  • Lowering the Free R-factor
  • Improving your models geometry
  • Producing the most biologically meaningful
    structure from your experimental data

3
History of refinement
  • First structures were not refined
  • mid 1970s saw first attempts, many programs and
    approaches
  • Konnert-Hendrickson emerges as dominant program
    (implemented as PROTIN PROLSQ)
  • TNT probably best implementation of these ideas

4
More History
  • In late 1980s a new program emerged X-PLOR
  • based on energies, pictorial satisfying to
    molecular people, not so abstract
  • included simulated annealing, all approaches to
    this point use least squares, R-factors drop
  • early 1990s introduction of Free R to curb the
    boom years of the late 80s

5
Recent developments
  • Free R saw R-factors climb
  • Geometry examined more carefully, models had
    substantial errors, counter act climate of
    R-factor fixation
  • Free R and geometric analysis tend to concur
  • new maximum likelihood targets introduced (CNS,
    REFMAC TNT)

6
Your data gt your model
  • Your model can only be as meaningful as your
    data, crap data crap model
  • Resolution is very important
  • refinement is a process where a model is adjusted
    in a way to minimise a certain target value
  • a model consists of x, y and z coordinates and a
    B (thermal) factor

7
Effect of missing data
  • Kevin Cowtans ducks
  • Note that low resolution data give low resolution
    model
  • omitting low resolution terms severely distorts
    the image
  • systematically absent data
  • What does omitting free data do?

8
What is your model?
  • Your model is the x, y and z positions of each
    atom (waters, ligands as well)
  • each atom also a B-factor
  • Atoms with zero occupancy are in the geometric
    model but do not model the data
  • It also includes any modification you make to the
    Fcalc (bulk solvent, anisotropic)

9
What is the target value?
  • What can we measure about a structure
  • R-factors
  • bond deviations
  • angle deviations
  • dihedral and planar deviations
  • NCS deviations
  • B-factor deviations along bonds, thru angles
  • Ramachandran (the Free geometery term?)

10
Dont forget the R-factor!
  • Reasonable R-factor is a necessary but not
    sufficient proof of correctness
  • R-free consists of 2-10 of data which are not
    used in calculating adjustments to the models
  • These data are only used to calculate an R-factor
    after a refinement step
  • R-free validates refinement protocol but not
    structure

11
Why does R-free work?
  • Refinement creates feed back error, phases which
    are calculated dominate the refinement process,
    errors then become self fulfilling .
  • By removing a statistically valid sample, one
    test whether refinement is actually improving
    your model or fudging your model to fit the data

12
How to use R-Free
  • Create your set at the start. Never ever refine
    against it, otherwise it is lost
  • I use it in map calculation
  • If you are molecular replacing or extending
    resolution, you must keep the same set free. This
    is done by using MNF and update reflections in
    CCP4
  • Problems with Free-R where multiple NCS molecules
    (gt4)

13
Why geometry terms?
  • In small molecule crystallography, we dont worry
    about anything other than R-factors
  • Protein crystallography is under determined
  • Small molecule refinement have 7 measurements to
    1 parameter (something which is refined). Why?
  • Resolution, even at 1.85Å approx 2 measurements
    per parameter

14
Getting more observations
  • Using our knowledge we can use geometry to
    create the extra observations
  • bond lengths, know a C-C bonds is 1.54Å
  • bond angles, know CH2-CH2-CH2 is 109º
  • bond planarity, know Try ring and peptide
  • ncs, know two molecules should be similar
  • bonded atoms, similar B-factors
  • called RESTRAINTS
  • leads to more measurement

15
Restraints versus constraints
  • Restraint increases observations
  • Constraints decreases parameters
  • By restraining a bond distance, we allow the
    atoms to move relative to each other. As they
    move closer or further than the ideal they pay a
    penalty. However, if the gain elsewhere is higher
    than the penalty, we pay
  • By constraining, we allow no flexibility

16
When to use constraints?
  • Most common use in very large ncs ensembles,
    dramatically cuts parameters
  • rigid body refinement constrains all atoms to
    move together
  • probably should be used more frequently in
    B-factor refinement at low resolution (2.8Å)
  • rarely used in protein crystallography in general
    because hard to do

17
Where do we get restraints?
  • Fortunately we dont have to go and measure them
    ourselves.
  • A compendium known as the Engh and Huber exists
    and gives all values for proteins
  • For novel ligands, you should trawl the small
    molecule data base and use sense

18
What does a restraint look like?
  • Jack-Levitt (energy based, cns x-plor) is the
    easiest to visualise
  • bond CH2 N 1.431 657.77
  • The values are the optimum bond length and the
    energy paid per angstrom deviation from it
  • In PROTIN the values are expressed in the same
    units as the parameter and reflect a distance
    from ideality

19
Weighting restraints
  • In CNS one has no control over the weighting of
    the relative geometric terms. These are defined
    along with parameters.
  • In PROTIN most people use the defaults, so again
    one has no discretion in practice
  • TNT more flexible
  • Mainly realm of experts

20
Weighting (more)
  • In practice then you weight only the data and any
    non-crystallographic restraints
  • Most programs will calculate a weight based on a
    quick refinement
  • Accurate determination of weight for X-ray term
    is best done by trial and error (even in CNS) see
    recipes
  • Free R-factor, appears to be an excellent guide

21
Non-crystallographic symmetry
  • How tight should it be?
  • Dealing with two or more molecules with identical
    chemical composition
  • Null hypothesis, molecules identical
  • Restrains should be very tight, unless you have a
    good reason to believe otherwise
  • contentious issue, many people dont restrain
    tightly enough

22
How and what is NCS restrained?
  • Simplest method (and most common) is to restrain
    atomic positions (general form)
  • (i)(M)(i)
  • E(ncs) Wncs ?(i-j)2
  • i coordinates of atom in subunit a, j coordinates
    of equivalent atom in subunit b, M is NCS matrix,
    Wncs is user defined weight and i is predicted
    coordinate position with perfect
    non-crystallographic symmetry

23
How to do it?
  • Typically one restrains different parts of the
    molecule (common sense)
  • core main chain region very tight (lt0.1Å)
  • core side chain tight (lt 0.2Å)
  • outer main chain tight (lt0.2Å)
  • outer side chain loose (lt0.3Å)
  • Exclude obvious differences you verify from the
    graphics

24
NCS and multidomain
  • Very common problem, for example see figure
  • if you try to restrain the whole molecule
    introduce substantial errors (possibly fatal)
  • solution is to treat each domain separately
  • only the hinge does not obey NCS

25
Alternative approach
  • Rather than restrain the atomic positions, one
    could restrain the orientation of atom with
    respect to its neighbors.
  • Most of the flexibility occurs around dihedral
    bonds, so by restraining dihedral angles with
    appropriate exclusions eliminate need for
    matrices and domain splitting (SHELX)

26
Resolution and its role
  • The consequences of resolution are not considered
    by most people
  • The less experimental data you have the more you
    MUST rely on your restraints. Undue weighting to
    the X-ray term for low resolution data is the
    road to hell. This is the null hypothesis, all
    geometry/ncs is perfect. You do not have the
    information do go against this.

27
Resolution and convergence
  • The higher the resolution the deeper the well for
    refinement but the narrower too. A small radius
    of convergence. At high resolution the atomic
    positions are tightly defined and cannot move
    very far.
  • A common error is to get trapped in the wrong
    well. Known as the false minimum

28
Minimisation
29
How to avoid the false minimum
  • Traditionally this was done by starting at low
    resolution with incomplete model
  • still in my view a good starting point large
    radius of convergence
  • claimed to be irrelevant with ML target
  • Restrain the molecule as tightly as possible
    (especially at the start)
  • Especially important with NCS
  • Common problem with molecular replacement, hard
    to remove previous errors

30
How to get out?
  • Graphic rebuilding (OOPS)
  • omit difficult regions and re-refine
  • re-start refinement at lower resolution
  • chuck out waters
  • reset all B-factor and occupancies
  • impose NCS
  • simulated annealing/torsion dynamics

31
Annealing
  • Unlike all other minimization strategies
    annealing does not have to go downhill
  • depending on the thermal energy in the system it
    can actually increase

32
Is simulated annealing the answer?
  • Apparently not
  • When it first appeared everyone found it
    wonderful, (only in CNS X-PLOR)
  • Free-R suggests that its use is very limited
    (tends to make things worse)
  • It is useful for calculating maps of difficult
    regions, it seems to reduce bias (simulated
    anneal omit maps)

33
Torsional dynamics (CNS)
  • Still based on an annealing algorithm
  • This time the molecule is split into rigid
    groups, only the rotation between the groups is
    allowed to vary
  • By reducing the variables it allows you to
    increase the radius of convergence (related to
    temperature)
  • this is a powerful technique in removing errors
    at the start of refinement
  • a full description by Paul Adams is here

34
Bulk solvent
  • Your model is your attempt to mimic the
    diffraction pattern
  • Diffraction comes from the whole crystal
  • The cavities are full of water
  • An atomic model does not represent these channel
  • The water cavities (bulk solvent) contribute
    strongly to data at a resolution lower than 8Å
    and dominate below 15Å

35
Handling bulk solvent
  • To refine against low resolution data with out
    some sort of model for bulk solvent will cause
    errors
  • Ignoring low resolution data causes errors
  • Simplier approach (REFMAC/TNT) scales your Fcalcs
    at low resolution
  • CNS approach is to calculate a partial structure
    factor, based on the volume occupied by solvent
    (best)

36
Anisotropic B
  • Data is always stronger along the Lorentz axis
  • If your data is stronger along one crystal axis
    than another you have marked anisotropy
  • Even if you dont you probably have it anyway!
  • It is corrected for by applying anisotropic
    B-factor corrections (typically 6 Bs)
  • If you apply this, check it!

37
Method of least squares
  • There are two excellent and detailed lectures by
    David Watkin on the method least squares
    refinement. There is also Randy Reads lecture
    which discusses some of these concepts.
  • Ill give you the bluffers guide to LS refinement

38
Least squares in the least effort
  • We wish to make Fc the same as Fo, we must have a
    starting value for Fc.
  • We construct a matrix which tells us how each
    parameter influences Fc (gradient)
  • This matrix is square and explicitly deals with
    correlation between parameters

39
More least squares
  • The inverse of the matrix multiplied by a vector
    to give the shifts which need be applied to
    improve the model.
  • Each term in the vector is composed of the change
    in Fc with parameter multiplied by the difference
    between Fo and Fc
  • constraints show their effect here
  • The big challenge is to set up the normal matrix

40
Least squares in protein cryst
  • Formulating the full matrix was beyond our
    abilities until very recently
  • Inverting the matrix is extremely time consuming
  • The result is we make approximations for the
    off-diagonal terms in the Normal matrix (usually
    set them to 0). There are a number of other
    mathematical tricks.

41
Conjugate gradient
  • This is a type of algorithm for minimisation,, it
    is still based on least squares
  • In essence, it uses the results from the previous
    calculation to influence the current one
  • The principle is that you are the right road so
    without a compelling reason you should stay on it

42
Consequences of least squares
  • We are trying to minimize the difference between
    Fobs and Fcalc.
  • We express this as an R-factor which only
    involves the modulus, in reality the target is a
    complex number involving phases
  • Where do we get observed phase?
  • The phases are derived from the current working
    model (last cycle of refinement), we assume the
    calculated phases is correct

43
Phases soak up error
  • The assumption that the calculated phases are
    correct must be wrong, provided you are moving
    the right direction, this become progressively
    more true
  • If you do not have a good estimation of the
    phases from your model, least squares cannot work
  • Least squares shunts the error into the phase,
    thus you move further away from the correct
    solution

44
Maximum likelihood
  • Maximum likelihood is extremely complex and
    difficult. I dont really understand it.
  • The following are the maxims I live by
  • it assumes the phases have errors
  • it down weights terms which are poorly defined
    (ie high resol data at the starting stages)
  • it will give a different model than least squares
  • should be much better than LS in early stages

45
Recipes/hints I
  • Spend as much time on the graphics as possible,
    fit the map as best as you can
  • set to zero occupancy anything you are unsure of
  • impose ncs, decide on exclusions, when in doubt
    include
  • start with lower resolution data (2.8Å is my
    favourite)

46
Recipes/hints II
  • Know your data!
  • What is the Wilson B-factor?
  • What is the R-merge, redundancy
  • calculate ramachandran plots regularly and
    monitor rms deviations of bonds/angles etc
  • experiment with refinement, if it does not work
    toss it, restrain things tightly
  • dont get hung up on 0.2 difference

47
CNS strategy
  • For molecular replacement I find this to be a
    generally successful scheme
  • rigid body, assembly, dimer, monomer, domain
  • b-factor, reset to Wilson B
  • positional refinement till convergence
  • torsional dynamics (includes positional on the
    fly)
  • b-factor until convergence
  • At this point I do a major graphics rebuild

48
More CNS_SOLVE
  • After this rebuild I almost never do torsional
    dynamics again, cycle through positional,
    B-factor, water addition and manual
    rebuilding/correcting
  • I often reset the all Bs to Wilson value after
    rebuild
  • Each automated pass, is usually 3 xyz, 3 b-factor
    and 2 water

49
MIR structures
  • In the early stage, these have to be restrained
    more tightly, the model phases are too inaccurate
    for sensible refinement (rule of thumb R-factor
    above 45 unlikely to work or less than 65 of
    atoms)
  • use experimental phases to hold the model in
    place (this is easily done in all refinement
    packages) and is very effective

50
Water addition
  • CNS has a script to do this, so does CCP4 (ARP)
  • I check every water once, after it has been
    included. I chuck out non-H-bonds and those
    without density. I add them in batches, when the
    Free R does not drop by more than 0.5, unless
    map is very clear thats it no more

51
What to look for
  • B-factor distribution is very important, the
    absolute value is less important, it is outliers
    that you need to pick up
  • cold in the inside hot on the outside
  • water b-factor should not be more than twice the
    average
  • real space R-factors are a good guide, routine
    with cns

52
What to measure?
  • R-factors, bond deviations, angle deviations, ncs
    deviations, ramachandran plot
  • Whats good?
  • Rfree less than 30, R and Rfree gap less than 5
  • bonds lt 0.02, angles lt 2, ramachandran 80 in
    core

53
How do I know its correct?
  • The oldest and most reliable method is does your
    map contain sensible information which your model
    did not
  • co-factor where you expect one
  • side chain instead of Ala
  • When this happens, the phases must be
    substantially correct
  • Trust your instinct and all values, dont focus
    on any one number

54
What do you do with final coordinates?
  • Deposit them!
  • This very important
  • Still a cumbersome pain but we owe it to our
    sponsors to get it right
  • Currently all data are deposited at RCSB
  • Important to deposit as much information as
    possible

55
Words of warning
  • What does your coordinate file actually mean?
  • Accuracy, your xyz are given to 3 decimal places,
    is this realisitic?
  • Occupancy set to 0, why?
  • High B-factors, how do you explain the meaning to
    biologists?
  • How do we explain messy regions?

56
Validation
  • Validation should proceed hand in hand with
    refinement, not be done at the end
  • It is very useful tool in refining your structure
  • Remember not everything in your structure can be
    perfect. Some bonds will too long, you may have
    Ramachandran outliers. Try to assess validation
    information critically
Write a Comment
User Comments (0)
About PowerShow.com