Title: Binding site prediction on protein surfaces using Support Vector Machines
1Binding site prediction on protein surfaces using
Support Vector Machines
- James Bradford
- Leeds Bioinformatics Group
2Talk Structure
- Motivation.
- Introduction to machine learning and Support
Vector Machines. - Methods.
- Results.
- Initial findings.
3Binding site prediction
- Complements our work on the docking problem.
- Reduces the search space for our docking
algorithm. - Decrease in number of docking solutions.
- Scoring functions are computer intensive.
4Machine learning
- Making predictions by automated learning from
existing knowledge - Learning
- requires training data where the answer is known.
- generates rules or other functions that fit the
training data. - The trained method is then used to predict on new
data.
5Support Vector Machines (SVMs)
- A family of learning algorithms that aim to
- generate a hyperplane that divides a training set
of examples labeled positive and negative such
that all points with the same label appear on the
same side of the hyperplane, - maximise the distance between the two classes and
the hyperplane (optimal separating hyperplane -
OSH).
6Support Vector Machines (SVMs)
- In practise, training sets are usually
non-separable... - Position the OSH to minimise the number of
misclassified points (Fig 1). - and non-linear.
- Use a Kernel function to map the data from real
space into high dimensional feature space (Fig 2).
Fig 1
Fig 2
7SVMs and Binding Site Prediction
Surface patches of contiguous residues are
classified as either part of or outside the
interface between two proteins i.e. the binding
site.
Molecular surface of trypsin and showing the
Bowman-Birk inhibitor binding site (PDB code
1tab)
Interface patch
Non-interface patch
Actual binding site
8Training Set
- SVM trained on either whole training set...
- or training sets subdivided into
- homodimers
- enzymes
- inhibitors
- heterodimers (transient, small)
- heterodimers (transient, large)
- hetero-obligomers.
- Some interface properties are specific to protein
type, for example see - Bradford Westhead (2003) Asymmetric mutation
rates at enzyme-inhibitor interfaces
Implications for the docking problem. Protein
Science. 12 2099-2103.
9PDB -gt SVM
Calculate solvent excluded surface
Label each surface vertex with eight chemical,
geometrical or physical properties
Define true binding site
Generate interface and non-interface patches
Generate patches
Calculate patch attributes
Calculate patch attributes
Train SVM
Predict
10Patch Characteristics
- Interface patch
- Circular.
- Centre centre of actual binding site.
- Number of surface vertices 0.08 x Number of
surface vertices of smallest protein in dimer. - Non-interface patch
- As for interface patch except...
- Centre randomly selected from non-interface
vertex set. - Why not just use the actual interface?
- No prior knowledge of size and shape of interface
in blind prediction. - No. of non-interface patches no. of interface
patches. - SVM training is balanced.
11Surface Properties
- Eight properties seen as useful in distinguishing
binding sites from the rest of the surface. - Conservation
- Shape index
- Curvedness
- Hydrophobicity
- Electrostatic potential
- Residue propensity
- Solvent accessibility
- Secondary structure
- Jones Thornton (1996) Principles of
protein-protein interactions. Proc. Natl. Acad.
Sci. USA. 93 13-20.
12Conservation
- Calculated using Scorecons (William Valdar)
- Clusters of conserved residues can sometimes
characterise a functional site.
Conservation at the BPTI binding site on trypsin
(PDB code 2ptc)
Interface
Conservation
13Shape Index and Curvedness
- Calculated from the principle curvatures at each
surface vertex. - Shape index
- Scale -1 (concave) through 0 (flat) to 1
(convex). - Make concave clefts and convex protrusions easy
to identify.
Shape characteristics of Bowman-Birk inhibitor
(PDB code 1tab)
Shape index
Curvedness
Interface
14Electrostatic Potential
- Calculated by Delphi.
- Interface maybe marked by an area of particularly
positive or negative potential.
Thermitase binding site on eglin c
Electrostatic potential
Eglin c binding site on thermitase
Positive potential at eglin c interface
complements positive potential on negative
potential on thermitase binding surface (PDB
code 2tec).
15Other Properties (1)
- Hydrophobicity
- Simple hydrophobicity scale (Fauchère and Pliska
1983). - Homodimer interfaces tend to be hydrophobic.
- Solvent accessibility
- MSMS (Michael Sanner).
- Outputs accessible surface area of each atom.
- Clefts are less accessible than protrusions.
16Other Properties (2)
- Residue Propensity
- Knowledge based.
- Calculated for each amino acid as the fraction of
ASA that amino acid contributes to interface
compared to its contribution to the whole surface
(Jones and Thornton 1996). - Residue propensity gt 1 means that residue occurs
more frequently at interface. - Secondary Structure
- Extracted from PDB atom coordinates using STRIDE
(Frishman Argos 1995).
17Patch Attributes
- Mean and standard deviation
- Conservation
- Shape index
- Curvedness
- Hydrophobicity
- Electrostatic potential
- Solvent accessibility
- Residue propensity
- Proportion
- Conserved / Variable
- Concave / Convex
- Helix / Sheet / Other
18Initial Results
19Summary
- Methods have been implemented to train an SVM to
distinguish between an interface patch and a
non-interface patch. - Training on a separated data set is more accurate
than training on all proteins. - Results need to be validated.
- Successful predictions on blind data are the
ultimate aim.
20Acknowledgements
- Supervisor David Westhead
- Funding BBSRC
- Support Leeds Bioinformatics Group
Contact
- Email bmbjrb_at_bmb.leeds.ac.uk
- Website http//www.bioinformatics.leeds.ac.uk
21(No Transcript)