Title: Chemoinformatics tools for lead discovery
1Chemoinformatics tools for lead discovery
- Peter Willett, University of Sheffield, UK
2Overview of talk
- Approaches to virtual screening
- Fingerprint-based similarity searching
- Turbo similarity searching
- Conclusions
3Virtual screening
- The huge numbers of molecules available in public
and in-house databases means that there is a
requirement for tools to rank compounds in order
of decreasing probability of activity - Range of methods available, varying in the
sophistication and the amount of information that
is available - Use of structure-based methods when an X-ray
structure for the biological target is available - If this is not the case then must make use of
information about (potential) ligands
4Ligand-Based Methods
- Similarity searching
- Use when just a single bioactive reference
structure is available - 3D pharmacophore searching
- Use when it has been possible to carry out a
pharmacophore mapping exercise - Machine learning
- Use when a fair number of both actives and
inactives have been identified
5Similarity Searching I
- Use of a similarity measure to quantify the
resemblance between an active target, or
reference, structure and each database structure - The similar property principle means that
high-ranked structures are likely to have similar
activities to that of the target structure - Similarity searching hence provides an obvious
way of following-up on an initial active
6Similarity searching II
- Many ways in which the similarity between two
molecules can be computed - A similarity measure has two components
- A structure representation
- A similarity coefficient to compare two
representations - Most operational systems use similarity measures
based on 2D fingerprints and the Tanimoto
coefficient
7Fragment bit-strings (fingerprints)
- Originally developed for 2D substructure search
- Similarity is based on the fragments common to
two molecules - Widely used in both in-house and commercial
chemoinformatics systems
8Similarity coefficients
- Tanimoto coefficient for binary bit strings
- C bits set in common between Target and Database
Structure - T bits set in Target
- D bits set in Database structure
- Values between zero (no bits in common) and unity
(identical fingerprints) - Many other, related similarity coefficients
exist - Tversky, cosine, Euclidean distance ..
9Combination of search techniques using data
fusion I
- Tanimoto/fingerprint measures most common but
many other types, e.g., - Computed physicochemical properties
- 3D grid describing the molecular electrostatic
potential - These reflect different molecular
characteristics, so may enhance search
performance by using more than one similarity
measure - Data fusion or consensus scoring
10Combination of search techniques using data
fusion II
- Combination of different rankings of the same
sets of molecules - Two basic approaches
- Generate rankings from the same molecule using
different similarity measures (similarity fusion) - Generate rankings from different molecules using
the same similarity measure but different
molecules (group fusion)
11Group fusion
Reference 1
12After truncation to required rank
Reference 2
Reference 1
Reference 3
13Fused
Group Fusion
Final truncated
r 1000
r 2000
New Active
Active found in earlier list
14Group fusion rules
- Useful performance increases, even with just 10
actives, as better coverage of structural space
with multiple starting points - Improvement most obvious when searching for
heterogeneous sets of active molecules - Best results obtained by
- Fusing similarity coefficient values, rather than
ranks - Re-ranking using the maximum of the similarity
values associated with each molecule - Using the Tanimoto coefficient
15Turbo similarity searching I
- Similar property principle nearest neighbours
are likely to exhibit the same activity as the
reference structure - Group fusion improves the identification of
active compounds - Potential for further enhancements by group
fusion of rankings from the reference structure
and from its assumed active nearest neighbours
16Turbo similarity searching II
REFERENCE STRUCTURE
RANKED LIST
NEAREST NEIGHBOURS
17Experimental details
- MDL Drug Data report (MDDR) dataset of 11
activity classes and 102K structures - In all, 8294 actives in the 11 classes, with
(turbo) similarity searches being carried out
using each of these as the reference structure - ECFP_4 fingerprints/Tanimoto coefficient
- MAX group fusion on similarity scores
- Increasing numbers of nearest neighbours
18Numbers of nearest neighbours
19Upper and lower bound experiments
20Rationale for upper bound results
- The true actives in the set of assumed actives
yield significant enhancements in performance - The true inactives in the set of assumed actives
have little effect on performance - Taken together, the two groups of compounds yield
the observed net enhancement
21Use of machine-learning methods for similarity
searching I
- Turbo similarity searching uses group fusion to
enhance conventional similarity searching - Machine learning is a more powerful virtual
screening tool than similarity searching - But requires a training-set containing known
actives and inactives - Given an active reference structure, a
training-set can be generated from - Using the k nearest neighbours of the reference
structure as the actives - Using k randomly chosen, low-similarity compounds
as the inactives
22Use of machine-learning methods for similarity
searching II
23Results I
- Experiments with the MDDR dataset show that group
fusion better than machine-learning methods when
averaged over all of the classes - However, group fusion inferior for the most
diverse datasets (as measured by the mean
pair-wise similarities) - Additional searches using 10 MDDR activity
classes that are as structurally diverse as
possible
24Results II
25Conclusions I
- Fingerprint-based similarity searching using a
known reference structure is long-established in
chemoinformatics - When small numbers of actives are available,
group fusion will enhance performance when the
sought actives are structurally heterogeneous
26Conclusions II
- Can also enhance conventional similarity search,
even if there is just a single active, by
assuming that the nearest neighbours are also
active - Can be effected in two ways
- Use of group fusion to combine similarity
rankings (overall best approach) - Use of substructural analysis to compute fragment
weights (best with highly heterogeneous sets of
actives)
27Acknowledgements
- Collaborators
- Jerome Hert, Martin Whittle and David Wilton
- Pierre Acklin, Kamal Azzaoui, Edgar Jacoby and
Ansgar Schuffenhauer - Alexander Alex, Jens Loesel and Jonathan Mason
- Funding, software and data support
- Barnard Chemical Information, Daylight Chemical
Information Systems, MDL Information Systems,
Novartis Institutes for BioMedical Research,
Pfizer Global Research and Development, Royal
Society, Scitegic, Tripos, and the Wolfson
Foundation