Title: In silico screening in modern drug discovery research
1In silico screening in modern drug discovery
research
- Presented by Olga Komina
- Department of Computer Science Engineering
- University of Nebraska Lincoln
- July 2004
2Modern Drug Discovery
- Multidisciplinary area of research
- Combinarotial chemistry
- Chemoinformatics
- Molecular biology
- Biochemistry
- Medicine
- Macromolecular modeling
- Pharmacology
- Drug Discovery is a goal of research. Methods and
approaches from different science areas can be
applied to achieve the goal.
3 Drug Discovery Pipeline
- Target identification and validation
- Assay development
- Virtual screening (VS)
- High throughput screening (HTS)
- Quantitative structure activity relationship
(QSAR) and refinement of compounds - Characterization of prospective drugs
- Testing on animals for activity and side effects
- Clinical trials
- FDA approval
4Computer-aided Drug Design Strategy
5Mechanism of Drug Action
6Virtual Screening (VS)
- In silico screening of large compound databases
in order to reduce the scale of high-throughput
screening. - Conceptual diversity
- Small molecule screening
- Protein structure based screening
- Algorithmic diversity
- Similarity searching
- Clustering and partitioning
- Simple filters
- Artificial intelligence
- Integration of different computational approaches
- Similarity paradox
7Similarity Paradox
8Descriptors of Molecular Structure Properties
- 1D-descriptors encode chemical composition
physicochemical properties - MW, CmOnHk ,hydrophobicity
- 2D-descriptors encode chemical topology
- Connectivity indices, degree of branching, degree
of flexibility, of aromatic bonds - 3D-descriptors encode 3D shape, volume,
functionality, surface area - Pharmacophore the spatial arrangement of
chemical groups that determines its activity
9Connectivity Indices
- Connectivity of an atom
- of atoms connected to it
- Connectivity of a bond -
- the reciprocal of the square root of the product
of the connectivities of the atoms - Connectivity index of a molecule summation of
all bond connectivities
Isobytul alcohol
10Classification of Atoms to Atom Types
- Developed for prediction of log P values
- A molecule characterized by the count of 120 atom
types - Atom type commonly occurring atomic states of
C, H, O, N, S, P, Se and - halogens (F, Cl, Br, I)
11Atom Types (Carbon)
12Example Description of a Molecule by the Count
of Atom Types
13Molecular Fingerprints
- Molecule A 00011101010
- Molecule B 00101111000
- Tanimoto coefficient Tc
Nc
3
Tc
NA NB - Nc
5 5 - 3
Nc the number of common bits set on NA the
number of bits set in A NB the number of bits
set in B.
14Drugs vs. Non-drugs
- Enriching screening libraries with drug-like
compounds - fail fast, fail cheap strategy
- Manual classification is time-consuming and bias
- Computational approaches speeds up the screening,
reduce the size and improves the quality of
combinatorial libraries - Assumption typical drugs have something in
common that other compounds lack
15Lipinski Rule of Five (1997)
- Poor absorption and permeation are more likely to
occur when there are more than 5 hydrogen-bond
donors, more than 10 hydrogen-bond acceptors, the
molecular mass is greater than 500, or the log P
value is greater than 5. - Further research studied a broader range of
physicochemical and structural properties - Related problems
- Compound toxicity
- Compound mutagenicity
- Blood-brain barrier penetration
- Central nervous system activity
16Data Sets
- Drug Databases
- World Drug Index (WDI)
- Comprehensive Medical Chemistry (CMC)
- MACCS-II Drug Data Report (MDDR)
- Non-drug Databases
- Available Chemical Directory (ACD)
- Quality of training sets
17Artificial Neural Networks
- ANNs are self learning systems which learn from
experience - Biologically inspired
- Neuron is a processing element
- Artificial neuron simulates four basic functions
of a natural neuron - Receives input from other sources
- Combines those inputs in some way
- Performs nonlinear operations on the result
- Outputs the final result
18Artificial Neuron
19Network Topology
20ANN Training
- Supervised both inputs and outputs are provided
- Initial weights chosen randomly
- Errors propagated back through the system to
adjust weights - Most common algorithm backward-error propagation
(back-propagation)
21ANNs for Drug Classification (1998)
- Input Counts of atom types
- Topology 92 x 5 x 1
- Feedforward with backpropagation
- Training 5000 ACD and 5000 WDI
- Accuracy 83 - ACD, 77 - WDI
22ANNs for Drug Classification (1998)
- Input seven 1D descriptors (MW, log P, aromatic
density) and ISIS fingerprints - Topology 173 x 0/5/10 x 1
- Bayesian learning procedure
- Training 3500 ACD and 3500 CMC
- Accuracy 90 - CMC, 80 - MDDR, 90 - ACM
23Misclassification Examples
Misclassified non-drug
Misclassified drug
24ANNs to Predict Biological Activity
- Applications
- CNS-active compounds
- Protein kinase inhibitors
- G protein-coupled receptor ligands
- Best prediction accuracy 80
- Advantage capable of predicting structurally
diverse compounds - Disadvantage no definite rules
25Recursive Partitioning
- Statistical method for analyzing and mining large
data sets that consists of active and inactive
molecules - HTS data analyzed to discover SAR
- Easy to visualize and interpret
- Applicable to a variety of classification problem
- A problem of assigning chemical compounds to
property classes based on their structural and
physicochemical features
26Partitioning Problem Definition
- Given a training set of D descriptor values and P
property values for each molecule in the set, the
question is to create a set of yes/no questions
which are organized into hierarchical tree from
with one question per node and class predictions
at leaf nodes with minimum classification error.
27Single Property RP
- Single property classification such as molecules
classified active or inactive - Drugs vs. Non-drugs
- C4.5, C5.0
28Single Property RP (cont.)
- All possible questions are asked based on single
descriptor values, scores of corresponding
partitions are computed - Descriptor resulting in the best score is used
to grow the tree - Loop to question asking until terminating
condition is met
29Gini Impurity Metric
- Impurity, I, of a node
- I ? pipj
- where pi and pj are the fractions of the members
of a node that belong to class value i and j
respectively - Gini metric maximizes the decrease in Impurity,
?I, from a potencial node question - ?I I pLIL pRIR
- where pL and pR are the fractions of the node
members that partition to nodes L and R
respectively for a given question, and IL and IR
are the impurities of new nodes
30Tree Growth
Entire Training set
Root
Descriptor 3
yes
no
Node L
Node R
Pruning phase metric R? R? ?Nleaf R? the
number of misclassifications in the training set
31Application of Single Property RP for
Drug/Non-drug Classification
- Input 120 atom types
- C5.0
- Training 5000 WDI, 5000 ACD
- Prediction error 21
- The presence of alcohols, tertiary and secondary
amines, phenols, enols, and carboxylic groups
accounts for 75 of correct classifications for
drugs.
32Decision Tree for Drug/Non-drug Classification
33Multiple Property RP
- SP is not sufficient in many biological systems
- ADMET properties
- Absorbtion
- Distribution
- Metabolism
- Excretion
- Toxicity
- Nonspecific binding to multiple targets causes
side effects - Dependent properties
34Partially Unified Multiple Property RP
- Developed for prediction of multiple dependent
properties - Discover features that distinguishes the classes
of different properties and make them similar - Some node apply to all properties while others
apply to only single properties - Classes are NOT mutually exclusive
- Nodes are labeled with one class of a single
property type
35Mapping to SP Representation
- D descriptor values x1, x2, x3, , xD
- P property values y1, y2, y3, , yP
- New descriptor K is a property descriptor
- x1, x2, x3, 1, y1
- x1, x2, x3, y1, y2, y3 x1, x2, x3, 2,
y2 - x1, x2, x3, 3, y3
- Every path from the root to a leaf has a split on
the descriptor K
361. Pure Specific Tree
2. Generic node growth Max ( Min ?I k ) gt 0
k
37PUMP-RP (cont.)
- A split with an improvement for each property is
chosen - The metric maximizes the minimum decrease in
impurity from each potential node question - A compound may appear in more than one leaf node
- Each K node is regrown recursively
- The resulting tree is overgeneralized
38Finding the Best Tree
- R?? Ro ?(Nleaf - ?Ngeneric)
- Where ? is a generality parameter,
- Ngeneric is the number of generic nodes
39Application of PUMP-RP for Drug Specificity
- Cyclooxygenase (COX) inhibitors
- COX-2 inhibitors are antiinflammatory agents
- COX-1 inhibitors damage gastrointestinal tract
- Good drug should be highly specific to COX-2
- Celebrex, Vioxx are widely prescribed
- Goal to obtain a model of activity and
selectivity of COX-2 inhibitors as a function of
their physicochemical properties
40Data and Results
- 100 2D and 3D descriptors
- Each property has two classes active and
inactive - Gini Impurity score
- Accuracy
- on the training set
- 60-80 COX-2, 78-91 COX-1
- On the test set
- 50-89 COX-2, 60-100 COX-1
- Disadvantage not capable of predicting compounds
with molecular scaffolds not yet discovered
41Extension to PUMP-RP
- To model systems with more than two properties
- semi-generic node applies to more than one
property but not all - To model multiple properties with opportunity to
observe what properties are more closely related
than others - Problems to apply
- ADMET properties
- Activity/ADMET properties
- COX-2/COX-1/Drugs
- Drug-drug interactions based on target
specificity - Modified Gini Impurity score
42Gini Impurity for the Extended PUMP-RP
- Modified scoring function
- Max (Max (Min ?Ik)), where k P
k
43Tree Built by the Extended PUMP-RP
44Targeting RNA
- Emerging field in drug discovery
- RNA plays an essential role in many biological
processes - Natural antibiotics are RNA-targeting drugs
(streptomycin, tetracycline, etc) - Potential drug targets viral RNAs
- Antisense strategy
45Targeting RNA
- HTS against RNA targets less successful that for
protein targets - Identification of new classes of RNA ligands are
extremely rare - Limited knowledge of the chemistry and structure
of RNA recognition - Consists of 4 nucleotides less diverse than
proteins, RNA flexibility
46What Can Be Done?
- Assumption compounds binding RNA have something
in common that other compounds lack - Dataset a comprehensive database containing
examples of bindings between small molecules and
RNAs - Computational approaches to extract common
features of such compounds and to train models
for prediction (AI methods)
47Concluding Remarks
- Drug Discovery is a goal of multidisciplinary
research - No algorithm to discover a drug
- Old problem given a compound structure, what are
its properties? - Computational approaches can assist drug
discovery process - Limitation lack of systematic biological data
- Market pressure and prospective profit bring more
and more resources into drug discovery
48Multilevel Neighborhoods of Atoms
phenol