Title: Active Learning Strategies for Drug Screening
1 Active Learning Strategies for Drug Screening
Megon Walker1 and Simon Kasif1,2 1Bioinformatics
Program, Boston University, Boston, MA
2Department of Biomedical Engineering, Boston
University, Boston, MA
- 2. Objectives
- exploitation optimize the number of target
binding (active) drugs retrieved with each batch - exploration optimize the prediction accuracy of
the committee during each iteration of querying - 3. Methods
- Datasets
- a binary feature vector for each compound
indicated the presence or absence of structural
fragments - 200 features with highest feature-activity mutual
information (MI) selected for each dataset - retrospective data labels provided with the
features - labels target binding or active (A)
non-binding or inactive (I) - 632 DuPont thrombin-targeting compounds4 (149 A,
483 I, mean MI 0.126) - 1346 Abbott monoamine oxidase inhibitors5 (221
A,1125 I, mean MI 0.006) - Pipeline
1. Introduction At the intersection of drug
discovery and experimental design, active
learning algorithms guide selection of successive
compound batches for biological assays when
screening a chemical library in order to identify
many target binding compounds with minimal
screening iterations.1-3 The active learning
paradigm refers to the ability of the learner to
modify the sampling strategy of data chosen for
training based on previously seen data. During
each round of screening, the active learning
algorithm selects a batch of unlabeled compounds
to be tested for target binding activity and
added to the training set. Once the labels for
this batch are known, the model of activity is
recomputed on all examples labeled so far, and a
new chemical set for screening is selected
(Figure 1). The drug screening pipeline
proposed here combines committee-based active
learning with bagging and boosting techniques and
several options for sample selection. Our best
strategy retrieves up to 87 of the active
compounds after screening only 30 of the
chemical datasets analyzed.
Figure 2 Pipeline Flowchart
Start
Input data files with compound descriptors
Designate training and testing sets for this
round of cross validation
1st batch of drugs whose labels are queried?
no
yes
Sample Selection
random
uncertainty
density
P(active)
1st batch selected by chemists domain knowledge
Figure 1 The Drug Discovery Cycle
compounds
descriptors
Labels for a batch from the unlabeled training
set queried
Features Features Features Features Features Features
Drugs 0 1 1 1 0
Drugs 1 0 1 1 0
Drugs 1 1 1 0 1
Drugs 0 1 1 0 1
committee of classifiers trained on sub-samples
from the labeled training set drugs
Classifiers Committees
naïve Bayes bagging
perceptron boosting
selection
screening
Unlabeled testing set training set drugs
classified by committee (weighted majority vote)
Figure 3 Querying for labels training
classifiers on sub-samples
All training set labels queried?
no
after 1st query
after 2nd query
Features Features A/I
Drugs train classifier 1 I
Drugs train classifier 2 A
Drugs NOT labeled ?
Drugs NOT labeled ?
Drugs NOT labeled ?
Drugs NOT labeled ?
Drugs NOT labeled ?
Drugs NOT labeled ?
Drugs test ?
Drugs test ?
Features Features A/I
Drugs train classifier 1 I
Drugs train classifier 1 A
Drugs train classifier 1 A
Drugs train classifier 2 I
Drugs train classifier 2 A
Drugs train classifier 2 I
Drugs NOT labeled ?
Drugs NOT labeled ?
Drugs test ?
Drugs test ?
yes
Cross validation completed?
no
yes
Accuracy and performance statistics
End
- 5. Discussion
- exploitation number of active drugs retrieved
with each batch queried - P(active) sample selection shows best hit
performance when feature information content is
higher (Figure 4a) - -after 30 of drug are labeled (cross
validation averages) - 1. P(active) retrieves 84 actives
- 2. density retrieves 77 actives
- 3. uncertainty retrieves 65 actives
- 4. random retrieves 42 actives
- density sample selection strategy shows best
initial hit performance when feature information
content is lower (Figure 4b) - -classifier sensitivity is compromised
- -linear hit performance for all strategies
after 20 of drugs labeled - exploration the prediction accuracy of the
committee on the testing data set during each
iteration of querying - uncertainty sample selection shows best testing
set sensitivity - increases in the labeled training set size during
progressive rounds of querying result in no
significant increase in testing set sensitivity
(Figure 4c) - -labeled training set ratio
activesinactives biases the classifier? - -multiple modes of drug activity present
in datasets? -
- tradeoff sample selection methods resulting in
the best hit performance display the lowest
testing set sensitivity (Figure 4c)
4. Results
Figure 4 Hit Performance and Sensitivity
- 6. References
- N. Abe, and H. Mamitsuka. Query Learning
Strategies Using Boosting and Bagging. ICML 1998,
1-9. - G. Forman. Incremental Machine Learning to Reduce
Biochemistry Lab Costs in the Search for Drug
Discovery. BIOKDD 2002, 33-36. - M. Warmuth, G. Ratsch, M. Mathieson, J. Liao, C.
Lemmen. Active Learning in the Drug Discovery
Process. NIPS 2001, 1449-1456. - KDD Cup 2001. http//www.cs.wisc.edu/dpage/kddcup
2001/ - R. Brown and Y. Martin. Use of Structure-Activity
Data To Compare Structure-Based Clustering
Methods and Descriptors for Use in Compound
Selection. Journal of Chemical Information and
Computer Science.1996. 36, 572-584.