Title: Active Learning Strategies for Compound Screening
1Active Learning Strategies for Compound Screening
- Megon Walker1 and Simon Kasif1,2
- 1Bioinformatics Program, Boston University
- 2Department of Biomedical Engineering, Boston
University - 229th ACS National Meeting
- March 13-17, 2005
- San Diego, CA
2Outline
- Introduction to active learning for compound
screening - Objectives and performance criteria
- Algorithms and procedures
- Thrombin dataset results
- Preliminary conclusions
3Introduction drug discovery
- drug discovery is an iterative process
- goal to identify many target binding compounds
with minimal screening iterations
descriptors
compounds
screening
selection
4Introductionsupervised learning
- input data set with positive and negative
examples - output a classifier such that for each example
- 1 if example is positive
- -1 if example is negative
- standard learning
- classifier trains on a static training set
- train, then test
- active learning
- classifier chooses data points for training set
- classifer requests labels
- iterative rounds of training and testing
5Introductionactive learning compound screening
1st query
2nd query
- Mamitsuka et al. Proceedings of the Fifteenth
International Conference on Machine Learning,
19981-9. - Warmuth et al. J. Chem Inf. Comput. Sci. 2003,
43 667-673.
6Objectives
- exploitation
- Hit Performance
- Enrichment Factor (EF)
- exploration
- Accurate model of activity
- Sensitivity
7Methods datasets
Start
Input data files
Pick training and testing data for next round of
cross validation
- 632 DuPont thrombin-targeting compounds
- 149 actives
- 483 inactives
- a binary feature vector for each compound
- shaped-based features
- pharmacophore features
- 139,351 features
- retrospective data
- 200 features selected by mutual information (MI)
w.r.t. activity labels - mean MI 0.126
1st batch?
yes
no
Select 1st batch randomly or by chemist
Sample Selection - P(active) - uncertainty -
density
Query training set batch labels
Train classifier committee on labeled training
set subsamples
Predict compound labels by committee weighted
majority vote
no
All training set labels queried?
yes
no
Cross validation completed?
yes
Accuracy and performance statistics
- Warmuth et al. J. Chem Inf Comput Sci. 2003
Mar-Apr43(2)667-73. - Eksterowicz et al. J Mol Graph Model. 2002
Jun20(6)469-77. - Putta et al. J Chem Inf Comput Sci. 2002
Sep-Oct42(5)1230-40. - KDD Cup 2001. http//www.cs.wisc.edu/dpage/kddcup
2001/
End
8Methods cross validation
Start
Input data files
Pick training and testing data for next round of
cross validation
1st batch?
yes
no
Select 1st batch randomly or by chemist
Sample Selection - P(active) - uncertainty -
density
2nd
1st
Query training set batch labels
Train classifier committee on labeled training
set subsamples
Predict compound labels by committee weighted
majority vote
no
All training set labels queried?
yes
no
Cross validation completed?
yes
Accuracy and performance statistics
End
9Start
Methods perceptron
Input data files
- given
- binary input vector,
- weight vector,
- threshold value, T
- learning rate, n
- classification, t
- TEST
- TRAIN
- if classified correctly, do nothing
- if misclassified,
Pick training and testing data for next round of
cross validation
1st batch?
yes
no
Select 1st batch randomly or by chemist
Sample Selection - P(active) - uncertainty -
density
Query training set batch labels
Train classifier committee on labeled training
set subsamples
Predict compound labels by committee weighted
majority vote
no
All training set labels queried?
yes
no
Cross validation completed?
yes
Accuracy and performance statistics
End
10Methods classifier committees
Start
Input data files
Pick training and testing data for next round of
cross validation
1st batch?
yes
no
Select 1st batch randomly or by chemist
Sample Selection - P(active) - uncertainty -
density
Query training set batch labels
Train classifier committee on labeled training
set subsamples
Predict compound labels by committee weighted
majority vote
no
- bagging uniform sampling distribution
- boosting compounds misclassified by classifier
1 more likely resampled by classifier 2
All training set labels queried?
yes
no
Cross validation completed?
yes
Accuracy and performance statistics
End
11Methods weighted voting
Start
Input data files
Pick training and testing data for next round of
cross validation
- weighted vote of all classifiers predicts
compound activity label
1st batch?
yes
no
Select 1st batch randomly or by chemist
Sample Selection - P(active) - uncertainty -
density
Query training set batch labels
Train classifier committee on labeled training
set subsamples
Predict compound labels by committee weighted
majority vote
no
All training set labels queried?
yes
no
Cross validation completed?
yes
Accuracy and performance statistics
End
12Methods sample selection strategies
Start
Input data files
Pick training and testing data for next round of
cross validation
- P(active) select compounds predicted active
with highest probability by the committee - uncertainty select compounds on which the
committee disagrees most strongly - density with respect to actives select compounds
most similar to previously labeled or predicted
actives - Tanimoto similarity metric
- given compound bitstrings A and B
- a bits on in A
- b bits on in B
- c bits on in both A and B
1st batch?
yes
no
Select 1st batch randomly or by chemist
Sample Selection - P(active) - uncertainty -
density
Query training set batch labels
Train classifier committee on labeled training
set subsamples
Predict compound labels by committee weighted
majority vote
no
All training set labels queried?
yes
no
Cross validation completed?
yes
Accuracy and performance statistics
End
13Methods performance criteria
Start
Input data files
Pick training and testing data for next round of
cross validation
- Hit Performance
- Enrichment Factor (EF)
- Sensitivity
1st batch?
yes
no
Select 1st batch randomly or by chemist
Sample Selection - P(active) - uncertainty -
density
Query training set batch labels
Train classifier committee on labeled training
set subsamples
Predict compound labels by committee weighted
majority vote
no
All training set labels queried?
yes
no
Cross validation completed?
yes
Accuracy and performance statistics
End
14Results hit performance
15Results sensitivity
- uncertainty
- highest testing set sensitivity initially
- no significant increase in testing set
sensitivity
16Results bagging vs. boosting
- boosting
- training set TP climbs faster, converges higher
- overfits to the training data
17Results classifiers
18Conclusions
- Sample selection
- Bag vs. boost
- Committee vs. single classifier
- Testing set sensitivity
- Trade off exploration and exploitation