Active Learning Strategies for Compound Screening - PowerPoint PPT Presentation

1 / 18

About This Presentation

Title:

Active Learning Strategies for Compound Screening

Description:

classifier trains on a static training set. train, then test. active learning ... Train classifier committee on labeled training set subsamples. Select 1st ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 19

Provided by: Meg4158

Category:

more less

Transcript and Presenter's Notes

Title: Active Learning Strategies for Compound Screening

1
Active Learning Strategies for Compound Screening

Megon Walker1 and Simon Kasif1,2
1Bioinformatics Program, Boston University
2Department of Biomedical Engineering, Boston
University
229th ACS National Meeting
March 13-17, 2005
San Diego, CA

2
Outline

Introduction to active learning for compound
screening
Objectives and performance criteria
Algorithms and procedures
Thrombin dataset results
Preliminary conclusions

3
Introduction drug discovery

drug discovery is an iterative process
goal to identify many target binding compounds
with minimal screening iterations

descriptors
compounds
screening
selection
4
Introductionsupervised learning

input data set with positive and negative
examples
output a classifier such that for each example
1 if example is positive
-1 if example is negative

standard learning
classifier trains on a static training set
train, then test

active learning
classifier chooses data points for training set
classifer requests labels
iterative rounds of training and testing

5
Introductionactive learning compound screening
1st query
2nd query

Mamitsuka et al. Proceedings of the Fifteenth
International Conference on Machine Learning,
19981-9.
Warmuth et al. J. Chem Inf. Comput. Sci. 2003,
43 667-673.

6
Objectives

exploitation
Hit Performance
Enrichment Factor (EF)

exploration
Accurate model of activity
Sensitivity

7
Methods datasets
Start
Input data files
Pick training and testing data for next round of
cross validation

632 DuPont thrombin-targeting compounds
149 actives
483 inactives
a binary feature vector for each compound
shaped-based features
pharmacophore features
139,351 features
retrospective data
200 features selected by mutual information (MI)
w.r.t. activity labels
mean MI 0.126

1st batch?
yes
no
Select 1st batch randomly or by chemist
Sample Selection - P(active) - uncertainty -
density
Query training set batch labels
Train classifier committee on labeled training
set subsamples
Predict compound labels by committee weighted
majority vote
no
All training set labels queried?
yes
no
Cross validation completed?
yes
Accuracy and performance statistics

Warmuth et al. J. Chem Inf Comput Sci. 2003
Mar-Apr43(2)667-73.
Eksterowicz et al. J Mol Graph Model. 2002
Jun20(6)469-77.
Putta et al. J Chem Inf Comput Sci. 2002
Sep-Oct42(5)1230-40.
KDD Cup 2001. http//www.cs.wisc.edu/dpage/kddcup
2001/

End
8
Methods cross validation
Start
Input data files
Pick training and testing data for next round of
cross validation

5X cross validation

1st batch?
yes
no
Select 1st batch randomly or by chemist
Sample Selection - P(active) - uncertainty -
density
2nd
1st
Query training set batch labels
Train classifier committee on labeled training
set subsamples
Predict compound labels by committee weighted
majority vote
no
All training set labels queried?
yes
no
Cross validation completed?
yes
Accuracy and performance statistics
End
9
Start
Methods perceptron
Input data files

given
binary input vector,
weight vector,
threshold value, T
learning rate, n
classification, t
TEST
TRAIN
if classified correctly, do nothing
if misclassified,

Pick training and testing data for next round of
cross validation
1st batch?
yes
no
Select 1st batch randomly or by chemist
Sample Selection - P(active) - uncertainty -
density
Query training set batch labels
Train classifier committee on labeled training
set subsamples
Predict compound labels by committee weighted
majority vote
no
All training set labels queried?
yes
no
Cross validation completed?
yes
Accuracy and performance statistics
End
10
Methods classifier committees
Start
Input data files
Pick training and testing data for next round of
cross validation
1st batch?
yes
no
Select 1st batch randomly or by chemist
Sample Selection - P(active) - uncertainty -
density
Query training set batch labels
Train classifier committee on labeled training
set subsamples
Predict compound labels by committee weighted
majority vote
no

bagging uniform sampling distribution
boosting compounds misclassified by classifier
1 more likely resampled by classifier 2

All training set labels queried?
yes
no
Cross validation completed?
yes
Accuracy and performance statistics
End
11
Methods weighted voting
Start
Input data files
Pick training and testing data for next round of
cross validation

weighted vote of all classifiers predicts
compound activity label

P(active) select compounds predicted active
with highest probability by the committee
uncertainty select compounds on which the
committee disagrees most strongly
density with respect to actives select compounds
most similar to previously labeled or predicted
actives
Tanimoto similarity metric
given compound bitstrings A and B
a bits on in A
b bits on in B
c bits on in both A and B