Active Learning Strategies for Compound Screening - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Active Learning Strategies for Compound Screening

Description:

classifier trains on a static training set. train, then test. active learning ... Train classifier committee on labeled training set subsamples. Select 1st ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 19
Provided by: Meg4158
Category:

less

Transcript and Presenter's Notes

Title: Active Learning Strategies for Compound Screening


1
Active Learning Strategies for Compound Screening
  • Megon Walker1 and Simon Kasif1,2
  • 1Bioinformatics Program, Boston University
  • 2Department of Biomedical Engineering, Boston
    University
  • 229th ACS National Meeting
  • March 13-17, 2005
  • San Diego, CA

2
Outline
  • Introduction to active learning for compound
    screening
  • Objectives and performance criteria
  • Algorithms and procedures
  • Thrombin dataset results
  • Preliminary conclusions

3
Introduction drug discovery
  • drug discovery is an iterative process
  • goal to identify many target binding compounds
    with minimal screening iterations

descriptors
compounds
screening
selection
4
Introductionsupervised learning
  • input data set with positive and negative
    examples
  • output a classifier such that for each example
  • 1 if example is positive
  • -1 if example is negative
  • standard learning
  • classifier trains on a static training set
  • train, then test
  • active learning
  • classifier chooses data points for training set
  • classifer requests labels
  • iterative rounds of training and testing

5
Introductionactive learning compound screening
1st query
2nd query
  • Mamitsuka et al. Proceedings of the Fifteenth
    International Conference on Machine Learning,
    19981-9.
  • Warmuth et al. J. Chem Inf. Comput. Sci. 2003,
    43 667-673.

6
Objectives
  • exploitation
  • Hit Performance
  • Enrichment Factor (EF)
  • exploration
  • Accurate model of activity
  • Sensitivity

7
Methods datasets
Start
Input data files
Pick training and testing data for next round of
cross validation
  • 632 DuPont thrombin-targeting compounds
  • 149 actives
  • 483 inactives
  • a binary feature vector for each compound
  • shaped-based features
  • pharmacophore features
  • 139,351 features
  • retrospective data
  • 200 features selected by mutual information (MI)
    w.r.t. activity labels
  • mean MI 0.126

1st batch?
yes
no
Select 1st batch randomly or by chemist
Sample Selection - P(active) - uncertainty -
density
Query training set batch labels
Train classifier committee on labeled training
set subsamples
Predict compound labels by committee weighted
majority vote
no
All training set labels queried?
yes
no
Cross validation completed?
yes
Accuracy and performance statistics
  • Warmuth et al. J. Chem Inf Comput Sci. 2003
    Mar-Apr43(2)667-73.
  • Eksterowicz et al. J Mol Graph Model. 2002
    Jun20(6)469-77.
  • Putta et al. J Chem Inf Comput Sci. 2002
    Sep-Oct42(5)1230-40.
  • KDD Cup 2001. http//www.cs.wisc.edu/dpage/kddcup
    2001/

End
8
Methods cross validation
Start
Input data files
Pick training and testing data for next round of
cross validation
  • 5X cross validation

1st batch?
yes
no
Select 1st batch randomly or by chemist
Sample Selection - P(active) - uncertainty -
density
2nd
1st
Query training set batch labels
Train classifier committee on labeled training
set subsamples
Predict compound labels by committee weighted
majority vote
no
All training set labels queried?
yes
no
Cross validation completed?
yes
Accuracy and performance statistics
End
9
Start
Methods perceptron
Input data files
  • given
  • binary input vector,
  • weight vector,
  • threshold value, T
  • learning rate, n
  • classification, t
  • TEST
  • TRAIN
  • if classified correctly, do nothing
  • if misclassified,

Pick training and testing data for next round of
cross validation
1st batch?
yes
no
Select 1st batch randomly or by chemist
Sample Selection - P(active) - uncertainty -
density
Query training set batch labels
Train classifier committee on labeled training
set subsamples
Predict compound labels by committee weighted
majority vote
no
All training set labels queried?
yes
no
Cross validation completed?
yes
Accuracy and performance statistics
End
10
Methods classifier committees
Start
Input data files
Pick training and testing data for next round of
cross validation
1st batch?
yes
no
Select 1st batch randomly or by chemist
Sample Selection - P(active) - uncertainty -
density
Query training set batch labels
Train classifier committee on labeled training
set subsamples
Predict compound labels by committee weighted
majority vote
no
  • bagging uniform sampling distribution
  • boosting compounds misclassified by classifier
    1 more likely resampled by classifier 2

All training set labels queried?
yes
no
Cross validation completed?
yes
Accuracy and performance statistics
End
11
Methods weighted voting
Start
Input data files
Pick training and testing data for next round of
cross validation
  • weighted vote of all classifiers predicts
    compound activity label

1st batch?
yes
no
Select 1st batch randomly or by chemist
Sample Selection - P(active) - uncertainty -
density
Query training set batch labels
Train classifier committee on labeled training
set subsamples
Predict compound labels by committee weighted
majority vote
no
All training set labels queried?
yes
no
Cross validation completed?
yes
Accuracy and performance statistics
End
12
Methods sample selection strategies
Start
Input data files
Pick training and testing data for next round of
cross validation
  • P(active) select compounds predicted active
    with highest probability by the committee
  • uncertainty select compounds on which the
    committee disagrees most strongly
  • density with respect to actives select compounds
    most similar to previously labeled or predicted
    actives
  • Tanimoto similarity metric
  • given compound bitstrings A and B
  • a bits on in A
  • b bits on in B
  • c bits on in both A and B

1st batch?
yes
no
Select 1st batch randomly or by chemist
Sample Selection - P(active) - uncertainty -
density
Query training set batch labels
Train classifier committee on labeled training
set subsamples
Predict compound labels by committee weighted
majority vote
no
All training set labels queried?
yes
no
Cross validation completed?
yes
Accuracy and performance statistics
End
13
Methods performance criteria
Start
Input data files
Pick training and testing data for next round of
cross validation
  • Hit Performance
  • Enrichment Factor (EF)
  • Sensitivity

1st batch?
yes
no
Select 1st batch randomly or by chemist
Sample Selection - P(active) - uncertainty -
density
Query training set batch labels
Train classifier committee on labeled training
set subsamples
Predict compound labels by committee weighted
majority vote
no
All training set labels queried?
yes
no
Cross validation completed?
yes
Accuracy and performance statistics
End
14
Results hit performance
15
Results sensitivity
  • uncertainty
  • highest testing set sensitivity initially
  • no significant increase in testing set
    sensitivity

16
Results bagging vs. boosting
  • boosting
  • training set TP climbs faster, converges higher
  • overfits to the training data

17
Results classifiers
18
Conclusions
  • Sample selection
  • Bag vs. boost
  • Committee vs. single classifier
  • Testing set sensitivity
  • Trade off exploration and exploitation
Write a Comment
User Comments (0)
About PowerShow.com