Active Learning Strategies for Drug Screening - PowerPoint PPT Presentation

1 / 1
About This Presentation
Title:

Active Learning Strategies for Drug Screening

Description:

The active learning paradigm refers to the ability of the ... Incremental Machine Learning to Reduce Biochemistry Lab Costs in the Search for Drug Discovery. ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 2
Provided by: hyunj1
Category:

less

Transcript and Presenter's Notes

Title: Active Learning Strategies for Drug Screening


1
Active Learning Strategies for Drug Screening
Megon Walker1 and Simon Kasif1,2 1Bioinformatics
Program, Boston University, Boston, MA
2Department of Biomedical Engineering, Boston
University, Boston, MA
  • 2. Objectives
  • exploitation optimize the number of target
    binding (active) drugs retrieved with each batch
  • exploration optimize the prediction accuracy of
    the committee during each iteration of querying
  • 3. Methods
  • Datasets
  • a binary feature vector for each compound
    indicated the presence or absence of structural
    fragments
  • 200 features with highest feature-activity mutual
    information (MI) selected for each dataset
  • retrospective data labels provided with the
    features
  • labels target binding or active (A)
    non-binding or inactive (I)
  • 632 DuPont thrombin-targeting compounds4 (149 A,
    483 I, mean MI 0.126)
  • 1346 Abbott monoamine oxidase inhibitors5 (221
    A,1125 I, mean MI 0.006)
  • Pipeline

1. Introduction At the intersection of drug
discovery and experimental design, active
learning algorithms guide selection of successive
compound batches for biological assays when
screening a chemical library in order to identify
many target binding compounds with minimal
screening iterations.1-3 The active learning
paradigm refers to the ability of the learner to
modify the sampling strategy of data chosen for
training based on previously seen data. During
each round of screening, the active learning
algorithm selects a batch of unlabeled compounds
to be tested for target binding activity and
added to the training set. Once the labels for
this batch are known, the model of activity is
recomputed on all examples labeled so far, and a
new chemical set for screening is selected
(Figure 1). The drug screening pipeline
proposed here combines committee-based active
learning with bagging and boosting techniques and
several options for sample selection. Our best
strategy retrieves up to 87 of the active
compounds after screening only 30 of the
chemical datasets analyzed.
Figure 2 Pipeline Flowchart
Start
Input data files with compound descriptors
Designate training and testing sets for this
round of cross validation
1st batch of drugs whose labels are queried?
no
yes
Sample Selection
random
uncertainty
density
P(active)
1st batch selected by chemists domain knowledge
Figure 1 The Drug Discovery Cycle
compounds
descriptors
Labels for a batch from the unlabeled training
set queried
Features Features Features Features Features Features
Drugs 0 1 1 1 0
Drugs 1 0 1 1 0
Drugs 1 1 1 0 1
Drugs 0 1 1 0 1
committee of classifiers trained on sub-samples
from the labeled training set drugs
Classifiers Committees
naïve Bayes bagging
perceptron boosting
selection
screening
Unlabeled testing set training set drugs
classified by committee (weighted majority vote)
Figure 3 Querying for labels training
classifiers on sub-samples
All training set labels queried?
no
after 1st query
after 2nd query
Features Features A/I
Drugs train classifier 1 I
Drugs train classifier 2 A
Drugs NOT labeled ?
Drugs NOT labeled ?
Drugs NOT labeled ?
Drugs NOT labeled ?
Drugs NOT labeled ?
Drugs NOT labeled ?
Drugs test ?
Drugs test ?
Features Features A/I
Drugs train classifier 1 I
Drugs train classifier 1 A
Drugs train classifier 1 A
Drugs train classifier 2 I
Drugs train classifier 2 A
Drugs train classifier 2 I
Drugs NOT labeled ?
Drugs NOT labeled ?
Drugs test ?
Drugs test ?
yes
Cross validation completed?
no
yes
Accuracy and performance statistics
End
  • 5. Discussion
  • exploitation number of active drugs retrieved
    with each batch queried
  • P(active) sample selection shows best hit
    performance when feature information content is
    higher (Figure 4a)
  • -after 30 of drug are labeled (cross
    validation averages)
  • 1. P(active) retrieves 84 actives
  • 2. density retrieves 77 actives
  • 3. uncertainty retrieves 65 actives
  • 4. random retrieves 42 actives
  • density sample selection strategy shows best
    initial hit performance when feature information
    content is lower (Figure 4b)
  • -classifier sensitivity is compromised
  • -linear hit performance for all strategies
    after 20 of drugs labeled
  • exploration the prediction accuracy of the
    committee on the testing data set during each
    iteration of querying
  • uncertainty sample selection shows best testing
    set sensitivity
  • increases in the labeled training set size during
    progressive rounds of querying result in no
    significant increase in testing set sensitivity
    (Figure 4c)
  • -labeled training set ratio
    activesinactives biases the classifier?
  • -multiple modes of drug activity present
    in datasets?
  • tradeoff sample selection methods resulting in
    the best hit performance display the lowest
    testing set sensitivity (Figure 4c)

4. Results
Figure 4 Hit Performance and Sensitivity
  • 6. References
  • N. Abe, and H. Mamitsuka. Query Learning
    Strategies Using Boosting and Bagging. ICML 1998,
    1-9.
  • G. Forman. Incremental Machine Learning to Reduce
    Biochemistry Lab Costs in the Search for Drug
    Discovery. BIOKDD 2002, 33-36.
  • M. Warmuth, G. Ratsch, M. Mathieson, J. Liao, C.
    Lemmen. Active Learning in the Drug Discovery
    Process. NIPS 2001, 1449-1456.
  • KDD Cup 2001. http//www.cs.wisc.edu/dpage/kddcup
    2001/
  • R. Brown and Y. Martin. Use of Structure-Activity
    Data To Compare Structure-Based Clustering
    Methods and Descriptors for Use in Compound
    Selection. Journal of Chemical Information and
    Computer Science.1996. 36, 572-584.
Write a Comment
User Comments (0)
About PowerShow.com