Rule Extraction From Trained Neural Networks - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Rule Extraction From Trained Neural Networks

Description:

Trepan. A method for extracting a decision tree from an artificial ... TREPAN creates new training cases by sampling the distributions of the training data ... – PowerPoint PPT presentation

Number of Views:174
Avg rating:3.0/5.0
Slides: 31
Provided by: por5
Category:

less

Transcript and Presenter's Notes

Title: Rule Extraction From Trained Neural Networks


1
Rule Extraction From Trained Neural Networks
  • Brian Hudson
  • University of Portsmouth, UK

2
Artificial Neural Networks
3
Trepan
  • A method for extracting a decision tree from an
    artificial neural network (Craven, 1996).
  • The tree is built by expanding nodes in a best
    first manner, producing an unbalanced tree.
  • The splitting tests at the nodes are m-of-n tests
  • e.g. 2-of-x1, x2, x3, where the xi are Boolean
    conditions
  • The network is used as an oracle to answer
    queries during the learning process.

4
Splitting Tests
  • Start with a set of candidate tests
  • binary tests on each value for nominal features
  • binary tests on thresholds for real-valued
    features
  • Find optimal splitting test by a beam search,
    initializing beam with candidate test maximizing
    the information gain.

5
Splitting Tests
  • To each m-of-n test in the beam and each
    candidate test, apply two operators
  • m-of-(n1)
  • e.g. 2-of-x1, x2 gt 2-of-x1, x2, x3
  • (m1)-of-(n1)
  • e.g. 2-of-x1, x2 gt 3-of-x1, x2, x3
  • Admit new tests to the beam if they increase the
    information gain and differ significantly
    (chi-squared) from existing tests.

6
Data Modelling
  • The amount of training data reaching each node
    decreases with depth of tree.
  • TREPAN creates new training cases by sampling the
    distributions of the training data
  • empirical distributions for nominal inputs
  • kernel density estimates for continuous inputs
  • Apply oracle (i.e. neural network) to new
    training cases to assign output values.

7
Application to Bioinformatics
  • Prediction of Splice Junction sites in Eukaryotic
    DNA

8
Splice Junction Sites
9
Consensus Sequences
  • Donor
  • -3 -2 -1 1 2 3 4 5 6
  • C/G A G G T A/G A G T
  • Acceptor
  • -12 -11 -10 -9 -8 -7 -6 -5 -4 -3
    -2 -1 1
  • C/T C/T C/T C/T C/T C/T C/T C/T C/T C/T A G G

10
EBI Dataset
  • Clean dataset generated at EBI (Thanaraj,
    1999)
  • Donors
  • training set 567 positive, 943 negative
  • test set 229 positive, 373 negative
  • Acceptors
  • training set 637 positive, 468 negative
  • test set 273 positive, 213 negative

11
Results
12
TREPAN Donor Tree
Yes
No
C/G A G G T A/G A G T
13
C5 Donor Tree (extract)
p5G p3C or p3T gt NEGATIVE
p3A p2G gt POSITIVE
p2A p4A or p4G gt
POSITIVE p4C or p4T gt
NEGATIVE p2C
p4A gt POSITIVE else gt
NEGATIVE p2T
p6A or p6G gt NEGATIVE p6C
or p6T gt POSITIVE p3G
p4T gt NEGATIVE p4C
p6T gt POSITIVE else
gt NEGATIVE
14
Trepan Acceptor Tree
C/T C/T A G G
15
Application to Chemoinformatics
  • Learning general rules
  • Conformational Analysis
  • QSAR dataset

16
Oprea Dataset
  • 137 diverse compounds
  • Classification
  • 62 leads, 75 drugs
  • 14 descriptors (from Cerius-2)
  • MW, MR, AlogP
  • Ndonor, Nacceptor, Nrotbond
  • Number of Lipinski violations
  • T.I. Oprea, A.M. Davis, S.J. Teague P.D.
    Leeson, Is there a difference between Leads
    Drugs? A Historical Perspective, J. Chem. Inf.
    Comput. Sci., 41, 1308-1315, (2001).

17
C5 tree
MW lt 380 Mode lead Rule of 5 Violations
0 Mode lead Hbond acceptor lt 2 Mode
lead gt lead Hbond acceptor gt 2 Mode drug
gt drug Rule of 5 Violations gt 0 Mode lead
gt lead MW gt 380 Mode drug gt drug
18
Trepan Oprea Tree
19
Conformational Analysis
  • 300 conformations from
  • 5ns MD simulation of rosiglitazone
  • Classified by length of long axis into
  • Extended distance gt 10A
  • Folded distance lt 10A
  • 8 torsion angles
  • In house data.

20
Rosiglitazone
  • Agonist of PPAR gamma Nuclear Receptor
  • Regulates HDL/LDL and triglycerides
  • Active ingredient of Avandia for Type II Diabetes

21
Distances
22
C5 tree
T5 lt 269 Mode extended T5 lt 52 Mode
extended T7 lt 185 Mode extended gt
extended T7 gt 185 Mode folded T6 lt
75 Mode folded gt folded T6 gt 75 Mode
extended T5 lt 41 Mode folded
T8 lt 249 Mode folded gt folded
T8 gt 249 Mode extended gt extended
T5 gt 41 Mode extended gt extended T5
gt 52 Mode extended T6 lt 73 Mode
extended T8 lt 242 Mode extended
T5 lt 7 Mode extended T8 lt 22
Mode extended gt extended T8 gt 22
Mode folded gt folded T5 gt 7 Mode
extended gt extended T8 gt 242 Mode
extended gt extended T6 gt 73 Mode
extended gt extended T5 gt 269 Mode folded
gt folded
23
Trepan Conformation Tree
24
Ferreira Dataset
  • typical QSAR dataset
  • 48 HIV-1 Protease inhibitors
  • Activity as pIC50
  • Low pIC50 lt 8.0
  • High pIC50 gt 8.0
  • 14 descriptors (mostly topological)
  • R. Kiralj and M.M.C. Ferreira, A-priori
    Molecular Descriptors in QSAR a case of HIV-1
    protease inhibitors I. The Chemometric Approach,
    J. Mol. Graph. Modell. 21, 435-448, (2003)

25
Original Results
  • PLS model
  • Activity determined by
  • X9,X11,X10,X13
  • R2 0.91, Q20.85, Ncomps3

26
C5 tree
X11 lt 2.5 Mode low X13 lt 16.7 Mode
low gt low X13 gt 16.7 Mode high gt high
X11 gt 2.5 Mode high gt high
27
Trepan Ferreira Tree
28
Accuracy
29
Conclusions
  • Reasonable Accuracy
  • Comprehensible Rules

30
Acknowledgements
  • David Whitley.
  • Tony Browne.
  • Martyn Ford.
  • BBSRC grant reference BIO/12005.
Write a Comment
User Comments (0)
About PowerShow.com