Title: Rule Extraction From Trained Neural Networks
1Rule Extraction From Trained Neural Networks
- Brian Hudson
- University of Portsmouth, UK
2Artificial Neural Networks
3Trepan
- A method for extracting a decision tree from an
artificial neural network (Craven, 1996). - The tree is built by expanding nodes in a best
first manner, producing an unbalanced tree. - The splitting tests at the nodes are m-of-n tests
- e.g. 2-of-x1, x2, x3, where the xi are Boolean
conditions - The network is used as an oracle to answer
queries during the learning process.
4Splitting Tests
- Start with a set of candidate tests
- binary tests on each value for nominal features
- binary tests on thresholds for real-valued
features - Find optimal splitting test by a beam search,
initializing beam with candidate test maximizing
the information gain.
5Splitting Tests
- To each m-of-n test in the beam and each
candidate test, apply two operators - m-of-(n1)
- e.g. 2-of-x1, x2 gt 2-of-x1, x2, x3
- (m1)-of-(n1)
- e.g. 2-of-x1, x2 gt 3-of-x1, x2, x3
- Admit new tests to the beam if they increase the
information gain and differ significantly
(chi-squared) from existing tests.
6Data Modelling
- The amount of training data reaching each node
decreases with depth of tree. - TREPAN creates new training cases by sampling the
distributions of the training data - empirical distributions for nominal inputs
- kernel density estimates for continuous inputs
- Apply oracle (i.e. neural network) to new
training cases to assign output values.
7Application to Bioinformatics
- Prediction of Splice Junction sites in Eukaryotic
DNA
8Splice Junction Sites
9Consensus Sequences
- Donor
- -3 -2 -1 1 2 3 4 5 6
- C/G A G G T A/G A G T
- Acceptor
- -12 -11 -10 -9 -8 -7 -6 -5 -4 -3
-2 -1 1 - C/T C/T C/T C/T C/T C/T C/T C/T C/T C/T A G G
10EBI Dataset
- Clean dataset generated at EBI (Thanaraj,
1999) - Donors
- training set 567 positive, 943 negative
- test set 229 positive, 373 negative
- Acceptors
- training set 637 positive, 468 negative
- test set 273 positive, 213 negative
11Results
12TREPAN Donor Tree
Yes
No
C/G A G G T A/G A G T
13C5 Donor Tree (extract)
p5G p3C or p3T gt NEGATIVE
p3A p2G gt POSITIVE
p2A p4A or p4G gt
POSITIVE p4C or p4T gt
NEGATIVE p2C
p4A gt POSITIVE else gt
NEGATIVE p2T
p6A or p6G gt NEGATIVE p6C
or p6T gt POSITIVE p3G
p4T gt NEGATIVE p4C
p6T gt POSITIVE else
gt NEGATIVE
14Trepan Acceptor Tree
C/T C/T A G G
15Application to Chemoinformatics
- Learning general rules
- Conformational Analysis
- QSAR dataset
16Oprea Dataset
- 137 diverse compounds
- Classification
- 62 leads, 75 drugs
- 14 descriptors (from Cerius-2)
- MW, MR, AlogP
- Ndonor, Nacceptor, Nrotbond
- Number of Lipinski violations
- T.I. Oprea, A.M. Davis, S.J. Teague P.D.
Leeson, Is there a difference between Leads
Drugs? A Historical Perspective, J. Chem. Inf.
Comput. Sci., 41, 1308-1315, (2001).
17C5 tree
MW lt 380 Mode lead Rule of 5 Violations
0 Mode lead Hbond acceptor lt 2 Mode
lead gt lead Hbond acceptor gt 2 Mode drug
gt drug Rule of 5 Violations gt 0 Mode lead
gt lead MW gt 380 Mode drug gt drug
18Trepan Oprea Tree
19Conformational Analysis
- 300 conformations from
- 5ns MD simulation of rosiglitazone
- Classified by length of long axis into
- Extended distance gt 10A
- Folded distance lt 10A
- 8 torsion angles
- In house data.
20Rosiglitazone
- Agonist of PPAR gamma Nuclear Receptor
- Regulates HDL/LDL and triglycerides
- Active ingredient of Avandia for Type II Diabetes
21Distances
22C5 tree
T5 lt 269 Mode extended T5 lt 52 Mode
extended T7 lt 185 Mode extended gt
extended T7 gt 185 Mode folded T6 lt
75 Mode folded gt folded T6 gt 75 Mode
extended T5 lt 41 Mode folded
T8 lt 249 Mode folded gt folded
T8 gt 249 Mode extended gt extended
T5 gt 41 Mode extended gt extended T5
gt 52 Mode extended T6 lt 73 Mode
extended T8 lt 242 Mode extended
T5 lt 7 Mode extended T8 lt 22
Mode extended gt extended T8 gt 22
Mode folded gt folded T5 gt 7 Mode
extended gt extended T8 gt 242 Mode
extended gt extended T6 gt 73 Mode
extended gt extended T5 gt 269 Mode folded
gt folded
23Trepan Conformation Tree
24Ferreira Dataset
- typical QSAR dataset
- 48 HIV-1 Protease inhibitors
- Activity as pIC50
- Low pIC50 lt 8.0
- High pIC50 gt 8.0
- 14 descriptors (mostly topological)
- R. Kiralj and M.M.C. Ferreira, A-priori
Molecular Descriptors in QSAR a case of HIV-1
protease inhibitors I. The Chemometric Approach,
J. Mol. Graph. Modell. 21, 435-448, (2003)
25Original Results
- PLS model
- Activity determined by
- X9,X11,X10,X13
- R2 0.91, Q20.85, Ncomps3
26C5 tree
X11 lt 2.5 Mode low X13 lt 16.7 Mode
low gt low X13 gt 16.7 Mode high gt high
X11 gt 2.5 Mode high gt high
27Trepan Ferreira Tree
28Accuracy
29Conclusions
- Reasonable Accuracy
- Comprehensible Rules
30Acknowledgements
- David Whitley.
- Tony Browne.
- Martyn Ford.
- BBSRC grant reference BIO/12005.