Title: Rule Extraction 2
1Rule Extraction 2
2Rule Extraction 2 TREPAN
- References
- M.W.Craven Extracting Comprehensible Models From
Trained Neural Networks, PhD Thesis, Univ. of
Wisconsin -Madison, 1996 - Craven M. and Shavlik J. (1996), Extracting
tree-structured representations of trained
networks. In Touretzky, D., Mozer, M.,
Hasselmo, M., editors, Advances in Neural
Infromation Processing Systems (volume 8). MIT
Press, Cambridge, M.A.
3Rule Extraction 2 TREPAN
- Basic idea
- First a black box model (e.g. neural network, but
any type of black box models) is built based on
the data available - The black-box model is used as an oracle, so if
any analysis is needed, the black-box model is
analyzed not the original system - Next a humanly comprehensible descriptive model
is built based on the black box model - The comprehensible model is a decision tree
4Rule Extraction 2 TREPAN
- Attributes XJ could be either discrete (binary)
one or continuous one - Basically networks for clustering purposes are
modeled by decision trees
5Rule Extraction 2 TREPAN
- Problems to be solved
- How to expand the tree step by step? How to
choose the next node to be expanded? - How to deal with the problem of the decreasing
number of samples from the root to the leafs? - How to choose the best test for the node to be
expanded? - What stopping criteria should be used in the tree
expansion process? - What pruning should be done before the process is
finished?
6Rule Extraction 2 TREPAN/How to expand?
- 1. How to choose the next node to be expanded?
- Heuristics suggested Look for the node where the
ratio of incorrectly classified patterns is the
highest. - Many wrong decisions in a node ?
- Promise of high gain if properly refined
(expanded)
7Rule Extraction 2 TREPAN
Rule Extraction 2 TREPAN/How to expand?
- Choose the node (nth node) for which the promise
of gain (G(n)) is the highest
8Rule Extraction 2 TREPAN/How to expand?
- Example
- 10000 patterns are classified into 3 classes C1,
C2, C3, the original number of patterns in the
classes are 4500, 1800, 3700 respectively. - When the number of patterns reaching a node drops
under 3000 new artificial patterns are generated.
9Rule Extraction 2 TREPAN/How to expand?
Test1
Test4
10Rule Extraction 2 TREPAN /Membership queries
2. How to deal with the problem of the decreasing
number of samples from the root to the leafs?
- Each test in each node divides the set of samples
into two subsets the number of samples are
decreasing. - Analyzing the node having less samples, gives
poorer results.
Example
11Rule Extraction 2 TREPAN /Membership queries
- Idea If the number of samples drops below the
appropriate level, new instances are generated.
The black-box model (NN) is used as an ORACLE ,
new input patterns are generated and presented to
the oracle (membership queries), and its answer
is taken as the ground truth. - Problem How to generate the new input patterns?
12Rule Extraction 2 TREPAN /Membership queries
- Membership query artificial input patterns
(attribute vectors) are generated, the black-box
model is used to generate the classification of
the pattern (output). - The constraints introduced by the previous nodes
(parent, grandparent etc nodes) should be kept. - The distribution of the instances associated to
the actual node is to be conserved.
13Rule Extraction 2 TREPAN /Membership queries
- The distribution of the instances associated to
the actual node - Possible approach always using uniform
distributions, correct if the fidelity of the
black-box (NN) model should be uniform in the
entire instance space. - Another approach taking into account the actual
distribution, the extraction algorithm focus to
the parts of the space where most examples are
found. The fidelity of the tree will be higher at
these parts than the not so frequently used ones.
14Rule Extraction 2 TREPAN /Membership queries
Example
- The effects of the constraints introduced by the
parent, grandparent, etc nodes.
If memberships queries are to be generated in
node4 x1 should be true in all instances, and
most of the examples should have small negative
x2 attribute.
15Rule Extraction 2 TREPAN /Membership queries
- Estimating attribute distributions
- Discrete valued attributes the frequency of
values are used as empirical distributions - Continuous valued attributes kernel density
estimates - consistent estimate (N??)
- In all cases marginal distributions are used
(dependencies among attributes are neglected),
but locally different ones
16Rule Extraction 2 TREPAN /Membership queries
Give the first 3 membership queries in the node
if the initial distributions (at node0) are
17Rule Extraction 2 TREPAN /Membership queries
- Distributions of attributes at node1 (after
performing Test0) - Probability of the conditions tested at node1
-
18Rule Extraction 2 TREPAN /Membership queries
- Distribution of the possible outcomes of the 1 of
x1?0.5x2 test (node1) - Distribution of conditions tested at node2
- (using a uniform
- distribution
- over 0,1)
19Rule Extraction 2 TREPAN /Membership queries
- If a random generator of uniform distribution
over 0,1 is used, and the following random
numbers (?i) are provided - 0.950 0.231 0.607 0.486 0.018
0.762 0.456 0.892 0.821 0.444 - 1st step ?10.950
- ?choose
- x1lt0.5 x2F
20Rule Extraction 2 TREPAN /Membership queries
- 2nd step
- second random number
- 0.950 0.231 0.607 0.486 0.018
0.762 0.456 0.892 0.821 0.444 -
21Rule Extraction 2 TREPAN /Membership queries
- 3rd step next random number
- 0.950 0.231 0.607 0.486 0.018
0.762 0.456 0.892 0.821 0.444 - therefore the second membership query is the same
type as the previous one
22Rule Extraction 2 TREPAN /Membership queries
- 4th step -next random number
- 0.950 0.231 0.607 0.486 0.018
0.762 0.456 0.892 0.821 0.444 -
23Rule Extraction 2 TREPAN /Membership queries
- 5th step - next random number
- 0.950 0.231 0.607 0.486 0.018
0.762 0.456 0.892 0.821 0.444 - ?50.018
- ? choose
- x1?0.5 x2T
24Rule Extraction 2 TREPAN /Membership queries
- 6th step
- next random number
- 0.950 0.231 0.607 0.486 0.018
0.762 0.456 0.892 0.821 0.444
25Rule Extraction 2 TREPAN/Search for best test
- 3. How to choose the best test for the node to be
expanded? - Tests m-of-n type tests are used.
- (Integer threshold m and a set of n Boolean
literals. The test is satisfied if at least m of
the literals are satisfied.) - Example 2 of A,B,C is the same as
- (the last term is not needed)
26Rule Extraction 2 TREPAN /Search for best test
- Literals used in tests
- Binary attributes - binary test separates the
data according to the test - Discrete features with more than 2 values
- binary tests of the form attributevalue1?
attributevalue2, etc. - Real valued features binary tests of the form
xjgtthresholdj1? or xjltthresholdj2?
27Rule Extraction 2 TREPAN /Search for best test
- Real valued features
- threshold midpoints between values of the
attributes of patterns reaching the actual node
are threshold candidates - only that midpoints are considered where the two
values belong to different classes
28Rule Extraction 2 TREPAN /Search for best test
- Search for best splitting test
- The best binary test at the current node is
selected based on the information gain
criterion. It is the seed of the search process. - Two operators are used in the search
- m of n ? m of n1
- m of n ? m1 of n1
- Limited form of backtracking conflicting
conditions can be omitted
29Rule Extraction 2 TREPAN /Search for best test
- Examples
- 2 of x1,x3 ? 2 of x1,x3,x4gt0.7
- 2 of x1,x3 ? 3 of x1,x3,x4gt0.7
- 2 of x1,x3,x4gt0.7,? x3 ? 2 of x1,x4gt0.7
30Rule Extraction 2 TREPAN /Search for best test
- Using these operators a beam-search method is
used with a beam width of two. Both the seed
test and its negated version are used as starting
points. - Example
- Assume the best binary test is 1 of x1.
Naturally 1 of ?x1 is of the same value. - If the target is 2 of x1,, x3, x7 the first
seed is to be used if the target is 2 of ?x1,x3,
x7.
31Rule Extraction 2 TREPAN /Search for best test
- Information gain criterion for tests
- The test T is selected that maximizes the
information gained about the class labels of the
examples S - gain(T)info(S)-infoT(S)
- where info(S) is thhe information needed to
classify the examples of S, that belong to
classes C1,C2,C3,,Ck
32Rule Extraction 2 TREPAN /Search for best test
- infoT is the information needed after performing
test T that has outcomes i1,,n. The subset of
examples that have the ith outcome is Si.
33Rule Extraction 2 TREPAN /Search for best test
- Example A neural network performs two-class
classification. A decision tree is generated
using TREPAN when expanding the kth node the
1000 examples reaching that node have the
following attribute distribution. Which - of the possible 4
- tests gives the highest
- gain?
34Rule Extraction 2 TREPAN /Search for best test
35Rule Extraction 2 TREPAN /Search for best test
36Rule Extraction 2 TREPAN /Search for best test
37Rule Extraction 2 TREPAN /Search for best test
?BEST !!!
38Rule Extraction 2 TREPAN /Stopping criteria
- 4. Stopping criteria in the tree expansion
process - Local criterion Purity of the example set
reaching the node. The node becomes a leaf if
with high probability it covers only instances of
a single class. - Global criterion Complexity of the decision
tree complexity of the generated explanations.
Limit of number of nodes, depth of the tree, etc.
39Rule Extraction 2 TREPAN /Pruning the tree
- 5. Pruning the final decision tree
- Purpose to detect subtrees that predict the same
class at all of their leaves, and to collapse
such subtree into one single leaf. - Other pruning methods can be used as well, but
they are not built into the TREPAN.