Title: Recent developments in tree induction for KDD
1Recent developments in tree induction for
KDDÂ Towards soft tree inductionÂ
- Louis WEHENKEL
- University of Liège Belgium
- Department of Electrical and Computer Engineering
2A. Supervised learning (notation)
- x (x1,,xm) vector of input variables
(numerical and/or symbolic) - y single output variable
- Symbolic classification problem
- Numeric regression problem
- LS ((x1,y1),,(xN,yN)), sample of I/O pairs
- Learning (or modeling) algorithm
- Mapping from sample sp. to hypothesis sp. H
- Say y f(x) e , where e modeling error
-  Guess fLS in H so as to minimize e
3Statistical viewpoint
- x and y are random variables distributed
according to p(x,y) - LS is distributed according to pN(x,y)
- fLS is a random function (selected in H)
- e(x) y fLS(x) is also a random variable
- Given a metric to measure the error we can
define the best possible model (Bayes model) - Regression fB(x) E(yx)
- Classification fB(x) argmaxy P(yx)
4B. Crisp decision trees (what is it ?)
X1lt0.6
Yes
No
Y is big
X2lt1.5
Yes
No
Y is small
Y is very big
5B. Crisp decision trees (what is it ?)
X21.5
X10.6
6Tree induction (Overview)
- Growing the tree (uses GS, a part of LS)
- Top down (until all nodes are closed)
- At each step
- Select open node to split (best first, greedy
approach) - Find best input variable and best question
- If node can be purified split, otherwise close
the node - Pruning the tree (uses PS, rest of LS)
- Bottom up (until all nodes are contracted)
- At each step
- Select test node to contract (worst first,
greedy) - Contract and evaluate
7Tree Growing
- Demo Titanic database
- Comments
- Tree growing is a local process
- Very efficient
- Can select relevant input variables
- Cannot determine appropriate tree shape
- (Just like real trees)
8Tree Pruning
- Strategy
- To determine appropriate tree shape let tree grow
too big (allong all branches), and then reshape
it by pruning away irrelevant parts - Tree pruning uses global criterion to determine
appropriate shape - Tree pruning is even faster than growing
- Tree pruning avoids overfitting the data
9Growing Pruning (graphically)
Error (GS / PS)
Tree complexity
10C. Soft trees (what is it ?)
- Generalization of crisp trees using continuous
splits and aggregation of terminal node
predictions
1
0
11Soft trees (discussion)
- Each split is defined by two parameters
- Position a , and width b of transition region
- Generalize decision/regression trees into a
continuous and differentiable model w.r.t. the
model parameters - Test nodes aj , bj
- Terminal nodes ni
- Other names (of similar models)
- Fuzzy trees, continuous trees
- Tree structured (neural, bayesian) networks
- Hierarchical models
12Soft trees (Motivations)
- Improve performance (w.r.t. crisp trees)
- Use of a larger hypothesis space
- Reduced variance and bias
- Improved optimization (Ã la backprop)
- Improve interpretability
- More  honest model
- Reduced parameter variance
- Reduced complexity
13D. Plan of the presentation
- Bias/Variance tradeoff (in tree induction)
- Main techniques to reduce variance
- Why soft trees have lower variance
- Techniques for learning soft trees
14Concept of variance
- Learning sample is random
- Learned model is function of the sample
- Model is also random variance
- Model predictions have variance
- Model structure / parameters have variance
- Variance reduces accuracy and
interpretability - Variance can be reduced by various
averaging or smoothing techniques
15Theoretical explanation
- Bias, variance and residual error
- Residual error
- Difference between output variable and the best
possible model (i.e. error of the Bayes model) - Bias
- Difference between the best possible model and
the average model produced by algorithm - Variance
- Average variability of model around average model
- Expected error2 res2bias2var
- NB these notions depend on the metric used for
measuring error
16Regression (locally, at point x)
- Find yf(x) such that Eyxerr(y,y) is
minimum, where err is an error measure. - Usually, err squared error (y- y)2
- f(x)Eyxy minimizes the error at every point x
- Bayes model is the conditional expectation
17Learning algorithm (1)
- Usually, p(yx) is unknown
- Use LS ((x1,y1),,(xN,yN)), and a learning
algorithm to choose hypothesis in H - yLS(x)f(LS,x)
- At each input point x, the prediction yLS(x) is a
random variable - Distribution of yLS(x) depends on sample size N
and on the learning algorithm used
18Learning algorithm (2)
pLS (y(x))
y
- Since LS is randomly drawn,
estimation y(x) is a random variable
19Good learning algorithm
- A good learning algorithm should minimize the
average (generalization) error over all learning
sets - In regression, the usual error is the mean
squared error. So we want to minimize (at each
point x) - Err(x)ELSEyx(y-yLS(x))2
- There exists a useful additive decomposition of
this error into three (positive) terms
20Bias/variance decomposition (1)
varyxy
y
Eyxy
- Err(x) Eyx(y- Eyxy)2
- Eyxy arg miny Eyx(y- y)2 Bayes
model - varyxy residual error minimal error
21Bias/variance decomposition (2)
bias2(x)
y
Eyxy
- Err(x) varyxy (Eyxy-ELSy(x))2
- ELSy(x) average model (w.r.t. LS)
- bias2(x) error between Bayes and average
model
22Bias/variance decomposition (3)
varLSy
y
- Err(x) varyxy bias2(x) ELS(y(x)-ELSy(x)
)2 -
- varLSy(x) variance
23Bias/variance decomposition (4)
varyxy
varLSy(x)
- Local error decomposition
- Err(x) varyxy bias2(x) varLSy(x)
- Global error decomposition (take average w.r.t.
p(x)) - EXErr(x) EXvaryxy EXbias2(x)
EXvarLSy(x)
24Illustration (1)
- Problem definition
- One input x, uniform random variable in 0,1
- yh(x)e where e?N(0,1)
h(x)Eyxy
x
25Illustration (2)
- Small variance, high bias method
26Illustration (3)
- Small bias, high variance method
27Illustration (Methods comparison)
- Artificial problem with 10 inputs, all uniform
random variables in 0,1 - The true function depends only on 5 inputs
- y(x)10.sin(p.x1.x2)20.(x3-0.5)210.x45.x5e,
- where e is a N(0,1) random variable
- Experimentation
- ELS ? average over 50 learning sets of size 500
- Ex,y ? average over 2000 cases
- Estimate variance and bias ( residual error)
28Illustration (Linear regression)
Method Err2 Bias2Noise Variance
Linear regr. 7.0 6.8 0.2
k-NN (k1) 15.4 5 10.4
k-NN (k10) 8.5 7.2 1.3
MLP (10) 2.0 1.2 0.8
MLP (10 10) 4.6 1.4 3.2
Regr. Tree 10.2 3.5 6.7
- Very few parameters small variance
- Goal function is not linear high bias
29Illustration (k-Nearest Neighbors)
Method Err2 Bias2Noise Variance
Linear regr. 7.0 6.8 0.2
k-NN (k1) 15.4 5 10.4
k-NN (k10) 8.5 7.2 1.3
MLP (10) 2.0 1.2 0.8
MLP (10 10) 4.6 1.4 3.2
Regr. Tree 10.2 3.5 6.7
- Small k high variance and moderate bias
- High k smaller variance but higher bias
30Illustration (Multilayer Perceptrons)
Method Err2 Bias2Noise Variance
Linear regr. 7.0 6.8 0.2
k-NN (k1) 15.4 5 10.4
k-NN (k10) 8.5 7.2 1.3
MLP (10) 2.0 1.2 0.8
MLP (10 10) 4.6 1.4 3.2
Regr. Tree 10.2 3.5 6.7
- Small bias
- Variance increases with the model complexity
31Illustration (Regression trees)
Method Err2 Bias2Noise Variance
Linear regr. 7.0 6.8 0.2
k-NN (k1) 15.4 5 10.4
k-NN (k10) 8.5 7.2 1.3
MLP (10) 2.0 1.2 0.8
MLP (10 10) 4.6 1.4 3.2
Regr. Tree 10.2 3.5 6.7
- Small bias, a (complex enough) tree can
approximate any non linear function - High variance (see later)
32Variance reduction techniques
- In the context of a given method
- Adapt the learning algorithm to find the best
trade-off between bias and variance. - Not a panacea but the least we can do.
- Example pruning, weight decay.
- Wrapper techniques
- Change the bias/variance trade-off.
- Universal but destroys some features of the
initial method. - Example bagging.
33Variance reduction 1 model (1)
- General idea reduce the ability of the learning
algorithm to over-fit the LS - Pruning
- reduces the model complexity explicitly
- Early stopping
- reduces the amount of search
- Regularization
- reduce the size of hypothesis space
34Variance reduction 1 model (2)
Ebias2var
Optimal fitting
var
bias2
Fitting
- Bias2 ? error on the learning set, E ? error on
an independent test set - Selection of the optimal level of tuning
- a priori (not optimal)
- by cross-validation (less efficient)
35Variance reduction 1 model (3)
- Examples
- Post-pruning of regression trees
- Early stopping of MLP by cross-validation
Method E Bias Variance
Full regr. Tree (488) 10.2 3.5 6.7
Pr. regr. Tree (93) 9.1 4.3 4.8
Full learned MLP 4.6 1.4 3.2
Early stopped MLP 3.8 1.5 2.3
- As expected, reduces variance and increases bias
36Variance reduction bagging (1)
- Idea the average model ELSy(x) has the same
bias as the original method but zero variance - Bagging (Bootstrap AGGregatING)
- To compute ELSy(x), we should draw an infinite
number of LS (of size N) - Since we have only one single LS, we simulate
sampling from nature by bootstrap sampling from
the given LS - Bootstrap sampling sampling with replacement of
N objects from LS (N is the size of LS)
37Variance reduction bagging (2)
LS
38Variance reduction bagging (3)
- Application to regression trees
Method E Bias Variance
3 Test regr. Tree 14.8 11.1 3.7
Bagged 11.7 10.7 1.0
Full regr. Tree 10.2 3.5 6.7
Bagged 5.3 3.8 1.5
- Strong variance reduction without increasing bias
(although the model is much more complex than a
single tree)
39Dual bagging (1)
- Instead of perturbing learning sets to obtain
several predictions, directly perturb the test
case at the prediction stage - Given a model y(.) and a test case x
- Form k attribute vectors by adding Gaussian noise
to x xe1, xe2, , xek. - Average the predictions of the model at these
points to get the prediction at point x - 1/k.(y(xe1)y(xe2)y(xek)
- Noise level ? (variance of Gaussian noise)
selected by cross-validation
40Dual bagging (2)
Noise level E Bias Variance
0.0 10.2 3.5 6.7
0.2 6.3 3.5 2.8
0.5 5.3 4.4 0.9
2.0 13.3 13.1 0.2
- Smooth the function y(.).
- Too much noise increases bias
- there is a (new) trade-off between bias and
variance
41Dual bagging (classification trees)
? 1.5 ? error 4.6
? 0.3 ? error 1.4
? 0 ? error 3.7
42Variance in tree induction
- Tree induction is among the ML methods of highest
variance - (together with 1-NN)
- Main reason
- Generalization is local
- Depends on small parts of the learning set
- Sources of variance
- Discretization of numerical attributes (60 )
- The selected thresholds have a high variance
- Structure choice (10 )
- Sometimes, attribute scores are very close
- Estimation at leaf nodes (30 )
- Because of the recursive partitioning, prediction
at leaf nodes is based on very small samples of
objects - Consequences
- Questionable interpretability and higher error
rates
43Threshold variance (1)
- Test on numerical attributes a(o)ltath
- Discretization find ath which minimizes score
- Classification maximize information
- Regression minimize residual variance
Score
a(o)
ath
44Threshold variance (2)
45Threshold variance (3)
46Tree variance
- DT/RT are among the machine learning methods
which present the highest variance
Method E Bias Variance
RT, no test 25.5 25.4 0.1
RT, 1 test 19.0 17.7 1.3
RT, 3 tests 14.8 11.1 3.7
RT, full (250 tests) 10.2 3.5 6.7
47DT variance reduction
- Pruning
- Necessary to select the right complexity
- Decreases variance but increases bias small
effect on accuracy - Threshold stabilization
- Smoothing of score curves, bootstrap sampling
- Reduces parameter variance but has only a slight
effect on accuracy and prediction variance - Bagging
- Very efficient at reducing variance
- But jeopardizes interpretability of trees and
computational efficiency - Dual bagging
- In terms of variance reduction, similar to
bagging - Much faster and can be simulated by soft trees
- Fuzzy tree induction
- Build soft trees in a full fledged approach
48Dual tree bagging Soft trees
- Reformulation of dual bagging as an explicit soft
tree propagation algorithm - Algorithms
- Forward-backward propagation in soft trees
- Softening of thresholds during learning stage
- Some results
49Dual bagging soft thresholds
- xeltxth ? sometimes left, sometimes right
- Multiple crisp propagations can be replaced
by one soft propagation - E.g. if e has uniform pdf in ath- l/2,ath l/2
then probability of right propagation is as
follows
l
TSleft
TSright
ath
50Forward-backward algorithm
Top-down propagation of probability
Root
P(Rootx)1
P(N1x) P(Test1x) P(Rootx)
Test1
N1
L3
P(L3x) P(?Test1x)P(Rootx)
Test2
L2
L1
Bottom-up aggregation of predictions
P(L1x) P(Test2x)P(N1x)
P(L2x) P(?Test2x)P(N1x)
51Learning of l values
- Use of an independent validation set and
bisection search - One single value can be learned very efficiently
(amounts to 10 full tests of a DT/RT on the
validation set) - Combination of several values can also be learned
with the risk of overfitting - (see fuzzy tree induction, in what follows)
52Some results with dual bagging
53Fuzzy tree induction
- General ideas
- Learning Algorithm
- Growing
- Refitting
- Pruning
- Backfitting
54General Ideas
- Obviously, soft trees have much lower variance
than crisp trees - In the  Dual Bagging approach, attribute
selection is carried out in a cloassical way,
then tests are softened in a post-processing
stage - Might be more effective to combine the two
methods - Fuzzy tree induction
55Soft trees
- Samples are handled as fuzzy subsets
- Each observation belongs to such a FS with a
certain membership degree - SCORE measure is modified
- Objects are weighted by their membership degree
- Output y
- Denotes the membership degree to a class
- Goal of Fuzzy tree induction
- Provide a smooth model of y as a function of
the input variables
56Fuzzy discretization
- Same as fuzzification
- Carried out locally, at the tree growing stage
- At each test node
- On the basis of local fuzzy sub-training set
- Select attribute, together with discriminator so
as to maximize local SCORE - Split in soft way and proceed recursively
- Criteria for SCORE
- Minimal residual variance
- Maximal (fuzzy) information quantity
- Etc
57Attaching labels to leaves
- Basically, for each terminal node, we need to
determine a local estimate yi of y - During intermediate steps
- Use average of y in local sub-learning set
- Direct computation
- Refitting of the labels
- Once the full tree has been grown and at each
step of pruning - Determine all values simultaneously
- To minimize Square Error
- Amounts to a linear least squares problem
- Direct solution
58Refitting (Explanation)
- A leaf corresponds to a basis function mi(x)
- Product of discriminators encountered on the path
from root - Tree prediction is equivalent to a weighted
average of these basis functions - y(x) y1 m1(x) y2 m2(x) yk mk(x)
- the weights yi are the labels attached to the
terminal nodes - Refitting amounts to tune the yi parameters to
minimize square error on training set
59Tree growing and pruning
- Grow tree
- Refit leaf labels
- Prune tree, while refitting at each stage leaf
labels - Test sequence of pruned trees on validation set
- Select best pruning level
60Backfitting (1)
- After growing and pruning, the fuzzy tree
structure has been determined - Leaf labels are globally optimal, but not the
parameters of the discriminators (tuned locally) - Resulting model has 2 parameters per test node,
and 1 parameter per terminal node - The output (and hence Mean square error) of the
fuzzy tree is a smooth function of these
parameters - The parameters can be optimized, by using a
standard LSE technique, e.g. Levenberg-Marquardt
61Backfitting (2)
- How to compute the derivatives needed by
nonlinear optimization technique - Use a modified version of backpropagation to
compute derivates with respect to parameters - Yields an efficient algorithm (linear in the size
of the tree) - Backfitting starts from tree produced after
growing and pruning - Already a good approximation of a local optimum
- Only a small number of iterations are necessary
to backfit - Backfitting may also lead to overfitting
62Summary and conclusions
- Variance is the problem number one in
decision/regression tree induction - It is possible to reduce variance significantly
- Bagging and/or tree softening
- Soft trees have the advantage of preserving
interpretability and computational efficiency - Two approaches have been presented to get soft
trees - Dual bagging
- Generic approach
- Fast and simple
- Best approach for very large databases
- Fuzzy tree induction
- Similar to ANN type of model, but (more)
interpretable - Best approach for small learning sets (probably)
63Some references for further reading
- Variance evaluation/reduction, bagging
- Contact Pierre GEURTS (PhD student)
geurts_at_montefiore.ulg.ac.be - Papers
- Discretization of continuous attributes for
supervised learning - Variance evaluation and
variance reduction. (Invited) - L. Wehenkel. Proc. of IFSA'97, International
Fuzzy Systems Association World Congress, Prague,
June 1997, pp. 381--388. - Investigation and Reduction of Discretization
Variance in Decision Tree Induction. - Pierre GEURTS and Louis WEHENKEL, Proc. of
ECML2000 - Some Enhancements of Decision Tree Bagging.
- Pierre GEURTS, Proc. of PKDD2000
- Dual Perturb and Combine Algorithm.
- Pierre GEURTS, Proc. of AI and Statistics 2001.
64See also www.montefiore.ulg.ac.be/services/stochas
tic/
- Fuzzy/soft tree induction
- Contact Cristina OLARU (PhD student)
olaru_at_montefiore.ulg.ac.be - Papers
- Automatic induction of fuzzy decision trees and
its application to power system security
assessment. - X. Boyen, L. Wehenkel, Int. Journal on Fuzzy
Sets and Systems, Vol. 102, No 1, pp. 3-19, 1999. - On neurofuzzy and fuzzy decision trees
approaches. - C. Olaru, L. Wehenkel. (Invited) Proc. of
IPMU'98, 7th Int. Congr. on Information
Processing and Management of Uncertainty in
Knowledge based Systems, 1998.