Recent developments in tree induction for KDD - PowerPoint PPT Presentation

About This Presentation
Title:

Recent developments in tree induction for KDD

Description:

Illustration (Linear regression) Very few parameters : small variance ... Illustration (k-Nearest Neighbors) Small k : high ... Illustration (Regression trees) ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 65
Provided by: montefio
Category:

less

Transcript and Presenter's Notes

Title: Recent developments in tree induction for KDD


1
Recent developments in tree induction for
KDD Towards soft tree induction 
  • Louis WEHENKEL
  • University of Liège Belgium
  • Department of Electrical and Computer Engineering

2
A. Supervised learning (notation)
  • x (x1,,xm) vector of input variables
    (numerical and/or symbolic)
  • y single output variable
  • Symbolic classification problem
  • Numeric regression problem
  • LS ((x1,y1),,(xN,yN)), sample of I/O pairs
  • Learning (or modeling) algorithm
  • Mapping from sample sp. to hypothesis sp. H
  • Say y f(x) e , where e modeling error
  •  Guess  fLS in H so as to minimize e

3
Statistical viewpoint
  • x and y are random variables distributed
    according to p(x,y)
  • LS is distributed according to pN(x,y)
  • fLS is a random function (selected in H)
  • e(x) y fLS(x) is also a random variable
  • Given a metric to measure the error we can
    define the best possible model (Bayes model)
  • Regression fB(x) E(yx)
  • Classification fB(x) argmaxy P(yx)

4
B. Crisp decision trees (what is it ?)
X1lt0.6
Yes
No
Y is big
X2lt1.5
Yes
No
Y is small
Y is very big
5
B. Crisp decision trees (what is it ?)
X21.5
X10.6
6
Tree induction (Overview)
  • Growing the tree (uses GS, a part of LS)
  • Top down (until all nodes are closed)
  • At each step
  • Select open node to split (best first, greedy
    approach)
  • Find best input variable and best question
  • If node can be purified split, otherwise close
    the node
  • Pruning the tree (uses PS, rest of LS)
  • Bottom up (until all nodes are contracted)
  • At each step
  • Select test node to contract (worst first,
    greedy)
  • Contract and evaluate

7
Tree Growing
  • Demo Titanic database
  • Comments
  • Tree growing is a local process
  • Very efficient
  • Can select relevant input variables
  • Cannot determine appropriate tree shape
  • (Just like real trees)

8
Tree Pruning
  • Strategy
  • To determine appropriate tree shape let tree grow
    too big (allong all branches), and then reshape
    it by pruning away irrelevant parts
  • Tree pruning uses global criterion to determine
    appropriate shape
  • Tree pruning is even faster than growing
  • Tree pruning avoids overfitting the data

9
Growing Pruning (graphically)
Error (GS / PS)
Tree complexity
10
C. Soft trees (what is it ?)
  • Generalization of crisp trees using continuous
    splits and aggregation of terminal node
    predictions

1
0
11
Soft trees (discussion)
  • Each split is defined by two parameters
  • Position a , and width b of transition region
  • Generalize decision/regression trees into a
    continuous and differentiable model w.r.t. the
    model parameters
  • Test nodes aj , bj
  • Terminal nodes ni
  • Other names (of similar models)
  • Fuzzy trees, continuous trees
  • Tree structured (neural, bayesian) networks
  • Hierarchical models

12
Soft trees (Motivations)
  • Improve performance (w.r.t. crisp trees)
  • Use of a larger hypothesis space
  • Reduced variance and bias
  • Improved optimization (à la backprop)
  • Improve interpretability
  • More  honest  model
  • Reduced parameter variance
  • Reduced complexity

13
D. Plan of the presentation
  • Bias/Variance tradeoff (in tree induction)
  • Main techniques to reduce variance
  • Why soft trees have lower variance
  • Techniques for learning soft trees

14
Concept of variance
  • Learning sample is random
  • Learned model is function of the sample
  • Model is also random variance
  • Model predictions have variance
  • Model structure / parameters have variance
  • Variance reduces accuracy and
    interpretability
  • Variance can be reduced by various
    averaging or smoothing techniques

15
Theoretical explanation
  • Bias, variance and residual error
  • Residual error
  • Difference between output variable and the best
    possible model (i.e. error of the Bayes model)
  • Bias
  • Difference between the best possible model and
    the average model produced by algorithm
  • Variance
  • Average variability of model around average model
  • Expected error2 res2bias2var
  • NB these notions depend on the metric used for
    measuring error

16
Regression (locally, at point x)
  • Find yf(x) such that Eyxerr(y,y) is
    minimum, where err is an error measure.
  • Usually, err squared error (y- y)2
  • f(x)Eyxy minimizes the error at every point x
  • Bayes model is the conditional expectation

17
Learning algorithm (1)
  • Usually, p(yx) is unknown
  • Use LS ((x1,y1),,(xN,yN)), and a learning
    algorithm to choose hypothesis in H
  • yLS(x)f(LS,x)
  • At each input point x, the prediction yLS(x) is a
    random variable
  • Distribution of yLS(x) depends on sample size N
    and on the learning algorithm used

18
Learning algorithm (2)
pLS (y(x))
y
  • Since LS is randomly drawn,
    estimation y(x) is a random variable

19
Good learning algorithm
  • A good learning algorithm should minimize the
    average (generalization) error over all learning
    sets
  • In regression, the usual error is the mean
    squared error. So we want to minimize (at each
    point x)
  • Err(x)ELSEyx(y-yLS(x))2
  • There exists a useful additive decomposition of
    this error into three (positive) terms

20
Bias/variance decomposition (1)
varyxy
y
Eyxy
  • Err(x) Eyx(y- Eyxy)2
  • Eyxy arg miny Eyx(y- y)2 Bayes
    model
  • varyxy residual error minimal error

21
Bias/variance decomposition (2)
bias2(x)
y
Eyxy
  • Err(x) varyxy (Eyxy-ELSy(x))2
  • ELSy(x) average model (w.r.t. LS)
  • bias2(x) error between Bayes and average
    model

22
Bias/variance decomposition (3)
varLSy
y
  • Err(x) varyxy bias2(x) ELS(y(x)-ELSy(x)
    )2
  • varLSy(x) variance

23
Bias/variance decomposition (4)
varyxy
varLSy(x)
  • Local error decomposition
  • Err(x) varyxy bias2(x) varLSy(x)
  • Global error decomposition (take average w.r.t.
    p(x))
  • EXErr(x) EXvaryxy EXbias2(x)
    EXvarLSy(x)

24
Illustration (1)
  • Problem definition
  • One input x, uniform random variable in 0,1
  • yh(x)e where e?N(0,1)

h(x)Eyxy
x
25
Illustration (2)
  • Small variance, high bias method

26
Illustration (3)
  • Small bias, high variance method

27
Illustration (Methods comparison)
  • Artificial problem with 10 inputs, all uniform
    random variables in 0,1
  • The true function depends only on 5 inputs
  • y(x)10.sin(p.x1.x2)20.(x3-0.5)210.x45.x5e,
  • where e is a N(0,1) random variable
  • Experimentation
  • ELS ? average over 50 learning sets of size 500
  • Ex,y ? average over 2000 cases
  • Estimate variance and bias ( residual error)

28
Illustration (Linear regression)
Method Err2 Bias2Noise Variance
Linear regr. 7.0 6.8 0.2
k-NN (k1) 15.4 5 10.4
k-NN (k10) 8.5 7.2 1.3
MLP (10) 2.0 1.2 0.8
MLP (10 10) 4.6 1.4 3.2
Regr. Tree 10.2 3.5 6.7
  • Very few parameters small variance
  • Goal function is not linear high bias

29
Illustration (k-Nearest Neighbors)
Method Err2 Bias2Noise Variance
Linear regr. 7.0 6.8 0.2
k-NN (k1) 15.4 5 10.4
k-NN (k10) 8.5 7.2 1.3
MLP (10) 2.0 1.2 0.8
MLP (10 10) 4.6 1.4 3.2
Regr. Tree 10.2 3.5 6.7
  • Small k high variance and moderate bias
  • High k smaller variance but higher bias

30
Illustration (Multilayer Perceptrons)
Method Err2 Bias2Noise Variance
Linear regr. 7.0 6.8 0.2
k-NN (k1) 15.4 5 10.4
k-NN (k10) 8.5 7.2 1.3
MLP (10) 2.0 1.2 0.8
MLP (10 10) 4.6 1.4 3.2
Regr. Tree 10.2 3.5 6.7
  • Small bias
  • Variance increases with the model complexity

31
Illustration (Regression trees)
Method Err2 Bias2Noise Variance
Linear regr. 7.0 6.8 0.2
k-NN (k1) 15.4 5 10.4
k-NN (k10) 8.5 7.2 1.3
MLP (10) 2.0 1.2 0.8
MLP (10 10) 4.6 1.4 3.2
Regr. Tree 10.2 3.5 6.7
  • Small bias, a (complex enough) tree can
    approximate any non linear function
  • High variance (see later)

32
Variance reduction techniques
  • In the context of a given method
  • Adapt the learning algorithm to find the best
    trade-off between bias and variance.
  • Not a panacea but the least we can do.
  • Example pruning, weight decay.
  • Wrapper techniques
  • Change the bias/variance trade-off.
  • Universal but destroys some features of the
    initial method.
  • Example bagging.

33
Variance reduction 1 model (1)
  • General idea reduce the ability of the learning
    algorithm to over-fit the LS
  • Pruning
  • reduces the model complexity explicitly
  • Early stopping
  • reduces the amount of search
  • Regularization
  • reduce the size of hypothesis space

34
Variance reduction 1 model (2)
Ebias2var
Optimal fitting
var
bias2
Fitting
  • Bias2 ? error on the learning set, E ? error on
    an independent test set
  • Selection of the optimal level of tuning
  • a priori (not optimal)
  • by cross-validation (less efficient)

35
Variance reduction 1 model (3)
  • Examples
  • Post-pruning of regression trees
  • Early stopping of MLP by cross-validation

Method E Bias Variance
Full regr. Tree (488) 10.2 3.5 6.7
Pr. regr. Tree (93) 9.1 4.3 4.8
Full learned MLP 4.6 1.4 3.2
Early stopped MLP 3.8 1.5 2.3
  • As expected, reduces variance and increases bias

36
Variance reduction bagging (1)
  • Idea the average model ELSy(x) has the same
    bias as the original method but zero variance
  • Bagging (Bootstrap AGGregatING)
  • To compute ELSy(x), we should draw an infinite
    number of LS (of size N)
  • Since we have only one single LS, we simulate
    sampling from nature by bootstrap sampling from
    the given LS
  • Bootstrap sampling sampling with replacement of
    N objects from LS (N is the size of LS)

37
Variance reduction bagging (2)
LS
38
Variance reduction bagging (3)
  • Application to regression trees

Method E Bias Variance
3 Test regr. Tree 14.8 11.1 3.7
Bagged 11.7 10.7 1.0
Full regr. Tree 10.2 3.5 6.7
Bagged 5.3 3.8 1.5
  • Strong variance reduction without increasing bias
    (although the model is much more complex than a
    single tree)

39
Dual bagging (1)
  • Instead of perturbing learning sets to obtain
    several predictions, directly perturb the test
    case at the prediction stage
  • Given a model y(.) and a test case x
  • Form k attribute vectors by adding Gaussian noise
    to x xe1, xe2, , xek.
  • Average the predictions of the model at these
    points to get the prediction at point x
  • 1/k.(y(xe1)y(xe2)y(xek)
  • Noise level ? (variance of Gaussian noise)
    selected by cross-validation

40
Dual bagging (2)
  • With regression trees

Noise level E Bias Variance
0.0 10.2 3.5 6.7
0.2 6.3 3.5 2.8
0.5 5.3 4.4 0.9
2.0 13.3 13.1 0.2
  • Smooth the function y(.).
  • Too much noise increases bias
  • there is a (new) trade-off between bias and
    variance

41
Dual bagging (classification trees)
? 1.5 ? error 4.6
? 0.3 ? error 1.4
? 0 ? error 3.7
42
Variance in tree induction
  • Tree induction is among the ML methods of highest
    variance
  • (together with 1-NN)
  • Main reason
  • Generalization is local
  • Depends on small parts of the learning set
  • Sources of variance
  • Discretization of numerical attributes (60 )
  • The selected thresholds have a high variance
  • Structure choice (10 )
  • Sometimes, attribute scores are very close
  • Estimation at leaf nodes (30 )
  • Because of the recursive partitioning, prediction
    at leaf nodes is based on very small samples of
    objects
  • Consequences
  • Questionable interpretability and higher error
    rates

43
Threshold variance (1)
  • Test on numerical attributes a(o)ltath
  • Discretization find ath which minimizes score
  • Classification maximize information
  • Regression minimize residual variance

Score
a(o)
ath
44
Threshold variance (2)
45
Threshold variance (3)
46
Tree variance
  • DT/RT are among the machine learning methods
    which present the highest variance

Method E Bias Variance
RT, no test 25.5 25.4 0.1
RT, 1 test 19.0 17.7 1.3
RT, 3 tests 14.8 11.1 3.7
RT, full (250 tests) 10.2 3.5 6.7
47
DT variance reduction
  • Pruning
  • Necessary to select the right complexity
  • Decreases variance but increases bias small
    effect on accuracy
  • Threshold stabilization
  • Smoothing of score curves, bootstrap sampling
  • Reduces parameter variance but has only a slight
    effect on accuracy and prediction variance
  • Bagging
  • Very efficient at reducing variance
  • But jeopardizes interpretability of trees and
    computational efficiency
  • Dual bagging
  • In terms of variance reduction, similar to
    bagging
  • Much faster and can be simulated by soft trees
  • Fuzzy tree induction
  • Build soft trees in a full fledged approach

48
Dual tree bagging Soft trees
  • Reformulation of dual bagging as an explicit soft
    tree propagation algorithm
  • Algorithms
  • Forward-backward propagation in soft trees
  • Softening of thresholds during learning stage
  • Some results

49
Dual bagging soft thresholds
  • xeltxth ? sometimes left, sometimes right
  • Multiple crisp propagations can be replaced
    by one soft propagation
  • E.g. if e has uniform pdf in ath- l/2,ath l/2
    then probability of right propagation is as
    follows

l
TSleft
TSright
ath
50
Forward-backward algorithm
Top-down propagation of probability
Root
P(Rootx)1
P(N1x) P(Test1x) P(Rootx)
Test1
N1
L3
P(L3x) P(?Test1x)P(Rootx)
Test2
L2
L1
Bottom-up aggregation of predictions
P(L1x) P(Test2x)P(N1x)
P(L2x) P(?Test2x)P(N1x)
51
Learning of l values
  • Use of an independent validation set and
    bisection search
  • One single value can be learned very efficiently
    (amounts to 10 full tests of a DT/RT on the
    validation set)
  • Combination of several values can also be learned
    with the risk of overfitting
  • (see fuzzy tree induction, in what follows)

52
Some results with dual bagging
53
Fuzzy tree induction
  • General ideas
  • Learning Algorithm
  • Growing
  • Refitting
  • Pruning
  • Backfitting

54
General Ideas
  • Obviously, soft trees have much lower variance
    than crisp trees
  • In the  Dual Bagging  approach, attribute
    selection is carried out in a cloassical way,
    then tests are softened in a post-processing
    stage
  • Might be more effective to combine the two
    methods
  • Fuzzy tree induction

55
Soft trees
  • Samples are handled as fuzzy subsets
  • Each observation belongs to such a FS with a
    certain membership degree
  • SCORE measure is modified
  • Objects are weighted by their membership degree
  • Output y
  • Denotes the membership degree to a class
  • Goal of Fuzzy tree induction
  • Provide a smooth model of y as a function of
    the input variables

56
Fuzzy discretization
  • Same as fuzzification
  • Carried out locally, at the tree growing stage
  • At each test node
  • On the basis of local fuzzy sub-training set
  • Select attribute, together with discriminator so
    as to maximize local SCORE
  • Split in soft way and proceed recursively
  • Criteria for SCORE
  • Minimal residual variance
  • Maximal (fuzzy) information quantity
  • Etc

57
Attaching labels to leaves
  • Basically, for each terminal node, we need to
    determine a local estimate yi of y
  • During intermediate steps
  • Use average of y in local sub-learning set
  • Direct computation
  • Refitting of the labels
  • Once the full tree has been grown and at each
    step of pruning
  • Determine all values simultaneously
  • To minimize Square Error
  • Amounts to a linear least squares problem
  • Direct solution

58
Refitting (Explanation)
  • A leaf corresponds to a basis function mi(x)
  • Product of discriminators encountered on the path
    from root
  • Tree prediction is equivalent to a weighted
    average of these basis functions
  • y(x) y1 m1(x) y2 m2(x) yk mk(x)
  • the weights yi are the labels attached to the
    terminal nodes
  • Refitting amounts to tune the yi parameters to
    minimize square error on training set

59
Tree growing and pruning
  • Grow tree
  • Refit leaf labels
  • Prune tree, while refitting at each stage leaf
    labels
  • Test sequence of pruned trees on validation set
  • Select best pruning level

60
Backfitting (1)
  • After growing and pruning, the fuzzy tree
    structure has been determined
  • Leaf labels are globally optimal, but not the
    parameters of the discriminators (tuned locally)
  • Resulting model has 2 parameters per test node,
    and 1 parameter per terminal node
  • The output (and hence Mean square error) of the
    fuzzy tree is a smooth function of these
    parameters
  • The parameters can be optimized, by using a
    standard LSE technique, e.g. Levenberg-Marquardt

61
Backfitting (2)
  • How to compute the derivatives needed by
    nonlinear optimization technique
  • Use a modified version of backpropagation to
    compute derivates with respect to parameters
  • Yields an efficient algorithm (linear in the size
    of the tree)
  • Backfitting starts from tree produced after
    growing and pruning
  • Already a good approximation of a local optimum
  • Only a small number of iterations are necessary
    to backfit
  • Backfitting may also lead to overfitting

62
Summary and conclusions
  • Variance is the problem number one in
    decision/regression tree induction
  • It is possible to reduce variance significantly
  • Bagging and/or tree softening
  • Soft trees have the advantage of preserving
    interpretability and computational efficiency
  • Two approaches have been presented to get soft
    trees
  • Dual bagging
  • Generic approach
  • Fast and simple
  • Best approach for very large databases
  • Fuzzy tree induction
  • Similar to ANN type of model, but (more)
    interpretable
  • Best approach for small learning sets (probably)

63
Some references for further reading
  • Variance evaluation/reduction, bagging
  • Contact Pierre GEURTS (PhD student)
    geurts_at_montefiore.ulg.ac.be
  • Papers
  • Discretization of continuous attributes for
    supervised learning - Variance evaluation and
    variance reduction. (Invited)
  • L. Wehenkel. Proc. of IFSA'97, International
    Fuzzy Systems Association World Congress, Prague,
    June 1997, pp. 381--388.
  • Investigation and Reduction of Discretization
    Variance in Decision Tree Induction.
  • Pierre GEURTS and Louis WEHENKEL, Proc. of
    ECML2000
  • Some Enhancements of Decision Tree Bagging.
  • Pierre GEURTS, Proc. of PKDD2000
  • Dual Perturb and Combine Algorithm.
  • Pierre GEURTS, Proc. of AI and Statistics 2001.

64
See also www.montefiore.ulg.ac.be/services/stochas
tic/
  • Fuzzy/soft tree induction
  • Contact Cristina OLARU (PhD student)
    olaru_at_montefiore.ulg.ac.be
  • Papers
  • Automatic induction of fuzzy decision trees and
    its application to power system security
    assessment.
  • X. Boyen, L. Wehenkel, Int. Journal on Fuzzy
    Sets and Systems, Vol. 102, No 1, pp. 3-19, 1999.
  • On neurofuzzy and fuzzy decision trees
    approaches.
  • C. Olaru, L. Wehenkel. (Invited) Proc. of
    IPMU'98, 7th Int. Congr. on Information
    Processing and Management of Uncertainty in
    Knowledge based Systems, 1998.
Write a Comment
User Comments (0)
About PowerShow.com