Regression Tree Ensembles - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Regression Tree Ensembles

Description:

1990's: Multitude of techniques developed to tackle regression problems ... Examples: Bagging, Boosting, Random Forests. Key Ingredients of Ensembles ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 33
Provided by: sergey1
Category:

less

Transcript and Presenter's Notes

Title: Regression Tree Ensembles


1
Regression Tree Ensembles
  • Sergey Bakin

2
Problem Formulation
  • Training data set of N data points (xi,yi),
    1,,N.
  • x are predictor variables (P-dimensional vector)
    can be fixed design points or independently
    sampled from the same distribution.
  • y is numeric response variable.
  • Problem estimate regression function E(yx)F(x)
    - can be very complex function.

3
Ensembles of Models
  • lt1990s Multitude of techniques developed to
    tackle regression problems
  • 1990s New idea - use collection of basic
    models (ensemble)
  • Substantial improvements in accuracy compared
    with any single basic model
  • Examples Bagging, Boosting, Random Forests

4
Key Ingredients of Ensembles
  • Type of basic model used in the ensemble (RT,
    K-NN, NN)
  • The way basic models are built (data sub-sampling
    schemes, injection of randomness)
  • The way basic models are combined
  • Possible postprocessing (tuning) of resulting
    ensemble (optional)

5
Random Forests (RF)
  • Developed by Leo Brieman, Department of
    Statistics, University of California, Berkeley in
    late 1990s.
  • RF resistant to overfitting
  • RF capable of handling large number of predictors

6
Key Features of RF
  • Randomised Regression Tree is a basic model
  • Each tree is grown on a bootsrap sample
  • Ensemble (Forest) is formed by averaging of
    predictions from individual trees

7
Regression Trees
  • Performs recursive binary division of data start
    with Root node (all points) and split it into 2
    parts (Left Node and Right Node)
  • Split attempts to separate data points with high
    yis from data points with low yis as much as
    possible
  • Split is based on a single predictor and a split
    point
  • To find the best splitter all possible splits and
    split points are tried
  • Splitting repeated for Children.

8
(No Transcript)
9
RT Competitor List
Primary splits x2 lt -110.6631 to the
right, improve734.0907, (0 missing) x6
lt 107.5704 to the left, improve728.0376,
(0 missing) x51 lt 101.4707 to the left,
improve720.1280, (0 missing) x30 lt
-113.879 to the right, improve716.6580, (0
missing) x67 lt -93.76226 to the right,
improve715.6400, (0 missing) x78 lt
93.27373 to the left, improve715.6400, (0
missing) x62 lt 93.99937 to the left,
improve715.6400, (0 missing) x44 lt 96.059
to the left, improve715.6400, (0
missing) x25 lt -85.65475 to the right,
improve685.0943, (0 missing) x21 lt
-118.4764 to the right, improve685.0736, (0
missing) x82 lt 119.6532 to the left,
improve685.0736, (0 missing) x79 lt
-81.00349 to the right, improve675.7913, (0
missing) x18 lt -70.78995 to the right,
improve663.0757, (0 missing)
10
(No Transcript)
11
Predictions from a Tree model
  • Prediction from a tree is obtained by dropping
    x down the tree until it gets to a terminal node.
  • The predicted value is the average of the
    response values of training data points in that
    terminal node.
  • Example if x1 gt-110.66 and x82 gt118.65 then
    Prediction0.61

12
Pruning of CART trees
  • Prediction Error(PE) Variance Bias2
  • PE vs Tree Size has U-shape very large and very
    small trees are bad
  • Trees are grown until terminal nodes become
    small...
  • ... and then pruned back
  • Use holdout data to estimate PE of trees.
  • Select tree that has smallest PE.

13
Randomised Regression Trees I
  • Each tree is grown on a bootstrap sample N data
    points are sampled with replacement
  • Each such sample contains 63 of original data
    points - some records occur multiple times
  • Each tree is built on its own bootstrap sample -
    trees are likely to be different

14
Randomised Regression Trees II
  • At each split, only M randomly selected
    predictors are allowed to compete as potential
    splitters, i.e. 10 out of 100.
  • New group of eligible splitters is selected at
    random at each step.
  • At each step the splitter selected is likely to
    be somewhat suboptimal
  • Every predictor gets a chance to compete as a
    splitter important predictors are very likely to
    be eventually used as splitters

15
Competitor List for Randomised RT
Primary splits x6 lt 107.5704 to the
left, improve728.0376, (0 missing) x78
lt 93.27373 to the left, improve715.6400, (0
missing) x62 lt 93.99937 to the left,
improve715.6400, (0 missing) x79 lt
-81.00349 to the right, improve675.7913, (0
missing) x80 lt 63.85983 to the left,
improve654.7728, (0 missing) x24 lt 59.5085
to the left, improve648.3837, (0 missing)
x90 lt -59.35043 to the right,
improve646.8825, (0 missing) x75 lt
-52.43783 to the right, improve639.5996, (0
missing) x68 lt 50.18278 to the left,
improve631.1139, (0 missing) Y lt
-33.42134 to the right, improve606.9931, (0
missing) x34 lt 132.8378 to the left,
improve555.2047, (0 missing)
16
Randomised Regression Trees III
  • M1 splitter selected at random, but not the
    split point.
  • MP original deterministic CART algorithm
  • Trees deliberately are not pruned.

17
Combining the Trees
  • Each tree represents a regression model which
    fits the training data very closely low bias,
    high variance model.
  • The idea behind RF take predictions from large
    number of highly variable trees and average them.
  • The result is low bias, low variance model

18
(No Transcript)
19
Correlation Vs Strength
  • Another decomposition for PE of Random Forests
    PE(RF) lt ?(BT)PE(Tree)
  • ?(BT) correlation between any 2 trees in a
    forest
  • PE(Tree) prediction error (strength) of a single
    tree.
  • M1 low correlation, low strength
  • MP high correlation, high strength

20
RF as K-NN regression model I
  • RF induces proximity measure in the predictor
    space P(x1,x2) Proportion of trees where x1 and
    x2 landed in the same terminal node.
  • Prediction at point x
  • Only fraction of data points actually contributes
    to prediction.
  • Strongly resembles formula used for K-NN
    predictions

21
RF as K-NN regression model II
  • Lin, Y., Jeon, Y. Random Forests and Adaptive
    Nearest Neighbours, Technical Report 1055,
    Department of Statistics, University of
    Wisconsin, 2002.
  • Breiman, L. Consistency for a Simple Model of
    Random Forests. Technical Report 670, Statistics
    Department, University of California at Berkeley,
    2004

22
It was shown that
  • Randomisation does reduce variance component
  • Optimal M is independent of sample size
  • RF does behave as an Adaptive K-Nearest Neighbour
    model shape and size of neighbourhood is adapted
    to the local behaviour of target regression
    function.

23
Case Study Postcode Ranking in Motor Insurance
  • 575 postcodes in NSW
  • For each postcode number of claims as well as
    exposure - number of policies in a postcode
  • Problem Ranking of postcodes for pricing purposes

24
Approach
  • Each postcode is represented by (x,y) coordinates
    of its centroid.
  • Model expected claim frequency as function of
    (x,y).
  • The target surface is likely to be highly
    irregular.
  • Add coordinates of postcodes along 100 randomly
    generated directions to allow greater flexibility.

25
Tuning RF M
26
Tuning RF Size of Node
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
Things not covered
  • Clustering, missing value imputation and outlier
    detection
  • Identification of important variables
  • OOB testing of RF models
  • Postprocessing of RF models
Write a Comment
User Comments (0)
About PowerShow.com