Regression Tree Ensembles - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Regression Tree Ensembles

Description:

1990's: Multitude of techniques developed to tackle regression problems ... Examples: Bagging, Boosting, Random Forests. Key Ingredients of Ensembles ... – PowerPoint PPT presentation

Number of Views:93

Avg rating:3.0/5.0

Slides: 33

Provided by: sergey1

Category:

more less

Transcript and Presenter's Notes

Title: Regression Tree Ensembles

1
Regression Tree Ensembles

Sergey Bakin

2
Problem Formulation

Training data set of N data points (xi,yi),
1,,N.
x are predictor variables (P-dimensional vector)
can be fixed design points or independently
sampled from the same distribution.
y is numeric response variable.
Problem estimate regression function E(yx)F(x)
- can be very complex function.

3
Ensembles of Models

lt1990s Multitude of techniques developed to
tackle regression problems
1990s New idea - use collection of basic
models (ensemble)
Substantial improvements in accuracy compared
with any single basic model
Examples Bagging, Boosting, Random Forests

4
Key Ingredients of Ensembles

Type of basic model used in the ensemble (RT,
K-NN, NN)
The way basic models are built (data sub-sampling
schemes, injection of randomness)
The way basic models are combined
Possible postprocessing (tuning) of resulting
ensemble (optional)

5
Random Forests (RF)

Developed by Leo Brieman, Department of
Statistics, University of California, Berkeley in
late 1990s.
RF resistant to overfitting
RF capable of handling large number of predictors

6
Key Features of RF

Randomised Regression Tree is a basic model
Each tree is grown on a bootsrap sample
Ensemble (Forest) is formed by averaging of
predictions from individual trees

7
Regression Trees

Performs recursive binary division of data start
with Root node (all points) and split it into 2
parts (Left Node and Right Node)
Split attempts to separate data points with high
yis from data points with low yis as much as
possible
Split is based on a single predictor and a split
point
To find the best splitter all possible splits and
split points are tried
Splitting repeated for Children.

8
(No Transcript)
9
RT Competitor List
Primary splits x2 lt -110.6631 to the
right, improve734.0907, (0 missing) x6
lt 107.5704 to the left, improve728.0376,
(0 missing) x51 lt 101.4707 to the left,
improve720.1280, (0 missing) x30 lt
-113.879 to the right, improve716.6580, (0
missing) x67 lt -93.76226 to the right,
improve715.6400, (0 missing) x78 lt
93.27373 to the left, improve715.6400, (0
missing) x62 lt 93.99937 to the left,
improve715.6400, (0 missing) x44 lt 96.059
to the left, improve715.6400, (0
missing) x25 lt -85.65475 to the right,
improve685.0943, (0 missing) x21 lt
-118.4764 to the right, improve685.0736, (0
missing) x82 lt 119.6532 to the left,
improve685.0736, (0 missing) x79 lt
-81.00349 to the right, improve675.7913, (0
missing) x18 lt -70.78995 to the right,
improve663.0757, (0 missing)
10
(No Transcript)
11
Predictions from a Tree model

Prediction from a tree is obtained by dropping
x down the tree until it gets to a terminal node.
The predicted value is the average of the
response values of training data points in that
terminal node.
Example if x1 gt-110.66 and x82 gt118.65 then
Prediction0.61

12
Pruning of CART trees

Prediction Error(PE) Variance Bias2
PE vs Tree Size has U-shape very large and very
small trees are bad
Trees are grown until terminal nodes become
small...
... and then pruned back
Use holdout data to estimate PE of trees.
Select tree that has smallest PE.

13
Randomised Regression Trees I

Each tree is grown on a bootstrap sample N data
points are sampled with replacement
Each such sample contains 63 of original data
points - some records occur multiple times
Each tree is built on its own bootstrap sample -
trees are likely to be different

14
Randomised Regression Trees II

At each split, only M randomly selected
predictors are allowed to compete as potential
splitters, i.e. 10 out of 100.
New group of eligible splitters is selected at
random at each step.
At each step the splitter selected is likely to
be somewhat suboptimal
Every predictor gets a chance to compete as a
splitter important predictors are very likely to
be eventually used as splitters

15
Competitor List for Randomised RT
Primary splits x6 lt 107.5704 to the
left, improve728.0376, (0 missing) x78
lt 93.27373 to the left, improve715.6400, (0
missing) x62 lt 93.99937 to the left,
improve715.6400, (0 missing) x79 lt
-81.00349 to the right, improve675.7913, (0
missing) x80 lt 63.85983 to the left,
improve654.7728, (0 missing) x24 lt 59.5085
to the left, improve648.3837, (0 missing)
x90 lt -59.35043 to the right,
improve646.8825, (0 missing) x75 lt
-52.43783 to the right, improve639.5996, (0
missing) x68 lt 50.18278 to the left,
improve631.1139, (0 missing) Y lt
-33.42134 to the right, improve606.9931, (0
missing) x34 lt 132.8378 to the left,
improve555.2047, (0 missing)
16
Randomised Regression Trees III

M1 splitter selected at random, but not the
split point.
MP original deterministic CART algorithm
Trees deliberately are not pruned.

17
Combining the Trees

Each tree represents a regression model which
fits the training data very closely low bias,
high variance model.
The idea behind RF take predictions from large
number of highly variable trees and average them.
The result is low bias, low variance model

18
(No Transcript)
19
Correlation Vs Strength

Another decomposition for PE of Random Forests
PE(RF) lt ?(BT)PE(Tree)
?(BT) correlation between any 2 trees in a
forest
PE(Tree) prediction error (strength) of a single
tree.
M1 low correlation, low strength
MP high correlation, high strength

20
RF as K-NN regression model I

RF induces proximity measure in the predictor
space P(x1,x2) Proportion of trees where x1 and
x2 landed in the same terminal node.
Prediction at point x
Only fraction of data points actually contributes
to prediction.
Strongly resembles formula used for K-NN
predictions

21
RF as K-NN regression model II

Lin, Y., Jeon, Y. Random Forests and Adaptive
Nearest Neighbours, Technical Report 1055,
Department of Statistics, University of
Wisconsin, 2002.
Breiman, L. Consistency for a Simple Model of
Random Forests. Technical Report 670, Statistics
Department, University of California at Berkeley,
2004

22
It was shown that

Randomisation does reduce variance component
Optimal M is independent of sample size
RF does behave as an Adaptive K-Nearest Neighbour
model shape and size of neighbourhood is adapted
to the local behaviour of target regression
function.

23
Case Study Postcode Ranking in Motor Insurance

575 postcodes in NSW
For each postcode number of claims as well as
exposure - number of policies in a postcode
Problem Ranking of postcodes for pricing purposes

24
Approach

Each postcode is represented by (x,y) coordinates
of its centroid.
Model expected claim frequency as function of
(x,y).
The target surface is likely to be highly
irregular.
Add coordinates of postcodes along 100 randomly
generated directions to allow greater flexibility.

25
Tuning RF M
26
Tuning RF Size of Node
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
Things not covered