Title: Regression Tree Ensembles
1Regression Tree Ensembles
2Problem Formulation
- Training data set of N data points (xi,yi),
1,,N. - x are predictor variables (P-dimensional vector)
can be fixed design points or independently
sampled from the same distribution. - y is numeric response variable.
- Problem estimate regression function E(yx)F(x)
- can be very complex function.
3Ensembles of Models
- lt1990s Multitude of techniques developed to
tackle regression problems - 1990s New idea - use collection of basic
models (ensemble) - Substantial improvements in accuracy compared
with any single basic model - Examples Bagging, Boosting, Random Forests
4Key Ingredients of Ensembles
- Type of basic model used in the ensemble (RT,
K-NN, NN) - The way basic models are built (data sub-sampling
schemes, injection of randomness) - The way basic models are combined
- Possible postprocessing (tuning) of resulting
ensemble (optional)
5Random Forests (RF)
- Developed by Leo Brieman, Department of
Statistics, University of California, Berkeley in
late 1990s. - RF resistant to overfitting
- RF capable of handling large number of predictors
6Key Features of RF
- Randomised Regression Tree is a basic model
- Each tree is grown on a bootsrap sample
- Ensemble (Forest) is formed by averaging of
predictions from individual trees
7Regression Trees
- Performs recursive binary division of data start
with Root node (all points) and split it into 2
parts (Left Node and Right Node) - Split attempts to separate data points with high
yis from data points with low yis as much as
possible - Split is based on a single predictor and a split
point - To find the best splitter all possible splits and
split points are tried - Splitting repeated for Children.
8(No Transcript)
9RT Competitor List
Primary splits x2 lt -110.6631 to the
right, improve734.0907, (0 missing) x6
lt 107.5704 to the left, improve728.0376,
(0 missing) x51 lt 101.4707 to the left,
improve720.1280, (0 missing) x30 lt
-113.879 to the right, improve716.6580, (0
missing) x67 lt -93.76226 to the right,
improve715.6400, (0 missing) x78 lt
93.27373 to the left, improve715.6400, (0
missing) x62 lt 93.99937 to the left,
improve715.6400, (0 missing) x44 lt 96.059
to the left, improve715.6400, (0
missing) x25 lt -85.65475 to the right,
improve685.0943, (0 missing) x21 lt
-118.4764 to the right, improve685.0736, (0
missing) x82 lt 119.6532 to the left,
improve685.0736, (0 missing) x79 lt
-81.00349 to the right, improve675.7913, (0
missing) x18 lt -70.78995 to the right,
improve663.0757, (0 missing)
10(No Transcript)
11Predictions from a Tree model
- Prediction from a tree is obtained by dropping
x down the tree until it gets to a terminal node. - The predicted value is the average of the
response values of training data points in that
terminal node. - Example if x1 gt-110.66 and x82 gt118.65 then
Prediction0.61
12Pruning of CART trees
- Prediction Error(PE) Variance Bias2
- PE vs Tree Size has U-shape very large and very
small trees are bad - Trees are grown until terminal nodes become
small... - ... and then pruned back
- Use holdout data to estimate PE of trees.
- Select tree that has smallest PE.
13Randomised Regression Trees I
- Each tree is grown on a bootstrap sample N data
points are sampled with replacement - Each such sample contains 63 of original data
points - some records occur multiple times - Each tree is built on its own bootstrap sample -
trees are likely to be different
14Randomised Regression Trees II
- At each split, only M randomly selected
predictors are allowed to compete as potential
splitters, i.e. 10 out of 100. - New group of eligible splitters is selected at
random at each step. - At each step the splitter selected is likely to
be somewhat suboptimal - Every predictor gets a chance to compete as a
splitter important predictors are very likely to
be eventually used as splitters
15Competitor List for Randomised RT
Primary splits x6 lt 107.5704 to the
left, improve728.0376, (0 missing) x78
lt 93.27373 to the left, improve715.6400, (0
missing) x62 lt 93.99937 to the left,
improve715.6400, (0 missing) x79 lt
-81.00349 to the right, improve675.7913, (0
missing) x80 lt 63.85983 to the left,
improve654.7728, (0 missing) x24 lt 59.5085
to the left, improve648.3837, (0 missing)
x90 lt -59.35043 to the right,
improve646.8825, (0 missing) x75 lt
-52.43783 to the right, improve639.5996, (0
missing) x68 lt 50.18278 to the left,
improve631.1139, (0 missing) Y lt
-33.42134 to the right, improve606.9931, (0
missing) x34 lt 132.8378 to the left,
improve555.2047, (0 missing)
16Randomised Regression Trees III
- M1 splitter selected at random, but not the
split point. - MP original deterministic CART algorithm
- Trees deliberately are not pruned.
17Combining the Trees
- Each tree represents a regression model which
fits the training data very closely low bias,
high variance model. - The idea behind RF take predictions from large
number of highly variable trees and average them. - The result is low bias, low variance model
18(No Transcript)
19Correlation Vs Strength
- Another decomposition for PE of Random Forests
PE(RF) lt ?(BT)PE(Tree) - ?(BT) correlation between any 2 trees in a
forest - PE(Tree) prediction error (strength) of a single
tree. - M1 low correlation, low strength
- MP high correlation, high strength
20RF as K-NN regression model I
- RF induces proximity measure in the predictor
space P(x1,x2) Proportion of trees where x1 and
x2 landed in the same terminal node. - Prediction at point x
- Only fraction of data points actually contributes
to prediction. - Strongly resembles formula used for K-NN
predictions
21RF as K-NN regression model II
- Lin, Y., Jeon, Y. Random Forests and Adaptive
Nearest Neighbours, Technical Report 1055,
Department of Statistics, University of
Wisconsin, 2002. - Breiman, L. Consistency for a Simple Model of
Random Forests. Technical Report 670, Statistics
Department, University of California at Berkeley,
2004
22It was shown that
- Randomisation does reduce variance component
- Optimal M is independent of sample size
- RF does behave as an Adaptive K-Nearest Neighbour
model shape and size of neighbourhood is adapted
to the local behaviour of target regression
function.
23Case Study Postcode Ranking in Motor Insurance
- 575 postcodes in NSW
- For each postcode number of claims as well as
exposure - number of policies in a postcode - Problem Ranking of postcodes for pricing purposes
24Approach
- Each postcode is represented by (x,y) coordinates
of its centroid. - Model expected claim frequency as function of
(x,y). - The target surface is likely to be highly
irregular. - Add coordinates of postcodes along 100 randomly
generated directions to allow greater flexibility.
25Tuning RF M
26Tuning RF Size of Node
27(No Transcript)
28(No Transcript)
29(No Transcript)
30(No Transcript)
31(No Transcript)
32Things not covered
- Clustering, missing value imputation and outlier
detection - Identification of important variables
- OOB testing of RF models
- Postprocessing of RF models