Introduction to Predictive Modeling with Examples

About This Presentation

Title:

Introduction to Predictive Modeling with Examples

Description:

Introduction to Predictive Modeling with Examples. Nationwide Insurance Company, November 2. D. A. Dickey – PowerPoint PPT presentation

Number of Views:229

Avg rating:3.0/5.0

Slides: 67

Provided by: ncs99

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Predictive Modeling with Examples

1
Introduction to Predictive Modeling with Examples
Nationwide Insurance Company, November 2 D. A.
Dickey
2
Cool lt ------------------------ gt Nerdy
Analytics Statistics Predictive
Modeling Regression
Part 1 Simple Linear Regression
3
If the Life Line is long and deep, then this
represents a long life full of vitality and
health. A short line, if strong and deep, also
shows great vitality in your life and the ability
to overcome health problems. However, if the line
is short and shallow, then your life may have the
tendency to be controlled by others
http//www.ofesite.com/spirit/palm/lines/linelife.
htm
4
Wilson Mather JAMA 229 (1974) Xlife line
length Yage at death
proc sgplot scatter Yage Xline reg Yage
Xline run
Result Predicted Age at Death 79.24
1.367(lifeline) (Is this real???
Is this repeatable???)
5
We Use LEAST SQUARES
Squared residuals sum to 9609
6
(No Transcript)
7
Error sum of squares SSq versus slope and
intercept (truncated at SSq9700)
8
Best line is the one that minimizes sum of
squared residuals. Best for this sample is it
the true relationship for everyone? SAS PROC REG
will compute it. What other lines might be the
true line for everyone?? Probably not the purple
one. Red one has slope 0 (no effect). Is red
line unreasonable? Can we reject H0slope is 0?
9
Simulation Age at Death 67 0(life line)
e Error e has normal distribution mean 0
variance 200. Simulate 20 cases with n 50 bodies
each.
NOTE Regression equations Age(rep1)
80.56253 - 1.345896line. Age(rep2) 61.76292
0.745289line. Age(rep3) 72.14366 -
0.546996line. Age(rep4) 95.85143 -
3.087247line. Age(rep5) 67.21784 -
0.144763line. Age(rep6) 71.0178 -
0.332015line. Age(rep7) 54.9211
1.541255line. Age(rep8) 69.98573 -
0.472335line. Age(rep9) 85.73131 -
1.240894line. Age(rep10) 59.65101
0.548992line. Age(rep11) 59.38712
0.995162line. Age(rep12) 72.45697 -
0.649575line. Age(rep13) 78.99126 -
0.866334line. Age(rep14) 45.88373
2.283475line. Age(rep15) 59.28049
0.790884line. Age(rep16) 73.6395 -
0.814287line. Age(rep17) 70.57868 -
0.799404line. Age(rep18) 72.91134 -
0.821219line. Age(rep19) 55.46755
1.238873line. Age(rep20) 63.82712
0.776548line.
Predicted Age at Death 79.24
1.367(lifeline) Would NOT be unusual if there is
no true relationship .
10
Distribution of t Under H0
Conclusion Estimated slopes vary Standard
deviation of estimated slopes Standard error
(estimated) Compute t (estimate
hypothesized)/standard error p-value is
probability of larger t when hypothesis is
correct (e.g. 0 slope) p-value is sum of two
tail areas. Traditionally plt0.05 ? hypothesized
value is wrong. pgt0.05 is inconclusive.
11
proc reg datalife model ageline run
Parameter Estimates
Parameter Standard Variable DF Estimate
Error t Value Pr gt t Intercept 1
79.23341 14.83229 5.34 lt.0001 Line
1 -1.36697 1.59782 0.86 0.3965
Area 0.19825 Area 0.19825 0.39650
-0.86 0.86
12
Conclusion insufficient evidence against the
hypothesis of no linear relationship.
H0 H1
H0 Innocence H1 Guilt
Beyond reasonable doubt Plt0.05
H0 True slope is 0 (no
association) H1 True slope is not 0
P0.3965
13
Simulation Age at Death 67 0(life line)
e Error e has normal distribution mean 0
variance 200. ? WHY? Simulate 20 cases with n 50
bodies each. Want estimate of variability
around the true line. True variance is Use sums
of squared residuals (SS). Sum of squared
residuals from the mean is SS(total)
9755 Sum of squared residuals around the line is
SS(error) 9609 (1) SS(total)-SS(error)
is SS(model) 146 (2)
Variance estimate is SS(error)/(degrees of
freedom) 200 (3) SS(model)/SS(total) is R2,
i.e. proportion of variablity explained
by the model.
Analysis of Variance
Sum of
Mean Source DF Squares
Square F Value Pr gt F Model 1
146.51753 146.51753 0.73 0.3965 Error
48 9608.70247
200.18130 Corrected Total 49 9755.22000
Root MSE 14.14854 R-Square 0.0150
14
Part 2 Multiple Regression
Issues (1) Testing joint importance versus
individual significance (2) Prediction
versus modeling individual effects (3)
Collinearity (correlation among inputs) Example
Hypothetical companys sales Y depend on TV
advertising X1 and Radio Advertising
X2. Y b0 b1X1 b2X2 e
Two engine plane can still fly if engine 1
fails Two engine plane can still fly if engine 2
fails Neither is critical individually
Jointly critical (cant omit both!!)
15
Data Sales length sval 8 length cval 8
input store TV radio sales
(more code) cards 1 869 868
9089 2 836 820 8290
(more data) 40 969 961 10130
Sales
Radio
TV
proc g3d datasales scatter radioTVsales/shape
sval colorcval zmin8000 run
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
P2 axis
23
P2 axis
24
P2 axis
25
Conclusion Can predict well with just TV, just
radio, or both! SAS code proc reg
datanext model sales TV radio
Analysis of Variance
Sum of
Mean Source DF Squares
Square F Value Pr gt F Model
2 32660996 16330498 358.84 lt.0001
?(Cant omit both) Error 37
1683844 45509 Corrected Total 39
34344840 Root MSE 213.32908
R-Square 0.9510 ? Explaining 95 of variation
in sales Parameter
Estimates Parameter
Standard Variable DF Estimate
Error t Value Pr gt t Intercept 1
531.11390 359.90429 1.48
0.1485 TV 1 5.00435
5.01845 1.00 0.3251 ?(can omit
TV) radio 1 4.66752 4.94312
0.94 0.3512 ?(can omit
radio) Estimated Sales 531 5.0 TV 4.7
radio with error variance 45509 (standard
deviation 213). TV approximately equal to
radio so, approximately Estimated Sales 531
9.7 TV or Estimated Sales 531 9.7 radio

26
Setting TV radio (approximate relationship)
Estimated Sales 531 9.7 TV is this the
BEST TV line? Estimated Sales 531 9.7
radio is this the BEST radio line? Proc
Reg DataStores Model Sales TV Model
Sales radio run
27

Analysis of Variance
Sum of Mean Source
DF Squares Square F Value
Pr gt F Model 1
32620420 32620420 718.84
lt.0001 Error 38 1724420
45379 Corrected Total 39
34344840 Root MSE 213.02459
R-Square 0.9498
Parameter Standard Variable DF
Estimate Error t Value Pr gt
t Intercept 1 478.50829
355.05866 1.35 0.1857 TV 1
9.73056 0.36293 26.81
lt.0001

Analysis of
Variance Sum
of Mean Source DF
Squares Square F Value Pr gt
F Model 1 32615742
32615742 716.79 lt.0001 Error
38 1729098 45503 Corrected
Total 39 34344840 Root MSE
213.31333 R-Square 0.9497
Parameter Standard Variable DF
Estimate Error t Value Pr gt
t Intercept 1 612.08604
350.59871 1.75 0.0889 radio 1
9.58381 0.35797 26.77
lt.0001
28
Sums of squares capture variation explained by
each variable Type I How much when it is added
to the model? Type II How much when all other
variables are present (as if it had been added
last) Parameter
Estimates Parameter
Standard Variable DF Estimate Error
t Value Pr gt t Type I SS Type II
SS Intercept 1 531.11390 359.90429
1.48 0.1485 3964160640 99106 TV
1 5.00435 5.01845 1.00 0.3251
32620420 45254 radio 1 4.66752
4.94312 0.94 0.3512 40576
40576

Parameter Estimates
Parameter Standard Variable DF
Estimate Error t Value Pr gt t
Type I SS Type II SS Intercept 1 531.11390
359.90429 1.48 0.1485 3964160640
99106 radio 1 4.66752 4.94312
0.94 0.3512 32615742 40576 TV
1 5.00435 5.01845 1.00
0.3251 45254 45254
29
Summary Good predictions given by Sales
531 5.0 x TV 4.7 x Radio or Sales 479
9.7 x TV or Sales 612 9.6 x Radio
or (lots of others) Why
the confusion? The evil Multicollinearity!! (co
rrelated Xs)
30
Those Mysterious Degrees of Freedom (DF)
First Martian ? information about average height
0 information about variation.
2nd Martian gives first piece of information (DF)
about error variance around mean.
n Martians n-1 DF for error (variation)
31
Martian Height
2 points ? no information on variation of errors
n points ? n-2 error DF
Martian Weight
32
Sum of Mean Source
DF Squares Square Model 2
32660996 16330498 Error 37 1683844
45509 Corrected Total 39 34344840
How Many Table Legs? (regress Y on X1, X2)
X2
error
X1
Three legs will all touch the floor.
Fourth leg gives first chance to measure error
(first error DF).
Fit a plane ? n-3 (37) error DF (2 model DF,
n-139 total DF) Regress Y on X1 X2 X7 ?
n-8 error DF (7 model DF, n-1 total DF)
33
Grades vs. IQ and Study Time Data
tests input IQ Study_Time Grade IQ_S
IQStudy_Time cards 105 10 75 110 12
79 120 6 68 116 13 85 122 16 91 130 8
79 114 20 98 102 15 76 Proc reg
datatests model Grade IQ Proc reg
datatests model Grade IQ Study_Time

Parameter Standard Variable DF
Estimate Error t Value Pr gt
t Intercept 1 62.57113 48.24164
1.30 0.2423 IQ 1 0.16369
0.41877 0.39 0.7094
Parameter Standard Variable DF
Estimate Error t Value Pr gt
t Intercept 1 0.73655 16.26280
0.05 0.9656 IQ 1 0.47308
0.12998 3.64 0.0149 Study_Time 1
2.10344 0.26418 7.96 0.0005

34
Contrast TV advertising looses significance
when radio is added. IQ gains significance
when study time is added. Model for Grades
Predicted Grade 0.74 0.47 x IQ 2.10 x
Study Time Question Does an extra hour of
study really deliver 2.10 points for everyone
regardless of IQ? Current model only allows
this.
35
proc reg model Grade IQ Study_Time IQ_S
Sum
of Mean Source
DF Squares Square F Value
Pr gt F Model 3
610.81033 203.60344 26.22 0.0043
Error 4 31.06467
7.76617 Corrected Total 7
641.87500 Root MSE
2.78678 R-Square 0.9516

Parameter Standard Variable
DF Estimate Error t Value
Pr gt t Intercept 1
72.20608 54.07278 1.34 0.2527
IQ 1 -0.13117
0.45530 -0.29 0.7876
Study_Time 1 -4.11107 4.52430
-0.91 0.4149 IQ_S
1 0.05307 0.03858 1.38
0.2410
Interaction model Predicted Grade
72.21 - 0.13 x IQ - 4.11 x Study Time 0.053 x
IQ x Study Time (72.21 - 0.13 x IQ )( - 4.11
0.053 x IQ )x Study Time IQ 102 predicts
Grade (72.21-13.26)(5.41-4.11) x Study
Time 58.95 1.30 x Study Time IQ 122
predicts Grade (72.21-15.86)(6.47-4.11)
x Study Time 56.35 2.36 x Study Time
36
Slope 2.36
Slope 1.30

Adding interaction makes everything insignificant
(individually) !
Do we need to omit insignificant terms until only
significant ones remain?
Has an acquitted defendant proved his innocence?
Common sense trumps statistics!

37
Part 3 Diagnosing Problems in Regression Main
problems are Multicollinearity (correlation
among inputs) Outliers
Principal Component Axis 1 P1
Proc Corr Var TV radio sales Pearson
Correlation Coefficients, N 40 Prob gt
r under H0 Rho0 TV radio
sales TV 1.00000 0.99737 0.97457
lt.0001 lt.0001 radio 0.99737 1.00000
0.97450 lt.0001 lt.0001 sales
0.97457 0.97450 1.00000 lt.0001 lt.0001
Principal Component Axis 2 P2
TV
Radio
38

Principal Components
Center and scale variables to mean 0 variance 1.
Call these X1 (TV) and X2 (radio)
n variables ? total variation is n (n2 here)
Find most variable linear combination P1__X1__X2

TV 1.00000 0.99737
lt.0001 radio 0.99737 1.00000 lt.0001
Variances are 1.9973 out of 2 (along P1 axis)
standard deviation and 0.0027
out of 2 (along P2 axis) standard
deviation Ratio of standard deviations (27.6) is
condition number large ? unstable
regression. Rule of thumb Ratio 1 is perfect,
gt30 problematic. Spread on long axis is 27.6
times that on short axis. Variance Inflation
Factor (1) Regress predictor i on all the others
getting r-square Ri2 (2) VIF is 1/(1- Ri2 )
for variable i (measures collinearity). (3) VIF
gt 10 is a problem.
39
Variance Inflation Factor (1) Regress predictor
i on all the others getting r-square Ri2 (2)
VIF is 1/(1- Ri2 ) for variable i (measures
collinearity). (3) VIF gt 10 is a problem.
Example Proc Reg DataSales Model Sales TV
Radio/VIF collinoint
Parameter Estimates
Parameter Standard
Variance Variable DF
Estimate Error t Value Pr gt
t Inflation Intercept 1
531.11390 359.90429 1.48 0.1485
0 TV 1 5.00435
5.01845 1.00 0.3251
190.65722 radio 1 4.66752
4.94312 0.94 0.3512 190.65722
Collinearity Diagnostics (intercept
adjusted) Condition
--Proportion of Variation- Number
Eigenvalue Index TV
radio 1 1.99737 1.00000
0.00131 0.00131 2 0.00263
27.57948 0.99869 0.99869
We have a MAJOR problem!
(note other diagnostics besides VIF and
condition number are available)
40
P1
TV 1200 ˆ

1000 ˆ

800 ˆ Šˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒ
ƒƒƒƒƒˆƒ 800 1000
1200 radio
Another problem
Outliers Example Add one point to TV-Radio
data TV 1021, radio 954, Sales 9020 Proc
Reg Model Sales TV radio/ p r
Analysis of
Variance Sum
of Mean Source DF
Squares Square F Value Pr gt
F Model 2 33190059
16595030 314.07 lt.0001 Error
38 2007865 52839 Corrected
Total 40 35197924 Root MSE
229.86639 R-Square 0.9430
Parameter Estimates
Parameter Standard Variable DF
Estimate Error t Value Pr gt
t Intercept 1 689.01260 382.52628
1.80 0.0796 TV 1
-6.28994 2.90505 -2.17 0.0367
??????? radio 1 15.78081
2.86870 5.50 lt.0001
Dependent Predicted Std Error
Student Cook's Obs
Variable Value Residual Residual Residual
-2-1 0 1 2 D 39 9277
9430 -153.4358 225.3 -0.681
0.006 40 10130 9759 370.5848
226.1 1.639 0.030
41 9020 9322 -301.8727 121.9
-2.476 5.224
P2
41
(No Transcript)
42
(No Transcript)
43
P1
P2
44

Ordinary residual for store 41 not too bad
(-300.87)
PRESS residuals
Remove store i , Sales Y(i)
Fit model to other 40 stores
Get model prediction P(i) for store I
PRESS residual is Y(i)-P(i)

proc reg dataraw model sales TV radio
output outout1 rr press press run
Regular O and PRESS (dot) residuals
View Along the P2 Axis
Store number 41
P2 (2nd Principal Component)
45
Part 4 Classification Variables (dummy
variables, indicator variables) Predicted
Accidents 1181 2579 X11 X11 is 1
in November, 0 elsewhere. Interpretation In
November, predict 11812579(1) 3660. In any
other month predict 1181 2579(0) 1181.
1181 is average of other months. 2579 is
added November effect (vs. average of
others) Model for NC Crashes involving Deer
Proc reg datadeer model deer X11
Analysis of
Variance Sum
of Mean Source DF
Squares Square F Value Pr gt
F Model 1 30473250
30473250 90.45 lt.0001 Error
58 19539666 336891 Corrected
Total 59 50012916 Root MSE
580.42294 R-Square 0.6093
Parameter
Standard Variable Label
DF Estimate Error t Value Pr gt
t Intercept Intercept 1
1181.09091 78.26421 15.09 lt.0001 X11
1 2578.50909
271.11519 9.51 lt.0001
46
(No Transcript)
47
Looks like December and October need dummies
too! Proc reg datadeer model deer X10 X11
X12 Analysis of
Variance Sum
of Mean Source DF
Squares Square F Value Pr gt
F Model 3 46152434
15384145 223.16 lt.0001 Error
56 3860482 68937 Corrected
Total 59 50012916 Root MSE
262.55890 R-Square 0.9228
Parameter
Standard Variable Label
DF Estimate Error t Value Pr gt
t Intercept Intercept 1
929.40000 39.13997 23.75 lt.0001 X10
1 1391.20000
123.77145 11.24 lt.0001 X11
1 2830.20000 123.77145
22.87 lt.0001 X12
1 1377.40000 123.77145 11.13
lt.0001 Average of Jan through Sept. is 929
crashes per month. Add 1391 in October, 2830 in
November, 1377 in December.
48
(No Transcript)
49
What the heck lets do all but one (need
average of rest so must leave out at least
one) Proc reg datadeer model deer X1 X2 X10
X11

Analysis of Variance
Sum of Mean Source
DF Squares Square F Value
Pr gt F Model 11
48421690 4401972 132.79
lt.0001 Error 48 1591226
33151 Corrected Total 59
50012916 Root MSE 182.07290
R-Square 0.9682
Parameter Estimates
Parameter
Standard Variable Label
DF Estimate Error t Value Pr gt
t Intercept Intercept 1
2306.80000 81.42548 28.33 lt.0001 X1
1 -885.80000
115.15301 -7.69 lt.0001 X2
1 -1181.40000 115.15301
-10.26 lt.0001 X3
1 -1220.20000 115.15301 -10.60
lt.0001 X4 1
-1486.80000 115.15301 -12.91 lt.0001 X5
1
-1526.80000 115.15301 -13.26 lt.0001 X6
1 -1433.00000
115.15301 -12.44 lt.0001 X7
1 -1559.20000
115.15301 -13.54 lt.0001 X8
1 -1646.20000 115.15301
-14.30 lt.0001 X9
1 -1457.20000 115.15301 -12.65
lt.0001 X10 1
13.80000 115.15301 0.12 0.9051 X11
1
1452.80000 115.15301 12.62
lt.0001 Average of rest is just December mean
2307. Subtract 886 in January, add 1452 in
November. October (X10) is not significantly
different than December.
50
(No Transcript)
51
positive
negative
52
Add date (days since Jan 1 1960 in SAS) to
capture trend Proc reg datadeer model deer
date X1 X2 X10 X11
Analysis of Variance
Sum of Mean Source
DF Squares Square
F Value Pr gt F Model 12
49220571 4101714 243.30
lt.0001 Error 47 792345
16858 Corrected Total 59
50012916 Root MSE 129.83992
R-Square 0.9842
Parameter Estimates
Parameter
Standard Variable Label
DF Estimate Error t Value Pr gt
t Intercept Intercept 1
-1439.94000 547.36656 -2.63 0.0115 X1
1 -811.13686
82.83115 -9.79 lt.0001 X2
1 -1113.66253
82.70543 -13.47 lt.0001 X3
1 -1158.76265 82.60154
-14.03 lt.0001 X4
1 -1432.28832 82.49890 -17.36
lt.0001 X5 1
-1478.99057 82.41114 -17.95 lt.0001 X6
1
-1392.11624 82.33246 -16.91 lt.0001 X7
1 -1525.01849
82.26796 -18.54 lt.0001 X8
1 -1618.94416
82.21337 -19.69 lt.0001 X9
1 -1436.86982 82.17106
-17.49 lt.0001 X10
1 27.42792 82.14183 0.33
0.7399 X11 1
1459.50226 82.12374 17.77 lt.0001 date
1
0.22341 0.03245 6.88 lt.0001 Trend
is 0.22 more accidents per day (1 per 5 days) and
is significantly different from 0.
53
(No Transcript)
54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
Part 5 Logistic RegressionThe problem response
is binary yes or no, accident or no accident,
claim or no claim, at fault, not at
faultPrediction is prediction of probability
(of fault for example)
59

Logistic idea Map p in (0,1) to L in whole real
line, pprobability of fabric igniting.
Use L ln(p/(1-p))
Model L as linear in flame exposure time.
Predicted L a b(time)
Given temperature X, compute abX then p
eL/(1eL)
p(i) eabXi/(1eabXi)
Write p(i) if response, 1-p(i) if not
Multiply all n of these together, get function
Q(a,b), find a,b to maximize.

60
(No Transcript)
61
Example Ignition