Use of Weighted Least Squares - PowerPoint PPT Presentation

1 / 46

About This Presentation

Title:

Use of Weighted Least Squares

Description:

One such is called loess. This constructs ... We try to fit a model using loess. Possible R code is data(cars) attach(cars) plot(cars) ... – PowerPoint PPT presentation

Number of Views:173

Avg rating:3.0/5.0

Slides: 47

Provided by: jphil

Category:

more less

Transcript and Presenter's Notes

Title: Use of Weighted Least Squares

1
Use of Weighted Least Squares
2
In fitting models of the form
yi f(xi) ?i i 1n, least squares
is optimal under the condition
?1.?n are i.i.d. N(0, ?2) and is a reasonable
fitting method when this condition is at least
approximately satisfied. (Most importantly we
require here that there should be no significant
outliers).
3
In the case where we have instead
?1.?n are independent N(0, ?i2), it is
natural to use instead weighted least squares
choose f from within the permitted class of
functions f to minimise
?wi(yi-f(xi))2 Where we take wi proportional to
1/?i2 (clearly only relative weights matter)

4
For the hill races data, it is natural to assume
greater variability in the times for the longer
races, with the variability perhaps proportional
to the distance. We therefore try refitting the
quadratic model with weights proportional to
1/distance2 gt model2w lm(time -1 dist
I(dist2) climb I(climb2),data
hills-18,, weights1/dist2)
5
(No Transcript)
6
The fitted model is now time4.94distance0.0548
(distance)20.00349climb 0.00000134(climb)2?
Note that the residual summary above is on a
reweighted scale, and cannot be directly
compared with the earlier residual
summaries. While the coefficients here appear to
have changed somewhat from those in the earlier,
unweighted, fit of Model 2, the fitted model is
not really very different.
7
This is confirmed by the plot of the residuals
from the weighted fit against those from the
unweighted fit, produced by gtplot(resid(model2w)
resid(model2))
8
(No Transcript)
9
Resistant Regression
10
As already observed, least squares fitting is
very sensitive to outlying observations. However,
there are also a large number of resistant
fitting techniques available. One such is least
trimmed squares choose f from within the
permitted class of functions f to minimise-

11
(No Transcript)
12
Example phones data. The R dataset phones in
the package MASS gives the annual number of phone
calls (millions) in Belgium over the period
1950-73. Consider the model calls a
byear The following two graphs plot the data
and shows the result of fitting the model by
least squares and then fitting the same model by
least trimmed squares.
13
(No Transcript)
14
These graphs are achieved by the following
code gt plot(callsyear) gt phoneslslm(callsyea
r) gt abline(phonesls) gt plot(callsyear) gt
library(lqs) gt phonesltslqs(callsyear) gt
abline(phoneslts)
15
The explanation for the data is that for a period
of time total length of all phone calls in each
year was accidentally recorded instead.
16
Nonparametric Regression
17
Sometimes we simply wish to fit a smooth model
without specifying any particular functional form
for f. Again there are very many techniques here.
One such is called loess. This constructs the
fitted value f(xi) for each observation i by
performing a local regression using only those
observations with x values in the neighbourhood
of xi (and attaching most weight to the closest
observations).

18
Example cars data. The R data frame cars (in
the base package) records 50 observations of
speed (mph) and stopping distance (ft). These
observations were collected in the 1920s! We
treat stopping distance as the response variable
and seek to model its dependence on speed.
19
(No Transcript)
20
We try to fit a model using loess. Possible R
code is gt data(cars) gt attach(cars) gt
plot(cars) gt library(modreg) gt carsloloess(dists
peed) gt lines(fitted(carslo)speed)
21
(No Transcript)
22
An optional argument span can be increased from
its default value of 075 to give more
smoothing gt plot(cars) gt carslo2loess(distspee
d, span1) gt lines(fitted(carslo2)speed)
23
(No Transcript)
24
More robust and resistant fits can be given by
specifying the further optional argument
family"symmetric"
25
Models with Qualitative Explanatory Variables
(Factors) Data n 22 pairs (xi , yi) where y
is the response the data arise under two
different sets of conditions (type 1 or 2) and
are presented below sorted by x within type.
26
Row y x type
1 3.4 2.4 1 2 4.6 2.8 1 3 3.8
3.7 1 4 5.0 4.4 1 5 4.4
5.1 1 6 5.7 5.2 1 7 6.4
6.0 1 8 6.6 7.9 1 9 8.9 8.4 1 10
6.7 8.9 1 11 7.9 9.6 1 12
8.7 10.4 1 13 9.1 12.0 1 14
10.1 12.9 1 15 7.1 5.1 2 16 7.2
6.3 2 17 8.6 7.2 2 18 8.3 8.1 2 19
9.7 8.8 2 20 9.2 9.1 2 21 10.2
9.6 2 22 9.8 10.0 2
27
(No Transcript)
28
Distinguishing the two types (an appropriate R
command will do this)
29
We model the responses first ignoring the
variable type. gt mod1 lm(yx) gt
abline(mod1)
30
(No Transcript)
31
gt summary(mod1) Call lm(formula y
x) Residuals Min 1Q Median 3Q
Max -1.58460 -0.83189 -0.07654 0.79318
1.48079 Coefficients Estimate Std.
Error t value Pr(gtt) (Intercept) 2.4644
0.6249 3.944 0.000803 x
0.6540 0.0785 8.331 6.2e-08
--- Signif. codes 0 ' 0.001 ' 0.01
' 0.05 .' 0.1 ' 1 Residual standard error
1.033 on 20 degrees of freedom Multiple
R-Squared 0.7763, Adjusted R-squared 0.7651
F-statistic 69.4 on 1 and 20 DF, p-value
6.201e-08
32
gt summary.aov(mod1) Df Sum Sq Mean Sq
F value Pr(gtF) x 1 74.035
74.035 69.398 6.201e-08 Residuals 20
21.336 1.067 Signif.
codes 0 ' 0.001 ' 0.01 ' 0.05 .' 0.1
' 1
33
We now model the responses using a model which
includes the qualitative variable type, Which
was declared as a factor when the data frame was
set up gt type factor(c( rep(1,14),rep(2,8)))
gtmod2 lm(yxtype)
34
gt summary(mod2) Call lm(formula y x
type) Residuals Min 1Q Median
3Q Max -0.90463 -0.39486 -0.03586 0.34657
1.59988 Coefficients Estimate Std.
Error t value Pr(gtt) (Intercept) 2.18426
0.37348 5.848 1.24e-05 x
0.60903 0.04714 12.921 7.36e-11 type2
1.69077 0.27486 6.151 6.52e-06
--- Signif. codes 0 ' 0.001 ' 0.01
' 0.05 .' 0.1 ' 1 Residual standard error
0.6127 on 19 degrees of freedom Multiple
R-Squared 0.9252, Adjusted R-squared 0.9173
F-statistic 117.5 on 2 and 19 DF, p-value
2.001e-11
35
Interpreting the output The fit is
so e.g. observation 1 x 2.4, type 1,
and for observation 20 x 9.1, type 2,
36
(No Transcript)
37
gt summary.aov(mod2) Df Sum Sq Mean Sq
F value Pr(gtF) x 1 74.035
74.035 197.223 1.744e-11 type 1
14.204 14.204 37.838 6.522e-06 Residuals
19 7.132 0.375 Signif.
codes 0 ' 0.001 ' 0.01 ' 0.05 .' 0.1
' 1
38
The fitted values for Model 2 can be obtained in
R by gtfitted.values(mod2)
1 2 3 4
5 6 7 8 3.645930
3.889543 4.437671 4.863993 5.290315 5.351218
5.838443 6.995603 9 10 11
12 13 14 15 16
7.300119 7.604634 8.030956 8.518181
9.492632 10.040760 6.981083 7.711921 17
18 19 20 21 22
8.260049 8.808177 9.234499 9.417209
9.721724 9.965337
39
The total variation in the responses is Syy
95.371 variable x explains 74.035 of this total
(77.6) and the coefficient associated with it
(0.6090) is highly significant (significantly
different from 0) it has a negligible P-value.

40
In the presence of x, type explains a further
14.204 of the total variation and its coefficient
is also highly significant. Together the two
variables explain 92.5 of the total variation.
In the presence of x, we gain much by including
type.
41
Finally we extend the previous model (mod2) by
allowing for an interaction between the
explanatory variables x and type. An interaction
exists between two explanatory variables when the
effect of one on a response variable is different
at different values/levels of the other.
42
For example consider the effect of policyholders
age and gender on a response variable claim rate.
If the effect of age on claim rate is different
for males and females, then there is an
interaction between age and gender.
43
gt mod3 lm(y x type) gt summary(mod5) Call lm
(formula y x type) Residuals Min
1Q Median 3Q Max -0.90080 -0.38551
-0.01445 0.36309 1.60651 Coefficients
Estimate Std. Error t value Pr(gtt)
(Intercept) 2.22119 0.40345 5.506 3.15e-05
x 0.60385 0.05152 11.721
7.36e-10 type2 1.35000 1.20826
1.117 0.279 xtype2 0.04305
0.14843 0.290 0.775 --- Signif. codes
0 ' 0.001 ' 0.01 ' 0.05 .' 0.1 ' 1
Residual standard error 0.628 on 18 degrees of
freedom Multiple R-Squared 0.9256, Adjusted
R-squared 0.9132 F-statistic 74.6 on 3 and 18
DF, p-value 2.388e-10
44
gt summary.aov(mod5) Df Sum Sq Mean Sq
F value Pr(gtF) x 1 74.035
74.035 187.7155 5.810e-11 type 1
14.204 14.204 36.0142 1.124e-05 xtype
1 0.033 0.033 0.0841 0.7751
Residuals 18 7.099 0.394
--- Signif. codes 0 ' 0.001 ' 0.01
' 0.05 .' 0.1 ' 1
45
The interaction appears to have added nothing -
the coefficient of determination is effectively
unchanged compared to the previous model. We also
note that the extra parameter value is small and
is not significant. In this particular case, an
interaction term is not helpful - including it
has simply confused the issue.
46
In a case where an interaction term does improve
the fit and the coefficient is significant, then
both variables and the interaction between them
should be included in the model

Write a Comment

User Comments (0)