Title: Statistical Data Analysis 0655'221A
1Example
- The data in this example come from measurements
of height (cm) and weight (kg) from 42 electrical
engineering students - The data are actually self-reported, so we must
be aware that there is measurement error in both
variables - This could affect our results, but the most
likely consequence is that it will make the data
harder to explain, because it will be more
noisy - The question is is there a relationship between
height and weight (we expect there is), and if so
how is it related
2Plot the data
There appears to be some evidence of a linear
relationship between weight and height The
relationship is positive gt weight increases with
height
3Fit the model
- We propose the following model
- We fit the model in a very similar manner to that
for ANOVA - fitlt-lm(WeightHeight,dataheights)
- To get some diagnostic plots (pred-res and
norplot we type) - plot(fit)
4- Everything seems to be okay in the pred-res plot
- There may be a slight funnel effect, but nothing
significant
5Our assumption of normality may be violated, but
for the time being we will ignore this
6gt summary(fit) Call lm(formula Weight
Height, data heights) Residuals Min
1Q Median 3Q Max -16.310 -6.080
-2.714 5.574 20.021 Coefficients
Estimate Std. Error t value Pr(gtt)
(Intercept) -93.2313 26.0258 -3.582 0.000914
Height 0.9327 0.1492 6.252
2.09e-07 --- Signif. codes 0 ' 0.001
' 0.01 ' 0.05 .' 0.1 ' 1
Residual standard error 8.851 on 40 degrees of
freedom Multiple R-Squared 0.4942, Adjusted
R-squared 0.4816 F-statistic 39.09 on 1 and 40
degrees of freedom, p-value 2.089e-007
7Influence
- Outliers can cause many problems with statistical
analyses - In regression, outliers have the potential to
alter the fit - How?
- Recall that the least squares procedure tries to
minimise
- If one particular observation is a long way from
the bulk of the data, i.e. if
is large, then in will generally
influence the fit
8- The R2 for the black line is 0.8995
- When we drop the influential point from the
analysis the R2 is 0.9805 - The lines may not look too different on the plot,
but if we look at the regression tables we can
see a vast difference
Estimate Std.Err t-value
Pr(gtt) Intercept -49.1483 11.8924 -4.1327
1e-04 X 3.5482 0.1192 29.7659 0e00
Intercept 7.7864 4.2126 1.8484 0.0676 X
2.9696 0.0423 70.1278 0.0000
9- Outliers have the most influence when they are a
long way from the mean of the x values and mean
of the y values - You can think of the regression line as a plank
on a fulcrum - The fulcrum is centered at the mean
- If you place a heavy wieght near the fulcrum, the
tilt of the plank will not change too much - However, if you place such a weight at the end on
the plank then the tilt will change considerably - Points with high influence are also called high
leverage points
In this plot, although the point is a long way
from the mean of the y values it is very close to
the mean of the x values consequently it will
have a large residual, but will not be influential
10Detecting influential observations
- This can be hard to do
- However, in general, if an observation has high
influence then in will have a small residual - However, this in itself is not useful, we need to
know how much each predictor played in giving the
fitted value - Firstly, let us re-write our regression model in
a more convenient form.
11Matrix form of the regression model (not
examinable)
- Let y be a (n by 1) vector of responses, y
(y1,y2,,yn)T - Let ? be a (2 by 1) vector of coefficients ?
(?0, ?1)T - Let ? be a (n by 1) vector of errors, ? (?1,
?2,, ?n)T - And let X be a (n by 2) matrix with
- Then we can rewrite our regression model as
12Hat matrix
- If we write our model in this form, then it is
very simple to write down the least squares
estimates of the slope and the intercept
- Now if we multiply each side by X, then we get
- The matrix H is called the hat matrix, and this
equation shows us that each fitted value is a
(linear) combination of all the observed y values.
13Hat matrix diagonals
- From this we can see that larger the hij value,
the more influence yi has on the fitted value - In general, if hii (the ith diagonal element of
the hat matrix) is large then the ith observation
is influential - The values, hii for i 1,,n are called the hat
matrix diagonals and are produced by most
regression analysis packages (not Excel)
14Leverage plots
- The best way to use the hat matrix diagonals is
in a leverage plot - A leverage plot puts the hat matrix diagonals on
y-axis and the squared residuals on the x-axis - This plot can be divided into four quadrants
15Large hat matrix diagonal?
- What do we mean by large?
- It varies but there are a couple of rules of
thumb - If k is the number of coefficients (excluding the
intercept), then if hii gt 2(k1)/n a point might
be worth an investigation. If hii gt 3(k1)/n a
then this is large
16Interpreting leverage plots
- How do we interpret this plot?
- If points are in the lower left quadrant, (small
residual, small hat matrix diagonal) then we can
ignore them - If they in the upper left (small residual, large
hat matrix diagonal) we might consider these
points influential and drop them from the
regression - If theyre in the lower right corner (large
residual, small hat matrix diagonal), then
theyre not influential, but they will put
unnecessary noise into the model i.e. they
dont affect the fit, but they do affect the
significance of the fit - If theyre in the upper right corner(large
residual, large hat matrix diagonal) were in
trouble!
17Example
- The data in the following example comes from
ultrasound measurements on 40 babies. The
variables are bi-parietal diameter (BPD) a
measure of brain size in mm, and birth weight
(BW) in grams. - The first thing we should do is?
- We have two continuous variables relating to the
same individual - Therefore a scatterplot is the most appropriate
- We hope to see what trend (if any) exists in the
data set
18- The relationship between BW and BPD looks
approximately linear, therefore we propose a
regression model
19Fitting the model
- We assume that BW and BPD are in a data frame
called babies. We type - fitlt-lm(BWBPD,datababies)
- plot(fit)
20- The pred-res plot shows the residuals to be in a
homogeneous band around zero - There is slight evidence that the residuals
increase with the predictors
- The norplot is approximately straight
- There are a couple of residuals that are perhaps
too large - There is some bunching in the plot
21Constructing a leverage plot
- We can ask R to give us the hat matrix diagonals
for us as well as the residuals - reslt-resid(fit)
- hatslt-lm.influence(fit)hat
- To construct the leverage plot we need the
squared residuals - We make a new variable ressq
- ressqlt-res2
- We now plot the hats vs. ressq
- plot(ressq,hats)
22- We have one coefficient in this model (besides
the intercept), so k 1 and n 40 observations,
so 3(k1)/n 32/40 3/200.15 - Looking at our plot we can see there is one point
that is larger than this bound - We find this is point 23 by the R statement
(140)hatsgt0.15
23- If we go back to the original data, we can see
that point 23 is a low birth weight baby - This point has high leverage, because it is
anchoring the regression line - What happens if we leave it out?
24Regression Analysis (before deletion of pt
23) Call lm(formula BW BPD, data
babies) Residuals Min 1Q Median
3Q Max -287.72 -132.50 28.93 102.50
345.61 Coefficients Estimate Std.
Error t value Pr(gtt) (Intercept) -2109.473
232.063 -9.09 4.52e-11 BPD
45.418 3.065 14.82 lt 2e-16
--- Signif. codes 0 ' 0.001 '
0.01 ' 0.05 .' 0.1 ' 1 Residual
standard error 158.1 on 38 degrees of
freedom Multiple R-Squared 0.8525, Adjusted
R-squared 0.8486 F-statistic 219.6 on 1 and 38
degrees of freedom, p-value 0
25Call lm(formula BW BPD, data babies,
subset -23) Residuals Min 1Q Median
3Q Max -285.75 -137.38 14.09 111.16
347.62 Coefficients Estimate Std.
Error t value Pr(gtt) (Intercept) -2208.988
254.803 -8.669 1.93e-10 BPD
46.685 3.345 13.956 2.22e-16
--- Signif. codes 0 ' 0.001 '
0.01 ' 0.05 .' 0.1 ' 1 Residual
standard error 158.3 on 37 degrees of
freedom Multiple R-Squared 0.8404, Adjusted
R-squared 0.8361 F-statistic 194.8 on 1 and 37
degrees of freedom, p-value 2.22e-016
26Removing influential points
- Rarely successful
- Especially when we have the anchoring situation
as with this data - What usually happens is that as you delete one
influential point another becomes influential - So what should we do?
- Only remove points if it substantially improves
model fit - I.e. if the adj-R2 increases.
- If the coefficients change
- If the significance changes
- Always record which points you removed and why