Multiple regression : a refresher - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Multiple regression : a refresher

Description:

Model 1. Model 2. Note that both models predict identical means for men and women. ... mean hedonism for women from the model 1 estimates using a wald-test, which ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 29

Provided by: edj9

Category:

more less

Transcript and Presenter's Notes

Title: Multiple regression : a refresher

1
Multiple regression a refresher
In this and other sessions we will be using data
from the 2002 European Social Surveys (ESS).
Measures of ten human values have
been constructed for 20 countries in the European
Union.
We will study one of the ten values, hedonism,
defined as the pleasure and sensuous
gratification for oneself.
The scores on the hedonism variable range from
-3.76 to 2.90, where higher scores indicate more
hedonistic beliefs.
In this session we consider the application of
multiple regression to a subset of the data for
three countries onlyUK, Germany and France
Hedonism is taken as the outcome variable in our
analysis. We consider three explanatory
variables Age in years Gender (coded 0 for
male and 1 for female) Country (coded 1 for the
UK, 2 for Germany and 3 for France)
2
Regression with a single continuous explanatory
variable
Line of best fit through the data
Ordinary least squares estimates ?0 and ?1 to
minimise the sum of the squared values of ei
3
Terminology
Y response variable, outcome variable,
dependent variable
X explanatory variable, predictor variable,
independent variable
4
Linear regression with a continuous predictor
Research questions
Is there an association between y and x?
For example in the values data set is there an
association between hedonism( y) and age( x)
5
Interpretation
For every year increase in age hedonism decreases
by 0.018 units
At age0(x0) the average hedonism level is
0.712. The notion of the hedonism score of a
newly born baby, where hedonism is measured by
answers to survey questions put to people in the
age range 14..98 years is not very meaningful.
6
Centering
When an x value of 0 is outside the range of x
and therefore the interpretation of the intercept
is not meaningful, people often center the x
variable. In our data set we can center age
around its average value of 46 years. This gives
intercept and slope estimates of
Note that centering a predictor variable does not
change the estimate of the slope or the position
of the regression line through the data
7
Linear regression with a continuous explanatory
variable Assumptions
1. Independence. The residuals(ei) are assumed to
be independent of each other. This means that
knowing the value of the residual for one person
tells us nothing about the value of a residual
for any other person. The residuals are assumed
to be independent of x. That is cov(xi , ei)0.
2. The residuals follow a Normal distribution
that is
3. The variance of the residuals is constant wrt
to x. This is known as homoskedasticity.
8
Constant variance assumption
-3 -2 -1 0 1 2
3
Residuals variance constant wrt to x
homoskedasticity
Residuals variance not constant wrt to x
heteroskedasticity
9
Checking the model assumptions
We can evaluate the validity of assumption 2) and
3) by use of diagnostic
Assumption 2 Normality. Standardised residuals
plotted against Normal scores of standardised
residuals should lie on a straight line
Assumption 3 Constant variance. Vertical scatter
of points should be roughly the same for any
value of x
Assumption 1 If we suspect residuals are not
independent of each other then we can fit more
complex models to test this for example a
multilevel model.
10
Hypothesis testing p values
Null hypothesis that there is no relationship
between hedonism and age in the population(?1
0) and the relationship we observe in the sample
could have arisen by chance.
Alternative hypothesis there is a relationship
in the population(?1 ? 0).
The standard error is a measure of the
imprecision of our estimates (as the standard
error gets smaller the precision of our estimates
increases). In our example SE(?1)0.001. We can
look at Z or t ratio
Which yields a p-value 0.001. Which says if there
were no relationship in the population between
hedonism and age we would expect less than 0.1
of samples to produce a slope estimate of
magnitude greater than 0.018.
Note that the SE decreases with n so that with
large enough samples any effect becomes
statistically significant.
11
Hypothesis testingconfidence intervals
Alternatively, but equivalently, we can construct
a 95 CI for ?1
Zero (the value of ß1 under the null hypothesis)
is well outside the 95 confidence interval, so
we reject the null hypothesis and conclude that
the relationship is statistically significant at
the 5 level. Note -1.96 and 1.96 are the 2.5
and 97.5 points on a standard Normal
distribution.
12
Comparing groups regression with a single
categorical predictor
Suppose we fit the regression model
where yi is the hedonism score of individual i,
and xi 1 if the individual is female, and 0 if
the respondent is male. We then obtain
The predictions for men and women are
The difference between men and women has a
z-ratio of -0.156/0.025 and we would reject the
null hypothesis of the male and female means
being equal and the 95 CI for ?1 is
(-2.06,-0.106).
13
Alternative parameterisation
where xi 1 if the individual is a woman, and 0
if the respondent is man ?0 estimates the mean
hedonism for men and ?1 the difference between
men and women means for hedonism. Supposing we
wanted to know the 95 CI for mean hedonism for
women.
The first is to reparameterise the model so that
?0 estimates the mean for men and ?1 estimates
the mean for women rather than the gender
difference.
To do this we need to know that the intercept
coefficient also has an explanatory variable
associated with it that is a constant vector( a
vector of 1s). It is usually omitted from model
equations because ?0?1 ?0. However, if we change
the values in the explanatory variable
multiplying the intercept it can change the
meaning of coefficients in the model.
MlwiN creates a hidden variable called cons for
this constant vector.
14
Alternative parameterisation
Suppose we fit the regression, where we recode
the explanatory variable associated with ?0 so it
is no longer a constant vector
15
Comparing the two models
Model 1
Model 2
The predictions for men and women are
The predictions for men and women are
Note that both models predict identical means for
men and women.
16
Testing functions of parameters
We can also obtain the confidence interval for
the mean hedonism for women from the model 1
estimates using a wald-test, which allows us to
construct confidence intervals for linear
combinations of parameters and to test against
null hypothesis that these linear combinations
equal a particular value(typically 0)
Model 1
Via wald-test gives 1?(-0.069) 1?(-0.156) /-
0.034 -0.225 /- 0.034(-0.259,-0.191)
Note this is very similar but not identical to
the confidence interval obtained from model 2
(-0.262,-0.188). This is because different
testing procedures make different assumptions.
17
Comparing groups with more than two categories
We used two different parameterisations to
estimate the two gender means. Generally, the
first parameterisation, where the intercept is
multiplied by a constant vector of 1s is
preferred. This is because when we add multiple
predictors into the model, interpretation of the
coefficients is more straightforward.
For every extra category in a predictor variable
we need to include an extra indicator or dummy
variable in our model. With an n-category
variable we need to include n-1 indicator
variables in addition to the intercept term to
model the means of the n groups. For example to
model the three country means(UK, France and
Germany). We can fit the model
18
Country difference in hedonism
For UK residents (Germany0, France0) Predicted
hedonism -0.384 (0.256 0) (0.492 0)
-0.384 For German residents (Germany1,
France0) Predicted Hedonism -0.384 (0.256
1) (0.492 0) -0.128 For French residents
(GERM0, FRANCE1) Predicted Hedonism -0.384
(0.256 0) (0.492 1) 0.108
19
Hypothesis testing for categorical predictors
with more than two groups
What if we want to test the France/Germany
difference?
We could reparameterise the model so that
Germany, instead of the UK, was the reference
category.
Or we could conduct a wald test on the equality
20
More than one predictor variable-statistical
control
When modelling the effects of the country
predictor variable we already entered multiple
dummy or indicator variables.
We can add multiple predictor variables into our
model, where categorical predictor variables will
be handled by a set of dummy variables and
continuous predictor variables will be handled by
including the variable directly
Once we include more than one predictor variable
our model can address the issue of statistical
control
Does the association of one predictor variable
with the response persist when we simultaneously
account for further predictor variables?
21
Example of statistical control with the hedonism
data
We have already seen that
Women are less hedonistic than men
Hedonism decreases with age
However, women live longer than men. So some of
the gender gap will be due to the fact that women
are on average older than men.
Some but how much?
We can answer this question by fitting age and
gender in the same model.
This will tell us if the gender gap persists
after controlling for age.
22
Modelling gender and age simultaneously
The gender effect in the model where gender is
only the predictor variable is -0.156. So the
gender effect persists strongly after controlling
for age.
23
Statistical control another example
Imagine attainment scores on two schools
But controlling for prior ability
Fitting school as a single predictor
School B has -ve effect
School B has ve effect
24
Interactions between predictor variables
Recall our model with age and gender effects..
It may be that the gender gap changes as a
function of age Or equivalently The age slope is
not the same for men and women.
We can test for this by including an interaction
between age and female as an extra explanatory
variable in the model
We do this by including a variable that is the
product of age and female
25
Gender x Age interaction effects
Results
1 1 10
10 1 0 15
0 .
This gives a prediction line for males( female0)
of -0.058 0.153?0 - 0.019agei 0.02 ? 0 ?
agei -0.058 0.019agei and for
females(female1) of -0.058 0.153?1 -
0.019agei 0.02 ? 1 ? agei(-0.058-0.153)(-0.01
90.002)agei
That is females have an intercept -0.153 lower
than males and a slope 0.002 greater than males
Note the gender difference in the slopes 0.002
has a z-ratio of 2 so is just statistically
significant at the 5 level
26
Graphing the lines
male 0.058 0.019agei
female (-0.058-0.153)(-0.0190.02)agei
The slightly flatter (less negative) slope for
females means the gender gap decreases with age
27
Examining the gender gap
We may want to know does the gender gap remain
statistically significant even at higher ages
when it is diminished?
The gender gap is
So we can plot this function out with its
associated confidence envelope and see for which
ages the confidence interval does not include
0(no gender gap)
28
Graphing the gender gap with 95 CI
The gender gap becomes statistically
insignificant at age-46 30 that is at 76 years

Write a Comment

User Comments (0)