Title: Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics
1Lecture 8Regression Relationships between
continuous variables Slides available from
Statistics SPSS page of www.gpryce.com
- Social Science Statistics Module I
- Gwilym Pryce
2Notices
- Register
- Revision lecture next week
- Worked examples on
- Confidence Intervals?
- Hypothesis Tests?
- Regression?
- Email me any particular issues
- Learning Support strategy
3Learning Support strategy
- Independent learning
- this is a PG course and a degree of independent
learning is assumed. - do the reading, attend the labs, review the
lectures, make use of the computer labs and
online help in your own time. - Lab Overview Feedback
- Please feedback to the tutors Class Reps how
you think that is going, how it could be
improved. - Tutors and Class Reps will then report back to me
how things are going each week. - Talk to tutors if you are struggling
- Let the tutors know if you are struggling
(assuming you have done the reading, attended
labs etc.) - Tutors cannot guarantee extra support, but it
might be possible to arrange extra tutorials etc.
- Support from Maths Advisor Shazia Ahmed,
Universitys Maths Adviser - If you have gone through steps 1 to 3, Shazia has
agreed to run one-on-one sessions with students
that are struggling with particular mathematical
or statistical concepts (though she has made it
clear that she cannot advise on SPSS problems,
nor will she do the assignment for you). - Students who have particular problems in this
regard can contact her directly - Shazia Ahmed, Maths Adviser, Student Learning
Service McMillan Reading Room, Tel 330 5631
Fax 330 8063 - Departmental Support
- Struggling students should enquire whether their
own dept has support to offer. - All the grad school courses are only intended to
constitute a generic training component - Individual depts supervisors should supplement
with additional training and support as
necessary. - Tutor of Last Resort
- Students who have gone through steps 1 to 5
above, and who still feel they are not receiving
enough support, can email me directly
4Plan
- 1. Linear Non-linear Relationships
- 2. Fitting a line using OLS
- 3. Inference in Regression
- 4. Omitted Variables R2
- 5. Categorical Explanatory Variables
- 6. Summary
51. Linear Non-linear relationships between
variables
- Often of greatest interest in social science is
investigation into relationships between
variables - is social class related to political perspective?
- is income related to education?
- is worker alienation related to job monotony?
- We are also interested in the direction of
causation, but this is more difficult to prove
empirically - our empirical models are usually structured
assuming a particular theory of causation
6Relationships between scale variables
- The most straight forward way to investigate
evidence for relationship is to look at scatter
plots - traditional to
- put the dependent variable (I.e. the effect) on
the vertical axis - or y axis
- put the explanatory variable (I.e. the cause)
on the horizontal axis - or x axis
7Scatter plot of IQ and Income
8We would like to find the line of best fit
Predicted values (i.e. values of y lying on the
line of best fit) are given by
9What does the output mean?
10Sometimes the relationship appears non-linear
11 straight line of best fit is not always very
satisfactory
12Could try a quadratic line of best fit
13We can simulate a non-linear relationship by
first transforming one of the variables
14e.g. squaring IQ and taking the natural log of IQ
15 or a cubic line of best fit (over-fitted?)
16Or could try two linear lines structural break
172. Fitting a line using OLS
- The most popular algorithm for drawing the line
of best fit is one that minimises the sum of
squared deviations from the line to each
observation
Where yi observed value of y predicted
value of yi the value on the line of
best fit corresponding to xi
18y school performance x ave. HH income of
pupils (000s)
Example School Performance in 8 Schools
- Write this model output as an equation.
- When xi 41 what is the value of yi?
- When xi 41 what is the value of y_hat?
- What is the difference between yi and y_hat when
xi 41, and what does this difference mean? - Where does the line of best fit cut the vertical
axis? - What is the value of school performance when
average HH income of pupils is zero? - How sensitive is school performance to the
economic status of its intake? - How is this sensitivity calculated?
etc
19- y_hat 6 2xi
- yi 6 2xi ei
- From the table of observations we can see that,
when xi 41, yi 91.7. - NB if there was another school with xi 41, the
observed value of y might not be the same due to
random variation. - When xi 41 what is the value of y_hat?
- y_hat 6 241 88
- The difference between yi and y_hat when xi 41,
is 91.7 88.0 3.7. This difference is the
error or residual. - i.e. our model predicts that school performance
will equal 88 when x 41, but for this
particular school, the actual performance is
91.7, so the model underpredicts performance by
3.7. - The line of best fit (our model) cuts the
vertical axis where x 0. - y_hat 6 2xi 6 20 6
- The value of school performance 6 when average
HH income of pupils, x, is zero. - The regression slope, also called b, also called
the slop coefficient is a measure of how
sensitive the dependent variable is to change in
the explanatory variables. SPSS has estimated
that the slope in this case 2. - i.e. for every unit increase in the explanatory
variable (average income of parents measured in
000s) school performance rises by two units. - i.e. for every extra 1,000 average income,
school performance goes up by one unit. - How is this sensitivity calculated? Good
question! It is the slope of the line of best
fit, calculated using the OLS formula which
minimises the sum squared residuals
20Regression estimates of a, b using Ordinary Least
Squares (OLS)
- Solving the minerror sum of squares problem
yields estimates of the slope b and y-intercept a
of the straight line
2
y_hat 6 2xi
6
21A Second random sample of 8 schools
Now consider what would happen if we collected
another sample and calculated the line of best
fit for this new sample
2.1
7.6
22A Third Random Sample of 8 Schools
1.9
15.2
23A Fourth Random Sample of 8 Schools
2.0
14.5
24A Fifth Random Sample of 8 Schools
1.9
14.0
25Further random samples
Sample 8
Sample 6
Sample 9
Sample 7
26Sample 1 b 2.0 Sample 2 b 2.1 Sample 3 b
1.9 Sample 4 b 2.0 Sample 5 b 1.9 Sample 6
b 1.7 Sample 7 b 1.8 Sample 8 b
2.5 Sample 9 b 2.2 Average b from 9 samples
2.0 Standard deviation of b from 9 samples
0.2 i.e. average deviation of b from sample to
sample 0.2 Standard Error of the slope
- Notice that, in the second, third etc samples we
have found schools with exactly the same values
of x as in the first sample. - Despite this, we find random variation in the
performance of the school for a given value of x. - This means that the slope coefficient will also
vary from sample to sample.
27- Q1/ What would the sampling distribution of b
look like if the sample size was large? - Q2/ What will the average of all sample slopes by
and what symbol do we use to denote this value? - Q3/ What section of that distribution are we
usually most interested in?
28If n is large
- A1/ sample slope b is normally distributed if n
is large. - A2/ average of all sample slopes population
slope b - A3/ we are usually most interested in the central
95 of the distribution of b - We want to be 95 sure that the population value
of the slope lies between some lower bound and
some upper bound.
b
b Average b
29- Q/ Why is it useful that b is normally
distributed?
30- A/ If b is normally distributed, it means that we
can use the standard normal curve to help us work
out the lower and upper bounds of the central 95
of the sampling distribution of b
31a
b
c
Convert to z value
where sb is the SE of b
z
32- Because the sampling distribution of the
regression slope from large samples is normal
(i.e. has a bell-shaped histogram), we can use
the standard normal curve (z distribution) to
work out confidence intervals and hypothesis
tests on b. - i.e we can use the known probabilities for areas
under the standard normal curve to work out - The lower and upper bounds for the central 95 of
b - The probability of observing a sample like our
own with a value of b at least as far away from
the H0 assumed value of b
33Small samples
- If the sample is small, b will have a
t-distribution. - Since the t-distribution is asymptotically normal
(i.e. tends towards the z distribution as n
increases) we tend to use the t-distribution
whether the sample is large or small.
34a
b
c
Convert to t value
where sb is the SE of b
t
353. Hypothesis tests on the slope coefficient
- Regressions are usually run on samples, but
usually we want to say something about the
population value of the relationship between x
and y. - Repeated samples would yield a range of values
for estimates of b N(b, sb) - I.e. b is normally distributed with mean b
population mean value of b if regression run on
population - If there is no relationship in the population
between x and y, then b 0 - H0 b 0, H1 b ?? 0 is the hypothesis test
which SPSS runs automatically on every regression
you run and produces the output in two columns
headed t and Sig. in the Coefficients table. - i.e. every SPSS output table of coefficients
includes the results of a hypothesis test on
whether there is any relationship at all between
x and y.
36 37Returning to our IQ example
- Q1/ what is the estimate of slope in this sample
and what does it tell us? - Q2/ what is the standard error and what does it
mean? - Q3/ what is the value of the intercept term and
what does it mean? - Q4/ how would we test the hypothesis that b 0,
and what does this hypothesis mean?
38- A1/ the estimate of slope in this sample is 260.
This tells us that for every unit increase in IQ,
income typically rises by around 260. - A2/ the standard error tells us how much the
estimate of the slope typically varies from
sample to sample. We do not know the SE of b for
sure, but SPSS estimates it at 11 - i.e. the slope estimate is likely to vary by
around 11 from sample to sample. - A3/ the value of the intercept term is estimated
to be -8,237. The intercept term tells us the
value of the dependenet variable when the
explanatory variables are all zero. - i.e. where the line of best fit cuts the vertical
axis - So we estimate that for someone with zero IQ,
their income will typically be -8,237.
39- A4/ we would test the hypothesis that b 0 by
calculating the probability of observing a sample
with an estimated slope of 260 when the value of
the population slope is zero. - We would calculate this probability (sig.
probability of falsely rejecting H0 b 0 ) by
calculating the associated value on the
t-distribution and use this to work out the areas
in the tails. - tc (258.5 0)/11.01 23.5 where tc is the
value of t you have calculated. You then want to
work out what proportion of t lies above tc and
below tc. - We would then look up this value for t in the t
tables for the degrees of freedom associated with
out regression sample size -(1 the number of
explanatory variables).
40Hypothesis test on b
- (1) H0 b 0
- (I.e. slope coefficient, if regression run on
population, would 0) - H1 b ? 0
- (2) a 0.05 or 0.01 etc.
- (3) Reject H0 iff P lt a
- (N.B. Rule of thumb if n fairly large P lt 0.05
if tc ? 2) - (4) Calculate P and conclude.
41Floor Area Example
- You run a regression of house price on floor area
which yields the following output. Use this
output to answer the following questions - Q/ What is the Constant? What does its value
mean here? - Q/ What is the slope coefficient and what does it
tell you here? - Q/ What is the estimated value of an extra square
metre? - Q/ How would you test for the existence of a
relationship between purchase price and floor
area? - Q/ How much is a 200m2 house worth?
- Q/ How much is a 100m2 house worth?
- Q/ On average, how much is the slope coefficient
likely to vary from sample to sample? - NB Write down your answers youll need them
later!
42Floor area example
- (1) H0 no relationship between house price and
floor area. - H1 there is a relationship
- (2), (3), (4)
- P 1- CDF.T(24.469,554) 0.000000
- Reject H0
434. Omitted Variables, Goodness of Fit and R2
Q/ is floor area the only factor?Q/ How much of
the variation in Price does it explain?
44R-square
- R-square tells you how much of the variation in y
is explained by your model - 0 lt R2 lt 1 (NB you want R2 to be near
1). - If your have more than one explanatory variable,
use Adjusted R2 which takes into account the
distortion caused by adding extra variables.
45House Price Example contd Two explanatory
variables
Now add number of bathrooms as an extra
explanatory variable
- Q/ How has the estimated value of an extra square
metre changed? - Q/ Do a hypothesis test for the existence of a
relationship between price and number of
bathrooms. - Q/ How much will an extra bathroom typically add
to the value of a house? - Q/ What is the value of a 200m2 house with one
bathroom? Compare your estimate with that from
the previous model. - Q/ What is the value of a 100m2 house with one
bathroom? Compare your estimate with that from
the previous model. - Q/ What is the value of a 100m2 house with two
bathrooms? Compare your estimate with that from
the previous model. - Q/ On average, how much is the slope coefficient
on floor area likely to vary from sample to
sample?
46Scatter plot (with floor spikes)
473D Surface Plots Construction, Price
Unemployment during a boomQ -246 27P -
0.2P2 - 73U 3U2
Non-linear effects can also be modelled when you
have more than one explanatory variable
48Construction Equation in a SlumpQ 315 4P -
73U 5U2
495. Categorical Explanatory Variables
- Sometimes certain observations display
consistently higher y values for a particular
subgroup in the sample. - i.e. for a particular category of observations.
- If you assume the slope will have the same value,
and that only the intercept is shifting, you can
model the effect of categorical variables by
including dummy variables - A dummy variable is simply a binary variable
- e.g. male 1 or 0
50- To model the effect of a categorical explanatory
variable in this way you need to - Decide on a baseline category. This is usually
an arbitrary decision, so just choose the largest
or most familiar category. - E.g. if the category is UK Region, choose London
as the baseline - Create dummies (binary variables) for all
remaining categories - E.g. Compute yorksh_dum 0.
- if (Region Yorkshire) yorksh_dum 1.
- Execute.
- Include in your regression the dummies for all
categories except your baseline category. - E.g. suppose you only have two regions in your
sample, London and Yorkshire, - you would do a regression of house price on
floorarea and yorksh_dum
51- By including dummy variables you are saying that
the difference between categories can be modelled
as a parallel shift of the regression line above
or below the baseline category - The value of the coefficient on the dummy
variable tells you how much higher the value of
the dependent variable would be observations in
that category - E.g. if the regression output were as follows
- price -2000 500floorarea -
27500yorksh_dum - then the results tell us that a house of a
given size is 27,500 cheaper in Yorkshire
compared with London. - i.e. the coefficient tells you the size of the
intercept shift associated with that category of
observations
52Coefficient on Dummy Variable size of Intercept
Shift
House price
London
Yorkshire
27,500
Slope 500 same for both areas
27,500
Floorarea
53Summary
- 1. Linear Non-linear Relationships
- 2. Fitting a line using OLS
- 3. Inference in Regression
- 4. Omitted Variables R2
- 5. Categorical Explanatory variables
- Revision lecture next week
- Worked examples on
- Confidence Intervals?
- Hypothesis Tests?
- Regression?
54Reading
- Regression Analysis
- Pryce chapter on relationships.
- Field, A. chapters on regression.
- Moore and McCabe Chapters on regression.
- Kennedy, P. A Guide to Econometrics
- Bryman, Alan, and Cramer, Duncan (1999)
Quantitative Data Analysis with SPSS for
Windows A Guide for Social Scientists, Chapters
9 and 10. - Achen, Christopher H. Interpreting and Using
Regression (London Sage, 1982).