Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics

About This Presentation
Title:

Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics

Description:

Slides available from Statistics & SPSS page of www.gpryce.com ... Moore and McCabe Chapters on regression. Kennedy, P. A Guide to Econometrics' ... –

Number of Views:43
Avg rating:3.0/5.0
Slides: 55
Provided by: author91
Category:

less

Transcript and Presenter's Notes

Title: Lecture 8 Regression: Relationships between continuous variables Slides available from Statistics


1
Lecture 8Regression Relationships between
continuous variables Slides available from
Statistics SPSS page of www.gpryce.com
  • Social Science Statistics Module I
  • Gwilym Pryce

2
Notices
  • Register
  • Revision lecture next week
  • Worked examples on
  • Confidence Intervals?
  • Hypothesis Tests?
  • Regression?
  • Email me any particular issues
  • Learning Support strategy

3
Learning Support strategy
  • Independent learning
  • this is a PG course and a degree of independent
    learning is assumed.
  • do the reading, attend the labs, review the
    lectures, make use of the computer labs and
    online help in your own time.
  • Lab Overview Feedback
  • Please feedback to the tutors Class Reps how
    you think that is going, how it could be
    improved.
  • Tutors and Class Reps will then report back to me
    how things are going each week.
  • Talk to tutors if you are struggling
  • Let the tutors know if you are struggling
    (assuming you have done the reading, attended
    labs etc.)
  • Tutors cannot guarantee extra support, but it
    might be possible to arrange extra tutorials etc.
  • Support from Maths Advisor Shazia Ahmed,
    Universitys Maths Adviser
  • If you have gone through steps 1 to 3, Shazia has
    agreed to run one-on-one sessions with students
    that are struggling with particular mathematical
    or statistical concepts (though she has made it
    clear that she cannot advise on SPSS problems,
    nor will she do the assignment for you).
  • Students who have particular problems in this
    regard can contact her directly
  • Shazia Ahmed, Maths Adviser, Student Learning
    Service McMillan Reading Room, Tel  330 5631
    Fax 330 8063
  • Departmental Support
  • Struggling students should enquire whether their
    own dept has support to offer.
  • All the grad school courses are only intended to
    constitute a generic training component
  • Individual depts supervisors should supplement
    with additional training and support as
    necessary.
  • Tutor of Last Resort
  • Students who have gone through steps 1 to 5
    above, and who still feel they are not receiving
    enough support, can email me directly

4
Plan
  • 1. Linear Non-linear Relationships
  • 2. Fitting a line using OLS
  • 3. Inference in Regression
  • 4. Omitted Variables R2
  • 5. Categorical Explanatory Variables
  • 6. Summary

5
1. Linear Non-linear relationships between
variables
  • Often of greatest interest in social science is
    investigation into relationships between
    variables
  • is social class related to political perspective?
  • is income related to education?
  • is worker alienation related to job monotony?
  • We are also interested in the direction of
    causation, but this is more difficult to prove
    empirically
  • our empirical models are usually structured
    assuming a particular theory of causation

6
Relationships between scale variables
  • The most straight forward way to investigate
    evidence for relationship is to look at scatter
    plots
  • traditional to
  • put the dependent variable (I.e. the effect) on
    the vertical axis
  • or y axis
  • put the explanatory variable (I.e. the cause)
    on the horizontal axis
  • or x axis

7
Scatter plot of IQ and Income
8
We would like to find the line of best fit
Predicted values (i.e. values of y lying on the
line of best fit) are given by
9
What does the output mean?


10
Sometimes the relationship appears non-linear
11
straight line of best fit is not always very
satisfactory
12
Could try a quadratic line of best fit
13
We can simulate a non-linear relationship by
first transforming one of the variables
14
e.g. squaring IQ and taking the natural log of IQ
15
or a cubic line of best fit (over-fitted?)
16
Or could try two linear lines structural break
17
2. Fitting a line using OLS
  • The most popular algorithm for drawing the line
    of best fit is one that minimises the sum of
    squared deviations from the line to each
    observation

Where yi observed value of y predicted
value of yi the value on the line of
best fit corresponding to xi
18
y school performance x ave. HH income of
pupils (000s)
Example School Performance in 8 Schools
  1. Write this model output as an equation.
  2. When xi 41 what is the value of yi?
  3. When xi 41 what is the value of y_hat?
  4. What is the difference between yi and y_hat when
    xi 41, and what does this difference mean?
  5. Where does the line of best fit cut the vertical
    axis?
  6. What is the value of school performance when
    average HH income of pupils is zero?
  7. How sensitive is school performance to the
    economic status of its intake?
  8. How is this sensitivity calculated?

etc
19
  • y_hat 6 2xi
  • yi 6 2xi ei
  • From the table of observations we can see that,
    when xi 41, yi 91.7.
  • NB if there was another school with xi 41, the
    observed value of y might not be the same due to
    random variation.
  • When xi 41 what is the value of y_hat?
  • y_hat 6 241 88
  • The difference between yi and y_hat when xi 41,
    is 91.7 88.0 3.7. This difference is the
    error or residual.
  • i.e. our model predicts that school performance
    will equal 88 when x 41, but for this
    particular school, the actual performance is
    91.7, so the model underpredicts performance by
    3.7.
  • The line of best fit (our model) cuts the
    vertical axis where x 0.
  • y_hat 6 2xi 6 20 6
  • The value of school performance 6 when average
    HH income of pupils, x, is zero.
  • The regression slope, also called b, also called
    the slop coefficient is a measure of how
    sensitive the dependent variable is to change in
    the explanatory variables. SPSS has estimated
    that the slope in this case 2.
  • i.e. for every unit increase in the explanatory
    variable (average income of parents measured in
    000s) school performance rises by two units.
  • i.e. for every extra 1,000 average income,
    school performance goes up by one unit.
  • How is this sensitivity calculated? Good
    question! It is the slope of the line of best
    fit, calculated using the OLS formula which
    minimises the sum squared residuals

20
Regression estimates of a, b using Ordinary Least
Squares (OLS)
  • Solving the minerror sum of squares problem
    yields estimates of the slope b and y-intercept a
    of the straight line

2
y_hat 6 2xi
6
21
A Second random sample of 8 schools
Now consider what would happen if we collected
another sample and calculated the line of best
fit for this new sample
2.1
7.6
22
A Third Random Sample of 8 Schools
1.9
15.2
23
A Fourth Random Sample of 8 Schools
2.0
14.5
24
A Fifth Random Sample of 8 Schools
1.9
14.0
25
Further random samples
Sample 8
Sample 6
Sample 9
Sample 7
26
Sample 1 b 2.0 Sample 2 b 2.1 Sample 3 b
1.9 Sample 4 b 2.0 Sample 5 b 1.9 Sample 6
b 1.7 Sample 7 b 1.8 Sample 8 b
2.5 Sample 9 b 2.2 Average b from 9 samples
2.0 Standard deviation of b from 9 samples
0.2 i.e. average deviation of b from sample to
sample 0.2 Standard Error of the slope
  • Notice that, in the second, third etc samples we
    have found schools with exactly the same values
    of x as in the first sample.
  • Despite this, we find random variation in the
    performance of the school for a given value of x.
  • This means that the slope coefficient will also
    vary from sample to sample.

27
  • Q1/ What would the sampling distribution of b
    look like if the sample size was large?
  • Q2/ What will the average of all sample slopes by
    and what symbol do we use to denote this value?
  • Q3/ What section of that distribution are we
    usually most interested in?

28
If n is large
  • A1/ sample slope b is normally distributed if n
    is large.
  • A2/ average of all sample slopes population
    slope b
  • A3/ we are usually most interested in the central
    95 of the distribution of b
  • We want to be 95 sure that the population value
    of the slope lies between some lower bound and
    some upper bound.

b
b Average b
29
  • Q/ Why is it useful that b is normally
    distributed?

30
  • A/ If b is normally distributed, it means that we
    can use the standard normal curve to help us work
    out the lower and upper bounds of the central 95
    of the sampling distribution of b

31
a
b
c
Convert to z value
where sb is the SE of b
z
32
  • Because the sampling distribution of the
    regression slope from large samples is normal
    (i.e. has a bell-shaped histogram), we can use
    the standard normal curve (z distribution) to
    work out confidence intervals and hypothesis
    tests on b.
  • i.e we can use the known probabilities for areas
    under the standard normal curve to work out
  • The lower and upper bounds for the central 95 of
    b
  • The probability of observing a sample like our
    own with a value of b at least as far away from
    the H0 assumed value of b

33
Small samples
  • If the sample is small, b will have a
    t-distribution.
  • Since the t-distribution is asymptotically normal
    (i.e. tends towards the z distribution as n
    increases) we tend to use the t-distribution
    whether the sample is large or small.

34
a
b
c
Convert to t value
where sb is the SE of b
t
35
3. Hypothesis tests on the slope coefficient
  • Regressions are usually run on samples, but
    usually we want to say something about the
    population value of the relationship between x
    and y.
  • Repeated samples would yield a range of values
    for estimates of b N(b, sb)
  • I.e. b is normally distributed with mean b
    population mean value of b if regression run on
    population
  • If there is no relationship in the population
    between x and y, then b 0
  • H0 b 0, H1 b ?? 0 is the hypothesis test
    which SPSS runs automatically on every regression
    you run and produces the output in two columns
    headed t and Sig. in the Coefficients table.
  • i.e. every SPSS output table of coefficients
    includes the results of a hypothesis test on
    whether there is any relationship at all between
    x and y.

36
  • Some examples

37
Returning to our IQ example
  • Q1/ what is the estimate of slope in this sample
    and what does it tell us?
  • Q2/ what is the standard error and what does it
    mean?
  • Q3/ what is the value of the intercept term and
    what does it mean?
  • Q4/ how would we test the hypothesis that b 0,
    and what does this hypothesis mean?

38
  • A1/ the estimate of slope in this sample is 260.
    This tells us that for every unit increase in IQ,
    income typically rises by around 260.
  • A2/ the standard error tells us how much the
    estimate of the slope typically varies from
    sample to sample. We do not know the SE of b for
    sure, but SPSS estimates it at 11
  • i.e. the slope estimate is likely to vary by
    around 11 from sample to sample.
  • A3/ the value of the intercept term is estimated
    to be -8,237. The intercept term tells us the
    value of the dependenet variable when the
    explanatory variables are all zero.
  • i.e. where the line of best fit cuts the vertical
    axis
  • So we estimate that for someone with zero IQ,
    their income will typically be -8,237.

39
  • A4/ we would test the hypothesis that b 0 by
    calculating the probability of observing a sample
    with an estimated slope of 260 when the value of
    the population slope is zero.
  • We would calculate this probability (sig.
    probability of falsely rejecting H0 b 0 ) by
    calculating the associated value on the
    t-distribution and use this to work out the areas
    in the tails.
  • tc (258.5 0)/11.01 23.5 where tc is the
    value of t you have calculated. You then want to
    work out what proportion of t lies above tc and
    below tc.
  • We would then look up this value for t in the t
    tables for the degrees of freedom associated with
    out regression sample size -(1 the number of
    explanatory variables).

40
Hypothesis test on b
  • (1) H0 b 0
  • (I.e. slope coefficient, if regression run on
    population, would 0)
  • H1 b ? 0
  • (2) a 0.05 or 0.01 etc.
  • (3) Reject H0 iff P lt a
  • (N.B. Rule of thumb if n fairly large P lt 0.05
    if tc ? 2)
  • (4) Calculate P and conclude.

41
Floor Area Example
  • You run a regression of house price on floor area
    which yields the following output. Use this
    output to answer the following questions
  • Q/ What is the Constant? What does its value
    mean here?
  • Q/ What is the slope coefficient and what does it
    tell you here?
  • Q/ What is the estimated value of an extra square
    metre?
  • Q/ How would you test for the existence of a
    relationship between purchase price and floor
    area?
  • Q/ How much is a 200m2 house worth?
  • Q/ How much is a 100m2 house worth?
  • Q/ On average, how much is the slope coefficient
    likely to vary from sample to sample?
  • NB Write down your answers youll need them
    later!

42
Floor area example
  • (1) H0 no relationship between house price and
    floor area.
  • H1 there is a relationship
  • (2), (3), (4)
  • P 1- CDF.T(24.469,554) 0.000000
  • Reject H0

43
4. Omitted Variables, Goodness of Fit and R2
Q/ is floor area the only factor?Q/ How much of
the variation in Price does it explain?
44
R-square
  • R-square tells you how much of the variation in y
    is explained by your model
  • 0 lt R2 lt 1 (NB you want R2 to be near
    1).
  • If your have more than one explanatory variable,
    use Adjusted R2 which takes into account the
    distortion caused by adding extra variables.

45
House Price Example contd Two explanatory
variables
Now add number of bathrooms as an extra
explanatory variable
  • Q/ How has the estimated value of an extra square
    metre changed?
  • Q/ Do a hypothesis test for the existence of a
    relationship between price and number of
    bathrooms.
  • Q/ How much will an extra bathroom typically add
    to the value of a house?
  • Q/ What is the value of a 200m2 house with one
    bathroom? Compare your estimate with that from
    the previous model.
  • Q/ What is the value of a 100m2 house with one
    bathroom? Compare your estimate with that from
    the previous model.
  • Q/ What is the value of a 100m2 house with two
    bathrooms? Compare your estimate with that from
    the previous model.
  • Q/ On average, how much is the slope coefficient
    on floor area likely to vary from sample to
    sample?

46
Scatter plot (with floor spikes)
47
3D Surface Plots Construction, Price
Unemployment during a boomQ -246 27P -
0.2P2 - 73U 3U2
Non-linear effects can also be modelled when you
have more than one explanatory variable
48
Construction Equation in a SlumpQ 315 4P -
73U 5U2
49
5. Categorical Explanatory Variables
  • Sometimes certain observations display
    consistently higher y values for a particular
    subgroup in the sample.
  • i.e. for a particular category of observations.
  • If you assume the slope will have the same value,
    and that only the intercept is shifting, you can
    model the effect of categorical variables by
    including dummy variables
  • A dummy variable is simply a binary variable
  • e.g. male 1 or 0

50
  • To model the effect of a categorical explanatory
    variable in this way you need to
  • Decide on a baseline category. This is usually
    an arbitrary decision, so just choose the largest
    or most familiar category.
  • E.g. if the category is UK Region, choose London
    as the baseline
  • Create dummies (binary variables) for all
    remaining categories
  • E.g. Compute yorksh_dum 0.
  • if (Region Yorkshire) yorksh_dum 1.
  • Execute.
  • Include in your regression the dummies for all
    categories except your baseline category.
  • E.g. suppose you only have two regions in your
    sample, London and Yorkshire,
  • you would do a regression of house price on
    floorarea and yorksh_dum

51
  • By including dummy variables you are saying that
    the difference between categories can be modelled
    as a parallel shift of the regression line above
    or below the baseline category
  • The value of the coefficient on the dummy
    variable tells you how much higher the value of
    the dependent variable would be observations in
    that category
  • E.g. if the regression output were as follows
  • price -2000 500floorarea -
    27500yorksh_dum
  • then the results tell us that a house of a
    given size is 27,500 cheaper in Yorkshire
    compared with London.
  • i.e. the coefficient tells you the size of the
    intercept shift associated with that category of
    observations

52
Coefficient on Dummy Variable size of Intercept
Shift
House price
London
Yorkshire
27,500
Slope 500 same for both areas
27,500
Floorarea
53
Summary
  • 1. Linear Non-linear Relationships
  • 2. Fitting a line using OLS
  • 3. Inference in Regression
  • 4. Omitted Variables R2
  • 5. Categorical Explanatory variables
  • Revision lecture next week
  • Worked examples on
  • Confidence Intervals?
  • Hypothesis Tests?
  • Regression?

54
Reading
  • Regression Analysis
  • Pryce chapter on relationships.
  • Field, A. chapters on regression.
  • Moore and McCabe Chapters on regression.
  • Kennedy, P. A Guide to Econometrics
  • Bryman, Alan, and Cramer, Duncan (1999)
    Quantitative Data Analysis with SPSS for
    Windows A Guide for Social Scientists, Chapters
    9 and 10.
  • Achen, Christopher H. Interpreting and Using
    Regression (London Sage, 1982).
Write a Comment
User Comments (0)
About PowerShow.com