Title: Regression
1Regression
2What is Regression?
- A way of predicting the value of one variable
from another. - It is a hypothetical model of the relationship
between two variables. - The model used is a linear one.
- Therefore, we describe the relationship using the
equation of a straight line.
3Describing a Straight Line
- bi
- Regression coefficient for the predictor
- Gradient (slope) of the regression line
- Direction/Strength of Relationship
- a
- Intercept (value of Y when X 0)
- Point at which the regression line crosses the
Y-axis (ordinate) - ?i
- Unexplained error.
4Same Intercept, Different Gradient
5Same Gradient, Different Intercept
6The Method of Least Squares
Why is this line a better summary of the data
than a line which is marginally more steep or
marginally more shallow or which is a millimetre
or two further up the page? In fact the line has
been chosen in such a way that the sum of the
squares of the vertical distances between the
points and the line is minimised. As we have
seen earlier in the module, squaring differences
has the advantage of making positive and negative
differences equivalent.
7How Good is the Model?
- The regression line is only a model based on the
data. - This model might not reflect reality.
- We need some way of testing how well the model
fits the observed data. - How?
8Sum of Squares
- SST
- Total variability (variability between scores
and the mean). - SSR
- Residual/Error variability (variability between
the regression model and the actual data). - SSM
- Model variability (difference in variability
between the model and the mean).
9Testing the Model ANOVA
- If the model results in better prediction than
using the mean, then we expect SSM to be much
greater than SSR
10Testing the Model ANOVA
- Mean Squared Error
- Sums of Squares are total values.
- They can be expressed as averages. The averages
are obtained by dividing the sum of squares by
the degrees of freedom for each model. - These are called Mean Squares, MS
- If you know F you can check whether the model is
significantly better at predicting the dependent
variable than chance alone.
11Testing the Model R2
- R2
- The proportion of variance accounted for by the
regression model (you can transform R2 into a
percentage). - The Pearson Correlation Coefficient Squared
12Regression An Example
13SPSS output showing the F ratio
If the improvement due to fitting the model is
much greater than the inaccuracy within the model
then the value of F will be greater than 1. In
this instance the value of F is 99.587 SPSS
tells us that the probability of obtaining this
value of F by chance is very low (p
lt.001) Note Mean Square Sum of Squares /
df F MS regression / MS residual
14SPSS output showing R2
In this instance the model explains 33.5 of the
variation in the dependent variable.
15SPSS Output Model Parameters
16Produce your own regression equations at the
following site
- http//people.hofstra.edu/faculty/Stefan_Waner/new
graph/regressionframes.html - Linked from the statistical simulations page
of the website.
17Multiple Regression when there is more than one
independent variable
- b1
- Regression coefficient for the first predictor
- Direction/Strength of Relationship
- b2
- Regression coefficient for the second predictor
- Direction/Strength of Relationship
- bn
- Regression coefficient for the nth predictor
- Direction/Strength of Relationship
- a
- Intercept (value of Y when X1 and X2 and Xn all
0) - Point at which the regression line crosses the
Y-axis (ordinate) - ?i
- Unexplained error.
18Multiple regression an example
19Checking Assumptions Checking Residuals
Linearity This assumption is that there is a
straight line relationship between the
independent and dependent variables (n.b. if
there is not it may be possible to make it linear
by transforming one or more variables). Homoscedas
ticity This assumption means that the variance
around the regression line is the same for all
values of the independent variable(s).
20The effect of outliers
Because the regression line minimizes the squared
difference of points to the line outliers can
have a very large effect (as their squared
distance to the line will make a big difference).
This is why it is sometimes advisable to run
regression analysis omitting outliers.