Title: Correlation and regression
1Correlation and regression
- Friday, February 24th 2006
2Outline
- Lines intercept and gradient
- Correlation
- Line fitting
- The correlation coefficient Pearsons r
- Regression
- What is it?
- Least squares
- Testing the model
- Example SPSS Output
- Multiple Regression
- Coefficients
- Effect Size
- Assumptions
- Transforming variables
- Interactions
- Outliers
3Looking at the relationship between two
interval-ratio variablesYou can use ordinal
variables if you adjust them to represent rank
order
- When we want to know how two variables are
related to one another the pattern of the data
points on a scatterplot can illustrate various
patterns and relationships, including - data correlation
- positive or direct relationships between
variables - negative or inverse relationships between
variables - non-linear patterns
Example of scatterplot showing relationship
between Grip Strength and Arm Strength
4Thinking about linesWhat can we measure
- Gradient a measure of how the line slopes
- Intercept where the line cuts the y axis
- Correlation a measure of how well the
line fits the data
5 4 3 2 1 0
Equation for a line y a bx a is the
point at which the line crosses the y axis (when
x0). b is a measure of the slope (the amount
of change in y that occurs with a 1-unit change
in x).
y
y 1.5 0.5x
0 1 2 3 4 5 x
5Same Intercept, Different Gradient
For each line y 35 bx Where b varies
6Same Gradient, Different Intercept
For each line y a 2.5x Where a varies
7In Groups
- Draw pictures of the following lines
- y 2 3x
- y -2 x
- y 4 - 2x
- y 3 - 0.5x
Write equations for the following lines
8Linear relationship
- The technique of line-fitting, known as
regression is used to measure how well a line
fits a scatter of plots. - When the data points form a straight line on the
graph, the linear relationship between the
variables is stronger and the correlation is
higher. - The following scatterplot shows a strong linear
relationship between the two variables. - We say that these two variables are highly
correlated.
9Positive and negative relationships
- Positive or direct relationships
- If the points cluster around a line that runs
from the lower left to upper right of the graph
area, then the relationship between the two
variables is positive or direct. - An increase in the value of x is more likely
associated with an increase in the value of y. - The closer the points are to the line, the
stronger the relationship.
- Negative or inverse relationships
- If the points tend to cluster around a line that
runs from the upper left to lower right of the
graph, then the relationship between the two
variables is negative or inverse. - An increase in the value of x is more likely
associated with a decrease in the value of y. - The closer the points are to the line, the
stronger the relationship.
10There are lots of online sites where you can
explore this topic
- Three examples
- http//argyll.epsb.ca/jreed/math9/strand4/scatterP
lot.htm - This site lets you produce your own scatter
plot, produce a line of best fit, practice
interpolating data points on the line, and look
at the correlation coefficient. - http//www.stat.berkeley.edu/stark/Java/Html/Corr
elation.htm - This site lets you alter a scatter plot and add
your own points, see the point of averages,
standard deviation lines, and correlation
coefficient as well as plot the regression line
and more. - http//www.stat.uiuc.edu/courses/stat100/java/GCAp
plet/GCAppletFrame.html - This site allows you to guess correlations.
- You can also take a look at Chapter 8 of
Statistics for the Terrified.
11Working out the correlation coefficient
(Pearsons r)
- Pearsons r tells us how much one variable
changes as the values of another changes their
covariation. - Variation is measured with the standard
deviation. This measures average variation of
each variable from the mean for that variable. - Covariation is measured by calculating the amount
by which each value of X varies from the mean of
X, and the amount by which each value of Y varies
from the mean of Y and multiplying the
differences together and finding the average (by
dividing by n-1). - Pearsons r is calculated by dividing this by (SD
of x) x (SD of y) in order to standardize it.
12Working out the correlation coefficient
(Pearsons r)
- This can also be calculated as the average sum of
the products of the standardized values of x and
y -
- r will always fall between 1 and -1.
- A correlation of either 1 or -1 means perfect
association between the two variables. - A correlation of 0 means that there is no
association. - Note correlation does not mean causation. We can
only determine causation by reference to our
theory. However (thinking about it the other way
round) there is unlikely to be causation if there
is not correlation.
13Worked Example
Average of x 4, SD 2 Average of y 7, SD 4
Note reminder of how to standardize scores
14Worked Example
Average of x 4, SD 2 Average of y 7, SD 4
Note reminder of how to standardize scores
15Worked Example
Average of x 4, SD 2 Average of y 7, SD 4
Average of the products 0.75 -0.25 0
-0.75 2.25 2.00
Note reminder of how to standardize scores
Divide by n-1 2.00/(5-1) 2/4 .5
16Explained Variation
- Pearsons r measures strength of association
between two variables. - It does not tell you how much of variable y is
explained by variable x. To get this you need to
calculate r2. This is known as the coefficient of
determination. - In this example r2 0.5 x 0.5 0.25. Therefore
25 of the variation in y is explained by x.
17What is Regression?
- A way of predicting the value of one variable
from another. - It is a hypothetical model of the relationship
between two variables. - The model used is a linear one.
- Therefore, we describe the relationship using the
equation of a straight line.
18How the correlation coefficient describes a
linear relationship
- The regression line for y on x estimates the
average value for y corresponding to each value
of x - The regression line always goes through the point
of averages (the point that contains the average
y score and the average x score) - Associated with each increase of one SD of x
there is an increase of r SDs in y, on the
average.
The regression estimate
y
Point of averages
r x SDy
SDx
x
19Regression and the description of a Straight Line
- bi
- Regression coefficient for the predictor
- Gradient (slope) of the regression line
- Direction/Strength of Relationship
- a
- Intercept (value of Y when X 0)
- Point at which the regression line crosses the
Y-axis (ordinate) - ?i
- Unexplained error.
20The Method of Least Squares
Why is this line a better summary of the data
than a line which is marginally more steep or
marginally more shallow or which is a millimetre
or two further up the page? In fact the line has
been chosen in such a way that the sum of the
squares of the vertical distances between the
points and the line is minimised. As we have
seen earlier in the module, squaring differences
has the advantage of making positive and negative
differences equivalent.
21How Good is the Model?
- The regression line is only a model based on the
data. - This model might not reflect reality.
- We need some way of testing how well the model
fits the observed data. - How?
22Sum of Squares
- SST
- Total variability (variability betweenscores
and the mean). - SSR
- Residual/Error variability (variability between
the regression model and the actual data). - SSM
- Model variability (difference in variability
between the model and the mean).
23Testing the Model ANOVA
- If the model results in better prediction than
using the mean, then we expect SSM to be much
greater than SSR
24Testing the Model ANOVA
- Mean Squared Error
- Sums of Squares are total values.
- They can be expressed as averages. The averages
are obtained by dividing the sum of squares by
the degrees of freedom for each model. - These are called Mean Squares, MS
- If you know F you can check whether the model is
significantly better at predicting the dependent
variable than chance alone.
25Testing the Model R2
- R2
- The proportion of variance accounted for by the
regression model (you can transform R2 into a
percentage). - The Pearson Correlation Coefficient Squared
26Regression An Example
27SPSS output showing the F ratio
If the improvement due to fitting the model is
much greater than the inaccuracy within the model
then the value of F will be greater than 1. In
this instance the value of F is 99.587 SPSS
tells us that the probability of obtaining this
value of F by chance is very low (p
lt.001) Note Mean Square Sum of Squares /
df F MS regression / MS residual
28SPSS output showing R2
In this instance the model explains 33.5 of the
variation in the dependent variable.
29SPSS Output Model Parameters
30Produce your own regression equations at the
following site
- http//people.hofstra.edu/faculty/Stefan_Waner/ne
wgraph/regressionframes.html - Lets discover what the equation is that relates
age to number of jobs ever held (assuming that
there is one). - y number of jobs ever
- x age
- Using the equation that weve got from this site
- How many jobs would you predict that someone who
is 25 would have ever held? - What about someone who is 40?
31Multiple Regression when there is more than one
independent variable
- b1
- Regression coefficient for the first predictor,
controlling for the other predictors - Direction/Strength of Relationship
- b2
- Regression coefficient for the second predictor,
controlling for the other predictors - Direction/Strength of Relationship
- bn
- Regression coefficient for the nth predictor,
controlling for the other predictors - Direction/Strength of Relationship
- a
- Intercept (value of Y when X1 and X2 and Xn all
0) - Point at which the regression line crosses the
Y-axis (ordinate) - ?i
- Unexplained error.
32Multiple regression an example
33SPSS Output Example Coefficients
- This is a regression of usual gross monthly pay
on the numbers of hours a respondent has worked,
their age, and whether or not they have a degree. - The coefficients for each independent variable
show the effect of that variable holding all
other variables in the model constant.
- Questions
- How much would you expect a 30 year old to earn
if they do not have a degree and work 36 hours a
week? - How much would you expect a 60 year old to earn
if they have a degree and work 20 hours a week?
34SPSS Output Example effect size
- Standardized coefficients enable us to measure
the different effect sizes of different
independent variables i.e. to answer the
question Does a persons age or whether they
have a degree make more difference to their pay? - We cannot use the normal coefficients to make
this comparison because each variable is measured
in different units i.e. a degree is coded 1 or
0 (you have one or not), age ranges from 16 to
90 (relatively evenly spread out) and hours of
work from 0 to 100 (but bunched up between 20
and 50). - Standardized coefficients measure the number of
standard deviations of change in the dependent
variable (gross pay) that is produced by one
standard deviation of change in each independent
variable. - Since we are comparing like with like now we can
determine whether a one standard deviation in
respondents age has more or less effect than a
one standard deviation in having a degree. - As the standardized coefficient for holding a
degree is larger than the standardized
coefficient for age we can say that this variable
has a larger effect. However the number of hours
worked has the largest effect.
35Checking Assumptions Checking Residuals
Linearity This assumption is that there is a
straight line relationship between the
independent and dependent variables (n.b. if
there is not it may be possible to make it linear
by transforming one or more variables). Homoscedas
ticity This assumption means that the variance
around the regression line is the same for all
values of the independent variable(s).
36Example Transforming variables
- Some variables have a non-linear effect. If this
is the case you may be able to transform them in
such a way as to model their effect. - A common way of transforming variables is to
square them. If you include just a square-term
the effect will be exponential (getting rapidly
larger). If you include both a squared and the
original item you can explore a curvilinear
relationship. - Example The curvilinear affect of age on
income. - I suspect that age actually has a curvilinear
affect on income i.e. that people initially
earn more as they get older (so age has a
positive effect) but that eventually this evens
out and then perhaps declines. - In order to explore this I will do the same
regression as above but will include both age and
age2 as independent variables. (I calculate age2
using the compute function in SPSS and
calculating age2ageage).
37Example Transforming variables
As you can see the coefficient for age is large
and positive and the coefficient for age2 is
small and negative. This means that the combined
effect of age for someone who is 25 is 64.596 x
age 0.732 x age2 64.596 x 25 0.732 x
252 1,614.90 457.50 1,157.40 What is
the combined effect of age for someone who is 45?
65? What happens to the size of the square-term
in comparison to the original term?
38Example Interactions
It occurred to me that age may have a different
effect on those people with higher levels of
education than on those with lower educational
levels. In order to investigate this I decided to
look at the interaction of age and degree I
created a new variable agedegree. This will be
equal to age for those with a degree (scored 1)
and will equal zero for those without a degree
(scored 0). This means that the combined effect
of age for someone who is 25 and has a degree
is (64.323 7.703) x age 0.735 x age2
72.026 x 25 0.735 x 252 1,800.65
459.38 1,341.28 Substantively this means
that age has a stronger (positive) influence on
pay when people have a degree.
39Note The effect of outliers
Because the regression line minimizes the squared
difference of points to the line outliers can
have a very large effect (as their squared
distance to the line will make a big difference).
This is why it is sometimes advisable to run
regression analysis omitting outliers.
40Next Week
- Regression requires that your dependent variable
is interval-ratio. - Next week we will look at logistic regression,
which is similar to regression analysis (and
produces similar looking equations and SPSS
output), but is used where the dependent variable
is dichotomous.