Correlation and regression

About This Presentation

Title:

Correlation and regression

Description:

Looking at the relationship between two interval-ratio variables* *You can use ordinal variables ... http://argyll.epsb.ca/jreed/math9/strand4/scatterPlot.htm ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 41

Provided by: sys84

Category:

more less

Transcript and Presenter's Notes

Title: Correlation and regression

1
Correlation and regression

Friday, February 24th 2006

2
Outline

Lines intercept and gradient
Correlation
Line fitting
The correlation coefficient Pearsons r
Regression
What is it?
Least squares
Testing the model
Example SPSS Output
Multiple Regression
Coefficients
Effect Size
Assumptions
Transforming variables
Interactions
Outliers

3
Looking at the relationship between two
interval-ratio variablesYou can use ordinal
variables if you adjust them to represent rank
order

When we want to know how two variables are
related to one another the pattern of the data
points on a scatterplot can illustrate various
patterns and relationships, including
data correlation
positive or direct relationships between
variables
negative or inverse relationships between
variables
non-linear patterns

Example of scatterplot showing relationship
between Grip Strength and Arm Strength
4
Thinking about linesWhat can we measure

Gradient a measure of how the line slopes
Intercept where the line cuts the y axis
Correlation a measure of how well the
line fits the data

5 4 3 2 1 0
Equation for a line y a bx a is the
point at which the line crosses the y axis (when
x0). b is a measure of the slope (the amount
of change in y that occurs with a 1-unit change
in x).
y
y 1.5 0.5x
0 1 2 3 4 5 x
5
Same Intercept, Different Gradient
For each line y 35 bx Where b varies
6
Same Gradient, Different Intercept
For each line y a 2.5x Where a varies
7
In Groups

Draw pictures of the following lines
y 2 3x
y -2 x
y 4 - 2x
y 3 - 0.5x

Write equations for the following lines
8
Linear relationship

The technique of line-fitting, known as
regression is used to measure how well a line
fits a scatter of plots.
When the data points form a straight line on the
graph, the linear relationship between the
variables is stronger and the correlation is
higher.
The following scatterplot shows a strong linear
relationship between the two variables.
We say that these two variables are highly
correlated.

9
Positive and negative relationships

Positive or direct relationships
If the points cluster around a line that runs
from the lower left to upper right of the graph
area, then the relationship between the two
variables is positive or direct.
An increase in the value of x is more likely
associated with an increase in the value of y.
The closer the points are to the line, the
stronger the relationship.

Negative or inverse relationships
If the points tend to cluster around a line that
runs from the upper left to lower right of the
graph, then the relationship between the two
variables is negative or inverse.
An increase in the value of x is more likely
associated with a decrease in the value of y.
The closer the points are to the line, the
stronger the relationship.

10
There are lots of online sites where you can
explore this topic

Three examples
http//argyll.epsb.ca/jreed/math9/strand4/scatterP
lot.htm
This site lets you produce your own scatter
plot, produce a line of best fit, practice
interpolating data points on the line, and look
at the correlation coefficient.
http//www.stat.berkeley.edu/stark/Java/Html/Corr
elation.htm
This site lets you alter a scatter plot and add
your own points, see the point of averages,
standard deviation lines, and correlation
coefficient as well as plot the regression line
and more.
http//www.stat.uiuc.edu/courses/stat100/java/GCAp
plet/GCAppletFrame.html
This site allows you to guess correlations.
You can also take a look at Chapter 8 of
Statistics for the Terrified.

11
Working out the correlation coefficient
(Pearsons r)

Pearsons r tells us how much one variable
changes as the values of another changes their
covariation.
Variation is measured with the standard
deviation. This measures average variation of
each variable from the mean for that variable.
Covariation is measured by calculating the amount
by which each value of X varies from the mean of
X, and the amount by which each value of Y varies
from the mean of Y and multiplying the
differences together and finding the average (by
dividing by n-1).
Pearsons r is calculated by dividing this by (SD
of x) x (SD of y) in order to standardize it.

12
Working out the correlation coefficient
(Pearsons r)

This can also be calculated as the average sum of
the products of the standardized values of x and
y
r will always fall between 1 and -1.
A correlation of either 1 or -1 means perfect
association between the two variables.
A correlation of 0 means that there is no
association.
Note correlation does not mean causation. We can
only determine causation by reference to our
theory. However (thinking about it the other way
round) there is unlikely to be causation if there
is not correlation.

13
Worked Example
Average of x 4, SD 2 Average of y 7, SD 4
Note reminder of how to standardize scores

14
Worked Example
Average of x 4, SD 2 Average of y 7, SD 4
Note reminder of how to standardize scores

15
Worked Example
Average of x 4, SD 2 Average of y 7, SD 4
Average of the products 0.75 -0.25 0
-0.75 2.25 2.00
Note reminder of how to standardize scores

Divide by n-1 2.00/(5-1) 2/4 .5
16
Explained Variation

Pearsons r measures strength of association
between two variables.
It does not tell you how much of variable y is
explained by variable x. To get this you need to
calculate r2. This is known as the coefficient of
determination.
In this example r2 0.5 x 0.5 0.25. Therefore
25 of the variation in y is explained by x.

17
What is Regression?

A way of predicting the value of one variable
from another.
It is a hypothetical model of the relationship
between two variables.
The model used is a linear one.
Therefore, we describe the relationship using the
equation of a straight line.

18
How the correlation coefficient describes a
linear relationship

The regression line for y on x estimates the
average value for y corresponding to each value
of x
The regression line always goes through the point
of averages (the point that contains the average
y score and the average x score)
Associated with each increase of one SD of x
there is an increase of r SDs in y, on the
average.

The regression estimate
y
Point of averages
r x SDy
SDx
x
19
Regression and the description of a Straight Line

bi
Regression coefficient for the predictor
Gradient (slope) of the regression line
Direction/Strength of Relationship
a
Intercept (value of Y when X 0)
Point at which the regression line crosses the
Y-axis (ordinate)
?i
Unexplained error.

20
The Method of Least Squares
Why is this line a better summary of the data
than a line which is marginally more steep or
marginally more shallow or which is a millimetre
or two further up the page? In fact the line has
been chosen in such a way that the sum of the
squares of the vertical distances between the
points and the line is minimised. As we have
seen earlier in the module, squaring differences
has the advantage of making positive and negative
differences equivalent.
21
How Good is the Model?

The regression line is only a model based on the
data.
This model might not reflect reality.
We need some way of testing how well the model
fits the observed data.
How?

22
Sum of Squares

SST
Total variability (variability betweenscores
and the mean).
SSR
Residual/Error variability (variability between
the regression model and the actual data).
SSM
Model variability (difference in variability
between the model and the mean).

23
Testing the Model ANOVA

If the model results in better prediction than
using the mean, then we expect SSM to be much
greater than SSR

24
Testing the Model ANOVA

Mean Squared Error
Sums of Squares are total values.
They can be expressed as averages. The averages
are obtained by dividing the sum of squares by
the degrees of freedom for each model.
These are called Mean Squares, MS
If you know F you can check whether the model is
significantly better at predicting the dependent
variable than chance alone.

25
Testing the Model R2

R2
The proportion of variance accounted for by the
regression model (you can transform R2 into a
percentage).
The Pearson Correlation Coefficient Squared

26
Regression An Example
27
SPSS output showing the F ratio
If the improvement due to fitting the model is
much greater than the inaccuracy within the model
then the value of F will be greater than 1. In
this instance the value of F is 99.587 SPSS
tells us that the probability of obtaining this
value of F by chance is very low (p
lt.001) Note Mean Square Sum of Squares /
df F MS regression / MS residual
28
SPSS output showing R2
In this instance the model explains 33.5 of the
variation in the dependent variable.
29
SPSS Output Model Parameters
30
Produce your own regression equations at the
following site

http//people.hofstra.edu/faculty/Stefan_Waner/ne
wgraph/regressionframes.html
Lets discover what the equation is that relates
age to number of jobs ever held (assuming that
there is one).
y number of jobs ever
x age
Using the equation that weve got from this site
How many jobs would you predict that someone who
is 25 would have ever held?
What about someone who is 40?

31
Multiple Regression when there is more than one
independent variable

b1
Regression coefficient for the first predictor,
controlling for the other predictors
Direction/Strength of Relationship
b2
Regression coefficient for the second predictor,
controlling for the other predictors
Direction/Strength of Relationship
bn
Regression coefficient for the nth predictor,
controlling for the other predictors
Direction/Strength of Relationship
a
Intercept (value of Y when X1 and X2 and Xn all
0)
Point at which the regression line crosses the
Y-axis (ordinate)
?i
Unexplained error.

32
Multiple regression an example
33
SPSS Output Example Coefficients

This is a regression of usual gross monthly pay
on the numbers of hours a respondent has worked,
their age, and whether or not they have a degree.
The coefficients for each independent variable
show the effect of that variable holding all
other variables in the model constant.

Questions
How much would you expect a 30 year old to earn
if they do not have a degree and work 36 hours a
week?
How much would you expect a 60 year old to earn
if they have a degree and work 20 hours a week?

34
SPSS Output Example effect size

Standardized coefficients enable us to measure
the different effect sizes of different
independent variables i.e. to answer the
question Does a persons age or whether they
have a degree make more difference to their pay?
We cannot use the normal coefficients to make
this comparison because each variable is measured
in different units i.e. a degree is coded 1 or
0 (you have one or not), age ranges from 16 to
90 (relatively evenly spread out) and hours of
work from 0 to 100 (but bunched up between 20
and 50).
Standardized coefficients measure the number of
standard deviations of change in the dependent
variable (gross pay) that is produced by one
standard deviation of change in each independent
variable.
Since we are comparing like with like now we can
determine whether a one standard deviation in
respondents age has more or less effect than a
one standard deviation in having a degree.
As the standardized coefficient for holding a
degree is larger than the standardized
coefficient for age we can say that this variable
has a larger effect. However the number of hours
worked has the largest effect.

35
Checking Assumptions Checking Residuals
Linearity This assumption is that there is a
straight line relationship between the
independent and dependent variables (n.b. if
there is not it may be possible to make it linear
by transforming one or more variables). Homoscedas
ticity This assumption means that the variance
around the regression line is the same for all
values of the independent variable(s).
36
Example Transforming variables

Some variables have a non-linear effect. If this
is the case you may be able to transform them in
such a way as to model their effect.
A common way of transforming variables is to
square them. If you include just a square-term
the effect will be exponential (getting rapidly
larger). If you include both a squared and the
original item you can explore a curvilinear
relationship.
Example The curvilinear affect of age on
income.
I suspect that age actually has a curvilinear
affect on income i.e. that people initially
earn more as they get older (so age has a
positive effect) but that eventually this evens
out and then perhaps declines.
In order to explore this I will do the same
regression as above but will include both age and
age2 as independent variables. (I calculate age2
using the compute function in SPSS and
calculating age2ageage).

37
Example Transforming variables
As you can see the coefficient for age is large
and positive and the coefficient for age2 is
small and negative. This means that the combined
effect of age for someone who is 25 is 64.596 x
age 0.732 x age2 64.596 x 25 0.732 x
252 1,614.90 457.50 1,157.40 What is
the combined effect of age for someone who is 45?
65? What happens to the size of the square-term
in comparison to the original term?
38
Example Interactions
It occurred to me that age may have a different
effect on those people with higher levels of
education than on those with lower educational
levels. In order to investigate this I decided to
look at the interaction of age and degree I
created a new variable agedegree. This will be
equal to age for those with a degree (scored 1)
and will equal zero for those without a degree
(scored 0). This means that the combined effect
of age for someone who is 25 and has a degree
is (64.323 7.703) x age 0.735 x age2
72.026 x 25 0.735 x 252 1,800.65
459.38 1,341.28 Substantively this means
that age has a stronger (positive) influence on
pay when people have a degree.
39
Note The effect of outliers
Because the regression line minimizes the squared
difference of points to the line outliers can
have a very large effect (as their squared
distance to the line will make a big difference).
This is why it is sometimes advisable to run
regression analysis omitting outliers.
40
Next Week

Regression requires that your dependent variable
is interval-ratio.
Next week we will look at logistic regression,
which is similar to regression analysis (and
produces similar looking equations and SPSS
output), but is used where the dependent variable
is dichotomous.

Write a Comment

User Comments (0)