Title: Correlation and Regression
1Chapter 15
- Correlation and Regression
2Introduction
When one does correlational research, he or she
is interested in the relationship between two
variables.
3Some examples of questions that would be answered
with a correlational study
- Do taller people tend to weigh more than shorter
people? - Do people with higher IQ scores tend to do better
in school than others? - Do children who eat more sugar in their diet tend
to be more active than other children?
4In each case, the researcher would obtain a pair
of observations from each member of the sample.
5Examples
- To determine if taller people tended to weigh
more than shorter people, the researcher would
have to obtain the height and weight of each
member in a sample. - To determine if people with higher IQ scores tend
to do better than others in school, the
researcher would have to obtain the IQ score and
some index of academic performance (e.g, GPA)
from each member of the sample. - To determine if children who eat more sugar are
more active than other children, a researcher
would have to record the amount of sugar consumed
and the activity level of each child in the
sample.
6Bivariate Distributions
Because correlational research involves getting
pairs of scores, what results is a bivariate
distribution.
7Bivariate Distributions
Because correlational research involves getting
pairs of scores, what results is a bivariate
distribution. Bivariate distributions should be
distinguished from univariate distributions.
8When we do correlational research, we need ways
statistically describe the nature of the
relationship between the two variables.
9Scatter Plots
One way we can get a sense of the relationship
between the two variables is to construct a
scatter plot. In a scatter plot, each pair of
scores is represented by a point in a
two-dimensional space. The horizontal distance
of each point is determined by the value of one
of the variables. The vertical distance is
determined by the value of the other variable.
10Example
Suppose we have the following pairs 1, 2 1, 3 2,
2 3, 4 What would the scatter plot look like for
these four pairs?
11(No Transcript)
12Which variable you designate as X and which you
designate as Y is largely arbitrary. The only
time when it might make a difference is when one
of the two variables might be logically used to
predict the other.
13When that is true, we should plot the variable
used to make the prediction (the predictor
variable) along the X-axis, and the variable we
are trying to predict (the criterion variable)
along the Y-axis.
14For example, when it comes to height and weight,
we would probably be more likely to use ones
height (predictor variable) to predict ones
weight (the criterion variable). Therefore, it
would make sense to plot height along the X-axis
and weight along the Y-axis
15Interpreting Scatter Plots
16This scatter plot depicts a perfect, positive
relationship between the two variables. By
perfect, positive we mean that there is perfect
consistency between the two variables. A certain
increase in X always is accompanied by the same
amount of increase in Y
17This scatter plot depicts a strong, positive
relationship. As X increases, so does Y.
However, the increase is not perfectly consistent
18This scatter plot depicts a weak, positive
relationship. As X increases, there is only a
slight tendency for Y to increase
19This scatter plot depicts an instance where there
is no relationship between X and Y. As X
increases, Y neither increases nor decreases.
20This scatter plot depicts a perfect negative
(inverse) relationship. As X increases, Y
decreases
21This is a strong, negative (inverse) relationship
22This is a weak, negative (inverse) relationship.
23In other words, relationships can vary from
perfect positive to perfect negative. The full
dimension is depicted below
No relationship
Perfect, negative
Perfect, positive
24How would you describe the relationship between
height and weight?
25The Pearson Correlation Coefficient
The Pearson correlation coefficient is a
statistic that quite precisely describes the
relationship between two variables. Specifically
it indicates whether the relationship is positive
or negative and how strong the relationship is.
26The conceptual formula
or
27The numerator is the interesting part of this
formula.
First it determines how the X and Y members of a
pair deviate from their respective means.
28The numerator is the interesting part of this
formula.
Then by multiplying these deviations it
determines if they deviate in the same direction
(in which case the product will be positive) or
in opposite directions (in which case the product
will be negative).
29The numerator is the interesting part of this
formula.
Finally, by summing the products of these
deviations, it determines if there is a
consistent pattern across all pairs.
The sum of the products of the deviation scores
is frequently represented by the symbol SP
30Some examples
31X and Y scores consistently deviate in the same
direction from their respective means. This
results in a large positive value for the sum of
the products of the deviations.
32In this case, the X and Y values consistently
deviate in opposite directions. This results in
a large negative value for the sum of the
products of the deviations
33What will happen in this case?
34Or in these two cases?
35Lets take a look at the height and weight data.
36Calculating the Pearson correlation coefficient
37While you can use the conceptual formula to
compute r, it is easier to use the raw score or
computational formula
38A good strategy is to break the formula into
three components and compute the value for each
component and then insert them into the formula.
39Its also a good strategy to compute the
following quantities before you begin.
40Example What is r for the following pairs of
scores?
1, 2 1, 3 2, 2 3, 4
41(No Transcript)
42What is r for the following pairs of scores?
1, 4 1, 5 2, 2 4, 1
43(No Transcript)
44Whats the correlation between height and weight?
45Interpreting correlation coefficients
- The correlation coefficient can obtain any value
between -1 and 1. - The sign of the correlation coefficient indicates
whether the relationship is positive (as X
increases, Y also increases) or negative (as X
increases, Y decreases) - Its value indicates how strong the relationship
is. Values close to 0 are weak or nonexistent.
Values close to either 1 or -1 are very strong.
46r -1
-1 lt r lt 0
r 0
r 1
0 lt r lt 1
47While the correlation coefficient conveys
information about the strength of a relationship
in a precise way, it doesnt do so in a manner
that is particularly meaningful.
48For example, a correlation coefficient of .8
indicates a stronger relationship than.7. Just
how strong is a relationship of .8 or .7, however?
49The Coefficient of Determination
The coefficient of determination conveys
information about the strength of a relationship
in a manner which is quite meaningful
50The Coefficient of Determination
Specifically, it tells you how strong a
relationship is by indicating the proportion of
variance that is shared by the two variables
51To illustrate
Suppose that the circle below labeled X
represents the variance of X and that the circle
labeled Y represents the variance in Y.
52If X and Y are correlated, that means that the
two variables co-vary to some extent. In other
words, they share some variance. The shared
variance is represented by the portion of the
circles that overlap.
53The stronger the relationship between X and Y,
the more overlap, or shared variance, there will
be.
54Calculating the Coefficient of Determination
To calculate the coefficient of determination,
simply square the correlation coefficient (r2).
This will tell you exactly what proportion of the
variance in one variable is shared with the
second variable. Unique variance is simply
1- r2.
55(No Transcript)
56Example
If the correlation between height and weight is
.64, then the coefficient of determination would
be .41. That indicates that the two variables
share 41 of their variance. That also means
that 59 of the variance in each variable is
unique (i.e., not shared with the other.
57(No Transcript)
58While the coefficient of determination conveys
information about the strength of a relationship,
it does not convey information about the type of
relationship (positive vs. negative). That is
because the sign of the correlation coefficient
is lost when it is squared.
59Linear Regression
When we calculate a correlation coefficient, we
are really determining the extent to which the
relationship can be described by a straight line
60In this case, a straight line provides a fairly
good description of the relationship (i.e, the
points tend to fall close to the line).
Consequently the correlation coefficient would be
relatively large.
61In contrast, a straight line does a poorer job
describing this relationship since the points
tend to fall off of the line by a good bit. In
this case, we would expect a small correlation
coefficient.
62The line that best describes the relationship
between two variables (i.e. comes closest to all
of the points) is referred to as the regression
line.
63The line that best describes the relationship
between two variables (i.e. comes closest to all
of the points) is referred to as the regression
line. By obtaining the formula for the regression
line, we can more precisely describe how X and Y
are related.
64Example
The relationships depicted below are both strong
and positive, yet they are not the same
relationship.
The difference is reflected in the slope of the
regression lines.
65We can also use the regression line to predict a
Y value given any value of X.
66Obtaining the Regression Equation
67All straight lines have a common formula
Y b(X) a
b is referred to as the slope. It indicates how
much Y will change for every unit of change in
X a is referred to as the Y intercept. It
indicates the point at which the line intercepts
the Y-axis. Different lines will have different
values for b and a.
68To obtain the regression equation, we must
calculate values for the slope (b) and intercept
(a). Here are the formulas
69Example
What would the regression equation be for
predicting weight from height given the following
information
70(No Transcript)
71The complete regression equation would be
You can predict a weight for any height by
substituting that height for X in the regression
equation
72The stronger the correlation is, the more
accurate the prediction will be.
73Its important to remember that the regression
equation for predicting Y from X is not the same
as the regression equation for predicting X from
Y.
74A few remaining points about correlation and
regression
75Non-linear relationships
A correlation only determines the extent to which
a straight line describes the relationship
between two variables. Sometimes the relationship
isnt linear.
76In this case, there is a curvilinear
relationship. Since the relationship isnt
linear, r might equal 0.
77The problem of restricted range
Sometimes the true relationship between two
variables is masked because range of values for
one or both of the variables has been restricted.
78Consider this scatter plot. What is the
relationship between X and Y?
79It might appear to be weak only because X and Y
vary only over a restricted range. If allowed to
vary over a wider range, a stronger relationship
might emerge.
80Multiple correlation and regression
Often we are interested in how much of the
variance we can account for in a criterion
(dependent) variable. Typically, we can account
for more variance it we take into account
multiple predictor (independent) variables.
81For example, we might be able to account for more
of the variance in weight if we considered age
and gender in addition to height. This would
also lead to better prediction.
82This is the idea behind multiple correlation and
multiple regression