Title: CORRELATION AND REGRESSION
1CORRELATION AND REGRESSION
2An Example
- Recall sentence 11 in Problem Set 3A (and 9)
- If you want to get ahead, stay in school.
- Underlying this nagging parental advice is the
following claimed empirical relationship -
- LEVEL OF EDUCATION gt LEVEL OF SUCCESS IN
LIFE - Suppose we collect data through by means of a
survey asking respondents (say a representative
sample of the population aged 35-55) to report
the number of years of formal EDUCATION they
completed and also their current INCOME (as an
indicator of SUCCESS). We then analyze the
association between the two interval variables in
this reformulated hypothesis. -
- LEVEL OF EDUCATION gt LEVEL OF INCOME
- ( of years reported)
(000 per year) - Since these are both interval continuous
variables, we analyze their association by means
of a scattergram.
3Having collected such data in two rather
different societies A and B, we produce these two
scattergrams.
4An Example (cont.)
- Note that the two scattergrams are drawn with the
same horizontal and vertical scales. - With respect to the vertical axis in scattergram
A, this violates the usual guideline that scales
be set to minimize white space in the
scattergram. - But here we want to facilitate comparison between
the two charts. - Both scattergrams show a clear positive
association between the two variables, i.e., the
plotted points in both form an upward-sloping
pattern running from Low Low to High High. - At the same time there are obvious differences
between the two scattergrams (and thus between
the relation-ships between INCOME and EDUCATION
in societies A and B).
5Questions For Discussion
- In which society, A or B, is the hypothesis most
powerfully confirmed? - In which society, A or B, is their a greater
incentive for people to stay in school? - Which society, A or B, does the U.S. more closely
resemble? - How might we characterize the difference between
societies A and B?
6An Example (cont.)
- We can visually compare and contrast the nature
of the associations between the two variables in
the two scattergrams by drawing a number of
vertical strips in each scattergram (as we did in
the earlier Pearson scattergram of SONS HEIGHT
by FATHERS HEIGHT). - Points that lie within each vertical strip
represent respondents who have (just about) the
same value on the independent (horizontal)
variable of EDUCATION. - Within each strip, we can estimate (by eyeball
methods) the average magnitude of the dependent
(vertical) variable INCOME and put a mark at the
appropriate level.
7Average Income for Selected Levels of Education
8We can connect these marks to form a line of
averages that is apparently (close to being) a
straight line.
9An Example (cont.)
- Now we can assess two distinct characteristics of
the relationships between EDUCATION and INCOME in
scattergrams A and B. - How much the does the average level of INCOME
change among people with different levels of
education? - How much dispersion of INCOME there is among
people with the same level of EDUCATION?
10An Example (cont.)
- In both scattergams, the line of averages is
upward-sloping, indicating a clear apparent
positive effect on EDUCATION on INCOME. - But in the scattergram for society A, the upward
slope of the line of averages is fairly shallow.
- The line of averages indicates that average
INCOME increases by only about 1000 for each
additional year of EDUCATION. - On the other hand, in the scattergram for society
B, the upward slope of the line of averages is
much steeper. - The graph in Figure 1B indicates that average
INCOME increases by about 4000 for each
additional year of EDUCATION. - In this sense, EDUCATION is on average more
rewarding in society B than A.
11An Example (cont.)
- There is another difference between the two
scattergrams. - In scattergram A, there is almost no dispersion
within each vertical strip (and almost no
dispersion around the line of averages as a
whole). - In scattergram B, there is a lot of dispersion
within each vertical strip (and around the line
of averages as a whole). - We can put this point in substantive language.
- In society A, while additional years of EDUCATION
produce rewards in terms of INCOME that are
modest (as we saw before), these modest rewards
are essentially certain. - In society B, while additional years of EDUCATION
produce on average much more substantial rewards
in terms of INCOME (as we saw before), these
large expected rewards are highly uncertain and
are indeed realized only on average. - For example, in scattergram B (but not A), we can
find many pairs of cases such that one case has
(much) higher EDUCATION but the other case has
(much) higher INCOME.
12An Example (cont.)
- This means that in society B, while EDUCATION has
a big impact on EDUCATION, there are evidently
other (independent) variables (maybe family
wealth, ambition, career choice, athletic or
other talent, just plain luck, etc.) that also
have major effects on LEVEL OF INCOME. - In contrast, in society A it appears that LEVEL
OF EDUCATION (almost) wholly determines LEVEL OF
INCOME and that essentially nothing else matters.
- Another difference between the two societies is
that, while both societies have similar
distributions of EDUCATION, their INCOME
distributions are quite different. - A is quite egalitarian with respect to INCOME,
which ranges only from about 40,000 to about
60,000, while B is considerably less egalitarian
with respect to INCOME, which ranges from under
to 10,000 to at least 100,000 and possibly
higher.) - In summary, in society A the INCOME rewards of
EDUCATION are modest but essentially certain,
while in society B the INCOME rewards of
EDUCATION are substantial on average but quite
uncertain in individual cases.
13Two Kinds of Strength of Association
- This example illustrates that, given bivariate
data for interval variables, strength of
association between them can mean either of two
quite different things - the independent variable has a very reliable or
certain association with the dependent variable,
as is true for society A but not B, or - the independent variable has on average a big
impact on the dependent variable, as is true for
society B but not A. - There are two bivariate summary statistics that
capture these two different kinds of strength of
association between interval variables - the first second is called the regression
coefficient, customarily designated b, which is
the slope of the line of averages and - the second is called the correlation coefficient,
customarily designated r , which is (more or
less) determined by the magnitude of dispersion
in each vertical strip. - In scattergrams A and B, A has the greater
correlation coefficient and B has the greater
regression coefficient.
14Review The Equation of a Line
- Having drawn (at least by eyeball methods) the
line of averages in a scattergram, it is
convenient to write an equation for the line. - You should recall from high-school algebra that,
given any graph with a horizontal axis X and a
vertical axis Y, any straight line drawn on the
graph has an equation of the form (using the
symbols you probably used in high school) - y m x b , where
- m is the slope of the line expressed as ? y / ? x
(change in y divided by change in x or rise
over run), and - b is the y-intercept (i.e., the value of y when x
0). - Evidently to further torment students, in college
statistics the symbol b is used in place of m to
represent the slope of the line and the symbol a
is used in place of b to represent the intercept
(and a is customarily placed in front of the bx
term), so the equation for a straight line is
usually written as - y a b x .
15Slope and Intercept in Scattergrams A and B
16Equation of a Line (cont.)
- The equation for the line of averages in
scattergram A appears to be approximately - Y
a b x - AVERAGE INCOME 40,000 1000 EDUCATION
. - The equation for the line of averages in Figure B
appears to be approximately - AVERAGE INCOME 10,000 4000 EDUCATION
. - Given such an equation (or formula), we can take
any value for the independent variable EDUCATION,
plug it into the appropriate formula above, and
calculate or predict the corresponding average
or expected value of the dependent variable
INCOME. - In society A, such a prediction is likely to be
quite reliable, because there is very little
dispersion in any vertical strip and the
association/correlation between the two variables
is almost perfect - In society B, this prediction will be much less
reliable or fuzzier, because there is
considerable dispersion in every vertical strip
and the association/correlation between the two
variables is much less than perfect.
17(No Transcript)
18Before we used eyeball methods to draw the line
of averages in the SONS HEIGHT by FATHERS
HEIGHT scattergram. Lets write its equation in
the formSONS HEIGHT a b FATHERS HEIGHT .
19(No Transcript)
20Equation for Sons Height
- What is the equation for the line of averages?
- SH 63 0.5 FH
- SH 63 0.5 74 63 37 100 (8' 4")
- Clearly something is wrong
- 63" is not the true intercept, i.e., it is
- not average SH when FH 0,
- but rather average SH when FH 58".
- My height is 16" greater than 58", so my sons
expected height is - 63 0.5 16 72"
- To read the true y-intercept a on the chart, we
need to extend the FH scale down to FH 0 to see
where the line of averages intersects the true SH
axis.
21(No Transcript)
22Determining the Line of Averages Regression
Line and the Degree of Association Correlation
Numerically
- Clearly statisticians are not satisfied with
eyeballing a scattergram and - visually estimating the slope and intercept of
the line averages, or (especially) - guessing at the degree of association between the
DEPENDENT and INDEPENDENT variables. - Before we can make exact numerical calculations,
we must have an exact definition of the line of
averages. - This will lead to precise formula for the slope,
intercept, and association/correlation.
23Example Worksheet in Topic 13
24Correlation and Regression (cont.)
- Consider the scattergram to the right, which
displays the bivariate numerical data for x and y
in the Sample Problem presented on p. 9 of
Handout 13. - The vertical strips kind of argument simply
will not work, because most strips include no
data points and only one strip (for x 5)
includes more than one point. - Using this simple example, we will now proceed a
little more formally.
25Correlation and Regression (cont.)
- Suppose we were to ask what horizontal line
(i.e., a line for which the slope b 0, so its
equation is simply y a) would best fit (come
as close as possible to) the plotted points. - In order to answer this question, we must have a
specific criterion for best fitting. - The one statisticians use is called the least
squares criterion. - For each horizontal line, we calculate the sum
(or mean) of squared deviations of the
y-observations from a line. - Were looking for the horizontal line that
minimizes this sum/mean. - Lets try the horizontal line y 6
- The sum of squared deviations 138,
- so the mean squared deviation is 23.
- Can we do better that this?
26(No Transcript)
27(No Transcript)
28Correlation and Regression (cont.)
- That is, we want to find the horizontal line such
that the squared vertical deviations from the
line to each point are (in total or on average)
as small as possible. - You might remember (or should be able to guess)
what line this is it is the line y mean of y
, or in this case y 5. - Recall from Handout 6, p. 4, point (e) that the
sum (or average) of the squared deviations from
the mean is less than the sum of the squared
deviations from any other value of the variable. - You should also recall that we have a special
name for the average squared deviation from the
mean (of y) namely, the variance (of y) - This (or its positive square root, i.e., the
standard deviation of y) is the standard
univariate measure of dispersion in y.
29(No Transcript)
30Correlation and Regression (cont.)
- So the line y 5 is the best fitting horizontal
line by the least squares criterion. - Now suppose that, in finding the best fitting
line by the least squares criterion, we are no
longer restricted to horizontal lines but can tip
the line up or down, so it has a non-zero slope
and its height varies with the independent
variable x. - More particularly, suppose that we pivot such
straight lines on the point that is right in the
middle of the all data plotted points, - specifically the point that represents the mean
of x and the mean of y (in this case, x 4, y
5), - until it has a slope that best fits all the
plotted points by the same least squares
criterion. - In our example, we can clearly improve the fit by
tipping the line counter-clockwise.
31(No Transcript)
32Tipping the Horizontal Line Can Improve the Fit
by the Least Squares Criterion
33Finding the Best Fitting Line
- The question is how far we must tip the line to
get the best fit (by the least squares
criterion). - If we tip it too far, the fit becomes worse.
- When we have found this best fitting (by the
least squares criterion) line, we have found what
is thereby defined as the regression line. - Fortunately, we dont have to find this best
fitting line by trial and error methods. - There are formulas for finding slope of the
regression line (i.e., the regression
coefficient) and intercept from the numerical
data. - There is a related formula for finding the
correlation coefficient from the numerical data,
which tells us how well this best fitting
regression line fits the data points.
34(No Transcript)
35Association Between X and Y?
- Mean Squared Deviation from regression line
equals 7.9375. - The degree of assiciation would be related to how
big the MSD is compared with MSD around the best
fitting horizontal line, which equals 22. - Degree of association 1 7.9375
- 22
- 1 - .3608 0.6392
- Note that this number must be between 0 and 1
(but cannot be negative).
36Correlation and Regression (cont.)
- One statistical theorem tells us that the
regression line goes through the point
corresponding to the mean of x and the mean of y.
- Another theorem gives us formulas to calculate
the slope and intercept of the regression line,
given the numerical data. - A third theorem
- gives us the formula to calculate the correlation
coefficient, and also - tells us that the squared correlation coefficient
r 2 measures how much we can improve the goodness
of fit (by the least squares criterion) when we
move from the best fitting horizontal line to the
best fitting tipped line.
37How to Calculate the Regression and Correlation
Coefficients
- Lets call the independent variable X and the
dependent variable Y. (This usage is standard.) - Set up a worksheet like the one that follows
(also p. 9 of Handout 13).
38How to Calculate Coefficients (cont.)
- Compute the following univariate statistics (1)
the mean of X, (2) the mean of Y, (3) the
variance of X, (4) the variance of Y, (5) the SD
of X, and (6) the SD of Y.
39How to Calculate Coefficients (cont.)
- Now we make bivariate calculations for each case
by multiplying its deviation from the mean with
respect to X by its deviation from the mean with
respect to Y. - This is called the crossproduct of the
deviations.
40How to Calculate Coefficients (cont.)
- Notice that a crossproduct in a given case
- is positive if, in that case, the X and Y values
both deviate in the same direction (i.e., both
upwards or both downwards) from their respective
means and - it is negative if, in that case, the X and Y
values deviate in opposite directions (i.e., one
upwards and one downwards) from their respective
means. - In either event, the absolute crossproduct is
large if both absolute deviations are large and
small if either deviation is small. - The crossproduct is zero if either deviation is
zero. - Thus
- (a) if there is a positive relationship between
the two variables, most crossproducts are
positive and many are large - (b) if there is a negative relationship between
the two variables, most crossproducts are
negative and many are large and - (c) if the two variables are unrelated, the
crossproducts are positive in about half the
cases and negative in about half the cases. - So, the average of all crossproducts is
indicative of the association between variables.
41How to Calculate Coefficients (cont.)
- Add up the crossproducts over all cases.
- The sum (and thus the average) of crossproducts
is - positive in the event of (a) above,
- negative in the event of (b) above, and
- (close to) zero in the event of (c) above.
- Divide the sum of the crossproducts by the number
of cases to get the average (mean) crossproduct.
This average is called the covariance of X and Y,
and its formula is the bivariate parallel to the
univariate variance (of X or Y).
42How to Calculate Coefficients (cont.)
- Notice the following
- If the association between X and Y is positive,
their covariance is positive (because most
crossproducts are positive). - If the association between X and Y is negative,
their covariance is negative (because most
crossproducts are negative). - If there is almost no association between X
and Y , their covariance is approximately zero
(because positive and negative crossproducts
approximately cancel out when added up). - Thus the covariance does measure the association
between the two variables.
43How to Calculate Coefficients (cont.)
- However, the covariance is not a (fully) valid
measure of the association, because the magnitude
of the average (positive or negative)
crossproduct depends on not only - how closely the two variables are associated, but
also - on the magnitude of their dispersions (as
indicated by their standard deviations or
variances). - So, for example
- two very closely (and positively) associated
variables have a positive but small covariance if
they both have small standard deviations - two not so closely (but still positively)
associated variables have a larger covariance if
their standard deviations are sufficiently larger.
44Multiplying all Values by 10 does not Changethe
Degree of Association between the Two Variables X
and Y
45But Doing This Increases the Covariance of X and
Y 100-Fold
46The Correlation Coefficient
- The correlation coefficient is a measure of
association only, which (like other measures of
association) is standardized so that it always
falls in the range from -1 to 1. - This is accomplished by dividing the covariance
of X and Y by the standard deviation of X and
also by the standard deviation of X. - This is equivalent to calculating the covariance
of X and Y if the X and Y data is converted into
standard scores. - The degree of correlation in a scattergram can be
observed by looking only at the plotted points,
the regression line, and the orientation of the
horizontal and vertical axes. - The units of measurement on each axis make no
difference and do not have to be shown.
47(No Transcript)
48The Correlation Coefficient (cont.)
- Divide the covariance of X and Y by the standard
deviation of X and also by the standard deviation
of Y. - This gives the correlation coefficient r, which
measures the degree (or reliability or
completeness) and direction of association
between two interval variables X and Y. - Cov (X,Y)
- Correlation coefficient r
- SD(X) SD(Y)
- If you calculate r to be greater than 1 or less
than -1, you have made a mistake. - The correlation coefficient is the one measure of
association for which you are expected to know
how to use the calculating formula (above). - It is a good idea first to construct a
scattergram of the data on X and Y and then to
check whether your calculated correlation
coefficient looks plausible in light of the
scattergram.
49Calculating the Correlation Coefficient
50Properties of the Correlation Coefficient
- The correlation coefficient formula is based on a
ratio of X-units Y-units (i.e., their
respective deviations from the mean) divided by
X-units Y-units (i.e., their respective SDs).
- This means that all units of measurement cancel
out and the correlation coefficient a pure number
that is not expressed in any units. - For example, suppose the correlation between the
height (measured in inches) and weight (measured
in pounds) of students in the class is r .45.
- This is not r .45 inches or r .45 pounds,
or r .45 pounds per inch, etc., - rather it is just r .45.
- Moreover, the magnitude of the correlation
coefficient is independent of the units used to
measure the two variables. - If we measured students heights in feet, meters,
etc. (instead of inches) and/or measured their
weights in ounces, kilograms, etc. (instead of
pounds), and then calculated the correlation
coefficient based on the new numbers, it would be
just the same as before, i.e., r .45. - In addition, if you check the formula above, you
can see that r is symmetric, i.e., it is
unchanged if we interchange X and Y. - Thus, the correlation between two variables is
the same regardless of which variable is
considered to be independent and which dependent.
51Calculating the Regression Coefficient
- To compute the regression coefficient b, divide
the covariance of X and Y by the variance of the
independent variable X. - Cov(X,Y)
- Regression coefficient b -----------
- Var (X)
52Properties of the Regression Coefficient
- So, from one point of view, the regression
coefficient answers this question - How big is the covariance of the IND and DEP
variables compared with the variance of the IND
variable? - But (as we have already seen) from a more
practical point of view, the regression
coefficient (i.e., the slope of the regression
line or line of averages) answers this
question - How many units, on average, does the dependent
variable increase when the independent variable
increases by one unit? - If the dependent variable decreases (has a
negative increase) when the independent
variable increases, the regression coefficient is
negative. - Thus the magnitude of the regression coefficient
unlike the correlation coefficient depends on - which variable is considered to be independent
and which dependent, and also - the units in which each variable is measured.
53Properties of the Regression Coefficient (cont.)
- The regression coefficient is a ratio of X-units
Y-units (i.e., their respective deviations
from the mean) divided by X-units X-units
(i.e., the variance of X). - Thus the regression coefficient is expressed in
units of the dependent variable per unit of the
independent variable that is, - it is a rate,
- like miles per hour rate of speed or
- miles per gallon rate of fuel efficiency, and
so - its numerical value changes as these units
change.
54Properties of the Regression Coefficient (cont.)
- Remember that we informally calculated regression
coefficients (by eyeball methods) in the INCOME
by EDUCATION example. - In society A, the regression coefficient was
about 1000 (of INCOME DEP) per YEAR (of
EDUCATION IND). - In society B, the regression coefficient was
about 4000 (of INCOME DEP) per YEAR (of
EDUCATION IND). - Likewise we informally calculated the regression
coefficient in the Pearson (SONS HEIGHT by
FATHERS HEIGHT) - It was about 0.5 INCHES (of SH DEP) per INCH
(of FH IND). - In this (rather special) case, both variables are
in the same currency and measured in the same
units (INCHES), so - if we change the unit of measurement for both
variables to FEET, CENTIMETERS, etc.), the
magnitude of the regression coefficient does not
change.
55Properties of the Regression Coefficient (cont.)
- But suppose we measure the height in inches and
weight in pounds of all students in the class,
suppose we treat height as the independent
variable, and suppose the regression coefficient
is b 6. - This means 6 pounds (dependent variable units)
of weight per inch (independent variable units)
of height. - That is, if we lined students up in order from
shortest to tallest, we would observe that their
weights increase, on average, by about 6 pounds
for each additional inch of height. - This also means that students weights increase,
on average, by about 2.7 kilograms (since there
are about .45 kilograms to a pound) for each
additional inch of height. - So if weight were measured in kilograms instead
of pounds, the regression coefficient would be b
6 x.45 2.7 kilograms per inch. - Likewise if we measured weight in pounds but
height in feet, the regression coefficient would
be about b 6 12 72 pounds per foot. - Remember that the magnitude of the correlation
coefficient is unchanged by such changes in units.
56Properties of the Regression Coefficient (cont.)
- If we took weight to be the independent variable
and height the dependent variable, - the magnitude of the regression coefficient would
clearly change, because - we would divide Cov(X,Y) by Var(Y), instead of by
Cov(X) - This make sense because the regression
coefficient is now answering a different
question. - If we lined students up in order of their weights
and observed how their heights varied with
weight, the regression coefficient would be
telling us how much students heights increase,
on average, in some unit of height (inch, meter,
etc.) as their weights increase by one unit
(pound, kilogram, or whatever), so it would be
yet a different number, e.g., perhaps about b
0.1 inches per pound. - Remember that the magnitude of the correlation
coefficient is unchanged by reversing the roles
of the independent and dependent variables.
57(No Transcript)
58Specifying the Regression Line
- To specify the regression line of the
relationship between independent variable X and
dependent variable Y, we need to know, in
addition to the slope of the regression line
(i.e., the regression coefficient b), how high or
low the line with that slope lies in the
scattergram. - Actually, we already know enough to draw the
regression line into the scattergram precisely. - The regression line is the line that passes
through the point that equals the mean of x and
the mean of y and that has a slope b. - But to write the equation of the regression line,
we need to know the intercept a, i.e., where the
regression line passes through the vertical axis
representing values of the dependent variable Y
when the vertical Y axis intersects the
horizontal X axis at the point x 0. - This intercept a is equal to the mean of Y minus
b times the mean of X.
59Finding the Intercept
60Finding the Intercept
61Properties of the Intercept
- The magnitude of the intercept is expressed in
Y-units. - The value of the intercept answers this
(sometimes quite artificial or even absurd)
question - What is the average (or expected or predicted)
value - of the dependent variable
- when the independent variable X is zero?
- Using a and b together, we can answer this
question what is the expected or predicted value
of Y when X is any specified value. The expected
or predicted value of Y, given that X is some
particular value x, is given by the regression
equation (i.e., the equation of the regression
line) - y a b x.
62Relationship between Calculations for Correlation
and Regression Coefficients
- Note from the formulas that the sign (i.e.,
or -) of b and r are both determined by the
sign of cov(X,Y), from which it follows that - b and r themselves always have the same sign, and
- if one is zero, the other is also zero.
- Notice also that the regression and correlation
coefficients are equal in the event the
independent and dependent variables have the same
dispersion, i.e., same SD. - For example, in the SONS HEIGHT by FATHERS
HEIGHT scattergram, by eyeball methods we can
determine the regression coefficient b is about
0.5, - It is also apparent, both from common
expectations and examination of the scattergram,
that the dispersions (SDs) of SONs HEIGHT and
FATHERs HEIGHT are just about the same, so we
also know that the correlation coefficient is
also about 0.5. - In general, b r x SD(Y) and r b
x SD(X) - SD(X)
SD(Y)
63Other Formulas
- You will find other formulas for the correlation
and regression coefficients in textbooks. - For a horrendous looking example, see Weisberg et
al., near top of p. 305 and below. - Such formulas are mathematically equivalent to
(i.e., give the same answers as) the formulas
given here. - They are actually easier to work with if you are
processing many cases or programming a calculator
or computer, because they require you (or the
computer) to pass through the data only once.
(They involve only X and Y values, not X and Y
deviations from the mean) - But the formulas presented here make more
intuitive sense and are easy enough to use in the
simple sorts of problems that you will be
presented with in problems sets and tests.
64The Coefficient of Determination r2
- For various reasons, the squared correlation
coefficient r ² is often reported. - To compute r ², just multiply the correlation
coefficient by itself. - This results in a number that
- (i) always lies between 0 and 1 (i.e., is never
negative and so does not indicate the direction
of association), and - (ii) is closer to 0 than r is for all
intermediate values of r for example, (
0.45)² 0.2025. - The statistic r ² is sometimes called the
coefficient of determination. - There are three reasons for using this statistic.
65The Coefficient of Determination r2
- (1) A scattergram with r about 0.5 does not
appear to be halfway between a scattergram with
r 1 and one with r 0 it looks closer to
the one with zero association. The scattergram
that looks halfway in between perfect and zero
association has a correlation of about r 0.7
(and r2 0.5).
66The Coefficient of Determination r2
- (2) r 2 has a specific interpretation that r
itself lacks. - Recall that the line y 5 ( mean of y) is the
horizontal line that best fits the plotted points
by the least squares criterion. - Recall also that the quantity average squared
deviation from the line y 5 has a special and
familiar name - It is the variance of the dependent variable Y
(the square root of which is the SD of Y.)
67The Coefficient of Determination r2
- When we tip the line, we can almost always
improve the fit at least a bit. - The regression line is the tipped line with the
best fit according to the least squares
criterion. - But even this line usually fits the points far
from perfectly. - For each plotted point, there is some vertical
distance (positive if the point lies above the
line, negative if it lies below) between the
(best fitting) regression line and the point. - This vertical distance the called the residual
for that case. - These residuals are the errors in prediction that
are left over after we use the regression line
to predict the value of the dependent variable in
each case. - The ratio of the average squared residuals
divided by the variance of Y can be characterized
as the proportion of the variance in Y that is
not predicted or explained by the regression
equation that has X as the independent variable.
- Therefore 1 minus this ratio can be characterized
as the proportion of the variance in Y that is
predicted or explained by the regression
equation. - It turns out that the latter proportion is
exactly equal to r2.
68(No Transcript)
69The Coefficient of Determination r2
- (3) Finally, serious regression analysis is
almost always multiple (multivariate) regression,
where the effects of multiple independent (and/or
control) variables on a single dependent
variable are analyzed. - In this case, we want some summary measure of the
overall extent to which the set of all
independent variables explains variation in the
dependent variable, regardless of whether
individual independent variables have positive or
negative effects (i.e., regardless of whether
bivariate correlations are positive or negative).
- The coefficient of determination r 2 provides
this measure.
70Look Underneath the Correlation
- Correlation Reflects Linear Association Only.
- Suppose that you calculate a correlation
coefficient and find that r 0 (so b 0 also).
- You should not jump to the conclusion that there
is no association of any kind between the
variables. - The zero coefficient tells you that there is no
linear (straight-line) association between the
variables. - But there may be a very clear curvilinear
(curved-line) association between them.
71Look Underneath the Correlation (cont.)
- A Univariate Outlier May Create a Correlation.
- Suppose that you calculate a correlation
coefficient and find that r . 0.9. - You should not jump to the conclusion that there
is a strong and reliable associ-ation between the
vari-ables. - In the adjacent scatter-gram, a single univariate
outlier produces the high correlation by itself.
- You should at least double check your data entry
for this case.
72Look Underneath the Correlation (cont.)
- A Bivariate Outlier May Greatly Attenuate a
Correlation. - Clerical errors (or deviant cases) can attenuate,
as well as enhance, apparent correla-tion. - In the adjacent scattergram, a single bivariate
outlier reduces what is otherwise an almost
perfect association between the variables to a
more modest level. - Note that the deviant case is not an outlier in
this univariate sense. - Its value on each variable separately is
unexcep-tional. - What is exceptional is the combination of values
on the two variables that it exhibits.
73Applied Regression (and Correlation) Analysis
- Regression (especially multiple regression)
analysis is now very commonly used in political
science research. - But perhaps its application is most intuitive in
practical situations in which researchers
literally want to make predictions about future
cases based on analysis of past cases. - Here are two examples.
74Predicting College GPAs.
- A college Admissions Office has recorded the
combined SAT scores of all of its incoming
students over a number of years. - It has also recorded the first-year college GPAs
of the same students. - The Admissions Office can therefore calculate the
regression coefficient b, the intercept a, and
the correlation coefficient r for the data they
have collected in the past. It can then use the
regression equation - PREDICTED COLLEGE GPA a b SAT SCORE
- to predict the potential college GPAs of the
next crop of applicants on the basis of their SAT
scores (and use these predictions in making their
admissions decisions). - Even better, it can collect and record more
extensive data and use a multiple regression
equation such as - GPA a b1 SATV b2 SATM b3 HSGPA
b4 AP . . .
75Restricting the Values of the Independent
Variable Attenuates Correlation
76Predicting Presidential Elections Months in
Advance
- A number of political scientists have devised
predictive models that you may have heard about
during the past Presidential election year. - Much like the hypothetical college Admissions
Office, these political scientists have assembled
data concerning the past 15 or so Presidential
elections, in particular - the percent of the popular vote won by the
candidate of the incumbent party (that controlled
the White House going into the election) the
dependent variable of interest, and - data on a number of independent variables whose
values become available well in advance of the
election, typically including - one or more indicators of the state of the
economy (growth rate, unemployment rate, etc.)
usually as of the end of the second quarter (June
30) of the election year - some poll measure of the incumbent Presidents
approval rating as of about June 30. - Using this data, they calculate the coefficients
for the regression equation.
77Predicting Presidential Elections (cont.)
- You constructed two bivariate scattergrams along
these lines in PS 11 - INCUMBENT VOTE by GDP
- INCUMBENT VOTE by PRESIDENTIAL APPROVAL
- Remember that
- GDP was real Gross Domestic Product (economic)
growth over the Fall, Winter, and Spring quarters
preceding the election (e.g., from October 1,
2003 through June 30, 2004) - PRES APPROVAL was the incumbent Presidents
approval rating in the first Gallup Poll taken
after June 30 of the election year. - Lets look at each of these more closely, and
find (by either eyeball or calculation) the
regression equation.
78Predicting Presidential Elections (cont.)
79Predicting Presidential Elections (cont.)
80Predicting Presidential Elections (cont.)
- In July 2008, GDP was about 1, so we could
predict - McCain POP VOTE 47.7 1.12 x 1
- 47.7 1.12 48.8
- This would be a pretty fuzzy prediction because
the calculated correlation coefficient is only
about r 0.5 (r 2 .25)
81The Regression Equation
- Putting this all together, we now have the
equation for the regression line in the example
82(No Transcript)
83Predicting Presidential Elections (cont.)
84Predicting Presidential Elections (cont.)
- We can make still sharper predictions by using
both independent variables simultaneously to
predict the value of the dependent variable. - Here is the multiple (vs. bivariate) regression
equation (coefficients calculated by SPSS) - INC VOTE 35.8 .49 x GDP .30 x POP (with r2
.71) - It might seems surprising the apparent impact of
each independent variable (its regression
coefficient) is smaller in this multivariate
analysis. - This is because the two independent variables are
themselves correlated (r .4)
85Out-of-Sample Predictions
86An Relevant Example
87(No Transcript)
88(No Transcript)
89How Much Is an Additional Year of Education
Really Worth?
90How Much Is an Additional Year of Education
Really Worth? (cont.)
91How Much Is an Additional Year of Education
Really Worth? (cont.)
92How Much Is an Additional Year of Education
Really Worth? (cont.)
93Test 2
- If you answered (C) to M-C Q9, one point has
been added to your score and your graded has been
increased by 0.16 of a grade point. - If you answered 3 to Blue Book Q1(d), show it
to me and I will add one point to your score (and
0.16 to your grade).