CORRELATION AND REGRESSION - PowerPoint PPT Presentation

About This Presentation
Title:

CORRELATION AND REGRESSION

Description:

correlation and regression topic #13 ... – PowerPoint PPT presentation

Number of Views:413
Avg rating:3.0/5.0
Slides: 94
Provided by: userpage5
Category:

less

Transcript and Presenter's Notes

Title: CORRELATION AND REGRESSION


1
CORRELATION AND REGRESSION
  • Topic 13

2
An Example
  • Recall sentence 11 in Problem Set 3A (and 9)
  • If you want to get ahead, stay in school.
  • Underlying this nagging parental advice is the
    following claimed empirical relationship

  • LEVEL OF EDUCATION gt LEVEL OF SUCCESS IN
    LIFE
  • Suppose we collect data through by means of a
    survey asking respondents (say a representative
    sample of the population aged 35-55) to report
    the number of years of formal EDUCATION they
    completed and also their current INCOME (as an
    indicator of SUCCESS). We then analyze the
    association between the two interval variables in
    this reformulated hypothesis.

  • LEVEL OF EDUCATION gt LEVEL OF INCOME
  • ( of years reported)
    (000 per year)
  • Since these are both interval continuous
    variables, we analyze their association by means
    of a scattergram.

3
Having collected such data in two rather
different societies A and B, we produce these two
scattergrams.
4
An Example (cont.)
  • Note that the two scattergrams are drawn with the
    same horizontal and vertical scales.
  • With respect to the vertical axis in scattergram
    A, this violates the usual guideline that scales
    be set to minimize white space in the
    scattergram.
  • But here we want to facilitate comparison between
    the two charts.
  • Both scattergrams show a clear positive
    association between the two variables, i.e., the
    plotted points in both form an upward-sloping
    pattern running from Low Low to High High.
  • At the same time there are obvious differences
    between the two scattergrams (and thus between
    the relation-ships between INCOME and EDUCATION
    in societies A and B).

5
Questions For Discussion
  • In which society, A or B, is the hypothesis most
    powerfully confirmed?
  • In which society, A or B, is their a greater
    incentive for people to stay in school?
  • Which society, A or B, does the U.S. more closely
    resemble?
  • How might we characterize the difference between
    societies A and B?

6
An Example (cont.)
  • We can visually compare and contrast the nature
    of the associations between the two variables in
    the two scattergrams by drawing a number of
    vertical strips in each scattergram (as we did in
    the earlier Pearson scattergram of SONS HEIGHT
    by FATHERS HEIGHT).
  • Points that lie within each vertical strip
    represent respondents who have (just about) the
    same value on the independent (horizontal)
    variable of EDUCATION.
  • Within each strip, we can estimate (by eyeball
    methods) the average magnitude of the dependent
    (vertical) variable INCOME and put a mark at the
    appropriate level.

7
Average Income for Selected Levels of Education
8
We can connect these marks to form a line of
averages that is apparently (close to being) a
straight line.
9
An Example (cont.)
  • Now we can assess two distinct characteristics of
    the relationships between EDUCATION and INCOME in
    scattergrams A and B.
  • How much the does the average level of INCOME
    change among people with different levels of
    education?
  • How much dispersion of INCOME there is among
    people with the same level of EDUCATION?

10
An Example (cont.)
  • In both scattergams, the line of averages is
    upward-sloping, indicating a clear apparent
    positive effect on EDUCATION on INCOME.
  • But in the scattergram for society A, the upward
    slope of the line of averages is fairly shallow.
  • The line of averages indicates that average
    INCOME increases by only about 1000 for each
    additional year of EDUCATION.
  • On the other hand, in the scattergram for society
    B, the upward slope of the line of averages is
    much steeper.
  • The graph in Figure 1B indicates that average
    INCOME increases by about 4000 for each
    additional year of EDUCATION.
  • In this sense, EDUCATION is on average more
    rewarding in society B than A.

11
An Example (cont.)
  • There is another difference between the two
    scattergrams.
  • In scattergram A, there is almost no dispersion
    within each vertical strip (and almost no
    dispersion around the line of averages as a
    whole).
  • In scattergram B, there is a lot of dispersion
    within each vertical strip (and around the line
    of averages as a whole).
  • We can put this point in substantive language.
  • In society A, while additional years of EDUCATION
    produce rewards in terms of INCOME that are
    modest (as we saw before), these modest rewards
    are essentially certain.
  • In society B, while additional years of EDUCATION
    produce on average much more substantial rewards
    in terms of INCOME (as we saw before), these
    large expected rewards are highly uncertain and
    are indeed realized only on average.
  • For example, in scattergram B (but not A), we can
    find many pairs of cases such that one case has
    (much) higher EDUCATION but the other case has
    (much) higher INCOME.

12
An Example (cont.)
  • This means that in society B, while EDUCATION has
    a big impact on EDUCATION, there are evidently
    other (independent) variables (maybe family
    wealth, ambition, career choice, athletic or
    other talent, just plain luck, etc.) that also
    have major effects on LEVEL OF INCOME.
  • In contrast, in society A it appears that LEVEL
    OF EDUCATION (almost) wholly determines LEVEL OF
    INCOME and that essentially nothing else matters.
  • Another difference between the two societies is
    that, while both societies have similar
    distributions of EDUCATION, their INCOME
    distributions are quite different.
  • A is quite egalitarian with respect to INCOME,
    which ranges only from about 40,000 to about
    60,000, while B is considerably less egalitarian
    with respect to INCOME, which ranges from under
    to 10,000 to at least 100,000 and possibly
    higher.)
  • In summary, in society A the INCOME rewards of
    EDUCATION are modest but essentially certain,
    while in society B the INCOME rewards of
    EDUCATION are substantial on average but quite
    uncertain in individual cases.

13
Two Kinds of Strength of Association
  • This example illustrates that, given bivariate
    data for interval variables, strength of
    association between them can mean either of two
    quite different things
  • the independent variable has a very reliable or
    certain association with the dependent variable,
    as is true for society A but not B, or
  • the independent variable has on average a big
    impact on the dependent variable, as is true for
    society B but not A.
  • There are two bivariate summary statistics that
    capture these two different kinds of strength of
    association between interval variables
  • the first second is called the regression
    coefficient, customarily designated b, which is
    the slope of the line of averages and
  • the second is called the correlation coefficient,
    customarily designated r , which is (more or
    less) determined by the magnitude of dispersion
    in each vertical strip.
  • In scattergrams A and B, A has the greater
    correlation coefficient and B has the greater
    regression coefficient.

14
Review The Equation of a Line
  • Having drawn (at least by eyeball methods) the
    line of averages in a scattergram, it is
    convenient to write an equation for the line.
  • You should recall from high-school algebra that,
    given any graph with a horizontal axis X and a
    vertical axis Y, any straight line drawn on the
    graph has an equation of the form (using the
    symbols you probably used in high school)
  • y m x b , where
  • m is the slope of the line expressed as ? y / ? x
    (change in y divided by change in x or rise
    over run), and
  • b is the y-intercept (i.e., the value of y when x
    0).
  • Evidently to further torment students, in college
    statistics the symbol b is used in place of m to
    represent the slope of the line and the symbol a
    is used in place of b to represent the intercept
    (and a is customarily placed in front of the bx
    term), so the equation for a straight line is
    usually written as
  • y a b x .

15
Slope and Intercept in Scattergrams A and B
16
Equation of a Line (cont.)
  • The equation for the line of averages in
    scattergram A appears to be approximately
  • Y
    a b x
  • AVERAGE INCOME 40,000 1000 EDUCATION
    .
  • The equation for the line of averages in Figure B
    appears to be approximately
  • AVERAGE INCOME 10,000 4000 EDUCATION
    .
  • Given such an equation (or formula), we can take
    any value for the independent variable EDUCATION,
    plug it into the appropriate formula above, and
    calculate or predict the corresponding average
    or expected value of the dependent variable
    INCOME.
  • In society A, such a prediction is likely to be
    quite reliable, because there is very little
    dispersion in any vertical strip and the
    association/correlation between the two variables
    is almost perfect
  • In society B, this prediction will be much less
    reliable or fuzzier, because there is
    considerable dispersion in every vertical strip
    and the association/correlation between the two
    variables is much less than perfect.

17
(No Transcript)
18
Before we used eyeball methods to draw the line
of averages in the SONS HEIGHT by FATHERS
HEIGHT scattergram. Lets write its equation in
the formSONS HEIGHT a b FATHERS HEIGHT .
19
(No Transcript)
20
Equation for Sons Height
  • What is the equation for the line of averages?
  • SH 63 0.5 FH
  • SH 63 0.5 74 63 37 100 (8' 4")
  • Clearly something is wrong
  • 63" is not the true intercept, i.e., it is
  • not average SH when FH 0,
  • but rather average SH when FH 58".
  • My height is 16" greater than 58", so my sons
    expected height is
  • 63 0.5 16 72"
  • To read the true y-intercept a on the chart, we
    need to extend the FH scale down to FH 0 to see
    where the line of averages intersects the true SH
    axis.

21
(No Transcript)
22
Determining the Line of Averages Regression
Line and the Degree of Association Correlation
Numerically
  • Clearly statisticians are not satisfied with
    eyeballing a scattergram and
  • visually estimating the slope and intercept of
    the line averages, or (especially)
  • guessing at the degree of association between the
    DEPENDENT and INDEPENDENT variables.
  • Before we can make exact numerical calculations,
    we must have an exact definition of the line of
    averages.
  • This will lead to precise formula for the slope,
    intercept, and association/correlation.

23
Example Worksheet in Topic 13
24
Correlation and Regression (cont.)
  • Consider the scattergram to the right, which
    displays the bivariate numerical data for x and y
    in the Sample Problem presented on p. 9 of
    Handout 13.
  • The vertical strips kind of argument simply
    will not work, because most strips include no
    data points and only one strip (for x 5)
    includes more than one point.
  • Using this simple example, we will now proceed a
    little more formally.

25
Correlation and Regression (cont.)
  • Suppose we were to ask what horizontal line
    (i.e., a line for which the slope b 0, so its
    equation is simply y a) would best fit (come
    as close as possible to) the plotted points.
  • In order to answer this question, we must have a
    specific criterion for best fitting.
  • The one statisticians use is called the least
    squares criterion.
  • For each horizontal line, we calculate the sum
    (or mean) of squared deviations of the
    y-observations from a line.
  • Were looking for the horizontal line that
    minimizes this sum/mean.
  • Lets try the horizontal line y 6
  • The sum of squared deviations 138,
  • so the mean squared deviation is 23.
  • Can we do better that this?

26
(No Transcript)
27
(No Transcript)
28
Correlation and Regression (cont.)
  • That is, we want to find the horizontal line such
    that the squared vertical deviations from the
    line to each point are (in total or on average)
    as small as possible.
  • You might remember (or should be able to guess)
    what line this is it is the line y mean of y
    , or in this case y 5.
  • Recall from Handout 6, p. 4, point (e) that the
    sum (or average) of the squared deviations from
    the mean is less than the sum of the squared
    deviations from any other value of the variable.
  • You should also recall that we have a special
    name for the average squared deviation from the
    mean (of y) namely, the variance (of y)
  • This (or its positive square root, i.e., the
    standard deviation of y) is the standard
    univariate measure of dispersion in y.

29
(No Transcript)
30
Correlation and Regression (cont.)
  • So the line y 5 is the best fitting horizontal
    line by the least squares criterion.
  • Now suppose that, in finding the best fitting
    line by the least squares criterion, we are no
    longer restricted to horizontal lines but can tip
    the line up or down, so it has a non-zero slope
    and its height varies with the independent
    variable x.
  • More particularly, suppose that we pivot such
    straight lines on the point that is right in the
    middle of the all data plotted points,
  • specifically the point that represents the mean
    of x and the mean of y (in this case, x 4, y
    5),
  • until it has a slope that best fits all the
    plotted points by the same least squares
    criterion.
  • In our example, we can clearly improve the fit by
    tipping the line counter-clockwise.

31
(No Transcript)
32
Tipping the Horizontal Line Can Improve the Fit
by the Least Squares Criterion
33
Finding the Best Fitting Line
  • The question is how far we must tip the line to
    get the best fit (by the least squares
    criterion).
  • If we tip it too far, the fit becomes worse.
  • When we have found this best fitting (by the
    least squares criterion) line, we have found what
    is thereby defined as the regression line.
  • Fortunately, we dont have to find this best
    fitting line by trial and error methods.
  • There are formulas for finding slope of the
    regression line (i.e., the regression
    coefficient) and intercept from the numerical
    data.
  • There is a related formula for finding the
    correlation coefficient from the numerical data,
    which tells us how well this best fitting
    regression line fits the data points.

34
(No Transcript)
35
Association Between X and Y?
  • Mean Squared Deviation from regression line
    equals 7.9375.
  • The degree of assiciation would be related to how
    big the MSD is compared with MSD around the best
    fitting horizontal line, which equals 22.
  • Degree of association 1 7.9375
  • 22
  • 1 - .3608 0.6392
  • Note that this number must be between 0 and 1
    (but cannot be negative).

36
Correlation and Regression (cont.)
  • One statistical theorem tells us that the
    regression line goes through the point
    corresponding to the mean of x and the mean of y.
  • Another theorem gives us formulas to calculate
    the slope and intercept of the regression line,
    given the numerical data.
  • A third theorem
  • gives us the formula to calculate the correlation
    coefficient, and also
  • tells us that the squared correlation coefficient
    r 2 measures how much we can improve the goodness
    of fit (by the least squares criterion) when we
    move from the best fitting horizontal line to the
    best fitting tipped line.

37
How to Calculate the Regression and Correlation
Coefficients
  • Lets call the independent variable X and the
    dependent variable Y. (This usage is standard.)
  • Set up a worksheet like the one that follows
    (also p. 9 of Handout 13).

38
How to Calculate Coefficients (cont.)
  • Compute the following univariate statistics (1)
    the mean of X, (2) the mean of Y, (3) the
    variance of X, (4) the variance of Y, (5) the SD
    of X, and (6) the SD of Y.

39
How to Calculate Coefficients (cont.)
  • Now we make bivariate calculations for each case
    by multiplying its deviation from the mean with
    respect to X by its deviation from the mean with
    respect to Y.
  • This is called the crossproduct of the
    deviations.

40
How to Calculate Coefficients (cont.)
  • Notice that a crossproduct in a given case
  • is positive if, in that case, the X and Y values
    both deviate in the same direction (i.e., both
    upwards or both downwards) from their respective
    means and
  • it is negative if, in that case, the X and Y
    values deviate in opposite directions (i.e., one
    upwards and one downwards) from their respective
    means.
  • In either event, the absolute crossproduct is
    large if both absolute deviations are large and
    small if either deviation is small.
  • The crossproduct is zero if either deviation is
    zero.
  • Thus
  • (a) if there is a positive relationship between
    the two variables, most crossproducts are
    positive and many are large
  • (b) if there is a negative relationship between
    the two variables, most crossproducts are
    negative and many are large and
  • (c) if the two variables are unrelated, the
    crossproducts are positive in about half the
    cases and negative in about half the cases.
  • So, the average of all crossproducts is
    indicative of the association between variables.

41
How to Calculate Coefficients (cont.)
  • Add up the crossproducts over all cases.
  • The sum (and thus the average) of crossproducts
    is
  • positive in the event of (a) above,
  • negative in the event of (b) above, and
  • (close to) zero in the event of (c) above.
  • Divide the sum of the crossproducts by the number
    of cases to get the average (mean) crossproduct.
    This average is called the covariance of X and Y,
    and its formula is the bivariate parallel to the
    univariate variance (of X or Y).

42
How to Calculate Coefficients (cont.)
  • Notice the following
  • If the association between X and Y is positive,
    their covariance is positive (because most
    crossproducts are positive).
  • If the association between X and Y is negative,
    their covariance is negative (because most
    crossproducts are negative).
  • If there is almost no association between X
    and Y , their covariance is approximately zero
    (because positive and negative crossproducts
    approximately cancel out when added up).
  • Thus the covariance does measure the association
    between the two variables.

43
How to Calculate Coefficients (cont.)
  • However, the covariance is not a (fully) valid
    measure of the association, because the magnitude
    of the average (positive or negative)
    crossproduct depends on not only
  • how closely the two variables are associated, but
    also
  • on the magnitude of their dispersions (as
    indicated by their standard deviations or
    variances).
  • So, for example
  • two very closely (and positively) associated
    variables have a positive but small covariance if
    they both have small standard deviations
  • two not so closely (but still positively)
    associated variables have a larger covariance if
    their standard deviations are sufficiently larger.

44
Multiplying all Values by 10 does not Changethe
Degree of Association between the Two Variables X
and Y
45
But Doing This Increases the Covariance of X and
Y 100-Fold
46
The Correlation Coefficient
  • The correlation coefficient is a measure of
    association only, which (like other measures of
    association) is standardized so that it always
    falls in the range from -1 to 1.
  • This is accomplished by dividing the covariance
    of X and Y by the standard deviation of X and
    also by the standard deviation of X.
  • This is equivalent to calculating the covariance
    of X and Y if the X and Y data is converted into
    standard scores.
  • The degree of correlation in a scattergram can be
    observed by looking only at the plotted points,
    the regression line, and the orientation of the
    horizontal and vertical axes.
  • The units of measurement on each axis make no
    difference and do not have to be shown.

47
(No Transcript)
48
The Correlation Coefficient (cont.)
  • Divide the covariance of X and Y by the standard
    deviation of X and also by the standard deviation
    of Y.
  • This gives the correlation coefficient r, which
    measures the degree (or reliability or
    completeness) and direction of association
    between two interval variables X and Y.
  • Cov (X,Y)
  • Correlation coefficient r
  • SD(X) SD(Y)
  • If you calculate r to be greater than 1 or less
    than -1, you have made a mistake.
  • The correlation coefficient is the one measure of
    association for which you are expected to know
    how to use the calculating formula (above).
  • It is a good idea first to construct a
    scattergram of the data on X and Y and then to
    check whether your calculated correlation
    coefficient looks plausible in light of the
    scattergram.

49
Calculating the Correlation Coefficient
50
Properties of the Correlation Coefficient
  • The correlation coefficient formula is based on a
    ratio of X-units Y-units (i.e., their
    respective deviations from the mean) divided by
    X-units Y-units (i.e., their respective SDs).
  • This means that all units of measurement cancel
    out and the correlation coefficient a pure number
    that is not expressed in any units.
  • For example, suppose the correlation between the
    height (measured in inches) and weight (measured
    in pounds) of students in the class is r .45.
  • This is not r .45 inches or r .45 pounds,
    or r .45 pounds per inch, etc.,
  • rather it is just r .45.
  • Moreover, the magnitude of the correlation
    coefficient is independent of the units used to
    measure the two variables.
  • If we measured students heights in feet, meters,
    etc. (instead of inches) and/or measured their
    weights in ounces, kilograms, etc. (instead of
    pounds), and then calculated the correlation
    coefficient based on the new numbers, it would be
    just the same as before, i.e., r .45.
  • In addition, if you check the formula above, you
    can see that r is symmetric, i.e., it is
    unchanged if we interchange X and Y.
  • Thus, the correlation between two variables is
    the same regardless of which variable is
    considered to be independent and which dependent.

51
Calculating the Regression Coefficient
  • To compute the regression coefficient b, divide
    the covariance of X and Y by the variance of the
    independent variable X.
  • Cov(X,Y)
  • Regression coefficient b -----------
  • Var (X)

52
Properties of the Regression Coefficient
  • So, from one point of view, the regression
    coefficient answers this question
  • How big is the covariance of the IND and DEP
    variables compared with the variance of the IND
    variable?
  • But (as we have already seen) from a more
    practical point of view, the regression
    coefficient (i.e., the slope of the regression
    line or line of averages) answers this
    question
  • How many units, on average, does the dependent
    variable increase when the independent variable
    increases by one unit?
  • If the dependent variable decreases (has a
    negative increase) when the independent
    variable increases, the regression coefficient is
    negative.
  • Thus the magnitude of the regression coefficient
    unlike the correlation coefficient depends on
  • which variable is considered to be independent
    and which dependent, and also
  • the units in which each variable is measured.

53
Properties of the Regression Coefficient (cont.)
  • The regression coefficient is a ratio of X-units
    Y-units (i.e., their respective deviations
    from the mean) divided by X-units X-units
    (i.e., the variance of X).
  • Thus the regression coefficient is expressed in
    units of the dependent variable per unit of the
    independent variable that is,
  • it is a rate,
  • like miles per hour rate of speed or
  • miles per gallon rate of fuel efficiency, and
    so
  • its numerical value changes as these units
    change.

54
Properties of the Regression Coefficient (cont.)
  • Remember that we informally calculated regression
    coefficients (by eyeball methods) in the INCOME
    by EDUCATION example.
  • In society A, the regression coefficient was
    about 1000 (of INCOME DEP) per YEAR (of
    EDUCATION IND).
  • In society B, the regression coefficient was
    about 4000 (of INCOME DEP) per YEAR (of
    EDUCATION IND).
  • Likewise we informally calculated the regression
    coefficient in the Pearson (SONS HEIGHT by
    FATHERS HEIGHT)
  • It was about 0.5 INCHES (of SH DEP) per INCH
    (of FH IND).
  • In this (rather special) case, both variables are
    in the same currency and measured in the same
    units (INCHES), so
  • if we change the unit of measurement for both
    variables to FEET, CENTIMETERS, etc.), the
    magnitude of the regression coefficient does not
    change.

55
Properties of the Regression Coefficient (cont.)
  • But suppose we measure the height in inches and
    weight in pounds of all students in the class,
    suppose we treat height as the independent
    variable, and suppose the regression coefficient
    is b 6.
  • This means 6 pounds (dependent variable units)
    of weight per inch (independent variable units)
    of height.
  • That is, if we lined students up in order from
    shortest to tallest, we would observe that their
    weights increase, on average, by about 6 pounds
    for each additional inch of height.
  • This also means that students weights increase,
    on average, by about 2.7 kilograms (since there
    are about .45 kilograms to a pound) for each
    additional inch of height.
  • So if weight were measured in kilograms instead
    of pounds, the regression coefficient would be b
    6 x.45 2.7 kilograms per inch.
  • Likewise if we measured weight in pounds but
    height in feet, the regression coefficient would
    be about b 6 12 72 pounds per foot.
  • Remember that the magnitude of the correlation
    coefficient is unchanged by such changes in units.

56
Properties of the Regression Coefficient (cont.)
  • If we took weight to be the independent variable
    and height the dependent variable,
  • the magnitude of the regression coefficient would
    clearly change, because
  • we would divide Cov(X,Y) by Var(Y), instead of by
    Cov(X)
  • This make sense because the regression
    coefficient is now answering a different
    question.
  • If we lined students up in order of their weights
    and observed how their heights varied with
    weight, the regression coefficient would be
    telling us how much students heights increase,
    on average, in some unit of height (inch, meter,
    etc.) as their weights increase by one unit
    (pound, kilogram, or whatever), so it would be
    yet a different number, e.g., perhaps about b
    0.1 inches per pound.
  • Remember that the magnitude of the correlation
    coefficient is unchanged by reversing the roles
    of the independent and dependent variables.

57
(No Transcript)
58
Specifying the Regression Line
  • To specify the regression line of the
    relationship between independent variable X and
    dependent variable Y, we need to know, in
    addition to the slope of the regression line
    (i.e., the regression coefficient b), how high or
    low the line with that slope lies in the
    scattergram.
  • Actually, we already know enough to draw the
    regression line into the scattergram precisely.
  • The regression line is the line that passes
    through the point that equals the mean of x and
    the mean of y and that has a slope b.
  • But to write the equation of the regression line,
    we need to know the intercept a, i.e., where the
    regression line passes through the vertical axis
    representing values of the dependent variable Y
    when the vertical Y axis intersects the
    horizontal X axis at the point x 0.
  • This intercept a is equal to the mean of Y minus
    b times the mean of X.

59
Finding the Intercept
60
Finding the Intercept
61
Properties of the Intercept
  • The magnitude of the intercept is expressed in
    Y-units.
  • The value of the intercept answers this
    (sometimes quite artificial or even absurd)
    question
  • What is the average (or expected or predicted)
    value
  • of the dependent variable
  • when the independent variable X is zero?
  • Using a and b together, we can answer this
    question what is the expected or predicted value
    of Y when X is any specified value. The expected
    or predicted value of Y, given that X is some
    particular value x, is given by the regression
    equation (i.e., the equation of the regression
    line)
  • y a b x.

62
Relationship between Calculations for Correlation
and Regression Coefficients
  • Note from the formulas that the sign (i.e.,
    or -) of b and r are both determined by the
    sign of cov(X,Y), from which it follows that
  • b and r themselves always have the same sign, and
  • if one is zero, the other is also zero.
  • Notice also that the regression and correlation
    coefficients are equal in the event the
    independent and dependent variables have the same
    dispersion, i.e., same SD.
  • For example, in the SONS HEIGHT by FATHERS
    HEIGHT scattergram, by eyeball methods we can
    determine the regression coefficient b is about
    0.5,
  • It is also apparent, both from common
    expectations and examination of the scattergram,
    that the dispersions (SDs) of SONs HEIGHT and
    FATHERs HEIGHT are just about the same, so we
    also know that the correlation coefficient is
    also about 0.5.
  • In general, b r x SD(Y) and r b
    x SD(X)
  • SD(X)
    SD(Y)

63
Other Formulas
  • You will find other formulas for the correlation
    and regression coefficients in textbooks.
  • For a horrendous looking example, see Weisberg et
    al., near top of p. 305 and below.
  • Such formulas are mathematically equivalent to
    (i.e., give the same answers as) the formulas
    given here.
  • They are actually easier to work with if you are
    processing many cases or programming a calculator
    or computer, because they require you (or the
    computer) to pass through the data only once.
    (They involve only X and Y values, not X and Y
    deviations from the mean)
  • But the formulas presented here make more
    intuitive sense and are easy enough to use in the
    simple sorts of problems that you will be
    presented with in problems sets and tests.

64
The Coefficient of Determination r2
  • For various reasons, the squared correlation
    coefficient r ² is often reported.
  • To compute r ², just multiply the correlation
    coefficient by itself.
  • This results in a number that
  • (i) always lies between 0 and 1 (i.e., is never
    negative and so does not indicate the direction
    of association), and
  • (ii) is closer to 0 than r is for all
    intermediate values of r for example, (
    0.45)² 0.2025.
  • The statistic r ² is sometimes called the
    coefficient of determination.
  • There are three reasons for using this statistic.

65
The Coefficient of Determination r2
  • (1) A scattergram with r about 0.5 does not
    appear to be halfway between a scattergram with
    r 1 and one with r 0 it looks closer to
    the one with zero association. The scattergram
    that looks halfway in between perfect and zero
    association has a correlation of about r 0.7
    (and r2 0.5).

66
The Coefficient of Determination r2
  • (2) r 2 has a specific interpretation that r
    itself lacks.
  • Recall that the line y 5 ( mean of y) is the
    horizontal line that best fits the plotted points
    by the least squares criterion.
  • Recall also that the quantity average squared
    deviation from the line y 5 has a special and
    familiar name
  • It is the variance of the dependent variable Y
    (the square root of which is the SD of Y.)

67
The Coefficient of Determination r2
  • When we tip the line, we can almost always
    improve the fit at least a bit.
  • The regression line is the tipped line with the
    best fit according to the least squares
    criterion.
  • But even this line usually fits the points far
    from perfectly.
  • For each plotted point, there is some vertical
    distance (positive if the point lies above the
    line, negative if it lies below) between the
    (best fitting) regression line and the point.
  • This vertical distance the called the residual
    for that case.
  • These residuals are the errors in prediction that
    are left over after we use the regression line
    to predict the value of the dependent variable in
    each case.
  • The ratio of the average squared residuals
    divided by the variance of Y can be characterized
    as the proportion of the variance in Y that is
    not predicted or explained by the regression
    equation that has X as the independent variable.
  • Therefore 1 minus this ratio can be characterized
    as the proportion of the variance in Y that is
    predicted or explained by the regression
    equation.
  • It turns out that the latter proportion is
    exactly equal to r2.

68
(No Transcript)
69
The Coefficient of Determination r2
  • (3) Finally, serious regression analysis is
    almost always multiple (multivariate) regression,
    where the effects of multiple independent (and/or
    control) variables on a single dependent
    variable are analyzed.
  • In this case, we want some summary measure of the
    overall extent to which the set of all
    independent variables explains variation in the
    dependent variable, regardless of whether
    individual independent variables have positive or
    negative effects (i.e., regardless of whether
    bivariate correlations are positive or negative).
  • The coefficient of determination r 2 provides
    this measure.

70
Look Underneath the Correlation
  • Correlation Reflects Linear Association Only.
  • Suppose that you calculate a correlation
    coefficient and find that r 0 (so b 0 also).
  • You should not jump to the conclusion that there
    is no association of any kind between the
    variables.
  • The zero coefficient tells you that there is no
    linear (straight-line) association between the
    variables.
  • But there may be a very clear curvilinear
    (curved-line) association between them.

71
Look Underneath the Correlation (cont.)
  • A Univariate Outlier May Create a Correlation.
  • Suppose that you calculate a correlation
    coefficient and find that r . 0.9.
  • You should not jump to the conclusion that there
    is a strong and reliable associ-ation between the
    vari-ables.
  • In the adjacent scatter-gram, a single univariate
    outlier produces the high correlation by itself.
  • You should at least double check your data entry
    for this case.

72
Look Underneath the Correlation (cont.)
  • A Bivariate Outlier May Greatly Attenuate a
    Correlation.
  • Clerical errors (or deviant cases) can attenuate,
    as well as enhance, apparent correla-tion.
  • In the adjacent scattergram, a single bivariate
    outlier reduces what is otherwise an almost
    perfect association between the variables to a
    more modest level.
  • Note that the deviant case is not an outlier in
    this univariate sense.
  • Its value on each variable separately is
    unexcep-tional.
  • What is exceptional is the combination of values
    on the two variables that it exhibits.

73
Applied Regression (and Correlation) Analysis
  • Regression (especially multiple regression)
    analysis is now very commonly used in political
    science research.
  • But perhaps its application is most intuitive in
    practical situations in which researchers
    literally want to make predictions about future
    cases based on analysis of past cases.
  • Here are two examples.

74
Predicting College GPAs.
  • A college Admissions Office has recorded the
    combined SAT scores of all of its incoming
    students over a number of years.
  • It has also recorded the first-year college GPAs
    of the same students.
  • The Admissions Office can therefore calculate the
    regression coefficient b, the intercept a, and
    the correlation coefficient r for the data they
    have collected in the past. It can then use the
    regression equation
  • PREDICTED COLLEGE GPA a b SAT SCORE
  • to predict the potential college GPAs of the
    next crop of applicants on the basis of their SAT
    scores (and use these predictions in making their
    admissions decisions).
  • Even better, it can collect and record more
    extensive data and use a multiple regression
    equation such as
  • GPA a b1 SATV b2 SATM b3 HSGPA
    b4 AP . . .

75
Restricting the Values of the Independent
Variable Attenuates Correlation
76
Predicting Presidential Elections Months in
Advance
  • A number of political scientists have devised
    predictive models that you may have heard about
    during the past Presidential election year.
  • Much like the hypothetical college Admissions
    Office, these political scientists have assembled
    data concerning the past 15 or so Presidential
    elections, in particular
  • the percent of the popular vote won by the
    candidate of the incumbent party (that controlled
    the White House going into the election) the
    dependent variable of interest, and
  • data on a number of independent variables whose
    values become available well in advance of the
    election, typically including
  • one or more indicators of the state of the
    economy (growth rate, unemployment rate, etc.)
    usually as of the end of the second quarter (June
    30) of the election year
  • some poll measure of the incumbent Presidents
    approval rating as of about June 30.
  • Using this data, they calculate the coefficients
    for the regression equation.

77
Predicting Presidential Elections (cont.)
  • You constructed two bivariate scattergrams along
    these lines in PS 11
  • INCUMBENT VOTE by GDP
  • INCUMBENT VOTE by PRESIDENTIAL APPROVAL
  • Remember that
  • GDP was real Gross Domestic Product (economic)
    growth over the Fall, Winter, and Spring quarters
    preceding the election (e.g., from October 1,
    2003 through June 30, 2004)
  • PRES APPROVAL was the incumbent Presidents
    approval rating in the first Gallup Poll taken
    after June 30 of the election year.
  • Lets look at each of these more closely, and
    find (by either eyeball or calculation) the
    regression equation.

78
Predicting Presidential Elections (cont.)
79
Predicting Presidential Elections (cont.)
80
Predicting Presidential Elections (cont.)
  • In July 2008, GDP was about 1, so we could
    predict
  • McCain POP VOTE 47.7 1.12 x 1
  • 47.7 1.12 48.8
  • This would be a pretty fuzzy prediction because
    the calculated correlation coefficient is only
    about r 0.5 (r 2 .25)

81
The Regression Equation
  • Putting this all together, we now have the
    equation for the regression line in the example

82
(No Transcript)
83
Predicting Presidential Elections (cont.)
84
Predicting Presidential Elections (cont.)
  • We can make still sharper predictions by using
    both independent variables simultaneously to
    predict the value of the dependent variable.
  • Here is the multiple (vs. bivariate) regression
    equation (coefficients calculated by SPSS)
  • INC VOTE 35.8 .49 x GDP .30 x POP (with r2
    .71)
  • It might seems surprising the apparent impact of
    each independent variable (its regression
    coefficient) is smaller in this multivariate
    analysis.
  • This is because the two independent variables are
    themselves correlated (r .4)

85
Out-of-Sample Predictions
86
An Relevant Example
87
(No Transcript)
88
(No Transcript)
89
How Much Is an Additional Year of Education
Really Worth?
90
How Much Is an Additional Year of Education
Really Worth? (cont.)
91
How Much Is an Additional Year of Education
Really Worth? (cont.)
92
How Much Is an Additional Year of Education
Really Worth? (cont.)
93
Test 2
  • If you answered (C) to M-C Q9, one point has
    been added to your score and your graded has been
    increased by 0.16 of a grade point.
  • If you answered 3 to Blue Book Q1(d), show it
    to me and I will add one point to your score (and
    0.16 to your grade).
Write a Comment
User Comments (0)
About PowerShow.com