Title: Bivariate Regression and Correlation
1Bivariate Regression and Correlation
2Analytic Tool
- Answer the following
- Suppose that you have a sample of 64 individuals,
the sample mean is 20, the sample standard
deviation is 16. - Can you reject the null hypothesis that the
sample mean is less than or equal to zero at the
.05 significance level? - Can you reject the null hypothesis that the
sample mean is less than or equal to 18 at the
.05 significance level?
3Answer to the Analytic Tool
- Suppose that you have a sample of 64 individuals,
the sample mean is 20, the sample standard
deviation is 16. - The critical value of the t-statistic at the .05
significance level with 63 degrees of freedom is
about 1.6. - To determine whether the sample mean is
significantly different from zero, we calculate - t (20 0) / (16 / 8) 20 / 2 10
- Since t gt 1.6, we can reject the null
hypothesis. - To determine whether the sample mean is
significantly different from 18, we calculate - t (20 18) / (16 / 8) 2 / 2 1
- Since t lt 1, we cannot reject the null
hypothesis.
4Agenda
- Today we will begin to learn how to investigate
the relationship between two continuous
variables. - You will learn
- 1) how to graphically present the relationship
between two variables - 2) how to measure the correlation between two
variables
5Review
- Thus far, we have learned how to conduct three
general types of hypothesis tests. - 1) Hypothesis tests concerning the sample mean
of a continuous random variable. - e.g. Is the mean income in the U.S. greater
than 40,000? - 2) Hypothesis tests concerning the difference in
the means of two samples of a continuous random
variable. - e.g. Is the mean income for men greater than
the mean income for women? - 3) Hypothesis tests concerning the independence
of two categorical variables. - e.g. Are race and vote choice independent?
6Introduction
- Suppose that you have two continuous variables
measured for the same observation over a number
of samples. - You would like to whether these two samples are
related. - How would you proceed based on what youve been
taught so far?
7Possible Methods (based on what weve covered so
far)
- Option 1. Split one of the variables at some
critical value (say the median). Then test for a
difference in the means of the samples for the
second variable above and below the critical
value. - e.g. Test whether cities with large percent
increases in govt expenditures had higher or
lower percent changes in unemployment than cities
with small percent changes in govt expenditures. - Option 2. Divide both categories into smaller
sets (say quartiles or quintiles). Then perform a
chi-squared test to see if those categories are
independent.
8Graphical Representation of the Data
- One way to analyze relationships between two
continuous variables is with a scatterplot. - A scatterplot is a type of diagram that displays
the covariation of two continuous variables as a
set of points on a Cartesian coordinate system.
9Interpretation of the Scatterplot
A Positive Relationship between the two variables
occurs when an increase in the variable
represented on the x-axis corresponds to an
increase in the variable represented on the
y-axis.
A Negative Relationship between the two variables
exists when an increase in the variable
represented on the x-axis corresponds to a
decrease in the variable represented on the
y-axis.
10Interpretation of the Scatterplot
A curvilinear relationship exists if the effect
of a change in the variable on the x-axis has a
different effect on the variable represented
along the y-axis, depending on the value of x (or
y).
No relationship exists if a change in the
variable represented along the x-axis does not
correspond to a change in the variable along the
y-axis
11Group Analytic Tool Group
- How would you device a statistic to determine
whether there was a positive or negative
relationship?
12Covariance
- Covariance is a statistical measure of the
relationship between two samples of two
variables. - Cov( X , Y ) ? ( Yi Mean(Y) ) ( Xi
Mean(X) ) - N 1
- If your relationship is positive, then the
covariance will be positive large values of X
will be associated with large values of Y and
small values of X will be associated with small
values of Y. - If your relationship is negative, then the
covariance will be negative large values of X
will be associated with small values of Y and
small values of X will be associated with large
values of Y. - If there is no relationship, then the covariance
is zero large values of X will be associated
with both large and small values of Y and small
values of X will be associated with both large
and small values of Y.
13Computing the Covariance
- Note that the equation for covariance can be
defined multiple ways - The intuitive expression
- Cov( X , Y ) ?( Yi Mean(Y) ) ( Xi
Mean(X) ) - N 1
- Is equivalent to
- Cov( X , Y ) N ?(Xi Yi) (?Xi )(?Yi)
- (N 1)N
- The second expression may be easier to use for
calculations in Excel. - Note It is useful to calculate results yourself
rather than with Excels canned function
because (at least with my version), Excel assumes
that it is estimating the covariance of two
populations and uses n as a denominator rather
than n-1.
14Comments on Covariance
- Covariance is a very good indicator of the
direction of the relationship between two
variables. - Covariance is not a good measure of the
magnitude of the relationship. This is because
covariance is sensitive to the scale of the
variables under investigation. - Note you can see this if you simply multiply
both variables by a constant and compare the
covariances. - So, it is not proper to compare the covariances
from two different data sets to see if the
relationship is stronger in one case than the
other. - How would you improve covariance to make your
findings less sensitive to scale?
15Correlation
- Correlation is a statistical measure of
association closely related to covariance. - The correlation coefficient, denoted RXY or just
R, is defined as - RXY ?( Yi Mean(Y) ) ( Xi Mean(X) )
?( Xi Mean(X) )2 ?( Yi Mean(Y) )2 - Covariance (X , Y ) Standard
Deviation(X) Standard Deviation (Y) - RXY by definition can only take values between -1
and 1. - The larger the absolute value of RXY stronger the
relationship between X and Y. If RXY 1, then X
and Y are positively related and X is a perfect
predictor of Y. - If If RXY -1, then X and Yare negatively
related and X is a perfect predictor of Y. - If RXY 0, then X and Y are unrelated.
16Correlation cont.
- RXY by definition can only take values between -1
and 1. - The larger the absolute value of RXY stronger the
relationship between X and Y. If RXY 1, then X
and Y are positively related and X is a perfect
predictor of Y (and Y is a perfect predictor of
X). - If RXY -1, then X and Yare negatively related
and X is a perfect predictor of Y (and Y is a
perfect predictor of X). - If RXY 0, then X and Y are unrelated.Overview
- Give the big picture of the subject
- Explain how all the individual topics fit together
17Comments on Correlation
- The correlation coefficient provides a very
useful summary of the relationship between X and
Y. - But, it takes real effort to use a knowledge of
the correlation coefficient and the value of X
(or Y) to make prediction about the value of Y
(X). - Additionally, correlation does not imply
causation. - e.g. In our original example, do changes in govt
expenditures cause changes in employment or do
changes in unemployment cause changes in govt
expenditures?
18Getting at Causation
- When we do statistical analyses, we generally
have to make assumptions about what constitutes a
cause and what constitutes an effect. - That is, we make a formal statement about our
hypothesized relationship like - Yi f(stuff), where Y is the dependent variable
and stuff is the set of independent variables. - If we are clever, we can estimate the effect of
stuff (and thats what we will be talking about
for the next few weeks) to test whether it has a
statistically significant influence on Y. - If we are really clever, can we test for
causality as well? How?
19Getting at Causation cont.
- In order for a variable to be a cause, it is
necessary (but not sufficient) for the variable
to occur prior to the effect. - Possible Research Designs to Examine Causality.
- - For a dependent variable that doesnt change
much - Measure a stable set of individual-level
characteristics (e.g. race, gender, parents value
for the dependent variable), then examine which
stable characteristics explain variation in your
sample. - - For a dependent variable that does change
- Measure the dependent variable and the
independent variables at time t, retake the
measurements for the same sample at time t1,
then examine whether changes (stability) in the
independent variables led to changes (stability)
in the dependent variable. (Note ideally, youd
show that changes in X occurred before changes in
Y) - - For a dependent variable that does change
- Cohort Analysis