Title: Regression
1Regression Correlation (1)
- A relationship between 2 variables X and Y
- The relationship seen as a straight line
- Two problems
- How can we tell if our regression line is useful?
- Test of hypothesis about the slope, ß1
- Correlation
- Useful features of r
- Test of hypothesis about ?
- Examples
2A relationship between two variables X Y
- We often have pairs of scores for a given set of
cases. For example, we might have - of years of education and annual income, or
- IQ and GPA
- income and of books in the household
- More generally, we have any X and Y, and our
question is, does knowing something about X tell
us anything about Y?
3A relationship between two variables X Y
- Does knowing something about X tell us anything
about Y? - For example, knowing how many years of education
a person has, could you usefully estimate their
annual income, or the number of cigarettes they
smoke in a year?
4A relationship between two variables X Y
- Often, the answer to that question is, Yes
there is a relationship between the X and Y
scores you have measured. - On average, as number of years of education goes
up (across a set of people), number of cigarettes
smoked per year goes down.
5A relationship between two variables X Y
- In the graph on the next slide, we see two
things - X goes down as Y goes up.
- At each value of X, there is some variability in
Y but substantially less than there is in Y
overall.
6Note that the range of the Y values for this
value of X is small, compared to the whole range
of Y in the data set.
Y Cigarettes per year
X Years of education
7The relationship seen as a straight line
- The relationship between an X and a Y can be
described using the equation for a straight line. - Y ß0 ß1X e
- Y-intercept Slope Error
- Note this is the (theoretical) population
equation relating Y to X
8Two problems
- Y ß0 ß1X e
- In principle, this equation would let us predict
the value of Y for a given X without error IF - A. X were the only variable that influenced Y
- Usually, it isnt
- B. We knew the population values of ß0 ß1
- Usually, we dont
9Two problems
- Be sure to distinguish between
- Actual values of Y in the population.
- Values of Y we would predict using
- Y ß0 ß1X e
- if we had the population values for ß0 ß1.
- C. Values of Y we predict on the basis of the X-Y
relationship in our sample data - Y ß0 ß1X
Why no e here?
10Two problems
- When we predict Y on the basis of X for a given
case, two things can cause the predicted values
to be different from the values we would find if
we actually measured Y for that case - 1. We dont know the population values of ß0 and
ß1 only the sample values ß0 and ß1. - Note that if we did know ß0 ß1, this source of
error would disappear.
11Two problems
- 2. In the population, Y is not uniquely
determined by X. As a result, for each value of
X, there is a distribution of Y values. - relative to our predicted Y for a given value of
X, the observed values of Y will sometimes be
higher and sometimes be lower. - these errors are random over the long term,
they will cancel each other out - but even if we knew ß0 and ß1, this source of
error would still exist.
12Two problems
- In other words
- We dont have population values for the slope and
the intercept of the line relating X to Y. Thats
one problem. - Even if we had population values for the slope
and the intercept, the equation relating X to Y
would still not perfectly predict Y. Thats the
other problem.
13How can we tell if our regression line is useful?
- The line is useful if the predicted values of Y
are close to the observed values of Y (in the
sample). - We use our sample X and Y values to compute the
regression line, Y ß0 ß1X. - We then use this line to predict the same Y
values, and compare our predicted values with the
observed values in the sample data. If the
prediction is good, we can then use the
regression line to predict Y for values of X not
in our sample.
14How can we tell if our regression line is useful?
- (Yi Yi) Yi (ß0 ß1Xi) (since Yi ß0
ß1Xi) - Therefore, the sum of the squared deviations of
predicted Y values from actual Y values is - SSE SYi (ß0 ß1Xi)2
- Now ß0 and ß1 are the least squares estimators
of ß0 ß1 giving smaller SSE than any other
values of ß0 and ß1 would.
15Y
When there is no relation between X and Y, the
best estimator of the Y value for any case is the
mean, Y.
Notice that the slope of this line is zero!
X
16How can we tell if our regression line is useful?
- If X is completely unrelated to Y, the best
estimate we could make of Y would be the mean, Y,
for any value of X. - We find out whether our regression line is useful
by asking whether its slope is different from 0. - H0 ß1 0
Why not ß1?
17How can we tell if our regression line is useful?
- To test that null hypothesis, we use the fact
that ß1 is one slope taken from the sampling
distribution of ß1. - ß1 SSXY ß0 Y - ß1X
- SSXX
- Where SSXY S(Xi X) (Yi Y) SXiYi SXi SYi
- n
18How can we tell if our regression line is useful?
- SSXX S(Xi X)2 SX2 (SX)2
- n
- (n sample size)
- For the sampling distribution of ß1
- The mean ß1 ?ß1 ?
- vSSXX
19How can we tell if our regression line is useful?
- We estimate ?ß1 by sß1 s
- vSSXX
- Where s SSE
- n-2
v
20Test of hypothesis about the slope, ß1
- Since ? is unknown, we use t to test H0
- H0 ß1 0 H0 ß1 0
- HA ß1 lt 0 HA ß1 ? 0
- or ß1 gt 0
- Test statistic t ß1 0
- Sß1
21Test of hypothesis about the slope, ß1
- Rejection region
- tobt lt t? tobt gt t?/2
- tobt gt t?
- tcrit is based on n-2 degrees of freedom.
22Correlation
- The Pearson Correlation coefficient r is a
numerical, descriptive measure of the strength
and direction of relationship between two
variables X and Y. - r SSXY
- SSXXSSYY
- r gives much the same information as ß1. However
r is scale-less and (-1 r 1)
v
23Useful features of r
- r indexes the X-Y relationship
- r gt 0 means Y increases as X increases
- r lt 0 means Y decreases as X increases
- r 0 means there is no relationship between X
Y - r is the sample correlation coefficient. We can
use it to estimate rho (?), the population
correlation coefficient, and use r to test H0 ?
0
24Test of hypothesis about ?
- H0 ? 0 H0 ? 0
- HA ? lt 0 HA ? ? 0
- or ? gt 0
- Test statistic t r ?
- 1 r2
- n 2
- tcrit has n-2 degrees of freedom.
v
25Example 1
- H0 ? 0
- HA ? ? 0
- Test statistic t r ?
- 1 r2
- n 2
- tcrit t(5, a/2 .025) 2.571.
v
26Example 1 Sum formulas
- First, calculations involving X
- SX 74 (SX)2 5476 SX2 922
- Then, analogous calculations involving Y
- SY 82 (SY)2 6724 SY2 1076
- Then, calculations involving X and Y
- SXY 976
27Example 1 Sums of squares formulas
- SSXY S(Xi X) (Yi Y) SXiYi SXi SYi
- n
- SSXX S(Xi X)2 SX2 (SX)2
- n
- SSYY S(Yi Y)2 SY2 (SY)2
- n
28Example 1 calculate r
- SSXY 109.143
- SSXX 139.71
- SSYY 115.429
- r SSXY r .859
- SSXXSSYY
v
29Example 1 do t-test
- t r ?
- 1 r2
- n 2
- t .859 - 0 .859 3.751
- 1 - .738 .229
- 5
- Reject H0 A significant correlation exists.
v
v
30Example 2
Note these are the Greek letter rho, NOT the
English letter P
- H0 ? 0
- HA ? gt 0
- Test statistic t r ?
- 1 r2
- n 2
- tcrit t(7-2 5, a .05) 2.015
v
31Example 2 Sum formulas
- First, calculations involving X
- SX 4.2 (SX)2 17.64 SX2 2.86
- Then, analogous calculations involving Y
- SY 32 (SY)2 1024 SY2 161.5
- Then, calculations involving X and Y
- SXY 21.35
32Example 2 calculate r
- SSXY 21.35 (4.2)(32) 2.15
- 7
- SSXX 2.86 17.64 .34
- 7
33Example 2 calculate r
- SSYY 161.5 1024 15.2143
- 7
- r SSXY
- SSXXSSYY
- r .945
v
34Example 2 do t-test
- t r ?
- 1 r2
- n 2
- t .945 - 0 .945 6.48
- 1 - .893 .146
- 5
- Reject H0 A significant correlation exists.
v
v