Regression - PowerPoint PPT Presentation

About This Presentation

Title:

Regression

Description:

First, calculations involving X: SX = 74 (SX)2 = 5476 SX2 = 922. Then, analogous calculations involving Y: SY = 82 (SY)2 = 6724 SY2 = 1076 ... – PowerPoint PPT presentation

Number of Views:20

Avg rating:3.0/5.0

Slides: 35

Provided by: patric53

Category:

more less

Transcript and Presenter's Notes

Title: Regression

1
Regression Correlation (1)

A relationship between 2 variables X and Y
The relationship seen as a straight line
Two problems
How can we tell if our regression line is useful?
Test of hypothesis about the slope, ß1
Correlation
Useful features of r
Test of hypothesis about ?
Examples

2
A relationship between two variables X Y

We often have pairs of scores for a given set of
cases. For example, we might have
of years of education and annual income, or
IQ and GPA
income and of books in the household
More generally, we have any X and Y, and our
question is, does knowing something about X tell
us anything about Y?

3
A relationship between two variables X Y

Does knowing something about X tell us anything
about Y?
For example, knowing how many years of education
a person has, could you usefully estimate their
annual income, or the number of cigarettes they
smoke in a year?

4
A relationship between two variables X Y

Often, the answer to that question is, Yes
there is a relationship between the X and Y
scores you have measured.
On average, as number of years of education goes
up (across a set of people), number of cigarettes
smoked per year goes down.

5
A relationship between two variables X Y

In the graph on the next slide, we see two
things
X goes down as Y goes up.
At each value of X, there is some variability in
Y but substantially less than there is in Y
overall.

6
Note that the range of the Y values for this
value of X is small, compared to the whole range
of Y in the data set.
Y Cigarettes per year
X Years of education
7
The relationship seen as a straight line

The relationship between an X and a Y can be
described using the equation for a straight line.
Y ß0 ß1X e
Y-intercept Slope Error
Note this is the (theoretical) population
equation relating Y to X

8
Two problems

Y ß0 ß1X e
In principle, this equation would let us predict
the value of Y for a given X without error IF
A. X were the only variable that influenced Y
Usually, it isnt
B. We knew the population values of ß0 ß1
Usually, we dont

9
Two problems

Be sure to distinguish between
Actual values of Y in the population.
Values of Y we would predict using
Y ß0 ß1X e
if we had the population values for ß0 ß1.
C. Values of Y we predict on the basis of the X-Y
relationship in our sample data
Y ß0 ß1X

Why no e here?
10
Two problems

When we predict Y on the basis of X for a given
case, two things can cause the predicted values
to be different from the values we would find if
we actually measured Y for that case
1. We dont know the population values of ß0 and
ß1 only the sample values ß0 and ß1.
Note that if we did know ß0 ß1, this source of
error would disappear.

11
Two problems

2. In the population, Y is not uniquely
determined by X. As a result, for each value of
X, there is a distribution of Y values.
relative to our predicted Y for a given value of
X, the observed values of Y will sometimes be
higher and sometimes be lower.
these errors are random over the long term,
they will cancel each other out
but even if we knew ß0 and ß1, this source of
error would still exist.

12
Two problems

In other words
We dont have population values for the slope and
the intercept of the line relating X to Y. Thats
one problem.
Even if we had population values for the slope
and the intercept, the equation relating X to Y
would still not perfectly predict Y. Thats the
other problem.

13
How can we tell if our regression line is useful?

The line is useful if the predicted values of Y
are close to the observed values of Y (in the
sample).
We use our sample X and Y values to compute the
regression line, Y ß0 ß1X.
We then use this line to predict the same Y
values, and compare our predicted values with the
observed values in the sample data. If the
prediction is good, we can then use the
regression line to predict Y for values of X not
in our sample.

14
How can we tell if our regression line is useful?

(Yi Yi) Yi (ß0 ß1Xi) (since Yi ß0
ß1Xi)
Therefore, the sum of the squared deviations of
predicted Y values from actual Y values is
SSE SYi (ß0 ß1Xi)2
Now ß0 and ß1 are the least squares estimators
of ß0 ß1 giving smaller SSE than any other
values of ß0 and ß1 would.

15
Y
When there is no relation between X and Y, the
best estimator of the Y value for any case is the
mean, Y.
Notice that the slope of this line is zero!
X
16
How can we tell if our regression line is useful?

If X is completely unrelated to Y, the best
estimate we could make of Y would be the mean, Y,
for any value of X.
We find out whether our regression line is useful
by asking whether its slope is different from 0.
H0 ß1 0

Why not ß1?
17
How can we tell if our regression line is useful?

To test that null hypothesis, we use the fact
that ß1 is one slope taken from the sampling
distribution of ß1.
ß1 SSXY ß0 Y - ß1X
SSXX
Where SSXY S(Xi X) (Yi Y) SXiYi SXi SYi
n

18
How can we tell if our regression line is useful?

SSXX S(Xi X)2 SX2 (SX)2
n
(n sample size)
For the sampling distribution of ß1
The mean ß1 ?ß1 ?
vSSXX

19
How can we tell if our regression line is useful?

We estimate ?ß1 by sß1 s
vSSXX
Where s SSE
n-2

v
20
Test of hypothesis about the slope, ß1

Since ? is unknown, we use t to test H0
H0 ß1 0 H0 ß1 0
HA ß1 lt 0 HA ß1 ? 0
or ß1 gt 0
Test statistic t ß1 0
Sß1

21
Test of hypothesis about the slope, ß1

Rejection region
tobt lt t? tobt gt t?/2
tobt gt t?
tcrit is based on n-2 degrees of freedom.

22
Correlation

The Pearson Correlation coefficient r is a
numerical, descriptive measure of the strength
and direction of relationship between two
variables X and Y.
r SSXY
SSXXSSYY
r gives much the same information as ß1. However
r is scale-less and (-1 r 1)

v

23
Useful features of r

r indexes the X-Y relationship
r gt 0 means Y increases as X increases
r lt 0 means Y decreases as X increases
r 0 means there is no relationship between X
Y
r is the sample correlation coefficient. We can
use it to estimate rho (?), the population
correlation coefficient, and use r to test H0 ?
0

24
Test of hypothesis about ?