Title: Regression
1Chapter 5
2Chapter outline
- The least-squares regression line
- Facts about least-squares regression
- Residuals
- Influential observations
- Cautions about correlation and regression
- Association does not imply causation
3Correlation and Regression
- Regression effects are depicted by the slope of
the line.
- Correlation can be seen as the spread of points
around the regression line. The greater the
amount of spread of points around the regression
line, the less predictive is X of Y and
consequently, the weaker the correlation.
4Correlation r 1
5(No Transcript)
6Imperfect Correlation and Relationships
- We rarely see perfect correlation
- While Correlation is never perfect, we can draw a
line to summarize the trend in the data points.
This is the Regression Line
7Regression Line
- Regression Line A straight line that describes
how a response variable y changes as an
explanatory variable x changes.
- It can sometimes be used to predict the value of
y for a given value of x.
8Making Predictions
9Where do we Draw the Line?
10Minimize the sum of the distances between the
points and the line
-.25
2
2
-3.5
-.25
Square the Distances
11(No Transcript)
12The best fitting line would minimize the sum of
the squared distance of every point in the
scatterplot from the regression line
Minimize ? This line -- the best-fitting
line -- is that line which -- compared to any
other line you could plot through the points --
produced the smallest sum of squared deviations.
13(No Transcript)
14- The slope b is the change in y when x increases
by 1.
- The intercept a is the predicted value of y when
x 0.
15Finding the equation of the regression line
16Facts about least-squares regression line
- Fact 1It is a mathematical model for the data.
- Fact 2 The distinction between explanatory and
response variables is essential in regression.
- Fact 3 There is a close connection between
correlation and the slope of least squares line.
- Fact 4 The least-squares regression line always
passes through the point , where is the
mean of the x values, and is the mean of the y
values. - Fact 5 The correlation r describes the strength
of a straight-line relationship. In the
regression setting, this description takes a
specific form the square of the correlation, r2,
is the fraction of the variation in the value of
y that is explained by the least squares
regression of y on x.
17(No Transcript)
18Residual plots
- A residual plot is a scatterplot of the
regression residuals against the explanatory
variable. Residual plots help us assess the fit
of a regression line. - A residual is the difference between an observed
of the response variable and the value predicted
by the regression line. That is,
- Residual observed y predicted y
-
19(No Transcript)
20(No Transcript)
21(No Transcript)
22Outliers and Influential Observations
- An outlier is an observation that lies outside
the overall pattern of the other observations
- An observation is influential for a statistical
calculation if removing it would markedly change
the result of the calculation.
- Points that are outliers in the x direction of a
scatterplot are often influential for the
least-squares regression line. Influential
observations can also be described as outliers.
23Outliers and Influential Observations
24(No Transcript)
25Beware extrapolation
- Extrapolation is the use of a regression line for
prediction far outside the range of values of the
explanatory variable x that you used to obtain
the line. Such predictions are often not
accurate. - Example
- Suppose Angela was 1.20m tall on January 1st
1975, and 1.40m tall on January 1st 1976. By
extrapolation, estimate her height on January 1st
1977. - By extrapolation, it could be estimated that by
January 1st 1977 she would have grown another
0.20m to be 1.60m tall. This however assumes that
she continued to grow at the same rate. This must
eventually become a false assumption, otherwise
by January 1st 1980, she would be a giantess.
26Lurking variable
- A lurking variable is a variable that has an
important effect on the relationship among the
variable in a study but is not included among the
variables studied. - Example Studies of relationship between
treatment of heart disease and the patients
gender show that women are in general treated
less aggressively than men with similar symptoms.
Women are less likely to undergo bypass
operation. - Question Might this be discrimination? Answer
No. Be aware of the lurking variable Although
half of heart disease victim are women, they are
on the average much older than male victim.
27Association does not imply causation
Example Sales of rum and number of Methodist
ministers is positively correlated, but a large
number of ministers does not encourage rum
drinking. Is there a lurking variable that infl
uences both rum sales and Methodist ministers?
The the previous example, both the sales of rum
and the number of Methodists ministers were
correlated with the number of people in the U.S.
As the number of people increases, it causes an
increase in demand for both Methodist ministers
and for rum.