Title: Chapters 8
1Chapters 8 9
- Linear Regression Regression Wisdom
2Price of Homes Bases on Size (in Square Feet)
Sold in Ames between Sep. 2004 and Oct. 2005
r 0.8718945
3Statistical Modeling
- Statistical Model An equation that fits the
pattern between a response variable and possible
explanatory variables, accounting for deviations
from the model. (Simplest case one quantitative
response variable and one quantitative
explanatory variable.) - Response Variable (Y) The quantitative outcome
of a study. - Explanatory Variable (X) A quantitative variable
that may explain or predict the response variable - What is the beset model for Predicting weight
(Y) from height (X)? - What is the best model for Predicting blood
pressure (Y) from age (X)?
4Correlation and the Line
Price of Homes Based on Square Feet Price
-90.2458 0.1598SQFT r 0.8718945
5Regression line
- Explains how the response variable (y) changes in
relation to the explanatory variable (x) - Use the line to predict value of y for a given
value of x
6Regression line
- Need a mathematical formula
- We want to predict y from x
- The predicted values are called .
- The observed values are called y.
7Which Line is Best?
- What are some ways we can determine which model
out of all the possible models is the best one? - What are some ways that we can numerically rank
the different models. (i.e. the different lines)
8Which Model is Best?
Price -90.2458 0.1598SQFT (red) Price
-300 0.3SQFT (blue) Price 0 0.1SQFT
(green)
9Regression line
- Putting a hat on it means we have predicted
something from the model - Look at vertical distance
- Amount of error in the regression line
- The goal is to find the line so that these errors
are minimized.
10Least squares regression
- Most commonly used regression line
- Makes the sum of the squared errors as small as
possible - Based on the statistics
11Regression line equation
12Regression line equation
- b1 slope of line. For every unit increase in
x, y changes by the amount of the slope. - Interpreting b1 (slope)
- For every one unit increase in the explanatory
variable, there will be, on average, a b1 unit(s)
increase/decrease in the response variable. - For example For every one square foot increase
in size, on average, there will be a 159.80
increase in home price. - MEMORIZE THIS!!!!!
13Regression line equation
- b0 y-intercept of line. The value of y when x
0. - Interpreting b0 (y-intercept)
- When the explanatory variable 0, on average,
the value of the response variable b0. - For example When the sq. ft. of a home is 0,
the price of the home will be -90,245.80 on
average. - MEMORIZE THIS!!!!!
- BE CAREFUL. The interpretation of the intercept
does not always make sense. When interpreting,
be sure to mention if the interpretation does not
make sense.
14Example Kobes Shooting
- I visited cnnsis website and checked out some of
Kobe Bryants personal scoring numbers. I looked
at the number of times he shot the ball and his
point total for each game so far this year. - Lets come up with the regression equation for
this data.
15Kobes Shooting
r 0.7293762 Form Linear Strength Moderate to
Strong Direction Positive
16Calculating the regression line
- Remember that
- Our explanatory variable(x) is the number of
shots - Our response variable(y) is the number of points
- So the five numbers needed are
17Calculating the Regression Line
- Find the Slope
- Find the Intercept
18Calculating the regression line.
- Dont forget to write the equation.
- DONT FORGET TO WRITE THE EQUATION IN THE CONTEXT
OF THE PROBLEM.
19Interpretation
- How would we interpret b1?
- For a one shot increase from Kobe Bryant, on
average we would expect him to score 1.19 more
points. - How would we interpret b0?
- If Kobe Bryant did not take one shot then on
average we would expect him to score 3.436 points
20Prediction
- Use the regression equation to predict y from x.
- Ex. What is the predicted number of points when
Kobe shoots 30 times in a game? - Ex. What is the predicted number of points when
Kobe shoots 55 times in a game?
21Plotting the regression line
- Find two points on the line
- Ex. x 30, y 39 and x 55, y 69
- If you are plotting by hand it is ok to round
values - Plot these two points on the graph
- Connect the points
- This is the regression line
22Plotting the Regression Line
23Properties of regression line
- r is related to the value of b1
- r has the same sign as b1
- One standard deviation change in x corresponds to
r times one standard deviation change in y - The regression line always goes through the point
24Properties of regression line
- r2
- Percent of variation in y that is explained by
the least squares regression of y on x - The higher the value of r2, the more the
regression line explains the changes that occur
in the y variable - The higher the values of r2, the better the
regression line fits the data
25Properties of regression line
- r2
- 0 ? r2 ? 1 since -1 ? r ? 1
- Interpretation of r2
- r2 is the percent of variation in the response
variable that can be explained by the least
squares regression of the response variable on
the explanatory variable. - For Kobes example 53.1 of the variability in
the number of points Kobe Bryant scores in a game
can be explained by the LS regression of points
per game on number of shots per game (g). - MEMORIZE THIS!!!!
26Residuals
- Amount of variation in y not taken into account
by regression line - Formula
- There is a residual for each data point
- Mean of the residuals is zero
27Calculating Residuals Kobe
- Find the residual for the point (46,81)
- First find the predicted number of calories for a
sandwich with a serving weight of 182 g - Now find residual
28Calculating Residuals Kobe
- Find the residual for the point (26,35)
29Residual Plots
- Scatterplot of Residuals
- Explanatory variable on horizontal axis
- Residuals on vertical axis
- Horizontal line at residual 0
30Residual Plots
31Interpreting Residual Plots
- Is there a curved pattern?
- This could mean a non-linear relationship
- Is there increasing spread about the line as x
increases? - This could mean non-constant variance
- Is there decreasing spread about the line as x
increases? - This could mean non-constant variance
32Interpreting Residual Plots
- Points with large residuals
- These are probably outliers in the y direction
- These will pull the regression line in the
direction of the outlier (up or down) - Extreme points in the x direction
- These are called influential points
- They do not always show up in residuals because
the residual could be small - Removing the outlier could markedly change the
regression line
33Reading JMP Data
- Bivariate Fit of BAC by of Beers
34Reading JMP Data
- Linear Fit
- BAC -0.011654 0.0180112 of Beers
This is the regression line for the data. Slope
is 0.0180112. y-Intercept is -0.011654. The
response variable is the BAC. The explanatory
variable is the of Beers.
35Reading JMP Data
- Summary of Fit
- RSquare 0.803536
- RSquare Adj 0.788424
- Root Mean Square Error 0.020920
- Mean of Response 0.076000
- Observations (or Sum Wgts) 15
This gives some summary of the data. RSquare r2
(r)2 (correlation)2 Root Mean Square Error
s Mean of response Observations n
36Reading JMP Data
- Analysis of Variance
- Source DF Sum of Squares Mean Square
F Ratio - Model 1 0.02327041 0.023270 53.1700
Error 13 0.00568959 0.000438 Prob F C.
Total 14 0.02896000
This is called the ANOVA Table. This is another
way to analyze the data. We arent going to
discuss this in this class.
37Reading JMP Data
- Parameter Estimates
- Term Estimate Std Error
t Ratio Probt - Intercept -0.011654 0.013179
-0.88 0.3926 beers 0.0180112 0.002470
7.29
This tells you what the y-intercept and slope
are. It also gives the standard error for each
of the estimates. If you were to form confidence
intervals for the parameter estimates, you would
need these values. We wont discuss that in this
class.
38Reading JMP Data
Here is your residual plot. Check it to see if
there are any problems with linearity of the data
and constant variance.
39Example
40Example
- Age at first word vs. Gesell score.
- Scatterplot Weak negative linear relationship
between two variables. Possible outliers at
(42,57) and (17,121). - Regression r -0.64, r2 40.96.
41Example
42Example
- Age at first word vs. Gesell score.
- Prediction
- When x17
- When x42
- Residuals
- point (17,121)
- point (42,57)
43Example
44Example
- Residual Plot
- Outliers at x17 and x42
- Small residual for x42
- Could be influential
- Remove (42,57) from data.
- Regression line changes markedly.
- r -0.33, r2 10.89.
45Example
46Outliers--What should you do?
- Make sure data points have been recorded
correctly - Collect more data
- Remove the outlier
- Examine collection techniques
- Examine outside influences
47Cautions about regression
- Linear relationship only
- Not resistant
- Using averaged data
- Makes relationship appear stronger
- Taking average removes variation
- Extrapolation
- Predicting y when x value is outside the original
data
48Cautions about Regression
- Extrapolation
- Remember the data about home prices vs. the
amount of sq. footage in the home. - The regression line we found based on data
collected from homes with 900 to 3,000 sq. ft. is
- This would mean that if my home has no square
footage, then I pay -75,470. - If you must extrapolate, at least dont expect
that your prediction will come true.
49Cautions about regression
- ASSOCIATION IS NOT CAUSATION!
- Strong association between explanatory and
response variables does not mean that the
explanatory variable causes the response
variable.
50Proving Causation
- Experiment
- Change the values of x and control for lurking
variables. - Not all problems can be solved by experiment
- Smoking causes lung cancer.
- Living near power lines causes leukemia.
51Proving Causation
- Lurking variable
- Important effect on variables, but not included
in study. - Example
- Do taller people make more money? What do you
think a lurking variable might be? -
52Proving Causation
- Proving smoking causes lung cancer
- Association is strong
- Association is consistent
- High doses are associated with stronger response
- Cause precedes the effect in time
- Cause is plausible
53Review
Number of Calories By Sugar Content (g) for 13
Cereals
Lets calculate the formula for this regression
line
54Review
- Lets review all the formulas we need
55Review
- Here are all the numbers you need
56Review
- First, calculate sx and sy
57Review
- Second, calculate r
- Third, calculate b1
58Review
- Fourth, calculate and
- Fifth, calculate a (were almost done!!)
59Review
- Last, but definitely the most important, WRITE
DOWN THE EQUATION IN THE CONTEXT OF THE PROBLEM
60Review
- Interpret b1
- For every one gram increase in sugar, the number
of calories will increase by 3.36. - Interpret r2
- About 55 of the variability in the number of
calories in cereal can be explained by the LS
regression of calories on sugar content.