Title: Ch8: Linear Regression Fat Versus Protein: An Example
1 Ch8 Linear Regression Fat Versus Protein An
Example
- We now build on Ch 7s scatter-plots,
- Would you say there is a strong association?
- Does that association look linear?
Content of Fat and Protein for Food at Burger King
Data taken from 30 items on the Burger King menu
2The Linear Model
- A large or - Correlation indicates There seems
to be a linear association between these two
variables, but it doesnt tell exactly what that
association is. - We can say more about the linear relationship
between two quantitative variables with a model. - A model simplifies reality to help us understand
underlying patterns and relationships. - A linear model is just an equation of a straight
line through the data. - The points in the scatter-plot dont all line up,
but a straight line can summarize the general
pattern. - The linear model can help us understand how the
values are associated.
3Practice Fitting Lines
- We can eyeball what fit looks best. Any
votes? - Luckily for us, theres an algorithm to
determine the - best fit model
4The Straight Line and the Residuals
- With real, non-trivial data, the model will never
be perfect, regardless of the line we draw. - Some points will be above the line and some will
be below. - The estimate made from a model is the predicted
value (denoted as ).
5Residuals (cont.)
- The difference between the observed value and its
associated predicted value is called the
residual. - To find the residuals, we always subtract the
predicted value from the observed one - A negative residual means the predicted value is
too big (an overestimate). - A positive residual means the predicted value is
too small (an underestimate).
6Best Fit Means Least Squares
- Some residuals are positive, others are negative,
and, on average, they cancel each other out. - So, we cant assess how well the line fits by
adding up all the residuals. - Similar to what we did with the standard
deviation, we square the residuals and add the
squares. - The smaller the sum, the better the fit.
- The line of best fit is the line for which the
sum of the squared residuals is smallest.
7The Least Squares Line Parameters
- In our model, the first parameter, b1, is the
slope - The slope is built from the correlation and the
standard deviations (s, not s, as we use standard
deviations of the sample) It is always in units
of y per unit of x. - The models second parameter, b0, is the
intercept - The intercept is built from the mean and the
slope, and is always in units of y
8How to Find these in Reality, via Excel
- Realistically, you will always use a computer to
build such a model - Thus the following will walk through 2 different
ways you can use Excel to determine regressions,
using the Burger King dataset ( See also the
class handout) - Note Since regression and correlation are
closely related, we need to check the same
conditions for regressions as we did for
correlations - Quantitative Variables Condition
- Straight Enough Condition
- Outlier Condition
9Method 1) Use Excels Chart Wizard
- Create a scatter-plot (as per Chapter 7)
- Select Add Trendline from the Chart menu (which
appears one you have a chart) - Select linear
- For options pick both Display equation on Chart
and Display R-Squared Value on Chart - R-Squared is just the square of the
correlation, r - Advantages to this method- its generally the
easier one, and we dont need the Analysis
Toolpak add-in. - Disadvantages- we dont automatically get the
plot of the residuals or many other regression
statistics. - Note The sheet in Ch08-excel-answers.xls for
Problem 42 walks through generating a residuals
plot.
10Method 2) Use Excels Regression Wizard
- Going to ToolsData AnalysisRegression
Analysis yields a pop-up wizard - Enter the range for what you are trying to
predict, the response variable. - Enter input range, and pick the column that will
serve as X. - Check labels if you included the data headers
- You have the option of where to put the
voluminous output. Using a separate sheet in the
same workbook is the least confusing for many
people. - For the options, check Residuals, Line Fit Plot
and Residuals Plot. Dont bother with
standardized residuals.
if you dont get the Data Analysis menu
option in Tools, go to the Add-in Option.
11Method 2) Regression in Excel- Outputs
- Coefficients (Intercept is listed above the
slope) - Multiple R (this is effectively the absolute
value of the correlation) - Excel does not give you the correlation sign.
You should check if you have a negative
correlation. - R-Square (is r2) -is the of variance in Y that
is explainable by X. - Graphs
- You may also want to adjust your graphs so they
are the right size, dont show a lot of dead
space, and look better. - I would recommend formatting the regression line
in the Line Fit Plot to actually BE a line. - Advantages to this method
- you also get to see the residuals.
- You get other useful information, such as
confidence intervals, even if we dont use this
information right now. - Disadvantages
- Not as simple to use. (Much easier to make
mistakes) - You may not have Data Analysis installed.
- You may misinterpret the correlation.
12Fat Versus Protein An Example
- The regression line for the Burger King data fits
the data well - The equation turns out to be
- The predicted fat content for a BK Broiler
chicken sandwich is - 6.8 0.97(30) 35.9 grams of fat.
13Correlation and the Line
- Moving one standard deviation away from the mean
in x moves us r standard deviations away from the
mean in y. - This relationship is shown
in a scatter-plot
of z-scores - for fat and protein
- Put generally, moving any number of standard
deviations away from the mean in x moves us r
times that number of standard deviations away
from the mean in y.
14How Big Can Predicted Values Get?
- r can never be bigger than 1 (in absolute value),
so each predicted y tends to be closer to
its mean (in standard deviations) than its
corresponding x was. - This property of the linear model is called
regression to the mean the line is called the
regression line.
15Residuals Revisited
- The linear model assumes that the relationship
between the two variables is a perfect straight
line. The residuals are the part of the data that
hasnt been modeled. - Data Model Residual
- or (equivalently)
- Residual Data Model
- Or, in symbols
16Residuals Revisited (cont.)
- Residuals help us to see whether the model makes
sense. - When a regression model is appropriate, nothing
interesting should be left behind. - After we fit a regression model, we usually plot
the residuals in the hope of findingnothing. - No curves or lines
- No increasing or decreasing variation as we move
along the x-axis
17Residuals Revisited (cont.)
- The residuals for the BK menu regression look
appropriately boring there are no obvious
patterns
18R2The Variation that is Accounted For
- The variation in the residuals is the key to
assessing how well the model fits. - In the BK menu items
example, total fat has
a
standard deviation
of 16.4 grams.
The
standard deviation
of the residuals
from
our models prediction - of fat is 9.2 grams.
- Which shows more variation?
Variation in Fat in BK Items, And in the Models
Residuals
19R2The Variation Accounted For (cont.)
- If the correlation were 1.0 and the model
predicted the fat values perfectly, the residuals
would all be zero and have no variation. - As it is, the correlation is 0.83not perfection.
- However, we did see that the model residuals had
less variation than total fat alone. - We can determine how much of the variation is
accounted for by the model and how much is left
in the residuals.
20R2The Variation Accounted For (cont.)
- The squared correlation, R2, (pronounced
R-squared). gives the fraction of the datas
variance accounted for by the model. - Thus, 1 R2 is the fraction of the original
variance left in the residuals. - An R2 of 0 means that none of the variance in the
data is in the model all of it is still in the
residuals. - When interpreting a regression model you need to
Tell what R2 means. - For the BK model, R2 0.832 0.69,
- 69 of the variation in total fat is accounted
for by the model. - so 31 of the variability in total fat has been
left in the residuals.
21How Big Should R2 Be?
- R2 is always between 0 and 100.
- What makes a good R2 value depends on the kind
of data you are analyzing and on what you want to
do with the results. - Unless you are modeling trivial, already known
relationships ( ex. weight in kg. to weight in
lbs) or are using very textbook data, R2 will
never be 100 and will rarely be above 90.
22Interpreting Model Results
- A regression model can always explain some
variation R2 is rarely 0. But that doesnt mean
it makes sense. - The Random-r sheet in Ch08-excel-answers.xls
shows a random number stream used to explain
another random number stream. Note the non-zero
values for b1 and R2 - Does this model provide a real explanation of the
variation?
23Assumptions and Conditions
- Quantitative Variables Condition
- We currently only know how to do Regression on
two quantitative variables, so make sure to check
this condition. - More advanced classes will show you how to
incorporate categorical data. - Straight Enough Condition
- The linear model assumes that the relationship
between the variables is linear. - A scatter-plot will let you check that the
assumption is reasonable.
24Assumptions and Conditions (cont.)
- It is a good idea to check linearity again after
computing the regression when we can examine the
residuals. - You should also check for outliers, which could
change the regression. - If the data seem to clump or cluster in the
scatter-plot, that could be a sign of trouble
worth looking into further. - If the scatter-plot is not straight enough,
stop here. - You cant use a linear model for any two
variables, even if they are related. - They must have a linear association or the model
wont mean a thing. - Some nonlinear relationships can be saved by
re-expressing the data to make the scatter-plot
more linear. (But we wont do this in DS212)
25Assumptions and Conditions (cont.)
- Outlier Condition
- Watch out for outliers.
- Outlying points can dramatically change a
regression model. - Outliers can even change the sign of the slope,
misleading us about the underlying relationship
between the variables. - Dont automatically delete outliers, but instead
study them further to see if they are data errors
or indicate something that your study did not
initially incorporate!
26Reality Check Is the Regression Reasonable?
- Statistics dont come out of nowhere. They are
based on data. - The results of a statistical analysis should
reinforce your common sense, not fly in its face.
- If the results are surprising, then either you
have learned something new about the world or
your analysis is wrong. - When you perform a regression, think about the
coefficients and ask yourself whether they make
sense.
27What Can Go Wrong?
- Dont fit a straight line to a nonlinear
relationship. - Beware extraordinary points (y-values that stand
off from the linear pattern or extreme x-values). - Dont extrapolate beyond the datathe linear
model may no longer hold outside of the range of
the data. - Dont infer that x causes y just because there is
a good linear model for their relationshipassocia
tion is not causation. - Dont choose a model based on R2 alone.
- We will study more about Regression (and what can
go wrong) in Chapter 9!
28What have we learned?
- When the relationship between two quantitative
variables is fairly straight, a linear model can
help summarize that relationship. - The regression line doesnt pass through all the
points, but it is the best compromise in the
sense that it has the smallest sum of squared
residuals. - The correlation, r , tells us several things
about the regression - The slope of the line is based on the
correlation, adjusted for the units of x and y. - For each SD (standard deviation) in x that we are
away from the x mean, we expect to be r SDs in y
away from the y mean. - Since r is always between -1 and 1, each
predicted y is fewer SDs away from its mean than
the corresponding x was (regression to the mean). - R2 gives us the fraction of the response
accounted for by the regression model.
29What have we learned? (conclusion)
- The residuals also reveal how well the model
works. - If a plot of the residuals against predicted
values shows a pattern, we should re-examine the
data to see why. - The standard deviation of the residuals
quantifies the amount of scatter around the line. - We have learned how to use technology (Excel) to
perform regressions instead of calculating
everything by hand!
30Step-by-Step 1 Regression Analysis
- We want to examine the relationship between
calories and sugar in breakfast cereals.
Specifically, wed like to see if we can predict
how many calories a serving of a cereal has given
its sugar. The Ch08-excel-answers.xls file
contains a dataset with 77 types. - What should we do first? (After thinking, that
is). - Hint What were the 3 rules of Data Analysis?
31Step-by-Step 2Regression Analysis
- A scatter-plot (after we re-order the columns) is
a good way to check the association. - What are the 3 things we need to have a linear
regression model be appropriate? Do they all
apply here?
32Step-by-Step 3Regression Analysis
- Given the scatter-plot shows we met all 3
conditions, we can now run a regression analysis. - What does our model predict?
- Does our model appear to be good at predicting
caloric content? - What should we do next?
33Step-by-Step 4Regression Analysis
- It is useful to look at the residuals plot.
- If we used the otherwise easy Add trendline
approach, we need to do a bit of work to get this
plot. - What do the residuals tell us about the
appropriateness of our linear model?
34Example More Predicting Housing Prices Given Size
- Problem 7 Using a sample of home sales in New
Mexico in 1993, a linear regression model
attempts to predict price (K) give size (sq ft)
and has a R2 of 71.4 - What are the variables and their units in this
regression? Which is the explicatory variable and
which is the response variable? - What units does the slope have?
- Do you think the slope is positive or negative?
- 11, continued
- What is the correlation between size and price?
Dont forget the sign! - What would you predict about the price of a home
that is one SD above the average house size? - What would you predict about the price of a home
that is 2 SD below the average house size? - (Note, we havent yet mentioned what b0 and b1
are, nor have we listed figures for the average
price or size! But we dont need them yet)
35Example Even More Predicting Housing Prices
Given Square Footage
- Problems 13 We finally get the model
coefficients for the Albuquerque real estate
market. The intercept is 47.82 K and the slope
is .061 K/sqft - What does the slope of the line say about housing
prices and size? - What price would you predict for a 3000 sq ft
house in this market at this time? - It turns out that a 1200 sq ft. house sold for
6K less than one would have expected it to given
the model. What was the selling price and what
is the 6K difference called?
36Example Predicting Birth Rates
- Problem 42 The table shows the number of live
births per 1000 women aged 15-44 in the US. - Make a scatter-plot and fit a regression. Does
the scatter-plot suggest a linear model is
justified? - Plot the residuals to see if there are any
possible problems. - Interpret the slope of the line.
- Use the model to predict the birthrate for 1978,
2005 and 2020 - Give your confidence in each of these predictions
- Extra What year does the model suggest that US
women will stop reproducing? Does this make
sense?
37Example Predicting Birth Rates
- The scatter-plot and Add Trendline can be used
to generate the following chart of the data and
the regression - Any concerns?
38Example Predicting Birth Rates
- We can calculate the predicted values and
residuals, generating this table, from which we
can then develop the following scatter-plot