Ch8: Linear Regression Fat Versus Protein: An Example - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Ch8: Linear Regression Fat Versus Protein: An Example

Description:

... from 30 items on the Burger King menu. Content of Fat and ... The regression line for the Burger King data fits the data well: The equation turns out to be ... – PowerPoint PPT presentation

Number of Views:310
Avg rating:3.0/5.0
Slides: 39
Provided by: Addison6
Category:

less

Transcript and Presenter's Notes

Title: Ch8: Linear Regression Fat Versus Protein: An Example


1
Ch8 Linear Regression Fat Versus Protein An
Example
  • We now build on Ch 7s scatter-plots,
  • Would you say there is a strong association?
  • Does that association look linear?

Content of Fat and Protein for Food at Burger King
Data taken from 30 items on the Burger King menu
2
The Linear Model
  • A large or - Correlation indicates There seems
    to be a linear association between these two
    variables, but it doesnt tell exactly what that
    association is.
  • We can say more about the linear relationship
    between two quantitative variables with a model.
  • A model simplifies reality to help us understand
    underlying patterns and relationships.
  • A linear model is just an equation of a straight
    line through the data.
  • The points in the scatter-plot dont all line up,
    but a straight line can summarize the general
    pattern.
  • The linear model can help us understand how the
    values are associated.

3
Practice Fitting Lines
  • We can eyeball what fit looks best. Any
    votes?
  • Luckily for us, theres an algorithm to
    determine the
  • best fit model

4
The Straight Line and the Residuals
  • With real, non-trivial data, the model will never
    be perfect, regardless of the line we draw.
  • Some points will be above the line and some will
    be below.
  • The estimate made from a model is the predicted
    value (denoted as ).

5
Residuals (cont.)
  • The difference between the observed value and its
    associated predicted value is called the
    residual.
  • To find the residuals, we always subtract the
    predicted value from the observed one
  • A negative residual means the predicted value is
    too big (an overestimate).
  • A positive residual means the predicted value is
    too small (an underestimate).

6
Best Fit Means Least Squares
  • Some residuals are positive, others are negative,
    and, on average, they cancel each other out.
  • So, we cant assess how well the line fits by
    adding up all the residuals.
  • Similar to what we did with the standard
    deviation, we square the residuals and add the
    squares.
  • The smaller the sum, the better the fit.
  • The line of best fit is the line for which the
    sum of the squared residuals is smallest.

7
The Least Squares Line Parameters
  • In our model, the first parameter, b1, is the
    slope
  • The slope is built from the correlation and the
    standard deviations (s, not s, as we use standard
    deviations of the sample) It is always in units
    of y per unit of x.
  • The models second parameter, b0, is the
    intercept
  • The intercept is built from the mean and the
    slope, and is always in units of y

8
How to Find these in Reality, via Excel
  • Realistically, you will always use a computer to
    build such a model
  • Thus the following will walk through 2 different
    ways you can use Excel to determine regressions,
    using the Burger King dataset ( See also the
    class handout)
  • Note Since regression and correlation are
    closely related, we need to check the same
    conditions for regressions as we did for
    correlations
  • Quantitative Variables Condition
  • Straight Enough Condition
  • Outlier Condition

9
Method 1) Use Excels Chart Wizard
  • Create a scatter-plot (as per Chapter 7)
  • Select Add Trendline from the Chart menu (which
    appears one you have a chart)
  • Select linear
  • For options pick both Display equation on Chart
    and Display R-Squared Value on Chart
  • R-Squared is just the square of the
    correlation, r
  • Advantages to this method- its generally the
    easier one, and we dont need the Analysis
    Toolpak add-in.
  • Disadvantages- we dont automatically get the
    plot of the residuals or many other regression
    statistics.
  • Note The sheet in Ch08-excel-answers.xls for
    Problem 42 walks through generating a residuals
    plot.

10
Method 2) Use Excels Regression Wizard
  • Going to ToolsData AnalysisRegression
    Analysis yields a pop-up wizard
  • Enter the range for what you are trying to
    predict, the response variable.
  • Enter input range, and pick the column that will
    serve as X.
  • Check labels if you included the data headers
  • You have the option of where to put the
    voluminous output. Using a separate sheet in the
    same workbook is the least confusing for many
    people.
  • For the options, check Residuals, Line Fit Plot
    and Residuals Plot. Dont bother with
    standardized residuals.

if you dont get the Data Analysis menu
option in Tools, go to the Add-in Option.
11
Method 2) Regression in Excel- Outputs
  • Coefficients (Intercept is listed above the
    slope)
  • Multiple R (this is effectively the absolute
    value of the correlation)
  • Excel does not give you the correlation sign.
    You should check if you have a negative
    correlation.
  • R-Square (is r2) -is the of variance in Y that
    is explainable by X.
  • Graphs
  • You may also want to adjust your graphs so they
    are the right size, dont show a lot of dead
    space, and look better.
  • I would recommend formatting the regression line
    in the Line Fit Plot to actually BE a line.
  • Advantages to this method
  • you also get to see the residuals.
  • You get other useful information, such as
    confidence intervals, even if we dont use this
    information right now.
  • Disadvantages
  • Not as simple to use. (Much easier to make
    mistakes)
  • You may not have Data Analysis installed.
  • You may misinterpret the correlation.

12
Fat Versus Protein An Example
  • The regression line for the Burger King data fits
    the data well
  • The equation turns out to be
  • The predicted fat content for a BK Broiler
    chicken sandwich is
  • 6.8 0.97(30) 35.9 grams of fat.

13
Correlation and the Line
  • Moving one standard deviation away from the mean
    in x moves us r standard deviations away from the
    mean in y.
  • This relationship is shown
    in a scatter-plot
    of z-scores
  • for fat and protein
  • Put generally, moving any number of standard
    deviations away from the mean in x moves us r
    times that number of standard deviations away
    from the mean in y.

14
How Big Can Predicted Values Get?
  • r can never be bigger than 1 (in absolute value),
    so each predicted y tends to be closer to
    its mean (in standard deviations) than its
    corresponding x was.
  • This property of the linear model is called
    regression to the mean the line is called the
    regression line.

15
Residuals Revisited
  • The linear model assumes that the relationship
    between the two variables is a perfect straight
    line. The residuals are the part of the data that
    hasnt been modeled.
  • Data Model Residual
  • or (equivalently)
  • Residual Data Model
  • Or, in symbols

16
Residuals Revisited (cont.)
  • Residuals help us to see whether the model makes
    sense.
  • When a regression model is appropriate, nothing
    interesting should be left behind.
  • After we fit a regression model, we usually plot
    the residuals in the hope of findingnothing.
  • No curves or lines
  • No increasing or decreasing variation as we move
    along the x-axis

17
Residuals Revisited (cont.)
  • The residuals for the BK menu regression look
    appropriately boring there are no obvious
    patterns

18
R2The Variation that is Accounted For
  • The variation in the residuals is the key to
    assessing how well the model fits.
  • In the BK menu items
    example, total fat has
    a
    standard deviation
    of 16.4 grams.
    The
    standard deviation
    of the residuals
    from
    our models prediction
  • of fat is 9.2 grams.
  • Which shows more variation?

Variation in Fat in BK Items, And in the Models
Residuals
19
R2The Variation Accounted For (cont.)
  • If the correlation were 1.0 and the model
    predicted the fat values perfectly, the residuals
    would all be zero and have no variation.
  • As it is, the correlation is 0.83not perfection.
  • However, we did see that the model residuals had
    less variation than total fat alone.
  • We can determine how much of the variation is
    accounted for by the model and how much is left
    in the residuals.

20
R2The Variation Accounted For (cont.)
  • The squared correlation, R2, (pronounced
    R-squared). gives the fraction of the datas
    variance accounted for by the model.
  • Thus, 1 R2 is the fraction of the original
    variance left in the residuals.
  • An R2 of 0 means that none of the variance in the
    data is in the model all of it is still in the
    residuals.
  • When interpreting a regression model you need to
    Tell what R2 means.
  • For the BK model, R2 0.832 0.69,
  • 69 of the variation in total fat is accounted
    for by the model.
  • so 31 of the variability in total fat has been
    left in the residuals.

21
How Big Should R2 Be?
  • R2 is always between 0 and 100.
  • What makes a good R2 value depends on the kind
    of data you are analyzing and on what you want to
    do with the results.
  • Unless you are modeling trivial, already known
    relationships ( ex. weight in kg. to weight in
    lbs) or are using very textbook data, R2 will
    never be 100 and will rarely be above 90.

22
Interpreting Model Results
  • A regression model can always explain some
    variation R2 is rarely 0. But that doesnt mean
    it makes sense.
  • The Random-r sheet in Ch08-excel-answers.xls
    shows a random number stream used to explain
    another random number stream. Note the non-zero
    values for b1 and R2
  • Does this model provide a real explanation of the
    variation?

23
Assumptions and Conditions
  • Quantitative Variables Condition
  • We currently only know how to do Regression on
    two quantitative variables, so make sure to check
    this condition.
  • More advanced classes will show you how to
    incorporate categorical data.
  • Straight Enough Condition
  • The linear model assumes that the relationship
    between the variables is linear.
  • A scatter-plot will let you check that the
    assumption is reasonable.

24
Assumptions and Conditions (cont.)
  • It is a good idea to check linearity again after
    computing the regression when we can examine the
    residuals.
  • You should also check for outliers, which could
    change the regression.
  • If the data seem to clump or cluster in the
    scatter-plot, that could be a sign of trouble
    worth looking into further.
  • If the scatter-plot is not straight enough,
    stop here.
  • You cant use a linear model for any two
    variables, even if they are related.
  • They must have a linear association or the model
    wont mean a thing.
  • Some nonlinear relationships can be saved by
    re-expressing the data to make the scatter-plot
    more linear. (But we wont do this in DS212)

25
Assumptions and Conditions (cont.)
  • Outlier Condition
  • Watch out for outliers.
  • Outlying points can dramatically change a
    regression model.
  • Outliers can even change the sign of the slope,
    misleading us about the underlying relationship
    between the variables.
  • Dont automatically delete outliers, but instead
    study them further to see if they are data errors
    or indicate something that your study did not
    initially incorporate!

26
Reality Check Is the Regression Reasonable?
  • Statistics dont come out of nowhere. They are
    based on data.
  • The results of a statistical analysis should
    reinforce your common sense, not fly in its face.
  • If the results are surprising, then either you
    have learned something new about the world or
    your analysis is wrong.
  • When you perform a regression, think about the
    coefficients and ask yourself whether they make
    sense.

27
What Can Go Wrong?
  • Dont fit a straight line to a nonlinear
    relationship.
  • Beware extraordinary points (y-values that stand
    off from the linear pattern or extreme x-values).
  • Dont extrapolate beyond the datathe linear
    model may no longer hold outside of the range of
    the data.
  • Dont infer that x causes y just because there is
    a good linear model for their relationshipassocia
    tion is not causation.
  • Dont choose a model based on R2 alone.
  • We will study more about Regression (and what can
    go wrong) in Chapter 9!

28
What have we learned?
  • When the relationship between two quantitative
    variables is fairly straight, a linear model can
    help summarize that relationship.
  • The regression line doesnt pass through all the
    points, but it is the best compromise in the
    sense that it has the smallest sum of squared
    residuals.
  • The correlation, r , tells us several things
    about the regression
  • The slope of the line is based on the
    correlation, adjusted for the units of x and y.
  • For each SD (standard deviation) in x that we are
    away from the x mean, we expect to be r SDs in y
    away from the y mean.
  • Since r is always between -1 and 1, each
    predicted y is fewer SDs away from its mean than
    the corresponding x was (regression to the mean).
  • R2 gives us the fraction of the response
    accounted for by the regression model.

29
What have we learned? (conclusion)
  • The residuals also reveal how well the model
    works.
  • If a plot of the residuals against predicted
    values shows a pattern, we should re-examine the
    data to see why.
  • The standard deviation of the residuals
    quantifies the amount of scatter around the line.
  • We have learned how to use technology (Excel) to
    perform regressions instead of calculating
    everything by hand!

30
Step-by-Step 1 Regression Analysis
  • We want to examine the relationship between
    calories and sugar in breakfast cereals.
    Specifically, wed like to see if we can predict
    how many calories a serving of a cereal has given
    its sugar. The Ch08-excel-answers.xls file
    contains a dataset with 77 types.
  • What should we do first? (After thinking, that
    is).
  • Hint What were the 3 rules of Data Analysis?

31
Step-by-Step 2Regression Analysis
  • A scatter-plot (after we re-order the columns) is
    a good way to check the association.
  • What are the 3 things we need to have a linear
    regression model be appropriate? Do they all
    apply here?

32
Step-by-Step 3Regression Analysis
  • Given the scatter-plot shows we met all 3
    conditions, we can now run a regression analysis.
  • What does our model predict?
  • Does our model appear to be good at predicting
    caloric content?
  • What should we do next?

33
Step-by-Step 4Regression Analysis
  • It is useful to look at the residuals plot.
  • If we used the otherwise easy Add trendline
    approach, we need to do a bit of work to get this
    plot.
  • What do the residuals tell us about the
    appropriateness of our linear model?

34
Example More Predicting Housing Prices Given Size
  • Problem 7 Using a sample of home sales in New
    Mexico in 1993, a linear regression model
    attempts to predict price (K) give size (sq ft)
    and has a R2 of 71.4
  • What are the variables and their units in this
    regression? Which is the explicatory variable and
    which is the response variable?
  • What units does the slope have?
  • Do you think the slope is positive or negative?
  • 11, continued
  • What is the correlation between size and price?
    Dont forget the sign!
  • What would you predict about the price of a home
    that is one SD above the average house size?
  • What would you predict about the price of a home
    that is 2 SD below the average house size?
  • (Note, we havent yet mentioned what b0 and b1
    are, nor have we listed figures for the average
    price or size! But we dont need them yet)

35
Example Even More Predicting Housing Prices
Given Square Footage
  • Problems 13 We finally get the model
    coefficients for the Albuquerque real estate
    market. The intercept is 47.82 K and the slope
    is .061 K/sqft
  • What does the slope of the line say about housing
    prices and size?
  • What price would you predict for a 3000 sq ft
    house in this market at this time?
  • It turns out that a 1200 sq ft. house sold for
    6K less than one would have expected it to given
    the model. What was the selling price and what
    is the 6K difference called?

36
Example Predicting Birth Rates
  • Problem 42 The table shows the number of live
    births per 1000 women aged 15-44 in the US.
  • Make a scatter-plot and fit a regression. Does
    the scatter-plot suggest a linear model is
    justified?
  • Plot the residuals to see if there are any
    possible problems.
  • Interpret the slope of the line.
  • Use the model to predict the birthrate for 1978,
    2005 and 2020
  • Give your confidence in each of these predictions
  • Extra What year does the model suggest that US
    women will stop reproducing? Does this make
    sense?

37
Example Predicting Birth Rates
  • The scatter-plot and Add Trendline can be used
    to generate the following chart of the data and
    the regression
  • Any concerns?

38
Example Predicting Birth Rates
  • We can calculate the predicted values and
    residuals, generating this table, from which we
    can then develop the following scatter-plot
Write a Comment
User Comments (0)
About PowerShow.com