Chapter 4 Describing Relationships Between Variables - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Chapter 4 Describing Relationships Between Variables

Description:

Describe a relationship between two variables x and y. We will find the best linear fit of y ... Predicting win percentage based on rebounds/game for NBA teams. ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 40
Provided by: karl252
Category:

less

Transcript and Presenter's Notes

Title: Chapter 4 Describing Relationships Between Variables


1
Chapter 4 Describing Relationships Between
Variables
  • 4.1 Fitting a Least Squares Line
  • Describe a relationship between two variables x
    and y
  • We will find the best linear fit of y versus x.
  • and are unknown parameters
  • Goal find estimates and for the
    parameters and .

2
Example 4.1
  • Eight batches of plastic are made and from each
    batch one test item is molded and its hardness y
    is measured at time x. The following are the 8
    measurements

3
Example 4.1
  • Scatterplot Is a linear relationship
    appropriate?
  • How do we find an equation for this line?

4
Least Squares Principle
  • We will fit a line given by b0 b1x, where
    b0 and b1 are estimates for the parameters
    and .
  • Note that a straight line will not pass perfectly
    through every one of our data points.
  • If we plug a data value xi into the equation
    b0 b1x, the value we get for
    will not be exactly our data value yi.

5
Least Squares Principle
  • Need to minimize the squared distances from the
    actual data value, yi, and the value given by our
    equation, .
  • Thus, we wish to minimize
  • We are minimizing residuals (more soon).

6
Least Squares Principles
  • How do we find estimates for b0 and b1?
  • Use calculus.
  • Plugging into the equation
    yields
  • Need to minimize
  • Take partial derivatives and set equal to zero.

7
Normal Equations
  • Taking partial derivatives with respect to b0 and
    b1 yields what are known as the Normal Equations.

8
Least Squares Estimates
  • Solving these equations (details omitted) for b0
    and b1 yields the following

9
Example 4.2
  • Continued from example 4.1
  • Find the least squares estimates given

10
Interpretation
  • b1 means for every 1 unit increase in x
    variable, the y variable increases, on average,
    by the value of b1.
  • Only true for a linear model
  • b0 means the value of y when x is equal to 0.
  • Not always meaningful
  • Example GPA vs. ACT score, b0 -5.7

11
Example 4.3
  • Continued from example 4.1
  • Eight batches of plastic are made and from each
    batch one test item is molded and its hardness y
    is measured at time x.
  • b1 means that for every 1 unit increase in
    time, the hardness increases, on average, by
    2.433.
  • b0 means that when no time has passed, the
    hardness is 153.060.

12
Prediction
  • We can predict y with the least squares line.
  • Simply insert a value of x into the least squares
    equation to obtain a predicted value of y.
  • What is the predicted hardness for time x24?

13
Extrapolation
  • Extrapolation is when a value of x beyond the
    range of our actual x observations is used to
    find a predicted .
  • Predicted values should not be used when
    extrapolating beyond the data set.
  • We do not know the behavior beyond the range of
    our x values.
  • Example What is the predicted hardness for time
    x 110?

14
Example 4.4
15
Example 4.5
  • What is the predicted hardness for time x57?
  • What is the predicted hardness for time x5?

16
Linear Fit
  • We have a fitted line, but does it fit well?
  • To check the fit
  • Correlation
  • Coefficient of Determination
  • Residuals

17
Correlation
  • Correlation quantifies the linear fit between y
    and x.
  • r will always lie between 1 and 1
  • r close to 0 indicates a weak linear
    relationship.
  • r close to either 1 or 1 indicates a strong
    linear relationship.
  • The sign of r indicates if the relationship is
    positive or negative.
  • So a positive value of r tells us that y is
    increasing linearly in x and a negative value of
    r tells us that y is decreasing linearly in x.

18
Coefficient of Determination
  • Coefficient of Determination the fraction of
    raw variation in y accounted for by the fitted
    equation.
  • Quantifies the fit of other types of
    relationships (not just linear)
  • The value of will always lie between 0 and 1
  • Values closer to 0 indicating a weak relationship
    between the variables
  • Values closer to 1 indicating a strong
    relationship between the variables

19
Example 4.6
  • Continued from example 4.1
  • From r we can tell that there is a strong,
    positive, linear relationship (the linear model
    fits well).
  • From R2 we can tell that our model fits well.
  • R2 r2 only with a linear model.

20
Residuals
  • We hope that the fitted values, , will look
    like our data,
  • except for small fluctuations explainable only as
    random variation.
  • To assess this, we look at what are called
    residuals

21
Residuals
  • When we are fitting a best line, we are
    minimizing the residuals
  • These residuals should be patternless (randomly
    scattered).

22
Residuals
  • To use residuals to check the fit, we need to
    check their pattern.
  • Residual Plot check for random scatter centered
    around zero (x-axis).
  • residuals plotted against x or
  • Normal Probability Plot check for a straight
    line
  • ie., the residuals follow a normal distribution

23
Residual Plot 1
Actual Data
Residual Plot
The residuals are randomly scattered around
0. Thus, residual plot shows good fit (linear
model is appropriate).
24
Residual Plot 2
Actual Data
Residual Plot
The residual plot shows a distinct curved
pattern. Thus, a linear model is not appropriate
(bad fit). The data is probably better described
with a quadratic model.
25
Residual Plot 3
Actual Data
Residual Plot
The residual plot shows a cone-shaped
pattern. There is more spread for larger fitted
values (bad fit). The researcher may want to
investigate the data collection process.
26
Residual Plot 4
  • Residuals vs. the time order of the observation
  • As time increases the residuals increase.
  • This pattern suggests that some variable changing
    in time is acting on y and has not been accounted
    for in fitting the values.
  • After seeing a residual plot with this pattern,
    the researcher may want to inspect the process
    from which the data was obtained.
  • Example instrumental drift could produce a
    pattern like this.

Ordered Residual Plot
27
Normal Prob. Plot for Residuals
  • If we really have random variation, residuals
    should centered at zero and scattered evenly
    above and below zero.
  • Histogram of residuals should look like the one
    following.
  • To check that our residuals follow a normal
    distribution, we can use a normal probability
    plot.

28
Example 4.7
  • Continued from example 4.1

29
Example 4.7
Residual Plot
  • Residual plot shows random scatter around 0.
  • Normal probability plot follows a straight line.
  • Conclusion linear model fits well.

30
More on Model Fit
  • It is often wise to check multiple forms of model
    fit.
  • Each assessment may only be painting half the
    picture
  • Most common combination
  • R2
  • Residual plot

31
Example 4.8
  • Trying to predict stopping distance (ft) given
    the current speed (mph).

Distance vs. Speed
32
Example 4.8
Residual Plot
  • Although the data seemed linear, and the R2 was
    extremely high, the residual plot shows a
    distinct curved pattern.
  • Thus, the fit could be improved upon.
  • Use quadratic instead of linear.

33
Example 4.9
  • Predicting win percentage based on rebounds/game
    for NBA teams.
  • Residual plot theres random scatter around 0.
  • Linear model seems to fit well.

34
Example 4.9
  • Although the residual plot indicates a good fit,
    the R2 0.2014, which is very low.
  • From the scatterplot, we notice that the data are
    somewhat linear, but a very weak relationship
    exists (thus the low R2).

35
Linear Regression Cautions
  • r measures only linear relationships.
  • Correlation does not imply causation
  • An example from Wikipedia Since the 1950s, both
    the atmospheric CO2 level and crime levels have
    increased sharply. Thus, we would expect a large
    correlation between crime and CO2 levels.
    However, we would not assume that atmospheric CO2
    causes crime.
  • Both R2 and r can be drastically affected by a
    few unusual data points.
  • Example on page 137

36
4.2 Fitting Curves and Surfaces
  • Use least squares
  • Computation and interpretation becomes more
    complicated.
  • Curve fitting
  • A natural generalization to the linear equation
    is the polynomial equation
  • Computation of estimates
    are done by computer.

37
Surface Fitting
  • In surface fitting we have more than 1 predictor
    variable (xs) with our response (y).
  • Again, computation of estimates
    are done by computer.
  • Example we want to predict brick strength (y)
    given a level of temperature and humidity (xs)

38
Interpretation
  • Given , the
    interpretation is as follows
  • b0 represents the value of y when x1 0 and x2
    0
  • b1 represents the increase/decrease in y for
    every one unit increase in x1, holding constant
    x2
  • b2 represents the increase/decrease in y for
    every one unit increase in x2, holding constant
    x1
  • Note these statements are general.
  • You will need to do this within the context of
    the problem.

39
Residual Plots
  • Computed the same way as before
  • Normal probability plot of residuals
  • Residual plot against x
  • Useful for checking a linear fit or curve fit
  • Residual plot against fitted values
  • Useful for any fit
  • Use computer due to computational intensity
Write a Comment
User Comments (0)
About PowerShow.com