Title: Chapter 4 Describing Relationships Between Variables
1Chapter 4 Describing Relationships Between
Variables
- 4.1 Fitting a Least Squares Line
- Describe a relationship between two variables x
and y - We will find the best linear fit of y versus x.
- and are unknown parameters
- Goal find estimates and for the
parameters and .
2Example 4.1
- Eight batches of plastic are made and from each
batch one test item is molded and its hardness y
is measured at time x. The following are the 8
measurements
3Example 4.1
- Scatterplot Is a linear relationship
appropriate? - How do we find an equation for this line?
4Least Squares Principle
- We will fit a line given by b0 b1x, where
b0 and b1 are estimates for the parameters
and . - Note that a straight line will not pass perfectly
through every one of our data points. - If we plug a data value xi into the equation
b0 b1x, the value we get for
will not be exactly our data value yi.
5Least Squares Principle
- Need to minimize the squared distances from the
actual data value, yi, and the value given by our
equation, . - Thus, we wish to minimize
- We are minimizing residuals (more soon).
6Least Squares Principles
- How do we find estimates for b0 and b1?
- Use calculus.
- Plugging into the equation
yields - Need to minimize
- Take partial derivatives and set equal to zero.
7Normal Equations
- Taking partial derivatives with respect to b0 and
b1 yields what are known as the Normal Equations.
8Least Squares Estimates
- Solving these equations (details omitted) for b0
and b1 yields the following
9Example 4.2
- Continued from example 4.1
- Find the least squares estimates given
10Interpretation
- b1 means for every 1 unit increase in x
variable, the y variable increases, on average,
by the value of b1. - Only true for a linear model
- b0 means the value of y when x is equal to 0.
- Not always meaningful
- Example GPA vs. ACT score, b0 -5.7
11Example 4.3
- Continued from example 4.1
- Eight batches of plastic are made and from each
batch one test item is molded and its hardness y
is measured at time x. - b1 means that for every 1 unit increase in
time, the hardness increases, on average, by
2.433. - b0 means that when no time has passed, the
hardness is 153.060.
12Prediction
- We can predict y with the least squares line.
- Simply insert a value of x into the least squares
equation to obtain a predicted value of y. - What is the predicted hardness for time x24?
13Extrapolation
- Extrapolation is when a value of x beyond the
range of our actual x observations is used to
find a predicted . - Predicted values should not be used when
extrapolating beyond the data set. - We do not know the behavior beyond the range of
our x values. - Example What is the predicted hardness for time
x 110?
14Example 4.4
15Example 4.5
- What is the predicted hardness for time x57?
- What is the predicted hardness for time x5?
16Linear Fit
- We have a fitted line, but does it fit well?
- To check the fit
- Correlation
- Coefficient of Determination
- Residuals
17Correlation
- Correlation quantifies the linear fit between y
and x. - r will always lie between 1 and 1
- r close to 0 indicates a weak linear
relationship. - r close to either 1 or 1 indicates a strong
linear relationship. - The sign of r indicates if the relationship is
positive or negative. - So a positive value of r tells us that y is
increasing linearly in x and a negative value of
r tells us that y is decreasing linearly in x.
18Coefficient of Determination
- Coefficient of Determination the fraction of
raw variation in y accounted for by the fitted
equation. - Quantifies the fit of other types of
relationships (not just linear) - The value of will always lie between 0 and 1
- Values closer to 0 indicating a weak relationship
between the variables - Values closer to 1 indicating a strong
relationship between the variables
19Example 4.6
- Continued from example 4.1
- From r we can tell that there is a strong,
positive, linear relationship (the linear model
fits well). - From R2 we can tell that our model fits well.
- R2 r2 only with a linear model.
20Residuals
- We hope that the fitted values, , will look
like our data, - except for small fluctuations explainable only as
random variation. - To assess this, we look at what are called
residuals
21Residuals
- When we are fitting a best line, we are
minimizing the residuals - These residuals should be patternless (randomly
scattered).
22Residuals
- To use residuals to check the fit, we need to
check their pattern. - Residual Plot check for random scatter centered
around zero (x-axis). - residuals plotted against x or
- Normal Probability Plot check for a straight
line - ie., the residuals follow a normal distribution
23Residual Plot 1
Actual Data
Residual Plot
The residuals are randomly scattered around
0. Thus, residual plot shows good fit (linear
model is appropriate).
24Residual Plot 2
Actual Data
Residual Plot
The residual plot shows a distinct curved
pattern. Thus, a linear model is not appropriate
(bad fit). The data is probably better described
with a quadratic model.
25Residual Plot 3
Actual Data
Residual Plot
The residual plot shows a cone-shaped
pattern. There is more spread for larger fitted
values (bad fit). The researcher may want to
investigate the data collection process.
26Residual Plot 4
- Residuals vs. the time order of the observation
- As time increases the residuals increase.
- This pattern suggests that some variable changing
in time is acting on y and has not been accounted
for in fitting the values. - After seeing a residual plot with this pattern,
the researcher may want to inspect the process
from which the data was obtained. - Example instrumental drift could produce a
pattern like this.
Ordered Residual Plot
27Normal Prob. Plot for Residuals
- If we really have random variation, residuals
should centered at zero and scattered evenly
above and below zero. - Histogram of residuals should look like the one
following. - To check that our residuals follow a normal
distribution, we can use a normal probability
plot.
28Example 4.7
- Continued from example 4.1
29Example 4.7
Residual Plot
- Residual plot shows random scatter around 0.
- Normal probability plot follows a straight line.
- Conclusion linear model fits well.
30More on Model Fit
- It is often wise to check multiple forms of model
fit. - Each assessment may only be painting half the
picture - Most common combination
- R2
- Residual plot
31Example 4.8
- Trying to predict stopping distance (ft) given
the current speed (mph).
Distance vs. Speed
32Example 4.8
Residual Plot
- Although the data seemed linear, and the R2 was
extremely high, the residual plot shows a
distinct curved pattern. - Thus, the fit could be improved upon.
- Use quadratic instead of linear.
33Example 4.9
- Predicting win percentage based on rebounds/game
for NBA teams. - Residual plot theres random scatter around 0.
- Linear model seems to fit well.
34Example 4.9
- Although the residual plot indicates a good fit,
the R2 0.2014, which is very low. - From the scatterplot, we notice that the data are
somewhat linear, but a very weak relationship
exists (thus the low R2).
35Linear Regression Cautions
- r measures only linear relationships.
- Correlation does not imply causation
- An example from Wikipedia Since the 1950s, both
the atmospheric CO2 level and crime levels have
increased sharply. Thus, we would expect a large
correlation between crime and CO2 levels.
However, we would not assume that atmospheric CO2
causes crime. - Both R2 and r can be drastically affected by a
few unusual data points. - Example on page 137
364.2 Fitting Curves and Surfaces
- Use least squares
- Computation and interpretation becomes more
complicated. - Curve fitting
- A natural generalization to the linear equation
is the polynomial equation - Computation of estimates
are done by computer.
37Surface Fitting
- In surface fitting we have more than 1 predictor
variable (xs) with our response (y). - Again, computation of estimates
are done by computer. - Example we want to predict brick strength (y)
given a level of temperature and humidity (xs)
38Interpretation
- Given , the
interpretation is as follows - b0 represents the value of y when x1 0 and x2
0 - b1 represents the increase/decrease in y for
every one unit increase in x1, holding constant
x2 - b2 represents the increase/decrease in y for
every one unit increase in x2, holding constant
x1 - Note these statements are general.
- You will need to do this within the context of
the problem.
39Residual Plots
- Computed the same way as before
- Normal probability plot of residuals
- Residual plot against x
- Useful for checking a linear fit or curve fit
- Residual plot against fitted values
- Useful for any fit
- Use computer due to computational intensity