Chapter 4 Describing Relationships Between Variables - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

Chapter 4 Describing Relationships Between Variables

Description:

Describe a relationship between two variables x and y. We will find the best linear fit of y ... Predicting win percentage based on rebounds/game for NBA teams. ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 40

Provided by: karl252

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 4 Describing Relationships Between Variables

1
Chapter 4 Describing Relationships Between
Variables

4.1 Fitting a Least Squares Line
Describe a relationship between two variables x
and y
We will find the best linear fit of y versus x.
and are unknown parameters
Goal find estimates and for the
parameters and .

2
Example 4.1

Eight batches of plastic are made and from each
batch one test item is molded and its hardness y
is measured at time x. The following are the 8
measurements

3
Example 4.1

Scatterplot Is a linear relationship
appropriate?
How do we find an equation for this line?

4
Least Squares Principle

We will fit a line given by b0 b1x, where
b0 and b1 are estimates for the parameters
and .
Note that a straight line will not pass perfectly
through every one of our data points.
If we plug a data value xi into the equation
b0 b1x, the value we get for
will not be exactly our data value yi.

5
Least Squares Principle

Need to minimize the squared distances from the
actual data value, yi, and the value given by our
equation, .
Thus, we wish to minimize
We are minimizing residuals (more soon).

6
Least Squares Principles

How do we find estimates for b0 and b1?
Use calculus.
Plugging into the equation
yields
Need to minimize
Take partial derivatives and set equal to zero.

7
Normal Equations

Taking partial derivatives with respect to b0 and
b1 yields what are known as the Normal Equations.

8
Least Squares Estimates

Solving these equations (details omitted) for b0
and b1 yields the following

9
Example 4.2

Continued from example 4.1
Find the least squares estimates given

10
Interpretation

b1 means for every 1 unit increase in x
variable, the y variable increases, on average,
by the value of b1.
Only true for a linear model
b0 means the value of y when x is equal to 0.
Not always meaningful
Example GPA vs. ACT score, b0 -5.7

11
Example 4.3

Continued from example 4.1
Eight batches of plastic are made and from each
batch one test item is molded and its hardness y
is measured at time x.
b1 means that for every 1 unit increase in
time, the hardness increases, on average, by
2.433.
b0 means that when no time has passed, the
hardness is 153.060.

12
Prediction

We can predict y with the least squares line.
Simply insert a value of x into the least squares
equation to obtain a predicted value of y.
What is the predicted hardness for time x24?

13
Extrapolation

Extrapolation is when a value of x beyond the
range of our actual x observations is used to
find a predicted .
Predicted values should not be used when
extrapolating beyond the data set.
We do not know the behavior beyond the range of
our x values.
Example What is the predicted hardness for time
x 110?

14
Example 4.4
15
Example 4.5

What is the predicted hardness for time x57?
What is the predicted hardness for time x5?

16
Linear Fit

We have a fitted line, but does it fit well?
To check the fit
Correlation
Coefficient of Determination
Residuals

17
Correlation

Correlation quantifies the linear fit between y
and x.
r will always lie between 1 and 1
r close to 0 indicates a weak linear
relationship.
r close to either 1 or 1 indicates a strong
linear relationship.
The sign of r indicates if the relationship is
positive or negative.
So a positive value of r tells us that y is
increasing linearly in x and a negative value of
r tells us that y is decreasing linearly in x.

18
Coefficient of Determination

Coefficient of Determination the fraction of
raw variation in y accounted for by the fitted
equation.
Quantifies the fit of other types of
relationships (not just linear)
The value of will always lie between 0 and 1
Values closer to 0 indicating a weak relationship
between the variables
Values closer to 1 indicating a strong
relationship between the variables

19
Example 4.6

Continued from example 4.1
From r we can tell that there is a strong,
positive, linear relationship (the linear model
fits well).
From R2 we can tell that our model fits well.
R2 r2 only with a linear model.

20
Residuals

We hope that the fitted values, , will look
like our data,
except for small fluctuations explainable only as
random variation.
To assess this, we look at what are called
residuals

21
Residuals

When we are fitting a best line, we are
minimizing the residuals
These residuals should be patternless (randomly
scattered).

22
Residuals

To use residuals to check the fit, we need to
check their pattern.
Residual Plot check for random scatter centered
around zero (x-axis).
residuals plotted against x or
Normal Probability Plot check for a straight
line
ie., the residuals follow a normal distribution

23
Residual Plot 1
Actual Data
Residual Plot
The residuals are randomly scattered around
0. Thus, residual plot shows good fit (linear
model is appropriate).
24
Residual Plot 2
Actual Data
Residual Plot
The residual plot shows a distinct curved
pattern. Thus, a linear model is not appropriate
(bad fit). The data is probably better described
with a quadratic model.
25
Residual Plot 3
Actual Data
Residual Plot
The residual plot shows a cone-shaped
pattern. There is more spread for larger fitted
values (bad fit). The researcher may want to
investigate the data collection process.
26
Residual Plot 4

Residuals vs. the time order of the observation
As time increases the residuals increase.
This pattern suggests that some variable changing
in time is acting on y and has not been accounted
for in fitting the values.
After seeing a residual plot with this pattern,
the researcher may want to inspect the process
from which the data was obtained.
Example instrumental drift could produce a
pattern like this.

Ordered Residual Plot
27
Normal Prob. Plot for Residuals

If we really have random variation, residuals
should centered at zero and scattered evenly
above and below zero.
Histogram of residuals should look like the one
following.
To check that our residuals follow a normal
distribution, we can use a normal probability
plot.

28
Example 4.7

Continued from example 4.1

29
Example 4.7
Residual Plot

Residual plot shows random scatter around 0.
Normal probability plot follows a straight line.
Conclusion linear model fits well.

30
More on Model Fit

It is often wise to check multiple forms of model
fit.
Each assessment may only be painting half the
picture
Most common combination
R2
Residual plot

31
Example 4.8

Trying to predict stopping distance (ft) given
the current speed (mph).

Distance vs. Speed
32
Example 4.8
Residual Plot

Although the data seemed linear, and the R2 was
extremely high, the residual plot shows a
distinct curved pattern.
Thus, the fit could be improved upon.
Use quadratic instead of linear.

33
Example 4.9

Predicting win percentage based on rebounds/game
for NBA teams.
Residual plot theres random scatter around 0.
Linear model seems to fit well.

34
Example 4.9

Although the residual plot indicates a good fit,
the R2 0.2014, which is very low.
From the scatterplot, we notice that the data are
somewhat linear, but a very weak relationship
exists (thus the low R2).

35
Linear Regression Cautions

r measures only linear relationships.
Correlation does not imply causation
An example from Wikipedia Since the 1950s, both
the atmospheric CO2 level and crime levels have
increased sharply. Thus, we would expect a large
correlation between crime and CO2 levels.
However, we would not assume that atmospheric CO2
causes crime.
Both R2 and r can be drastically affected by a
few unusual data points.
Example on page 137

36
4.2 Fitting Curves and Surfaces

Use least squares
Computation and interpretation becomes more
complicated.
Curve fitting
A natural generalization to the linear equation
is the polynomial equation
Computation of estimates
are done by computer.

37
Surface Fitting

In surface fitting we have more than 1 predictor
variable (xs) with our response (y).
Again, computation of estimates
are done by computer.
Example we want to predict brick strength (y)
given a level of temperature and humidity (xs)

38
Interpretation

Given , the
interpretation is as follows
b0 represents the value of y when x1 0 and x2
0
b1 represents the increase/decrease in y for
every one unit increase in x1, holding constant
x2
b2 represents the increase/decrease in y for
every one unit increase in x2, holding constant
x1
Note these statements are general.
You will need to do this within the context of
the problem.

39
Residual Plots