Title: Describing Relationships: Regression, Prediction, and Causation
1Chapter 15
- Describing Relationships Regression,
Prediction, and Causation
2Linear Regression
- Objective To quantify the linear relationship
between an explanatory variable and a response
variable. - We can then predict the average response for all
subjects with a given value of the explanatory
variable. - Regression equation y a bx
- x is the value of the explanatory variable
- y is the average value of the response variable
- note that a and b are just the intercept and
slope of a straight line - note that r and b are not the same thing, but
their signs will agree
3Thought Question 1
How would you draw a line through the points?
How do you determine which line fits best?
4Linear Equations
High School Teacher
5The Linear Model
- Remember from Algebra that a straight line can be
written as - In Statistics we use a slightly different
notation - We write to emphasize that the points that
satisfy this equation are just our predicted
values, not the actual data values.
6Example Fat Versus Protein
- The following is a scatterplot of total fat
versus protein for 30 items on the Burger King
menu
7Residuals
- The model wont be perfect, regardless of the
line we draw. - Some points will be above the line and some will
be below. - The estimate made from a model is the predicted
value (denoted as ).
8Residuals (cont.)
- The difference between the observed value and its
associated predicted value is called the
residual. - To find the residuals, we always subtract the
predicted value from the observed one
9Residuals (cont.)
- A negative residual means the predicted value is
too big (an overestimate). - A positive residual means the predicted value is
too small (an underestimate).
10Best Fit Means Least Squares
- Some residuals are positive, others are negative,
and, on average, they cancel each other out. - So, we cant assess how well the line fits by
adding up all the residuals. - Similar to what we did with deviations, we square
the residuals and add the squares. - The smaller the sum, the better the fit.
- The line of best fit is the line for which the
sum of the squared residuals is smallest.
11Least Squares
- Used to determine the best line
- We want the line to be as close as possible to
the data points in the vertical (y) direction
(since that is what we are trying to predict) - Least Squares use the line that minimizes the
sum of the squares of the vertical distances of
the data points from the line
12The Linear Model (cont.)
- We write b1 and b0 for the slope and intercept of
the line. The bs are called the coefficients of
the linear model. - The coefficient b1 is the slope, which tells us
how rapidly changes with respect to x. The
coefficient b0 is the intercept, which tells
where the line hits (intercepts) the y-axis.
13The Least Squares Line
- In our model, we have a slope (b1)
- The slope is built from the correlation and the
standard deviations - Our slope is always in units of y per unit of x.
- The slope has the same sign as the correlation
coefficient.
14The Least Squares Line (cont.)
- In our model, we also have an intercept (b0).
- The intercept is built from the means and the
slope - Our intercept is always in units of y.
15Example
- Fill in the missing information in the table
below
16Interpretation of the Slope and Intercept
- The slope indicates the amount by which
- changes when x changes by one unit.
- The intercept is the value of y when x 0. It is
not always meaningful.
17Example
- The regression line for the Burger King data is
- Interpret the slope and the intercept.
- Slope For every one gram increase in protein,
the fat - content increases by 0.97g.
- Intercept A BK meal that has 0g of protein
contains - 6.8g of fat.
18Thought Question 2
From a long-term study on several families,
researchers constructed a scatterplot of the
cholesterol level of a child at age 50 versus the
cholesterol level of the father at age 50. You
know the cholesterol level of your best friends
father at age 50. How could you use this
scatterplot to predict what your best friends
cholesterol level will be at age 50?
19Predictions
- In predicting a value of y based on some given
value of x ... - 1. If there is not a linear correlation, the
best predicted y-value is y.
2. If there is a linear correlation, the best
predicted y-value is found by substituting the
x-value into the regression equation.
20Example Fat Versus Protein
- The regression line for the
- Burger King data fits the data
- well
- The equation is
- The predicted fat content for a BK Broiler
chicken sandwich that contains 30g of protein is - 6.8 0.97(30) 35.9 grams of fat.
21Prediction via Regression Line
Husband and Wife Ages
Hand, et al., A Handbook of Small Data Sets,
London Chapman and Hall
- The regression equation is y 3.6 0.97x
- y is the average age of all husbands who have
wives of age x - For all women aged 30, we predict the average
husband age to be 32.7 years - 3.6 (0.97)(30) 32.7 years
- Suppose we know that an individual wifes age is
30. What would we predict her husbands age to
be?
22The Least Squares Line (cont.)
- Since regression and correlation are closely
related, we need to check the same conditions for
regressions as we did for correlations - Quantitative Variables Condition
- Straight Enough Condition
- Outlier Condition
23Guidelines for Using The Regression Equation
- 1. If there is no linear correlation, dont
use the regression equation to make predictions. - 2. When using the regression equation for
predictions, stay within the scope of the
available sample data. - 3. A regression equation based on old data is
not necessarily valid now. - 4. Dont make predictions about a population
that is different from the population from
which the sample data were drawn.
24Definitions
- Marginal Change refers to the slope the
amount the response variable changes when the
explanatory variable changes by one unit. - Outlier - A point lying far away from the other
data points. - Influential Point - An outlier that that has the
potential to change the regression line.
-
Try
25Residuals Revisited
- Residuals help us to see whether the model makes
sense. - When a regression model is appropriate, nothing
interesting should be left behind. - After we fit a regression model, we usually plot
the residuals in the hope of findingnothing.
26Residual Plot Analysis
If a residual plot does not reveal any pattern,
the regression equation is a good representation
of the association between the two variables. If
a residual plot reveals some systematic pattern,
the regression equation is not a good
representation of the association between the two
variables.
27Residuals Revisited (cont.)
- The residuals for the BK menu regression look
appropriately boring
Plot
28Coefficient of Determination (R2)
- Measures usefulness of regression prediction
- R2 (or r2, the square of the correlation)
measures the percentage of the variation in the
values of the response variable (y) that is
explained by the regression line - r1 R21 regression line explains all (100)
of the variation in y - r.7 R2.49 regression line explains almost
half (50) of the variation in y
29R2 (cont)
- Along with the slope and intercept for a
regression, you should always report R2 so that
readers can judge for themselves how successful
the regression is at fitting the data. - Statistics is about variation, and R2 measures
the success of the regression model in terms of
the fraction of the variation of y accounted for
by the regression.
30A CautionBeware of Extrapolation
- Sarahs height was plotted against her age
- Can you predict her height at age 42 months?
- Can you predict her height at age 30 years (360
months)?
31A CautionBeware of Extrapolation
- Regression liney 71.95 .383 x
- height at age 42 months? y 88 cm.
- height at age 30 years? y 209.8 cm.
- She is predicted to be 6' 10.5" at age 30.
32Correlation Does Not Imply Causation
- Even very strong correlations may not correspond
to a real causal relationship.
33Evidence of Causation
- A properly conducted experiment establishes the
connection - Other considerations
- A reasonable explanation for a cause and effect
exists - The connection happens in repeated trials
- The connection happens under varying conditions
- Potential confounding factors are ruled out
- Alleged cause precedes the effect in time
34Evidence of Causation
- An observed relationship can be used for
prediction without worrying about causation as
long as the patterns found in past data continue
to hold true. - We must make sure that the prediction makes
sense. - We must be very careful of extreme extrapolation.
35Reasons Two Variables May Be Related (Correlated)
- Explanatory variable causes change in response
variable - Response variable causes change in explanatory
variable - Explanatory may have some cause, but is not the
sole cause of changes in the response variable - Confounding variables may exist
- Both variables may result from a common cause
- such as, both variables changing over time
- The correlation may be merely a coincidence
36Response causes Explanatory
- Explanatory Hotel advertising dollars
- Response Occupancy rate
- Positive correlation? more advertising leads
to increased occupancy rate?
- Actual correlation is negative lower occupancy
leads to more advertising
37Explanatory is notSole Contributor
- Explanatory Consumption of barbecued foods
- Response Incidence of stomach cancer
- barbecued foods are known to contain carcinogens,
but other lifestyle choices may also contribute
38Common Response(both variables change due to
common cause)
- Explanatory Divorce among men
- Response Percent abusing alcohol
- Both may result from an unhappy marriage.
39Both Variables are Changing Over Time
- Both divorces and suicides have increased
dramatically since 1900. - Are divorces causing suicides?
- Are suicides causing divorces???
- The population has increased dramatically since
1900 (causing both to increase).
- Better to investigate Has the rate of divorce
or the rate of suicide changed over time?
40The Relationship May Be Just a Coincidence
- We will see some strong correlations (or
apparent associations) just by chance, even when
the variables are not related in the population
41Coincidence (?)
Vaccines and Brain Damage
- A required whooping cough vaccine was blamed for
seizures that caused brain damage - led to reduced production of vaccine (due to
lawsuits) - Study of 38,000 children found no evidence for
the accusations (reported in New York Times) - people confused association with
cause-and-effect - virtually every kid received the vaccineit was
inevitable that, by chance, brain damage caused
by other factors would occasionally occur in a
recently vaccinated child
42Key Concepts
- Least Squares Regression Equation
- R2
- Correlation does not imply causation
- Confirming causation
- Reasons variables may be correlated