Title: Simple linear regression Review, 18'5, 18'8
1Lecture 19
- Simple linear regression (Review, 18.5, 18.8)
- Midterm 2 Wed, April 2, 6-8pm
- Extra office hours Tue, 2pm-5pm
- Regular office hours Mon 12-1pm, Tue 1-2pm
- Exam Review
2Review of Regression Analysis
- Goal Estimate E(YX) the regression function
- Uses
- E(YX) is a good prediction of Y based on X
- E(YX) describes the relationship between Y and X
- Simple linear regression model E(YX) is a
straight line (the regression line)
3Simple Linear Regression Model
- The data are assumed
to be a realization of
- are the unknown parameters of the
model. Objective of regression is to estimate
them. - , the slope, is the amount that Y changes on
average for each one unit increase in X. - , the standard error of estimate, is the
standard deviation of the amount by which Y
differs from E(YX), i.e., standard deviation of
the errors
4Estimation of Regression Line
- We estimate the regression line by
the least squares line , the line
that minimizes the sum of squared prediction
errors for the data.
5Fitted Values and Residuals
- The least squares line decomposes the data into
two parts where -
-
- are called the fitted or predicted
values. - are called the residuals.
- The residuals are estimates of
the errors
6Estimating
- The standard error of estimate (root mean
squared error) is an estimate of - The standard error of estimate is basically
the standard deviation of the residuals. - measures how useful the simple linear
regression model is for prediction - If the simple regression model holds, then
approximately - 68 of the data will lie within one of the
LS line. - 95 of the data will lie within two of the
LS line.
7Example 18.2 in JMP
8SEs of Parameter Estimates
- From the JMP output,
- Imagine yourself taking repeated samples of the
prices of cars with the odometer readings
from the population. - For each sample, you could estimate the
regression line by least squares. Each time, the
least squares line would be a little different. - The standard errors estimate how much the least
squares estimates of the slope and intercept
would vary over these repeated samples.
9Cause-and-effect Relationships
- A test of whether the slope is zero is a test of
whether there is a linear relationship between x
and y in the observed data, i.e., is a change in
x associated with a change in y. - This does not test whether a change in x causes a
change in y. Such a relationship can only be
established based on a carefully controlled
experiment or extensive subject matter knowledge
about the relationship.
10Example of Pitfall
- A researcher measures the number of television
sets per person X and the average life expectancy
Y for the worlds nations. The regression line
has a positive slope nations with many TV sets
have higher life expectancies. Could we lengthen
the lives of people in Rwanda by shipping them TV
sets?
11Using the Regression Equation
- Before using the regression model, we need to
assess how well it fits the data.
- If we are satisfied with how well the model fits
the data, we can use it to predict the values of
y. - To make a prediction we use
- Point prediction, and
- Interval prediction
12Point Prediction
- Example 18.7
- Predict the selling price of a three-year-old
Taurus with 40,000 miles on the odometer (Example
18.2).
- It is predicted that a 40,000 miles car would
sell for 14,575. - How close is this prediction to the real price?
13Interval Estimates
- Two intervals can be used for differing purposes
- Prediction interval predicts y for a given
value of x, - Confidence interval estimates the average y for
a given x. - Predicts y at x
Predicts the mean of y at x - y x
E(y x)
14Interval Estimates,Example
- Example 18.7 - continued
- Provide an interval estimate for the bidding
price on a Ford Taurus with 40,000 miles on the
odometer. - Two types of predictions are required
- A prediction for a specific car
- An estimate for the average price per car
15Prediction vs. Confidence Intervals
- The prediction interval attempts to cover future
observations at given value x with probability
0.95 (e.g.). - The confidence interval attempts to cover means
of observations at a given value x with
probability 0.95 (e.g.). The means should be
thought of as arising in potential alternative
studies whose data were collected the same way as
in our study.
16Interval Estimates,Example
- Solution
- A prediction interval provides the price estimate
for a single car let xg40,000 miles
t.025,98 Approximately
17Interval Estimates,Example
- Solution continued
- A confidence interval provides the estimate of
the mean price per car for a Ford Taurus with
40,000 miles reading on the odometer. - The confidence interval (95)
18Regression Diagnostics
- The three conditions required for the validity of
the regression analysis are - the error variable is normally distributed.
- the error variance is constant for all values of
x. - The errors are independent of each other.
19Outliers
- An outlier is an observation that is unusually
small or large. - Several possibilities need to be investigated
when an outlier is observed - There was an error in recording the value.
- The point does not belong in the sample.
- The observation is valid.
- Identify outliers from the scatter diagram.
- It is customary to suspect an observation is an
outlier if its standard residual gt 2
20Leverage and Influential Points
- An observation has high leverage if it is an
outlier in the x direction. - An observation is influential if removing it
would markedly change the least squares line. - Observations that have high leverage are
influential if they do not fall very close to the
least squares line for the other points.
2118.8 Coefficient of Correlation
- The coefficient of correlation is used to measure
the strength of association between two
variables. - The coefficient values range between -1 and 1.
- If r -1 (negative association) or r 1
(positive association) every point falls on the
regression line. - If r 0 there is no linear pattern.
- The coefficient can be used to test for linear
relationship between two variables.
22Testing the coefficient of correlation
- To test the coefficient of correlation for linear
relationship between X and Y - X and Y must be observational
- X and Y are bivariate normally distributed
23Testing the coefficient of correlation
- When no linear relationship exist between the two
variables, r 0. - The hypotheses are
- H0 r 0H1 r ¹ 0
- The test statistic is
The statistic is Student t distributed with d.f.
n - 2, provided the variables are bivariate
normally distributed.
24Transformations
- Suppose that the residual plot indicates
curvature in the regression function. What do we
do? - One possibility Transform x or transform y.
- Check handout 2
25Transformation for display.jmp
- YSales, XDisplay Feet
- YSales, XSquare Root of Display Feet/Log of
Display Feet
26Predictions with Transformations
- Linear Fit
- Sales -46.28718 154.90188 Square Root
DisplayFeet - For 5 display feet, the average amount of sales
is -
27Guidelines for exam
- Preparation for exams
- Work on lectures
- The book (remember you are required to have one
the red thing) - Work on assignments
- Lastly, work on the practice exams (without
looking at the solutions) - Comprehension questions will be similar to the
homework.
28Topics for Exam 2
- 13.5 Inference for ratio of two variances
- 13.6 Inference for difference between two
proportions - Chapter 15
- One-way ANOVA
- Multiple Comparisons
- Randomized Blocks
- Two-way ANOVA (Interactions IMPORTANT)
- Chapter 18
- Simple Linear regression (Estimation and Testing)
- Regression Diagnostics
- Point and Interval Prediction
- Assessing the model
- Finance Application