Title: Statistics 222
1Statistics 222
- Chapter 15
- Multiple Regression
2Multiple Regression
- Multiple Regression analysis is the study of how
a dependent variable (y) is related to two or
more independent variables (multiple xs).
3The Multiple Regression Model
y ?0 (?1x1) (?2x2) (?pxp) ?
- We begin with the assumption that the regression
equation follows this regression model. In other
words, there is a linear association between y
and x1, x2,xp (multiplied by their respective
?s) plus the error term ?. - The error term accounts for the variability in y
that cannot be explained by the linear effect of
the xs.
4The Multiple Regression Equation
- The multiple regression equation describes how
the expected value of y is related to the xs and
has this format
E(y) ?0 (?1x1) (?2x2) (?pxp)
5Developing the Multiple Regression Equation
- As with simple regression, we dont have values
for the ?s (population parameters), so we
estimate ?s from bs that we derive from a
samples data set. - Again, we use the least squares method to
develop a regression equation by solving for our
b-values in such a way that minimizes ?(yi yi)
(the sum of the ?s).
6Example Butler Trucking
- Butler Trucking Company makes deliveries
throughout a local geographic area. To develop
better work schedules, the managers want to
estimate the total daily travel time that it
should take to complete any given delivery route. - Initially the managers believe that the total
daily travel time is closely related to the
number of miles traveled. - They obtain a sample of 10 delivery schedules and
the track the number of miles traveled (x1) and
the total delivery time (y) of each.
7(No Transcript)
8Is there reason to believe the relationship is
linear?
- Before proceeding with regression analysis, we
should plot the (x, y) pairs to see if there
appears to be a linear relationship between the
variables. - See next slide.
9There appears to be somewhat of a positive linear
relationship between miles traveled and travel
time so we can proceed with the regression
analysis.
10Open the file DataSetsForCh15.xls and click on
the worksheet Butler Trucking I
11From the menu, select Tools, Data Analysis,
Regression, ok.
12Select C2C12 for the Y range and B2B12 for
the X range. Check off Labels and click ok.
13See the regression output.
14The Regression Equation
- Based upon Excels Regression analysis output, we
develop the following regression equation
y 1.27 .0678 x1
For every additional mile driven, travel time
increases by .0678 of an hour (thats about 4
minutes).
15Is the regression equation significant?
- Recall that for simple regression, we can use
either the F-test to test for overall
significance of the equation or we do a t-test
to test the hypothesis that ?1 0. - Using ? .05 and referring to our Excel output,
we see that the model is significant because p
.004 (which is less than .05). - Furthermore, longer travel times are associated
with more miles driven.
16Since we are doing simple linear regression
analysis, the p-values for both the F-test and
the t-test are the same, and are significant at ?
.05.
17Referring to the Excel regression output, we see
that r2, the coefficient of determination .664.
Therefore, 66.4 of the variability in travel
time can be explained by the linear effect of the
number of miles traveled.
18Adding a 2nd independent variable
- Lets say that the managers believe there is
another variable that effects travel time and
that is number of deliveries. - Lets run the regression analysis again using two
independent variables miles driven will be x1
and number of deliveries will be x2.
19Open the file DataSetsForCh15.xls and click on
the worksheet Butler Trucking II
20From the menu, select Tools, Data Analysis,
Regression, ok.
21Select D2D12 for the Y range and B2C12 for
the X range. Check off Labels and click ok.
22See the regression output.
23The Regression Equation (with two independent
variables)
- Based upon Excels Regression analysis output, we
develop the following regression equation
y -.669 .0611 x1 .923 x2
24Note that the value of ?1 declined
- When there was just one independent variable
(number of miles driven), the value of ?1 was
.0678. - When we added an second independent variable, the
value of ?1 became .0611. - The reason that ?1 declined is that the
explanation power is now shared between two
variables (X1 and X2) that are slightly
correlated, so X2 picked up some of the
explanation power that was initially attributed
to X1 when X2 wasnt in the picture yet.
25The Definition of ?i in multiple regression
- ?i represents an estimate of the change in yi
corresponding to a one-unit change in xi when all
other independent variables are held constant. - ?1 .0611. This means that for every additional
mile driven, travel time should increase by .0611
hours (about 3.67 minutes) when the number of
deliveries is held constant. - ?2 .923. This means that for every additional
delivery made, travel time should increase by
.923 hours (about 55 minutes) when the number of
miles driven is held constant.
26Is the regression equation significant?
- For simple regression, we could use either the
F-test or do one t-test (H0 ?1 0) to test for
the significance of the regression equation. - But now we are doing multiple regression because
we have more than one independent variable.
Therefore, we still do the F-test to determine
the overall significance of the regression
equation but we must also do a t-test for each
independent variable. - The t-test is used to test each individual
independent variable for significance.
27The F-test
- To test the multiple regression equation for
overall significance, we test this set of
hypotheses - Ho ?1 ?2 ?3 ?p 0
- Ha One or more of the ?is is not 0.
- Recall that the F-statistic is calculated by
obtaining the ratio of MSR/MSE. - MSE is the unbiased estimate of ?2 of the ?s and
MSR will be similar to MSE if none of the ?Is
are significantly different than 0 (thus, the
ratio will be 11).
28We see that MSR is 10.8, MSE is .328 and the
F-ratio is 32.878 resulting in a p-value of
.00027. Therefore, we reject the null hypothesis
and conclude the overall model is significant.
29The t-tests
- Since we have two ?s (other than ?0), we must
perform two t-tests - For number-of-miles (x1)
- H0 ?1 0
- Ha ?1 ? 0
- For number-of-deliveries (x2)
- H0 ?2 0
- Ha ?2 ? 0
30We see that tx for miles-driven is 6.18 resulting
in a p-value of .00045 and tx for
number-of-deliveries is 4.17 resulting in a
p-value of .004 and therefore we reject both null
hypotheses and conclude that both (x1 and x2) are
significant.
31(No Transcript)
32The Coefficient of Determination
- r2 is now .9037 (as compared to .664 with one x).
- We see that we have greatly improved our ability
to predict the value of y (travel time) when we
add a second variable. - Therefore, we could say that 90.37 of the
variability in delivery time can be explained by
the regression equation that includes miles
driven and number of deliveries as independent
variables. - The Coefficient of Determination should now be
referred to as the multiple coefficient of
determination since we have multiple xs.
33The Adjusted r2
- In general, r2 always increases as additional
independent variables are added to the regression
equation. These variables are likely to be
correlated amongst themselves to some degree.
Therefore, to avoid over-estimating the impact of
adding additional variables, a correction
factor is applied to r2 to adjust it downward.
The adjustment factor is a function of n (the
number of observations) and p (the number of
variables). - In this case, the adjusted r2 is .876. So we
would ultimately conclude that 87.6 of the
variability in delivery time can be explained by
the regression equation.
34Multi-collinearity
- When multi-collinearity exists, that means that
the independent variables themselves are
correlated. - For example, if we had used number of miles
driven and number of gallons of gas consumed
to estimate travel time, we would have been
using two independent variables that themselves
are significantly correlated.
35The problem caused by multicollinearity
- When regression analysis is performed the
variables are introduced into the model one at a
time. Therefore, if miles driven is introduced
first, all the variation in travel time due to
miles driven will be attributed to that factor.
Then when gallons-consumed is introduced, there
is not much more (if any) variation in the
travel time that has not already been explained
by miles driven. - The result is that gallons-consumed may end up
with an insignificant ? when it wouldnt be
insignificant if it had been introduced first. - In other words, with x1 already in the model, x2
does not make a significant contribution to
explaining y. But if x1 wasnt already in the
model, x2 would have made significant
contribution to explaining y.
36How to identify and fix a multi-collinearity
problem
- Run a regression analysis (to obtain an r-value)
using one independent variable as x and the
other as y. - The r-value will tell you if they are
significantly correlated. - The rule-of-thumb is if r gt .7, then leave one
of the independent variables out of the
regression equation.
37Example 1 (p. 646 - 10)
- Auto Rental News provided data that shows the
number of cars in service (in thousands), the
number of rental locations, and the rental
revenue (millions) for 15 car rental companies. - A. Determine the estimated regression equation
that can be used to predict the rental revenue
given the number of cars in service and the
number of locations. B. Is the model significant?
Are the variables significant? C. Provide an
interpretation of the slopes of the estimated
regression equation. D. What percentage of
variation in revenue can be explained by the
regression model that includes number of cars in
service?
38Open the file DataSetsForCh15.xls and click on
the worksheet RentalCars
39From the menu, select Tools, Data Analysis,
Regression, ok.
40Select D2D17 for the Y range and B2C17 for
the X range. Check off Labels and click ok.
41See the regression output.
42Question (a)
- Determine the estimated regression equation that
can be used to predict the rental revenue given
the number of cars in service and the number of
locations.
y 105.97 (8.94) x1 (-.191) x2
43Question (b)
- Is the model significant? Are the variables
significant? - The results of the F-test indicate an F-value of
96.66 resulting in a p-value lt .01 so that
overall model is significant. - The results of the t-test for each variable
indicate that number of rental cars is
significant (p .000) but number of rental
locations is not (p .086).
44Question (c)
- Provide an interpretation of the slopes of the
estimated regression equation - For each 1,000 additional rental cars, revenue
should increase by 8.94 million. - The second variable, number of locations, is not
significant at the .05 level (p .086). That is,
it does not provide significant explaining
power.
45Question (d)
- What percentage of variation in revenue can be
explained by the regression model that includes
number of cars in service? - To answer this question, it is necessary to
remove number of locations from the regression
equation and re-run the analysis (see next
slide).
46Regression analysis using only one independent
variable (number of rental cars). r2 is 92
47Homework 11
- 9 on page 645
- Develop regression equation use it to make
estimate - Use worksheet Schools
- 17 on page 649
- Compute and interpret r2 and r2 (adjusted)
48Extra Credit Homework Project(10 points)
- Case Problem 1 on page 695 (Consumer Research,
Inc.) - Use worksheet credit cards in
DataSetsForCh15.xls - Write a Managerial Report that provides and
describe descriptive statistics of the data,
develop regression equations and discuss the
findings. Make prediction and discuss the need
for other independent variables. - Due in 2 weeks.