Statistics 222 - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Statistics 222

Description:

Auto Rental News provided data that shows the number of cars in service (in ... For each 1,000 additional rental cars, revenue should increase by $8.94 million. ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 49
Provided by: margaret1
Category:
Tags: statistics

less

Transcript and Presenter's Notes

Title: Statistics 222


1
Statistics 222
  • Chapter 15
  • Multiple Regression

2
Multiple Regression
  • Multiple Regression analysis is the study of how
    a dependent variable (y) is related to two or
    more independent variables (multiple xs).

3
The Multiple Regression Model
y ?0 (?1x1) (?2x2) (?pxp) ?
  • We begin with the assumption that the regression
    equation follows this regression model. In other
    words, there is a linear association between y
    and x1, x2,xp (multiplied by their respective
    ?s) plus the error term ?.
  • The error term accounts for the variability in y
    that cannot be explained by the linear effect of
    the xs.

4
The Multiple Regression Equation
  • The multiple regression equation describes how
    the expected value of y is related to the xs and
    has this format

E(y) ?0 (?1x1) (?2x2) (?pxp)
5
Developing the Multiple Regression Equation
  • As with simple regression, we dont have values
    for the ?s (population parameters), so we
    estimate ?s from bs that we derive from a
    samples data set.
  • Again, we use the least squares method to
    develop a regression equation by solving for our
    b-values in such a way that minimizes ?(yi yi)
    (the sum of the ?s).

6
Example Butler Trucking
  • Butler Trucking Company makes deliveries
    throughout a local geographic area. To develop
    better work schedules, the managers want to
    estimate the total daily travel time that it
    should take to complete any given delivery route.
  • Initially the managers believe that the total
    daily travel time is closely related to the
    number of miles traveled.
  • They obtain a sample of 10 delivery schedules and
    the track the number of miles traveled (x1) and
    the total delivery time (y) of each.

7
(No Transcript)
8
Is there reason to believe the relationship is
linear?
  • Before proceeding with regression analysis, we
    should plot the (x, y) pairs to see if there
    appears to be a linear relationship between the
    variables.
  • See next slide.

9
There appears to be somewhat of a positive linear
relationship between miles traveled and travel
time so we can proceed with the regression
analysis.
10
Open the file DataSetsForCh15.xls and click on
the worksheet Butler Trucking I
11
From the menu, select Tools, Data Analysis,
Regression, ok.
12
Select C2C12 for the Y range and B2B12 for
the X range. Check off Labels and click ok.
13
See the regression output.
14
The Regression Equation
  • Based upon Excels Regression analysis output, we
    develop the following regression equation

y 1.27 .0678 x1
For every additional mile driven, travel time
increases by .0678 of an hour (thats about 4
minutes).
15
Is the regression equation significant?
  • Recall that for simple regression, we can use
    either the F-test to test for overall
    significance of the equation or we do a t-test
    to test the hypothesis that ?1 0.
  • Using ? .05 and referring to our Excel output,
    we see that the model is significant because p
    .004 (which is less than .05).
  • Furthermore, longer travel times are associated
    with more miles driven.

16
Since we are doing simple linear regression
analysis, the p-values for both the F-test and
the t-test are the same, and are significant at ?
.05.
17
Referring to the Excel regression output, we see
that r2, the coefficient of determination .664.
Therefore, 66.4 of the variability in travel
time can be explained by the linear effect of the
number of miles traveled.
18
Adding a 2nd independent variable
  • Lets say that the managers believe there is
    another variable that effects travel time and
    that is number of deliveries.
  • Lets run the regression analysis again using two
    independent variables miles driven will be x1
    and number of deliveries will be x2.

19
Open the file DataSetsForCh15.xls and click on
the worksheet Butler Trucking II
20
From the menu, select Tools, Data Analysis,
Regression, ok.
21
Select D2D12 for the Y range and B2C12 for
the X range. Check off Labels and click ok.
22
See the regression output.
23
The Regression Equation (with two independent
variables)
  • Based upon Excels Regression analysis output, we
    develop the following regression equation

y -.669 .0611 x1 .923 x2
24
Note that the value of ?1 declined
  • When there was just one independent variable
    (number of miles driven), the value of ?1 was
    .0678.
  • When we added an second independent variable, the
    value of ?1 became .0611.
  • The reason that ?1 declined is that the
    explanation power is now shared between two
    variables (X1 and X2) that are slightly
    correlated, so X2 picked up some of the
    explanation power that was initially attributed
    to X1 when X2 wasnt in the picture yet.

25
The Definition of ?i in multiple regression
  • ?i represents an estimate of the change in yi
    corresponding to a one-unit change in xi when all
    other independent variables are held constant.
  • ?1 .0611. This means that for every additional
    mile driven, travel time should increase by .0611
    hours (about 3.67 minutes) when the number of
    deliveries is held constant.
  • ?2 .923. This means that for every additional
    delivery made, travel time should increase by
    .923 hours (about 55 minutes) when the number of
    miles driven is held constant.

26
Is the regression equation significant?
  • For simple regression, we could use either the
    F-test or do one t-test (H0 ?1 0) to test for
    the significance of the regression equation.
  • But now we are doing multiple regression because
    we have more than one independent variable.
    Therefore, we still do the F-test to determine
    the overall significance of the regression
    equation but we must also do a t-test for each
    independent variable.
  • The t-test is used to test each individual
    independent variable for significance.

27
The F-test
  • To test the multiple regression equation for
    overall significance, we test this set of
    hypotheses
  • Ho ?1 ?2 ?3 ?p 0
  • Ha One or more of the ?is is not 0.
  • Recall that the F-statistic is calculated by
    obtaining the ratio of MSR/MSE.
  • MSE is the unbiased estimate of ?2 of the ?s and
    MSR will be similar to MSE if none of the ?Is
    are significantly different than 0 (thus, the
    ratio will be 11).

28
We see that MSR is 10.8, MSE is .328 and the
F-ratio is 32.878 resulting in a p-value of
.00027. Therefore, we reject the null hypothesis
and conclude the overall model is significant.
29
The t-tests
  • Since we have two ?s (other than ?0), we must
    perform two t-tests
  • For number-of-miles (x1)
  • H0 ?1 0
  • Ha ?1 ? 0
  • For number-of-deliveries (x2)
  • H0 ?2 0
  • Ha ?2 ? 0

30
We see that tx for miles-driven is 6.18 resulting
in a p-value of .00045 and tx for
number-of-deliveries is 4.17 resulting in a
p-value of .004 and therefore we reject both null
hypotheses and conclude that both (x1 and x2) are
significant.
31
(No Transcript)
32
The Coefficient of Determination
  • r2 is now .9037 (as compared to .664 with one x).
  • We see that we have greatly improved our ability
    to predict the value of y (travel time) when we
    add a second variable.
  • Therefore, we could say that 90.37 of the
    variability in delivery time can be explained by
    the regression equation that includes miles
    driven and number of deliveries as independent
    variables.
  • The Coefficient of Determination should now be
    referred to as the multiple coefficient of
    determination since we have multiple xs.

33
The Adjusted r2
  • In general, r2 always increases as additional
    independent variables are added to the regression
    equation. These variables are likely to be
    correlated amongst themselves to some degree.
    Therefore, to avoid over-estimating the impact of
    adding additional variables, a correction
    factor is applied to r2 to adjust it downward.
    The adjustment factor is a function of n (the
    number of observations) and p (the number of
    variables).
  • In this case, the adjusted r2 is .876. So we
    would ultimately conclude that 87.6 of the
    variability in delivery time can be explained by
    the regression equation.

34
Multi-collinearity
  • When multi-collinearity exists, that means that
    the independent variables themselves are
    correlated.
  • For example, if we had used number of miles
    driven and number of gallons of gas consumed
    to estimate travel time, we would have been
    using two independent variables that themselves
    are significantly correlated.

35
The problem caused by multicollinearity
  • When regression analysis is performed the
    variables are introduced into the model one at a
    time. Therefore, if miles driven is introduced
    first, all the variation in travel time due to
    miles driven will be attributed to that factor.
    Then when gallons-consumed is introduced, there
    is not much more (if any) variation in the
    travel time that has not already been explained
    by miles driven.
  • The result is that gallons-consumed may end up
    with an insignificant ? when it wouldnt be
    insignificant if it had been introduced first.
  • In other words, with x1 already in the model, x2
    does not make a significant contribution to
    explaining y. But if x1 wasnt already in the
    model, x2 would have made significant
    contribution to explaining y.

36
How to identify and fix a multi-collinearity
problem
  • Run a regression analysis (to obtain an r-value)
    using one independent variable as x and the
    other as y.
  • The r-value will tell you if they are
    significantly correlated.
  • The rule-of-thumb is if r gt .7, then leave one
    of the independent variables out of the
    regression equation.

37
Example 1 (p. 646 - 10)
  • Auto Rental News provided data that shows the
    number of cars in service (in thousands), the
    number of rental locations, and the rental
    revenue (millions) for 15 car rental companies.
  • A. Determine the estimated regression equation
    that can be used to predict the rental revenue
    given the number of cars in service and the
    number of locations. B. Is the model significant?
    Are the variables significant? C. Provide an
    interpretation of the slopes of the estimated
    regression equation. D. What percentage of
    variation in revenue can be explained by the
    regression model that includes number of cars in
    service?

38
Open the file DataSetsForCh15.xls and click on
the worksheet RentalCars
39
From the menu, select Tools, Data Analysis,
Regression, ok.
40
Select D2D17 for the Y range and B2C17 for
the X range. Check off Labels and click ok.
41
See the regression output.
42
Question (a)
  • Determine the estimated regression equation that
    can be used to predict the rental revenue given
    the number of cars in service and the number of
    locations.

y 105.97 (8.94) x1 (-.191) x2
43
Question (b)
  • Is the model significant? Are the variables
    significant?
  • The results of the F-test indicate an F-value of
    96.66 resulting in a p-value lt .01 so that
    overall model is significant.
  • The results of the t-test for each variable
    indicate that number of rental cars is
    significant (p .000) but number of rental
    locations is not (p .086).

44
Question (c)
  • Provide an interpretation of the slopes of the
    estimated regression equation
  • For each 1,000 additional rental cars, revenue
    should increase by 8.94 million.
  • The second variable, number of locations, is not
    significant at the .05 level (p .086). That is,
    it does not provide significant explaining
    power.

45
Question (d)
  • What percentage of variation in revenue can be
    explained by the regression model that includes
    number of cars in service?
  • To answer this question, it is necessary to
    remove number of locations from the regression
    equation and re-run the analysis (see next
    slide).

46
Regression analysis using only one independent
variable (number of rental cars). r2 is 92
47
Homework 11
  • 9 on page 645
  • Develop regression equation use it to make
    estimate
  • Use worksheet Schools
  • 17 on page 649
  • Compute and interpret r2 and r2 (adjusted)

48
Extra Credit Homework Project(10 points)
  • Case Problem 1 on page 695 (Consumer Research,
    Inc.)
  • Use worksheet credit cards in
    DataSetsForCh15.xls
  • Write a Managerial Report that provides and
    describe descriptive statistics of the data,
    develop regression equations and discuss the
    findings. Make prediction and discuss the need
    for other independent variables.
  • Due in 2 weeks.
Write a Comment
User Comments (0)
About PowerShow.com