Multiple Regression

About This Presentation

Title:

Multiple Regression

Description:

If d falls between dU and 4-dU there is no evidence for first order auto-correlation ... How does the weather affect the sales of lift tickets in a ski resort? ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 74

Provided by: krithik

Category:

more less

Transcript and Presenter's Notes

Title: Multiple Regression

1
Lecture 8

Multiple Regression

2
19.1 Introduction

In this chapter we extend the simple linear
regression model, and allow for any number of
independent variables.
We expect to build a model that fits the data
better than the simple linear regression model.

3
Examples of Multiple Regression

Business decisionmaking La Quinta Inns wants to
decide where to locate new inns. It wants to
predict operating margin based on variables
related to competition, market awareness, demand
generators, demographics and physical location.
College admissions The admissions officer wants
to predict which students will be most
successful. She wants to predict college GPA
based on GPA from high school, SAT score and
amount of time participating in extracurricular
activities.

4
More Examples

Improving operations A parcel delivery service
would like to increase the number of packages
that are sorted in each of its hub locations.
Three factors that the company can control and
that influence sorting performance are the number
of sorting lines, the number of sorting workers,
and the number of truck drivers. What can the
company do to improve sorting performance?
Understanding relationships Executive
compensation. Does it matter how long the
executive has been at the firm controlling for
other factors? Do CEOs pay themselves less if
they have a large stake in the stock of the
company controlling for other factors? Does
having an MBA increase executive salary
controlling for other factors?

5
Introduction

We shall use computer printout to
Assess the model
How well it fits the data
Is it useful
Are any required conditions violated?
Employ the model
Interpreting the coefficients
Predictions using the prediction equation
Estimating the expected value of the dependent
variable

6
Example 19.1

Where to locate a new motor inn?
La Quinta Motor Inns is planning an expansion.
Management wishes to predict which sites are
likely to be profitable.
Several areas where predictors of profitability
can be identified are
Competition
Market awareness
Demand generators
Demographics
Physical quality

7
Margin
Profitability
Competition
Market awareness
Customers
Community
Rooms
Nearest
Office space
College enrollment
Income
Disttwn
Median household income.
Distance to downtown.
Distance to the nearest La Quinta inn.
Number of hotels/motels rooms within 3 miles
from the site.
8
Model and Required Conditions

We allow for k independent variables to
potentially be related to the dependent variable
y b0 b1x1 b2x2 bkxk e

9
Multiple Regression for k 2, Graphical
Demonstration - I
y
The simple linear regression model allows for one
independent variable, x y b0 b1x e
y b0 b1x
y b0 b1x
y b0 b1x
y b0 b1x
Note how the straight line becomes a plane,
and...
y b0 b1x1 b2x2
y b0 b1x1 b2x2
y b0 b1x1 b2x2
y b0 b1x1 b2x2
y b0 b1x1 b2x2
X
y b0 b1x1 b2x2
1
y b0 b1x1 b2x2
The multiple linear regression model allows for
more than one independent variable. Y b0 b1x1
b2x2 e
X2
10
Multiple Regression for k 2, Graphical
Demonstration - II
y
y b0 b1x2
Note how a parabola becomes a parabolic Surface.
b0
X
1
y b0 b1x12 b2x2
X2
11
Required conditions for the error variable

The error e is normally distributed.
The mean of the error e is equal to zero for each
combination of xs, i.e., .
The standard deviation is constant (se) for all
values of xs.
The errors are independent.

12
Estimating the Coefficients and Assessing the
Model, Example

Data were collected from randomly selected 100
inns that belong to La Quinta, and ran for the
following suggested model
Margin b0 b1Rooms b2Nearest b3Office
b4College b5Income b6Disttwn

Xm19-01
13
19.3 Estimating the Coefficients and Assessing
the Model

The procedure used to perform regression
analysis
Estimate the model coefficients and statistics
using least squares using JMP.

Diagnose violations of required conditions. Try
to remedy problems when identified.

Assess the model fit using statistics obtained
from the sample.

If the model assessment indicates good fit to the
data, use it to interpret the coefficients and
generate predictions.

14
Model Assessment

The model is assessed using three tools
The standard error of estimate
The coefficient of determination
The F-test of the analysis of variance
The standard error of estimates participates in
building the other tools.

15
Standard Error of Estimate

The standard deviation of the error is estimated
by the Standard Error of Estimate
The magnitude of se is judged by comparing it to

16
Standard Error of Estimate

From the printout, se 5.51
Calculating the mean value of y,
It seems se is not particularly small relative to
y.
QuestionCan we conclude the model does not fit
the data well?

17
Coefficient of Determination

The definition is
From the printout, R2 0.5251
52.51 of the variation in operating margin is
explained by the six independent variables.
47.49 remains unexplained.
When adjusted for degrees of freedom, Adjusted
R2 1-SSE/(n-k-1) / SS(Total)/(n-1)
49.44

18
Testing the Validity of the Model

We pose the question
Is there at least one independent variable
linearly related to the dependent variable?
To answer the question we test the hypothesis
H0 b1 b2 bk
H1 At least one bi is not equal to zero.
If at least one bi is not equal to zero, the
model has some validity.

19
Testing the Validity of the La Quinta Inns
Regression Model

The hypotheses are tested by an ANOVA procedure.

20
Testing the Validity of the La Quinta Inns
Regression Model

Variation in y SSR SSE.
If SSR is large relative to SSE, much of the
variation in y is explained by the regression
model the model is useful and thus, the null
hypothesis should be rejected. Thus, we reject
for large F.

Rejection region FgtFa,k,n-k-1
21
Testing the Validity of the La Quinta Inns
Regression Model
Conclusion There is sufficient evidence to
reject the null hypothesis in favor of the
alternative hypothesis. At least one of the bi
is not equal to zero. Thus, at least one
independent variable is linearly related to y.
This linear regression model is valid
Fa,k,n-k-1 F0.05,6,100-6-12.17 F 17.14 gt 2.17
Also, the p-value (Significance F)
0.0000 Reject the null hypothesis.
22
Interpreting the Coefficients

b0 38.14. This is the intercept, the value of y
when all the variables take the value zero.
Since the data range of all the independent
variables do not cover the value zero, do not
interpret the intercept.
b1 0.0076. In this model, for each additional
room within 3 mile of the La Quinta inn, the
operating margin decreases on average by .0076
(assuming the other variables are held constant).

23
Interpreting the Coefficients

b2 1.65. In this model, for each additional
mile that the nearest competitor is to a La
Quinta inn, the operating margin increases on
average by 1.65 when the other variables are
held constant.
b3 0.020. For each additional 1000 sq-ft of
office space, the operating margin will increase
on average by .02 when the other variables are
held constant.
b4 0.21. For each additional thousand students
the operating margin increases on average by .21
when the other variables are held constant.

24
Interpreting the Coefficients

b5 0.41. For additional 1000 increase in
median household income, the operating margin
increases on average by .41, when the other
variables remain constant.
b6 -0.23. For each additional mile to the
downtown center, the operating margin decreases
on average by .23 when the other variables are
held constant.

25
Testing the Coefficients

The hypothesis for each bi is
JMP printout

H0 bi 0 H1 bi ¹ 0
d.f. n - k -1
26
Multiple Regression Model

Multiple regression model
y b0 b1x1 b2x2 bkxk e
Required conditions
The regression function is a linear function of
the independent variables x1,,xk (multiple
regression line does not systematically
overestimate/underestimate y for any combination
of x1,,xk ).
The error e is normally distributed.
The standard deviation is constant (se) for all
values of xs.
The errors are independent.

27
Confidence Intervals for Coefficients

Note that test of is a test of
whether xi helps to predict y given
x1,,xi-1,xi1,xk. Results of test might change
as we change other independent variables in the
model.
A confidence interval for is
In La Quinta data, a 95 confidence interval for
(the coefficient on number of rooms) is

28
Using the Linear Regression Equation

The model can be used for making predictions by
Producing prediction interval estimate for the
particular value of y, for a given values of xi.
Producing a confidence interval estimate for the
expected value of y, for given values of xi.
The model can be used to learn about
relationships between the independent variables
xi, and the dependent variable y, by interpreting
the coefficients bi

29
La Quinta Inns, Predictions
Xm19-01

Predict the average operating margin of an inn at
a site with the following characteristics
3815 rooms within 3 miles,
Closet competitor .9 miles away,
476,000 sq-ft of office space,
24,500 college students,
35,000 median household income,
11.2 miles distance to downtown center.

MARGIN 38.14 - 0.0076(3815) 1.65(.9)
0.020(476) 0.21(24.5)
0.41(35) - 0.23(11.2) 37.1
30
Prediction Intervals and Confidence Intervals for
Mean

Prediction interval for y given x1,,xk
Confidence interval for mean of y given x1,,xk
For inn with characteristics on previous slide
Confidence interval for mean
(32.970,41.213)
Prediction interval (25.395,48.788)

31
19.4 Regression Diagnostics - II

The conditions required for the model assessment
to apply must be checked.
Is the error variable normally distributed?
Is the regression function correctly specified as
a linear function of x1,,xk Plot the
residuals versus xs and
Is the error variance constant?
Are the errors independent?
Can we identify outlier?
Is multicollinearity a problem?

Draw a histogram of the residuals
Plot the residuals versus the time periods
32
Multicollinearity

Condition in which independent variables are
highly correlated.
Multicollinearity causes two kinds of
difficulties
The t statistics appear to be too small.
The b coefficients cannot be interpreted as
slopes.
Diagnostics
High correlation between independent variables
Counterintuitive signs on regression coefficients
Low values for t-statistics despite a significant
overall fit, as measured by the F statistics

33
Diagnostics Multicollinearity

Example 19.2 Predicting house price (Xm19-02)
A real estate agent believes that a house selling
price can be predicted using the house size,
number of bedrooms, and lot size.
A random sample of 100 houses was drawn and data
recorded.
Analyze the relationship among the four variables

34
Diagnostics Multicollinearity

The proposed model isPRICE b0 b1BEDROOMS
b2H-SIZE b3LOTSIZE e

The model is valid, but no variable is
significantly related to the selling price ?!
35
Diagnostics Multicollinearity

Multicollinearity is found to be a problem.

Multicollinearity causes two kinds of
difficulties
The t statistics appear too small.
The b coefficients cannot be interpreted as
slopes.

36
19.5 Regression Diagnostics - III

The Durbin - Watson Test
This test detects first order auto-correlation
between consecutive residuals in a time series
If autocorrelation exists the error variables are
not independent

Residual at time i
37
Positive first order autocorrelation occurs when
consecutive residuals tend to be similar.
Then, the value of d is small (less than 2).
Positive first order autocorrelation

Residuals

0
Time

Negative first order autocorrelation
Negative first order autocorrelation occurs when
consecutive residuals tend to markedly differ.
Then, the value of d is large (greater than 2).
Residuals

0

Time

38

One tail test for positive first order
auto-correlation
If dltdL there is enough evidence to show that
positive first-order correlation exists
If dgtdU there is not enough evidence to show that
positive first-order correlation exists
If d is between dL and dU the test is
inconclusive.
One tail test for negative first order
auto-correlation
If dgt4-dL, negative first order correlation
exists
If dlt4-dU, negative first order correlation does
not exists
if d falls between 4-dU and 4-dL the test is
inconclusive.

Two-tail test for first order auto-correlation
If dltdL or dgt4-dL first order auto-correlation
exists
If d falls between dL and dU or between 4-dU and
4-dL the test is inconclusive
If d falls between dU and 4-dU there is no
evidence for first order auto-correlation

dL
dU
2
0
4-dU
4-dL
4
40
Example 19.3

How does the weather affect the sales of lift
tickets in a ski resort?
Data of the past 20 years sales of tickets, along
with the total snowfall and the average
temperature during Christmas week in each year,
was collected.
The model hypothesized was
TICKETSb0b1SNOWFALLb2TEMPERATUREe
Regression analysis yielded the following
results

41
The model seems to be very poor

The fit is very low (R-square0.12),
It is not valid (Signif. F 0.33)
No variable is linearly related to Sales

42
A test for independent errors Durbin-Watson

Tests H0 no time order vs. H1
auto-correlation in time order
Autocorrelation adjacent observations in time
are correlated.
JMP regr. window, red diamond ! Row
Diagnostics ! Durbin-Watson test new red
diamond in bottom section p-value (sometimes
expensive to compute)

43
Durbin-Watson Test in JMP

H0 No first-order autocorrelation.
H1 First-order autocorrelation
Use row diagnostics, Durbin-Watson test in JMP
after fitting the model.
Autocorrelation is an estimate of correlation
between errors.

44
The autocorrelation has occurred over
time. Therefore, a time dependent variable added
to the model may correct the problem
The modified regression model TICKETSb0
b1SNOWFALL b2TEMPERATURE b3YEARSe

All the required conditions are met for this
model.
The fit of this model is high R2 0.74.
The model is useful. Significance F 5.93
E-5.
SNOWFALL and YEARS are linearly related to
ticket sales.
TEMPERATURE is not linearly related to ticket
sales.

45
Model Building (Chapter 20)
46
20.2 Polynomial Models

There are models where the independent variables
(xi) may appear as functions of a smaller number
of predictor variables.
Polynomial models are one such example.

47
Polynomial Models

y b0 b1x1 b2x2 bpxp e
y b0 b1x b2x2 bpxp e

48
Polynomial Models with One Predictor Variable

First order model (p 1)

y b0 b1x e

Second order model (p2)

y b0 b1x e
b2x2 e
49
Polynomial Models with One Predictor

Third order model (p 3)

y b0 b1x b2x2 e
b3x3 e
50
Interaction

Two independent variables x1 and x2 interact if
the effect of x1 on y is influenced by the value
of x2.
Interaction can be brought into the multiple
linear regression model by including the
independent variable x1 x2.
Example

51
Interaction (cont.)

y b0 b1 x1 b2 x2 b3 x1 x2 e
Slope for x1 b1 b3 x2
y (b0 b2 x2) (b1 b3 x2) x1 e
Example
Is the expected income increase from an extra
year of education higher for people with IQ 100
or with IQ 130 (or is it the same)?

52
Polynomial Models with Two Predictor Variables

First order modely b0 b1x1 b2x2 e

The effect of one predictor variable on y is
independent of the effect of the other predictor
variable on y.
The two variables interact to affect the value
of y.
X2 3
b0b2(3) b1b3(3)x1
X2 3
b0b2(3) b1x1
X2 2
b0b2(2) b1x1
b0b2(2) b1b3(2)x1
X2 1
X2 2
b0b2(1) b1x1
b0b2(1) b1b3(1)x1
X2 1
x1
53
Polynomial Models with Two Predictor Variables

Second order modely b0 b1x1 b2x2
b3x12 b4x22 e

X2 3
X2 3
y b0b2(3)b4(32) b1x1 b3x12 e
X2 2
X2 2
X2 1
y b0b2(2)b4(22) b1x1 b3x12 e
X2 1
y b0b2(1)b4(12) b1x1 b3x12 e
x1
54
Selecting a Model

Several models have been introduced.
How do we select the right model?
Selecting a model
Use your knowledge of the problem (variables
involved and the nature of the relationship
between them) to select a model.
Test the model using statistical techniques.

55
Selecting a Model Example

Example 20.1 The location of a new restaurant
A fast food restaurant chain tries to identify
new locations that are likely to be profitable.
The primary market for such restaurants is
middle-income adults and their children (between
the age 5 and 12).
Which regression model should be proposed to
predict the profitability of new locations?

56
Selecting a Model Example

Solution
The dependent variable will be Gross Revenue

Quadratic relationships between Revenue and each
predictor variable should be observed. Why?

Families with very young or older kids will not
visit the restaurant as frequent as families with
mid-range ages of kids.

Members of middle-class families are more likely
to visit a fast food family than members of poor
or wealthy families.

57
Selecting a Model Example

Solution
The quadratic regression model built is

Sales b0 b1INCOME b2AGE b3INCOME2
b4AGE2 b5(INCOME)(AGE) e
SALES annual gross salesINCOME median annual
household income in the
neighborhood AGE mean age of children in the
neighborhood
58
Selecting a Model Example

Example 20.2
To verify the validity of the model proposed in
example 20.1 for recommending the location of a
new fast food restaurant, 25 areas with fast food
restaurants were randomly selected.
Each area included one of the firms and three
competing restaurants.
Data collected included (Xm20-02.xls)
Previous years annual gross sales.
Mean annual household income.
Mean age of children

59
Selecting a Model Example
Xm20-02
Collected data
Added data
60
The Quadratic Relationships Graphical
Illustration
61
Model Validation
This is a valid model that can be used
to make predictions.
62
Model Validation
The model can be used to make predictions...
but multicolinearity is a problem!! The t-tests
may be distorted, therefore, do not interpret
the coefficients or test them.
In excel Tools gt Data Analysis gt Correlation
63
20.3 Nominal Independent Variables

In many real-life situations one or more
independent variables are nominal.
Including nominal variables in a regression
analysis model is done via indicator (or dummy)
variables.
An indicator variable (I) can assume one out of
two values, zero or one.

1 if the temperature was below 50o 0 if the
temperature was 50o or more
I
1 if data were collected before 1980 0 if data
were collected after 1980
1 if a degree earned is in Finance 0 if a
degree earned is not in Finance
64
Nominal Independent Variables Example Auction
Car Price (II)

Example 18.2 - revised (Xm18-02a)
Recall A car dealer wants to predict the auction
price of a car.
The dealer believes now that odometer reading and
the car color are variables that affect a cars
price.
Three color categories are considered
White
Silver
Other colors

Note Color is a nominal variable.
65
Nominal Independent Variables Example Auction
Car Price (II)

Example 18.2 - revised (Xm18-02b)

1 if the color is white 0 if the color is not
white
I1
1 if the color is silver 0 if the color is not
silver
I2
The category Other colors is defined by I1
0 I2 0
66
How Many Indicator Variables?

Note To represent the situation of three
possible colors we need only two indicator
variables.
Conclusion To represent a nominal variable with
m possible categories, we must create m-1
indicator variables.

67
Nominal Independent Variables Example Auction
Car Price

Solution
the proposed model is y b0 b1(Odometer)
b2I1 b3I2 e
The data

White car
Other color
Silver color
68
Example Auction Car Price The Regression
Equation
From Excel (Xm18-02b) we get the regression
equation PRICE 16701-.0555(Odometer)90.48(I-1)
295.48(I-2)
The equation for a silver color car.
Price 16701 - .0555(Odometer) 90.48(0)
295.48(1)
The equation for a white color car.
Price 16701 - .0555(Odometer) 90.48(1)
295.48(0)
Price 16701 - .0278(Odometer) 45.2(0) 148(0)
The equation for an other color car.
69
Example Auction Car Price The Regression
Equation
From Excel we get the regression equation PRICE
16701-.0555(Odometer)90.48(I-1)295.48(I-2)
For one additional mile the auction price
decreases by 5.55 cents.
A white car sells, on the average, for 90.48
more than a car of the Other color category
A silver color car sells, on the average, for
295.48 more than a car of the Other color
category.
70
Example Auction Car Price The Regression
Equation
Xm18-02b
71
Nominal Independent Variables Example MBA
Program Admission (MBA II)

Recall The Dean wanted to evaluate applications
for the MBA program by predicting future
performance of the applicants.
The following three predictors were suggested
Undergraduate GPA
GMAT score
Years of work experience
It is now believed that the type of undergraduate
degree should be included in the model.

Note The undergraduate degree is nominal data.
72
Nominal Independent Variables Example MBA
Program Admission (II)
1 if B.A. 0 otherwise
I1
1 if B.B.A 0 otherwise
I2
1 if B.Sc. or B.Eng. 0 otherwise
I3
The category Other group is defined by I1 0
I2 0 I3 0
73
Nominal Independent Variables Example MBA
Program Admission (II)
MBA-II

Write a Comment

User Comments (0)