Title: Multiple Regression
1Lecture 8
219.1 Introduction
- In this chapter we extend the simple linear
regression model, and allow for any number of
independent variables. - We expect to build a model that fits the data
better than the simple linear regression model.
3Examples of Multiple Regression
- Business decisionmaking La Quinta Inns wants to
decide where to locate new inns. It wants to
predict operating margin based on variables
related to competition, market awareness, demand
generators, demographics and physical location. - College admissions The admissions officer wants
to predict which students will be most
successful. She wants to predict college GPA
based on GPA from high school, SAT score and
amount of time participating in extracurricular
activities.
4More Examples
- Improving operations A parcel delivery service
would like to increase the number of packages
that are sorted in each of its hub locations.
Three factors that the company can control and
that influence sorting performance are the number
of sorting lines, the number of sorting workers,
and the number of truck drivers. What can the
company do to improve sorting performance? - Understanding relationships Executive
compensation. Does it matter how long the
executive has been at the firm controlling for
other factors? Do CEOs pay themselves less if
they have a large stake in the stock of the
company controlling for other factors? Does
having an MBA increase executive salary
controlling for other factors?
5Introduction
- We shall use computer printout to
- Assess the model
- How well it fits the data
- Is it useful
- Are any required conditions violated?
- Employ the model
- Interpreting the coefficients
- Predictions using the prediction equation
- Estimating the expected value of the dependent
variable
6Example 19.1
- Where to locate a new motor inn?
- La Quinta Motor Inns is planning an expansion.
- Management wishes to predict which sites are
likely to be profitable. - Several areas where predictors of profitability
can be identified are - Competition
- Market awareness
- Demand generators
- Demographics
- Physical quality
7Margin
Profitability
Competition
Market awareness
Customers
Community
Rooms
Nearest
Office space
College enrollment
Income
Disttwn
Median household income.
Distance to downtown.
Distance to the nearest La Quinta inn.
Number of hotels/motels rooms within 3 miles
from the site.
8Model and Required Conditions
- We allow for k independent variables to
potentially be related to the dependent variable
- y b0 b1x1 b2x2 bkxk e
9Multiple Regression for k 2, Graphical
Demonstration - I
y
The simple linear regression model allows for one
independent variable, x y b0 b1x e
y b0 b1x
y b0 b1x
y b0 b1x
y b0 b1x
Note how the straight line becomes a plane,
and...
y b0 b1x1 b2x2
y b0 b1x1 b2x2
y b0 b1x1 b2x2
y b0 b1x1 b2x2
y b0 b1x1 b2x2
X
y b0 b1x1 b2x2
1
y b0 b1x1 b2x2
The multiple linear regression model allows for
more than one independent variable. Y b0 b1x1
b2x2 e
X2
10Multiple Regression for k 2, Graphical
Demonstration - II
y
y b0 b1x2
Note how a parabola becomes a parabolic Surface.
b0
X
1
y b0 b1x12 b2x2
X2
11Required conditions for the error variable
- The error e is normally distributed.
- The mean of the error e is equal to zero for each
combination of xs, i.e., . - The standard deviation is constant (se) for all
values of xs. - The errors are independent.
12Estimating the Coefficients and Assessing the
Model, Example
- Data were collected from randomly selected 100
inns that belong to La Quinta, and ran for the
following suggested model - Margin b0 b1Rooms b2Nearest b3Office
b4College b5Income b6Disttwn
Xm19-01
1319.3 Estimating the Coefficients and Assessing
the Model
- The procedure used to perform regression
analysis - Estimate the model coefficients and statistics
using least squares using JMP.
- Diagnose violations of required conditions. Try
to remedy problems when identified.
- Assess the model fit using statistics obtained
from the sample.
- If the model assessment indicates good fit to the
data, use it to interpret the coefficients and
generate predictions.
14Model Assessment
- The model is assessed using three tools
- The standard error of estimate
- The coefficient of determination
- The F-test of the analysis of variance
- The standard error of estimates participates in
building the other tools.
15Standard Error of Estimate
- The standard deviation of the error is estimated
by the Standard Error of Estimate - The magnitude of se is judged by comparing it to
16Standard Error of Estimate
- From the printout, se 5.51
- Calculating the mean value of y,
- It seems se is not particularly small relative to
y. - QuestionCan we conclude the model does not fit
the data well?
17Coefficient of Determination
- The definition is
- From the printout, R2 0.5251
- 52.51 of the variation in operating margin is
explained by the six independent variables.
47.49 remains unexplained. - When adjusted for degrees of freedom, Adjusted
R2 1-SSE/(n-k-1) / SS(Total)/(n-1) - 49.44
18Testing the Validity of the Model
- We pose the question
- Is there at least one independent variable
linearly related to the dependent variable? - To answer the question we test the hypothesis
-
- H0 b1 b2 bk
- H1 At least one bi is not equal to zero.
- If at least one bi is not equal to zero, the
model has some validity.
19Testing the Validity of the La Quinta Inns
Regression Model
- The hypotheses are tested by an ANOVA procedure.
20Testing the Validity of the La Quinta Inns
Regression Model
- Variation in y SSR SSE.
- If SSR is large relative to SSE, much of the
variation in y is explained by the regression
model the model is useful and thus, the null
hypothesis should be rejected. Thus, we reject
for large F.
Rejection region FgtFa,k,n-k-1
21Testing the Validity of the La Quinta Inns
Regression Model
Conclusion There is sufficient evidence to
reject the null hypothesis in favor of the
alternative hypothesis. At least one of the bi
is not equal to zero. Thus, at least one
independent variable is linearly related to y.
This linear regression model is valid
Fa,k,n-k-1 F0.05,6,100-6-12.17 F 17.14 gt 2.17
Also, the p-value (Significance F)
0.0000 Reject the null hypothesis.
22Interpreting the Coefficients
- b0 38.14. This is the intercept, the value of y
when all the variables take the value zero.
Since the data range of all the independent
variables do not cover the value zero, do not
interpret the intercept. - b1 0.0076. In this model, for each additional
room within 3 mile of the La Quinta inn, the
operating margin decreases on average by .0076
(assuming the other variables are held constant).
23Interpreting the Coefficients
- b2 1.65. In this model, for each additional
mile that the nearest competitor is to a La
Quinta inn, the operating margin increases on
average by 1.65 when the other variables are
held constant. - b3 0.020. For each additional 1000 sq-ft of
office space, the operating margin will increase
on average by .02 when the other variables are
held constant. - b4 0.21. For each additional thousand students
the operating margin increases on average by .21
when the other variables are held constant.
24Interpreting the Coefficients
- b5 0.41. For additional 1000 increase in
median household income, the operating margin
increases on average by .41, when the other
variables remain constant. - b6 -0.23. For each additional mile to the
downtown center, the operating margin decreases
on average by .23 when the other variables are
held constant.
25Testing the Coefficients
- The hypothesis for each bi is
- JMP printout
H0 bi 0 H1 bi ¹ 0
d.f. n - k -1
26Multiple Regression Model
- Multiple regression model
- y b0 b1x1 b2x2 bkxk e
- Required conditions
- The regression function is a linear function of
the independent variables x1,,xk (multiple
regression line does not systematically
overestimate/underestimate y for any combination
of x1,,xk ). - The error e is normally distributed.
- The standard deviation is constant (se) for all
values of xs. - The errors are independent.
27Confidence Intervals for Coefficients
- Note that test of is a test of
whether xi helps to predict y given
x1,,xi-1,xi1,xk. Results of test might change
as we change other independent variables in the
model. - A confidence interval for is
- In La Quinta data, a 95 confidence interval for
(the coefficient on number of rooms) is
28Using the Linear Regression Equation
- The model can be used for making predictions by
- Producing prediction interval estimate for the
particular value of y, for a given values of xi. - Producing a confidence interval estimate for the
expected value of y, for given values of xi. - The model can be used to learn about
relationships between the independent variables
xi, and the dependent variable y, by interpreting
the coefficients bi
29 La Quinta Inns, Predictions
Xm19-01
- Predict the average operating margin of an inn at
a site with the following characteristics - 3815 rooms within 3 miles,
- Closet competitor .9 miles away,
- 476,000 sq-ft of office space,
- 24,500 college students,
- 35,000 median household income,
- 11.2 miles distance to downtown center.
MARGIN 38.14 - 0.0076(3815) 1.65(.9)
0.020(476) 0.21(24.5)
0.41(35) - 0.23(11.2) 37.1
30Prediction Intervals and Confidence Intervals for
Mean
- Prediction interval for y given x1,,xk
-
- Confidence interval for mean of y given x1,,xk
-
-
- For inn with characteristics on previous slide
-
- Confidence interval for mean
(32.970,41.213) - Prediction interval (25.395,48.788)
-
3119.4 Regression Diagnostics - II
- The conditions required for the model assessment
to apply must be checked. - Is the error variable normally distributed?
- Is the regression function correctly specified as
a linear function of x1,,xk Plot the
residuals versus xs and - Is the error variance constant?
- Are the errors independent?
- Can we identify outlier?
- Is multicollinearity a problem?
Draw a histogram of the residuals
Plot the residuals versus the time periods
32Multicollinearity
- Condition in which independent variables are
highly correlated. - Multicollinearity causes two kinds of
difficulties - The t statistics appear to be too small.
- The b coefficients cannot be interpreted as
slopes. - Diagnostics
- High correlation between independent variables
- Counterintuitive signs on regression coefficients
- Low values for t-statistics despite a significant
overall fit, as measured by the F statistics
33Diagnostics Multicollinearity
- Example 19.2 Predicting house price (Xm19-02)
- A real estate agent believes that a house selling
price can be predicted using the house size,
number of bedrooms, and lot size. - A random sample of 100 houses was drawn and data
recorded. - Analyze the relationship among the four variables
34Diagnostics Multicollinearity
- The proposed model isPRICE b0 b1BEDROOMS
b2H-SIZE b3LOTSIZE e
The model is valid, but no variable is
significantly related to the selling price ?!
35Diagnostics Multicollinearity
- Multicollinearity is found to be a problem.
- Multicollinearity causes two kinds of
difficulties - The t statistics appear too small.
- The b coefficients cannot be interpreted as
slopes.
3619.5 Regression Diagnostics - III
- The Durbin - Watson Test
- This test detects first order auto-correlation
between consecutive residuals in a time series - If autocorrelation exists the error variables are
not independent
Residual at time i
37Positive first order autocorrelation occurs when
consecutive residuals tend to be similar.
Then, the value of d is small (less than 2).
Positive first order autocorrelation
Residuals
0
Time
Negative first order autocorrelation
Negative first order autocorrelation occurs when
consecutive residuals tend to markedly differ.
Then, the value of d is large (greater than 2).
Residuals
0
Time
38- One tail test for positive first order
auto-correlation - If dltdL there is enough evidence to show that
positive first-order correlation exists - If dgtdU there is not enough evidence to show that
positive first-order correlation exists - If d is between dL and dU the test is
inconclusive. - One tail test for negative first order
auto-correlation - If dgt4-dL, negative first order correlation
exists - If dlt4-dU, negative first order correlation does
not exists - if d falls between 4-dU and 4-dL the test is
inconclusive.
39- Two-tail test for first order auto-correlation
- If dltdL or dgt4-dL first order auto-correlation
exists - If d falls between dL and dU or between 4-dU and
4-dL the test is inconclusive - If d falls between dU and 4-dU there is no
evidence for first order auto-correlation
dL
dU
2
0
4-dU
4-dL
4
40 Example 19.3
- How does the weather affect the sales of lift
tickets in a ski resort? - Data of the past 20 years sales of tickets, along
with the total snowfall and the average
temperature during Christmas week in each year,
was collected. - The model hypothesized was
- TICKETSb0b1SNOWFALLb2TEMPERATUREe
- Regression analysis yielded the following
results
41The model seems to be very poor
- The fit is very low (R-square0.12),
- It is not valid (Signif. F 0.33)
- No variable is linearly related to Sales
42A test for independent errors Durbin-Watson
- Tests H0 no time order vs. H1
auto-correlation in time order - Autocorrelation adjacent observations in time
are correlated. - JMP regr. window, red diamond ! Row
Diagnostics ! Durbin-Watson test new red
diamond in bottom section p-value (sometimes
expensive to compute)
43Durbin-Watson Test in JMP
- H0 No first-order autocorrelation.
- H1 First-order autocorrelation
- Use row diagnostics, Durbin-Watson test in JMP
after fitting the model. - Autocorrelation is an estimate of correlation
between errors.
44The autocorrelation has occurred over
time. Therefore, a time dependent variable added
to the model may correct the problem
The modified regression model TICKETSb0
b1SNOWFALL b2TEMPERATURE b3YEARSe
- All the required conditions are met for this
model. - The fit of this model is high R2 0.74.
- The model is useful. Significance F 5.93
E-5. -
- SNOWFALL and YEARS are linearly related to
ticket sales. - TEMPERATURE is not linearly related to ticket
sales.
45Model Building (Chapter 20)
4620.2 Polynomial Models
- There are models where the independent variables
(xi) may appear as functions of a smaller number
of predictor variables. - Polynomial models are one such example.
47Polynomial Models
- y b0 b1x1 b2x2 bpxp e
- y b0 b1x b2x2 bpxp e
48Polynomial Models with One Predictor Variable
y b0 b1x e
y b0 b1x e
b2x2 e
49 Polynomial Models with One Predictor
y b0 b1x b2x2 e
b3x3 e
50Interaction
- Two independent variables x1 and x2 interact if
the effect of x1 on y is influenced by the value
of x2. - Interaction can be brought into the multiple
linear regression model by including the
independent variable x1 x2. - Example
51Interaction (cont.)
- y b0 b1 x1 b2 x2 b3 x1 x2 e
- Slope for x1 b1 b3 x2
- y (b0 b2 x2) (b1 b3 x2) x1 e
- Example
-
- Is the expected income increase from an extra
year of education higher for people with IQ 100
or with IQ 130 (or is it the same)?
52 Polynomial Models with Two Predictor Variables
- First order modely b0 b1x1 b2x2 e
The effect of one predictor variable on y is
independent of the effect of the other predictor
variable on y.
The two variables interact to affect the value
of y.
X2 3
b0b2(3) b1b3(3)x1
X2 3
b0b2(3) b1x1
X2 2
b0b2(2) b1x1
b0b2(2) b1b3(2)x1
X2 1
X2 2
b0b2(1) b1x1
b0b2(1) b1b3(1)x1
X2 1
x1
53 Polynomial Models with Two Predictor Variables
- Second order modely b0 b1x1 b2x2
b3x12 b4x22 e
X2 3
X2 3
y b0b2(3)b4(32) b1x1 b3x12 e
X2 2
X2 2
X2 1
y b0b2(2)b4(22) b1x1 b3x12 e
X2 1
y b0b2(1)b4(12) b1x1 b3x12 e
x1
54Selecting a Model
- Several models have been introduced.
- How do we select the right model?
- Selecting a model
- Use your knowledge of the problem (variables
involved and the nature of the relationship
between them) to select a model. - Test the model using statistical techniques.
55 Selecting a Model Example
- Example 20.1 The location of a new restaurant
- A fast food restaurant chain tries to identify
new locations that are likely to be profitable. - The primary market for such restaurants is
middle-income adults and their children (between
the age 5 and 12). - Which regression model should be proposed to
predict the profitability of new locations?
56 Selecting a Model Example
- Solution
- The dependent variable will be Gross Revenue
- Quadratic relationships between Revenue and each
predictor variable should be observed. Why?
- Families with very young or older kids will not
visit the restaurant as frequent as families with
mid-range ages of kids.
- Members of middle-class families are more likely
to visit a fast food family than members of poor
or wealthy families.
57 Selecting a Model Example
- Solution
- The quadratic regression model built is
Sales b0 b1INCOME b2AGE b3INCOME2
b4AGE2 b5(INCOME)(AGE) e
SALES annual gross salesINCOME median annual
household income in the
neighborhood AGE mean age of children in the
neighborhood
58 Selecting a Model Example
- Example 20.2
- To verify the validity of the model proposed in
example 20.1 for recommending the location of a
new fast food restaurant, 25 areas with fast food
restaurants were randomly selected. - Each area included one of the firms and three
competing restaurants. - Data collected included (Xm20-02.xls)
- Previous years annual gross sales.
- Mean annual household income.
- Mean age of children
59 Selecting a Model Example
Xm20-02
Collected data
Added data
60The Quadratic Relationships Graphical
Illustration
61Model Validation
This is a valid model that can be used
to make predictions.
62Model Validation
The model can be used to make predictions...
but multicolinearity is a problem!! The t-tests
may be distorted, therefore, do not interpret
the coefficients or test them.
In excel Tools gt Data Analysis gt Correlation
6320.3 Nominal Independent Variables
- In many real-life situations one or more
independent variables are nominal. - Including nominal variables in a regression
analysis model is done via indicator (or dummy)
variables. - An indicator variable (I) can assume one out of
two values, zero or one.
1 if the temperature was below 50o 0 if the
temperature was 50o or more
I
1 if data were collected before 1980 0 if data
were collected after 1980
1 if a degree earned is in Finance 0 if a
degree earned is not in Finance
64Nominal Independent Variables Example Auction
Car Price (II)
- Example 18.2 - revised (Xm18-02a)
- Recall A car dealer wants to predict the auction
price of a car. - The dealer believes now that odometer reading and
the car color are variables that affect a cars
price. - Three color categories are considered
- White
- Silver
- Other colors
Note Color is a nominal variable.
65Nominal Independent Variables Example Auction
Car Price (II)
- Example 18.2 - revised (Xm18-02b)
1 if the color is white 0 if the color is not
white
I1
1 if the color is silver 0 if the color is not
silver
I2
The category Other colors is defined by I1
0 I2 0
66How Many Indicator Variables?
- Note To represent the situation of three
possible colors we need only two indicator
variables. - Conclusion To represent a nominal variable with
m possible categories, we must create m-1
indicator variables.
67Nominal Independent Variables Example Auction
Car Price
- Solution
- the proposed model is y b0 b1(Odometer)
b2I1 b3I2 e - The data
White car
Other color
Silver color
68Example Auction Car Price The Regression
Equation
From Excel (Xm18-02b) we get the regression
equation PRICE 16701-.0555(Odometer)90.48(I-1)
295.48(I-2)
The equation for a silver color car.
Price 16701 - .0555(Odometer) 90.48(0)
295.48(1)
The equation for a white color car.
Price 16701 - .0555(Odometer) 90.48(1)
295.48(0)
Price 16701 - .0278(Odometer) 45.2(0) 148(0)
The equation for an other color car.
69Example Auction Car Price The Regression
Equation
From Excel we get the regression equation PRICE
16701-.0555(Odometer)90.48(I-1)295.48(I-2)
For one additional mile the auction price
decreases by 5.55 cents.
A white car sells, on the average, for 90.48
more than a car of the Other color category
A silver color car sells, on the average, for
295.48 more than a car of the Other color
category.
70Example Auction Car Price The Regression
Equation
Xm18-02b
71Nominal Independent Variables Example MBA
Program Admission (MBA II)
- Recall The Dean wanted to evaluate applications
for the MBA program by predicting future
performance of the applicants. - The following three predictors were suggested
- Undergraduate GPA
- GMAT score
- Years of work experience
- It is now believed that the type of undergraduate
degree should be included in the model.
Note The undergraduate degree is nominal data.
72Nominal Independent Variables Example MBA
Program Admission (II)
1 if B.A. 0 otherwise
I1
1 if B.B.A 0 otherwise
I2
1 if B.Sc. or B.Eng. 0 otherwise
I3
The category Other group is defined by I1 0
I2 0 I3 0
73Nominal Independent Variables Example MBA
Program Admission (II)
MBA-II