Multiple Regression - PowerPoint PPT Presentation

1 / 73
About This Presentation
Title:

Multiple Regression

Description:

If d falls between dU and 4-dU there is no evidence for first order auto-correlation ... How does the weather affect the sales of lift tickets in a ski resort? ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 74
Provided by: krithik
Category:

less

Transcript and Presenter's Notes

Title: Multiple Regression


1
Lecture 8
  • Multiple Regression

2
19.1 Introduction
  • In this chapter we extend the simple linear
    regression model, and allow for any number of
    independent variables.
  • We expect to build a model that fits the data
    better than the simple linear regression model.

3
Examples of Multiple Regression
  • Business decisionmaking La Quinta Inns wants to
    decide where to locate new inns. It wants to
    predict operating margin based on variables
    related to competition, market awareness, demand
    generators, demographics and physical location.
  • College admissions The admissions officer wants
    to predict which students will be most
    successful. She wants to predict college GPA
    based on GPA from high school, SAT score and
    amount of time participating in extracurricular
    activities.

4
More Examples
  • Improving operations A parcel delivery service
    would like to increase the number of packages
    that are sorted in each of its hub locations.
    Three factors that the company can control and
    that influence sorting performance are the number
    of sorting lines, the number of sorting workers,
    and the number of truck drivers. What can the
    company do to improve sorting performance?
  • Understanding relationships Executive
    compensation. Does it matter how long the
    executive has been at the firm controlling for
    other factors? Do CEOs pay themselves less if
    they have a large stake in the stock of the
    company controlling for other factors? Does
    having an MBA increase executive salary
    controlling for other factors?

5
Introduction
  • We shall use computer printout to
  • Assess the model
  • How well it fits the data
  • Is it useful
  • Are any required conditions violated?
  • Employ the model
  • Interpreting the coefficients
  • Predictions using the prediction equation
  • Estimating the expected value of the dependent
    variable

6
Example 19.1
  • Where to locate a new motor inn?
  • La Quinta Motor Inns is planning an expansion.
  • Management wishes to predict which sites are
    likely to be profitable.
  • Several areas where predictors of profitability
    can be identified are
  • Competition
  • Market awareness
  • Demand generators
  • Demographics
  • Physical quality

7
Margin
Profitability
Competition
Market awareness
Customers
Community
Rooms
Nearest
Office space
College enrollment
Income
Disttwn
Median household income.
Distance to downtown.
Distance to the nearest La Quinta inn.
Number of hotels/motels rooms within 3 miles
from the site.
8
Model and Required Conditions
  • We allow for k independent variables to
    potentially be related to the dependent variable
  • y b0 b1x1 b2x2 bkxk e

9
Multiple Regression for k 2, Graphical
Demonstration - I
y
The simple linear regression model allows for one
independent variable, x y b0 b1x e
y b0 b1x
y b0 b1x
y b0 b1x
y b0 b1x
Note how the straight line becomes a plane,
and...
y b0 b1x1 b2x2
y b0 b1x1 b2x2
y b0 b1x1 b2x2
y b0 b1x1 b2x2
y b0 b1x1 b2x2
X
y b0 b1x1 b2x2
1
y b0 b1x1 b2x2
The multiple linear regression model allows for
more than one independent variable. Y b0 b1x1
b2x2 e
X2
10
Multiple Regression for k 2, Graphical
Demonstration - II
y
y b0 b1x2
Note how a parabola becomes a parabolic Surface.
b0
X
1
y b0 b1x12 b2x2
X2
11
Required conditions for the error variable
  • The error e is normally distributed.
  • The mean of the error e is equal to zero for each
    combination of xs, i.e., .
  • The standard deviation is constant (se) for all
    values of xs.
  • The errors are independent.

12
Estimating the Coefficients and Assessing the
Model, Example
  • Data were collected from randomly selected 100
    inns that belong to La Quinta, and ran for the
    following suggested model
  • Margin b0 b1Rooms b2Nearest b3Office
    b4College b5Income b6Disttwn

Xm19-01
13
19.3 Estimating the Coefficients and Assessing
the Model
  • The procedure used to perform regression
    analysis
  • Estimate the model coefficients and statistics
    using least squares using JMP.
  • Diagnose violations of required conditions. Try
    to remedy problems when identified.
  • Assess the model fit using statistics obtained
    from the sample.
  • If the model assessment indicates good fit to the
    data, use it to interpret the coefficients and
    generate predictions.

14
Model Assessment
  • The model is assessed using three tools
  • The standard error of estimate
  • The coefficient of determination
  • The F-test of the analysis of variance
  • The standard error of estimates participates in
    building the other tools.

15
Standard Error of Estimate
  • The standard deviation of the error is estimated
    by the Standard Error of Estimate
  • The magnitude of se is judged by comparing it to

16
Standard Error of Estimate
  • From the printout, se 5.51
  • Calculating the mean value of y,
  • It seems se is not particularly small relative to
    y.
  • QuestionCan we conclude the model does not fit
    the data well?

17
Coefficient of Determination
  • The definition is
  • From the printout, R2 0.5251
  • 52.51 of the variation in operating margin is
    explained by the six independent variables.
    47.49 remains unexplained.
  • When adjusted for degrees of freedom, Adjusted
    R2 1-SSE/(n-k-1) / SS(Total)/(n-1)
  • 49.44

18
Testing the Validity of the Model
  • We pose the question
  • Is there at least one independent variable
    linearly related to the dependent variable?
  • To answer the question we test the hypothesis
  • H0 b1 b2 bk
  • H1 At least one bi is not equal to zero.
  • If at least one bi is not equal to zero, the
    model has some validity.

19
Testing the Validity of the La Quinta Inns
Regression Model
  • The hypotheses are tested by an ANOVA procedure.

20
Testing the Validity of the La Quinta Inns
Regression Model
  • Variation in y SSR SSE.
  • If SSR is large relative to SSE, much of the
    variation in y is explained by the regression
    model the model is useful and thus, the null
    hypothesis should be rejected. Thus, we reject
    for large F.

Rejection region FgtFa,k,n-k-1
21
Testing the Validity of the La Quinta Inns
Regression Model
Conclusion There is sufficient evidence to
reject the null hypothesis in favor of the
alternative hypothesis. At least one of the bi
is not equal to zero. Thus, at least one
independent variable is linearly related to y.
This linear regression model is valid
Fa,k,n-k-1 F0.05,6,100-6-12.17 F 17.14 gt 2.17
Also, the p-value (Significance F)
0.0000 Reject the null hypothesis.
22
Interpreting the Coefficients
  • b0 38.14. This is the intercept, the value of y
    when all the variables take the value zero.
    Since the data range of all the independent
    variables do not cover the value zero, do not
    interpret the intercept.
  • b1 0.0076. In this model, for each additional
    room within 3 mile of the La Quinta inn, the
    operating margin decreases on average by .0076
    (assuming the other variables are held constant).

23
Interpreting the Coefficients
  • b2 1.65. In this model, for each additional
    mile that the nearest competitor is to a La
    Quinta inn, the operating margin increases on
    average by 1.65 when the other variables are
    held constant.
  • b3 0.020. For each additional 1000 sq-ft of
    office space, the operating margin will increase
    on average by .02 when the other variables are
    held constant.
  • b4 0.21. For each additional thousand students
    the operating margin increases on average by .21
    when the other variables are held constant.

24
Interpreting the Coefficients
  • b5 0.41. For additional 1000 increase in
    median household income, the operating margin
    increases on average by .41, when the other
    variables remain constant.
  • b6 -0.23. For each additional mile to the
    downtown center, the operating margin decreases
    on average by .23 when the other variables are
    held constant.

25
Testing the Coefficients
  • The hypothesis for each bi is
  • JMP printout

H0 bi 0 H1 bi ¹ 0
d.f. n - k -1
26
Multiple Regression Model
  • Multiple regression model
  • y b0 b1x1 b2x2 bkxk e
  • Required conditions
  • The regression function is a linear function of
    the independent variables x1,,xk (multiple
    regression line does not systematically
    overestimate/underestimate y for any combination
    of x1,,xk ).
  • The error e is normally distributed.
  • The standard deviation is constant (se) for all
    values of xs.
  • The errors are independent.

27
Confidence Intervals for Coefficients
  • Note that test of is a test of
    whether xi helps to predict y given
    x1,,xi-1,xi1,xk. Results of test might change
    as we change other independent variables in the
    model.
  • A confidence interval for is
  • In La Quinta data, a 95 confidence interval for
    (the coefficient on number of rooms) is

28
Using the Linear Regression Equation
  • The model can be used for making predictions by
  • Producing prediction interval estimate for the
    particular value of y, for a given values of xi.
  • Producing a confidence interval estimate for the
    expected value of y, for given values of xi.
  • The model can be used to learn about
    relationships between the independent variables
    xi, and the dependent variable y, by interpreting
    the coefficients bi

29
La Quinta Inns, Predictions
Xm19-01
  • Predict the average operating margin of an inn at
    a site with the following characteristics
  • 3815 rooms within 3 miles,
  • Closet competitor .9 miles away,
  • 476,000 sq-ft of office space,
  • 24,500 college students,
  • 35,000 median household income,
  • 11.2 miles distance to downtown center.

MARGIN 38.14 - 0.0076(3815) 1.65(.9)
0.020(476) 0.21(24.5)
0.41(35) - 0.23(11.2) 37.1
30
Prediction Intervals and Confidence Intervals for
Mean
  • Prediction interval for y given x1,,xk
  • Confidence interval for mean of y given x1,,xk
  • For inn with characteristics on previous slide
  • Confidence interval for mean
    (32.970,41.213)
  • Prediction interval (25.395,48.788)

31
19.4 Regression Diagnostics - II
  • The conditions required for the model assessment
    to apply must be checked.
  • Is the error variable normally distributed?
  • Is the regression function correctly specified as
    a linear function of x1,,xk Plot the
    residuals versus xs and
  • Is the error variance constant?
  • Are the errors independent?
  • Can we identify outlier?
  • Is multicollinearity a problem?

Draw a histogram of the residuals
Plot the residuals versus the time periods
32
Multicollinearity
  • Condition in which independent variables are
    highly correlated.
  • Multicollinearity causes two kinds of
    difficulties
  • The t statistics appear to be too small.
  • The b coefficients cannot be interpreted as
    slopes.
  • Diagnostics
  • High correlation between independent variables
  • Counterintuitive signs on regression coefficients
  • Low values for t-statistics despite a significant
    overall fit, as measured by the F statistics

33
Diagnostics Multicollinearity
  • Example 19.2 Predicting house price (Xm19-02)
  • A real estate agent believes that a house selling
    price can be predicted using the house size,
    number of bedrooms, and lot size.
  • A random sample of 100 houses was drawn and data
    recorded.
  • Analyze the relationship among the four variables

34
Diagnostics Multicollinearity
  • The proposed model isPRICE b0 b1BEDROOMS
    b2H-SIZE b3LOTSIZE e

The model is valid, but no variable is
significantly related to the selling price ?!
35
Diagnostics Multicollinearity
  • Multicollinearity is found to be a problem.
  • Multicollinearity causes two kinds of
    difficulties
  • The t statistics appear too small.
  • The b coefficients cannot be interpreted as
    slopes.

36
19.5 Regression Diagnostics - III
  • The Durbin - Watson Test
  • This test detects first order auto-correlation
    between consecutive residuals in a time series
  • If autocorrelation exists the error variables are
    not independent

Residual at time i
37
Positive first order autocorrelation occurs when
consecutive residuals tend to be similar.
Then, the value of d is small (less than 2).
Positive first order autocorrelation

Residuals



0
Time




Negative first order autocorrelation
Negative first order autocorrelation occurs when
consecutive residuals tend to markedly differ.
Then, the value of d is large (greater than 2).
Residuals



0


Time


38
  • One tail test for positive first order
    auto-correlation
  • If dltdL there is enough evidence to show that
    positive first-order correlation exists
  • If dgtdU there is not enough evidence to show that
    positive first-order correlation exists
  • If d is between dL and dU the test is
    inconclusive.
  • One tail test for negative first order
    auto-correlation
  • If dgt4-dL, negative first order correlation
    exists
  • If dlt4-dU, negative first order correlation does
    not exists
  • if d falls between 4-dU and 4-dL the test is
    inconclusive.

39
  • Two-tail test for first order auto-correlation
  • If dltdL or dgt4-dL first order auto-correlation
    exists
  • If d falls between dL and dU or between 4-dU and
    4-dL the test is inconclusive
  • If d falls between dU and 4-dU there is no
    evidence for first order auto-correlation

dL
dU
2
0
4-dU
4-dL
4
40
Example 19.3
  • How does the weather affect the sales of lift
    tickets in a ski resort?
  • Data of the past 20 years sales of tickets, along
    with the total snowfall and the average
    temperature during Christmas week in each year,
    was collected.
  • The model hypothesized was
  • TICKETSb0b1SNOWFALLb2TEMPERATUREe
  • Regression analysis yielded the following
    results

41
The model seems to be very poor
  • The fit is very low (R-square0.12),
  • It is not valid (Signif. F 0.33)
  • No variable is linearly related to Sales

42
A test for independent errors Durbin-Watson
  • Tests H0 no time order vs. H1
    auto-correlation in time order
  • Autocorrelation adjacent observations in time
    are correlated.
  • JMP regr. window, red diamond ! Row
    Diagnostics ! Durbin-Watson test new red
    diamond in bottom section p-value (sometimes
    expensive to compute)

43
Durbin-Watson Test in JMP
  • H0 No first-order autocorrelation.
  • H1 First-order autocorrelation
  • Use row diagnostics, Durbin-Watson test in JMP
    after fitting the model.
  • Autocorrelation is an estimate of correlation
    between errors.

44
The autocorrelation has occurred over
time. Therefore, a time dependent variable added
to the model may correct the problem
The modified regression model TICKETSb0
b1SNOWFALL b2TEMPERATURE b3YEARSe
  • All the required conditions are met for this
    model.
  • The fit of this model is high R2 0.74.
  • The model is useful. Significance F 5.93
    E-5.
  • SNOWFALL and YEARS are linearly related to
    ticket sales.
  • TEMPERATURE is not linearly related to ticket
    sales.

45
Model Building (Chapter 20)
46
20.2 Polynomial Models
  • There are models where the independent variables
    (xi) may appear as functions of a smaller number
    of predictor variables.
  • Polynomial models are one such example.

47
Polynomial Models
  • y b0 b1x1 b2x2 bpxp e
  • y b0 b1x b2x2 bpxp e

48
Polynomial Models with One Predictor Variable
  • First order model (p 1)

y b0 b1x e
  • Second order model (p2)

y b0 b1x e
b2x2 e
49
Polynomial Models with One Predictor
  • Third order model (p 3)

y b0 b1x b2x2 e
b3x3 e
50
Interaction
  • Two independent variables x1 and x2 interact if
    the effect of x1 on y is influenced by the value
    of x2.
  • Interaction can be brought into the multiple
    linear regression model by including the
    independent variable x1 x2.
  • Example

51
Interaction (cont.)
  • y b0 b1 x1 b2 x2 b3 x1 x2 e
  • Slope for x1 b1 b3 x2
  • y (b0 b2 x2) (b1 b3 x2) x1 e
  • Example
  • Is the expected income increase from an extra
    year of education higher for people with IQ 100
    or with IQ 130 (or is it the same)?

52
Polynomial Models with Two Predictor Variables
  • First order modely b0 b1x1 b2x2 e

The effect of one predictor variable on y is
independent of the effect of the other predictor
variable on y.
The two variables interact to affect the value
of y.
X2 3
b0b2(3) b1b3(3)x1
X2 3
b0b2(3) b1x1
X2 2
b0b2(2) b1x1
b0b2(2) b1b3(2)x1
X2 1
X2 2
b0b2(1) b1x1
b0b2(1) b1b3(1)x1
X2 1
x1
53
Polynomial Models with Two Predictor Variables
  • Second order modely b0 b1x1 b2x2
    b3x12 b4x22 e

X2 3
X2 3
y b0b2(3)b4(32) b1x1 b3x12 e
X2 2
X2 2
X2 1
y b0b2(2)b4(22) b1x1 b3x12 e
X2 1
y b0b2(1)b4(12) b1x1 b3x12 e
x1
54
Selecting a Model
  • Several models have been introduced.
  • How do we select the right model?
  • Selecting a model
  • Use your knowledge of the problem (variables
    involved and the nature of the relationship
    between them) to select a model.
  • Test the model using statistical techniques.

55
Selecting a Model Example
  • Example 20.1 The location of a new restaurant
  • A fast food restaurant chain tries to identify
    new locations that are likely to be profitable.
  • The primary market for such restaurants is
    middle-income adults and their children (between
    the age 5 and 12).
  • Which regression model should be proposed to
    predict the profitability of new locations?

56
Selecting a Model Example
  • Solution
  • The dependent variable will be Gross Revenue
  • Quadratic relationships between Revenue and each
    predictor variable should be observed. Why?
  • Families with very young or older kids will not
    visit the restaurant as frequent as families with
    mid-range ages of kids.
  • Members of middle-class families are more likely
    to visit a fast food family than members of poor
    or wealthy families.

57
Selecting a Model Example
  • Solution
  • The quadratic regression model built is

Sales b0 b1INCOME b2AGE b3INCOME2
b4AGE2 b5(INCOME)(AGE) e
SALES annual gross salesINCOME median annual
household income in the
neighborhood AGE mean age of children in the
neighborhood
58
Selecting a Model Example
  • Example 20.2
  • To verify the validity of the model proposed in
    example 20.1 for recommending the location of a
    new fast food restaurant, 25 areas with fast food
    restaurants were randomly selected.
  • Each area included one of the firms and three
    competing restaurants.
  • Data collected included (Xm20-02.xls)
  • Previous years annual gross sales.
  • Mean annual household income.
  • Mean age of children

59
Selecting a Model Example
Xm20-02
Collected data
Added data
60
The Quadratic Relationships Graphical
Illustration
61
Model Validation
This is a valid model that can be used
to make predictions.
62
Model Validation
The model can be used to make predictions...
but multicolinearity is a problem!! The t-tests
may be distorted, therefore, do not interpret
the coefficients or test them.
In excel Tools gt Data Analysis gt Correlation
63
20.3 Nominal Independent Variables
  • In many real-life situations one or more
    independent variables are nominal.
  • Including nominal variables in a regression
    analysis model is done via indicator (or dummy)
    variables.
  • An indicator variable (I) can assume one out of
    two values, zero or one.

1 if the temperature was below 50o 0 if the
temperature was 50o or more
I
1 if data were collected before 1980 0 if data
were collected after 1980
1 if a degree earned is in Finance 0 if a
degree earned is not in Finance
64
Nominal Independent Variables Example Auction
Car Price (II)
  • Example 18.2 - revised (Xm18-02a)
  • Recall A car dealer wants to predict the auction
    price of a car.
  • The dealer believes now that odometer reading and
    the car color are variables that affect a cars
    price.
  • Three color categories are considered
  • White
  • Silver
  • Other colors

Note Color is a nominal variable.
65
Nominal Independent Variables Example Auction
Car Price (II)
  • Example 18.2 - revised (Xm18-02b)

1 if the color is white 0 if the color is not
white
I1
1 if the color is silver 0 if the color is not
silver
I2
The category Other colors is defined by I1
0 I2 0
66
How Many Indicator Variables?
  • Note To represent the situation of three
    possible colors we need only two indicator
    variables.
  • Conclusion To represent a nominal variable with
    m possible categories, we must create m-1
    indicator variables.

67
Nominal Independent Variables Example Auction
Car Price
  • Solution
  • the proposed model is y b0 b1(Odometer)
    b2I1 b3I2 e
  • The data

White car
Other color
Silver color
68
Example Auction Car Price The Regression
Equation
From Excel (Xm18-02b) we get the regression
equation PRICE 16701-.0555(Odometer)90.48(I-1)
295.48(I-2)
The equation for a silver color car.
Price 16701 - .0555(Odometer) 90.48(0)
295.48(1)
The equation for a white color car.
Price 16701 - .0555(Odometer) 90.48(1)
295.48(0)
Price 16701 - .0278(Odometer) 45.2(0) 148(0)
The equation for an other color car.
69
Example Auction Car Price The Regression
Equation
From Excel we get the regression equation PRICE
16701-.0555(Odometer)90.48(I-1)295.48(I-2)
For one additional mile the auction price
decreases by 5.55 cents.
A white car sells, on the average, for 90.48
more than a car of the Other color category
A silver color car sells, on the average, for
295.48 more than a car of the Other color
category.
70
Example Auction Car Price The Regression
Equation
Xm18-02b
71
Nominal Independent Variables Example MBA
Program Admission (MBA II)
  • Recall The Dean wanted to evaluate applications
    for the MBA program by predicting future
    performance of the applicants.
  • The following three predictors were suggested
  • Undergraduate GPA
  • GMAT score
  • Years of work experience
  • It is now believed that the type of undergraduate
    degree should be included in the model.

Note The undergraduate degree is nominal data.
72
Nominal Independent Variables Example MBA
Program Admission (II)
1 if B.A. 0 otherwise
I1
1 if B.B.A 0 otherwise
I2
1 if B.Sc. or B.Eng. 0 otherwise
I3
The category Other group is defined by I1 0
I2 0 I3 0
73
Nominal Independent Variables Example MBA
Program Admission (II)
MBA-II
Write a Comment
User Comments (0)
About PowerShow.com