Title: Simple and multiple linear regression
- Simple and multiple linear regression
- What should be the sales of ice cream?
- Before beginning building a movie theater, one
must estimate the daily number of people entering
the building. - How can we estimate it?
- There are 2 millions individuals in the city.
- One could realize a local market study. However
it is often imprecise, specially for new
projects. - One could get data from similar projects in other
5Probably, taking into account the size of the city
6Case study Ice Cream Sales
- The file icecream.xls contains pairs of data
representing ice cream sales and temperature
recorded that day, for 30 days. - Is there a relation between temperature and
sales? - Can temperature be used to predict ice cream
sales? - If so whats the prediction when the temperature
is 25?
- One of the principle objectives of statistics is
to explain the variability that we observe in
data. - Linear regression (or linear models) is a
statistical tool MUCH USED to study the presence
of a linear relation between - a dependent variable Y (quantitative and
continuous) - and one or more independent variables X1, X2, ,
Xp (qualitative and/or quantitative), called
independent or explanatory variables.
8- For example, a manager could be interested in
seeing if he could explain a good part of the
variability that he observes in sales in his
differents branches (dependant variable Y) in the
last 12 months, by the area, number of employees,
number of payed overtime hours, quality of
customer service, number of promotions, etc. (
independent or explanatory variables).
9A regression model can be used to answer one of
the following three objectives
- Describe data coming from non experimental
studies i.e. we observe reality as it is. - Examine the hypothesis (data coming from
controled experimental studies). - Predict (if we like to take risks!!).
- We are interested in knowing what are the
important factors that influence or determine the
value of a property and we want to build a model
that would help us evaluate this value using
certain factors. - To do this, we have obtained the total value for
a sample of 79 properties in a given region. The
following variables have also been collected for
each property
total land first
outdoor heating OBS value
value of acres floor condition
type 1 199657 63247
1.63 1726 Good NatGas 2
78482 38091 0.495
1184 Good NatGas 3 119962
37665 0.375 1014 Good
Electric 4 116492 54062
0.981 1260 Average
Electric 5 131263 61546
1.14 1314 Average NatGas ...
78 253480
57948 0.862 1720 Good
Electric 79 257037 57489
0.95 2004 Excellnt
of of
of completed of non completed of OBS
rooms bedroom bathrooms
bathrooms fire-places GARAGE 1 8
4 2 1
2 Garage 2 6 2
1 0 0
NoGarage 3 7 3 2
0 1 Garage 4
6 3 2
0 1 Garage 5 8
4 2 1
2 NoGarage ... 78
10 5 5 1
1 Garage 79 9 4
2 2 2
12Is there a link between the total value and the
different factors?
15The Pearson correlation coefficient r is used to
measure the intensity of the linear relation
between two quantitative variables.
- The correlation coefficient r will take its
values between -1 and 1. - If a perfect linear relation exist between X and
Y, then r ?1 (r 1 if X and Y vary in the same
direction and r -1 if X varies in the opposite
direction of Y). - If r 0, there is no linear link between X and
Y. - The more the r value furthers from 0 to get
closer to ?1, the more the linear link intensity
between X and Y becomes larger.
16 Y 6.5 r 0.035 Y
r 1
6.0 29
25 5.5
19 5.0
13 4.5
------------------------------------ 4.0
5 6 7 8 9 10 11 12 13 14
----------------------------------- 4 5
6 7 8 9 10 11 12 13 14
r -1 -8.0 -10.5 -13.0
-15.5 -18.0
-20.5 -23.0
-25.5 -28.0
4 5 6 7 8 9 10 11 12 13 14
17Descriptive statistics Variable N Mean
Median Sta.Deviation Minimum Maximum
Total 79 187253 156761 84401
74365 453744 Land 79 65899
59861 22987 35353 131224 Acre
79 1.579 1.040 1.324 0.290
5.880 Sq.Feet 79 1678 1628
635 672 3501 Rooms 79
8.519 8.000 2.401 5 18
Bedrooms 79 3.987 4.000 1.266
2 8 C.Bathro 79 2.241
2.000 1.283 1 7 Bathro
79 0.7215 1.000 0.715 0
3 Fire-pl. 79 1.975 2.000
1.368 0 7 Pearson
Correlation Coefficients Total Land
Acre Sq.Feet Rooms Bedroom C.Bathro
Bathro Land 0.815 Acre 0.608
0.918 Sq.Feet 0.767 0.516 0.301 Rooms
0.626 0.518 0.373 0.563 Bedrooms 0.582
0.497 0.382 0.431 0.791 C.Bathro 0.626
0.506 0.376 0.457 0.479 0.586 Bathro
0.436 0.236 0.074 0.354 0.489 0.166
0.172 Fire-pl. 0.548 0.497 0.391 0.365
0.394 0.400 0.486 0.386
18BE CAREFULL!! it is important to interpret the
correlation coefficient with the graph.
r 0.816 in all cases below 12.5
Y1 Y2
7.5 6
2 ---------------------------------
-- ------------------------------------
4 5 6 7 8 9 10 11 12
13 14 4 5 6 7 8 9 10 11 12 13
14 X
X 15.0
7.5 7.5
----------------------------- 4 5 6 7
8 9 10 11 12 13 14 8
19 X
19Simple linear regression
- To describe a linear relation between two
quantitative variables or to be able to predict
Y for a given value of X, we use a regression
line - Y ?0 ?1X ??
- Since any statistical model is only an
approximation (we hope the best possible !!) and
because the linear link is never perfect , in the
model, there is always an error, noted ?. - If there was a perfect linear relation between Y
and X, the error term would always be equal to
0, and all the variability of Y would be
explained by the independent variable X.
20- So, for a given value of X, we would like to
estimate Y. - Thus, with the help of the data sample we will
estimate the regression model parameters ?0 and
?1 in order to minimize the residuals (errors)
sum of squares. - The squared correlation coefficient is called the
coefficient of determination and the percentage
of the variability of Y explained by X - R2 1 - (n-2)/(n-1)Se /Sy2,
- where Se is the standard deviation of the errors
and Sy is the standard deviation of Y.
21- We can also use the adjusted coefficient of
determination to indicate the percentage of the
variability of Y explained by X - R2ajusted 1 - Se/Sy2 .
22Simple linear regression example
MODEL 1. Regression Analysis The regression
equation is Total 16209 102
Sq.Feet Predictor Coef StDev
T P Constant 16209 17447
0.93 0.356 Sq.Feet 101.939
9.734 10.47 0.000 S 54556 R-Sq
58.8 R-Sq(adj) 58.2 Analysis of
Variance Source DF SS
MS F P Regression 1
3.26460E11 3.26460E11 109.68
0.000 Residual Error 77 2.29181E11
2976374177 Total 78 5.55641E11
23MODEL 2. The regression equation is Total -
347 22021 Rooms Predictor Coef
StDev T P Constant -347
27621 -0.01 0.990 Rooms
22021 3122 7.05 0.000 S 66210
R-Sq 39.3 R-Sq(adj) 38.5 Analysis
of Variance Source DF SS
MS F P Regression 1
2.18090E11 2.18090E11 49.75
0.000 Residual Error 77 3.37551E11
4383775699 Total 78
5.55641E11 ______________________________________
____________________________ MODEL 3. The
regression equation is Total 32428 38829
Bedrooms Predictor Coef StDev
T P Constant 32428 25826
1.26 0.213 Bedrooms 38829
6177 6.29 0.000 S 69056 R-Sq
33.9 R-Sq(adj) 33.1 Analysis of
Variance Source DF SS
MS F P Regression 1
1.88445E11 1.88445E11 39.52
0.000 Residual Error 77 3.67196E11
4768775127 Total 78 5.55641E11
24- Model 1
- total value 16209 102( of squared feet ).
- R2 58.8. Thus 58.8 of the variability of
the total value is explained by the of squared
feet . - Model 2
- total value -347 22021( of rooms ).
- R2 39.3. Thus 39.3 of the variability of
the total value is explained by the of rooms . - Model 3
- total value 32428 38829 ( of bedrooms ).
- R2 33.9. Thus 33.9 of the variability of
the total value is explained by the of bedrooms
25Which one of the 3 previous models would you
choose and why?
- Model 1 because it has the largest value of R2.
26 1-? confidence interval for the mean of the
values of Y for a specific value of X
- For model 1 and a value of X1500 sq.ft we
obtain the following point estimation - est. total value 16 209 1021500 169 117
- 95 confidence interval for the mean of the total
value for properties of 1500 sq.ft - 156 418, 181 817
- as calculated by CI-regression.xls
27 1-? confidence interval for a new value of Y
(prediction) being given a specific value of X
- For model 1 and a value of X1500 sq.ft we
obtain the following point estimation - est.total value 16 209 101.9391500 169
117 - 95 confidence interval for a predicted total
value when the area of the first floor is 1500
sq.ft - 59 742, 278 492
- The confidence interval for a predicted value is
always larger than for the mean of the value of
Y for a specific X .
28Inference on regression model parameters
- If there is no linear link between Y and X then
?1 0. So, we want to examine the following
hypothesis - H0 ?1 0 vs H1 ?1 ? 0
- We will reject H0 when the  p-value is too
small - This test will be valid if
- the relation between X and Y is linear
- the data are independent
- the variance of Y is the same for every value of
X. - Y has a normal distribution for every value of
X or the sample size n is large.
29Multiple linear regression
- It is more likely possible that the variability
of the dependent variable Y will be explained
not only by one independent variable X, but
rather by a linear combination of several
independent variables X1, X2, , Xp. - In this case, the multiple regression model is
given by - Y ?0 ?1X1 ?2X2 ?pXp ??
- Also, using the sample data, we will estimate the
regression model parameters ?0, ?1, , ?p in
order to minimize the residuals (errors) sum of
30- The multiple correlation coefficient R2, also
called the coefficient of determination,
represents the percentage of the variability of Y
explained by the independent variables X1, X2,
, Xp. - In the model, when we add one or more independent
variables, R2 increases. - The question is to know if R2 increases to a
significant degree. - Note that we cannot have more independent
variables in the model that there are
observations in the sample. (general rule n ?
MODEL 1. The regression equation is Total -
89131 3.05 Land - 20730 Acre 43.3 Sq.Feet -
4352 Rooms 10049 Bedroom 7606
C.Bathro 18725 Bathro 882 Fire-pl. Predictor
Coef StDev T
P Constant -89131 18302 -4.87
0.000 Land 3.0518 0.5260
5.80 0.000 Acre -20730 7907
-2.62 0.011 Sq.Feet 43.336
7.670 5.65 0.000 Rooms -4352
3036 -1.43 0.156 Bedroom
10049 5307 1.89 0.062 CBathro
7606 3610 2.11 0.039 Bathro
18725 6585 2.84
0.006 Fire-pl. 882 3184
0.28 0.783 S 29704 R-Sq 88.9
R-Sq(adj) 87.6 Analysis of Variance Source
P Regression 8 4.93877E11 61734659810
69.97 0.000 Residual Error 70
61763515565 882335937 Total 78
32MODEL 2 Regression Analysis The regression
equation is Total - 97512 3.11 Land - 21880
Acre 40.2 Sq.Feet 4411 Bedroom
8466 C.bathro 14328 Bathro Predictor
Coef StDev T P Constant
-97512 17466 -5.58 0.000 Land
3.1103 0.5236 5.94 0.000 Acre
-21880 7884 -2.78
0.007 Sq.Feet 40.195 7.384
5.44 0.000 Bedroom 4411 3469
1.27 0.208 C.bathro 8466
3488 2.43 0.018 Bathro 14328
5266 2.72 0.008 S 29763
R-Sq 88.5 R-Sq(adj) 87.6 Analysis of
Variance Source DF SS
MS F P Regression 6
4.91859E11 81976430646 92.54
0.000 Residual Error 72 63782210167
885864030 Total 78 5.55641E11
33MODEL 3 Regression Analysis The regression
equation is Total - 90408 3.20 Land - 22534
Acre 41.1 Sq.Feet 10234 C.bathro
14183 Bathro Predictor Coef StDev
T P Constant -90408
16618 -5.44 0.000 Land 3.2045
0.5205 6.16 0.000 Acre
-22534 7901 -2.85 0.006 Sq.Feet
41.060 7.383 5.56
0.000 C.bathro 10234 3213
3.19 0.002 Bathro 14183 5287
2.68 0.009 S 29889 R-Sq 88,3
R-Sq(adj) 87,5 Analysis of Variance Source
P Regression 5 4.90426E11
98085283380 109.80 0.000 Residual Error
73 65214377146 893347632 Total 78
34Model without the area of the land ( of
acres ) because of the multicolinearity with the
land value.
MODEL 4 The regression equation is Total -
55533 1.82 Land 49.8 Sq.Feet 11696 C.bathro
18430 Bathro Predictor Coef
StDev T P Constant
-55533 11783 -4.71 0.000 Land
1.8159 0.1929 9.42
0.000 Sq.Feet 49.833 7.028
7.09 0.000 C.bathro 11696 3321
3.52 0.001 Bathro 18430
5312 3.47 0.001 S 31297 R-Sq
87.0 R-Sq(adj) 86.3 Analysis of
Variance Source DF SS
MS F P Regression 4
4.83160E11 1.20790E11 123.32
0.000 Residual Error 74 72481137708
979474834 Total 78 5.55641E11
35Which one of the 4 previous models would you
choose and why?
- Probably model 4 because all the independent
variables are significant at the 5 level (i.e.
for each ? in the model, p-value lt 5) and
although R2 is smaller, it is just marginally
smaller. Moreover, all the model coefficients
make  sense ! - In model 1 , the variables  of rooms  and
 of fire-place are not statistically
significant at the 5 level (p-value gt 5). The
variable  of bedrooms is at the limit with a
p-value 0.0624.
36Which one of the 4 previous models would you
choose and why?(continued)
- In model 2 the variable  of bedroom is not
statistically significant at the 5 level. - In model 3 (and the previous models), the
variable  of acres coefficient is negative
which is contrary to common sense and to
what we observed in the scatter plot and the
positive Pearson correlation coefficient (r
0.608). - In models 1 to 3, the negative coefficient for
the variable  of acres is due to the fact
that there is a strong linear relation between
the value of the land and the area of the land
(r 0.918) multicolinearity problem.
- If two or more explanatory variables are strongly
correlated (gt 0.85 in absolute value), one says
that there is multicolinearity. It has an
influence on the estimation of parameters in the
model. - If two explanatory variables are highly
correlated, then can get rid of one of these
variables. Because of the strong correlation, the
contribution of the other variable is not
significant. - The correlation between several pairs of
variables can be calculated in Excel using
correlation in the Data Analysis toolbox.
38How can we choose a particular linear regression
model among all the possible ones?
- There are several techniques
- Step by step selection by adding one variable at
a time, starting with the most significant one
(stepwise, forward). - Selection starting from the model in which all
the variables are included and removing one
variable at a time starting with the least
significant (backward). - Construct all possible models and choose the best
subset of variables according to certain
specific criteria (ex adjusted R2 , Cp de
39Example of selection among the best subsets
Best Subsets Regression Response is Total
S e b B F
q R d a a i
L A f o r t t r
a c e o o h h e
Adj. n r e m o r r
p Vars R-Sq R-Sq C-p s d e t s
m o o l 1 66.4 65.9 136.8 49262 X
1 58.8 58.2 184.7
54556 X 1 39.3 38.5
307.6 66210 X 2 82.7
82.2 35.9 35564 X X 2
78.8 78.3 60.3 39343 X X
2 74.4 73.7 88.1 43244 X X
3 85.6 85.0 19.5 32637 X X X
3 84.8 84.2 24.5 33521 X
X X 3 84.8 84.2 24.9 33591
X X X 4 87.1 86.4 12.2
31115 X X X X 4 87.0 86.3
13.1 31297 X X X X 4 86.6
85.9 15.2 31682 X X X X 5
88.3 87.5 6.9 29889 X X X X X
5 87.6 86.7 11.2 30744 X X X X X
5 87.4 86.5 12.4 30979 X X X
X X 6 88.5 87.6 7.3 29763 X
X X X X X 6 88.3 87.3 8.6
30030 X X X X X X 6 88.3 87.3
8.9 30096 X X X X X X 7 88.9
87.8 7.1 29510 X X X X X X X 7
88.6 87.4 9.1 29924 X X X X X X X
7 88.3 87.2 10.6 30240 X X X X X X
X 8 88.9 87.6 9.0 29704 X X X X
40Selection of the model without the variable
of acres
Best Subsets Regression Response is Total
S e b B F
q R d a
a i L f o
r t t r a
e o o h h e Adj.
n e m o r r p Vars R-Sq R-Sq C-p
s d t s m o o l 1 66.4 65.9 120.6
49262 X 1 58.8 58.2 164.9
54556 X 1 39.3 38.5
278.3 66210 X 2 82.7
82.2 27.6 35564 X X 2
72.7 71.9 86.0 44704 X X
2 72.5 71.8 86.8 44813 X X
3 84.8 84.2 17.2 33521 X X
X 3 84.8 84.2 17.6 33591 X X
X 3 84.0 83.3 22.3 34467
X X X 4 87.0 86.3 6.9
31297 X X X X 4 86.1 85.3 12.1
32352 X X X X 4 85.3 84.5
16.5 33226 X X X X 5 87.3
86.4 6.9 31100 X X X X X 5
87.0 86.1 8.5 31439 X X X X X
5 87.0 86.1 8.9 31509 X X X X X
6 87.8 86.8 6.1 30707 X X X
X X X 6 87.3 86.3 8.7 31264 X
X X X X X 6 87.0 85.9 10.5 31656
X X X X X X 7 87.8 86.6 8.0
30908 X X X X X X X
41The selection of the best model is done according
to the combination
- The greatest value of R2 adjusted for the number
of variables in the model. - The smallest value of Cp .
- For the models with R2 adjusted and comparable
Cp, we will choose the model which has the most
 common sense according to the experts in the
field. - For the models with R2 adjusted and comparable
Cp, the model with the independent variables
that are the easiest and least expensive to
measure. - The model validity.
421-? confidence interval for Y mean and a new
value of Y (prediction) being given a specific
value combination for X1, X2, , Xp .
- For model 4 and property with a land 65 000,
sq.ft 1500, 2 completed bathrooms and 1
not-completed, we obtain the following point
estimation - est. total value -55 533 1.81665 000
49.8331 500 11 6962 18 4301 179 074 - 95 confidence interval for the mean of the total
value - 170 842, 187 306
- 95 confidence interval for a total predicted
value - 116 173, 241 974
- For a 1500 sq.ft property, the multiple
regression model gives a smaller 95 confidence
intervals than the simple regression model. - Therefore the addition of several other variables
in the model helped to better explain the total
value variability and to improve our estimations. - If two or more independent variables are
correlated we will say that there is
multicolinearity. This can influence the value
of the parameters in the model . - Also, if two independent variables are strongly
correlated then only one of the two variables
would be included in the model, the other one
bringing very little additional information. - Certain conditions are required for the validity
of the model and the corresponding inference
(similar to the simple linear regression ).
44Dummy variables
- How can one take into account qualitative
information in a regression? - Application Test on two or more means
- If a qualitative variable takes two values, one
defined one dummy variable taking values 0 or 1. - Examples
- Sex 1 if male, 0 otherwise
- Garage 1 if garage, 0 if not.
46Trick (continued)
- More generally, if a qualitative variable can
take m values, one defines (m-1) dummy variables
all taking values 0 or 1. - Example Sex and job category (executive,
white-collar, blue-collar) - X1 1 if male, 0 otherwise.
- X2 1 si exe, 0 otherwise.
- X3 1 si w-c, 0 otherwise.
- One wants to explain the salary of an employee
(Y) with the following variables sex, job
category and experience. - X1 1 if male, 0 otherwise.
- X2 1 if exe, 0 otherwise.
- X3 1 if w-c, 0 otherwise.
- X4 years of experience.
48Example (continued)
- Regression model
- Y ?0 ?1X1 ?2X2 ?3 X3 ?4X4 ??
- Question Interpret ?0, ?1, ?2, ?3 , ?4 .
- How do know if women have a smaller salary?
49P-value for one-tailed tests in Excel.
- The evaluation of the p-value of a one-tailed
test hypothesis H1 is not given in general, only
the p-value of a two-tailed test . For
example, in regression, Excel calculates the
p-value P corresponding to - H0Â bi 0 vs H1Â bi ? 0 .
- How can we calculate the p-value correponding to
one-tailed hypotheses H1?
- P p-value for the two-tailed test.
- If H1 is of the form bi gt 0 and bi gt0, then the
p-value of the right-tailed is P/2. Otherwise
it is 1- P/2. - If H1 is of the form bi lt 0 and bi lt0, then the
p-value of the left-tailed is P/2. Otherwise
it is 1- P/2. -  In other words, the one-tailed p-value is half
of the two-tailed p-value when the estimated
coefficient has the same sign as the coefficient
in H1. Otherwise, it is 1- p-value/2.
- One wants to know if having a garage increase the
total value of the property. The hypotheses to be
tested should be - H0 bgarage ? 0 vs H1 bgarage gt 0
- Since bgarage 22372 gt 0, the p-value
corresponding to H1 bgarage gt 0 is 0.058/2
0.029 lt 0.05. The anwser is yes because we
accept H1. - Does the decision depend on coding?
52- If the dummy is defined by 0 if there is a
garage and 1 otherwise, we would have got - Totale - 72080 1,83 Terrain 47,2 Pied2
- 11535 SbainsC 18899 Sbains - 22372
Garage - Predictor Coef StDev T
P - Constant -72080 14175 -5,08
0,000 - Terrain 1,8342 0,1892 9,69
0,000 - Pied2 47,175 7,013 6,73
0,000 - SbainsC 11535 3256 3,54
0,001 - Sbains 18899 5211 3,63
0,001 - Garage -22372 11116 -2,01
0,058 - S 30671 R-Sq 87,6 R-Sq(adj)
53- In that case, the right choice for hypotheses
would have been - H0 bgarage 0 vs H1 bgarage lt 0
- The corresponding p-value stays 0.029 0.058/2
because bgarage -22372 lt 0 has the same sign as
bgarage in H1.
54Comparison of several means
- Suppose one wants to compare the respective means
of a quantitative variable Y for two groups m1
mean of group 1, m2 mean of group 2. - One can use regression by defining X 1 for
group 1, and X 0 for group 2. - In this case, b m1 m2.
55- Hypothesis H1Â m1gt m2 correspond to H1Â b gt 0
. - Hypothesis H1Â m1lt m2 correspond to H1Â b lt 0.
- Hypothesis H1Â m1 ? m2 correspond to H1Â b ? 0
- A manager has some doubts on the (positive)
effects of a course in order to improve the speed
a given task is performed by employees. - To confirm his belief, he asked a technician to
choose at random 10 employees and to measure the
time (hours) to complete a task. - Then the same employees attend the course.
- After the course the employees had to realize a
similar task. - The results are summarized in the following
table manager.xls
- a) Should the company maintain the formation
program? Take a 5. - b) The technician in charge of the measurements
forget to identify employees on the measurements
form. What is the conclusion using that data set? - Unfortunately, case b is based on a real case.
- Â
- For situation a), data are paired and we have to
check if the differences  Before After are
significantly positive. The p-value is 0.0003 lt
0.05 a. - One accepts H1 and the manager conclude that
the program should be maintained.
59- In the second case, data are not paired. One can
use regression with - Y time of execution, and
- X 1 for measurements before the course and X
0 for measurements after the course. - In that case, the right choice for H1 is
- H1 b gt 0
- Results are given by
60- Since H1Â b gt 0 (which is equivalent to H1Â
mbeforegt mafter ), and - b 0.244 gt 0, the p-value is 0.201/2 0.1005 gt
0.05. - One accepts H0, so the formation program
shouldnt be maintained. - This is a very good example of the consequence of
the greater variability for two samples compared
to a paired sample.
61Remark Comparing several means
- If one needs to compare the means of k groups,
for some variable Y, one can use also regression. - For i1, 2, , k-1, set
- Xi 1 for group i, 0 otherwise.
- Then
- ?0 mean of group k ?k and Â
- ?i ?i - ?k, 1 ? i ? k-1.
62- Therefore, the regression test where H0 is given
by - H0 ?1 ?2 ... ?k-1 0
- is equivalent to a test where H0 is given by
- H0 ?1 ?2 ... ?k
- If H0 is rejected, then we conclude that at least
two means are different. - The p-value is the Significance of F, found in
the ANOVA table.