Simple and multiple linear regression - PowerPoint PPT Presentation

About This Presentation
Title:

Simple and multiple linear regression

Description:

Simple and multiple linear regression – PowerPoint PPT presentation

Number of Views:1703
Avg rating:3.0/5.0
Slides: 63
Provided by: sec181
Category:

less

Transcript and Presenter's Notes

Title: Simple and multiple linear regression


1
MBAStatistics 51-651-00COURSE 4
  • Simple and multiple linear regression
  • What should be the sales of ice cream?

2
Example
  • Before beginning building a movie theater, one
    must estimate the daily number of people entering
    the building.
  • How can we estimate it?
  • There are 2 millions individuals in the city.

3
Possible solutions
  • One could realize a local market study. However
    it is often imprecise, specially for new
    projects.
  • One could get data from similar projects in other
    cities.

4
What do you think?Can we do better?
5
Probably, taking into account the size of the city
6
Case study Ice Cream Sales
  • The file icecream.xls contains pairs of data
    representing ice cream sales and temperature
    recorded that day, for 30 days.
  • Is there a relation between temperature and
    sales?
  • Can temperature be used to predict ice cream
    sales?
  • If so whats the prediction when the temperature
    is 25?

7
Introduction
  • One of the principle objectives of statistics is
    to explain the variability that we observe in
    data.
  • Linear regression (or linear models) is a
    statistical tool MUCH USED to study the presence
    of a linear relation between
  • a dependent variable Y (quantitative and
    continuous)
  • and one or more independent variables X1, X2, ,
    Xp (qualitative and/or quantitative), called
    independent or explanatory variables.

8
  • For example, a manager could be interested in
    seeing if he could explain a good part of the
    variability that he observes in sales in his
    differents branches (dependant variable Y) in the
    last 12 months, by the area, number of employees,
    number of payed overtime hours, quality of
    customer service, number of promotions, etc. (
    independent or explanatory variables).

9
A regression model can be used to answer one of
the following three objectives
  • Describe data coming from non experimental
    studies i.e. we observe reality as it is.
  • Examine the hypothesis (data coming from
    controled experimental studies).
  • Predict (if we like to take risks!!).

10
Example
  • We are interested in knowing what are the
    important factors that influence or determine the
    value of a property and we want to build a model
    that would help us evaluate this value using
    certain factors.
  • To do this, we have obtained the total value for
    a sample of 79 properties in a given region. The
    following variables have also been collected for
    each property

11
Brief glimpse of the data filehouse.xls
of
square
feet
total land first
outdoor heating OBS value
value of acres floor condition
type 1 199657 63247
1.63 1726 Good NatGas 2
78482 38091 0.495
1184 Good NatGas 3 119962
37665 0.375 1014 Good
Electric 4 116492 54062
0.981 1260 Average
Electric 5 131263 61546
1.14 1314 Average NatGas ...
78 253480
57948 0.862 1720 Good
Electric 79 257037 57489
0.95 2004 Excellnt
Electric
of of
of completed of non completed of OBS
rooms bedroom bathrooms
bathrooms fire-places GARAGE 1 8
4 2 1
2 Garage 2 6 2
1 0 0
NoGarage 3 7 3 2
0 1 Garage 4
6 3 2
0 1 Garage 5 8
4 2 1
2 NoGarage ... 78
10 5 5 1
1 Garage 79 9 4
2 2 2
Garage
12
Is there a link between the total value and the
different factors?
13
(No Transcript)
14
(No Transcript)
15
The Pearson correlation coefficient r is used to
measure the intensity of the linear relation
between two quantitative variables.
  • The correlation coefficient r will take its
    values between -1 and 1.
  • If a perfect linear relation exist between X and
    Y, then r ?1 (r 1 if X and Y vary in the same
    direction and r -1 if X varies in the opposite
    direction of Y).
  • If r 0, there is no linear link between X and
    Y.
  • The more the r value furthers from 0 to get
    closer to ?1, the more the linear link intensity
    between X and Y becomes larger.

16
Y 6.5 r 0.035 Y
r 1

31
6.0 29

27

25 5.5
23
21

19 5.0
17
15

13 4.5
11

------------------------------------ 4.0
4
5 6 7 8 9 10 11 12 13 14
----------------------------------- 4 5
6 7 8 9 10 11 12 13 14
X X Y
r -1 -8.0 -10.5 -13.0
-15.5 -18.0
-20.5 -23.0
-25.5 -28.0
-30.5
-33.0
----------------------------------
4 5 6 7 8 9 10 11 12 13 14
X
17
Descriptive statistics Variable N Mean
Median Sta.Deviation Minimum Maximum
Total 79 187253 156761 84401
74365 453744 Land 79 65899
59861 22987 35353 131224 Acre
79 1.579 1.040 1.324 0.290
5.880 Sq.Feet 79 1678 1628
635 672 3501 Rooms 79
8.519 8.000 2.401 5 18
Bedrooms 79 3.987 4.000 1.266
2 8 C.Bathro 79 2.241
2.000 1.283 1 7 Bathro
79 0.7215 1.000 0.715 0
3 Fire-pl. 79 1.975 2.000
1.368 0 7 Pearson
Correlation Coefficients Total Land
Acre Sq.Feet Rooms Bedroom C.Bathro
Bathro Land 0.815 Acre 0.608
0.918 Sq.Feet 0.767 0.516 0.301 Rooms
0.626 0.518 0.373 0.563 Bedrooms 0.582
0.497 0.382 0.431 0.791 C.Bathro 0.626
0.506 0.376 0.457 0.479 0.586 Bathro
0.436 0.236 0.074 0.354 0.489 0.166
0.172 Fire-pl. 0.548 0.497 0.391 0.365
0.394 0.400 0.486 0.386
18
BE CAREFULL!! it is important to interpret the
correlation coefficient with the graph.
r 0.816 in all cases below 12.5
10




10.0
8

Y1 Y2

7.5 6



5.0
4


2.5
2 ---------------------------------
-- ------------------------------------
4 5 6 7 8 9 10 11 12
13 14 4 5 6 7 8 9 10 11 12 13
14 X
X 15.0
Y4
12.5


12.5

Y3
10.0
10.0




7.5 7.5



5.0
5.0
-----------------------------------
----------------------------- 4 5 6 7
8 9 10 11 12 13 14 8
19 X
X
19
Simple linear regression
  • To describe a linear relation between two
    quantitative variables or to be able to predict
    Y for a given value of X, we use a regression
    line
  • Y ?0 ?1X ??
  • Since any statistical model is only an
    approximation (we hope the best possible !!) and
    because the linear link is never perfect , in the
    model, there is always an error, noted ?.
  • If there was a perfect linear relation between Y
    and X, the error term would always be equal to
    0, and all the variability of Y would be
    explained by the independent variable X.

20
  • So, for a given value of X, we would like to
    estimate Y.
  • Thus, with the help of the data sample we will
    estimate the regression model parameters ?0 and
    ?1 in order to minimize the residuals (errors)
    sum of squares.
  • The squared correlation coefficient is called the
    coefficient of determination and the percentage
    of the variability of Y explained by X
  • R2 1 - (n-2)/(n-1)Se /Sy2,
  • where Se is the standard deviation of the errors
    and Sy is the standard deviation of Y.

21
  • We can also use the adjusted coefficient of
    determination to indicate the percentage of the
    variability of Y explained by X
  • R2ajusted 1 - Se/Sy2 .

22
Simple linear regression example
MODEL 1. Regression Analysis The regression
equation is Total 16209 102
Sq.Feet Predictor Coef StDev
T P Constant 16209 17447
0.93 0.356 Sq.Feet 101.939
9.734 10.47 0.000 S 54556 R-Sq
58.8 R-Sq(adj) 58.2 Analysis of
Variance Source DF SS
MS F P Regression 1
3.26460E11 3.26460E11 109.68
0.000 Residual Error 77 2.29181E11
2976374177 Total 78 5.55641E11
23
MODEL 2. The regression equation is Total -
347 22021 Rooms Predictor Coef
StDev T P Constant -347
27621 -0.01 0.990 Rooms
22021 3122 7.05 0.000 S 66210
R-Sq 39.3 R-Sq(adj) 38.5 Analysis
of Variance Source DF SS
MS F P Regression 1
2.18090E11 2.18090E11 49.75
0.000 Residual Error 77 3.37551E11
4383775699 Total 78
5.55641E11 ______________________________________
____________________________ MODEL 3. The
regression equation is Total 32428 38829
Bedrooms Predictor Coef StDev
T P Constant 32428 25826
1.26 0.213 Bedrooms 38829
6177 6.29 0.000 S 69056 R-Sq
33.9 R-Sq(adj) 33.1 Analysis of
Variance Source DF SS
MS F P Regression 1
1.88445E11 1.88445E11 39.52
0.000 Residual Error 77 3.67196E11
4768775127 Total 78 5.55641E11
24
  • Model 1
  • total value 16209 102( of squared feet ).
  • R2 58.8. Thus 58.8 of the variability of
    the total value is explained by the of squared
    feet .
  • Model 2
  • total value -347 22021( of rooms ).
  • R2 39.3. Thus 39.3 of the variability of
    the total value is explained by the of rooms .
  • Model 3
  • total value 32428 38829 ( of bedrooms ).
  • R2 33.9. Thus 33.9 of the variability of
    the total value is explained by the of bedrooms
    .

25
Which one of the 3 previous models would you
choose and why?
  • Model 1 because it has the largest value of R2.

26
1-? confidence interval for the mean of the
values of Y for a specific value of X
  • For model 1 and a value of X1500 sq.ft we
    obtain the following point estimation
  • est. total value 16 209 1021500 169 117
  • 95 confidence interval for the mean of the total
    value for properties of 1500 sq.ft
  • 156 418, 181 817
  • as calculated by CI-regression.xls

27
1-? confidence interval for a new value of Y
(prediction) being given a specific value of X
  • For model 1 and a value of X1500 sq.ft we
    obtain the following point estimation
  • est.total value 16 209 101.9391500 169
    117
  • 95 confidence interval for a predicted total
    value when the area of the first floor is 1500
    sq.ft
  • 59 742, 278 492
  • The confidence interval for a predicted value is
    always larger than for the mean of the value of
    Y for a specific X .

28
Inference on regression model parameters
  • If there is no linear link between Y and X then
    ?1 0. So, we want to examine the following
    hypothesis
  • H0 ?1 0 vs H1 ?1 ? 0
  • We will reject H0 when the  p-value  is too
    small
  • This test will be valid if
  • the relation between X and Y is linear
  • the data are independent
  • the variance of Y is the same for every value of
    X.
  • Y has a normal distribution for every value of
    X or the sample size n is large.

29
Multiple linear regression
  • It is more likely possible that the variability
    of the dependent variable Y will be explained
    not only by one independent variable X, but
    rather by a linear combination of several
    independent variables X1, X2, , Xp.
  • In this case, the multiple regression model is
    given by
  • Y ?0 ?1X1 ?2X2 ?pXp ??
  • Also, using the sample data, we will estimate the
    regression model parameters ?0, ?1, , ?p in
    order to minimize the residuals (errors) sum of
    squares.

30
  • The multiple correlation coefficient R2, also
    called the coefficient of determination,
    represents the percentage of the variability of Y
    explained by the independent variables X1, X2,
    , Xp.
  • In the model, when we add one or more independent
    variables, R2 increases.
  • The question is to know if R2 increases to a
    significant degree.
  • Note that we cannot have more independent
    variables in the model that there are
    observations in the sample. (general rule n ?
    5p).

31
Example
MODEL 1. The regression equation is Total -
89131 3.05 Land - 20730 Acre 43.3 Sq.Feet -
4352 Rooms 10049 Bedroom 7606
C.Bathro 18725 Bathro 882 Fire-pl. Predictor
Coef StDev T
P Constant -89131 18302 -4.87
0.000 Land 3.0518 0.5260
5.80 0.000 Acre -20730 7907
-2.62 0.011 Sq.Feet 43.336
7.670 5.65 0.000 Rooms -4352
3036 -1.43 0.156 Bedroom
10049 5307 1.89 0.062 CBathro
7606 3610 2.11 0.039 Bathro
18725 6585 2.84
0.006 Fire-pl. 882 3184
0.28 0.783 S 29704 R-Sq 88.9
R-Sq(adj) 87.6 Analysis of Variance Source
DF SS MS F
P Regression 8 4.93877E11 61734659810
69.97 0.000 Residual Error 70
61763515565 882335937 Total 78
5.55641E11
32
MODEL 2 Regression Analysis The regression
equation is Total - 97512 3.11 Land - 21880
Acre 40.2 Sq.Feet 4411 Bedroom
8466 C.bathro 14328 Bathro Predictor
Coef StDev T P Constant
-97512 17466 -5.58 0.000 Land
3.1103 0.5236 5.94 0.000 Acre
-21880 7884 -2.78
0.007 Sq.Feet 40.195 7.384
5.44 0.000 Bedroom 4411 3469
1.27 0.208 C.bathro 8466
3488 2.43 0.018 Bathro 14328
5266 2.72 0.008 S 29763
R-Sq 88.5 R-Sq(adj) 87.6 Analysis of
Variance Source DF SS
MS F P Regression 6
4.91859E11 81976430646 92.54
0.000 Residual Error 72 63782210167
885864030 Total 78 5.55641E11
33
MODEL 3 Regression Analysis The regression
equation is Total - 90408 3.20 Land - 22534
Acre 41.1 Sq.Feet 10234 C.bathro
14183 Bathro Predictor Coef StDev
T P Constant -90408
16618 -5.44 0.000 Land 3.2045
0.5205 6.16 0.000 Acre
-22534 7901 -2.85 0.006 Sq.Feet
41.060 7.383 5.56
0.000 C.bathro 10234 3213
3.19 0.002 Bathro 14183 5287
2.68 0.009 S 29889 R-Sq 88,3
R-Sq(adj) 87,5 Analysis of Variance Source
DF SS MS F
P Regression 5 4.90426E11
98085283380 109.80 0.000 Residual Error
73 65214377146 893347632 Total 78
5.55641E11
34
Model without the area of the land ( of
acres ) because of the multicolinearity with the
land value.
MODEL 4 The regression equation is Total -
55533 1.82 Land 49.8 Sq.Feet 11696 C.bathro
18430 Bathro Predictor Coef
StDev T P Constant
-55533 11783 -4.71 0.000 Land
1.8159 0.1929 9.42
0.000 Sq.Feet 49.833 7.028
7.09 0.000 C.bathro 11696 3321
3.52 0.001 Bathro 18430
5312 3.47 0.001 S 31297 R-Sq
87.0 R-Sq(adj) 86.3 Analysis of
Variance Source DF SS
MS F P Regression 4
4.83160E11 1.20790E11 123.32
0.000 Residual Error 74 72481137708
979474834 Total 78 5.55641E11
35
Which one of the 4 previous models would you
choose and why?
  • Probably model 4 because all the independent
    variables are significant at the 5 level (i.e.
    for each ? in the model, p-value lt 5) and
    although R2 is smaller, it is just marginally
    smaller. Moreover, all the model coefficients
    make  sense !
  • In model 1 , the variables   of rooms   and
      of fire-place are not statistically
    significant at the 5 level (p-value gt 5). The
    variable   of bedrooms  is at the limit with a
    p-value 0.0624.

36
Which one of the 4 previous models would you
choose and why?(continued)
  • In model 2 the variable   of bedroom  is not
    statistically significant at the 5 level.
  • In model 3 (and the previous models), the
    variable   of acres  coefficient is negative
    which is contrary to common sense  and to
    what we observed in the scatter plot and the
    positive Pearson correlation coefficient (r
    0.608).
  • In models 1 to 3, the negative coefficient for
    the variable   of acres  is due to the fact
    that there is a strong linear relation between
    the value of the land and the area of the land
    (r 0.918) multicolinearity problem.

37
Multicolinearity
  • If two or more explanatory variables are strongly
    correlated (gt 0.85 in absolute value), one says
    that there is multicolinearity. It has an
    influence on the estimation of parameters in the
    model.
  • If two explanatory variables are highly
    correlated, then can get rid of one of these
    variables. Because of the strong correlation, the
    contribution of the other variable is not
    significant.
  • The correlation between several pairs of
    variables can be calculated in Excel using
    correlation in the Data Analysis toolbox.

38
How can we choose a particular linear regression
model among all the possible ones?
  • There are several techniques
  • Step by step selection by adding one variable at
    a time, starting with the most significant one
    (stepwise, forward).
  • Selection starting from the model in which all
    the variables are included and removing one
    variable at a time starting with the least
    significant (backward).
  • Construct all possible models and choose the best
    subset of variables according to certain
    specific criteria (ex adjusted R2 , Cp de
    Mallow.)

39
Example of selection among the best subsets
Best Subsets Regression Response is Total

B C
S e b B F
q R d a a i
L A f o r t t r
a c e o o h h e
Adj. n r e m o r r
p Vars R-Sq R-Sq C-p s d e t s
m o o l 1 66.4 65.9 136.8 49262 X
1 58.8 58.2 184.7
54556 X 1 39.3 38.5
307.6 66210 X 2 82.7
82.2 35.9 35564 X X 2
78.8 78.3 60.3 39343 X X
2 74.4 73.7 88.1 43244 X X
3 85.6 85.0 19.5 32637 X X X
3 84.8 84.2 24.5 33521 X
X X 3 84.8 84.2 24.9 33591
X X X 4 87.1 86.4 12.2
31115 X X X X 4 87.0 86.3
13.1 31297 X X X X 4 86.6
85.9 15.2 31682 X X X X 5
88.3 87.5 6.9 29889 X X X X X
5 87.6 86.7 11.2 30744 X X X X X
5 87.4 86.5 12.4 30979 X X X
X X 6 88.5 87.6 7.3 29763 X
X X X X X 6 88.3 87.3 8.6
30030 X X X X X X 6 88.3 87.3
8.9 30096 X X X X X X 7 88.9
87.8 7.1 29510 X X X X X X X 7
88.6 87.4 9.1 29924 X X X X X X X
7 88.3 87.2 10.6 30240 X X X X X X
X 8 88.9 87.6 9.0 29704 X X X X
X X X X
40
Selection of the model without the variable
of acres
Best Subsets Regression  Response is Total
B C
S e b B F
q R d a
a i L f o
r t t r a
e o o h h e Adj.
n e m o r r p Vars R-Sq R-Sq C-p
s d t s m o o l 1 66.4 65.9 120.6
49262 X 1 58.8 58.2 164.9
54556 X 1 39.3 38.5
278.3 66210 X 2 82.7
82.2 27.6 35564 X X 2
72.7 71.9 86.0 44704 X X
2 72.5 71.8 86.8 44813 X X
3 84.8 84.2 17.2 33521 X X
X 3 84.8 84.2 17.6 33591 X X
X 3 84.0 83.3 22.3 34467
X X X 4 87.0 86.3 6.9
31297 X X X X 4 86.1 85.3 12.1
32352 X X X X 4 85.3 84.5
16.5 33226 X X X X 5 87.3
86.4 6.9 31100 X X X X X 5
87.0 86.1 8.5 31439 X X X X X
5 87.0 86.1 8.9 31509 X X X X X
6 87.8 86.8 6.1 30707 X X X
X X X 6 87.3 86.3 8.7 31264 X
X X X X X 6 87.0 85.9 10.5 31656
X X X X X X 7 87.8 86.6 8.0
30908 X X X X X X X
41
The selection of the best model is done according
to the combination
  • The greatest value of R2 adjusted for the number
    of variables in the model.
  • The smallest value of Cp .
  • For the models with R2 adjusted and comparable
    Cp, we will choose the model which has the most
     common sense  according to the experts in the
    field.
  • For the models with R2 adjusted and comparable
    Cp, the model with the independent variables
    that are the easiest and least expensive to
    measure.
  • The model validity.

42
1-? confidence interval for Y mean and a new
value of Y (prediction) being given a specific
value combination for X1, X2, , Xp .
  • For model 4 and property with a land 65 000,
    sq.ft 1500, 2 completed bathrooms and 1
    not-completed, we obtain the following point
    estimation
  • est. total value -55 533 1.81665 000
    49.8331 500 11 6962 18 4301 179 074
  • 95 confidence interval for the mean of the total
    value
  • 170 842, 187 306
  • 95 confidence interval for a total predicted
    value
  • 116 173, 241 974

43
Notes
  • For a 1500 sq.ft property, the multiple
    regression model gives a smaller 95 confidence
    intervals than the simple regression model.
  • Therefore the addition of several other variables
    in the model helped to better explain the total
    value variability and to improve our estimations.
  • If two or more independent variables are
    correlated we will say that there is
    multicolinearity. This can influence the value
    of the parameters in the model .
  • Also, if two independent variables are strongly
    correlated then only one of the two variables
    would be included in the model, the other one
    bringing very little additional information.
  • Certain conditions are required for the validity
    of the model and the corresponding inference
    (similar to the simple linear regression ).

44
Dummy variables
  • How can one take into account qualitative
    information in a regression?
  • Application Test on two or more means

45
Trick
  • If a qualitative variable takes two values, one
    defined one dummy variable taking values 0 or 1.
  • Examples
  • Sex 1 if male, 0 otherwise
  • Garage 1 if garage, 0 if not.

46
Trick (continued)
  • More generally, if a qualitative variable can
    take m values, one defines (m-1) dummy variables
    all taking values 0 or 1.
  • Example Sex and job category (executive,
    white-collar, blue-collar)
  • X1 1 if male, 0 otherwise.
  • X2 1 si exe, 0 otherwise.
  • X3 1 si w-c, 0 otherwise.

47
Example
  • One wants to explain the salary of an employee
    (Y) with the following variables sex, job
    category and experience.
  • X1 1 if male, 0 otherwise.
  • X2 1 if exe, 0 otherwise.
  • X3 1 if w-c, 0 otherwise.
  • X4 years of experience.

48
Example (continued)
  • Regression model
  • Y ?0 ?1X1 ?2X2 ?3 X3 ?4X4 ??
  • Question Interpret ?0, ?1, ?2, ?3 , ?4 .
  • How do know if women have a smaller salary?

49
P-value for one-tailed tests in Excel.
  • The evaluation of the p-value of a one-tailed
    test hypothesis H1 is not given in general, only
    the p-value of a two-tailed test . For
    example, in regression, Excel calculates the
    p-value P corresponding to
  • H0  bi 0 vs H1  bi ? 0 .
  • How can we calculate the p-value correponding to
    one-tailed hypotheses H1?

50
Rules 
  • P p-value for the two-tailed test.
  • If H1 is of the form bi gt 0 and bi gt0, then the
    p-value of the right-tailed is P/2. Otherwise
    it is 1- P/2. 
  • If H1 is of the form bi lt 0 and bi lt0, then the
    p-value of the left-tailed is P/2. Otherwise
    it is 1- P/2. 
  •  In other words, the one-tailed p-value is half
    of the two-tailed p-value when the estimated
    coefficient has the same sign as the coefficient
    in H1. Otherwise, it is 1- p-value/2.

51
Question
  • One wants to know if having a garage increase the
    total value of the property. The hypotheses to be
    tested should be
  • H0 bgarage ? 0 vs H1 bgarage gt 0
  • Since bgarage 22372 gt 0, the p-value
    corresponding to H1 bgarage gt 0 is 0.058/2
    0.029 lt 0.05. The anwser is yes because we
    accept H1.
  • Does the decision depend on coding?

52
  • If the dummy is defined by 0 if there is a
    garage and 1 otherwise, we would have got
  • Totale - 72080 1,83 Terrain 47,2 Pied2
  • 11535 SbainsC 18899 Sbains - 22372
    Garage
  • Predictor Coef StDev T
    P
  • Constant -72080 14175 -5,08
    0,000
  • Terrain 1,8342 0,1892 9,69
    0,000
  • Pied2 47,175 7,013 6,73
    0,000
  • SbainsC 11535 3256 3,54
    0,001
  • Sbains 18899 5211 3,63
    0,001
  • Garage -22372 11116 -2,01
    0,058
  • S 30671 R-Sq 87,6 R-Sq(adj)
    86,8

53
  • In that case, the right choice for hypotheses
    would have been
  • H0 bgarage 0 vs H1 bgarage lt 0
  • The corresponding p-value stays 0.029 0.058/2
    because bgarage -22372 lt 0 has the same sign as
    bgarage in H1.

54
Comparison of several means
  • Suppose one wants to compare the respective means
    of a quantitative variable Y for two groups m1
    mean of group 1, m2 mean of group 2.
  • One can use regression by defining X 1 for
    group 1, and X 0 for group 2.
  • In this case, b m1 m2.

55
  • Hypothesis H1  m1gt m2 correspond to H1  b gt 0
    .
  • Hypothesis H1  m1lt m2 correspond to H1  b lt 0.
  • Hypothesis H1  m1 ? m2 correspond to H1  b ? 0
    .

56
Example
  • A manager has some doubts on the (positive)
    effects of a course in order to improve the speed
    a given task is performed by employees.
  • To confirm his belief, he asked a technician to
    choose at random 10 employees and to measure the
    time (hours) to complete a task.
  • Then the same employees attend the course.
  • After the course the employees had to realize a
    similar task.
  • The results are summarized in the following
    table manager.xls

57
Questions
  • a) Should the company maintain the formation
    program? Take a 5.
  • b) The technician in charge of the measurements
    forget to identify employees on the measurements
    form. What is the conclusion using that data set?
  • Unfortunately, case b is based on a real case.
  •  

58
Solution
  • For situation a), data are paired and we have to
    check if the differences  Before After  are
    significantly positive. The p-value is 0.0003 lt
    0.05 a.
  • One accepts H1 and the manager conclude that
    the program should be maintained.

59
  • In the second case, data are not paired. One can
    use regression with
  • Y time of execution, and
  • X 1 for measurements before the course and X
    0 for measurements after the course.
  • In that case, the right choice for H1 is
  • H1 b gt 0
  • Results are given by

60
  • Since H1  b gt 0 (which is equivalent to H1 
    mbeforegt mafter ), and
  • b 0.244 gt 0, the p-value is 0.201/2 0.1005 gt
    0.05.
  • One accepts H0, so the formation program
    shouldnt be maintained.
  • This is a very good example of the consequence of
    the greater variability for two samples compared
    to a paired sample.

61
Remark Comparing several means
  • If one needs to compare the means of k groups,
    for some variable Y, one can use also regression.
  • For i1, 2, , k-1, set
  • Xi 1 for group i, 0 otherwise.
  • Then
  • ?0 mean of group k ?k and  
  • ?i ?i - ?k, 1 ? i ? k-1.

62
  • Therefore, the regression test where H0 is given
    by
  • H0 ?1 ?2 ... ?k-1 0
  • is equivalent to a test where H0 is given by
  • H0 ?1 ?2 ... ?k
  • If H0 is rejected, then we conclude that at least
    two means are different.
  • The p-value is the Significance of F, found in
    the ANOVA table.
Write a Comment
User Comments (0)
About PowerShow.com