Simple and multiple linear regression

About This Presentation

Title:

Simple and multiple linear regression

Description:

Simple and multiple linear regression – PowerPoint PPT presentation

Number of Views:1703

Avg rating:3.0/5.0

Slides: 63

Provided by: sec181

Category:

more less

Transcript and Presenter's Notes

Title: Simple and multiple linear regression

1
MBAStatistics 51-651-00COURSE 4

Simple and multiple linear regression
What should be the sales of ice cream?

2
Example

Before beginning building a movie theater, one
must estimate the daily number of people entering
the building.
How can we estimate it?
There are 2 millions individuals in the city.

3
Possible solutions

One could realize a local market study. However
it is often imprecise, specially for new
projects.
One could get data from similar projects in other
cities.

4
What do you think?Can we do better?
5
Probably, taking into account the size of the city
6
Case study Ice Cream Sales

The file icecream.xls contains pairs of data
representing ice cream sales and temperature
recorded that day, for 30 days.
Is there a relation between temperature and
sales?
Can temperature be used to predict ice cream
sales?
If so whats the prediction when the temperature
is 25?

7
Introduction

One of the principle objectives of statistics is
to explain the variability that we observe in
data.
Linear regression (or linear models) is a
statistical tool MUCH USED to study the presence
of a linear relation between
a dependent variable Y (quantitative and
continuous)
and one or more independent variables X1, X2, ,
Xp (qualitative and/or quantitative), called
independent or explanatory variables.

For example, a manager could be interested in
seeing if he could explain a good part of the
variability that he observes in sales in his
differents branches (dependant variable Y) in the
last 12 months, by the area, number of employees,
number of payed overtime hours, quality of
customer service, number of promotions, etc. (
independent or explanatory variables).

9
A regression model can be used to answer one of
the following three objectives

Describe data coming from non experimental
studies i.e. we observe reality as it is.
Examine the hypothesis (data coming from
controled experimental studies).
Predict (if we like to take risks!!).

10
Example

We are interested in knowing what are the
important factors that influence or determine the
value of a property and we want to build a model
that would help us evaluate this value using
certain factors.
To do this, we have obtained the total value for
a sample of 79 properties in a given region. The
following variables have also been collected for
each property

11
Brief glimpse of the data filehouse.xls
of
square
feet
total land first
outdoor heating OBS value
value of acres floor condition
type 1 199657 63247
1.63 1726 Good NatGas 2
78482 38091 0.495
1184 Good NatGas 3 119962
37665 0.375 1014 Good
Electric 4 116492 54062
0.981 1260 Average
Electric 5 131263 61546
1.14 1314 Average NatGas ...
78 253480
57948 0.862 1720 Good
Electric 79 257037 57489
0.95 2004 Excellnt
Electric
of of
of completed of non completed of OBS
rooms bedroom bathrooms
bathrooms fire-places GARAGE 1 8
4 2 1
2 Garage 2 6 2
1 0 0
NoGarage 3 7 3 2
0 1 Garage 4
6 3 2
0 1 Garage 5 8
4 2 1
2 NoGarage ... 78
10 5 5 1
1 Garage 79 9 4
2 2 2
Garage
12
Is there a link between the total value and the
different factors?
13
(No Transcript)
14
(No Transcript)
15
The Pearson correlation coefficient r is used to
measure the intensity of the linear relation
between two quantitative variables.

The correlation coefficient r will take its
values between -1 and 1.
If a perfect linear relation exist between X and
Y, then r ?1 (r 1 if X and Y vary in the same
direction and r -1 if X varies in the opposite
direction of Y).
If r 0, there is no linear link between X and
Y.
The more the r value furthers from 0 to get
closer to ?1, the more the linear link intensity
between X and Y becomes larger.

16
Y 6.5 r 0.035 Y
r 1

31
6.0 29

27

25 5.5
23
21

19 5.0
17
15

13 4.5
11

------------------------------------ 4.0
4
5 6 7 8 9 10 11 12 13 14
----------------------------------- 4 5
6 7 8 9 10 11 12 13 14
X X Y
r -1 -8.0 -10.5 -13.0
-15.5 -18.0
-20.5 -23.0
-25.5 -28.0
-30.5
-33.0
----------------------------------
4 5 6 7 8 9 10 11 12 13 14
X
17
Descriptive statistics Variable N Mean
Median Sta.Deviation Minimum Maximum
Total 79 187253 156761 84401
74365 453744 Land 79 65899
59861 22987 35353 131224 Acre
79 1.579 1.040 1.324 0.290
5.880 Sq.Feet 79 1678 1628
635 672 3501 Rooms 79
8.519 8.000 2.401 5 18
Bedrooms 79 3.987 4.000 1.266
2 8 C.Bathro 79 2.241
2.000 1.283 1 7 Bathro
79 0.7215 1.000 0.715 0
3 Fire-pl. 79 1.975 2.000
1.368 0 7 Pearson
Correlation Coefficients Total Land
Acre Sq.Feet Rooms Bedroom C.Bathro
Bathro Land 0.815 Acre 0.608
0.918 Sq.Feet 0.767 0.516 0.301 Rooms
0.626 0.518 0.373 0.563 Bedrooms 0.582
0.497 0.382 0.431 0.791 C.Bathro 0.626
0.506 0.376 0.457 0.479 0.586 Bathro
0.436 0.236 0.074 0.354 0.489 0.166
0.172 Fire-pl. 0.548 0.497 0.391 0.365
0.394 0.400 0.486 0.386
18
BE CAREFULL!! it is important to interpret the
correlation coefficient with the graph.
r 0.816 in all cases below 12.5
10

10.0
8

Y1 Y2

7.5 6

5.0
4

2.5
2 ---------------------------------
-- ------------------------------------
4 5 6 7 8 9 10 11 12
13 14 4 5 6 7 8 9 10 11 12 13
14 X
X 15.0
Y4
12.5

12.5

Y3
10.0
10.0

7.5 7.5

5.0
5.0
-----------------------------------
----------------------------- 4 5 6 7
8 9 10 11 12 13 14 8
19 X
X
19
Simple linear regression

To describe a linear relation between two
quantitative variables or to be able to predict
Y for a given value of X, we use a regression
line
Y ?0 ?1X ??
Since any statistical model is only an
approximation (we hope the best possible !!) and
because the linear link is never perfect , in the
model, there is always an error, noted ?.
If there was a perfect linear relation between Y
and X, the error term would always be equal to
0, and all the variability of Y would be
explained by the independent variable X.

So, for a given value of X, we would like to
estimate Y.
Thus, with the help of the data sample we will
estimate the regression model parameters ?0 and
?1 in order to minimize the residuals (errors)
sum of squares.
The squared correlation coefficient is called the
coefficient of determination and the percentage
of the variability of Y explained by X
R2 1 - (n-2)/(n-1)Se /Sy2,
where Se is the standard deviation of the errors
and Sy is the standard deviation of Y.

We can also use the adjusted coefficient of
determination to indicate the percentage of the
variability of Y explained by X
R2ajusted 1 - Se/Sy2 .

22
Simple linear regression example
MODEL 1. Regression Analysis The regression
equation is Total 16209 102
Sq.Feet Predictor Coef StDev
T P Constant 16209 17447
0.93 0.356 Sq.Feet 101.939
9.734 10.47 0.000 S 54556 R-Sq
58.8 R-Sq(adj) 58.2 Analysis of
Variance Source DF SS
MS F P Regression 1
3.26460E11 3.26460E11 109.68
0.000 Residual Error 77 2.29181E11
2976374177 Total 78 5.55641E11
23
MODEL 2. The regression equation is Total -
347 22021 Rooms Predictor Coef
StDev T P Constant -347
27621 -0.01 0.990 Rooms
22021 3122 7.05 0.000 S 66210
R-Sq 39.3 R-Sq(adj) 38.5 Analysis
of Variance Source DF SS
MS F P Regression 1
2.18090E11 2.18090E11 49.75
0.000 Residual Error 77 3.37551E11
4383775699 Total 78
5.55641E11 ______________________________________
____________________________ MODEL 3. The
regression equation is Total 32428 38829
Bedrooms Predictor Coef StDev
T P Constant 32428 25826
1.26 0.213 Bedrooms 38829
6177 6.29 0.000 S 69056 R-Sq
33.9 R-Sq(adj) 33.1 Analysis of
Variance Source DF SS
MS F P Regression 1
1.88445E11 1.88445E11 39.52
0.000 Residual Error 77 3.67196E11
4768775127 Total 78 5.55641E11
24

Model 1
total value 16209 102( of squared feet ).
R2 58.8. Thus 58.8 of the variability of
the total value is explained by the of squared
feet .
Model 2
total value -347 22021( of rooms ).
R2 39.3. Thus 39.3 of the variability of
the total value is explained by the of rooms .
Model 3
total value 32428 38829 ( of bedrooms ).
R2 33.9. Thus 33.9 of the variability of
the total value is explained by the of bedrooms
.

25
Which one of the 3 previous models would you
choose and why?

Model 1 because it has the largest value of R2.

26
1-? confidence interval for the mean of the
values of Y for a specific value of X

For model 1 and a value of X1500 sq.ft we
obtain the following point estimation
est. total value 16 209 1021500 169 117
95 confidence interval for the mean of the total
value for properties of 1500 sq.ft
156 418, 181 817
as calculated by CI-regression.xls

27
1-? confidence interval for a new value of Y
(prediction) being given a specific value of X

For model 1 and a value of X1500 sq.ft we
obtain the following point estimation
est.total value 16 209 101.9391500 169
117
95 confidence interval for a predicted total
value when the area of the first floor is 1500
sq.ft
59 742, 278 492
The confidence interval for a predicted value is
always larger than for the mean of the value of
Y for a specific X .

28
Inference on regression model parameters

If there is no linear link between Y and X then
?1 0. So, we want to examine the following
hypothesis
H0 ?1 0 vs H1 ?1 ? 0
We will reject H0 when the p-value is too
small
This test will be valid if
the relation between X and Y is linear
the data are independent
the variance of Y is the same for every value of
X.
Y has a normal distribution for every value of
X or the sample size n is large.

29
Multiple linear regression

It is more likely possible that the variability
of the dependent variable Y will be explained
not only by one independent variable X, but
rather by a linear combination of several
independent variables X1, X2, , Xp.
In this case, the multiple regression model is
given by
Y ?0 ?1X1 ?2X2 ?pXp ??
Also, using the sample data, we will estimate the
regression model parameters ?0, ?1, , ?p in
order to minimize the residuals (errors) sum of
squares.

The multiple correlation coefficient R2, also
called the coefficient of determination,
represents the percentage of the variability of Y
explained by the independent variables X1, X2,
, Xp.
In the model, when we add one or more independent
variables, R2 increases.
The question is to know if R2 increases to a
significant degree.
Note that we cannot have more independent
variables in the model that there are
observations in the sample. (general rule n ?
5p).

31
Example
MODEL 1. The regression equation is Total -
89131 3.05 Land - 20730 Acre 43.3 Sq.Feet -
4352 Rooms 10049 Bedroom 7606
C.Bathro 18725 Bathro 882 Fire-pl. Predictor
Coef StDev T
P Constant -89131 18302 -4.87
0.000 Land 3.0518 0.5260
5.80 0.000 Acre -20730 7907
-2.62 0.011 Sq.Feet 43.336
7.670 5.65 0.000 Rooms -4352
3036 -1.43 0.156 Bedroom
10049 5307 1.89 0.062 CBathro
7606 3610 2.11 0.039 Bathro
18725 6585 2.84
0.006 Fire-pl. 882 3184
0.28 0.783 S 29704 R-Sq 88.9
R-Sq(adj) 87.6 Analysis of Variance Source
DF SS MS F
P Regression 8 4.93877E11 61734659810
69.97 0.000 Residual Error 70
61763515565 882335937 Total 78
5.55641E11
32
MODEL 2 Regression Analysis The regression
equation is Total - 97512 3.11 Land - 21880
Acre 40.2 Sq.Feet 4411 Bedroom
8466 C.bathro 14328 Bathro Predictor
Coef StDev T P Constant
-97512 17466 -5.58 0.000 Land
3.1103 0.5236 5.94 0.000 Acre
-21880 7884 -2.78
0.007 Sq.Feet 40.195 7.384
5.44 0.000 Bedroom 4411 3469
1.27 0.208 C.bathro 8466
3488 2.43 0.018 Bathro 14328
5266 2.72 0.008 S 29763
R-Sq 88.5 R-Sq(adj) 87.6 Analysis of
Variance Source DF SS
MS F P Regression 6
4.91859E11 81976430646 92.54
0.000 Residual Error 72 63782210167
885864030 Total 78 5.55641E11
33
MODEL 3 Regression Analysis The regression
equation is Total - 90408 3.20 Land - 22534
Acre 41.1 Sq.Feet 10234 C.bathro
14183 Bathro Predictor Coef StDev
T P Constant -90408
16618 -5.44 0.000 Land 3.2045
0.5205 6.16 0.000 Acre
-22534 7901 -2.85 0.006 Sq.Feet
41.060 7.383 5.56
0.000 C.bathro 10234 3213
3.19 0.002 Bathro 14183 5287
2.68 0.009 S 29889 R-Sq 88,3
R-Sq(adj) 87,5 Analysis of Variance Source
DF SS MS F
P Regression 5 4.90426E11
98085283380 109.80 0.000 Residual Error
73 65214377146 893347632 Total 78
5.55641E11
34
Model without the area of the land ( of
acres ) because of the multicolinearity with the
land value.
MODEL 4 The regression equation is Total -
55533 1.82 Land 49.8 Sq.Feet 11696 C.bathro
18430 Bathro Predictor Coef
StDev T P Constant
-55533 11783 -4.71 0.000 Land
1.8159 0.1929 9.42
0.000 Sq.Feet 49.833 7.028
7.09 0.000 C.bathro 11696 3321
3.52 0.001 Bathro 18430
5312 3.47 0.001 S 31297 R-Sq
87.0 R-Sq(adj) 86.3 Analysis of
Variance Source DF SS
MS F P Regression 4
4.83160E11 1.20790E11 123.32
0.000 Residual Error 74 72481137708
979474834 Total 78 5.55641E11
35
Which one of the 4 previous models would you
choose and why?

Probably model 4 because all the independent
variables are significant at the 5 level (i.e.
for each ? in the model, p-value lt 5) and
although R2 is smaller, it is just marginally
smaller. Moreover, all the model coefficients
make sense !
In model 1 , the variables of rooms and
of fire-place are not statistically
significant at the 5 level (p-value gt 5). The
variable of bedrooms is at the limit with a
p-value 0.0624.

36
Which one of the 4 previous models would you
choose and why?(continued)

In model 2 the variable of bedroom is not
statistically significant at the 5 level.
In model 3 (and the previous models), the
variable of acres coefficient is negative
which is contrary to common sense and to
what we observed in the scatter plot and the
positive Pearson correlation coefficient (r
0.608).
In models 1 to 3, the negative coefficient for
the variable of acres is due to the fact
that there is a strong linear relation between
the value of the land and the area of the land
(r 0.918) multicolinearity problem.

37
Multicolinearity

If two or more explanatory variables are strongly
correlated (gt 0.85 in absolute value), one says
that there is multicolinearity. It has an
influence on the estimation of parameters in the
model.
If two explanatory variables are highly
correlated, then can get rid of one of these
variables. Because of the strong correlation, the
contribution of the other variable is not
significant.
The correlation between several pairs of
variables can be calculated in Excel using
correlation in the Data Analysis toolbox.

38
How can we choose a particular linear regression
model among all the possible ones?

There are several techniques
Step by step selection by adding one variable at
a time, starting with the most significant one
(stepwise, forward).
Selection starting from the model in which all
the variables are included and removing one
variable at a time starting with the least
significant (backward).
Construct all possible models and choose the best
subset of variables according to certain
specific criteria (ex adjusted R2 , Cp de
Mallow.)

39
Example of selection among the best subsets
Best Subsets Regression Response is Total

B C
S e b B F
q R d a a i
L A f o r t t r
a c e o o h h e
Adj. n r e m o r r
p Vars R-Sq R-Sq C-p s d e t s
m o o l 1 66.4 65.9 136.8 49262 X
1 58.8 58.2 184.7
54556 X 1 39.3 38.5
307.6 66210 X 2 82.7
82.2 35.9 35564 X X 2
78.8 78.3 60.3 39343 X X
2 74.4 73.7 88.1 43244 X X
3 85.6 85.0 19.5 32637 X X X
3 84.8 84.2 24.5 33521 X
X X 3 84.8 84.2 24.9 33591
X X X 4 87.1 86.4 12.2
31115 X X X X 4 87.0 86.3
13.1 31297 X X X X 4 86.6
85.9 15.2 31682 X X X X 5
88.3 87.5 6.9 29889 X X X X X
5 87.6 86.7 11.2 30744 X X X X X
5 87.4 86.5 12.4 30979 X X X
X X 6 88.5 87.6 7.3 29763 X
X X X X X 6 88.3 87.3 8.6
30030 X X X X X X 6 88.3 87.3
8.9 30096 X X X X X X 7 88.9
87.8 7.1 29510 X X X X X X X 7
88.6 87.4 9.1 29924 X X X X X X X
7 88.3 87.2 10.6 30240 X X X X X X
X 8 88.9 87.6 9.0 29704 X X X X
X X X X
40
Selection of the model without the variable
of acres
Best Subsets Regression Response is Total
B C
S e b B F
q R d a
a i L f o
r t t r a
e o o h h e Adj.
n e m o r r p Vars R-Sq R-Sq C-p
s d t s m o o l 1 66.4 65.9 120.6
49262 X 1 58.8 58.2 164.9
54556 X 1 39.3 38.5
278.3 66210 X 2 82.7
82.2 27.6 35564 X X 2
72.7 71.9 86.0 44704 X X
2 72.5 71.8 86.8 44813 X X
3 84.8 84.2 17.2 33521 X X
X 3 84.8 84.2 17.6 33591 X X
X 3 84.0 83.3 22.3 34467
X X X 4 87.0 86.3 6.9
31297 X X X X 4 86.1 85.3 12.1
32352 X X X X 4 85.3 84.5
16.5 33226 X X X X 5 87.3
86.4 6.9 31100 X X X X X 5
87.0 86.1 8.5 31439 X X X X X
5 87.0 86.1 8.9 31509 X X X X X
6 87.8 86.8 6.1 30707 X X X
X X X 6 87.3 86.3 8.7 31264 X
X X X X X 6 87.0 85.9 10.5 31656
X X X X X X 7 87.8 86.6 8.0
30908 X X X X X X X
41
The selection of the best model is done according
to the combination

The greatest value of R2 adjusted for the number
of variables in the model.
The smallest value of Cp .
For the models with R2 adjusted and comparable
Cp, we will choose the model which has the most
common sense according to the experts in the
field.
For the models with R2 adjusted and comparable
Cp, the model with the independent variables
that are the easiest and least expensive to
measure.
The model validity.

42
1-? confidence interval for Y mean and a new
value of Y (prediction) being given a specific
value combination for X1, X2, , Xp .

For model 4 and property with a land 65 000,
sq.ft 1500, 2 completed bathrooms and 1
not-completed, we obtain the following point
estimation
est. total value -55 533 1.81665 000
49.8331 500 11 6962 18 4301 179 074
95 confidence interval for the mean of the total
value
170 842, 187 306
95 confidence interval for a total predicted
value
116 173, 241 974

43
Notes

For a 1500 sq.ft property, the multiple
regression model gives a smaller 95 confidence
intervals than the simple regression model.
Therefore the addition of several other variables
in the model helped to better explain the total
value variability and to improve our estimations.
If two or more independent variables are
correlated we will say that there is
multicolinearity. This can influence the value
of the parameters in the model .
Also, if two independent variables are strongly
correlated then only one of the two variables
would be included in the model, the other one
bringing very little additional information.
Certain conditions are required for the validity
of the model and the corresponding inference
(similar to the simple linear regression ).

44
Dummy variables

How can one take into account qualitative
information in a regression?
Application Test on two or more means

45
Trick

If a qualitative variable takes two values, one
defined one dummy variable taking values 0 or 1.
Examples
Sex 1 if male, 0 otherwise
Garage 1 if garage, 0 if not.

46
Trick (continued)

More generally, if a qualitative variable can
take m values, one defines (m-1) dummy variables
all taking values 0 or 1.
Example Sex and job category (executive,
white-collar, blue-collar)
X1 1 if male, 0 otherwise.
X2 1 si exe, 0 otherwise.
X3 1 si w-c, 0 otherwise.

47
Example

One wants to explain the salary of an employee
(Y) with the following variables sex, job
category and experience.
X1 1 if male, 0 otherwise.
X2 1 if exe, 0 otherwise.
X3 1 if w-c, 0 otherwise.
X4 years of experience.

48
Example (continued)

Regression model
Y ?0 ?1X1 ?2X2 ?3 X3 ?4X4 ??
Question Interpret ?0, ?1, ?2, ?3 , ?4 .
How do know if women have a smaller salary?

49
P-value for one-tailed tests in Excel.

The evaluation of the p-value of a one-tailed
test hypothesis H1 is not given in general, only
the p-value of a two-tailed test . For
example, in regression, Excel calculates the
p-value P corresponding to
H0 bi 0 vs H1 bi ? 0 .
How can we calculate the p-value correponding to
one-tailed hypotheses H1?

50
Rules

P p-value for the two-tailed test.
If H1 is of the form bi gt 0 and bi gt0, then the
p-value of the right-tailed is P/2. Otherwise
it is 1- P/2.
If H1 is of the form bi lt 0 and bi lt0, then the
p-value of the left-tailed is P/2. Otherwise
it is 1- P/2.
In other words, the one-tailed p-value is half
of the two-tailed p-value when the estimated
coefficient has the same sign as the coefficient
in H1. Otherwise, it is 1- p-value/2.

51
Question

One wants to know if having a garage increase the
total value of the property. The hypotheses to be
tested should be
H0 bgarage ? 0 vs H1 bgarage gt 0
Since bgarage 22372 gt 0, the p-value
corresponding to H1 bgarage gt 0 is 0.058/2
0.029 lt 0.05. The anwser is yes because we
accept H1.
Does the decision depend on coding?

If the dummy is defined by 0 if there is a
garage and 1 otherwise, we would have got
Totale - 72080 1,83 Terrain 47,2 Pied2
11535 SbainsC 18899 Sbains - 22372
Garage
Predictor Coef StDev T
P
Constant -72080 14175 -5,08
0,000
Terrain 1,8342 0,1892 9,69
0,000
Pied2 47,175 7,013 6,73
0,000
SbainsC 11535 3256 3,54
0,001
Sbains 18899 5211 3,63
0,001
Garage -22372 11116 -2,01
0,058
S 30671 R-Sq 87,6 R-Sq(adj)
86,8

In that case, the right choice for hypotheses
would have been
H0 bgarage 0 vs H1 bgarage lt 0
The corresponding p-value stays 0.029 0.058/2
because bgarage -22372 lt 0 has the same sign as
bgarage in H1.

54
Comparison of several means

Suppose one wants to compare the respective means
of a quantitative variable Y for two groups m1
mean of group 1, m2 mean of group 2.
One can use regression by defining X 1 for
group 1, and X 0 for group 2.
In this case, b m1 m2.

Hypothesis H1 m1gt m2 correspond to H1 b gt 0
.
Hypothesis H1 m1lt m2 correspond to H1 b lt 0.
Hypothesis H1 m1 ? m2 correspond to H1 b ? 0
.

56
Example

A manager has some doubts on the (positive)
effects of a course in order to improve the speed
a given task is performed by employees.
To confirm his belief, he asked a technician to
choose at random 10 employees and to measure the
time (hours) to complete a task.
Then the same employees attend the course.
After the course the employees had to realize a
similar task.
The results are summarized in the following
table manager.xls

57
Questions

a) Should the company maintain the formation
program? Take a 5.
b) The technician in charge of the measurements
forget to identify employees on the measurements
form. What is the conclusion using that data set?
Unfortunately, case b is based on a real case.

58
Solution

For situation a), data are paired and we have to
check if the differences Before After are
significantly positive. The p-value is 0.0003 lt
0.05 a.
One accepts H1 and the manager conclude that
the program should be maintained.

In the second case, data are not paired. One can
use regression with
Y time of execution, and
X 1 for measurements before the course and X
0 for measurements after the course.
In that case, the right choice for H1 is
H1 b gt 0
Results are given by

Since H1 b gt 0 (which is equivalent to H1
mbeforegt mafter ), and
b 0.244 gt 0, the p-value is 0.201/2 0.1005 gt
0.05.
One accepts H0, so the formation program
shouldnt be maintained.
This is a very good example of the consequence of
the greater variability for two samples compared
to a paired sample.

61
Remark Comparing several means

If one needs to compare the means of k groups,
for some variable Y, one can use also regression.
For i1, 2, , k-1, set
Xi 1 for group i, 0 otherwise.
Then
?0 mean of group k ?k and
?i ?i - ?k, 1 ? i ? k-1.