Title: Announcements:
1Announcements
- Next Homework is on the Web
- Due next Tuesday
2- Mosquito repellent experiment
- 30 people were recruited for an experiment.
- Groups of 10 were randomly assigned to one of
three repellent types. - They were put into mosquito filled room for 10
minutes (and told not to kill the mosquitos!). - Total number of bites in each group was counted
after the experiment. - (Source Steve Gulyas, CRC Testing)
3Estimate Of Total Variability In the Data
S2 (X1-Xbar)2 (X30-Xbar)2 /
(30-1) 11561.37 / 29 398.67
(Xbar (X1X30)/30)
4Same data grouped by repellent type
Grouping the data by the treatment, explains some
of the variability! (Analysis of variance makes
this explanation more precise.)
5ANOVA table
For test H0 mAmBmC
- Source Sum of Meanof Variation df Squares Squa
re F P - repellent 2 6952.3 3476.1 20.4 0.0000
- Error 27 4609.1 170.7
-
- Total 29 11561.4
Estimate of average variance of counts across the
repellent types.
Variance of counts within each repellent type is
proportional to this.
Total variability in the data is proportional to
this.
Sum of squares treatment sum of squares Error
sum of squares total 6952.3
4609.1
1151.4 R2 SSTreat / SSTotal 0.6013 is
fraction of variability accounted for by
treatment
6 Explaining why ANOVA is an analysis of
variance MST 6952.3 / 2 3476.1 Sqrt(MST)
describes standard deviation among the
rellents. MSE 4609.1 / 27 170.7 Sqrt(MSE)
describes standard deviation of the count within
each repellent type. F MST / MSE 20.4 It
makes sense that this is large and p-value
Pr(F3-1,30-3 gt 20.4) 0 is small because the
variance among treatments is much larger than
variance within the units that get each
treatment. (Note that the F test assumes the
counts are independent and normally distributed
with the same variance.)
For test H0 mAmBmC
7It turns out that ANOVA is a special case of
regression. Well come back to that in a class or
two. First, lets learn about regression
(chapters 12 and 13).
- Simple Linear Regression example
- Ingrid is a small business owner who wants to buy
a fleet of Mitsubishi sigmas. To save she
decides to buy second hand cars and wants to
estimate how much to pay. In order to do this,
she asks one of her employees to collect data on
how much people have paid for these cars
recently. (From Matt Wand)
8Regression Plot
9000
8000
7000
6000
5000
Price ()
4000
3000
Data Each point is a car
2000
1000
0
15
14
13
12
11
10
9
8
7
6
Age (years)
9- Plot suggests a simple model
- Price of car intercept slope times cars age
error - or
- yi b0 b1xi ei, i 1,,39.
- Estimate b0 and b1.
- Outline for Regression
- Estimating the regression parameters and ANOVA
tables for regression - Testing and confidence intervals
- Multiple regression models ANOVA
- Regression Diagnostics
10- Plot suggests a model
- Price of car intercept slope times cars age
error - or
- yi b0 b1xi ei, i 1,,39.
- Estimate b0 and b1 with b0 and b1. Find these
with least squares. - In other words, find b0 and b1 to minimize sum of
squared errors - SSE y1 (b0 b1 x1)2 yn (b0 b1
xn)2 -
-
- See green line on next page.
Each term is squared differencebetween observed
y and the regression line ((b0 b1 x)
11Regression Plot
Price 8198.25 - 385.108 Age
S 1075.07 R-Sq 43.8 R-Sq(adj)
42.2
9000
This line has lengthyi b0 b1xi for some i
8000
7000
6000
e
5000
c
i
r
P
4000
Squared lengthof this line contributesone term
to Sum of Squared Errors (SSE)
3000
2000
1000
0
15
14
13
12
11
10
9
8
7
6
Age
12Regression Plot
Do Minitab example
S 1075.07 R-Sq 43.8 R-Sq(adj)
42.2
9000
General Model Price b0 b1 Age
error Fitted Model Price 8198.25 - 385.108 Age
8000
7000
6000
5000
Price ()
4000
3000
2000
1000
0
15
14
13
12
11
10
9
8
7
6
Age (years)
13- Regression parameter estimates, b0 and b1,
minimize - SSE
- y1 (b0 b1 x1)2 yn (b0 b1 xn)2
- Full model is yi b0 b1 xi ei
- Suppose errors (eis) are independent N(0, s2).
- What do you think a good estimate of s2 is?
- MSE SSE/(n-2) is an estimate of s2.
- Note how SSE looks like the numerator in s2.
14(I divided price by 1000. Think about why this
doesnt matter.) Source DF
SS MS F P Regression
1 33.274 33.274 28.79
0.000 Residual Error 37 42.763
1.156 Total 38 76.038
- Sum of Squares Total y1 mean(y)2 y39
mean(y)2 76.038 - Sum of Squared Errors y1 (b0 b1 x1)2
y (b0 b1 xn)2 42.763 - Sum of Squares for Regression SSTotal - SSE
- What do these mean?
15Regression Plot
Price 8198.25 - 385.108 Age
S 1075.07 R-Sq 43.8 R-Sq(adj)
42.2
9000
Overall mean of 3,656
Regression line
8000
7000
6000
e
5000
c
i
r
P
4000
3000
2000
1000
0
15
14
13
12
11
10
9
8
7
6
Age
16(I divided price by 1000. Think about why this
doesnt really matter.) Source DF
SS MS F P Regression
1p-1 33.274 33.274 28.79
0.000 Residual Error 37n-p 42.763
1.156 Total 38n-1 76.038 p is the
number of regression parameters (2 for now)
- SSTotal y1 mean(y)2 y39 mean(y)2
76.038 - SSTotal / 38 is an estimate of the variance
around the overall mean. - (i.e. variance in the data without doing
regression) - SSE y1 (b0 b1 x1)2 y (b0 b1
xn)2 42.763 - MSE SSE / 37 is an estimate of the variance
around the line. - (i.e. variance that is not explained by the
regression) - SSR SSTotal SSE
- MSR SSR / 1 is the variance the data that is
explained by the regression.
17(I divided price by 1000. Think about why this
doesnt really matter.) Source DF
SS MS F P Regression
1p-1 33.274 33.274 28.79
0.000 Residual Error 37n-p 42.763
1.156 Total 38n-1 76.038 p is the
number of regression parameters
- A test of H0 b1 0 versus HA parameter is not
0 - Reject if the variance explained by the
regression is high compared to the unexplained
variability in the data. Reject if F is large. - F MSR / MSE
- p-value is Pr(Fp-1,n-p gt MSR / MSE)
- Reject H0 for any a less than the p-value
- (See minitab exmple and confidence intervals for
estimated parameters) - (Assuming errors are independent and normal.)
18R2
- Another summary of a regression is
- R2 Sum of Squares for Regression
- Sum of Squares Total
- 0lt R2 lt 1
This is the percentage of the of variation in the
data that is described by the regression.
19Two different ways to assess worth of a
regression
- Absolute size of slope bigger better
- Size of error variance smaller better
- R2 close to one
- Large F statistic
20Multiple Regression
- Cheese ExampleIn a study of cheddar cheese
from the La Trobe Valley of Victoria, Australia,
samples of cheese were analyzed to determine the
amount of acetic acid and hydrogen sulfide they
contained. - Overall scores for each cheese were obtained by
combining the scores from several tasters. - The goal is to predict the taste score based on
the lactic acid and hydrogen sulfide
content.(From Matt Wand)
21Model
- A simple model for taste is
- Tastei b0 b1acetici b2H2Si errori
- i 1,,n30
- Again the intercepts and slopes are selected to
minimize the error sum of squares - SSE taste1 (b0 b1 acetic1 b2 H2S1)2
- taste30 (b0 b1 acetic30 b2 H2S30)2
- Geometrically
- The simple linear model estimated a line. A model
with an intercept and 2 slopes estimates a
surface. - Note that you could add more predictors too
22Minitab
- Stat Regression Regression
- Response is taste
- Predictors are acetic and h2s
- Output
- The regression equation is
- taste - 34.0 - 7.57 H2S 14.8 acetic
- Predictor Coef SE Coef T
P - Constant -33.99 26.53 -1.28
0.211 - H2S -7.570 3.474 -2.18
0.038 - acetic 14.763 4.242 3.48
0.002 - S 12.98 R-Sq 40.6 R-Sq(adj)
36.2 - Analysis of Variance
- Source DF SS MS
F P - Regression 2 3114.0 1557.0
9.24 0.001 - Residual Error 27 4548.9 168.5
23Minitab
- The regression equation is
- taste - 34.0 - 7.57 H2S 14.8 acetic
- Predictor Coef SE Coef T
P - Constant -33.99 26.53 -1.28
0.211 - H2S -7.570 3.474 -2.18
0.038 - acetic 14.763 4.242 3.48
0.002 - T Coef / SE Coef
- P-value is for test H0 Coef 0, HA Coef
is not 0 (if p-value lt a, then reject H0) - 1-a CI for Coef Coef /- SE Coef ta/2,dferror df
Test statistic
24Minitab
- Analysis of Variance
- Source DF SS MS
F P - Regression 2 3114.0 1557.0
9.24 0.001 - Residual Error 27 4548.9 168.5
- Total 29 7662.9
- The regression equation is
- taste - 34.0 - 7.57 H2S 14.8 acetic
- Model is regression equation error
- taste - 34.0 - 7.57 H2S 14.8 acetic error
- MSE 168.5 variance of error.
- F stat MSR / MSE (this is test statistic)
- P-value is for test H0 b1 b2 (both slopes
0) HA at least one is not 0
Overall test of whetheror not the regressionis
useful.
This is a test of the usefulness of regression
25Using the regression equation
- taste - 34.0 - 7.57 H2S 14.8 acetic
- If H2S 3 and acetic 5, then what is the
expected taste score? (NOTE that this is not an
extrapolation) - For value, just plug H2S3 and acetic5 into
equation.For confidence interval (CI) - Stat regression regression, Options button
prediction interval for new obs (put in in order
that theyre in the regression equation) - New Obs Fit SE Fit 95.0 CI
95.0 PI - 1 17.11 3.17 ( 10.60, 23.63)
( -10.30, 44.53)
Prediction interval wider than CI since
prediction includes error variability and
variability in estimating the parameters.
26Dummy (or indicator) variables
- When some predictor variables are categorical,
then regression can still be used. - Dummy variables are used to indicate fabric of
each observation
27Regression Model for Burn Time Data
- Burn time m1 if fabric 1 m2 if fabric 2
m3 if fabric 3 m4 if fabric 4 error or - yi b1x1i b2x2i b3x3i b4x4i ei
- (xs are indicator variables)
- x1i 1 if observation i is fabric 1 and 0
otherwisex2i 1 if observation i is fabric 2
and 0 otherwisex3i 1 if observation i is
fabric 3 and 0 otherwisex4i 1 if observation i
is fabric 4 and 0 otherwise - Betas are fabric specific means.
- The model does not have an intercept.
- (statregressionregression,options Fit
intercept button)
28An Equivalent Model
- yi g0 g2x2i g3x3i g4x4i ei
- x2i 1 if observation i is fabric 2 and 0
otherwise - x3i 1 if observation i is fabric 3 and 0
otherwise - x4i 1 if observation i is fabric 4 and 0
otherwise - Fabric 1 mean g0
- Fabric 2 mean g0g2
- Fabric 3 mean g0g3
- Fabric 4 mean g0g4
- This model does have an intercept.
g0 is mean for fabric 1 Rest of the gs are
offsets
29The regression equation is Burn Time 16.9 -
5.90 Fabric 2 - 6.35 Fabric 3 - 5.85 Fabric
4 Predictor Coef SE Coef T
P Constant 16.8500 0.5806
29.02 0.000 Fabric 2 -5.9000 0.8211
-7.19 0.000 Fabric 3 -6.3500
0.8211 -7.73 0.000 Fabric 4 -5.8500
0.8211 -7.12 0.000 S 1.161
R-Sq 87.2 R-Sq(adj) 83.9 Analysis of
Variance (Note that this is the same as
before!) Source DF SS
MS F P Regression 3
109.810 36.603 27.15 0.000 Residual
Error 12 16.180 1.348 Total
15 125.990 95 CIs for fabric
means (Point estimate of mean) /-
t0.025,12sqrt(MSE / 4) Fabric 2 (16.85 5.90)
/- 2.179sqrt(1.348 / 4) 10.96 /-
2.179(0.5806) (0.5806 is std dev of estimate of
g0g2)(As usual, were assuming the errors are
indep and normal with constant variance.)
30Back to cheese
- Suppose the cheeses come from two regions of
Australia and we want to include that info in the
model - Tastei b0 b1acetici b2H2Si b3Regioni
errori - i 1,,n30
- Regioni 1 if ith sample comes from region 1
and 0 otherwise. b3 is effect of region 1 - If b3 is gt 0, then region 1 tends to
increase the mean score (and vice versa)