Announcements: - PowerPoint PPT Presentation

About This Presentation

Title:

Announcements:

Description:

Ingrid is a small business owner who wants to buy a fleet of Mitsubishi sigmas. To save $ she decides to buy second hand cars and wants to estimate how much to pay. ... – PowerPoint PPT presentation

Number of Views:19

Avg rating:3.0/5.0

Slides: 31

Provided by: johnstau

Learn more at: https://people.math.umass.edu

Category:

Tags: announcements

more less

Transcript and Presenter's Notes

Title: Announcements:

1
Announcements

Next Homework is on the Web
Due next Tuesday

Mosquito repellent experiment
30 people were recruited for an experiment.
Groups of 10 were randomly assigned to one of
three repellent types.
They were put into mosquito filled room for 10
minutes (and told not to kill the mosquitos!).
Total number of bites in each group was counted
after the experiment.
(Source Steve Gulyas, CRC Testing)

3
Estimate Of Total Variability In the Data
S2 (X1-Xbar)2 (X30-Xbar)2 /
(30-1) 11561.37 / 29 398.67
(Xbar (X1X30)/30)
4
Same data grouped by repellent type
Grouping the data by the treatment, explains some
of the variability! (Analysis of variance makes
this explanation more precise.)
5
ANOVA table
For test H0 mAmBmC

Source Sum of Meanof Variation df Squares Squa
re F P
repellent 2 6952.3 3476.1 20.4 0.0000
Error 27 4609.1 170.7
Total 29 11561.4

Estimate of average variance of counts across the
repellent types.
Variance of counts within each repellent type is
proportional to this.
Total variability in the data is proportional to
this.
Sum of squares treatment sum of squares Error
sum of squares total 6952.3
4609.1
1151.4 R2 SSTreat / SSTotal 0.6013 is
fraction of variability accounted for by
treatment
6
Explaining why ANOVA is an analysis of
variance MST 6952.3 / 2 3476.1 Sqrt(MST)
describes standard deviation among the
rellents. MSE 4609.1 / 27 170.7 Sqrt(MSE)
describes standard deviation of the count within
each repellent type. F MST / MSE 20.4 It
makes sense that this is large and p-value
Pr(F3-1,30-3 gt 20.4) 0 is small because the
variance among treatments is much larger than
variance within the units that get each
treatment. (Note that the F test assumes the
counts are independent and normally distributed
with the same variance.)
For test H0 mAmBmC
7
It turns out that ANOVA is a special case of
regression. Well come back to that in a class or
two. First, lets learn about regression
(chapters 12 and 13).

Simple Linear Regression example
Ingrid is a small business owner who wants to buy
a fleet of Mitsubishi sigmas. To save she
decides to buy second hand cars and wants to
estimate how much to pay. In order to do this,
she asks one of her employees to collect data on
how much people have paid for these cars
recently. (From Matt Wand)

8
Regression Plot
9000
8000
7000
6000
5000
Price ()
4000
3000
Data Each point is a car
2000
1000
0
15
14
13
12
11
10
9
8
7
6
Age (years)
9

Plot suggests a simple model
Price of car intercept slope times cars age
error
or
yi b0 b1xi ei, i 1,,39.
Estimate b0 and b1.
Outline for Regression
Estimating the regression parameters and ANOVA
tables for regression
Testing and confidence intervals
Multiple regression models ANOVA
Regression Diagnostics

Plot suggests a model
Price of car intercept slope times cars age
error
or
yi b0 b1xi ei, i 1,,39.
Estimate b0 and b1 with b0 and b1. Find these
with least squares.
In other words, find b0 and b1 to minimize sum of
squared errors
SSE y1 (b0 b1 x1)2 yn (b0 b1
xn)2
See green line on next page.

Each term is squared differencebetween observed
y and the regression line ((b0 b1 x)
11
Regression Plot
Price 8198.25 - 385.108 Age
S 1075.07 R-Sq 43.8 R-Sq(adj)
42.2
9000
This line has lengthyi b0 b1xi for some i
8000
7000
6000
e
5000
c
i
r
P
4000
Squared lengthof this line contributesone term
to Sum of Squared Errors (SSE)
3000
2000
1000
0
15
14
13
12
11
10
9
8
7
6
Age
12
Regression Plot
Do Minitab example
S 1075.07 R-Sq 43.8 R-Sq(adj)
42.2
9000
General Model Price b0 b1 Age
error Fitted Model Price 8198.25 - 385.108 Age
8000
7000
6000
5000
Price ()
4000
3000
2000
1000
0
15
14
13
12
11
10
9
8
7
6
Age (years)
13

Regression parameter estimates, b0 and b1,
minimize
SSE
y1 (b0 b1 x1)2 yn (b0 b1 xn)2
Full model is yi b0 b1 xi ei
Suppose errors (eis) are independent N(0, s2).
What do you think a good estimate of s2 is?
MSE SSE/(n-2) is an estimate of s2.
Note how SSE looks like the numerator in s2.

14
(I divided price by 1000. Think about why this
doesnt matter.) Source DF
SS MS F P Regression
1 33.274 33.274 28.79
0.000 Residual Error 37 42.763
1.156 Total 38 76.038

Sum of Squares Total y1 mean(y)2 y39
mean(y)2 76.038
Sum of Squared Errors y1 (b0 b1 x1)2
y (b0 b1 xn)2 42.763
Sum of Squares for Regression SSTotal - SSE
What do these mean?

15
Regression Plot
Price 8198.25 - 385.108 Age
S 1075.07 R-Sq 43.8 R-Sq(adj)
42.2
9000
Overall mean of 3,656
Regression line
8000
7000
6000
e
5000
c
i
r
P
4000
3000
2000
1000
0
15
14
13
12
11
10
9
8
7
6
Age
16
(I divided price by 1000. Think about why this
doesnt really matter.) Source DF
SS MS F P Regression
1p-1 33.274 33.274 28.79
0.000 Residual Error 37n-p 42.763
1.156 Total 38n-1 76.038 p is the
number of regression parameters (2 for now)

SSTotal y1 mean(y)2 y39 mean(y)2
76.038
SSTotal / 38 is an estimate of the variance
around the overall mean.
(i.e. variance in the data without doing
regression)
SSE y1 (b0 b1 x1)2 y (b0 b1
xn)2 42.763
MSE SSE / 37 is an estimate of the variance
around the line.
(i.e. variance that is not explained by the
regression)
SSR SSTotal SSE
MSR SSR / 1 is the variance the data that is
explained by the regression.

17
(I divided price by 1000. Think about why this
doesnt really matter.) Source DF
SS MS F P Regression
1p-1 33.274 33.274 28.79
0.000 Residual Error 37n-p 42.763
1.156 Total 38n-1 76.038 p is the
number of regression parameters

A test of H0 b1 0 versus HA parameter is not
0
Reject if the variance explained by the
regression is high compared to the unexplained
variability in the data. Reject if F is large.
F MSR / MSE
p-value is Pr(Fp-1,n-p gt MSR / MSE)
Reject H0 for any a less than the p-value
(See minitab exmple and confidence intervals for
estimated parameters)
(Assuming errors are independent and normal.)

18
R2

Another summary of a regression is
R2 Sum of Squares for Regression
Sum of Squares Total
0lt R2 lt 1

This is the percentage of the of variation in the
data that is described by the regression.
19
Two different ways to assess worth of a
regression

Absolute size of slope bigger better
Size of error variance smaller better
R2 close to one
Large F statistic

20
Multiple Regression

Cheese ExampleIn a study of cheddar cheese
from the La Trobe Valley of Victoria, Australia,
samples of cheese were analyzed to determine the
amount of acetic acid and hydrogen sulfide they
contained.
Overall scores for each cheese were obtained by
combining the scores from several tasters.
The goal is to predict the taste score based on
the lactic acid and hydrogen sulfide
content.(From Matt Wand)

21
Model

A simple model for taste is
Tastei b0 b1acetici b2H2Si errori
i 1,,n30
Again the intercepts and slopes are selected to
minimize the error sum of squares
SSE taste1 (b0 b1 acetic1 b2 H2S1)2
taste30 (b0 b1 acetic30 b2 H2S30)2
Geometrically
The simple linear model estimated a line. A model
with an intercept and 2 slopes estimates a
surface.
Note that you could add more predictors too

22
Minitab

Stat Regression Regression
Response is taste
Predictors are acetic and h2s
Output
The regression equation is
taste - 34.0 - 7.57 H2S 14.8 acetic
Predictor Coef SE Coef T
P
Constant -33.99 26.53 -1.28
0.211
H2S -7.570 3.474 -2.18
0.038
acetic 14.763 4.242 3.48
0.002
S 12.98 R-Sq 40.6 R-Sq(adj)
36.2
Analysis of Variance
Source DF SS MS
F P
Regression 2 3114.0 1557.0
9.24 0.001
Residual Error 27 4548.9 168.5

23
Minitab

The regression equation is
taste - 34.0 - 7.57 H2S 14.8 acetic
Predictor Coef SE Coef T
P
Constant -33.99 26.53 -1.28
0.211
H2S -7.570 3.474 -2.18
0.038
acetic 14.763 4.242 3.48
0.002
T Coef / SE Coef
P-value is for test H0 Coef 0, HA Coef
is not 0 (if p-value lt a, then reject H0)
1-a CI for Coef Coef /- SE Coef ta/2,dferror df

Test statistic
24
Minitab

Analysis of Variance
Source DF SS MS
F P
Regression 2 3114.0 1557.0
9.24 0.001
Residual Error 27 4548.9 168.5
Total 29 7662.9
The regression equation is
taste - 34.0 - 7.57 H2S 14.8 acetic
Model is regression equation error
taste - 34.0 - 7.57 H2S 14.8 acetic error
MSE 168.5 variance of error.
F stat MSR / MSE (this is test statistic)
P-value is for test H0 b1 b2 (both slopes
0) HA at least one is not 0

Overall test of whetheror not the regressionis
useful.
This is a test of the usefulness of regression
25
Using the regression equation

taste - 34.0 - 7.57 H2S 14.8 acetic
If H2S 3 and acetic 5, then what is the
expected taste score? (NOTE that this is not an
extrapolation)
For value, just plug H2S3 and acetic5 into
equation.For confidence interval (CI)
Stat regression regression, Options button
prediction interval for new obs (put in in order
that theyre in the regression equation)
New Obs Fit SE Fit 95.0 CI
95.0 PI
1 17.11 3.17 ( 10.60, 23.63)
( -10.30, 44.53)

Prediction interval wider than CI since
prediction includes error variability and
variability in estimating the parameters.
26
Dummy (or indicator) variables

When some predictor variables are categorical,
then regression can still be used.
Dummy variables are used to indicate fabric of
each observation

27
Regression Model for Burn Time Data

Burn time m1 if fabric 1 m2 if fabric 2
m3 if fabric 3 m4 if fabric 4 error or
yi b1x1i b2x2i b3x3i b4x4i ei
(xs are indicator variables)
x1i 1 if observation i is fabric 1 and 0
otherwisex2i 1 if observation i is fabric 2
and 0 otherwisex3i 1 if observation i is
fabric 3 and 0 otherwisex4i 1 if observation i
is fabric 4 and 0 otherwise
Betas are fabric specific means.
The model does not have an intercept.
(statregressionregression,options Fit
intercept button)

28
An Equivalent Model

yi g0 g2x2i g3x3i g4x4i ei
x2i 1 if observation i is fabric 2 and 0
otherwise
x3i 1 if observation i is fabric 3 and 0
otherwise
x4i 1 if observation i is fabric 4 and 0
otherwise
Fabric 1 mean g0
Fabric 2 mean g0g2
Fabric 3 mean g0g3
Fabric 4 mean g0g4
This model does have an intercept.

g0 is mean for fabric 1 Rest of the gs are
offsets
29
The regression equation is Burn Time 16.9 -
5.90 Fabric 2 - 6.35 Fabric 3 - 5.85 Fabric
4 Predictor Coef SE Coef T
P Constant 16.8500 0.5806
29.02 0.000 Fabric 2 -5.9000 0.8211
-7.19 0.000 Fabric 3 -6.3500
0.8211 -7.73 0.000 Fabric 4 -5.8500
0.8211 -7.12 0.000 S 1.161
R-Sq 87.2 R-Sq(adj) 83.9 Analysis of
Variance (Note that this is the same as
before!) Source DF SS
MS F P Regression 3
109.810 36.603 27.15 0.000 Residual
Error 12 16.180 1.348 Total
15 125.990 95 CIs for fabric
means (Point estimate of mean) /-
t0.025,12sqrt(MSE / 4) Fabric 2 (16.85 5.90)
/- 2.179sqrt(1.348 / 4) 10.96 /-
2.179(0.5806) (0.5806 is std dev of estimate of
g0g2)(As usual, were assuming the errors are
indep and normal with constant variance.)
30
Back to cheese

Suppose the cheeses come from two regions of
Australia and we want to include that info in the
model
Tastei b0 b1acetici b2H2Si b3Regioni
errori
i 1,,n30
Regioni 1 if ith sample comes from region 1
and 0 otherwise. b3 is effect of region 1
If b3 is gt 0, then region 1 tends to
increase the mean score (and vice versa)