Announcements: - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Announcements:

Description:

Ingrid is a small business owner who wants to buy a fleet of Mitsubishi sigmas. To save $ she decides to buy second hand cars and wants to estimate how much to pay. ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 31
Provided by: johnstau
Category:

less

Transcript and Presenter's Notes

Title: Announcements:


1
Announcements
  • Next Homework is on the Web
  • Due next Tuesday

2
  • Mosquito repellent experiment
  • 30 people were recruited for an experiment.
  • Groups of 10 were randomly assigned to one of
    three repellent types.
  • They were put into mosquito filled room for 10
    minutes (and told not to kill the mosquitos!).
  • Total number of bites in each group was counted
    after the experiment.
  • (Source Steve Gulyas, CRC Testing)

3
Estimate Of Total Variability In the Data
S2 (X1-Xbar)2 (X30-Xbar)2 /
(30-1) 11561.37 / 29 398.67
(Xbar (X1X30)/30)
4
Same data grouped by repellent type
Grouping the data by the treatment, explains some
of the variability! (Analysis of variance makes
this explanation more precise.)
5
ANOVA table
For test H0 mAmBmC
  • Source Sum of Meanof Variation df Squares Squa
    re F P
  • repellent 2 6952.3 3476.1 20.4 0.0000
  • Error 27 4609.1 170.7
  • Total 29 11561.4

Estimate of average variance of counts across the
repellent types.
Variance of counts within each repellent type is
proportional to this.
Total variability in the data is proportional to
this.
Sum of squares treatment sum of squares Error
sum of squares total 6952.3
4609.1
1151.4 R2 SSTreat / SSTotal 0.6013 is
fraction of variability accounted for by
treatment
6
Explaining why ANOVA is an analysis of
variance MST 6952.3 / 2 3476.1 Sqrt(MST)
describes standard deviation among the
rellents. MSE 4609.1 / 27 170.7 Sqrt(MSE)
describes standard deviation of the count within
each repellent type. F MST / MSE 20.4 It
makes sense that this is large and p-value
Pr(F3-1,30-3 gt 20.4) 0 is small because the
variance among treatments is much larger than
variance within the units that get each
treatment. (Note that the F test assumes the
counts are independent and normally distributed
with the same variance.)
For test H0 mAmBmC
7
It turns out that ANOVA is a special case of
regression. Well come back to that in a class or
two. First, lets learn about regression
(chapters 12 and 13).
  • Simple Linear Regression example
  • Ingrid is a small business owner who wants to buy
    a fleet of Mitsubishi sigmas. To save she
    decides to buy second hand cars and wants to
    estimate how much to pay. In order to do this,
    she asks one of her employees to collect data on
    how much people have paid for these cars
    recently. (From Matt Wand)

8
Regression Plot
9000
8000
7000
6000
5000
Price ()
4000
3000
Data Each point is a car
2000
1000
0
15
14
13
12
11
10
9
8
7
6
Age (years)
9
  • Plot suggests a simple model
  • Price of car intercept slope times cars age
    error
  • or
  • yi b0 b1xi ei, i 1,,39.
  • Estimate b0 and b1.
  • Outline for Regression
  • Estimating the regression parameters and ANOVA
    tables for regression
  • Testing and confidence intervals
  • Multiple regression models ANOVA
  • Regression Diagnostics

10
  • Plot suggests a model
  • Price of car intercept slope times cars age
    error
  • or
  • yi b0 b1xi ei, i 1,,39.
  • Estimate b0 and b1 with b0 and b1. Find these
    with least squares.
  • In other words, find b0 and b1 to minimize sum of
    squared errors
  • SSE y1 (b0 b1 x1)2 yn (b0 b1
    xn)2
  • See green line on next page.

Each term is squared differencebetween observed
y and the regression line ((b0 b1 x)
11
Regression Plot
Price 8198.25 - 385.108 Age
S 1075.07 R-Sq 43.8 R-Sq(adj)
42.2
9000
This line has lengthyi b0 b1xi for some i
8000
7000
6000
e
5000
c
i
r
P
4000
Squared lengthof this line contributesone term
to Sum of Squared Errors (SSE)
3000
2000
1000
0
15
14
13
12
11
10
9
8
7
6
Age
12
Regression Plot
Do Minitab example
S 1075.07 R-Sq 43.8 R-Sq(adj)
42.2
9000
General Model Price b0 b1 Age
error Fitted Model Price 8198.25 - 385.108 Age
8000
7000
6000
5000
Price ()
4000
3000
2000
1000
0
15
14
13
12
11
10
9
8
7
6
Age (years)
13
  • Regression parameter estimates, b0 and b1,
    minimize
  • SSE
  • y1 (b0 b1 x1)2 yn (b0 b1 xn)2
  • Full model is yi b0 b1 xi ei
  • Suppose errors (eis) are independent N(0, s2).
  • What do you think a good estimate of s2 is?
  • MSE SSE/(n-2) is an estimate of s2.
  • Note how SSE looks like the numerator in s2.

14
(I divided price by 1000. Think about why this
doesnt matter.) Source DF
SS MS F P Regression
1 33.274 33.274 28.79
0.000 Residual Error 37 42.763
1.156 Total 38 76.038
  • Sum of Squares Total y1 mean(y)2 y39
    mean(y)2 76.038
  • Sum of Squared Errors y1 (b0 b1 x1)2
    y (b0 b1 xn)2 42.763
  • Sum of Squares for Regression SSTotal - SSE
  • What do these mean?

15
Regression Plot
Price 8198.25 - 385.108 Age
S 1075.07 R-Sq 43.8 R-Sq(adj)
42.2
9000
Overall mean of 3,656
Regression line
8000
7000
6000
e
5000
c
i
r
P
4000
3000
2000
1000
0
15
14
13
12
11
10
9
8
7
6
Age
16
(I divided price by 1000. Think about why this
doesnt really matter.) Source DF
SS MS F P Regression
1p-1 33.274 33.274 28.79
0.000 Residual Error 37n-p 42.763
1.156 Total 38n-1 76.038 p is the
number of regression parameters (2 for now)
  • SSTotal y1 mean(y)2 y39 mean(y)2
    76.038
  • SSTotal / 38 is an estimate of the variance
    around the overall mean.
  • (i.e. variance in the data without doing
    regression)
  • SSE y1 (b0 b1 x1)2 y (b0 b1
    xn)2 42.763
  • MSE SSE / 37 is an estimate of the variance
    around the line.
  • (i.e. variance that is not explained by the
    regression)
  • SSR SSTotal SSE
  • MSR SSR / 1 is the variance the data that is
    explained by the regression.

17
(I divided price by 1000. Think about why this
doesnt really matter.) Source DF
SS MS F P Regression
1p-1 33.274 33.274 28.79
0.000 Residual Error 37n-p 42.763
1.156 Total 38n-1 76.038 p is the
number of regression parameters
  • A test of H0 b1 0 versus HA parameter is not
    0
  • Reject if the variance explained by the
    regression is high compared to the unexplained
    variability in the data. Reject if F is large.
  • F MSR / MSE
  • p-value is Pr(Fp-1,n-p gt MSR / MSE)
  • Reject H0 for any a less than the p-value
  • (See minitab exmple and confidence intervals for
    estimated parameters)
  • (Assuming errors are independent and normal.)

18
R2
  • Another summary of a regression is
  • R2 Sum of Squares for Regression
  • Sum of Squares Total
  • 0lt R2 lt 1

This is the percentage of the of variation in the
data that is described by the regression.
19
Two different ways to assess worth of a
regression
  1. Absolute size of slope bigger better
  2. Size of error variance smaller better
  3. R2 close to one
  4. Large F statistic

20
Multiple Regression
  • Cheese ExampleIn a study of cheddar cheese
    from the La Trobe Valley of Victoria, Australia,
    samples of cheese were analyzed to determine the
    amount of acetic acid and hydrogen sulfide they
    contained.
  • Overall scores for each cheese were obtained by
    combining the scores from several tasters.
  • The goal is to predict the taste score based on
    the lactic acid and hydrogen sulfide
    content.(From Matt Wand)

21
Model
  • A simple model for taste is
  • Tastei b0 b1acetici b2H2Si errori
  • i 1,,n30
  • Again the intercepts and slopes are selected to
    minimize the error sum of squares
  • SSE taste1 (b0 b1 acetic1 b2 H2S1)2
  • taste30 (b0 b1 acetic30 b2 H2S30)2
  • Geometrically
  • The simple linear model estimated a line. A model
    with an intercept and 2 slopes estimates a
    surface.
  • Note that you could add more predictors too

22
Minitab
  • Stat Regression Regression
  • Response is taste
  • Predictors are acetic and h2s
  • Output
  • The regression equation is
  • taste - 34.0 - 7.57 H2S 14.8 acetic
  • Predictor Coef SE Coef T
    P
  • Constant -33.99 26.53 -1.28
    0.211
  • H2S -7.570 3.474 -2.18
    0.038
  • acetic 14.763 4.242 3.48
    0.002
  • S 12.98 R-Sq 40.6 R-Sq(adj)
    36.2
  • Analysis of Variance
  • Source DF SS MS
    F P
  • Regression 2 3114.0 1557.0
    9.24 0.001
  • Residual Error 27 4548.9 168.5

23
Minitab
  • The regression equation is
  • taste - 34.0 - 7.57 H2S 14.8 acetic
  • Predictor Coef SE Coef T
    P
  • Constant -33.99 26.53 -1.28
    0.211
  • H2S -7.570 3.474 -2.18
    0.038
  • acetic 14.763 4.242 3.48
    0.002
  • T Coef / SE Coef
  • P-value is for test H0 Coef 0, HA Coef
    is not 0 (if p-value lt a, then reject H0)
  • 1-a CI for Coef Coef /- SE Coef ta/2,dferror df

Test statistic
24
Minitab
  • Analysis of Variance
  • Source DF SS MS
    F P
  • Regression 2 3114.0 1557.0
    9.24 0.001
  • Residual Error 27 4548.9 168.5
  • Total 29 7662.9
  • The regression equation is
  • taste - 34.0 - 7.57 H2S 14.8 acetic
  • Model is regression equation error
  • taste - 34.0 - 7.57 H2S 14.8 acetic error
  • MSE 168.5 variance of error.
  • F stat MSR / MSE (this is test statistic)
  • P-value is for test H0 b1 b2 (both slopes
    0) HA at least one is not 0

Overall test of whetheror not the regressionis
useful.
This is a test of the usefulness of regression
25
Using the regression equation
  • taste - 34.0 - 7.57 H2S 14.8 acetic
  • If H2S 3 and acetic 5, then what is the
    expected taste score? (NOTE that this is not an
    extrapolation)
  • For value, just plug H2S3 and acetic5 into
    equation.For confidence interval (CI)
  • Stat regression regression, Options button
    prediction interval for new obs (put in in order
    that theyre in the regression equation)
  • New Obs Fit SE Fit 95.0 CI
    95.0 PI
  • 1 17.11 3.17 ( 10.60, 23.63)
    ( -10.30, 44.53)

Prediction interval wider than CI since
prediction includes error variability and
variability in estimating the parameters.
26
Dummy (or indicator) variables
  • When some predictor variables are categorical,
    then regression can still be used.
  • Dummy variables are used to indicate fabric of
    each observation

27
Regression Model for Burn Time Data
  • Burn time m1 if fabric 1 m2 if fabric 2
    m3 if fabric 3 m4 if fabric 4 error or
  • yi b1x1i b2x2i b3x3i b4x4i ei
  • (xs are indicator variables)
  • x1i 1 if observation i is fabric 1 and 0
    otherwisex2i 1 if observation i is fabric 2
    and 0 otherwisex3i 1 if observation i is
    fabric 3 and 0 otherwisex4i 1 if observation i
    is fabric 4 and 0 otherwise
  • Betas are fabric specific means.
  • The model does not have an intercept.
  • (statregressionregression,options Fit
    intercept button)

28
An Equivalent Model
  • yi g0 g2x2i g3x3i g4x4i ei
  • x2i 1 if observation i is fabric 2 and 0
    otherwise
  • x3i 1 if observation i is fabric 3 and 0
    otherwise
  • x4i 1 if observation i is fabric 4 and 0
    otherwise
  • Fabric 1 mean g0
  • Fabric 2 mean g0g2
  • Fabric 3 mean g0g3
  • Fabric 4 mean g0g4
  • This model does have an intercept.

g0 is mean for fabric 1 Rest of the gs are
offsets
29
The regression equation is Burn Time 16.9 -
5.90 Fabric 2 - 6.35 Fabric 3 - 5.85 Fabric
4 Predictor Coef SE Coef T
P Constant 16.8500 0.5806
29.02 0.000 Fabric 2 -5.9000 0.8211
-7.19 0.000 Fabric 3 -6.3500
0.8211 -7.73 0.000 Fabric 4 -5.8500
0.8211 -7.12 0.000 S 1.161
R-Sq 87.2 R-Sq(adj) 83.9 Analysis of
Variance (Note that this is the same as
before!) Source DF SS
MS F P Regression 3
109.810 36.603 27.15 0.000 Residual
Error 12 16.180 1.348 Total
15 125.990 95 CIs for fabric
means (Point estimate of mean) /-
t0.025,12sqrt(MSE / 4) Fabric 2 (16.85 5.90)
/- 2.179sqrt(1.348 / 4) 10.96 /-
2.179(0.5806) (0.5806 is std dev of estimate of
g0g2)(As usual, were assuming the errors are
indep and normal with constant variance.)
30
Back to cheese
  • Suppose the cheeses come from two regions of
    Australia and we want to include that info in the
    model
  • Tastei b0 b1acetici b2H2Si b3Regioni
    errori
  • i 1,,n30
  • Regioni 1 if ith sample comes from region 1
    and 0 otherwise. b3 is effect of region 1
  • If b3 is gt 0, then region 1 tends to
    increase the mean score (and vice versa)
Write a Comment
User Comments (0)
About PowerShow.com