Title: Regression and Analysis of Variance II
1Regression and Analysis of Variance II
- Chapter 13 Analysis of Covariance and Other
Methods for Adjusting Continuous Data
2Homework 1
- All even problems
- Due 2/5 in class
3Three Reasons for Considering Control
- When assessing an association between a dependent
variable and a set of study variables, we control
for extraneous variables (called covariates) in
order - to assess interaction between study variables and
covariates, - to correct for confounding, and
- to increase the precision in estimating the
association.
4Interaction and Confounding
- Interaction and Confounding are two
methodological concepts for quantifying the
relationship of one or more independent variables
to a dependent variable. - Interaction exists when the association is
different at different values of the covariates. - Confounding exists if different associations
between a dependent variable and one or more
study variables result when covariates are
included or ignored in data analysis.
5What Is Analysis of Covariance
- Analysis of covariance (ANACOVA, ANCOVA) is a
special regression procedure, which is used to
adjust or correct for problems of confounding. - An ANACOVA model is a regression model. In such a
model, study variables are categorical,
covariates may be measured on any scale, and
there is no interaction between the study
variables and the covariates.
6Choice of Covariates in An ANACOVA Model
- Covariates should be confounding variables
- Covariates should not depend upon study variables
in any way.
7Development of ANACOVA Model
- For simplicity, consider only one study variable
which is categorical and takes k values
corresponding to k groups. - Suppose that z1, z2, , zk-1 are k 1 dummy
variables for the study variable, and are defined
as - Suppose that x1, x2, , xp are p covariates.
8Development of Covariance Model (contd)
- The ANACOVA model is
- where
-
9Point Estimation of Interest Quantities
- The parameters can be estimated by the least
squares method. Denote these estimates by - Instead of the unadjusted group means (mean of y
for each category), we usually are interested in
the adjusted means. The adjusted means for
different categories are defined to be predicted
values obtained by evaluating the model in
different categories when the covariates are set
to the overall mean for all categories. - For the baseline category (k), the adjusted means
is - For the jth category, the adjusted mean is
-
-
10Confidence Intervals
- Formula
- Point Estimate /- (critical value)(standard
error) - Confidence intervals for single coefficients can
be obtained using t critical values. - Confidence intervals for a linear combination of
coefficients can also be obtained using t
critical values. In order to calculate the
standard error, we need to know the covariance
matrix of the estimates of coefficients.
11Hypothesis Testing
- Testing for equality of all the adjusted means is
of interest and is equivalent to testing H0 ?p1
?pk-1 0, which is carried out by the
multiple partial F test. - The test statistic is
12ANACOVA Example Using SAS
- Problem 1 (Page 274)
- Problem 3 (Page 275)
- Problem 5 (Page 275) SAS Codes
- Problem 7 (Page 281)
- Problem 9 (Page 281)
- Problem 11 (Page 281)
- Problem 13(Page 282)
- Problem 15 (Page 284)
13Regression and Analysis of Variance II
- Chapter 14 Regression Diagnostics
14Homework 2
- Problems 20, 22, 23, 24, 28
- Due Feb 12, 2009
15- This chapter introduces methods for
- Detecting outliers
- Checking regression assumptions
- Detecting the presence of collinearity
16Descriptive Statistical Analysis
- (Possible outliers) Examine the 5 largest and 5
smallest values for every numeric variable - Impossible values are set to missing
- When outliers are removed, this action and any
justification for it, should be documented and
presented along with results. - Examine the appropriate descriptive statistics
for each variable - For categorical variable, produce frequency
tables to detect unusual values - For continuous data, produce the range
- Examine scatterplots
- For simple linear regression with both variables
x and y continuous, plot y vs x (for checking
linearity and detecting outliers). Calculate
Pearson correlation, r. - For multiple regression, produce partial
regression plots and calculate partial
correlations. - Calculate correlation matrix for independent
variables A strong correlation signals
collinearity problems. -
17Residual Analysis
- A residual for an observation is the difference
between the observation and its predicted value - Standardized residuals
- Studentized residuals
- Jackknife residuals
- Residual Analyses can
- Detect outliers
- Detect violations of model assumptions
18Standardized Residuals
19Studentized Residuals
20Jackknife Residuals
21Hypothesis Testing for Outliers
- Under all model assumptions, both the studentized
and the Jackknife residuals are t distributed. - To test whether each observation is an outlier,
either the studentized residual or the Jackknife
residual may be used as the test statistic. - It is suggested that the Bonferroni procedure or
other multiple testing procedures be used. - When the Bonferroni procedure is used, the
significance level is ?/n for each test. Here ?
is the family-wise type I error rate.
22Cooks Distance
- Cooks distance measures the extent to which the
estimates of the regression coefficients change
when an observation is deleted from the analysis. - Let d(i) denote the Cooks distance, then
-
23Distribution of Cooks Distance
- The Cooks distance can be used as a test
statistic for the hypothesis H0 the ith
observation is an outlier against the hypothesis
H1 the ith observation is not an outlier. - The distribution of the distance is still
unknown. Muller and Chen (1997) performed
simulations and tabulated some critical values
for given sample size, n, and given number of
parameters ,k.
24Assessing Linearity, Homoscedasticity, and
Independence Assumptions Using Residual Plots
- The theoretical foundation of residual plots is
that the residuals are uncorrelated to both the
predicted values and values of each of the
predictor variables. That is - Thus, common residual plots are
- Residuals against predicted values
- Residuals against predictor values
-
25Assessing the Normality Assumption
- Good-ness-of-fit tests
- Kolmogorov-Smirnov test
- Anderson-Darling
- Shapiro-Wilks test
- Normal probability plot
- Residuals against normal percentiles
- Normality holds if points are on a line
- Normal quantile-quantile plot
- Residuals against normal quantiles
- Normality holds if points are on a line
26Some Remedies for Assumption Violations
- Data transformations
- To stabilize the variance of the dependent
variable, y, if the homoscedasticity assumption
is violated - To normalize the dependent variable, if the
normality assumption is violated - To linearize the regression model, if the
linearity assumption is violated - Commonly used transformations log(y), sqrt(y),
1/y
27Weighted Least Squares Analysis
- Used when the variance homogeneity assumption
and/or independence assumption do not hold. - Suppose the ith observation on y has the variance
?2i ?2/wi, where are wis all known. Then the
regression coefficients ?js, j 1, 2, , k1,
are determined by minimizing (page 304) -
28Collinearity
- It exists when there are strong linear
relationships among independent variables. - Symptoms of collinearity
- Effects of collinearity
- On regression coefficients
- On predicted values
- On variance-covariances and standard errors
- Approaches for diagnosing the presence of
collinearity - Remedies for the collinearity problem.
29Using Eigenvalues to Determine the Presence of
Collinearity
- The correlation matrix (k by k) of the k
independent variables has k eigenvalues. Suppose
they are arranged in descending order, ?1,, ?k. - To determine the presence of collinearity,
statisticians use three kinds of statistics - Condition indices sqrt(?1/?j), j 1, , k
- Describe the degree to which the data are
ill-conditioned, i.e., the degree to which
small changes in data values result in large
changes in parameter estimates. Threshold 30. - Condition numbers sqrt(?1/?k)
- Variance proportions page 313
- For condition indices larger than 30, the
variance proportion should be examined in order
to determine which predictor variables are
primarily responsible for the large condition
index. Predictors with variance proportion larger
than 0.5 can be considered to be involved in the
collinearity problem. If the intercept is
involved in the collinearity problem, its
recommended that intercept-adjusted collinearity
diagnostics be examined.
30Collinearity Diagnostics
- Step 1 produce correlation matrix of predictor
variables and plot one predictor against another. - Step 2 Examine the variance inflation factor
(VIF) values for each predictor. - Step 3 Examine condition indices and variance
proportions.
31Remedies for Collinearity Problems
- Drop predictors that are correlated to others.
Drop those least scientifically interesting
predictors. - Use dummy variables properly.
- Limit interaction terms in a model.
- Using centered data can alleviate collinearity
problems. Warning this may render the usual
collinearity diagnostics, VIF values and
condition indices, ineffective. - Regression on principle components or some of
them. - Ridge regression
32SAS Output Dictionary
- Leverage Values Measure how far an
observation is from the center point of the
independent variables (not the dependent
variable). Observations with values larger than
2(k1)/n are considered to be potentially highly
influential, where k is the number of predictors
and n is the sample size. - DFFITS Measure how much an observation has
affected its fitted value. Values larger than
2sqrt((k1)/n) in absolute value are considered
highly influential. - DFBETAS Measure how much an observation has
affected the estimate of a regression coefficient
(there is one DFBETA for each regression
coefficient, including the intercept). Values
larger than 2/sqrt(n) in absolute value are
considered highly influential. - Cooks D Measure aggregate impact of each
observation on the group of regression
coefficients, as well as the group of fitted
values. Values larger than 4/n are considered
highly influential. - COVRATIO Measure the impact of each
observation on the variances (and standard
errors) of the regression coefficients and their
covariances. Values outside the interval 1 /-
3(k1)/n are considered highly influential.
33Example (Problem 19, page 329 with data) Effect
of 0.25 ppm sulfur dioxide on airway resistance
in freely breathing, heavily exercising,
asthmatic subjects.
- Abstract- We sought to determine whether 0.25 ppm
sulfur dioxide in filtered air causes
bronchoconstriction when inhaled by freely
breathing, heavily exercising, asthmatic
subjects. Nineteen asthmatic volunteers exercised
at 750 kilogram meters/min for 5 min in an
exposure chamber that contained filtered air at
ambient temperature and humidity or, on another
day, filtered air plus 0.25 ppm sulfur dioxide.
The order of exposure to sulfur dioxide and to
filtered air alone was randomized, and the
experiments were double-blinded. Specific airway
resistance, measured by constant-volume,
whole-body plethysmography, increased from 6.38
/- 2.07 cm H2O X s (mean /- SD) before exercise
to 11.32 /- 8.97 after exercise on days when
subjects breathed filtered air alone and from
5.70 /- 1.93 to 13.33 /- 7.54 on days when
subjects breathed 0.25 ppm sulfur dioxide in
filtered air. The increase in specific airway
resistance on days when subjects breathed 0.25
ppm sulfur dioxide was only slightly greater than
on days when they breathed filtered air, but the
difference was significant. To determine whether
0.25 ppm sulfur dioxide causes greater
bronchoconstriction in asthmatic subjects
exercising more vigorously, 9 subjects then
repeated the experiment exercising at 1,000
instead of 750 kilogram meters/min. Specific
airway resistance increased from 6.71 /- 2.25 to
13.59 /- 7.57 on days when subjects breathed
filtered air alone and from 5.23 /- 1.23 to
12.54 /- 6.17 on days they breathed 0.25 ppm
sulfur dioxide in filtered air. The increase in
specific airway resistance on the 2 days was not
significantly different.
34data Ch14q19 input AGE Sex Height Weight
FEV datalines 24 M 175 78.0 4.7 36 M 172
67.6 4.3 28 F 171 98.0 3.5 25 M 166 65.5 4.0 26 F
166 65.0 3.2 22 M 176 65.5 4.7 27 M 185 85.5
4.3 27 M 171 76.3 4.7 36 M 185 79.0 5.2 24 M 182
88.2 4.2 26 M 180 70.5 3.5 29 M 163 75.0 3.2 33 F
180 68.0 2.6 31 M 180 65.0 2.0 30 M 180 70.4
4.0 22 M 168 63.0 3.9 27 M 168 91.2 3.0 46 M 178
67.0 4.5 36 M 173 62.0 2.4 RUN PROC PRINT DATA
Ch14q19run
For those not familiar with SAS If you have the
data saved somewhere, you may use the infile
statement, as the following data one infile
'yourfiledirectory\Ch14q19.txt' firstobs 2
input AGE Sex Height Weight FEV run proc
print
35proc gplot data results plot
standardizedpredicted plot
rstudentpredicted plot standardizedpredicted
plot jackknifepredicted title 'Plot of
Jackknife residuals against predicted
values' run
PROC reg DATA Ch14q19 model FEV AGE
Height Weight/vif
COLLIN influence output out
results p predicted
h leverage student
standardized rstudent
jackknife cookd Cookdistance
title 'Proc reg output Forced Expiratory
Volume (Y) regressed on AGE, Weight,
and Height' run
proc univariate data results normal plot var
jackknife histogram jackknife/normal probplot
jackknife/normal qqplot jackknife/normal run
data outliers set results if
(abs(jackknife) gt tinv(0.95, 19-3-2)) OR
(Cookdistance gt 1) OR (leverage gt 2(31)/19)
run proc print data outliers run quit
36Collinearity Diagnostics Collinearity Diagnostics Collinearity Diagnostics Collinearity Diagnostics Collinearity Diagnostics Collinearity Diagnostics Collinearity Diagnostics
Number Eigenvalue ConditionIndex Proportion of Variation Proportion of Variation Proportion of Variation Proportion of Variation
Number Eigenvalue ConditionIndex Intercept AGE Height Weight
1 3.95433 1.00000 0.00008665 0.00212 0.00007991 0.00109
2 0.03589 10.49649 0.00075655 0.62719 0.00045217 0.14783
3 0.00911 20.83984 0.03745 0.34070 0.02926 0.84941
4 0.00067443 76.57184 0.96170 0.02998 0.97021 0.00167
37Regression and Analysis of Variance II
- Chapter 15 Polynomial Regression
38Homework 3
39Preview
- Polynomial regression is a special case of
multiple regression, in which the regression
function is a polynomial of a single predictor
variable. - The general form of a polynomial regression model
is -
40Chapter Example (Ch15q01, page 370)
SOLN_NUM X Y LN_Y
1 6 0.029 -3.54
1 6 0.032 -3.442
1 6 0.027 -3.612
1 8 0.079 -2.538
1 8 0.072 -2.631
1 8 0.088 -2.43
1 10 0.181 -1.709
1 10 0.165 -1.802
1 10 0.201 -1.604
1 12 0.425 -0.856
1 12 0.384 -0.957
1 12 0.472 -0.751
1 14 1.13 0.122
1 14 1.02 0.02
1 14 1.249 0.222
1 16 2.812 1.034
1 16 2.465 0.902
1 16 3.099 1.131
41Fitting a Parabola Using the Least Squares Method
- A parabola is the graph of a quadratic equation
(order-2 polynomial). - Lets fit a quadratic model to our previous data.
The polynomial model is -
- We wish to estimate these ?s. As for a general
multiple regression model, we maximize the
function -
42ANOVA Table for Second-order Polynomial
Regression
- An ANOVA (Analysis of Variance) table can be
constructed for any regression model. - For a second-order polynomial regression model,
its ANOVA table looks like the following
constructed from the Ch15q01 data -
43Table 15.2, Page 352
data table5_1 /one outlier removed/ input SBP
Age_at__at_ Age2 AgeAge cards 144 39 138
45 145 47 162 65 142 46 170 67 124 42 158 67 154
56 162 64 150 56 140 59 110 34 128 42 130 48 135
45 114 17 116 20 124 19 136 36 142 50 120 39 120
21 160 44 158 53 144 63 130
29 125 25 175 69 run proc reg data
table5_1 model SBP Age Age2/ss1 plot
SBPAge run quit
44data Ch15q01 input X Y LN_Y X2
XX /create second-order term/
cards 6 0.029 -3.54 6 0.032 -3.442 6 0.027 -3.612
8 0.079 -2.538 8 0.072 -2.631 8 0.088 -2.43 10 0.
181 -1.709 10 0.165 -1.802 10 0.201 -1.604 12 0.42
5 -0.856 12 0.384 -0.957 12 0.472 -0.751 14 1.13 0
.122 14 1.02 0.02 14 1.249 0.222 16 2.812 1.034 16
2.465 0.902 16 3.099 1.131 run proc reg
model Y X X2/ss1 / ss1 allows
construction of an ANOVA table on page
352/ run quit
45Inferences Associated with Second-order
Polynomial Regression
- Is the overall 2nd order polynomial model
significant? Comparing this model with the
intercept only model. (Overall F test and R2) - Does the 2nd order model provide significantly
more predictive power than does the straight-line
model? (test H0 ?2 0, using t test or partial
F) - Given that a 2nd order model is more appropriate
than a straight-line model, should we add
high-order terms to the 2nd order model?
(Lack-of-fit test)
46Fitting Higher-order Models
- If a quadratic model is not adequate, a
higher-order polynomial model may be needed. - A kth order polynomial at most has k 1 bends or
relative extrema - Generally, the maximum-order polynomial that may
be fit is one less than the number of distinct
X-values.
47Lack-of-fit Tests
- Given a lower-order polynomial has been fitted
and the highest-order term is tested significant,
should we be confident that a higher-order model
is not needed? - We still need to conduct a lack-of-fit test by
comparing the current model with the
highest-order possible polynomial model. - For a lack-of-fit test to be possible, at least
one X-value should involve replicates.
48Lack-of-fit Test of a Second-order Model Example
Obs SBP Age
1
114 17
2 124 19
3 116 20
4 120 21
5 125
25 6
130 29
7 110 34
8 136 36
9 144 39
10 120
39 11
124 42
12 128 42
13 160 44
14 138 45
15 135
45 16
142 46
17 145 47
18 130 48
19 142 50
20 158
53 21
154 56
22 150 56
23 140 59
24 144 63
25 162
64 26
162 65
27 170 67
28 158 67
29 175 69
- 5 X-values (39, 42, 45, 56, 67) have repeats and
19 X-values have no repeats, so the total number
of distinct X-values is 519 24. - The highest-order polynomial that can be fit is a
polynomial of order 24 1 23. - Directly fitting such a 23rd -order polynomial is
difficult! Once its fitted, a multiple partial F
test may be used to test H0 No lack-of-fit of
2nd -order model. - The lack-of-fit statistic is given by
49Orthogonal Polynomials
- Collinearity problems can arise in work with
higher-order polynomial models. - Such collinearity problems can be overcome using
orthogonal polynomials. - When fitting a kth-order (natural) polynomial, we
need create k orthogonal polynomials, xi Pi(x),
i 1, 2, , k, where Pi(x) a0i a1ix
aiixi, such that
50Some Results
- When the two models (using a natural polynomial
or an orthogonal polynomial) are fitted, R2 and
the partial or overall F test stay the same. - There is no collinearity problem when fitting an
orthogonal polynomial. - The backward selection procedure is made easier
with an orthogonal model.
51Orthogonal Polynomial Coefficients
- If X-values are equally spaced and they repeat
equally frequently, then Table A.7 on Page 836 of
the textbook can be used to determine the
corresponding Xi-values. - Refer to an example on page 362-363.
52Lack-of-fit Test Using Orthogonal Polynomials
- Fit only the highest-order orthogonal polynomial
model. All lower-order model will have the same
slope estimates! - Conduct a multiple F test for H0 No lack-of-fit
of the primary model.
53Strategies for Choosing a Polynomial Model
- First choose the full model to be of third order
or lower. - Second, proceed backward in a stepwise fashion
starting with the largest power term, and
sequentially delete non-significant terms,
stopping at the first power term that is
significant this term and all lower-order terms
should be retained in the final model. - Third, conduct a multiple partial F test for lack
of fit. - Lastly, perform residual analysis using Jackknife.
54Sample Questions for This Chapter
- x y
- 6 0.029
- 6 0.032
- 6 0.027
- 8 0.079
- 8 0.072
- 8 0.088
- 10 0.181
- 10 0.165
- 10 0.201
- 12 0.425
- 12 0.384
- 12 0.472
- 14 1.13
- 14 1.02
- 14 1.249
- 16 2.812
- 16 2.465
- 3.099
- We wish to regress y on x using a polynomial.
What is the highest-order polynomial possible? - If we fit a quadratic polynomial, how many
degrees of freedom does the residual have? How
many degrees of freedom does the lack of fit
have? How many degrees of freedom does the pure
error have?
55Regression and Analysis of Variance II
- Chapter 16
- Selecting the Best Regression Equation
56Homework 4
57Problem Description
- Given one response variable Y and a set of
predictors X1, X2, , Xk, we want to determine
the best subset of the k predictor variables and
the corresponding best-fitting regression
equation.
58Steps in Selecting the Best Regression Model
- Start from the maximum model.
- Specify a criterion for model selection.
- Specify a strategy for variable selection.
- Conduct the specified analysis
- Evaluate the reliability of the selected model.
59Step 1 Specifying the Maximum Model
- The maximum model should not be too large. Large
models may overfit the data that is, include
variables with truly zero regression coefficients
(called a type I error). Overfitting introduces
no bias but leads to large variance. - On the other hand, the maximum model should not
be too small. Small models may underfit the data
that is, exclude variables with truly non-zero
regression coefficients (called a type II error).
Underfitting introduces bias but less variance. - Conclusion there is a bias-variance trade-off
when fitting regression models.
60Step2Specifying a Criterion for Model Selection
- A selection criterion is an index that can be
computed for each candidate model and used to
compare models. - According to the criterion, candidate models can
be ordered from best to worst. - Many criteria are possible, for example,
- Rp2, Fp, MSE(p), and Cp.
- All these criteria attempt to compare the maximum
model (full model) with k predictors and a
restricted model (reduced model) with p
predictors (p k).
61Comparing the Full Model and the Reduced Model
Using Rp2
62There are several approaches to explain R2 in
OLS. These different approaches lead to various
calculations of pseudo R2s with regressions of
categorical outcome variables (chapter 22).Â
R-squared as explained variability - The
denominator of the ratio can be thought of as the
total variability in the dependent variable, or
how much y varies from its mean. The numerator
of the ratio can be thought of as the variability
in the dependent variable that is not predicted
by the model. Thus, this ratio is the proportion
of the total variability unexplained by the
model. Subtracting this ratio from one results
in the proportion of the total variability
explained by the model. The more variability
explained, the better the model. R-squared as
improvement from null model to fitted model - The
denominator of the ratio can be thought of as the
sum of squared errors from the null model--a
model predicting the dependent variable without
any independent variables. In the null model,
each y value is predicted to be the mean of the y
values. Consider being asked to predict a y value
without having any additional information about
what you are predicting. The mean of the y
values would be your best guess if your aim is to
minimize the squared difference between your
prediction and the actual y value. The numerator
of the ratio would then be the sum of squared
errors of the fitted model. The ratio is
indicative of the degree to which the model
parameters improve upon the prediction of the
null model. The smaller this ratio, the greater
the improvement and the higher the R-squared.
R-squared as the square of the correlation - The
term "R-squared" is derived from this
definition. R-squared is the square of the
correlation between the model's predicted values
and the actual values. This correlation can
range from -1 to 1, and so the square of the
correlation then ranges from 0 to 1. The greater
the magnitude of the correlation between the
predicted values and the actual values, the
greater the R-squared, regardless of whether the
correlation is positive or negative.  Source
http//www.ats.ucla.edu/stat/mult_pkg/faq/general/
Psuedo_RSquareds.htm
63Comparing the Full Model and the Reduced Model
Using Fp
64Comparing the Full Model and the Reduced Model
Using MSE(p)
65Comparing the Full Model and the Reduced Model
Using Mallows Cp
66The Criteria are Intimately Related
67Step3Specifying a Strategy for Selecting
Variables
- SAS provides nine methods of model selection
implemented in PROC REG. These methods are
specified with the SELECTION option in the MODEL
statement.
68Backward Elimination (BACKWARD)
- The backward elimination technique begins by
calculating F statistics for a model, including
all of the independent variables. Then the
variables are deleted from the model one by one
until all the variables remaining in the model
produce F statistics significant at the SLSTAY
level specified in the MODEL statement (or at the
0.10 level if the SLSTAY option is omitted). At
each step, the variable showing the smallest
contribution to the model is deleted.
69Steps for Backward Elimination
- Step 1 Fit the maximum model.
- Step 2 Produce type II SS or t test results.
- Step 3 Focus on the term that has the lowest F
or - highest p-value.
- - If the term is non-significant at a specified
level - given by the SLS option, drop the term
and refit - a regression model for the remaining
variables and - repeat the backward selection procedure.
- - If the term is significant, the backward
selection procedure ends, and the selected model
consists of variables remaining in the model.
70Other Variable Selection Procedures
- Forward selection (selection f and SLS
option) - Stepwise selection (modified forward selection)
- Chunkwise selection
- Read Page 393-398
71Step 4 Conducting the Analysis Using the
Selected Model
72Step 5 Evaluating Reliability with Split Samples
- To assess the reliability of a chosen model, the
most compelling way is to conduct a new study and
test the fit of the chosen model to the new data. - This approach is expensive.
- A split-sample analysis attempt to find the best
model and assess its reliability. This analysis
split the sample into a training group and a
holdout (or validation) group randomly. Both
groups should be representative of the parent
population. - Use the training data to do model selection. Find
the R2 for the selected model, denoted R2(1). - Find the predicted values for the validation data
and calculate the square of the correlation
between the predicted values and the y values of
the validation data, denoted R2(2). - If the shrinkage on cross-validation R2(1) -
R2(2) lt 0.10, the selected model is reliable if
the shrinkage gt 0.90, the selected model is not
reliable.
73Regression and Analysis of Variance II
- Chapter 17 One-way Analysis of Variance
74Homework 5
75Preview
- Analysis of variance (ANOVA) is a technique for
assessing how one or several nominal independent
variables (called factors) affect a continuous
dependent variable. - ANOVA in which only one nominal independent
variable is involved is called one-way ANOVA. - ANOVA is usually employed in comparisons
involving several population means. - ANOVA is an extension to the independent-two-sampl
e t test. - An ANOVA problem can be handled under the
regression framework. -
76Factors and Levels
- A nominal independent variable with k categories
is called a factor with k levels. - For example, to compare three different diets, A,
B, and C, 60 people are available and are
randomly assigned to the three diets so that each
treatment group contains 20 people. Here diet is
the only factor which has 3 levels. - How to assign? Have slips of 20 As, 20 Bs, and
20 Cs in a bowl. Each person picks one.
77Example The following example studies the
effect of bacteria on the nitrogen content of red
clover plants. The (treatment) factor is bacteria
strain, and it has six levels. Red clover plants
are inoculated with the treatments, and nitrogen
content is later measured in milligrams.
title1 'Nitrogen Content of Red Clover
Plants' data Clover input Strain
Nitrogen _at__at_ datalines 3DOK1 19.4
3DOK1 32.6 3DOK1 27.0 3DOK1 32.1 3DOK1 33.0
3DOK5 17.7 3DOK5 24.8 3DOK5 27.9 3DOK5 25.2
3DOK5 24.3 3DOK4 17.0 3DOK4 19.4 3DOK4
9.1 3DOK4 11.9 3DOK4 15.8 3DOK7 20.7 3DOK7
21.0 3DOK7 20.5 3DOK7 18.8 3DOK7 18.6
3DOK13 14.3 3DOK13 14.4 3DOK13 11.8 3DOK13 11.6
3DOK13 14.2 COMPOS 17.3 COMPOS 19.4 COMPOS
19.1 COMPOS 16.9 COMPOS 20.8 proc
print proc glm data Clover class
strain model Nitrogen Strain run
78Another ANOVA Example
- (Ch17q01) Five treatments for fever blisters,
including a placebo, were randomly assigned to 30
patients. For each of the five treatments, the
number of days from initial appearance of the
blisters until healing is given as follows -
Data Ch17q01 treatment _n_ do i 1
to 6 input time_at__at_
output end drop i cards 5
8 7 7 10 8 4 6 6 3 5 6 6 4 4 5 4 3
7 4 6 6 3 5 9 3 5 7 7 6 Run
- The only factor involved here has 5 levels (1
placebo). - Questions of interest
- Do the effects of the five treatments differ
significantly with regard to healing fever
blisters? - Are the 4 treatments on average more effective
than the placebo in healing fever blisters? - Do the effects of the 4 treatments differ
significantly with regard to healing fever
blisters?, If yes, which one is most effective?
79Fixed Versus Random Factors
- A fixed factor is a factor whose levels are the
only ones of interest. - A random factor is a factor whose levels may be
regarded as a sample from some large population
of levels. - The distinction is important in statistical
analysis, since it will affect the variance of
estimators, thus the results of significance
tests. - Example of fixed factors Sex, Age, Marital
Status, Education. - Examples of random factors subjects, litters,
days. - Fixed or random If a factor has many potential
levels, treat it as random.
80Fixed-Effects One-Way ANOVA Model
- An Fixed-effects one-way ANOVA model deals with
the effect of a single fixed factor on a
continuous response variable. - A comparison of population means can be made
through this model. - Four assumptions must be made for the model
- An SRS has been selected from each of the k
populations. - The dependent variable is normally distributed in
each population. (normality) - The variance of the dependent variable is the
same in each population (denoted ?2). (constant
variance) - The k samples are independent. (independence)
- The fixed-effects one-way ANOVA model is
- Yij ? ?i Eij i
1, 2, , k - where the effects of the ith treatment, ?i
are subject to the following constraint - ?1 ?2
?k 0.
81The Problem of Interest Are Population Means
All Equal?
- H0 ?1 ?2 ?k against
- HA The k means are not all equal.
- If H0 is rejected, a follow-up study can be
conducted to find which mean is different from
others.
82Fixed-Effects One-Way ANOVA Model Data
Configuration
83ANOVA Table for theFixed Effects One-way ANOVA
Model
- Source df SS MS
F Value Pr gt F - Between k 1 SST MST
MST/MSE - Within n k SSE MSE
- Total n 1 SSY
- Where
-
If The p-value is less than ?, then reject the
null hypothesis that all means are the same.
84Some Useful Results
- The following can be shown
-
85Example (Ch17q01)
Source df SS MS
F Value Pr gt F Between
k 1 SST MST MST/MSE
Within n k SSE MSE
Total n 1 SSY
Source df SS MS
F Value Pr gt F Between
0.0136 Within Total
86Regression Approach for the Fixed-Effects
One-way ANOVA Model
- The fixed-effects one-way ANOVA model can be
studied under the regression framework. - Suppose that the factor has k levels. The idea is
to create k 1 dummy variables, X1, X2, Xk-1.
Then the ANOVA model is equivalent to the
following regression model - Y ? ?1X1 ?2X2 ?k-1Xk-1
E - There are two ways to code the dummy variables.
87Effect Coding Method
- This method codes the dummy variable as follows
88Reference Cell Coding Method
- Another way to define the k 1 dummy variables
is given as follows - The kth population becomes the reference cell.
- With this coding method, the regression
coefficients are - ? ?k, ?i ?i - ?, i 1, 2,
, k 1. - The regression model that results from this
coding method is equivalent to the fixed-effect
anova model which is subject to the constraint
?k 0.
89Which Coding Scheme to Use?
- The two coding methods produce coefficients (?
and ?1, ?2, , ?k-1) that have different
interpretations. - It does not matter which coding method is used,
since both produce the same ANOVA table. - Also, the differences between any two effects
stay the same.
90Multiple-comparison Procedures for Fixed-effects
One-way ANOVA
- When the test for equal means are significant,
our next step customarily is to determine which
means are different. - Examples of follow-up tests are
- H0 ?1 ?3
- H0 (?1 ?2)/2 ?3
- These comparisons may be of interest to us before
the data were collected, or they may arise in
completely exploratory studies only after the
data have been examined.
91Comparison-wise Or Experiment-wise Error Rate
- Suppose an experimenter wishes to perform m
tests, each having a ? type I error rate. That
is, each test is incorrectly rejected with
probability ?. Let Tj be the event that the jth
test will be incorrectly rejected, j 1,2,,m.
Then P(Tj) ?. Since P(T1?T2 ? ? Tm)
SjP(Tj), we have - P(T1?T2 ? ? Tm) m?.
- That is, the probability that at least one
test is incorrectly rejected is at most m?. - If one choose to control the experiment-wise
error rate at ?, the individual type I error rate
has to be controlled at ?/m.
92Multiple Testing The Bonferroni Approach
- When performing m tests simultaneously, we fix
the overall significance level at ?. That is, ?
is the probability of incorrectly rejecting at
least one of the m tests. To achieve this overall
level ?, we perform each individual test at level
?/m. - This is the Bonferroni approach.
- One disadvantage of this approach is that the
true overall significance level may be
considerably lower than ?, and, in extreme
situations, it may be so low that none of the
tests will be rejected, thus producing too many
false negatives (low power).
93Comparison-wise Or Experiment-wise Confidence
Level
- Suppose an experimenter wishes to calculate m
confidence intervals, each having a (1 - ?)
confidence level. Then each interval will be
individually correct with probability (1 - ?).
Let Sj be the event that the jth confidence
interval will be correct and NSj the event that
it will be incorrect, j 1,2,,m. Then P(Sj) 1
- ?, P(NSj) ?. Since P(S1?S2??Sm) 1
SjP(NSj), we have - P(S1?S2??Sm) 1 - m?.
- That is, the probability that the m intervals
will simultaneously be correct is at least 1 -
m?. - If one chooses to have the experiment-wise
confidence level at 1 - ?, the individual
confidence level has to be at 1 - ?/m.
94Simultaneous Confidence Intervals The
Bonferroni Approach
- When constructing m confidence intervals
simultaneously for m contrasts, we fix the
overall confidence level at 1 - ?. That is, 1 - ?
is the probability that these confidence
intervals simultaneously contain the true values
of all the contrasts considered. To achieve this
overall confidence level 1 - ?, we construct
confidence intervals for these m contrasts each
at level 1 - ?/m. - This is the Bonferroni approach (known as
Bonferroni Correction).
95Contrasts
- A contrast is defined as any linear function of
the population means, say, -
96Simultaneous t Confidence Intervals
- For a single contrast, say, L c1?1 c2 ?2
ck ?k, one may construct a t confidence interval
as -
- When constructing confidence intervals for m
contrasts L1, L2,, and Lm, the intervals are
97Example 17.5, Page 442
98Example 17.5, Page 442 (contd)
- We construct simultaneous confidence intervals
for the following population mean differences
(special contrasts) -
Interpretation Each of these confidence
intervals has a 1 0.05/6 0.9917 confidence
level and they are simultaneously correct with
probability of 1 0.05 0.95, which is called
the overall confidence level.
99Grouping by Simultaneous Confidence Intervals
100Simultaneous Infinite Scheffe Confidence
Intervals
101Simultaneous Tukey-Kramer Confidence Intervals
- While Scheffes approach is general, Tukeys
method is only for constructing pairwise
confidence intervals. See Page 445. - Tukeys method results in shorter confidence
intervals than Scheffes method.
102Which Approach? (Page 453)
- Sheffes intervals tend to be long when k, the
number of population means involved, is large.
Their length does not depend on m, the number of
contrasts. - The simultaneous t intervals do not depend on k,
but become longer as m increases. - For pairwise comparisons, use Tukey-Kramers
method. - For planned (a priori) multiple comparisons, use
Bonferronis t method. For unplanned (a
posteriori) multiple comparisons, use Scheffes
method. - Bonferroni and Sheffe methods are robust, but
Bonferroni method is insensitive (low power) and
Sheffes method is least powerful.
103SAS Codes for Multiple Comparison in Means
data table17_7 substance 1 do i 1
to 10 input dosage_at__at_ output
end drop i cards 29 28 23 26 26 19 25
29 26 28 17 25 24 19 28 21 20 25 19 24 17 16 21
22 23 18 20 17 25 21 18 20 25 24 16 20 20 17 19
17 Run
proc glm class substance model dosage
substance means substance/alpha 0.05 Bon
Scheffe Tukey run quit
104Orthogonal Contrasts
105Orthogonal Sample Contrasts and Partitioning the
Treatment Sum of Squares
- Treatment Sum of Squares with k 1 degrees of
freedom can be partitioned into t statistically
independent sums of squares of 1 degree of
freedom for each of t orthogonal sample
contrasts, where t k 1. - The maximum number of orthogonal sample contrasts
is K 1.
106Partitioning the Treatment Sum of Squares (contd)
107Testing a Contrast
108ANOVA Table With Orthogonal Contrasts
Source df
SS MS F Value
contrast1 1
SS1 MS1 MS1/MSE
contrast2 1
SS2 MS2 MS2/MSE
. .
. .
. .
. . .
. .
. . .
.
contrast(k-1) 1 SS(k-1)
MS(k-1) MS(k-1)/MSE Error
n k SSE
MSE Total n
1 SST
Model
109Random-Effects One-way ANOVA Model
- If the factor is a random factor, then its levels
should be treated as a random sample from a
population N(a, s2A). - For the model parameters to be identifiable, set
a 0. - The model is
- Yij µ Ai Eij,
- where Ais are iid N(0, s2A), and independent
of Ais, Eijs are iid N(0, s2), i 1, 2, , k
and j 1, 2, , ni. - Under this random effects model, the null
hypothesis of no factor effect should be stated
as H0 s2A 0. - This model is also know as variance components
model.
110Some Useful Results
- The following can be shown
-
111For balanced design, n0 equals the common sample
size.
112Example A company supplies a customer with a
larger number of batches of raw materials. The
customer makes three sample determinations from
each of 5 randomly selected batches to control
the quality of the incoming material. The model
is yij ? ?i ?ij, where the k 5 levels
(e.g., the batches) are chosen at random from a
population with variance ?2A. The data are shown
below.
batch 1 2 3 4 5
74 68 75 72 79 76 71 77 74 81
75 72 77 73 79 A 1-way ANOVA can be performed to
generate the following results
113 ANOVA Source DF SS MS EMS Between
(batches) 4 147.74 36.935 ?2 3?2A
Within 10 17.99 1.799 ?2 Total 14 165.7
3
Verify the results. To test that there is no
difference between batches, that is, ?2A 0,
calculate the F statistic from the ANOVA table
which is F 36.94 / 1.799 20.5. If we had
chosen an ? value of .01, then the F value in for
a df of 4 in the numerator and 10 in the
denominator is 5.99, which indicates rejection. A
follow-up study then proceeds. Since these
batches were chosen via a random selection
process, it may be of interest to find out how
much of the variance in the experiment might be
attributed to batch differences (measured by ?2A)
and how much to random error (measured by ?2). In
order to answer these questions, we equate the MS
column and the EMS column (this is called the
method of moments). The estimates of ?2 and ?2A
are 1.80 and 11.71, respectively. The total
variance of an observation is ?2 ?2A, which is
estimated by 1.80 11.71 13.51. So,
11.71/13.51 or 86.7 of the total variance of an
observation is attributable to batch differences
and 13.3 to error variability within the
batches. ? ?2A /(?2 ?2A) is the correlation
coefficient within a batch.
114Regression and Analysis of Variance II
- Chapter 18 Randomized Blocks Special Case of
Two-Way ANOVA
115Homework 6
116SAS System Options
- The OPTIONS statement can appear at any place in
a program, except within data or card lines. - options lineseze88 pagesize64 Â portrait
 - options linesize179 pagesize65 landscape
 - options PAGENOn will start the output file at
page number n. - Some Other Options...
- NONUMBER removes the page numbers from the output
file. - NODATE removes the date and time from the output
window - NOCENTER left justifies the output file.
- SKIPn will tell SAS to skip n lines before
printing on a page. - MISSING'character' specifies the character to
print for missing numeric values.
117Two-Way ANOVA
- Analysis of variance (ANOVA) is a technique for
assessing how one or several nominal independent
variables (called factors) affect a continuous
dependent variable. - Two-way ANOVA involves two factors, each at two
or more levels. The levels of one factor might be
various teaching methods, for example, and the
levels of the other factor might be textbooks. - If there are I levels of one factor (called the
row factor) and J of the other (called the column
factor), there are I x J combinations. - Suppose there are nij observations that
correspond to the ith level of the row factor and
the jth level of the column factor. -
118Randomized Complete Block Design (RCBD)A
Special Two-Factor Design
- This design originated in agricultural
experiments. To compare the effects of I
different fertilizers, J relatively homogeneous
plots of land, or blocks, are selected, and each
is divided into I plots. Within each block, the
assignment of fertilizers to plots is made at
random. - This design is a multi-sample generalization of a
matched-pairs design. - Blocks can be litters of animals, batches of raw
material, individuals, time, gender, age group, - We will term the other factor treatment.
- Block is a nuisance factor, but
- Blocking helps eliminate the effects of a
confounding factor.
119Another Example of RCBD
- A RCBD might be used by a nutritionist who wants
to compare the effects of three different diets
on experimental animals. - To control for genetic variation in the animals,
the nutritionist might select three animals from
each of several litters and randomly determine
their assignments to the diets. Here, the litters
are blocks.
120Data Layout for a RCBD
Block
- 2 b total
mean - Y11 Y12 Y1b T1
- Y21 Y22 Y2b T2
- . . . .
- . . . .
- . . . .
- Yk1 Yk2 Ykb Tk
- B1 B2 Bb G
1 2 . . . K total mean
Treatment
121Cell Means for a RCBD
Block
- 2 b mean
- ?11 ?12 ?1b ?1.
- ?21 ?22 ?2b ?2.
- . . . .
- . . . .
- . . . .
- ?k1 ?k2 ?kb ?k.
- ?.1 ?.2 ?.b ?..
1 2 . . . K mean
Treatment
122ANOVA Table for a RCBD
123Fixed-Effect ANOVA Model for a RCBD
- If both factors are considered fixed, a classical
ANOVA model may be written in terms of these
effects - Yij ? ?i ?j eij, i 1,2,, k
j 1, 2, , b - where
- Yij observed response associated with
the - ith treatment in the jth block
- ? overall mean
- ?i effect of treatment i
- ?j effect of block j
- Constraints ? ?i 0, ? ?j 0
- Model fitting
124Assumptions Underlying the ANOVA Model
- Normality of the errors
- Independence of the errors
- Homogeneity of variance
- Additivity of the model, ie., no interaction
125Regression Approach to the RCBD
- Using the effect coding scheme, we can create k
1 dummy variables X1, X2,, Xk-1 for the k
treatments and b 1 dummy variables Z1, Z2,,
Zb-1 for the b blocks. - A regression model for the RCBD is
- Y ? ??iXi ??jZj e, e
N(0, ?2) - These coefficients correspond to the first k 1
treatment effects and the first b 1 block
effects, -??i corresponds to the kth treatment
effect, and -??j corresponds to the bth block
effect, respectively.
126F Test for the Equality of Treatment Means
- H0 ?1. ?2. ?k. against
- H1 ?1., ?2., , ?k. are NOT all equal.
- F MST/MSE
- where
-
127Example Ch18q01, page 501
- The toxic effects of three chemicals (I, II, and
III) used in the tire-manufacturing industry were
investigated in an experiment. Three-adjacent
1-inch squares were marked on the back of each of
eight rats, and the three chemicals were applied
separately to the three squares on each rat. The
squares were then rated from 0 to 10, depending
on the degree of irritation. The data are as
shown below.
Rat
1 2 3 4 5 6 7
8 total 6 9 6 5
7 5 6 6 50
5 9 9 8 8 7
7 7 60 3 4 3 6
8 5 5 6 40 14
22 18 19 23 17 18 19
150
Chemical
I II III total
Questions (1) What are the blocks and
treatments in this RCBD?
(2) State an appropriate fixed-effect ANOVA model
and the regression model for this RCBD.
(3) Do the data provide sufficient
evidence to indicate a significant difference in
the toxic effects of
the three chemicals? (4)
Find a 98 t confidence interval for the true
difference in the toxic effects of
chemicals I and II.
(5) What proportion of total variation is
explained by the model? (6)
State the assumptions on which the validity of
the analysis depends.
128SAS Codes
data Ch18q01 do Chemical 1 to 3 / do
Chemical 'I','II','III' / do Rat 1 to
8 input Rate_at__at_ output end end card
s 6 9 6 5
7 5 6 6 5
9 9 8 8 7 7
7 3 4 3 6
8 5 5 6 run proc
glm class Chemical Rat model Rate Chemical
Rat run quit
129Regression and Analysis of Variance II
- Chapter 19
- Two-Way ANOVA with Equal Cell Numbers
130Homework 7
131Two-Way ANOVA
- Two-way ANOVA involves two factors, each at two
or more levels. - If there are I levels of one factor (called the
row factor) and J of the other (called the column
factor), there are I x J combinations. - Suppose there are nij observations that
correspond to the ith level of the row factor and
the jth level of the column factor. - In Chapter 18, we considered RCBD where all nij
1. Instead of considering interaction models, we
considered additive models. The reason is this
if we considered interaction, then errors would
all be zero and inference would be impossible
although some parameters could be estimated. With
additive models, the error term actually
represents interaction. (so interaction is small
and ignorable) - In this Chapter, We consider two-way ANOVA where
all nij n and n gt 1. Interaction between
factors will be modeled. - Although two-way balanced layouts are rarely seen
in observational studies, they are generated
intentionally in controlled experiments where the
investigator chooses the levels of the factors
and the allocation of subjects to the various
factor combinations. -
132Data Layout for Two-Way ANOVA with Equal Cell
Numbers
133The Reason of Replication
- There are two major reasons for having more than
one observations at each combinations of factor
levels - Be able to compute a pure estimate of the
experimental error (?2) - Detect an interaction effect between two factors
134Simple Effects, Main Effects, and Interactions
- The first step in examining a two-way layout
should always be to construct a table of cell
means (along with row and column means). - The difference between two row (column) means
holding the column (row) factor constant at a
given level measures the simple effect of the row
(column) factor. - The difference between two row means (or two
colum