Regression and Analysis of Variance II - PowerPoint PPT Presentation

About This Presentation
Title:

Regression and Analysis of Variance II

Description:

We would like to show you a description here but the site won t allow us. – PowerPoint PPT presentation

Number of Views:427
Avg rating:3.0/5.0
Slides: 209
Provided by: SY54
Category:

less

Transcript and Presenter's Notes

Title: Regression and Analysis of Variance II


1
Regression and Analysis of Variance II
  • Chapter 13 Analysis of Covariance and Other
    Methods for Adjusting Continuous Data

2
Homework 1
  • All even problems
  • Due 2/5 in class

3
Three Reasons for Considering Control
  • When assessing an association between a dependent
    variable and a set of study variables, we control
    for extraneous variables (called covariates) in
    order
  • to assess interaction between study variables and
    covariates,
  • to correct for confounding, and
  • to increase the precision in estimating the
    association.

4
Interaction and Confounding
  • Interaction and Confounding are two
    methodological concepts for quantifying the
    relationship of one or more independent variables
    to a dependent variable.
  • Interaction exists when the association is
    different at different values of the covariates.
  • Confounding exists if different associations
    between a dependent variable and one or more
    study variables result when covariates are
    included or ignored in data analysis.

5
What Is Analysis of Covariance
  • Analysis of covariance (ANACOVA, ANCOVA) is a
    special regression procedure, which is used to
    adjust or correct for problems of confounding.
  • An ANACOVA model is a regression model. In such a
    model, study variables are categorical,
    covariates may be measured on any scale, and
    there is no interaction between the study
    variables and the covariates.

6
Choice of Covariates in An ANACOVA Model
  • Covariates should be confounding variables
  • Covariates should not depend upon study variables
    in any way.

7
Development of ANACOVA Model
  • For simplicity, consider only one study variable
    which is categorical and takes k values
    corresponding to k groups.
  • Suppose that z1, z2, , zk-1 are k 1 dummy
    variables for the study variable, and are defined
    as
  • Suppose that x1, x2, , xp are p covariates.

8
Development of Covariance Model (contd)
  • The ANACOVA model is
  • where

9
Point Estimation of Interest Quantities
  • The parameters can be estimated by the least
    squares method. Denote these estimates by
  • Instead of the unadjusted group means (mean of y
    for each category), we usually are interested in
    the adjusted means. The adjusted means for
    different categories are defined to be predicted
    values obtained by evaluating the model in
    different categories when the covariates are set
    to the overall mean for all categories.
  • For the baseline category (k), the adjusted means
    is
  • For the jth category, the adjusted mean is

10
Confidence Intervals
  • Formula
  • Point Estimate /- (critical value)(standard
    error)
  • Confidence intervals for single coefficients can
    be obtained using t critical values.
  • Confidence intervals for a linear combination of
    coefficients can also be obtained using t
    critical values. In order to calculate the
    standard error, we need to know the covariance
    matrix of the estimates of coefficients.

11
Hypothesis Testing
  • Testing for equality of all the adjusted means is
    of interest and is equivalent to testing H0 ?p1
    ?pk-1 0, which is carried out by the
    multiple partial F test.
  • The test statistic is

12
ANACOVA Example Using SAS
  • Problem 1 (Page 274)
  • Problem 3 (Page 275)
  • Problem 5 (Page 275) SAS Codes
  • Problem 7 (Page 281)
  • Problem 9 (Page 281)
  • Problem 11 (Page 281)
  • Problem 13(Page 282)
  • Problem 15 (Page 284)

13
Regression and Analysis of Variance II
  • Chapter 14 Regression Diagnostics

14
Homework 2
  • Problems 20, 22, 23, 24, 28
  • Due Feb 12, 2009

15
  • This chapter introduces methods for
  • Detecting outliers
  • Checking regression assumptions
  • Detecting the presence of collinearity

16
Descriptive Statistical Analysis
  • (Possible outliers) Examine the 5 largest and 5
    smallest values for every numeric variable
  • Impossible values are set to missing
  • When outliers are removed, this action and any
    justification for it, should be documented and
    presented along with results.
  • Examine the appropriate descriptive statistics
    for each variable
  • For categorical variable, produce frequency
    tables to detect unusual values
  • For continuous data, produce the range
  • Examine scatterplots
  • For simple linear regression with both variables
    x and y continuous, plot y vs x (for checking
    linearity and detecting outliers). Calculate
    Pearson correlation, r.
  • For multiple regression, produce partial
    regression plots and calculate partial
    correlations.
  • Calculate correlation matrix for independent
    variables A strong correlation signals
    collinearity problems.

17
Residual Analysis
  • A residual for an observation is the difference
    between the observation and its predicted value
  • Standardized residuals
  • Studentized residuals
  • Jackknife residuals
  • Residual Analyses can
  • Detect outliers
  • Detect violations of model assumptions

18
Standardized Residuals

19
Studentized Residuals

20
Jackknife Residuals
21
Hypothesis Testing for Outliers
  • Under all model assumptions, both the studentized
    and the Jackknife residuals are t distributed.
  • To test whether each observation is an outlier,
    either the studentized residual or the Jackknife
    residual may be used as the test statistic.
  • It is suggested that the Bonferroni procedure or
    other multiple testing procedures be used.
  • When the Bonferroni procedure is used, the
    significance level is ?/n for each test. Here ?
    is the family-wise type I error rate.

22
Cooks Distance
  • Cooks distance measures the extent to which the
    estimates of the regression coefficients change
    when an observation is deleted from the analysis.
  • Let d(i) denote the Cooks distance, then

23
Distribution of Cooks Distance
  • The Cooks distance can be used as a test
    statistic for the hypothesis H0 the ith
    observation is an outlier against the hypothesis
    H1 the ith observation is not an outlier.
  • The distribution of the distance is still
    unknown. Muller and Chen (1997) performed
    simulations and tabulated some critical values
    for given sample size, n, and given number of
    parameters ,k.

24
Assessing Linearity, Homoscedasticity, and
Independence Assumptions Using Residual Plots
  • The theoretical foundation of residual plots is
    that the residuals are uncorrelated to both the
    predicted values and values of each of the
    predictor variables. That is
  • Thus, common residual plots are
  • Residuals against predicted values
  • Residuals against predictor values

25
Assessing the Normality Assumption
  • Good-ness-of-fit tests
  • Kolmogorov-Smirnov test
  • Anderson-Darling
  • Shapiro-Wilks test
  • Normal probability plot
  • Residuals against normal percentiles
  • Normality holds if points are on a line
  • Normal quantile-quantile plot
  • Residuals against normal quantiles
  • Normality holds if points are on a line

26
Some Remedies for Assumption Violations
  • Data transformations
  • To stabilize the variance of the dependent
    variable, y, if the homoscedasticity assumption
    is violated
  • To normalize the dependent variable, if the
    normality assumption is violated
  • To linearize the regression model, if the
    linearity assumption is violated
  • Commonly used transformations log(y), sqrt(y),
    1/y

27
Weighted Least Squares Analysis
  • Used when the variance homogeneity assumption
    and/or independence assumption do not hold.
  • Suppose the ith observation on y has the variance
    ?2i ?2/wi, where are wis all known. Then the
    regression coefficients ?js, j 1, 2, , k1,
    are determined by minimizing (page 304)

28
Collinearity
  • It exists when there are strong linear
    relationships among independent variables.
  • Symptoms of collinearity
  • Effects of collinearity
  • On regression coefficients
  • On predicted values
  • On variance-covariances and standard errors
  • Approaches for diagnosing the presence of
    collinearity
  • Remedies for the collinearity problem.

29
Using Eigenvalues to Determine the Presence of
Collinearity
  • The correlation matrix (k by k) of the k
    independent variables has k eigenvalues. Suppose
    they are arranged in descending order, ?1,, ?k.
  • To determine the presence of collinearity,
    statisticians use three kinds of statistics
  • Condition indices sqrt(?1/?j), j 1, , k
  • Describe the degree to which the data are
    ill-conditioned, i.e., the degree to which
    small changes in data values result in large
    changes in parameter estimates. Threshold 30.
  • Condition numbers sqrt(?1/?k)
  • Variance proportions page 313
  • For condition indices larger than 30, the
    variance proportion should be examined in order
    to determine which predictor variables are
    primarily responsible for the large condition
    index. Predictors with variance proportion larger
    than 0.5 can be considered to be involved in the
    collinearity problem. If the intercept is
    involved in the collinearity problem, its
    recommended that intercept-adjusted collinearity
    diagnostics be examined.

30
Collinearity Diagnostics
  • Step 1 produce correlation matrix of predictor
    variables and plot one predictor against another.
  • Step 2 Examine the variance inflation factor
    (VIF) values for each predictor.
  • Step 3 Examine condition indices and variance
    proportions.

31
Remedies for Collinearity Problems
  • Drop predictors that are correlated to others.
    Drop those least scientifically interesting
    predictors.
  • Use dummy variables properly.
  • Limit interaction terms in a model.
  • Using centered data can alleviate collinearity
    problems. Warning this may render the usual
    collinearity diagnostics, VIF values and
    condition indices, ineffective.
  • Regression on principle components or some of
    them.
  • Ridge regression

32
SAS Output Dictionary
  • Leverage Values Measure how far an
    observation is from the center point of the
    independent variables (not the dependent
    variable). Observations with values larger than
    2(k1)/n are considered to be potentially highly
    influential, where k is the number of predictors
    and n is the sample size.
  • DFFITS Measure how much an observation has
    affected its fitted value. Values larger than
    2sqrt((k1)/n) in absolute value are considered
    highly influential.
  • DFBETAS Measure how much an observation has
    affected the estimate of a regression coefficient
    (there is one DFBETA for each regression
    coefficient, including the intercept). Values
    larger than 2/sqrt(n) in absolute value are
    considered highly influential.
  • Cooks D Measure aggregate impact of each
    observation on the group of regression
    coefficients, as well as the group of fitted
    values. Values larger than 4/n are considered
    highly influential.
  • COVRATIO Measure the impact of each
    observation on the variances (and standard
    errors) of the regression coefficients and their
    covariances. Values outside the interval 1 /-
    3(k1)/n are considered highly influential.

33
Example (Problem 19, page 329 with data) Effect
of 0.25 ppm sulfur dioxide on airway resistance
in freely breathing, heavily exercising,
asthmatic subjects.
  • Abstract- We sought to determine whether 0.25 ppm
    sulfur dioxide in filtered air causes
    bronchoconstriction when inhaled by freely
    breathing, heavily exercising, asthmatic
    subjects. Nineteen asthmatic volunteers exercised
    at 750 kilogram meters/min for 5 min in an
    exposure chamber that contained filtered air at
    ambient temperature and humidity or, on another
    day, filtered air plus 0.25 ppm sulfur dioxide.
    The order of exposure to sulfur dioxide and to
    filtered air alone was randomized, and the
    experiments were double-blinded. Specific airway
    resistance, measured by constant-volume,
    whole-body plethysmography, increased from 6.38
    /- 2.07 cm H2O X s (mean /- SD) before exercise
    to 11.32 /- 8.97 after exercise on days when
    subjects breathed filtered air alone and from
    5.70 /- 1.93 to 13.33 /- 7.54 on days when
    subjects breathed 0.25 ppm sulfur dioxide in
    filtered air. The increase in specific airway
    resistance on days when subjects breathed 0.25
    ppm sulfur dioxide was only slightly greater than
    on days when they breathed filtered air, but the
    difference was significant. To determine whether
    0.25 ppm sulfur dioxide causes greater
    bronchoconstriction in asthmatic subjects
    exercising more vigorously, 9 subjects then
    repeated the experiment exercising at 1,000
    instead of 750 kilogram meters/min. Specific
    airway resistance increased from 6.71 /- 2.25 to
    13.59 /- 7.57 on days when subjects breathed
    filtered air alone and from 5.23 /- 1.23 to
    12.54 /- 6.17 on days they breathed 0.25 ppm
    sulfur dioxide in filtered air. The increase in
    specific airway resistance on the 2 days was not
    significantly different.

34
data Ch14q19 input AGE Sex Height Weight
FEV datalines 24 M 175 78.0 4.7 36 M 172
67.6 4.3 28 F 171 98.0 3.5 25 M 166 65.5 4.0 26 F
166 65.0 3.2 22 M 176 65.5 4.7 27 M 185 85.5
4.3 27 M 171 76.3 4.7 36 M 185 79.0 5.2 24 M 182
88.2 4.2 26 M 180 70.5 3.5 29 M 163 75.0 3.2 33 F
180 68.0 2.6 31 M 180 65.0 2.0 30 M 180 70.4
4.0 22 M 168 63.0 3.9 27 M 168 91.2 3.0 46 M 178
67.0 4.5 36 M 173 62.0 2.4 RUN PROC PRINT DATA
Ch14q19run
For those not familiar with SAS If you have the
data saved somewhere, you may use the infile
statement, as the following data one infile
'yourfiledirectory\Ch14q19.txt' firstobs 2
input AGE Sex Height Weight FEV run proc
print
35
proc gplot data results plot
standardizedpredicted plot
rstudentpredicted plot standardizedpredicted
plot jackknifepredicted title 'Plot of
Jackknife residuals against predicted
values' run
PROC reg DATA Ch14q19 model FEV AGE
Height Weight/vif
COLLIN influence output out
results p predicted
h leverage student
standardized rstudent
jackknife cookd Cookdistance
title 'Proc reg output Forced Expiratory
Volume (Y) regressed on AGE, Weight,
and Height' run
proc univariate data results normal plot var
jackknife histogram jackknife/normal probplot
jackknife/normal qqplot jackknife/normal run
data outliers set results if
(abs(jackknife) gt tinv(0.95, 19-3-2)) OR
(Cookdistance gt 1) OR (leverage gt 2(31)/19)
run proc print data outliers run quit
36
Collinearity Diagnostics Collinearity Diagnostics Collinearity Diagnostics Collinearity Diagnostics Collinearity Diagnostics Collinearity Diagnostics Collinearity Diagnostics
Number Eigenvalue ConditionIndex Proportion of Variation Proportion of Variation Proportion of Variation Proportion of Variation
Number Eigenvalue ConditionIndex Intercept AGE Height Weight
1 3.95433 1.00000 0.00008665 0.00212 0.00007991 0.00109
2 0.03589 10.49649 0.00075655 0.62719 0.00045217 0.14783
3 0.00911 20.83984 0.03745 0.34070 0.02926 0.84941
4 0.00067443 76.57184 0.96170 0.02998 0.97021 0.00167
37
Regression and Analysis of Variance II
  • Chapter 15 Polynomial Regression

38
Homework 3
  • 2 14

39
Preview
  • Polynomial regression is a special case of
    multiple regression, in which the regression
    function is a polynomial of a single predictor
    variable.
  • The general form of a polynomial regression model
    is

40
Chapter Example (Ch15q01, page 370)
SOLN_NUM X Y LN_Y
1 6 0.029 -3.54
1 6 0.032 -3.442
1 6 0.027 -3.612
1 8 0.079 -2.538
1 8 0.072 -2.631
1 8 0.088 -2.43
1 10 0.181 -1.709
1 10 0.165 -1.802
1 10 0.201 -1.604
1 12 0.425 -0.856
1 12 0.384 -0.957
1 12 0.472 -0.751
1 14 1.13 0.122
1 14 1.02 0.02
1 14 1.249 0.222
1 16 2.812 1.034
1 16 2.465 0.902
1 16 3.099 1.131
41
Fitting a Parabola Using the Least Squares Method
  • A parabola is the graph of a quadratic equation
    (order-2 polynomial).
  • Lets fit a quadratic model to our previous data.
    The polynomial model is
  • We wish to estimate these ?s. As for a general
    multiple regression model, we maximize the
    function

42
ANOVA Table for Second-order Polynomial
Regression
  • An ANOVA (Analysis of Variance) table can be
    constructed for any regression model.
  • For a second-order polynomial regression model,
    its ANOVA table looks like the following
    constructed from the Ch15q01 data

43
Table 15.2, Page 352
data table5_1 /one outlier removed/ input SBP
Age_at__at_ Age2 AgeAge cards 144 39 138
45 145 47 162 65 142 46 170 67 124 42 158 67 154
56 162 64 150 56 140 59 110 34 128 42 130 48 135
45 114 17 116 20 124 19 136 36 142 50 120 39 120
21 160 44 158 53 144 63 130
29 125 25 175 69 run proc reg data
table5_1 model SBP Age Age2/ss1 plot
SBPAge run quit
44
data Ch15q01 input X Y LN_Y X2
XX /create second-order term/
cards 6 0.029 -3.54 6 0.032 -3.442 6 0.027 -3.612
8 0.079 -2.538 8 0.072 -2.631 8 0.088 -2.43 10 0.
181 -1.709 10 0.165 -1.802 10 0.201 -1.604 12 0.42
5 -0.856 12 0.384 -0.957 12 0.472 -0.751 14 1.13 0
.122 14 1.02 0.02 14 1.249 0.222 16 2.812 1.034 16
2.465 0.902 16 3.099 1.131 run proc reg
model Y X X2/ss1 / ss1 allows
construction of an ANOVA table on page
352/ run quit
45
Inferences Associated with Second-order
Polynomial Regression
  • Is the overall 2nd order polynomial model
    significant? Comparing this model with the
    intercept only model. (Overall F test and R2)
  • Does the 2nd order model provide significantly
    more predictive power than does the straight-line
    model? (test H0 ?2 0, using t test or partial
    F)
  • Given that a 2nd order model is more appropriate
    than a straight-line model, should we add
    high-order terms to the 2nd order model?
    (Lack-of-fit test)

46
Fitting Higher-order Models
  • If a quadratic model is not adequate, a
    higher-order polynomial model may be needed.
  • A kth order polynomial at most has k 1 bends or
    relative extrema
  • Generally, the maximum-order polynomial that may
    be fit is one less than the number of distinct
    X-values.

47
Lack-of-fit Tests
  • Given a lower-order polynomial has been fitted
    and the highest-order term is tested significant,
    should we be confident that a higher-order model
    is not needed?
  • We still need to conduct a lack-of-fit test by
    comparing the current model with the
    highest-order possible polynomial model.
  • For a lack-of-fit test to be possible, at least
    one X-value should involve replicates.

48
Lack-of-fit Test of a Second-order Model Example
Obs SBP Age
1
114 17
2 124 19
3 116 20
4 120 21
5 125
25 6
130 29
7 110 34
8 136 36
9 144 39
10 120
39 11
124 42
12 128 42
13 160 44
14 138 45
15 135
45 16
142 46
17 145 47
18 130 48
19 142 50
20 158
53 21
154 56
22 150 56
23 140 59
24 144 63
25 162
64 26
162 65
27 170 67
28 158 67
29 175 69
  • 5 X-values (39, 42, 45, 56, 67) have repeats and
    19 X-values have no repeats, so the total number
    of distinct X-values is 519 24.
  • The highest-order polynomial that can be fit is a
    polynomial of order 24 1 23.
  • Directly fitting such a 23rd -order polynomial is
    difficult! Once its fitted, a multiple partial F
    test may be used to test H0 No lack-of-fit of
    2nd -order model.
  • The lack-of-fit statistic is given by

49
Orthogonal Polynomials
  • Collinearity problems can arise in work with
    higher-order polynomial models.
  • Such collinearity problems can be overcome using
    orthogonal polynomials.
  • When fitting a kth-order (natural) polynomial, we
    need create k orthogonal polynomials, xi Pi(x),
    i 1, 2, , k, where Pi(x) a0i a1ix
    aiixi, such that

50
Some Results
  • When the two models (using a natural polynomial
    or an orthogonal polynomial) are fitted, R2 and
    the partial or overall F test stay the same.
  • There is no collinearity problem when fitting an
    orthogonal polynomial.
  • The backward selection procedure is made easier
    with an orthogonal model.

51
Orthogonal Polynomial Coefficients
  • If X-values are equally spaced and they repeat
    equally frequently, then Table A.7 on Page 836 of
    the textbook can be used to determine the
    corresponding Xi-values.
  • Refer to an example on page 362-363.

52
Lack-of-fit Test Using Orthogonal Polynomials
  • Fit only the highest-order orthogonal polynomial
    model. All lower-order model will have the same
    slope estimates!
  • Conduct a multiple F test for H0 No lack-of-fit
    of the primary model.

53
Strategies for Choosing a Polynomial Model
  • First choose the full model to be of third order
    or lower.
  • Second, proceed backward in a stepwise fashion
    starting with the largest power term, and
    sequentially delete non-significant terms,
    stopping at the first power term that is
    significant this term and all lower-order terms
    should be retained in the final model.
  • Third, conduct a multiple partial F test for lack
    of fit.
  • Lastly, perform residual analysis using Jackknife.

54
Sample Questions for This Chapter
  • x y
  • 6 0.029
  • 6 0.032
  • 6 0.027
  • 8 0.079
  • 8 0.072
  • 8 0.088
  • 10 0.181
  • 10 0.165
  • 10 0.201
  • 12 0.425
  • 12 0.384
  • 12 0.472
  • 14 1.13
  • 14 1.02
  • 14 1.249
  • 16 2.812
  • 16 2.465
  • 3.099
  1. We wish to regress y on x using a polynomial.
    What is the highest-order polynomial possible?
  2. If we fit a quadratic polynomial, how many
    degrees of freedom does the residual have? How
    many degrees of freedom does the lack of fit
    have? How many degrees of freedom does the pure
    error have?

55
Regression and Analysis of Variance II
  • Chapter 16
  • Selecting the Best Regression Equation

56
Homework 4
  • Questions 2 and 12

57
Problem Description
  • Given one response variable Y and a set of
    predictors X1, X2, , Xk, we want to determine
    the best subset of the k predictor variables and
    the corresponding best-fitting regression
    equation.

58
Steps in Selecting the Best Regression Model
  • Start from the maximum model.
  • Specify a criterion for model selection.
  • Specify a strategy for variable selection.
  • Conduct the specified analysis
  • Evaluate the reliability of the selected model.

59
Step 1 Specifying the Maximum Model
  • The maximum model should not be too large. Large
    models may overfit the data that is, include
    variables with truly zero regression coefficients
    (called a type I error). Overfitting introduces
    no bias but leads to large variance.
  • On the other hand, the maximum model should not
    be too small. Small models may underfit the data
    that is, exclude variables with truly non-zero
    regression coefficients (called a type II error).
    Underfitting introduces bias but less variance.
  • Conclusion there is a bias-variance trade-off
    when fitting regression models.

60
Step2Specifying a Criterion for Model Selection
  • A selection criterion is an index that can be
    computed for each candidate model and used to
    compare models.
  • According to the criterion, candidate models can
    be ordered from best to worst.
  • Many criteria are possible, for example,
  • Rp2, Fp, MSE(p), and Cp.
  • All these criteria attempt to compare the maximum
    model (full model) with k predictors and a
    restricted model (reduced model) with p
    predictors (p k).

61
Comparing the Full Model and the Reduced Model
Using Rp2
62
There are several approaches to explain R2 in
OLS.  These different approaches lead to various
calculations of pseudo R2s with regressions of
categorical outcome variables (chapter 22). 
R-squared as explained variability - The
denominator of the ratio can be thought of as the
total variability in the dependent variable, or
how much y varies from its mean.  The numerator
of the ratio can be thought of as the variability
in the dependent variable that is not predicted
by the model.  Thus, this ratio is the proportion
of the total variability unexplained by the
model.  Subtracting this ratio from one results
in the proportion of the total variability
explained by the model.  The more variability
explained, the better the model. R-squared as
improvement from null model to fitted model - The
denominator of the ratio can be thought of as the
sum of squared errors from the null model--a
model predicting the dependent variable without
any independent variables.  In the null model,
each y value is predicted to be the mean of the y
values. Consider being asked to predict a y value
without having any additional information about
what you are predicting.  The mean of the y
values would be your best guess if your aim is to
minimize the squared difference between your
prediction and the actual y value.  The numerator
of the ratio would then be the sum of squared
errors of the fitted model.  The ratio is
indicative of the degree to which the model
parameters improve upon the prediction of the
null model.  The smaller this ratio, the greater
the improvement and the higher the R-squared.
R-squared as the square of the correlation - The
term "R-squared" is derived from this
definition.  R-squared is the square of the
correlation between the model's predicted values
and the actual values.  This correlation can
range from -1 to 1, and so the square of the
correlation then ranges from 0 to 1.  The greater
the magnitude of the correlation between the
predicted values and the actual values, the
greater the R-squared, regardless of whether the
correlation is positive or negative.   Source
http//www.ats.ucla.edu/stat/mult_pkg/faq/general/
Psuedo_RSquareds.htm
63
Comparing the Full Model and the Reduced Model
Using Fp
64
Comparing the Full Model and the Reduced Model
Using MSE(p)
65
Comparing the Full Model and the Reduced Model
Using Mallows Cp
66
The Criteria are Intimately Related
67
Step3Specifying a Strategy for Selecting
Variables
  • SAS provides nine methods of model selection
    implemented in PROC REG. These methods are
    specified with the SELECTION option in the MODEL
    statement.

68
Backward Elimination (BACKWARD)
  • The backward elimination technique begins by
    calculating F statistics for a model, including
    all of the independent variables. Then the
    variables are deleted from the model one by one
    until all the variables remaining in the model
    produce F statistics significant at the SLSTAY
    level specified in the MODEL statement (or at the
    0.10 level if the SLSTAY option is omitted). At
    each step, the variable showing the smallest
    contribution to the model is deleted.

69
Steps for Backward Elimination
  • Step 1 Fit the maximum model.
  • Step 2 Produce type II SS or t test results.
  • Step 3 Focus on the term that has the lowest F
    or
  • highest p-value.
  • - If the term is non-significant at a specified
    level
  • given by the SLS option, drop the term
    and refit
  • a regression model for the remaining
    variables and
  • repeat the backward selection procedure.
  • - If the term is significant, the backward
    selection procedure ends, and the selected model
    consists of variables remaining in the model.

70
Other Variable Selection Procedures
  • Forward selection (selection f and SLS
    option)
  • Stepwise selection (modified forward selection)
  • Chunkwise selection
  • Read Page 393-398

71
Step 4 Conducting the Analysis Using the
Selected Model
  • Fitting and plotting

72
Step 5 Evaluating Reliability with Split Samples
  • To assess the reliability of a chosen model, the
    most compelling way is to conduct a new study and
    test the fit of the chosen model to the new data.
  • This approach is expensive.
  • A split-sample analysis attempt to find the best
    model and assess its reliability. This analysis
    split the sample into a training group and a
    holdout (or validation) group randomly. Both
    groups should be representative of the parent
    population.
  • Use the training data to do model selection. Find
    the R2 for the selected model, denoted R2(1).
  • Find the predicted values for the validation data
    and calculate the square of the correlation
    between the predicted values and the y values of
    the validation data, denoted R2(2).
  • If the shrinkage on cross-validation R2(1) -
    R2(2) lt 0.10, the selected model is reliable if
    the shrinkage gt 0.90, the selected model is not
    reliable.

73
Regression and Analysis of Variance II
  • Chapter 17 One-way Analysis of Variance

74
Homework 5
  • 2, 10(abcdfg), 22

75
Preview
  • Analysis of variance (ANOVA) is a technique for
    assessing how one or several nominal independent
    variables (called factors) affect a continuous
    dependent variable.
  • ANOVA in which only one nominal independent
    variable is involved is called one-way ANOVA.
  • ANOVA is usually employed in comparisons
    involving several population means.
  • ANOVA is an extension to the independent-two-sampl
    e t test.
  • An ANOVA problem can be handled under the
    regression framework.

76
Factors and Levels
  • A nominal independent variable with k categories
    is called a factor with k levels.
  • For example, to compare three different diets, A,
    B, and C, 60 people are available and are
    randomly assigned to the three diets so that each
    treatment group contains 20 people. Here diet is
    the only factor which has 3 levels.
  • How to assign? Have slips of 20 As, 20 Bs, and
    20 Cs in a bowl. Each person picks one.

77
Example The following example studies the
effect of bacteria on the nitrogen content of red
clover plants. The (treatment) factor is bacteria
strain, and it has six levels. Red clover plants
are inoculated with the treatments, and nitrogen
content is later measured in milligrams.
title1 'Nitrogen Content of Red Clover
Plants' data Clover input Strain
Nitrogen _at__at_ datalines 3DOK1 19.4
3DOK1 32.6 3DOK1 27.0 3DOK1 32.1 3DOK1 33.0
3DOK5 17.7 3DOK5 24.8 3DOK5 27.9 3DOK5 25.2
3DOK5 24.3 3DOK4 17.0 3DOK4 19.4 3DOK4
9.1 3DOK4 11.9 3DOK4 15.8 3DOK7 20.7 3DOK7
21.0 3DOK7 20.5 3DOK7 18.8 3DOK7 18.6
3DOK13 14.3 3DOK13 14.4 3DOK13 11.8 3DOK13 11.6
3DOK13 14.2 COMPOS 17.3 COMPOS 19.4 COMPOS
19.1 COMPOS 16.9 COMPOS 20.8 proc
print proc glm data Clover class
strain model Nitrogen Strain run
78
Another ANOVA Example
  • (Ch17q01) Five treatments for fever blisters,
    including a placebo, were randomly assigned to 30
    patients. For each of the five treatments, the
    number of days from initial appearance of the
    blisters until healing is given as follows

Data Ch17q01 treatment _n_ do i 1
to 6 input time_at__at_
output end drop i cards 5
8 7 7 10 8 4 6 6 3 5 6 6 4 4 5 4 3
7 4 6 6 3 5 9 3 5 7 7 6 Run
  • The only factor involved here has 5 levels (1
    placebo).
  • Questions of interest
  • Do the effects of the five treatments differ
    significantly with regard to healing fever
    blisters?
  • Are the 4 treatments on average more effective
    than the placebo in healing fever blisters?
  • Do the effects of the 4 treatments differ
    significantly with regard to healing fever
    blisters?, If yes, which one is most effective?

79
Fixed Versus Random Factors
  • A fixed factor is a factor whose levels are the
    only ones of interest.
  • A random factor is a factor whose levels may be
    regarded as a sample from some large population
    of levels.
  • The distinction is important in statistical
    analysis, since it will affect the variance of
    estimators, thus the results of significance
    tests.
  • Example of fixed factors Sex, Age, Marital
    Status, Education.
  • Examples of random factors subjects, litters,
    days.
  • Fixed or random If a factor has many potential
    levels, treat it as random.

80
Fixed-Effects One-Way ANOVA Model
  • An Fixed-effects one-way ANOVA model deals with
    the effect of a single fixed factor on a
    continuous response variable.
  • A comparison of population means can be made
    through this model.
  • Four assumptions must be made for the model
  • An SRS has been selected from each of the k
    populations.
  • The dependent variable is normally distributed in
    each population. (normality)
  • The variance of the dependent variable is the
    same in each population (denoted ?2). (constant
    variance)
  • The k samples are independent. (independence)
  • The fixed-effects one-way ANOVA model is
  • Yij ? ?i Eij i
    1, 2, , k
  • where the effects of the ith treatment, ?i
    are subject to the following constraint
  • ?1 ?2
    ?k 0.

81
The Problem of Interest Are Population Means
All Equal?
  • H0 ?1 ?2 ?k against
  • HA The k means are not all equal.
  • If H0 is rejected, a follow-up study can be
    conducted to find which mean is different from
    others.

82
Fixed-Effects One-Way ANOVA Model Data
Configuration
83
ANOVA Table for theFixed Effects One-way ANOVA
Model
  • Source df SS MS
    F Value Pr gt F
  • Between k 1 SST MST
    MST/MSE
  • Within n k SSE MSE
  • Total n 1 SSY
  • Where

If The p-value is less than ?, then reject the
null hypothesis that all means are the same.
84
Some Useful Results
  • The following can be shown

85
Example (Ch17q01)
Source df SS MS
F Value Pr gt F Between
k 1 SST MST MST/MSE
Within n k SSE MSE
Total n 1 SSY
Source df SS MS
F Value Pr gt F Between

0.0136 Within Total
86
Regression Approach for the Fixed-Effects
One-way ANOVA Model
  • The fixed-effects one-way ANOVA model can be
    studied under the regression framework.
  • Suppose that the factor has k levels. The idea is
    to create k 1 dummy variables, X1, X2, Xk-1.
    Then the ANOVA model is equivalent to the
    following regression model
  • Y ? ?1X1 ?2X2 ?k-1Xk-1
    E
  • There are two ways to code the dummy variables.

87
Effect Coding Method
  • This method codes the dummy variable as follows

88
Reference Cell Coding Method
  • Another way to define the k 1 dummy variables
    is given as follows
  • The kth population becomes the reference cell.
  • With this coding method, the regression
    coefficients are
  • ? ?k, ?i ?i - ?, i 1, 2,
    , k 1.
  • The regression model that results from this
    coding method is equivalent to the fixed-effect
    anova model which is subject to the constraint
    ?k 0.

89
Which Coding Scheme to Use?
  • The two coding methods produce coefficients (?
    and ?1, ?2, , ?k-1) that have different
    interpretations.
  • It does not matter which coding method is used,
    since both produce the same ANOVA table.
  • Also, the differences between any two effects
    stay the same.

90
Multiple-comparison Procedures for Fixed-effects
One-way ANOVA
  • When the test for equal means are significant,
    our next step customarily is to determine which
    means are different.
  • Examples of follow-up tests are
  • H0 ?1 ?3
  • H0 (?1 ?2)/2 ?3
  • These comparisons may be of interest to us before
    the data were collected, or they may arise in
    completely exploratory studies only after the
    data have been examined.

91
Comparison-wise Or Experiment-wise Error Rate
  • Suppose an experimenter wishes to perform m
    tests, each having a ? type I error rate. That
    is, each test is incorrectly rejected with
    probability ?. Let Tj be the event that the jth
    test will be incorrectly rejected, j 1,2,,m.
    Then P(Tj) ?. Since P(T1?T2 ? ? Tm)
    SjP(Tj), we have
  • P(T1?T2 ? ? Tm) m?.
  • That is, the probability that at least one
    test is incorrectly rejected is at most m?.
  • If one choose to control the experiment-wise
    error rate at ?, the individual type I error rate
    has to be controlled at ?/m.

92
Multiple Testing The Bonferroni Approach
  • When performing m tests simultaneously, we fix
    the overall significance level at ?. That is, ?
    is the probability of incorrectly rejecting at
    least one of the m tests. To achieve this overall
    level ?, we perform each individual test at level
    ?/m.
  • This is the Bonferroni approach.
  • One disadvantage of this approach is that the
    true overall significance level may be
    considerably lower than ?, and, in extreme
    situations, it may be so low that none of the
    tests will be rejected, thus producing too many
    false negatives (low power).

93
Comparison-wise Or Experiment-wise Confidence
Level
  • Suppose an experimenter wishes to calculate m
    confidence intervals, each having a (1 - ?)
    confidence level. Then each interval will be
    individually correct with probability (1 - ?).
    Let Sj be the event that the jth confidence
    interval will be correct and NSj the event that
    it will be incorrect, j 1,2,,m. Then P(Sj) 1
    - ?, P(NSj) ?. Since P(S1?S2??Sm) 1
    SjP(NSj), we have
  • P(S1?S2??Sm) 1 - m?.
  • That is, the probability that the m intervals
    will simultaneously be correct is at least 1 -
    m?.
  • If one chooses to have the experiment-wise
    confidence level at 1 - ?, the individual
    confidence level has to be at 1 - ?/m.

94
Simultaneous Confidence Intervals The
Bonferroni Approach
  • When constructing m confidence intervals
    simultaneously for m contrasts, we fix the
    overall confidence level at 1 - ?. That is, 1 - ?
    is the probability that these confidence
    intervals simultaneously contain the true values
    of all the contrasts considered. To achieve this
    overall confidence level 1 - ?, we construct
    confidence intervals for these m contrasts each
    at level 1 - ?/m.
  • This is the Bonferroni approach (known as
    Bonferroni Correction).

95
Contrasts
  • A contrast is defined as any linear function of
    the population means, say,

96
Simultaneous t Confidence Intervals
  • For a single contrast, say, L c1?1 c2 ?2
    ck ?k, one may construct a t confidence interval
    as
  • When constructing confidence intervals for m
    contrasts L1, L2,, and Lm, the intervals are

97
Example 17.5, Page 442
98
Example 17.5, Page 442 (contd)
  • We construct simultaneous confidence intervals
    for the following population mean differences
    (special contrasts)

Interpretation Each of these confidence
intervals has a 1 0.05/6 0.9917 confidence
level and they are simultaneously correct with
probability of 1 0.05 0.95, which is called
the overall confidence level.
99
Grouping by Simultaneous Confidence Intervals
100
Simultaneous Infinite Scheffe Confidence
Intervals
101
Simultaneous Tukey-Kramer Confidence Intervals
  • While Scheffes approach is general, Tukeys
    method is only for constructing pairwise
    confidence intervals. See Page 445.
  • Tukeys method results in shorter confidence
    intervals than Scheffes method.

102
Which Approach? (Page 453)
  • Sheffes intervals tend to be long when k, the
    number of population means involved, is large.
    Their length does not depend on m, the number of
    contrasts.
  • The simultaneous t intervals do not depend on k,
    but become longer as m increases.
  • For pairwise comparisons, use Tukey-Kramers
    method.
  • For planned (a priori) multiple comparisons, use
    Bonferronis t method. For unplanned (a
    posteriori) multiple comparisons, use Scheffes
    method.
  • Bonferroni and Sheffe methods are robust, but
    Bonferroni method is insensitive (low power) and
    Sheffes method is least powerful.

103
SAS Codes for Multiple Comparison in Means
data table17_7 substance 1 do i 1
to 10 input dosage_at__at_ output
end drop i cards 29 28 23 26 26 19 25
29 26 28 17 25 24 19 28 21 20 25 19 24 17 16 21
22 23 18 20 17 25 21 18 20 25 24 16 20 20 17 19
17 Run
proc glm class substance model dosage
substance means substance/alpha 0.05 Bon
Scheffe Tukey run quit
104
Orthogonal Contrasts
105
Orthogonal Sample Contrasts and Partitioning the
Treatment Sum of Squares
  • Treatment Sum of Squares with k 1 degrees of
    freedom can be partitioned into t statistically
    independent sums of squares of 1 degree of
    freedom for each of t orthogonal sample
    contrasts, where t k 1.
  • The maximum number of orthogonal sample contrasts
    is K 1.

106
Partitioning the Treatment Sum of Squares (contd)
107
Testing a Contrast
108
ANOVA Table With Orthogonal Contrasts
Source df
SS MS F Value
contrast1 1
SS1 MS1 MS1/MSE
contrast2 1
SS2 MS2 MS2/MSE
. .
. .
. .
. . .
. .
. . .
.
contrast(k-1) 1 SS(k-1)
MS(k-1) MS(k-1)/MSE Error
n k SSE
MSE Total n
1 SST
Model
109
Random-Effects One-way ANOVA Model
  • If the factor is a random factor, then its levels
    should be treated as a random sample from a
    population N(a, s2A).
  • For the model parameters to be identifiable, set
    a 0.
  • The model is
  • Yij µ Ai Eij,
  • where Ais are iid N(0, s2A), and independent
    of Ais, Eijs are iid N(0, s2), i 1, 2, , k
    and j 1, 2, , ni.
  • Under this random effects model, the null
    hypothesis of no factor effect should be stated
    as H0 s2A 0.
  • This model is also know as variance components
    model.

110
Some Useful Results
  • The following can be shown

111
For balanced design, n0 equals the common sample
size.
112
Example A company supplies a customer with a
larger number of batches of raw materials. The
customer makes three sample determinations from
each of 5 randomly selected batches to control
the quality of the incoming material. The model
is yij ? ?i ?ij, where the k 5 levels
(e.g., the batches) are chosen at random from a
population with variance ?2A. The data are shown
below.
batch 1 2 3 4 5
74 68 75 72 79 76 71 77 74 81
75 72 77 73 79 A 1-way ANOVA can be performed to
generate the following results
113

ANOVA Source DF SS MS EMS Between
(batches) 4 147.74 36.935 ?2 3?2A
Within 10 17.99 1.799 ?2 Total 14 165.7
3
Verify the results. To test that there is no
difference between batches, that is, ?2A 0,
calculate the F statistic from the ANOVA table
which is F 36.94 / 1.799 20.5. If we had
chosen an ? value of .01, then the F value in for
a df of 4 in the numerator and 10 in the
denominator is 5.99, which indicates rejection. A
follow-up study then proceeds. Since these
batches were chosen via a random selection
process, it may be of interest to find out how
much of the variance in the experiment might be
attributed to batch differences (measured by ?2A)
and how much to random error (measured by ?2). In
order to answer these questions, we equate the MS
column and the EMS column (this is called the
method of moments). The estimates of ?2 and ?2A
are 1.80 and 11.71, respectively. The total
variance of an observation is ?2 ?2A, which is
estimated by 1.80 11.71 13.51. So,
11.71/13.51 or 86.7 of the total variance of an
observation is attributable to batch differences
and 13.3 to error variability within the
batches. ? ?2A /(?2 ?2A) is the correlation
coefficient within a batch.
114
Regression and Analysis of Variance II
  • Chapter 18 Randomized Blocks Special Case of
    Two-Way ANOVA

115
Homework 6
  • 3 and 12

116
SAS System Options
  • The OPTIONS statement can appear at any place in
    a program, except within data or card lines.
  • options lineseze88 pagesize64   portrait
     
  • options linesize179 pagesize65  landscape
     
  • options PAGENOn will start the output file at
    page number n.
  • Some Other Options...
  • NONUMBER removes the page numbers from the output
    file.
  • NODATE removes the date and time from the output
    window
  • NOCENTER left justifies the output file.
  • SKIPn will tell SAS to skip n lines before
    printing on a page.
  • MISSING'character' specifies the character to
    print for missing numeric values.

117
Two-Way ANOVA
  • Analysis of variance (ANOVA) is a technique for
    assessing how one or several nominal independent
    variables (called factors) affect a continuous
    dependent variable.
  • Two-way ANOVA involves two factors, each at two
    or more levels. The levels of one factor might be
    various teaching methods, for example, and the
    levels of the other factor might be textbooks.
  • If there are I levels of one factor (called the
    row factor) and J of the other (called the column
    factor), there are I x J combinations.
  • Suppose there are nij observations that
    correspond to the ith level of the row factor and
    the jth level of the column factor.

118
Randomized Complete Block Design (RCBD)A
Special Two-Factor Design
  • This design originated in agricultural
    experiments. To compare the effects of I
    different fertilizers, J relatively homogeneous
    plots of land, or blocks, are selected, and each
    is divided into I plots. Within each block, the
    assignment of fertilizers to plots is made at
    random.
  • This design is a multi-sample generalization of a
    matched-pairs design.
  • Blocks can be litters of animals, batches of raw
    material, individuals, time, gender, age group,
  • We will term the other factor treatment.
  • Block is a nuisance factor, but
  • Blocking helps eliminate the effects of a
    confounding factor.

119
Another Example of RCBD
  • A RCBD might be used by a nutritionist who wants
    to compare the effects of three different diets
    on experimental animals.
  • To control for genetic variation in the animals,
    the nutritionist might select three animals from
    each of several litters and randomly determine
    their assignments to the diets. Here, the litters
    are blocks.

120
Data Layout for a RCBD
Block
  • 2 b total
    mean
  • Y11 Y12 Y1b T1
  • Y21 Y22 Y2b T2
  • . . . .
  • . . . .
  • . . . .
  • Yk1 Yk2 Ykb Tk
  • B1 B2 Bb G

1 2 . . . K total mean
Treatment
121
Cell Means for a RCBD
Block
  • 2 b mean
  • ?11 ?12 ?1b ?1.
  • ?21 ?22 ?2b ?2.
  • . . . .
  • . . . .
  • . . . .
  • ?k1 ?k2 ?kb ?k.
  • ?.1 ?.2 ?.b ?..

1 2 . . . K mean
Treatment
122
ANOVA Table for a RCBD
  • P493

123
Fixed-Effect ANOVA Model for a RCBD
  • If both factors are considered fixed, a classical
    ANOVA model may be written in terms of these
    effects
  • Yij ? ?i ?j eij, i 1,2,, k
    j 1, 2, , b
  • where
  • Yij observed response associated with
    the
  • ith treatment in the jth block
  • ? overall mean
  • ?i effect of treatment i
  • ?j effect of block j
  • Constraints ? ?i 0, ? ?j 0
  • Model fitting

124
Assumptions Underlying the ANOVA Model
  • Normality of the errors
  • Independence of the errors
  • Homogeneity of variance
  • Additivity of the model, ie., no interaction

125
Regression Approach to the RCBD
  • Using the effect coding scheme, we can create k
    1 dummy variables X1, X2,, Xk-1 for the k
    treatments and b 1 dummy variables Z1, Z2,,
    Zb-1 for the b blocks.
  • A regression model for the RCBD is
  • Y ? ??iXi ??jZj e, e
    N(0, ?2)
  • These coefficients correspond to the first k 1
    treatment effects and the first b 1 block
    effects, -??i corresponds to the kth treatment
    effect, and -??j corresponds to the bth block
    effect, respectively.

126
F Test for the Equality of Treatment Means
  • H0 ?1. ?2. ?k. against
  • H1 ?1., ?2., , ?k. are NOT all equal.
  • F MST/MSE
  • where

127
Example Ch18q01, page 501
  • The toxic effects of three chemicals (I, II, and
    III) used in the tire-manufacturing industry were
    investigated in an experiment. Three-adjacent
    1-inch squares were marked on the back of each of
    eight rats, and the three chemicals were applied
    separately to the three squares on each rat. The
    squares were then rated from 0 to 10, depending
    on the degree of irritation. The data are as
    shown below.

Rat
1 2 3 4 5 6 7
8 total 6 9 6 5
7 5 6 6 50
5 9 9 8 8 7
7 7 60 3 4 3 6
8 5 5 6 40 14
22 18 19 23 17 18 19
150
Chemical
I II III total
Questions (1) What are the blocks and
treatments in this RCBD?
(2) State an appropriate fixed-effect ANOVA model
and the regression model for this RCBD.
(3) Do the data provide sufficient
evidence to indicate a significant difference in
the toxic effects of
the three chemicals? (4)
Find a 98 t confidence interval for the true
difference in the toxic effects of
chemicals I and II.
(5) What proportion of total variation is
explained by the model? (6)
State the assumptions on which the validity of
the analysis depends.
128
SAS Codes
data Ch18q01 do Chemical 1 to 3 / do
Chemical 'I','II','III' / do Rat 1 to
8 input Rate_at__at_ output end end card
s 6 9 6 5
7 5 6 6 5
9 9 8 8 7 7
7 3 4 3 6
8 5 5 6 run proc
glm class Chemical Rat model Rate Chemical
Rat run quit
129
Regression and Analysis of Variance II
  • Chapter 19
  • Two-Way ANOVA with Equal Cell Numbers

130
Homework 7
  • 2, 3, 8, and 14

131
Two-Way ANOVA
  • Two-way ANOVA involves two factors, each at two
    or more levels.
  • If there are I levels of one factor (called the
    row factor) and J of the other (called the column
    factor), there are I x J combinations.
  • Suppose there are nij observations that
    correspond to the ith level of the row factor and
    the jth level of the column factor.
  • In Chapter 18, we considered RCBD where all nij
    1. Instead of considering interaction models, we
    considered additive models. The reason is this
    if we considered interaction, then errors would
    all be zero and inference would be impossible
    although some parameters could be estimated. With
    additive models, the error term actually
    represents interaction. (so interaction is small
    and ignorable)
  • In this Chapter, We consider two-way ANOVA where
    all nij n and n gt 1. Interaction between
    factors will be modeled.
  • Although two-way balanced layouts are rarely seen
    in observational studies, they are generated
    intentionally in controlled experiments where the
    investigator chooses the levels of the factors
    and the allocation of subjects to the various
    factor combinations.

132
Data Layout for Two-Way ANOVA with Equal Cell
Numbers
  • Page 519

133
The Reason of Replication
  • There are two major reasons for having more than
    one observations at each combinations of factor
    levels
  • Be able to compute a pure estimate of the
    experimental error (?2)
  • Detect an interaction effect between two factors

134
Simple Effects, Main Effects, and Interactions
  • The first step in examining a two-way layout
    should always be to construct a table of cell
    means (along with row and column means).
  • The difference between two row (column) means
    holding the column (row) factor constant at a
    given level measures the simple effect of the row
    (column) factor.
  • The difference between two row means (or two
    colum
Write a Comment
User Comments (0)
About PowerShow.com