Title: Biostatisics and Computer Applications
1Biostatisics and Computer Applications
- ANOVA of hierarchical data
- Experimental design
- ANOVA of common designs
- SAS programming
- 1/6/2003
2 Recap (Analysis of Variance)
- Analysis of variance
- One-way ANOVA
- Two-way ANOVA
- Multiple comparisons
3Recap (Data for One-Way ANOVA)
K independent samples, n observations, kn total
4Recap( One-way ANOVA)
5Recap (Two-way ANOVA data)
6Recap (two-way ANOVA)
7Recap (multiple comparisons)
- PLSD (LSD, t test) method.
- Confidence interval (1-alpha)
8Hierarchical (Nested classification) data
- If the experimental data have l groups, each
group has u subgroups, each subgroup has v
sub-subgroup, , each of the last sub-sub-group
has n observations, we call this data
hierarchical data (Nested classification). - The simplest one is 2 levels hierarchic data. It
contains l group, each group has m subgroup, each
subgroup has n observation. Total number of
observations is lmn. - Example College-gtyear-gtmajor-gtstudent
9Hierarchical data table (i1,2,,lj1,2,,mk1,2
,,n)
10Linear mathematic model for hierarchical data
11Linear mathematic model for hierarchical ANOVA
12ANOVA Total Variation Partitioning
Total Variation
SS(Total)
Variation Due to Group
Variation Due to subgroup within group
SSd
Variation Due to Random Sampling
13ANOVA Summary Table
Source of
Degrees of
Sum of
Mean
F
Variation
Freedom
Squares
Square
l - 1
SSt
MSt
MSt
Group
MSd
Subgroup within group
l(m-1)
SSd
MSd
MSd
MSe
SSe
lm(n-1)
MSe
Error
Total
lmn - 1
SST
14Expected mean square of hierarchical ANOVA
Source of
Degrees of
Mean
Expected Mean
Variation
Freedom
Square
Square
l - 1
MSt
Group
Subgroup within group
l(m-1)
MSd
MSe
lm(n-1)
Error
Total
lmn - 1
15ANOVA Null Hypotheses
- 1. No difference in means due to group
- H01 ?1 ?2... ?k
- 2. No difference in means due to subgroup within
group - H02 ?12 ?12 ... ?lm
- If H02 is accepted, test H01,
16Example of hierarchical ANOVA
Measured lead concentrations in 4 vegetables
after the soil was supplied with a pesticide.
Each vegetable was planted 3 pots with
contaminated soil. There were 5 plants per pot.
17Result of ANOVA Table
Source of
Degrees of
Sum of
Mean
F
Variation
Freedom
Squares
Square
3
76.74
25.58
405.80
Plant
Pot within plant
8
0.63
0.078
1.31
2.90
48
0.060
Error
59
Total
80.27
18Multiple comparison
- Tukey fixed range method
- K4,df60(56), q0.053.74 (table)
19SAS program
- DATA Hierarchic
- INPUT vegetable pot _at_
- DO k1 to 5
- INPUT concentration _at_
- OUTPUT
- END
- DATALINES
- A 1 0.7 0.6 0.9 0.5 0.6
- A 2 0.9 0.9 0.7 1.1 0.7
- A 3 0.8 0.6 0.9 1.0 0.8
- B 1 1.2 1.4 1.6 1.2 1.5
- B 2 1.1 0.9 1.3 1.2 1.0
- B 3 1.5 1.4 0.9 1.3 1.6
- C 1 0.6 0.6 0.8 0.9 0.7
- C 2 0.5 0.8 0.9 1.0 0.6
- C 3 0.6 1.2 0.8 0.9 1.0
- D 1 4.2 3.7 2.9 3.5 3.6
- D 2 2.9 3.5 3.8 3.1 3.5
- D 3 3.6 3.5 4.0 3.3 3.7
- PROC ANOVA
- CLASS vegetable pot
- MODEL concentrationvegetable pot(vegetable)
- TEST Hvegetable Epot(vegetable)
- MEANS vegetable/Tukey Epot(vegetable)
- RUN
We test vegetation effects using MS of pot.
We use two INPUT statements.
20Experimental design
- Experimental design is a planned interference in
the natural order of events by the researcher. - Why design?
- inferences about what produced, contributed to,
or caused events - gain such information without ambiguity.
21Experimental design
- Terminology
- Experimental (and environmental) factor A
variable of specific experimental interest. For
example, fertilizer type amount of nutrient.
Treatment. - Experiment different level of an experimental
factor or combination of levels in a multiple
factors. Level refers to the degree or intensity
of a factor. - Random refers to the property of completely
chance events that are not predictable.
Elimination of systematic influence upon
assignment. - Control refers to a group not being exposed to
the treatment. - Block refers to categories of subjects with a
treatment group. Within a block, environmental
factors are homogeneity.
22Experimental design
- Principles of experimental design
- 1. Randomization.
- Assign treatments to each unit (plot) randomly
(with same probability). - Provides unbiased estimate of error (normal
distribution) - 2. Replication.
- Estimate random error ( )
- Increase the precision of the estimation
. - 3. Block control.
- One set of experiments with similar environmental
conditions - Further decrease standard error by separating
block effect.
A,B,C,D,E
23Experimental design
- According to number of factors
- Single factor experiment
- Detect simple effect of experimental factor
- Easy to apply and analyze.
- Multiple factors experiment
- 2, 3 factors and more
- Detect both main effect and interaction
- Lower standard error, easy to find smaller true
effects - Difficult to analyze data.
24Experimental effects
- Experimental effects
- Simple effect change in response produced by a
change in the level of a factor - Main effect mean of simple effect
- Interaction change in response caused by the
interaction of experimental factors. - Example
- Test the effects of N and P on wheat yield. Two
levels for N (n1,n2) and two level for P (p1,p2).
Yields (kg/plot) are shown in table.
25Experimental effects
No interaction!
Simple effect Main effect Interaction
26Experimental effects
Positive interaction!
Simple effect Main effect Interaction
27Experimental effects
Negative interaction!
Simple effect Main effect Interaction
28Experimental design
- According to unit arrangement
- Completely randomized experiment
- Random, replicate One or multiple factors
- Easy to apply and analyze
- Randomized block experiment
- Random, replicate and block control
- One or multiple factors
- Most commonly used
- Latin square experiment
- Random, replicate and block control on row and
column - One or more factors, but treatment No k510
- Split plot experiment
- Special requirement for different factors
- Different precisions for factors
- Multiple factors only.
29Completely randomized experiment
- One factor experiment
- This is exactly the same as one-way ANOVA.
- Multiple factors experiment
- Similar to randomized block experiment, just
remove Block effect as shown next.
30Randomized block experiment (one factor)
- Example We want to compare the yield of 7 barley
varieties. Randomized block design, replicate 3
times. Plots and yields per plot show below.
DATALINES I F 20 I A 24 I E 22
DATA rbe1 INPUT block variety yield
31Randomized block experiment (one factor)
- DATA rbe1
- INPUT block variety yield
- datalines
- I F 20
- I A 24
- I E 22
- I D 18
- I C 21
- I G 20
- I B 20
- II A 20
- II D 16
- II C 19
- II F 21
- II B 19
- II E 20
- II G 19
- III F 21
- III B 21
- PROC ANOVA
- CLASS block variety
- MODEL yieldvariety block
- MEANS variety /LSD alpha0.05
- RUN
Here we are interested in the effect of variety,
not block. Block is used to decrease the standard
error. If block effect is not significant, it
means no big difference in environmental factors
among blocks. If block effect is significant, we
are happy we separated this effect from model
error. We do not do multiple comparisons for
block.
32Randomized block experiment (two factors)
- Example We want to test the N and P effects on
plant yield. Three levels for N (0, 5, 10 kg) and
five levels for P (0,2,4,6,8 kg), the total
treatments is 15. Randomized block design,
replicate twice.
- DATA rbe2
- INPUT BLOCK N 5-6 P 7-8 yield
- DATALINES
- 1 A2B2 5.0
- 1 A2B4 4.9
- 1 A1B1 4.3
- 1 A3B2 4.4
- 1 A1B5 4.7
- 1 A2B1 5.2
We use column input to read in N and P.
33Randomized block experiment (two factors)
- DATA rbe2
- INPUT BLOCK N 5-6 P 7-8 yield
- DATALINES
- 1 A2B2 5.0
- 1 A2B4 4.9
- 1 A1B1 4.3
- 1 A3B2 4.4
- 1 A1B5 4.7
- 1 A2B1 5.2
- 1 A3B4 3.4
- 1 A1B4 4.8
- 1 A3B5 3.7
-
- 2 A2B3 3.4
- 2 A3B1 4.7
- 2 A3B3 3.4
- 2 A2B2 5.2
- 2 A1B4 4.0
- 2 A3B5 4.2
- PROC ANOVA
- CLASS block n p
- MODEL yieldn p np block
- MEANS n p np /t
- RUN
If the interaction (np) is not significant, then
the best combination of N and P is highest N
treatment and highest P treatment. Otherwise, you
need to compare NP.
34Latin square experiment
- Five N treatment, (0kg, 10kg, 15kg, 20kg, 25kg)
on wheat yield. Latin square design. (Code for
treatment 1-0kg, 2-10kg, 3-15 kg, 4-20kg, 5-25
kg).
35Latin square design
- PROC ANOVA
- CLASS row column treatment
- MODEL yieldtreatment row column
- MEANS treatment /t alpha0.05
- MEANS treatment /t alpha0.01
- RUN
- DATA latin
- DO row1 to 5
- DO column1 to 5
- INPUT treatment yield _at__at_
- OUTPUT
- END
- END
- DATALINES
- 3 10.1 1 7.9 2 9.8 5 7.1 4 9.6
- 1 7.0 4 10.0 5 7.0 3 9.7 2 9.1
- 5 7.6 3 9.7 4 10.0 2 9.3 1 6.8
- 4 10.5 2 9.6 3 9.8 1 6.6 5 7.9
- 2 8.9 5 8.9 1 8.6 4 10.6 3 10.1
We focus on treatment effect only.
36Split plot experiment
- To test the warming effect on plant growth, we
set four level of increased temperature (A1 3o,
A2 2o A31o and A40o, control.). Also we test
the effect of clipping. Within each warming plot,
we set 3 levels for clipping (B1 clipping twice,
summer and winter B2 clipping once in winter
B3 no clipping). Test warming and clipping
effect.
37Split plot experiment
- DATA splitplot
- INPUT block 1 warming 2-3 clipping 5-6 yield
- DATALINES
- 1A3 B2 20
- 1A3 B1 18
- 1A3 B3 18
- 1A2 B3 20
- 1A2 B1 24
-
- 3A3 B3 18
- 3A3 B2 18
- 3A2 B3 23
- 3A2 B2 22
- 3A2 B1 25
- PROC ANOVA
- CLASS block warming clipping
- MODEL yieldblock warming blockwarming clipping
warmingclipping - TEST Hwarming block Eblockwarming
- MEANS warming /LSD Eblockwarming
- MEANS clipping/LSD CLDIFF
- RUN
We use different error items for warming and
clipping in F test as well as multiple
comparisons.
38How if missing data? (PROC GLM)
- The GLM procedure uses the method of least
squares to fit general linear models. Its
powerful procedure. You can perform regression,
analysis of variance, analysis of covariance,
multivariate analysis of variance, and partial
correlation using PROC GLM. - With PROC GLM, you can use one or several
continuous dependent variables to one or several
independent variables. The independent variables
may be either classification variables, which
divide the observations into discrete groups, or
continuous variables. - For normal balanced data, you may use PROC ANOVA.
But for unbalanced data, you should use PROC GLM.
39Deal with missing data
- PROC GLM lt options gt
- CLASS variables
- MODEL dependentsindependents lt / options gt
- TEST lt Heffects gt Eeffect lt / options gt
- MEANS effects lt / options gt
- LSMEANS effects lt / options gt
- OUTPUT lt OUTSAS-data-set gt
- keywordnames lt ... keywordnames gt lt /
option gt - RANDOM effects lt / options gt
40Deal with missing data
- DATA unbalanced
- DO variety1 to 2
- DO fertilizer1 to 3
- DO block1 to 3
- INPUT yield _at__at_
- OUTPUT
- END
- END
- END
- DATALINES
- 7 6 8
- . 9 10
- 5 4 3
- 6 6 7
- 8 . 9
- 7 6 5
- PROC GLM
- CLASS variety fertilizer block
- MODEL yieldblock varietyfertilizer
- MEANS variety fertilizer /LSD LINES
- LSMEANS variety fertilizer /T
- RUN