Title: 5' Analysis of Variance IV Anlisis de Varianza IV
15. Analysis of Variance IV Análisis de
Varianza IV
- Profesor Simon Wilson
- Departamento de Estadística y Econometría
2Estimating m when H0 is accepted
- If we accept H0, then all levels have a common
mean m. - Its estimate is
- Then we can estimate s2 by
- This has n 1 degrees of freedom
- Use this to construct confidence intervals for m
and s2 in the usual manner.
3The coefficient of determination
- This is a measure of the relative size of the
variability explained by the groups / una medida
relativa de la variabilidad explicada por los
grupos - It is a number between 0 and 1.
-
4The efficiency / eficacia of the F test (1)
- How good is this test?
- To simplify, assume that each group has m
observations (so n m I) . Then - where sm2 is the variance between the means
- Clearly, as m increases, so then F will increase,
so more possibility of rejecting H0 as we
increase m.
5The efficiency of the F test (2)
- On the other hand, keep m fixed, we can increase
F if the denominator decreases - This happens if the experimental error s is
smaller (since the denominator estimates this). - So the power of the F test increases when we
- increase the size of the data from each group
- are able to reduce the experimental error.
6Analysis of the Difference in Means
- Let us suppose that we have rejected the null
hypothesis. We believe that the means of the
groups are different. - The 100(1 - a) confidence interval for the
differences in 2 means is - Note t distribution with n-I degrees of freedom
7Analysis of the Difference in Means multiple
tests (1)
- Remember that we can do a t-test to see if any
two of the means are different (recall start of
the whisky example). - There are such pairs of means
- If H0 is true, and level of significance is 5,
we accept H0 for each test with probability 0.95 - The probability that we accept H0 for all these
tests is
8Analysis of the Difference in Means multiple
tests (2)
- This assumes that all the tests are independent
- So even when H0 is true, when we do multiple
tests on pairs of means, we accept H0 as true
only with probability 0.45 and not 0.95. - As I ? ?, so this probability ? 0
9The Bonferroni Method (1)
- The Bonferroni method is a way to solve this
problem of multiple tests - It calculates a level of significance a for each
test on pairs of means, such that the probability
of H0 accepted for all tests is 1-aT (the level
of signifcance that we want). - If c is the total number of tests, then we want
- aT P(reject H0 at least once) P(reject H0 in
test 1 OR reject H0 in test 2 OR ... OR reject H0
in test c) - lt P(reject H0 in test 1) ... P(reject H0 in
test c) - ca
10The Bonferroni Method (2)
- So a aT / c works.
- Of course, when c is large, this means a is very
small, and that value of ta/2 is not in the
tables. - We use the approximation in this case of
- (za from normal tables)
11The Bonferroni Method (3)
- So, when the F test rejects H0 at the level of
significance aT, we can further investigate each
pair of means using the t-test at level of
signifinance a.
12Confidence Interval for the Variance
- Because
- (Remember from Section 3)
- So a 100(1-a) confidence interval for s2 is
13Example Whisky Data (see next slide)
- Construct 95 confidence intervals for
- The difference between the means of each whisky
(remember that our estimate for s2 is 9.356) - For the variance s2
- The Bonferroni method for the Whisky data
- How many t-tests are there?
- What is a if aT 0.05?
- Do the t-test with a for the JB and Glenfiddich
data. (you may need that z0.992 2.41) -
14The Whisky Data
15Model Diagnostics / Diagnosis del Modelo (1)
- What have we done up to now?
- Estimate the model parameters m1,...,mI and s2
(point and interval estimates) - Test to see if the means are different, using the
F-test - If H0 is rejected (i.e. means are different)
- Test the difference of each pair of means using
the Bonferroni method - Estimate the difference in means (and give a
confidence interval for the difference)
16Model Diagnostics (2)
- We now must study if the basic assumptions / las
hipótesis básicas of the model are reasonable or
not. - Recall that the model for the data is
- yij mi uij
- where uij are independent and normally
distributed with mean 0 and variance s2.
17Model Diagnostics (3)
- The peturbations uij are estimated by the
residuals - So, if our model agrees with the data, the
residuals should be independent and normally
distributed with mean 0 and variance s2.
18Model Diagnostics (4)
- There is a problem with the residuals the sum of
each group is always 0 - So they are not independent!!
- However, if n is big with respect to I, we can
consider them to be almost independent -
19Model Diagnostics (5)
- There are lots of ways to test the residuals
- Are they normally distributed? Draw a histogram
of the residuals. Does it look like a normal
distribution? Do a goodness of fit test for
the normal distribution. - Here are some possible problems that the
histogram will identify - Residuals are in 2 or more groups.
- Outlier / valor atípico is present
- Residuals are not symmetric
20Model Diagnostics (6)
- Are there outliers / valores atípicos? These are
values that are much larger or smaller than the
others. If they exist, you must investigate the
cause for such a value. If you can find no
reason for the outlier, then the model may not be
correct.
21Model Diagnostics (7)
- The variance must be the same in all the groups.
Draw the residuals as a function of their group
mean there should be no relationship between the
group mean and the variability in the residuals. - If the data is collected sequentially /
secuencialmente, draw the residuals as a function
of time. If the observations are independent
then there should be no trend in the plot.
22Residuals for the Whisky Data
- The residuals for the whisky data are on the next
slide, then a histogram, then a plot of residuals
by level. - From the histogram
- Does it look like a normal distribution?
- Are there outliers?
- Is there a relationship between group and
variance? - Suppose we know that the data was collected
sequentially. Plot the residuals sequentially.
23Residuals from the Whisky Example
24Histogram of Residuals
25Residuals for each Level
26What if the Residuals are not Normally
Distributed?
- If the residuals are not normally distributed,
then - We cannot trust our estimate of s2, and therefore
also cannot trust the confidence intervals for
the mi, and the differences in means. - However, the F test is still usually valid. This
is because the F test only relies on the Central
Limit Theorem. If the data are not normally
distributed, then the F test is still
approximately correct (and the approximation is
better if n is larger)
27What if the Variances are not Equal for each
Group?
- Clearly, confidence intervals that use the
estimate for s2 will be wrong - If all groups have approximately the same number
of observations i.e. ni ? n / I, i1,...,I, then
the F test still works - However, if the sizes of the groups are very
different (say max ni / min ni gt 2), it is not
valid - A formal test of equality of variances is usually
not worth it ( this test needs normally
distributed data)
28What if the Observations are not Independent?
- This is usually a serious problem. The formulae
for the confidence intervals for means are not
valid - The F test is not valid either (the Central Limit
Theorem needs independence) - Remember that randomization / aleatorización is
the best way of avoiding this problem