Title: Sample Sizes for IE
1Sample Sizes for IE
2Overview
- General question How large does the sample need
to be to credibly detect a given effect size? - What does Credibly mean here?
- We can be reasonably sure that the difference
between the treatment group and the comparison
group is due to the program - Randomization removes bias, but it does not
remove noise. To reduce noise, we need a large
sample size. But how large is large?
3Measuring Impact
- At the end of an experiment, we will compare the
outcome of interest in the treatment and the
comparison groups. - We are interested in the difference
- Mean in treatment - Mean in control Effect
size - For example mean of the malaria prevalence in
villages with ITN distribution vs. mean of
malaria prevalence in villages with no ITNs - To make conclusions based on that effect size, we
need it to be calculated with precision- since
there is always variability in data - If there are other many unobserved factors
affecting outcomes, it is harder to say whether
the treatment had an effect
4Precise outcomes
5Some noise
6Very noisy
7Confidence Intervals
- We only work with data which is a sample of the
population. In order to assess whether this is
valid for the entire population, we need a
measure of reliability - A 95 confidence interval for an effect size
tells us that, for 95 of any samples that we
could have drawn from the same population, the
estimated effect would have fallen into this
interval. - The Standard error (se) of the estimate in the
sample captures both the size of the sample and
the variability of the outcome - it is larger with a small sample and with a
variable outcome
8Two Types of Errors
- First type of error Conclude that there is an
effect, when in fact there are no effect. - The level of your test is the probability that
you will falsely conclude that the program has
an effect, when in fact it does not. - So with a level of 5, you can be 95 confident
in the validity of your conclusion that the
program had an effect. - To be confident, a 5, 10, 1
- Rule of thumb is that if the effect size is more
than twice the standard error, you can conclude
with more than 95 certainty that the program had
an effect
9Two Types of Errors
- Second type of error you fail to reject that the
program had no effect, when it fact it does have
an effect. - The Power of a test is the probability of finding
a significant effect in the RCT - Only with a significant effect can you cleanly
influence policy - Power Calculations are a tool to see how likely
we are to find a significant effect for a given
sample size
10What you Need for a Power Calculation
Significance level -This is often conventionally set at 5. - Lower levels (less likely to reject a false positive), we need more sample size to detect the effect
Power Level -A power level of 80 says 80 of the time, if there is a true effect you will be able to detect it in a given sample -Larger sample More Power
The mean and the variability of the outcome in the comparison group -From previous surveys conducted in similar settings -The larger the variability is, the larger the sample needed for a given power
The effect size that we want to detect -What is the smallest effect that should prompt a policy response? - The smaller the expected effect size the larger sample size needed
11How to Determine Effect Size
- What is the smallest effect that should justify
the program to be adopted (in terms of
cost-benefit)? - Sets minimum effect size we would want to be able
to test for - Common danger use an effect size that is too
optimistic too small of sample size - How large an effect you can detect with a given
sample depends on how variable the outcomes is. - Example If all children have very similar
diarrhea prevalence without a program, a very
small impact will be easy to detect - The Standardized effect size is the effect size
divided by the standard deviation of the outcome - Common effect sizes are .20 (small) .40
(medium) .50 (large)
12Design Factors to Take into Account
- Availability of a Baseline
- A baseline can help reduce needed sample size
since - Removes some variability in data, increasing
precision - Can been use it to stratify and create subgroups
- The level of randomization
- Whenever treatment occurs at a group level, this
reduces power relative to randomization at
individual level
13Cluster (Group) Randomization
Rural Water Project Water Guard Individual
Rural Water Project Spring Improvement Village
Community-based Monitoring in Uganda Village
HIV/AIDS Education School-level
14Implications from Group Design
- The outcomes for all the individuals within a
unit may be correlated - All villagers affected by spring improvements at
same time - All students at school with trained teachers may
have benefited from information - The sample size needs to be adjusted for this
correlation - The more correlation within the group, the more
we need to adjust the standard errors
15Implications
- It is extremely important to randomize an
adequate number of groups. - Typically the number of individual within groups
matter less than the number of groups - Big increases in power usually only happens when
the number of groups that are randomized increase - If you randomize at the level of the district,
with one treated district and one control
district, you have 2 observations!
16Conclusions
- Power calculations involve some guess work
- Some time we do not have the right information to
conduct it very properly - However, it is important to do them to
- Avoid launching studies that will have no power
at all waste of time and money - Determine the appropriate resources to the
studies that you decide to conduct (and not too
much) - If you have a fixed budget, can determine whether
the project is feasible at all - Software http//sitemaker.umich.edu/group-based/o
ptimal_design_software