Title: No CLT
1No CLT No Problem?Enter the Bootstrap!
- John McGready
- Department of Biostatistics
- Johns Hopkins University
- http//www.biostat.jhsph.edu/jmcgread
2Slide 2
3Goals of Inferential Statistics
- Much of what we do in statistics involves trying
to talk about true characteristics of a process,
using an imperfect subset of information from the
process
Population Information (what we WANT)
Sample Information (what we have)
4Medical Expenditures
- Suppose we want to study the FY 2005 medical
expenditures for 13,000 employees in a
particular company - However, the benefits administrator will only
give us one random sample of 200 employees
5Medical Expenditures
(True) mean 2.3 (True) sd 5.0
Median 0.59, Mean 2.3, sd 5.0
(Sample) mean 1.9 (Sample) sd 4.0
Median 0.57, Mean 2.0, sd 4.3
6Medical Expenditures
- Given the right skew, our first choice for
estimating the center of the distribution is to
work with the median - We can only estimate the true median using the
sample median from our 200 observations
7Medical Expenditures
- We are interested in how good a guess the
sample median is of the true median - We would also like to estimate a range of
possibilities for the true median (ie a
confidence interval)
8Medical Expenditures
- In order to understand how a sample median from
200 observations relates to the true mean, lets
call our administrator and see if we can get
1,000 more random samples of size 200 - This way, we can compute 1,000 more sample
medians and see how variable they are
9Making the Call
10The Response
No Way!
11What to Do Now??
- Well, it seems we are out of luck
- Lets just estimate the mean instead, and use the
Central Limit Theorem to estimate a range of
possible values for the true mean
12Review Sampling Behavior via the CLT
Standard error (spread)
13Sampling Behavior via the CLT
- Most (95) of the sample means we could get from
samples of 200 would fall between the 2.5th and
97.5 of this distribution - These percentiles correspond to true mean /-
1.96 standard errors
14Sampling Behavior via the CLT
15Sampling Behavior via the CLT
- Rub 1
- If we knew the true mean, we wouldnt care about
possible mean values - However, taking this one step further implies
that 95 of the samples we could get will fall
within a know range of the truth
16Sampling Behavior via the CLT
17Sampling Behavior via the CLT
18Sampling Behavior via the CLT
- Rub 2
- If we only have one sample, we dont know true
sampling distribution - However, CLT says it will be normal
- We spread from our sample data, and center it at
our sample mean
19Sampling Behavior via the CLT
- Our Sample info
- Sample mean 2.0 (thousand )
- Sample standard deviation 4.3 (thousand )
- Sample estimate of standard error (spread of
sampling distribution -
- (thousand )
-
20Sampling Behavior via the CLT
21Sampling Behavior via the CLT
22Sampling Behavior via the CLT
- True 95 CI
- Sample mean /- 1.96(true standard error)
- (1.3,2.7)
- Estimated 95 CI
- Sample mean /- 1.97(estimated standard error)
- (1.4, 2.6)
23Another Approach to Estimating Sampling
Distribution
- Instead of relying on CLT, how about we simulate
sampling distribution using just our sample of
200? - Treat our sample as truth
- Resample multiple times (say 1000) taking random
draws of 200 with replacement
24Resampling With Replacement
- Original sample (n4)
- Potential resample of same size
S1
S2
S3
S4
S2
S1
S3
25Re-Sampling
26Bootstrap Estimate of Sampling Distribution
- Take 1,000 resamples
- Compute the mean of each re-sample
- Plot a distribution of the means
27Bootstrap Estimate of Sampling Distribution
28Bootstrap Estimate of Sampling Distribution
29Bootstrap 95 CIs
- How to get a 95 CI from the bootstrap dist
- Assume normality (normal bootstrap method)
- But estimate standard error from bootstrap
distribution - Pick off 2.5th, 97.5th percentiles (bootstrap
percentile method) - Pick off adjusted percentile (bias-corrected
acclerated BCa - method)
3095 CIs
- True Mean 2.3
- Method 95 CI
- CLT Estimate 1.40 - 2.60
- Bootstrap Normal 1.39 - 2.60
- Bootstrap Percentile 1.41 - 2.58
- BCa 1.47 - 2.68
31We Could Do with 10,000 Resamples
32Bootstrap 95 CIs Mean
- Empirical Coverage Probabilities1
- Method 1K resamps 10K resamps
- CLT Estimate 2 93.4
- Bootstrap Normal 2 93.2 92.5
- Bootstrap Percentile 92.4 91.6
- BCa 92.3 93.4
- 1 To be thorough, should also look at average
width - 2 Some intervals could contain illegal (negative)
values
33Whats The Big Deal?
- Why not just use CLT?
- For many statistics, we do not have a CLT (or
good CLT) based approach - Median
- Ratio of mean to sd
- Correlation coefficients
34Getting a 95 CI for A Median
3595 CIs For Median
- True Median 0.59
- Method 95 CI (1,00 Reps)
- CLT Estimate NA
- Bootstrap Normal 0.44 - 0.71
- Bootstrap Percentile 0.39 - 0.68
- BCa 0.39 - 0.68
36Bootstrap 95 CIs Median
- Empirical Coverage Probabilities1
- Method 1K resamps 10K resamps
- Bootstrap Normal2 94.1 94.4
- Bootstrap Percentile 93.9 95.0
- BCa 94.0 95.2
- 1 To be thorough, should also look at average
width - 2 Some intervals could contain illegal (negative)
values
37Wrap Up
- Pros/Cons of boostrap
- Theoretical Justicifaction