Title: Bootstrap
1Bootstrap
- Chingchun Huang (???)
- Vision Lab, NCTU
2Introduction
- A data-based simulation method
- For statistical inference
- finding estimators of the parameter in interest
- Confidence of the parameter in interest
3An example
- Two statistics definition for a random variable
- Average sample mean
- Standard error The standard deviation of the
sample means - Calculation of two statistics
- Carry out measurement many times
- Observations from these two statistics
- Standard error decreases as N increases
- Sample mean becomes more reliable as N increases
4Central limit theorem
- Averages taken from any distribution
- (your experimental data) will have a normal
- distribution
- The error for such an statistic will
- decrease slowly as the number of
- observations increase
5Averages of N.D.
Normal distribution
c2 distribution
Averages of c2 distribution
6Uniform distribution
Averages of U.D.
7Consequences of central limit theorem
- But nobody tells you how big the sample has to
be.. - Should we believe a measurement of Average?
- How about other objects rather than Average
Bootstrap --- the technique to the rescue
8Basic idea of bootstrap
- Originally, from some list of data, one computes
an object (e.g. statistic). - Create an artificial list by randomly drawing
elements from that list. Some elements will be
picked more than once. - Nonparametric mode (later)
- Parametric mode (later)
- Compute a new object.
- Repeat 100-1000 times and look at the
distribution of these objects.
9A simple example
- Data available comparing grades before and after
leaving graduate school - Some linear correlation between grades r0.776
- But how reliable is this result (r0.776)?
10 A simple example
11(No Transcript)
12(No Transcript)
13A simple example
14Confidence intervals
- Consider the similar situation as before
- The parameter of interest is ? (e.g. Mean)
- is an estimator of ? based on the sample
. - We are interested in finding the confidence
interval for the parameter.
15The percentile algorithm
- Input the level2 for the confidence
interval. - Generate B number of bootstrap samples.
- Compute for b 1,, B
- Arrange the new data set with s in order.
- Compute and percentile for
the new data. - C.I. is given by ( th , th
) - Percentile 5 10 16 50 84 90 95
- Percentile 49.7 56.4 62.7 86.9 112.3 118.7 126.7
- of
16How many bootstraps ?
- No clear answer to this.
- Rule of thumb try it 100 times, then 1000
times, and see if your answers have changed by
much.
17How many bootstraps ?
B 50 100 200 500 1000 2000 3000
Std. Error 0.0869 0.0804 0.0790 0.0745 0.0759 0.0756 0.0755
18Convergence
- This histogram is showing the distribution of the
correlation coefficient for the bootstrap sample
. Here B200, B500
19Contd
20Contd..
- B3000 B4000
- Now it can be seen the sampling distributions
of correlation coefficient are more or less
identical.
21Contd..
- The above graph is showing the similarity in the
distribution of the bootstrap distribution and
the direct enumeration from random samples from
the empirical distribution
22Is it reliable ?
- Observations
- Good agreement for Normal (Gaussian)
distributions - Skewed distributions tend to more problematic,
particularly for the tails - A tip For now nobody is going to shoot you down
for using it.
23Schematic representation of bootstrap procedure
24Bootstrap
- The bootstrap can be used either
non-parametrically or parametrically - In nonparametric mode, it avoids restrictive and
sometimes dangerous parametric assumptions about
the form of the underlying population . - In parametric mode it can provide more accurate
estimates of errors than traditional methods.
25Parametric Bootstrap
(distribution) P x (samples)
Real World
Statistic of interest
Bootstrap World Estimated
Bootstrap probability sample
model Bootstrap Replication
26Bootstrap
- The technique was extended, modified and
- refined to handle a wide variety of problems
- including
- (1) confidence intervals and hypothesis tests,
- (2) linear and nonlinear regression,
- (3) time series analysis and other problems
26
27Example one-dimensional smoothing
Fit a cubic spline (N50 training data)
28The bootstrap and maximum likelihood method
Least squares
where ?
where
?
29The bootstrap and maximum likelihood method
Nonparametric bootstrap Repeat B200
times - draw a dataset of N50 with replacement
from the training data zi(xi,yi) - fit a cubic
spline
Construct a 95 pointwise confidence
interval At each xi compute the mean and find
the 2,5 and 97,5 percentiles
30The bootstrap and maximum likelihood method
Parametric bootstrap We assume that the
model errors are Gaussian Repeat B200 times -
draw a dataset of N50 with replacement from the
training data zi(xi,yi) - fit a cubic spline on
zi and estimate - simulate new
responses zi(xi,yi) - fit a
cubic spline on zi
Construct a 95 pointwise confidence interval At
each xi compute the mean and find the 2,5 and
97,5 percentiles
31The bootstrap and maximum likelihood method
Parametric bootstrap
Conclusion least squares parametric
bootstrap as B ? ? (only because of Gaussian
errors)
32Some notations
- The Bootstrap is
- A computer-based method for assigning measures of
accuracy to statistical estimates. - The basic idea behind bootstrap is very simple,
and goes back at least two centuries. - The bootstrap method is not a way of reducing
the error ! It only tries to estimate it. - Bootstrap methods depend only on the Bootstrap
samples. It does not depend on the underlying
distribution.
33A general data set-up
- We have dealt with
- The standard error
- The confidence interval
- With the assumption that distribution is either
unknown or very complicated. - The situation can be more general
- Like regression ,
- Sometimes using maximum likelihood estimation.
34Conclusion
- The bootstrap allow the data analyst to
- Asses the statistical accuracy of complicated
procedures, by exploiting the power of the
computer. - The use of the bootstrap either
- Relief the analyst from having to do complex
mathematical derivation or - Provide an answer where no analytical answer can
be obtained.
35Addendum The Jack-knife
- Jack-knife is a special kind of bootstrap.
- Each bootstrap subsample has all but one of the
original elements of the list. - For example, if original list has 10 elements,
then there are 10 jack-knife subsamples.
36Introduction (continued)
- Definition of Efrons nonparametric bootstrap.
- Given a sample of n independent identically
- distributed (i.i.d.) observations X1, X2, , Xn
from - a distribution F and a parameter ? of the
- distribution F with a real valued estimator
- ?(X1, X2, , Xn ), the bootstrap estimates the
- accuracy of the estimator by replacing F with Fn,
- the empirical distribution, where Fn places
- probability mass 1/n at each observation Xi.
36
37Introduction (continued)
- Let X1, X2, , Xn be a bootstrap sample, that
is a sample of size n taken with replacement from
Fn . - The bootstrap, estimates the variance of
- ?(X1, X2, , Xn ) by computing or approximating
the variance of - ? ?(X1, X2, , Xn ).
37
38Introduction (continued)
- The bootstrap is similar to earlier techniques
which are also called resampling methods - (1) jackknife,
- (2) cross-validation,
- (3) delta method,
- (4) permutation methods, and
- (5) subsampling..
38
39Bootstrap Remedies
- In the past decade many of the problems where the
bootstrap is inconsistent remedies have been
found by researchers to give good modified
bootstrap solutions that are consistent. - For both problems describe thus far a simple
procedure called the m-out-n bootstrap has been
shown to lead to consistent estimates .
39
40The m-out-of-n Bootstrap
- This idea was proposed by Bickel and Ren (1996)
for handling doubly censored data. - Instead of sampling n times with replacement from
a sample of size n they suggest to do it only m
times where m is much less than n. - To get the consistency results both m and n need
to get large but at different rates. We need
mo(n). That is m/n?0 as m and n both ? 8. - This method leads to consistent bootstrap
estimates in many cases where the ordinary
bootstrap has problems, particularly (1) mean
with infinite variance and (2) extreme value
distributions.
Dont know why.
40
41Examples where the bootstrap fails
- Athreya (1987) shows that the bootstrap estimate
of the sample mean is inconsistent when the
population distribution has an infinite variance.
- Angus (1993) provides similar inconsistency
results for the maximum and minimum of a sequence
of independent identically distributed
observations.
41