Title: Stats 120A
1Stats 120A
- Review of CIs, hypothesis tests and more
2Sample/Population
- Last time we collected height/armspan data. Is
this a sample or a population?
3Gallup Poll, 1/9/07
- "As you may know, the Bush administration is
considering a temporary but significant increase
in the number of U.S. troops in Iraq to help
stabilize the situation there. Would you favor or
oppose this?"
4Results
- Results based on 1004 randomly selected adults (gt
18 years) interviewed Jan 5-7, 2007. - 61 are opposed.
- "For results based on this sample, one can say
with 95 confidence that the maximum error
attributable to sampling and other random effects
is 3 percentage points. "
5Pop Quiz
- Is the value 61 a statistic or a parameter?
- The margin of error is given as 3. What does
the margin of error measure? a) the variability
in the sample - b) the variability in the population
- c) the variability in repeated sampling
6Sampling paradigm
- In the U.S., the proportion of adults who are
opposed to a surge is p, (or p100). - We take a random sample of n 1004.
- The proportion of our sample ("p hat") is an
estimate of the proportion in the population.
7A simulation
- Choose a value to serve as p (say p .6)
- Our "data" consist of 1004 numbers 0's represent
those in favor, 1's are those opposed. - x 589 out of 1004 say "opposed", so p-hat
589/1004 .5866 - mean(x) .5866
- sd(x) .4926
8xbar.5866, s .493
9How do we know sample proportion is a good
estimate of population proportion?
- Law of Large Numbers
- sample averages (and proportions) converge on
population values - implying that for finite values, the sample
proportion might be close if the sample size is
large
10Coin flips sample proportion "settles down" to
0.5
11So if we stop earlier, say n 10
p-hat .60
12Which raises the question
- If we stop early, how far away will our sample
proportion be from the true value? - Or, in a survey setting, if we take a finite
sample of n1004, how far off from the population
proportion are we likely to be?
13A simulation might help
- Assume p .60 (population proportion)
- Take sample of n 1004 and find p-hat.
- Save this value
- Repeat above 3 steps 10000 times.
14The R code (for the record)
- phat lt- c()
- for (i in 110000)
- x lt- sample(c(0,1),1004,replaceT,probc(.4,
.6)) - temp lt- sum(x)/1004
- phat lt- c(phat,temp)
- hist(phat)
15each dot represents one survey of 1004 people
1610,000 sample proportions, n 1004
17Observe that...
- sample proportions are centered on the true
population value p .60 - variability is not great smallest is .54,
biggest is .66 - distribution is bell-shaped
18We've just witnessed the Central Limit Theorem
- If samples are independent and random and
sufficiently large - means (and proportions) follow a nearly Normal
distribution - the mean of the Normal is the mean of the
population - the SD of the Normal (aka the standard error) is
the population SD divided by sqrt(n)
19CLT applied to sample proportions
- phat is distributed with an approx Normal
- mean is p
- SE is sqrt(p(1-p)/n)
- For our simulation, p .60 so our p-hats will be
centered on .6 with a SD of sqrt(.6.4/1004)
0.0155
20We saw
- Normal
- mean(phat) 0.600(expected .6)
- sd(phat) 0.01554(expected 0.0155)
21In practice, we don't know p
- but we can get a good approximation to the
standard error using - sqrt(phat (1-phat)/n)
- rather than
- sqrt(p(1-p)/n)
22So if we take a random sample of n 1004
- and we see p-hat .61, we know that
- The true value of p can't be far away.
- SE sqrt(.61.39/1004) 0.0154
- So 68 of the time we do this, p will be within
0.0154 of phat - And 95 of the time it will be with 2.0154
0.03
23Which leads us to conclude
- that the true proportion of the population that
opposes a surge is somewhere in the interval.61
- .03 0.58 - to .61.03 0.64
24Confidence intervals
- This is an example of a 95 confidence interval.
- Because 95 of all samples will produce a p-hat
that is within 2 standard errors of the true
value, we are 95 confident that ours is a "good"
interval.
25Formula
- A 95 CI for a proportion is
- estimate /- 2 (Standard Error)
- p-hat /- 2sqrt(phat(1-phat)/n)
- 0.61 /- 2sqrt(.61.39/1004)
- (.58, .64)
- note our replacing phat for p in SE means we get
an approximate value
26What does 95 mean?
- If we repeat this infinitely many times
- take a sample of n 1004 from population
- calculate sample proportion
- find an interval using /- 2 SE
- then 95 of these CIs will contain the truth and
5 will not. - We see only one (.58, .64). It is either good
or bad, but we are confident it is good.
27Where did the 95 come from?
- It came from the normal curve.
- The CLT told us that p-hat followed a (approx)
normal distribution. - For Normal's, 68 of probability is within 1
standard deviation of mean, 95 within 2, 99.7
within 3. - A normal table gives other probabilities
28Change confidence level by changing the width of
margin of error
.015
-0.015
1 SE
68
2 SEs
95
3 SEs
99.7
90
1.6 SE
phat 0.61
29The CLT applies to
- any linear combination of the observations
- assuming observations are randomly sampled, and
independent - it does NOT matter what the distribution of the
population looks like - if n is small, the distribution will be only
approximately normal, and this might be a very
poor approximation
30the CLT does NOT apply to
- non-linear combinations, such as the sample
median or the standard deviation - non-random samples
- samples that are dependent
31simulation
- http//onlinestatbook.com/stat_sim/sampling_dist/i
ndex.html
32Summary
- Confidence Level is a statement about the
sampling process, not the sample - Margin of error is determined to achieve the
desired confidence level - We can calculate the confidence level only if we
know the sampling distribution the probability
distribution of the sample
33Pop Quiz
- Is the value 61 a statistic or a parameter?
- The margin of error is given as 3. What does
the margin of error measure? a) the variability
in the sample - b) the variability in the population
- c) the variability in repeated sampling
34Pop Quiz
- Is the value 61 a statistic or a parameter?
- The margin of error is given as 3. What does
the margin of error measure? a) the variability
in the sample - b) the variability in the population
- c) the variability in repeated sampling
35For next time
- In WWII, German army produced tanks with
sequential serial numbers. The allies captured a
few tanks, and wanted to infer the total number
of tanks produced. - Suppose you had captured 10 tanks. Come up with
three estimators for the total number of tanks. - Data 911 5146 6083 944 11944 9365 6087
6647 7076 12275