Title: The eternal tension in statistics...
1The eternal tension in statistics...
2Between what you really really want (the
population) but can never get to...
3So you have to make do (with the sample) you
can estimate the population, make educated
guesses,
4but bottomline is you can never have the
population
5An investigator usually wants to generalize about
a class of individuals/things (the
population)For example in forecasting the
results of elections, population votersfor
the Furniture.com class group Population all
potential users
6- Parameters Usually there are some numerical
facts about the population which you want to
estimate - Statistic You can do that by measuring the same
aspect in the sample (Descriptive Statistics) - Depending on the accuracy of measurement, and
representativeness of your sample, you can make
inferences about the population (Inferential
Statistics)
7- One persons sample is another persons
population - IS 271 students are a sample for the larger
student population of UC Berkeley - IS271 students could be population for some other
study
8The 1936 election the literary digest poll
- Candidates Democrat FD Roosevelt and Republican
Alfred Landon - The Literary Digest had called the winner in
every election since 1916 - Its prediction Roosevelt will get 43
- polled 2.4 million people!
9The election results
- The election result 62
- The Digest prediction 43
- Gallups prediction 44
- of Digest Prediction
- Gallupss prediction 56
- of election result
10Why the Digest went wrong How they picked their
sample
- Selection Bias A systematic tendency on the part
of the sampling procedure to exclude one kind of
person or another from sample - Sample Size When a selection procedure is
biased, making the sample larger does not help
repeats the mistake on a larger level
11How they picked their sample
- Non Response Bias Non respondents differ from
respondents - they did not respond as compared to respondents
who did! - Lower income and upper income people tend not to
respond, so middle class over represented. - Non Response Bias One can give more weightage to
people who were available but hard to get.
12- For Example Predicting Elections
- Non Voters Gallup uses a few questions to
predict if people will vote at all. Election
forecast based only on those likely to vote. - Undecided Asks people who they are leaning
towards as of today. - Non Response Bias One can give more weightage to
people who were available but hard to get. - Ratio Estimation Look at sample obtained, and
compares it to population. If there are too many
educated people weigh them lesser. - Interviewer Bias Build redundancy into
questionnaire to check for consistency. Also
reinterview a small sample to check for
consistency.
13Distribution of brown MMs
Yellow 20
Brown 30
Orange 10
Blue 10
Red 20
Green 10
14The distribution of the population
15Sample 1
16Sample 2
17Sample 3
18Population
Sample 1
Sample3
Sample2
5 Samples
Sample3
19How much is each sample going to deviate from the
population? (how big is the chance error for
each sample likely to be?)
Computation of Standard Error ? number of
samples x SD of sample
9, 7, 6, 9, 11, 12
Mean 9 Standard Deviation 2.2 Standard Error
4.4
20Why is knowing the chance error important?
- Allows us to estimate the accuracy of our
estimates and is we are justified in using
inferential statistics. - Allows us to make inferences about the population
21If there is a lot of spread in the samples, the
SD is big and it will be hard to predict how
accurate the sample will be. So the standard
error will be big as well. Standard Deviation
(SD) and Standard Error (SE) SD refers to a
list of number. How far are most numbers from the
mean? SE refers to the variability in samples.
How variable is each sample going to be.
22Should the sample for Texas be larger than that
for Rhode Island?
23Surprisingly No
Analogy If you took a drop of liquid for
analysis. If the liquid is well mixed, then it
would not matter if the liquid was from a small
or a large bottle, whether the sample is 1 or
.1 of the population..
The statistical rationale The accuracy of
sampling is related to the standard deviation of
the sample. Example Election of 1992, voters
who chose Clinton 46 of voters in New Mexico,
SD .50 37 of voters in Texas .48 Therefor
accuracy of sample in Texas and New Mexico will
be similar
24Types of Samples
- The convenient sample More convenient elementary
units are chosen from a population. - The judgement sample Units are chosen according
to judgement made by someone who is familiar with
the relevant characteristics of the population. - The random sample Units are chosen randomly with
a known probability.
25- Quota Sampling Each interviewer is assigned a
fixed quota of subjects fitting certain
demographic characteristics. Within the quota is
a judgement sample. - Problems quotas might not be representative, and
judgement sampling is bad.
26Types of Random Sample
- Simple Random Sample Every unit of the
population has an equal chance of being chosen. - A systematic random sample One unit is chosen on
a random basis, additional elementary units are
taken from evenly spaced intervals until the
desired number of units is obtained.
27- The stratified random sample Obtained by
independently selecting a separate simple random
sample from each population stratum. A population
can be divided into different groupsbased on
some characteristic or variable like income of
education. - The cluster sample Obtained by selecting
clusters from the population on the basis of
simple random sampling. The sample comprises a
census of each random cluster selected. For
example, a cluster may be some thing like a
village or a school, a state.