Title: GATHERING
1GATHERING AND PRODUCING DATA
2How Data are Obtained
- Census
- Everyone is included
- Observational Study
- Observes individuals and measures variables but
does not attempt to influence responses - Includes surveys and polls
- Experiment
- Deliberately imposes some treatment on
individuals in order to observe their responses - In medicine, this is called a clinical trial
33 BIG ideas
- Examine a part of the whole take a sample from a
population - Randomization insures the sample is
representative - The size of the sample is whats important, not
the size of the population
4Big Idea 1 Examine Part of the Whole
- We are studying an entire population of
individuals (or subjects), but looking at
everyone is practically impossible. - How many support the U.S. role in Iraq?
- What percent of the tomato shipment is bad?
- How many children are obese?
- Whats the price of gas at the pump across
Minnesota? - Settle for looking at a smaller groupa
sampleselected from the population. - Sampling is natural! Think about cooking. You
taste (sample) a small part to get an idea about
the dish as a whole.
5Populations and parameters, samples and
statistics (This stuff is important!)
- A parameter is a numerical quantity that
describes a population. - A statistic is a numerical quantity that
describes the sample. - We study a population by looking at a sample. We
infer about a parameter by using statistics from
the sample. - Notation use Greek letters for parameters and
Latin letters for statistics
6Example Polling
- Minneapolis Star Tribune A Gallup Poll,
conducted Aug. 16-18, 1999, asked, Do you
consider pro-wrestling to be a sport, or not? Of
the people polled, 19 said, Yes. (Results were
based on telephone interviews with a randomly
selected national sample of 1,028 adults, 18
years and older.) - Whats the population, parameter, sample,
statistic? - Population Americans, 18 years and older
- Sample The 1,028 people who were polled
- Parameter The proportion of American adults who
believe pro-wrestling is a sport. (Called the
population proportion.) - p ?
- Statistic The proportion of people in the sample
who said they believe pro-wrestling is a sport.
(Called the sample proportion.) 0.19
7Example Surveying a lot shipment
- A carload of ball bearings has an average
diameter of 2.502 centimeters. This is within the
specifications for acceptance of the lot by the
purchaser. An inspector happens to inspect 100
bearings from the lot and finds the average
diameter of these to be 2.499 cm. This is within
the specified limits, so the entire lot is
accepted. - Whats the population, parameter, sample,
statistic? - Population The carload of ball bearings
- Sample The 100 ball bearings that were inspected
- Parameter The average diameter of the ball
bearings in the carload. - µ 2.502 cm (The population mean.)
- Statistic The average diameter of the 100 ball
bearings in the sample. - 2.499 cm (The sample mean.)
8Big Idea 2 Randomization
- Randomization makes sure that
- on average the sample looks like
- the rest of the population.
- Randomization makes it possible to use
quantitative tools (probability) to draw
inferences about the population when we see only
a sample. - Randomization protects against bias.
9Who will you vote for in 2008? Some examples of
biased samples
- 100 people at the Mall of America
- 100 people in front of the Metrodome after a
Twins game - 100 friends, family and relatives
- 100 people who volunteered to answer a survey
question on your web site - 100 people who answered their phone during supper
time - The first 100 people you see after you wake up in
the morning
10Bias the bane of sampling
- Samples that systematically misrepresent
individuals in the population are said to be
biased. - Bias is the systematic failure of a sample to
represent its population - There is usually no way to fix a biased sample
and no way to salvage useful information from it. - The best way to avoid bias is to select
individuals for the sample at random. The value
of deliberately introducing randomness is one of
the great insights of Statistics.
11Simple Random Sample (SRS)
- Suppose we want to draw a sample of size n from
some population - For a simple random sample, every possible subset
of size n has an equal chance to be selected and
to become the sample. - Such samples guarantee that each individual has
an equal chance of being selected. - Each combination of people also has an equal
chance of being selected. - The sampling frame is a list of the population
from which the sample is drawn. From the sampling
frame, we can choose a SRS using random numbers.
12SRS and Sampling Variability
- Samples drawn at random generally differ from one
another. - These differences lead to different values for
the variables we measure. - Sample-to-sample differences are called sampling
variability - This is different from bias!
- Example Everyone pick 10 Skittles at random from
The Bowl and count how many reds. - The variability of the different sample counts is
sampling variability. - If half the class peeked and tried to get more
reds the differences would reflect bias.
13Sources of sampling error
- In the context of using a sample to
- estimate a population parameter,
- sampling variability is sometimes
- called sampling error.
- Taking a SRS of 3 students to estimate the
average - height of all students will have a large
sampling error, but it is not biased. - Taking a sample of 300 basketball players to
estimate the average height of all students will
produce less variability but the sample is biased.
14More complex sampling designs
- Simple random sampling is not the only way to
sample. - More complicated designs may save time or money
or help avoid sampling problems. - Stratified sampling
- Cluster sampling
- Systematic sampling
- Multi-stage sampling
- All statistical sampling designs have in common
the idea that chance, rather than human choice,
is used to select the sample.
15Stratified sampling
- Suppose we want a sample of 240 Carleton students
- We also want to insure discipline representation
- The student body divides as
- Arts and Literature 20
- Humanities 15
- Social Sciences 30
- Mathematics and Natural Sciences 35
- For the sample, select
- 240 x .20 48 Arts and Lit students
- 240 x .15 36 Humanities students
- 240 x .30 72 Social science students
- 240 x .35 84 Natural science students
- Within each discipline, choose a SRS
16Stratified Sampling
- The population is divided into homogeneous
groups, called strata, before the sample is
selected. - Then simple random sampling is used within each
stratum before the results are combined. - Advantages
- Sample will be representative for the strata
- Reduces sampling variability
- Disadvantages
- May be logistically difficult if even possible to
implement - Must have information about the population
- Note a stratified sample is not a SRS
17Cluster sampling
- Sometimes stratifying isnt practical and simple
random sampling is difficult. Splitting the
population into clusters can make sampling more
practical. - Suppose you want to do a face-to-face survey of
attitudes in Minnesota based on a sample of size
600. - Choosing 600 people at random, finding their
addresses, and meeting them in person is costly
and time-consuming. - Another idea Choose some cities at random. Then
some streets at random, and then some blocks at
random. Interview everyone on the selected
blocks. - The blocks are the clusters.
- If you know there are about 20 people per block.
Then choose a random sample of 30 blocks.
18Cluster sampling in the newsThe Lancet study on
Iraq casualties
- In October 2006, The Lancet published Iraq
mortality after the 2003 invasion a
cross-sectional cluster sample survey - The study was controversial because of its
findings that hundreds of thousands of Iraqis
(most likely about 650,000) had been killed since
the U.S. invasion. - Earlier reports, including the U.S. and British
government had put the number at about 30,000. - The study was based on cluster sampling, a common
methodology in public health and human rights
work - The clusters were groups of 40 houses in close
proximity whose locations were chosen based on
population demographics.
19Cluster Sampling
- If each cluster fairly represents the population,
cluster sampling will give an unbiased sample. - Advantage
- Easier to implement depending on context
- Disadvantage
- Greater sampling variability, so less statistical
accuracy
20Multistage Sampling
- Most surveys conducted by the government or
professional polling organizations use some
combination of stratified and cluster sampling as
well as simple random sampling. - Current Population Survey is how the government
estimates the unemployment rate - Counties are divided into 2,007 Primary Sampling
Units - PSUs are divided into smaller census blocks. And
the blocks are grouped into strata. Households in
each block are grouped into clusters of about 4
households each - The final sample consists of these clusters and
interviewers go to all households in the chosen
clusters.
21Systematic Samples
- Sometimes we draw a sample by selecting
individuals systematically. - For example, you might survey every 10th person
on an alphabetical list of students. - To make it random, you must still start the
systematic selection from a randomly selected
individual. - When there is no reason to believe that the order
of the list could be associated in any way with
the responses sought, systematic sampling can
give a representative sample. - Systematic sampling can be much less expensive
than true random sampling.
22Sampling Example
- Hospital administrators are concerned about the
possibility of drug abuse among employees. They
plan to pick a sample of 40 from 800 employees,
and administer a drug test. Whats the sampling
strategy? - Randomly select 10 doctors, 10 nurses, 10 office
staff, and 10 support staff for the test. - Each employee has a 4-digit ID number. Randomly
choose 40 numbers. - At the start of each shift, choose every 20th
person who arrives for work. - There are 40 departments of 20 employees each.
Randomly choose two departments (say radiology
and ER) and test all the people who work in that
department.
23Big Idea 3 Sample size is key, not population
size
- How large a sample size do we need for the sample
to be reasonably representative of the
population? - In general, its the size of the sample, not the
size of the population, that makes the difference
in sampling. - The fraction of the population that youve
sampled doesnt matter. Its the sample size
itself thats important - Back to cooking If the soup is
- mixed enough a tablespoon will
- suffice, whether youre sampling
- from a saucepan or from a barrel.
24How big a sample?
- Most professional polls choose a sample size of
about 1,000 people. - These polls report a margin of error of about
3. That means that with high confidence their
estimates are within 3 of the true population
parameter value. - The margin of error for a sample of 1,000 people
is the same for Minneapolis (pop. 400,000),
Minnesota (pop. 5 million), and the U.S. (pop.
290 million) - But the bad news is that if you want similar
accuracy at Carleton, you need to poll over half
the student body. - Coming Attractions Margin of Error
and - . But youll have to wait
until we get to Statistical Inference to learn
why.
25How to Sample Badly
- Advice columnist Ann Landers once asked parents
- If you had it to do over again, would you have
children? - Do you think responses were representative of
public opinion? - Over 100,000 people responded, and 70 answered
No! - A later survey, more carefully designed, showed
90 of parents are happy with their decision to
have children. - In a voluntary response sample, a large group of
individuals is invited to respond, and all who do
respond are counted. But such samples are almost
always biased toward those with strong opinions
or those who are strongly motivated. - Since the sample is not representative, the
resulting voluntary response bias invalidates the
survey.
26What Can Go Wrong?or,How to Sample Badly
- In convenience sampling, we simply include the
individuals who are convenient. But they may not
be representative of the population. - A psychology professor performs an experiment
using his classroom. - A company samples opinions by using its own
customers. - Sampling mice from a large cage to study how a
drug affects physical activity The lab assistant
reaches into the cage to select the mice one at a
time until 10 are chosen. But which mice will
likely be chosen?
27Other problems
- Under-coverage
- In some survey designs a portion of the
population is not sampled or has a smaller
representation in the sample than it has in the
population. - Using telephone directories for phone survey.
- Half the households in large cities are unlisted.
- About 5 of households dont have phones.
- Random digit dialing only partially addresses
this problem - Misses students in dorms, inmates in prison,
soldiers in the military, homeless people. And
its too expensive to call Hawaii or Alaska. - Non-response
- No survey succeeds in getting responses from
everyone. - The problem is that those who dont respond may
differ from those who do. - Bureau of Labor Statistics get 6-7 non-response
rate. - But its common for opinion polls and market
research studies to have 75- 80 non-response
rate.
28What Else Can Go Wrong?
- Response bias refers to anything in the survey
design that influences the responses - In particular, the wording of a question can have
a big impact on the responses
29Some classic statistical mistakesThe Literary
Digest Poll
- 1936 presidential election Franklin Delano
Roosevelt vs. Alf Landon - The Literary Digest had called every presidential
election since 1916 - Sample size 2.4 million!
- They predicted Roosevelt would lose by 43
- In fact it was a landslide for Roosevelt at 62
30Literary Digest poll
- Context
- Midst of the Great Depression
- 9 million unemployed real income down 1/3
- Landons program Cut spending
- Roosevelts program Balance peoples budgets
before the governments budget - How the polling was done
- Survey sent to 10 million people
- And 2.4 million responded (thats huge!)
31A huge sample, but The Literary Digest poll was
biased
- The sampling frame was not representative of the
electorateselection bias - Based on magazine subscription lists, drivers
registrations, country club memberships, phone
numbers (when telephones were a luxury) - Biased toward better off groups (who were more
Republican) - Voluntary response bias
- Main issue was the economy
- The anti-Roosevelt forces were angryand had a
higher response rate!
32Year Sample size Winner Gallup prediction Election result Error
1936 50,000 Roosevelt 55.7 62.5 -6.8
1940 50,000 Roosevelt 52.0 55.0 -3.0
1944 50,000 Roosevelt 51.5 53.8 -2.3
1948 50,000 Truman 44.5 49.5 -5.0
1952 5,385 Eisenhower 51.0 55.4 -4.4
1956 8,144 Eisenhower 59.5 57.8 1.7
1960 8,015 Kennedy 51.0 50.1 0.9
1964 6,625 Johnson 64.0 61.3 2.7
1968 4,414 Nixon 43.0 43.5 -0.5
1972 3,689 Nixon 62.0 61.8 0.2
1976 3,439 Carter 48.0 50.1 -2.1
1980 3,500 Reagan 47.0 50.8 -3.8
1984 3,456 Reagan 59.0 59.2 0.2
1988 4,089 Bush 56.0 53.9 2.1
1992 2,019 Clinton 49 43.3 5.7
1996 2.,417 Clinton 52.0 50.1 1.9
2000 3,129 Bush 48.0 47.9 0.1
2004 1,866 Bush 49.0 51.0 -2.0
33The Year the Polls Elected Dewey
- 1948 Election Harry Truman versus Thomas Dewey
- Every major poll (including Gallup) predicted
Dewey would win by 5 percentage points
34What went wrong?
- Pollsters chose their samples using quota
sampling. Each interviewer was assigned a fixed
quota of subjects in certain categories (race,
sex, age). - For instance, an interviewer in St. Louis was
required to talk to 13 people - 6 live in the suburb, 7 in the central city
- 7 men and 6 women Over the 7 men (similar for
women) - 3 under 40 years old, 4 over 40 1 black, 6
white. - In each category, interviewers were free to
choose. - But this left room for human choice and
inevitable bias. - Republicans were easier to reach. They had
telephones, permanent addresses, nicer
neighborhoods. - So interviewers ended up with too many
Republicans. - Quota sampling was abandoned for random sampling.
35Do you believe the poll?What questions should
you ask?
- Who carried out survey?
- What is the population?
- How was sample selected?
- How large was the sample?
- What was the response rate?
- How were subjects contacted?
- When was the survey conducted?
- What are the exact questions asked?
36To summarize . . .
- We are often interested in a population and some
parameter that describes the population. - We select a sample from that population and use a
statistic from the sample to estimate the unknown
parameter - To obtain a good estimate, the sample must be as
representative of the population as possible. And
randomization, on average, insures a
representative sample - Possible sources of error are sampling
variability and bias. - To reduce sampling variability, take a bigger
sample - To reduce bias, get a better sampling design
- Its the sample size, not the population size,
that matters