GATHERING - PowerPoint PPT Presentation

About This Presentation
Title:

GATHERING

Description:

Parameter: The proportion of American adults who believe pro-wrestling is a sport. ... mixed enough a tablespoon will. suffice, whether you're 'sampling' ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 37
Provided by: james308
Category:

less

Transcript and Presenter's Notes

Title: GATHERING


1
GATHERING AND PRODUCING DATA
2
How Data are Obtained
  • Census
  • Everyone is included
  • Observational Study
  • Observes individuals and measures variables but
    does not attempt to influence responses
  • Includes surveys and polls
  • Experiment
  • Deliberately imposes some treatment on
    individuals in order to observe their responses
  • In medicine, this is called a clinical trial

3
3 BIG ideas
  1. Examine a part of the whole take a sample from a
    population
  2. Randomization insures the sample is
    representative
  3. The size of the sample is whats important, not
    the size of the population

4
Big Idea 1 Examine Part of the Whole
  • We are studying an entire population of
    individuals (or subjects), but looking at
    everyone is practically impossible.
  • How many support the U.S. role in Iraq?
  • What percent of the tomato shipment is bad?
  • How many children are obese?
  • Whats the price of gas at the pump across
    Minnesota?
  • Settle for looking at a smaller groupa
    sampleselected from the population.
  • Sampling is natural! Think about cooking. You
    taste (sample) a small part to get an idea about
    the dish as a whole.

5
Populations and parameters, samples and
statistics (This stuff is important!)
  • A parameter is a numerical quantity that
    describes a population.
  • A statistic is a numerical quantity that
    describes the sample.
  • We study a population by looking at a sample. We
    infer about a parameter by using statistics from
    the sample.
  • Notation use Greek letters for parameters and
    Latin letters for statistics

6
Example Polling
  • Minneapolis Star Tribune A Gallup Poll,
    conducted Aug. 16-18, 1999, asked, Do you
    consider pro-wrestling to be a sport, or not? Of
    the people polled, 19 said, Yes. (Results were
    based on telephone interviews with a randomly
    selected national sample of 1,028 adults, 18
    years and older.)
  • Whats the population, parameter, sample,
    statistic?
  • Population Americans, 18 years and older
  • Sample The 1,028 people who were polled
  • Parameter The proportion of American adults who
    believe pro-wrestling is a sport. (Called the
    population proportion.)
  • p ?
  • Statistic The proportion of people in the sample
    who said they believe pro-wrestling is a sport.
    (Called the sample proportion.) 0.19

7
Example Surveying a lot shipment
  • A carload of ball bearings has an average
    diameter of 2.502 centimeters. This is within the
    specifications for acceptance of the lot by the
    purchaser. An inspector happens to inspect 100
    bearings from the lot and finds the average
    diameter of these to be 2.499 cm. This is within
    the specified limits, so the entire lot is
    accepted.
  • Whats the population, parameter, sample,
    statistic?
  • Population The carload of ball bearings
  • Sample The 100 ball bearings that were inspected
  • Parameter The average diameter of the ball
    bearings in the carload.
  • µ 2.502 cm (The population mean.)
  • Statistic The average diameter of the 100 ball
    bearings in the sample.
  • 2.499 cm (The sample mean.)

8
Big Idea 2 Randomization
  • Randomization makes sure that
  • on average the sample looks like
  • the rest of the population.
  • Randomization makes it possible to use
    quantitative tools (probability) to draw
    inferences about the population when we see only
    a sample.
  • Randomization protects against bias.

9
Who will you vote for in 2008? Some examples of
biased samples
  • 100 people at the Mall of America
  • 100 people in front of the Metrodome after a
    Twins game
  • 100 friends, family and relatives
  • 100 people who volunteered to answer a survey
    question on your web site
  • 100 people who answered their phone during supper
    time
  • The first 100 people you see after you wake up in
    the morning

10
Bias the bane of sampling
  • Samples that systematically misrepresent
    individuals in the population are said to be
    biased.
  • Bias is the systematic failure of a sample to
    represent its population
  • There is usually no way to fix a biased sample
    and no way to salvage useful information from it.
  • The best way to avoid bias is to select
    individuals for the sample at random. The value
    of deliberately introducing randomness is one of
    the great insights of Statistics.

11
Simple Random Sample (SRS)
  • Suppose we want to draw a sample of size n from
    some population
  • For a simple random sample, every possible subset
    of size n has an equal chance to be selected and
    to become the sample.
  • Such samples guarantee that each individual has
    an equal chance of being selected.
  • Each combination of people also has an equal
    chance of being selected.
  • The sampling frame is a list of the population
    from which the sample is drawn. From the sampling
    frame, we can choose a SRS using random numbers.

12
SRS and Sampling Variability
  • Samples drawn at random generally differ from one
    another.
  • These differences lead to different values for
    the variables we measure.
  • Sample-to-sample differences are called sampling
    variability
  • This is different from bias!
  • Example Everyone pick 10 Skittles at random from
    The Bowl and count how many reds.
  • The variability of the different sample counts is
    sampling variability.
  • If half the class peeked and tried to get more
    reds the differences would reflect bias.

13
Sources of sampling error
  • In the context of using a sample to
  • estimate a population parameter,
  • sampling variability is sometimes
  • called sampling error.
  • Taking a SRS of 3 students to estimate the
    average
  • height of all students will have a large
    sampling error, but it is not biased.
  • Taking a sample of 300 basketball players to
    estimate the average height of all students will
    produce less variability but the sample is biased.

14
More complex sampling designs
  • Simple random sampling is not the only way to
    sample.
  • More complicated designs may save time or money
    or help avoid sampling problems.
  • Stratified sampling
  • Cluster sampling
  • Systematic sampling
  • Multi-stage sampling
  • All statistical sampling designs have in common
    the idea that chance, rather than human choice,
    is used to select the sample.

15
Stratified sampling
  • Suppose we want a sample of 240 Carleton students
  • We also want to insure discipline representation
  • The student body divides as
  • Arts and Literature 20
  • Humanities 15
  • Social Sciences 30
  • Mathematics and Natural Sciences 35
  • For the sample, select
  • 240 x .20 48 Arts and Lit students
  • 240 x .15 36 Humanities students
  • 240 x .30 72 Social science students
  • 240 x .35 84 Natural science students
  • Within each discipline, choose a SRS

16
Stratified Sampling
  • The population is divided into homogeneous
    groups, called strata, before the sample is
    selected.
  • Then simple random sampling is used within each
    stratum before the results are combined.
  • Advantages
  • Sample will be representative for the strata
  • Reduces sampling variability
  • Disadvantages
  • May be logistically difficult if even possible to
    implement
  • Must have information about the population
  • Note a stratified sample is not a SRS

17
Cluster sampling
  • Sometimes stratifying isnt practical and simple
    random sampling is difficult. Splitting the
    population into clusters can make sampling more
    practical.
  • Suppose you want to do a face-to-face survey of
    attitudes in Minnesota based on a sample of size
    600.
  • Choosing 600 people at random, finding their
    addresses, and meeting them in person is costly
    and time-consuming.
  • Another idea Choose some cities at random. Then
    some streets at random, and then some blocks at
    random. Interview everyone on the selected
    blocks.
  • The blocks are the clusters.
  • If you know there are about 20 people per block.
    Then choose a random sample of 30 blocks.

18
Cluster sampling in the newsThe Lancet study on
Iraq casualties
  • In October 2006, The Lancet published Iraq
    mortality after the 2003 invasion a
    cross-sectional cluster sample survey
  • The study was controversial because of its
    findings that hundreds of thousands of Iraqis
    (most likely about 650,000) had been killed since
    the U.S. invasion.
  • Earlier reports, including the U.S. and British
    government had put the number at about 30,000.
  • The study was based on cluster sampling, a common
    methodology in public health and human rights
    work
  • The clusters were groups of 40 houses in close
    proximity whose locations were chosen based on
    population demographics.

19
Cluster Sampling
  • If each cluster fairly represents the population,
    cluster sampling will give an unbiased sample.
  • Advantage
  • Easier to implement depending on context
  • Disadvantage
  • Greater sampling variability, so less statistical
    accuracy

20
Multistage Sampling
  • Most surveys conducted by the government or
    professional polling organizations use some
    combination of stratified and cluster sampling as
    well as simple random sampling.
  • Current Population Survey is how the government
    estimates the unemployment rate
  • Counties are divided into 2,007 Primary Sampling
    Units
  • PSUs are divided into smaller census blocks. And
    the blocks are grouped into strata. Households in
    each block are grouped into clusters of about 4
    households each
  • The final sample consists of these clusters and
    interviewers go to all households in the chosen
    clusters.

21
Systematic Samples
  • Sometimes we draw a sample by selecting
    individuals systematically.
  • For example, you might survey every 10th person
    on an alphabetical list of students.
  • To make it random, you must still start the
    systematic selection from a randomly selected
    individual.
  • When there is no reason to believe that the order
    of the list could be associated in any way with
    the responses sought, systematic sampling can
    give a representative sample.
  • Systematic sampling can be much less expensive
    than true random sampling.

22
Sampling Example
  • Hospital administrators are concerned about the
    possibility of drug abuse among employees. They
    plan to pick a sample of 40 from 800 employees,
    and administer a drug test. Whats the sampling
    strategy?
  • Randomly select 10 doctors, 10 nurses, 10 office
    staff, and 10 support staff for the test.
  • Each employee has a 4-digit ID number. Randomly
    choose 40 numbers.
  • At the start of each shift, choose every 20th
    person who arrives for work.
  • There are 40 departments of 20 employees each.
    Randomly choose two departments (say radiology
    and ER) and test all the people who work in that
    department.

23
Big Idea 3 Sample size is key, not population
size
  • How large a sample size do we need for the sample
    to be reasonably representative of the
    population?
  • In general, its the size of the sample, not the
    size of the population, that makes the difference
    in sampling.
  • The fraction of the population that youve
    sampled doesnt matter. Its the sample size
    itself thats important
  • Back to cooking If the soup is
  • mixed enough a tablespoon will
  • suffice, whether youre sampling
  • from a saucepan or from a barrel.

24
How big a sample?
  • Most professional polls choose a sample size of
    about 1,000 people.
  • These polls report a margin of error of about
    3. That means that with high confidence their
    estimates are within 3 of the true population
    parameter value.
  • The margin of error for a sample of 1,000 people
    is the same for Minneapolis (pop. 400,000),
    Minnesota (pop. 5 million), and the U.S. (pop.
    290 million)
  • But the bad news is that if you want similar
    accuracy at Carleton, you need to poll over half
    the student body.
  • Coming Attractions Margin of Error
    and
  • . But youll have to wait
    until we get to Statistical Inference to learn
    why.

25
How to Sample Badly
  • Advice columnist Ann Landers once asked parents
  • If you had it to do over again, would you have
    children?
  • Do you think responses were representative of
    public opinion?
  • Over 100,000 people responded, and 70 answered
    No!
  • A later survey, more carefully designed, showed
    90 of parents are happy with their decision to
    have children.
  • In a voluntary response sample, a large group of
    individuals is invited to respond, and all who do
    respond are counted. But such samples are almost
    always biased toward those with strong opinions
    or those who are strongly motivated.
  • Since the sample is not representative, the
    resulting voluntary response bias invalidates the
    survey.

26
What Can Go Wrong?or,How to Sample Badly
  • In convenience sampling, we simply include the
    individuals who are convenient. But they may not
    be representative of the population.
  • A psychology professor performs an experiment
    using his classroom.
  • A company samples opinions by using its own
    customers.
  • Sampling mice from a large cage to study how a
    drug affects physical activity The lab assistant
    reaches into the cage to select the mice one at a
    time until 10 are chosen. But which mice will
    likely be chosen?

27
Other problems
  • Under-coverage
  • In some survey designs a portion of the
    population is not sampled or has a smaller
    representation in the sample than it has in the
    population.
  • Using telephone directories for phone survey.
  • Half the households in large cities are unlisted.
  • About 5 of households dont have phones.
  • Random digit dialing only partially addresses
    this problem
  • Misses students in dorms, inmates in prison,
    soldiers in the military, homeless people. And
    its too expensive to call Hawaii or Alaska.
  • Non-response
  • No survey succeeds in getting responses from
    everyone.
  • The problem is that those who dont respond may
    differ from those who do.
  • Bureau of Labor Statistics get 6-7 non-response
    rate.
  • But its common for opinion polls and market
    research studies to have 75- 80 non-response
    rate.

28
What Else Can Go Wrong?
  • Response bias refers to anything in the survey
    design that influences the responses
  • In particular, the wording of a question can have
    a big impact on the responses

29
Some classic statistical mistakesThe Literary
Digest Poll
  • 1936 presidential election Franklin Delano
    Roosevelt vs. Alf Landon
  • The Literary Digest had called every presidential
    election since 1916
  • Sample size 2.4 million!
  • They predicted Roosevelt would lose by 43
  • In fact it was a landslide for Roosevelt at 62

30
Literary Digest poll
  • Context
  • Midst of the Great Depression
  • 9 million unemployed real income down 1/3
  • Landons program Cut spending
  • Roosevelts program Balance peoples budgets
    before the governments budget
  • How the polling was done
  • Survey sent to 10 million people
  • And 2.4 million responded (thats huge!)

31
A huge sample, but The Literary Digest poll was
biased
  • The sampling frame was not representative of the
    electorateselection bias
  • Based on magazine subscription lists, drivers
    registrations, country club memberships, phone
    numbers (when telephones were a luxury)
  • Biased toward better off groups (who were more
    Republican)
  • Voluntary response bias
  • Main issue was the economy
  • The anti-Roosevelt forces were angryand had a
    higher response rate!

32
Year Sample size Winner Gallup prediction Election result Error
1936 50,000 Roosevelt 55.7 62.5 -6.8
1940 50,000 Roosevelt 52.0 55.0 -3.0
1944 50,000 Roosevelt 51.5 53.8 -2.3
1948 50,000 Truman 44.5 49.5 -5.0
1952 5,385 Eisenhower 51.0 55.4 -4.4
1956 8,144 Eisenhower 59.5 57.8 1.7
1960 8,015 Kennedy 51.0 50.1 0.9
1964 6,625 Johnson 64.0 61.3 2.7
1968 4,414 Nixon 43.0 43.5 -0.5
1972 3,689 Nixon 62.0 61.8 0.2
1976 3,439 Carter 48.0 50.1 -2.1
1980 3,500 Reagan 47.0 50.8 -3.8
1984 3,456 Reagan 59.0 59.2 0.2
1988 4,089 Bush 56.0 53.9 2.1
1992 2,019 Clinton 49 43.3 5.7
1996 2.,417 Clinton 52.0 50.1 1.9
2000 3,129 Bush 48.0 47.9 0.1
2004 1,866 Bush 49.0 51.0 -2.0
33
The Year the Polls Elected Dewey
  • 1948 Election Harry Truman versus Thomas Dewey
  • Every major poll (including Gallup) predicted
    Dewey would win by 5 percentage points

34
What went wrong?
  • Pollsters chose their samples using quota
    sampling. Each interviewer was assigned a fixed
    quota of subjects in certain categories (race,
    sex, age).
  • For instance, an interviewer in St. Louis was
    required to talk to 13 people
  • 6 live in the suburb, 7 in the central city
  • 7 men and 6 women Over the 7 men (similar for
    women)
  • 3 under 40 years old, 4 over 40 1 black, 6
    white.
  • In each category, interviewers were free to
    choose.
  • But this left room for human choice and
    inevitable bias.
  • Republicans were easier to reach. They had
    telephones, permanent addresses, nicer
    neighborhoods.
  • So interviewers ended up with too many
    Republicans.
  • Quota sampling was abandoned for random sampling.

35
Do you believe the poll?What questions should
you ask?
  • Who carried out survey?
  • What is the population?
  • How was sample selected?
  • How large was the sample?
  • What was the response rate?
  • How were subjects contacted?
  • When was the survey conducted?
  • What are the exact questions asked?

36
To summarize . . .
  • We are often interested in a population and some
    parameter that describes the population.
  • We select a sample from that population and use a
    statistic from the sample to estimate the unknown
    parameter
  • To obtain a good estimate, the sample must be as
    representative of the population as possible. And
    randomization, on average, insures a
    representative sample
  • Possible sources of error are sampling
    variability and bias.
  • To reduce sampling variability, take a bigger
    sample
  • To reduce bias, get a better sampling design
  • Its the sample size, not the population size,
    that matters
Write a Comment
User Comments (0)
About PowerShow.com