FPP 16-18 - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

FPP 16-18

Description:

Does CLT apply A box consists of 9 ones and 1 zero. A random sample of size 50 is drawn with replacement from the box and the number of ones are counted. – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 48
Provided by: Elizabet377
Category:
Tags: fpp

less

Transcript and Presenter's Notes

Title: FPP 16-18


1
Expected Values, Standard Errors, Central Limit
Theorem
  • FPP 16-18

2
Statistical inference
  • Up to this point we have focused primarily on
    exploratory type statistical analyses (with a
    little probability thrown in).
  • We will now dive into the realm of statistical
    inference
  • The ideas associated with sampling distributions,
    p-values, and confidence intervals are more
    abstract and are therefore slightly harder
  • These concepts are also very powerful
  • For good if used correctly
  • For bad if used incorrectly

3
Statistics vs probability modeling
  • Probability know the truth, want to estimate
    the chances that data occur
  • Statistics know the data that occur, want to
    infer about the truth

4
Coin toss
  • Suppose we tossed a coin 50 times. We are
    interested to know if this coin is fair.
  • If the coin is fair then then a straightforward
    model that mimics reality is
  • heads 0.5( of tosses)
  • It should be fairly obvious that the number of
    heads wont be exactly 25. How far away from 25
    would convince us that the coin isnt fair?
  • Statistical model
  • heads 0.5( of tosses) chance error
  • This chance error will help us answer the
    question how many heads is too many for the coin
    not to be fair
  • We will study this chance error quite rigorously.

5
Study of chance error
  • Plan of attack for study of chance error
  • Law of averages
  • Sampling distributions
  • Central limit theorem
  • Our main tool will be so called box models

6
Law of averages
  • What does the law of averages say?
  • Toss a coin
  • As of tosses increase the
  • heads 0.5(tosses) ?
  • heads 50 ?
  • In words
  • As the number of tosses goes up
  • The difference between the number of heads and
    half the number of tosses gets bigger
  • The difference between the percentage of heads
    and 50 gets smaller (if coin is fair)

7
Law of averages
  • A die is thrown some number of times, and the
    object is to guess the total number of spots.
    There is a one-dollar penalty for each spot that
    the guess is off. For instance, if you guess 200
    and the total is 215, you lose 15.
  • Which do you prefer 50 throws, or 100?

8
Chance processes
  • When tossing a coin
  • Actual heads ? Expected heads
  • What is the likely size of the difference?
  • Strategy Find an analogy between the process
    being studied and drawing numbers at random from
    a box (box model)

9
Box models
  • A so called box model is a good starting point
    into statistical inference
  • The purpose of these very simple models is to
    analyze chance variability
  • They are a construction for learning about
    characteristics of populations
  • They help us incorporate the probability
    techniques we learned in studying chance error.

10
Box Model
  • A die is thrown some number of times, and the
    object is to guess the total number of spots.
    What is typical total number of spots after 50
    throws. After 100 throws.
  • Create a box model for this

11
Constructing Box models
  • A quiz has 25 multiple choice questions. Each
    question has 5 possible answers, one of which is
    correct. A correct answer is worth 4 points, but
    a point is taken off for each incorrect answer.
    A student answers all of the questions by
    guessing randomly.
  • What is the box model for this scenario?
  • What is the expected score on the quiz?
  • What is the range of scores?
  • What is the SD of scores?

12
Duke donor example
  • Population 119,106 graduates of Duke
  • Variable donation amount in to Duke Annual
    Fund in 2001
  • Box model
  • make a ticket for every alumnus containing
    his/her donation amount
  • Put all these tickets in a hypothetical box.

13
Box models typical questions
  • Pick 100 tickets at random from the box, with
    replacement
  • Before collecting the data, what do you expect
    the sum of these 100 alumni donations to equal?
  • What do you think is a typical deviation from
    this expected value?
  • We can answer these questions with a box model
  • Before collecting the data how many of the 100
    alumni people do you expect to be donators?
  • What do you think is a typical deviation from
    this expected value?
  • To answer these questions need another box model

14
Characteristics of alumni donations
  • For the 119,106 alumni
  • Average of all donations 735
  • SD of donations 23,827
  • 42,938 donated (36)
  • 76,168 did not donate (64)

15
Learning about the sample sum
  • When we sample randomly, the sum of the 100
    tickets will differ for different samples
  • What is the expected value (EV) of the sample sum
  • E(sample sum) n(average of box) n(µ)
  • What is a typical deviation of a sample sum from
    this expected value
  • Standard error (SE) of sum (SD of box)

16
Sample sum of donations for 100 alumni
  • So the sum of the 100 alumni donations should be
  • E(sample sum) 100(735) 73,500
  • give or take the SE
  • SE
  • How sure are we about the sum of donations using
    a sample of 100?
  • Key idea
  • If we take independent samples of 100 alumni over
    and over again, recording the sum of each sample
    then
  • The average of the sample sums should be around
    73,500
  • The SD of the sample sums should be around
    238,270

17
Box model for binary (dichotomous) outcomes
  • 42,938 donated and 76,168 did not
  • Make a box with tickets comprised of 42,938 ones
    and 76,168 zeros.
  • Average of box of ones 0.36 p
  • SD of box 0.48
  • Short cut for SD for binary box models (and only
    for binary box models)
  • Sample 100 tickets out of the box with
    replacement.
  • What does this process remind you of?

18
Sample number of donators out of 100 alumni
  • The number of donators in the sample equals the
    sample sum of the 0-1 tickets
  • Thus, the expected number of donators is
  • EV of sample sum n (Average of box)
    100 0.36
    36
  • The typical deviation of the sample sum for
    expected value is
  • The Standard error (SE) of sum
    (SD of box)

    10 .48 4.8

19
Sample number of donators out of 100 alumni
  • Hence, the number of alumni who donated out of a
    random sample of 100 should be 36, give or take
    around 5 people (SE 4.8).
  • Compared to the average donation per alumni how
    confident are we that any give sample of 100
    will produce 36 donors.
  • Key idea
  • If we take independent samples of 100 alumni over
    and over again, recording the number of donators
    in each sample
  • The average of the sample number of donators
    should be around 36
  • The SD of the sample numbers of donators should
    be around 4.8

20
Chance error / Standard Error
  • Standard error allows us to assess how big the
    chance error will be in the model
  • sum expected value chance error
  • Chance error is the difference between an
    observed value and the expected value

21
A problem from the text
  • 100 draws are made with replacement from a box
    containing the seven numbers 101
    102 103 104 105 106 107
  • Suppose you were betting. The closer your guess
    is to the sample sum, the more money you win.
    What number would you guess?
  • Use the expected value as your guess.
    10010410400
  • How much would you expect the sample sum to be
    off from the expected value of the sum?
  • This is the standard error. v1002.16 21.6

22
Difference between SD and SE
  • SD is the typical deviation from the average in a
    box. SD is a property of the box it doesnt
    depend on a random sampling
  • SE is the typical deviation from the expected
    value in a random sample. SE results from random
    sampling
  • SE gives an idea of how large the chance error is
  • Sum of draws is likely to be around its expected
    value, but to be off by a chance error similar in
    size to its SE
  • Sum of draws EV chance error

23
EV and SE of the sample average or percent
  • Since sample average(percent) sample sum /n we
    get
  • Just like sample sums, sample averages and sample
    percentages are subject to chance variation
  • EV for sample average ( or ) EV of sample sum
    / n
    Avg. of box.
  • SE for sample average (or ) SE for sample
    sum / n
    SD of box /vn

24
Common theme for SE of sample average and sample
percentage
  • Fir a binary variable, the population SD
  • So both the sample average and sample percentage
    have a standard error of the form
  • SE Population SD /

25
Sample averages and percentages
  • In a random sample of 100 alumni, we expect the
    sample average donation to equal 735 give or
    take 2,382.70. We expect 36 to donate, give or
    take 4.8
  • If we take independent samples of 100 alumni over
    and over again, recording the average donation
    and the percentage of donators in each sample
  • The average of the sample averages of donations
    should be around 735
  • The SD of the sample averages of donations should
    be around 2,382.70
  • The average of the sample percentages of donators
    should be around 0.36
  • The SD of the sample percentages of donators
    should be around 0.048

26
Law of averages
  • Plot the SE of sample average donation for an
    increasing sample taken from the box
  • As n in increases, the SE of the sample average
    decreases
  • This is called the law of averages
  • Vegas was built on this law

27
Shape of chance process
  • The expected value and the standard error provide
    a measure of center and spread for the chance
    process
  • What about the shape
  • Book introduces something called the probability
    histogram
  • This is a histogram of the samples take from the
    box model.
  • What shape will this histogram take on

28
Parameters vs statistics
  • A parameter is a number that describes the
    population
  • a fixed number
  • in practice, we dont know its value
  • A statistic is a number that describes a sample
  • its value is known when we have taken a sample
  • value can change from sample to sample
  • often used to estimate an unknown parameter

29
Sampling distributions
  • Box model is trying to motivate ideas surrounding
    a sampling distribution
  • All statistics have a sampling distribution
  • Formal definition
  • The sampling distribution of a statistic is the
    distribution of values taken by the statistic in
    all possible samples of the same size from the
    same population.
  • Note that a statistics sampling distribution
    depends on the sample size

30
Sampling distribution construction
  • From a given population exhaust all possible
    samples of size n
  • For each sample compute the statistic
  • Treat these statistics as the data and plot a
    histogram
  • The histogram displays the sampling distribution
  • I believe FPP calls these distributions
    probability histograms
  • Note that these distributions are highly
    dependent on the sample size

31
Silly example
32
Approximating sampling distributions
  • What if populations is such that exhausting all
    samples of size n is impossible
  • The sampling distribution can be well
    approximated using a ton of samples instead of
    all samples

33
Cool applet
34
Central Limit Theorem
  • When dealing with a statistic that uses a sum of
    some sort we can theoretically show what the
    sampling distribution will be like through the
    Central Limit Theorem

35
The central limit theorem
  • Take many random samples with replacement from a
    box model, all of the samples of size n. When n
    is sufficiently large, the distribution of the
    sample average (or sample ) is well-described by
    a normal curve
  • The mean of this normal curve is the EV and the
    standard deviation for this normal curve is the SE

36
The Central Limit Theorem
  • What does the CLT give us? A ton of stuff
  • We can find probabilities and percentiles using
    the the normal table
  • Can predict fairly accurately how unlikely it is
    to sample an observed sample mean
  • Can assess rather accurately how likely a
    population mean lies within an interval

37
Central Limit Theorem
  • What happens if the distribution of the original
    variable is not symmetric (or think about the
    distribution of the values on the tickets in a
    box)
  • The central limit theorem still kicks in (the
    sample size n just needs to be bigger)
  • What happens if the distribution of the original
    variable is bimodal
  • The central limit theorem still kicks in (the
    sample size n just needs to be bigger)
  • This is absolutely a fantastic result !!!

38
Does CLT apply
  • A box consists of 9 ones and 1 zero. A random
    sample of size 50 is drawn with replacement from
    the box and the number of ones are counted.
  • A box consists of the ages of the 100 students in
    our stat class (assume that the mean is 20 and sd
    is 1). A random sample of size 50 is drawn with
    replacement from the box and and the 25th
    percentile is computed.

39
Central Limit Theorem MMs
  • Pick 50 MMs at random (from a bag).
  • How likely is it to have less than 40 yellow and
    brown MMs in the bag?
  • Assume 50 of all MMs are yellow and brown
    (source MMs home page)
  • For a sample proportion of yellow and brown MMs
  • EV 0.5 and SE

40
Size of sample
  • For binomial (categorical data with two
    categories) data, the CLT usually kicks in pretty
    well when both of the following conditions on
    sample size are met

41
CLT and MMs
  • Since n50, CLT applies
  • The probability of getting less than 40 yellow
    and brown MMs in a bag of 50 is
  • It is somewhat unusual to get less than 40
    yellow and brown MMs (about 8 chances in 100)

42
CLT household example
  • The average size of U.S. households is 2.6
    people. The SD of household size is 1.42.
    (These are true values from the U.S. Census).
  • Pick 200 houses at random in the U.S.
  • How likely is it that well get a sample average
    household size of 3 or more?

43
CLT household example
  • For a sample average of 200 households
  • EV 2.6 and SE
  • The chance of getting an average household size
    greater than 3 equals the area under the standard
    normal curve to the right of 4. This is a very
    small chance

44
Alumni donations example
  • In a random sample of 100 alumni, what is the
    chance that more than half donated?

45
Alumni donations example
  • What is the chance that the sample average of
    donations from 100 randomly picked alumni will be
    between 50 and 100

46
CLT under three conditions
  • If original variable follows a normal
    distribution no need for CLT. We know the
    sampling distribution of a sum theoretically
  • If distribution of original variable is symmetric
    and unimodal then CLT holds for a small sample
    size (say less than 15)
  • If distribution is skewed, not unimodal then the
    CLT holds after a larger sample size
  • how large depends on the sharpness of the skew.
    In this class we will follow convention and say
    30.

47
Parameter µ
Inference
Statistic
Sample
Write a Comment
User Comments (0)
About PowerShow.com