The Analysis of Categorical Data and GoodnessofFit Tests - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

The Analysis of Categorical Data and GoodnessofFit Tests

Description:

Waning crescent. 152. 48230. Total. 699. 222,784. Example: ... Waning crescent. 48230. 48455.52. The Goodness-of-Fit Statistics ?2. First we compute the quantity ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 22
Provided by: shis152
Category:

less

Transcript and Presenter's Notes

Title: The Analysis of Categorical Data and GoodnessofFit Tests


1
Chapter 12
  • The Analysis of Categorical Data and
    Goodness-of-Fit Tests

2
12.1 Chi-Square Tests for Univariate Categorical
Data
  • Examples of Univariate Categorical Data
  • Each student in a sample of 100 is classified as
    full-time or part-time. (two categories)
  • Each airline passenger in a sample of 50 is
    classified based on type of ticket-coach,
    business class, or first class. (three
    categories).
  • Each voter in a sample of 100 is asked which of
    the five city council members he or she favors
    for mayor. (five categories).

3
One-way Frequency Table for Univariate
Categorical Data
  • Fees keep American taxpayers from using credit
    cards to make tax payments. 100 randomly selected
    taxpayers are asked if they will use a credit
    card to pay tax next year. The following are the
    outcome of the survey

The manager of a tax preparation company
might be interested in determining whether the
four possible responses occur equally often, that
is, the long-run proportion of responses in each
of the four categories is ¼.
4
One-way Frequency Table for Univariate
Categorical Data
  • Each item returned to a department store is
    classified according to how it was resolved cash
    refund, credit to charge account, merchandise
    exchange, or return refused. (four categories). A
    sample of 100 returns summarizes the observations
    in a one-way frequency table consisting of k 4
    cells

The customer relations manager for the
department store might be interested in
determining whether the four possible
dispositions for a return request occur equally
often, that is, the long-run proposition of
returns in each of the four categories is ¼.
5
Notation
  • k number of categories of a categorical
    variable,
  • p1 true proportion for Category 1
  • p2 true proportion for Category 2
  • pk true proportion for Category k
  • (Note p1 p2 pk 1.)
  • The hypotheses to be tested have the form
  • H0 p1 hypothesized proportion for Category 1.
  • p2 hypothesized proportion for Category 2.
  • pk hypothesized proportion for Category
    k.
  • Ha H0 is not true. At least one of the true
    category proportion differs from the
    corresponding hypothesized value.

6
Example 12.1 Births and the Lunar Cycle
  • A common legend is that more babies than expected
    are born during 24 lunar cycle. The following
    data is from a sample of randomly selected births
    during 24 lunar cycles.

7
Example Births and the Lunar Cycle
  • If there is no relationship between number of
    births and the lunar cycle, then the number of
    births in each lunar cycle category should be
    proportional to the number of days included in
    that category.
  • There are a total of 699 days in the 24 lunar
    cycles considered and 24 of those days are in the
    new moon category. If there is no relationship
    between number of births and lunar cycle
  • Similarly, we can find the proportion of births
    during other lunar cycles.

8
(No Transcript)
9
Example Births and the Lunar Cycle
  • If there is no relationship between number of
    births and the lunar cycle, then
  • H0 p1 0.0343, p2 0.2175, p3 0.0343, p4
    0.2132
  • p5 0.0343, p6 0.2146, p7 0.0343, p8
    0.2175
  • Ha H0 is not true.
  • If H0 is true, the expected count for Category 1
    (new moon) is
  • And the expected count for Category 2 is

10
Example Births and the Lunar Cycle
  • Expected counts for other categories are computed
    similarly.

11
The Goodness-of-Fit Statistics ?2
  • First we compute the quantity
  • for each cell, where, for a sample of size n,
  • The ?2 statistic is the sum of these quantities
    for all k cells

12
Chi-Square Distribution
  • The goodness-of-fit statistic, X2, is a
    quantitative measure of the extent to which the
    observed counts differ from those expected when
    H0 is true.
  • Therefore, large values of X2 suggest rejection
    of H0.
  • For a test procedure based on the X2 statistics,
    the associated P-value is the area under the
    appropriate chi-square curve and to the right of
    the computed X2 value. (Appendix Table 8)
  • Reject H0 if P-value lt significance level a.
  • Find the P-value if X2 is 4.93 and df 2.

From Appendix Table 8, P-value 0.085
13
Goodness-of-Fit Test Procedure
  • Hypotheses
  • H0 p1 hypothesized proportion for Category 1
  • pk hypothesized proportion for Category k
  • Ha H0 is not true.
  • Test statistic

P-value For a test procedure based on the
X2 statistics, the associated P-value is the area
under the appropriate chi-square curve and to the
right of the computed X2 value. (Appendix Table
8). Reject H0 if P-value lt significance
level a.
14
Goodness-of-Fit Test Procedure
  • When H0 is true and all expected counts are at
    least 5, ?2 has approximately a chi-square
    distribution with df k - 1.
  • The P-value associated with the computed test
    statistic value is the area to the right of ?2
    under the df k - 1 chi-square curve.
  • Upper-tail areas for chi-square distribution are
    found in Appendix Table 8.
  • Assumptions
  • Observed cell counts are based on a random
    sample.
  • The sample size is large. The sample size is
    large enough for the chi-square test to be
    appropriate as long as every expected cell count
    is at least 5.

15
Example Births and the Lunar Cycle Revisited
  • Test the hypothesis that number of births is
    unrelated to lunar cycle using the data in
    Example 12.1. Choose a 0.05.
  • df 8 - 1 7. The computed value of ?2 lt 12.01
    (the smallest entry in df 7 column), so P-value
    gt .10.
  • Fail to reject H0, because P-value gt a.
  • There is no enough evidence to conclude that
    number of births and lunar cycle are related.

16
Example Hybrid Car Purchases
  • The table on the right lists sales of hybrid cars
    in the top five states in 2004. Use ?2
    goodness-of-fit test and a significance level a
    .01 to test the hypothesis that hybrid sales for
    these 5 states are proportional to the 2004
    population (see table below) for these states.

17
Solution to Example Hybrid Car Purchase
  • If the hybrid sales for the 5 states are
    proportional to their 2004 population,
  • then, the expected counts for hybrid sales in
    these states are
  • Expected count for California 406(.495)
    200.970
  • Expected count for Virginia 406(.103) 41.818
  • Expected count for Washington 406(.085)
    34.510
  • Expected count for Florida 406(.240) 97.440
  • Expected count for Maryland 406(.077) 31.362

18
Solution to Example Hybrid Car Purchase
  • Let p1, p2, p3, p4, p5 denote the true proportion
    of hybrid car sales for California, Virginia,
    Washington, Florida and Maryland, respectively.
  • Assumption The sample was a random sample. All
    expected counts are gt 5, so it is appropriate to
    use chi-square test.
  • H0 p1 0.495, p2 0.103, p3 0.085, p4
    0.240, p5 0.077
  • Ha H0 is not true.
  • Test statistic
  • P-value All expected counts exceed 5, so
    the P-value can be based on chi-square
    distribution with df 5 - 1 4. From Appendix
    Table 8, P-value lt 0.001 0 ( the test
    statistic 59.49 gt 13.81 and any value gt 13.81 has
    the right tail area lt 0.001).
  • Conclusion Reject H0 since P-value lt a.
    There is evidence that hybrid car sales are not
    proportional to population size for at least one
    of the five states.

On next slides we use Excel to solve this problem.
19
Click x, and an Insert Function dialog box
appears. Select Statistical in select a
category box. In the Select a function list
choose CHITEST. Then click OK.
20
As soon as you input the Actual_range (observed
frequency) and Expected_range, you can see the
P-value in Formula result 3.70981E-12 (
3.70981 10-12 0 ).
21
Exercise Color of Stolen Cars
  • Does the color of a car influence the chance
    that it will be stolen? The AP reported that for
    a random sample of 830 stolen vehicles 140 were
    white, 100 were blue, 270 were red, 230 were
    black, and 90 were other colors. Use the X2
    goodness-of-fit test and a significance level of
    a.01 to test the hypothesis that proportions
    stolen are identical to population color
    proportions. It is known that 15 of all cars are
    white, 15 are blue, 35 are red, 30 are black
    and 5 are other colors.

Answer P-value lt .001. There is convincing
evidence that at least one of the color
proportions for stolen cars differs from the
corresponding proportions for all cars.
Write a Comment
User Comments (0)
About PowerShow.com