Title: The Analysis of Categorical Data and GoodnessofFit Tests
1Chapter 12
- The Analysis of Categorical Data and
Goodness-of-Fit Tests
212.1 Chi-Square Tests for Univariate Categorical
Data
- Examples of Univariate Categorical Data
- Each student in a sample of 100 is classified as
full-time or part-time. (two categories) - Each airline passenger in a sample of 50 is
classified based on type of ticket-coach,
business class, or first class. (three
categories). - Each voter in a sample of 100 is asked which of
the five city council members he or she favors
for mayor. (five categories).
3One-way Frequency Table for Univariate
Categorical Data
- Fees keep American taxpayers from using credit
cards to make tax payments. 100 randomly selected
taxpayers are asked if they will use a credit
card to pay tax next year. The following are the
outcome of the survey
The manager of a tax preparation company
might be interested in determining whether the
four possible responses occur equally often, that
is, the long-run proportion of responses in each
of the four categories is ¼.
4One-way Frequency Table for Univariate
Categorical Data
- Each item returned to a department store is
classified according to how it was resolved cash
refund, credit to charge account, merchandise
exchange, or return refused. (four categories). A
sample of 100 returns summarizes the observations
in a one-way frequency table consisting of k 4
cells
The customer relations manager for the
department store might be interested in
determining whether the four possible
dispositions for a return request occur equally
often, that is, the long-run proposition of
returns in each of the four categories is ¼.
5Notation
- k number of categories of a categorical
variable, - p1 true proportion for Category 1
- p2 true proportion for Category 2
-
- pk true proportion for Category k
- (Note p1 p2 pk 1.)
- The hypotheses to be tested have the form
- H0 p1 hypothesized proportion for Category 1.
- p2 hypothesized proportion for Category 2.
-
- pk hypothesized proportion for Category
k. - Ha H0 is not true. At least one of the true
category proportion differs from the
corresponding hypothesized value.
6Example 12.1 Births and the Lunar Cycle
- A common legend is that more babies than expected
are born during 24 lunar cycle. The following
data is from a sample of randomly selected births
during 24 lunar cycles.
7Example Births and the Lunar Cycle
- If there is no relationship between number of
births and the lunar cycle, then the number of
births in each lunar cycle category should be
proportional to the number of days included in
that category. - There are a total of 699 days in the 24 lunar
cycles considered and 24 of those days are in the
new moon category. If there is no relationship
between number of births and lunar cycle - Similarly, we can find the proportion of births
during other lunar cycles.
8(No Transcript)
9Example Births and the Lunar Cycle
- If there is no relationship between number of
births and the lunar cycle, then - H0 p1 0.0343, p2 0.2175, p3 0.0343, p4
0.2132 - p5 0.0343, p6 0.2146, p7 0.0343, p8
0.2175 - Ha H0 is not true.
- If H0 is true, the expected count for Category 1
(new moon) is - And the expected count for Category 2 is
10Example Births and the Lunar Cycle
- Expected counts for other categories are computed
similarly.
11The Goodness-of-Fit Statistics ?2
- First we compute the quantity
- for each cell, where, for a sample of size n,
- The ?2 statistic is the sum of these quantities
for all k cells
12Chi-Square Distribution
- The goodness-of-fit statistic, X2, is a
quantitative measure of the extent to which the
observed counts differ from those expected when
H0 is true. - Therefore, large values of X2 suggest rejection
of H0. - For a test procedure based on the X2 statistics,
the associated P-value is the area under the
appropriate chi-square curve and to the right of
the computed X2 value. (Appendix Table 8) - Reject H0 if P-value lt significance level a.
- Find the P-value if X2 is 4.93 and df 2.
From Appendix Table 8, P-value 0.085
13Goodness-of-Fit Test Procedure
- Hypotheses
- H0 p1 hypothesized proportion for Category 1
-
- pk hypothesized proportion for Category k
- Ha H0 is not true.
- Test statistic
P-value For a test procedure based on the
X2 statistics, the associated P-value is the area
under the appropriate chi-square curve and to the
right of the computed X2 value. (Appendix Table
8). Reject H0 if P-value lt significance
level a.
14Goodness-of-Fit Test Procedure
- When H0 is true and all expected counts are at
least 5, ?2 has approximately a chi-square
distribution with df k - 1. - The P-value associated with the computed test
statistic value is the area to the right of ?2
under the df k - 1 chi-square curve. - Upper-tail areas for chi-square distribution are
found in Appendix Table 8. - Assumptions
- Observed cell counts are based on a random
sample. - The sample size is large. The sample size is
large enough for the chi-square test to be
appropriate as long as every expected cell count
is at least 5.
15Example Births and the Lunar Cycle Revisited
- Test the hypothesis that number of births is
unrelated to lunar cycle using the data in
Example 12.1. Choose a 0.05. - df 8 - 1 7. The computed value of ?2 lt 12.01
(the smallest entry in df 7 column), so P-value
gt .10. - Fail to reject H0, because P-value gt a.
- There is no enough evidence to conclude that
number of births and lunar cycle are related.
16Example Hybrid Car Purchases
- The table on the right lists sales of hybrid cars
in the top five states in 2004. Use ?2
goodness-of-fit test and a significance level a
.01 to test the hypothesis that hybrid sales for
these 5 states are proportional to the 2004
population (see table below) for these states.
17Solution to Example Hybrid Car Purchase
- If the hybrid sales for the 5 states are
proportional to their 2004 population, - then, the expected counts for hybrid sales in
these states are - Expected count for California 406(.495)
200.970 - Expected count for Virginia 406(.103) 41.818
- Expected count for Washington 406(.085)
34.510 - Expected count for Florida 406(.240) 97.440
- Expected count for Maryland 406(.077) 31.362
18Solution to Example Hybrid Car Purchase
- Let p1, p2, p3, p4, p5 denote the true proportion
of hybrid car sales for California, Virginia,
Washington, Florida and Maryland, respectively. - Assumption The sample was a random sample. All
expected counts are gt 5, so it is appropriate to
use chi-square test. - H0 p1 0.495, p2 0.103, p3 0.085, p4
0.240, p5 0.077 - Ha H0 is not true.
- Test statistic
- P-value All expected counts exceed 5, so
the P-value can be based on chi-square
distribution with df 5 - 1 4. From Appendix
Table 8, P-value lt 0.001 0 ( the test
statistic 59.49 gt 13.81 and any value gt 13.81 has
the right tail area lt 0.001). - Conclusion Reject H0 since P-value lt a.
There is evidence that hybrid car sales are not
proportional to population size for at least one
of the five states.
On next slides we use Excel to solve this problem.
19Click x, and an Insert Function dialog box
appears. Select Statistical in select a
category box. In the Select a function list
choose CHITEST. Then click OK.
20As soon as you input the Actual_range (observed
frequency) and Expected_range, you can see the
P-value in Formula result 3.70981E-12 (
3.70981 10-12 0 ).
21Exercise Color of Stolen Cars
- Does the color of a car influence the chance
that it will be stolen? The AP reported that for
a random sample of 830 stolen vehicles 140 were
white, 100 were blue, 270 were red, 230 were
black, and 90 were other colors. Use the X2
goodness-of-fit test and a significance level of
a.01 to test the hypothesis that proportions
stolen are identical to population color
proportions. It is known that 15 of all cars are
white, 15 are blue, 35 are red, 30 are black
and 5 are other colors.
Answer P-value lt .001. There is convincing
evidence that at least one of the color
proportions for stolen cars differs from the
corresponding proportions for all cars.