The Analysis of Categorical Data and GoodnessofFit Tests - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

The Analysis of Categorical Data and GoodnessofFit Tests

Description:

Waning crescent. 152. 48230. Total. 699. 222,784. Example: ... Waning crescent. 48230. 48455.52. The Goodness-of-Fit Statistics ?2. First we compute the quantity ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 22

Provided by: shis152

Category:

more less

Transcript and Presenter's Notes

Title: The Analysis of Categorical Data and GoodnessofFit Tests

1
Chapter 12

The Analysis of Categorical Data and
Goodness-of-Fit Tests

2
12.1 Chi-Square Tests for Univariate Categorical
Data

Examples of Univariate Categorical Data
Each student in a sample of 100 is classified as
full-time or part-time. (two categories)
Each airline passenger in a sample of 50 is
classified based on type of ticket-coach,
business class, or first class. (three
categories).
Each voter in a sample of 100 is asked which of
the five city council members he or she favors
for mayor. (five categories).

3
One-way Frequency Table for Univariate
Categorical Data

Fees keep American taxpayers from using credit
cards to make tax payments. 100 randomly selected
taxpayers are asked if they will use a credit
card to pay tax next year. The following are the
outcome of the survey

The manager of a tax preparation company
might be interested in determining whether the
four possible responses occur equally often, that
is, the long-run proportion of responses in each
of the four categories is ¼.
4
One-way Frequency Table for Univariate
Categorical Data

Each item returned to a department store is
classified according to how it was resolved cash
refund, credit to charge account, merchandise
exchange, or return refused. (four categories). A
sample of 100 returns summarizes the observations
in a one-way frequency table consisting of k 4
cells

The customer relations manager for the
department store might be interested in
determining whether the four possible
dispositions for a return request occur equally
often, that is, the long-run proposition of
returns in each of the four categories is ¼.
5
Notation

k number of categories of a categorical
variable,
p1 true proportion for Category 1
p2 true proportion for Category 2
pk true proportion for Category k
(Note p1 p2 pk 1.)
The hypotheses to be tested have the form
H0 p1 hypothesized proportion for Category 1.
p2 hypothesized proportion for Category 2.
pk hypothesized proportion for Category
k.
Ha H0 is not true. At least one of the true
category proportion differs from the
corresponding hypothesized value.

6
Example 12.1 Births and the Lunar Cycle

A common legend is that more babies than expected
are born during 24 lunar cycle. The following
data is from a sample of randomly selected births
during 24 lunar cycles.

7
Example Births and the Lunar Cycle

If there is no relationship between number of
births and the lunar cycle, then the number of
births in each lunar cycle category should be
proportional to the number of days included in
that category.
There are a total of 699 days in the 24 lunar
cycles considered and 24 of those days are in the
new moon category. If there is no relationship
between number of births and lunar cycle
Similarly, we can find the proportion of births
during other lunar cycles.

8
(No Transcript)
9
Example Births and the Lunar Cycle

If there is no relationship between number of
births and the lunar cycle, then
H0 p1 0.0343, p2 0.2175, p3 0.0343, p4
0.2132
p5 0.0343, p6 0.2146, p7 0.0343, p8
0.2175
Ha H0 is not true.
If H0 is true, the expected count for Category 1
(new moon) is
And the expected count for Category 2 is

10
Example Births and the Lunar Cycle

Expected counts for other categories are computed
similarly.

11
The Goodness-of-Fit Statistics ?2

First we compute the quantity
for each cell, where, for a sample of size n,
The ?2 statistic is the sum of these quantities
for all k cells

12
Chi-Square Distribution

The goodness-of-fit statistic, X2, is a
quantitative measure of the extent to which the
observed counts differ from those expected when
H0 is true.
Therefore, large values of X2 suggest rejection
of H0.
For a test procedure based on the X2 statistics,
the associated P-value is the area under the
appropriate chi-square curve and to the right of
the computed X2 value. (Appendix Table 8)
Reject H0 if P-value lt significance level a.

Find the P-value if X2 is 4.93 and df 2.

From Appendix Table 8, P-value 0.085
13
Goodness-of-Fit Test Procedure

Hypotheses
H0 p1 hypothesized proportion for Category 1
pk hypothesized proportion for Category k
Ha H0 is not true.
Test statistic

P-value For a test procedure based on the
X2 statistics, the associated P-value is the area
under the appropriate chi-square curve and to the
right of the computed X2 value. (Appendix Table
8). Reject H0 if P-value lt significance
level a.
14
Goodness-of-Fit Test Procedure

When H0 is true and all expected counts are at
least 5, ?2 has approximately a chi-square
distribution with df k - 1.
The P-value associated with the computed test
statistic value is the area to the right of ?2
under the df k - 1 chi-square curve.
Upper-tail areas for chi-square distribution are
found in Appendix Table 8.
Assumptions
Observed cell counts are based on a random
sample.
The sample size is large. The sample size is
large enough for the chi-square test to be
appropriate as long as every expected cell count
is at least 5.

15
Example Births and the Lunar Cycle Revisited

Test the hypothesis that number of births is
unrelated to lunar cycle using the data in
Example 12.1. Choose a 0.05.
df 8 - 1 7. The computed value of ?2 lt 12.01
(the smallest entry in df 7 column), so P-value
gt .10.
Fail to reject H0, because P-value gt a.
There is no enough evidence to conclude that
number of births and lunar cycle are related.

16
Example Hybrid Car Purchases

The table on the right lists sales of hybrid cars
in the top five states in 2004. Use ?2
goodness-of-fit test and a significance level a
.01 to test the hypothesis that hybrid sales for
these 5 states are proportional to the 2004
population (see table below) for these states.

17
Solution to Example Hybrid Car Purchase

If the hybrid sales for the 5 states are
proportional to their 2004 population,
then, the expected counts for hybrid sales in
these states are
Expected count for California 406(.495)
200.970
Expected count for Virginia 406(.103) 41.818
Expected count for Washington 406(.085)
34.510
Expected count for Florida 406(.240) 97.440
Expected count for Maryland 406(.077) 31.362

18
Solution to Example Hybrid Car Purchase

Let p1, p2, p3, p4, p5 denote the true proportion
of hybrid car sales for California, Virginia,
Washington, Florida and Maryland, respectively.
Assumption The sample was a random sample. All
expected counts are gt 5, so it is appropriate to
use chi-square test.
H0 p1 0.495, p2 0.103, p3 0.085, p4
0.240, p5 0.077
Ha H0 is not true.
Test statistic

P-value All expected counts exceed 5, so
the P-value can be based on chi-square
distribution with df 5 - 1 4. From Appendix
Table 8, P-value lt 0.001 0 ( the test
statistic 59.49 gt 13.81 and any value gt 13.81 has
the right tail area lt 0.001).
Conclusion Reject H0 since P-value lt a.
There is evidence that hybrid car sales are not
proportional to population size for at least one
of the five states.

On next slides we use Excel to solve this problem.
19
Click x, and an Insert Function dialog box
appears. Select Statistical in select a
category box. In the Select a function list
choose CHITEST. Then click OK.
20
As soon as you input the Actual_range (observed
frequency) and Expected_range, you can see the
P-value in Formula result 3.70981E-12 (
3.70981 10-12 0 ).
21
Exercise Color of Stolen Cars

Does the color of a car influence the chance
that it will be stolen? The AP reported that for
a random sample of 830 stolen vehicles 140 were
white, 100 were blue, 270 were red, 230 were
black, and 90 were other colors. Use the X2
goodness-of-fit test and a significance level of
a.01 to test the hypothesis that proportions
stolen are identical to population color
proportions. It is known that 15 of all cars are
white, 15 are blue, 35 are red, 30 are black
and 5 are other colors.

Answer P-value lt .001. There is convincing
evidence that at least one of the color
proportions for stolen cars differs from the
corresponding proportions for all cars.

Write a Comment

User Comments (0)