Title: Categorical Data Chapter 10
1Categorical Data (Chapter 10)
- Understand what is meant by Categorical Data.
- Different distributions for discrete random
variables - Formulating and testing hypotheses for
categorical data.
Terms Multinominal experiment Multinominal
distribution Expected outcome Chi Square
measure Chi Square distribution
Count Data Binomial Distribution Comparing
binomial proportions Contingency Table Logistic
Regression Poisson Regression
2Situation
A local doctor suspects that there is a seasonal
trend in the occurrence of the common cold. She
thinks the trend is as follows Winter 40 Spri
ng 40 Summer 10 Fall 10. The information
below is collected from a random sample of 1,000
cases of patients with the common cold over the
past year. Winter 374 Spring 292 Summer 169
Fall 165. Would you agree with the doctors
estimates, based on the sample information?
Perform a test using a Pr(Type I error)0.05.
Conclusions?
3QUESTIONS TO ASK
What are the key characteristics of the sample
data collected?
The data represent counts in different categories.
What is the basic experiment?
Each of the 1,000 individuals with the common
cold was questioned as to when they had their
cold. Before questioning, each has a certain
probability of being in one of the four classes.
After questioning they are placed in the
appropriate class.
We call any experiment of n trials where each
trial can have one of k possible outcomes a
Multinomial Experiment.
For individual j, the response, yj, indicates
which outcome was observed. Possible outcomes are
the integers 1,2,,k.
4The Multinomial Experiment
- The experiment consists of n identical trials.
- Each trial results in one of k possible outcomes.
- The probability that a single trial will result
in outcome i is pi i1,2,...,k, (Spi1) and
remains constant from trial to trial . - The trials are independent (the response of one
trial does not depend on the response of any
other). - The response of interest is ni the number of
trials resulting in a particular outcome i.
(Snin).
Multinomial distribution provides the
probability distribution for the number of
observations resulting in each of k outcomes.
0!1
This tells us the probability of observing
exactly n1,n2,...,nk.
5From the common cold example
Observed
Hypothesized
Winter n1374 Spring n2292 Summer n3169 Fall
n4165.
Winter 40 p10.40 Spring 40 p20.40 Summer 10
p30.10 Fall 10 p40.10
If this probability is high, then we would say
that there is good likelihood that the observed
data come from a multinomial experiment with the
hypothesized probabilities. Otherwise we have
the probabilities wrong. How do we measure the
goodness of fit between the hypothesized
probabilities and the observed data?
6In a multinomial experiment of n trials with
hypothesized probabilities of pi i1,2,...,k, the
expected number of responses in each outcome
class is given by
observed cell count
cell probability
A reasonable measure of goodness of fit would be
to compare the observed class frequencies to the
expected class frequencies. Turns out (Pearson,
1900) that this statistic is one of the best for
this purpose.
expected cell count
Has Chi Square distribution with df k-1
provided no sparse counts (i) no Ei is less
than 1, and (ii) no more than 20 of the Ei are
less than 5.
7Class Hypothesized Observed Expected Winter 40
p10.40 374 400 Spring 40 p20.40 292 400 Summ
er 10 p30.10 169 100 Fall 10 p40.10 165 100
Pr(Type I error) a 0.05
Since 120.71 gt 7.812 we conclude it is unlikely
that the hypothesized proportions are the true
ones.
8Summary Chi Square Goodness of Fit Test
H0 pi pi o for categories i1,2,...,k
(Specified cell proportions for k categories).
Ha At least two of the true population cell
proportions differ from the specified proportions.
Test Statistic
Rejection Region Reject H0 if c2 exceeds the
tabulated critical value for the Chi Square
distribution with dfk-1 and Pr(Type I Error) a.
9Binomial Experiment Multinomial Experiment with
two classes
Since the sum of the proportions is equal to 1,
we have
Since the sum of the cell frequencies equal the
total sample size.
If p is the probability of a success and y is
the number of successes in n trials.
10Normal Approximation to the Binomial and CI for p
In general, for
the probability of observing y or greater
successes can be approximated by an appropriate
normal distribution (see section 4.13).
What about a confidence interval (CI) for p?
Using a similar argument as for y, we obtain the
(1-?)100 CI
Use
when p is unknown.
11Approximate Statistical Test for p
H0 p p0 (p0 specified)
Ha 1. p gt p0 2. p lt p0 3. p ¹ p0
Test Statistic
Note Under H0
1. Reject if z gt za 2. Reject if z lt
-za 3. Reject if z gt za/2
Rejection Region
12Sample Size needed to meet a pre-specified
confidence in p
Suppose we wish to estimate p to within E with
confidence 100(1-a). What sample size should we
use?
Since p is unknown, do the following 1. Substitut
e our best guess. 2. Use p 0.5 (worst case
estimate).
Example We have been contracted to perform a
survey to determine what fraction of students eat
lunch on campus. How many students should we
interview if we wish to be 95 confident of being
within 2 of the true proportion?
13Comparing Two Binomial Proportions
Situation Two sets of 60 ninth-graders were
taught algebra I by different methods
(self-paced versus formal lectures). At the end
of the 4-month period, a comprehensive,
standardized test was given to both groups with
results
Experimental group n60, 39 scored above
80. Traditional group n60, 28 scored above 80.
Is this sufficient evidence to conclude that the
experimental group performed better than the
traditional group?
14Ex 90 CI is
Population Example 1 2 Population
proportion p1 p2 Sample size n1 n2 60 60 Numbe
r of successes y1 y2 39 28
0.65 0.467
Sample proportion
100(1-?) confidence interval for p1 - p2.
use
0.183 1.645(0.089) ? (.036, .330)
15Statistical Test for Comparing Two Binomial
Proportions
H0 p1 - p2 0 (or p1 p2 p)
Ha 1. p1 - p2 gt 0 2. p1 - p2 lt 0 3. p1 - p2
¹ 0
Test Statistic
Note Under H0
1. Reject if z gt za 2. Reject if z lt
-za 3. Reject if z gt za/2
Rejection Region
16Test Statistic
Population Example 1 2 Population
proportion p1 p2 Sample size n1 n2 60 60 Numbe
r of successes y1 y2 39 28
Sample proportion
0.65 0.467
Since 2.056 is greater than 1.645 we reject H0
and conclude Ha p1 gt p2.
17Contingency Tables Tests for independence and
homogeneity
How to test hypotheses of independence
(association) and homogeneity (similarity) for
general two-way cross classifications of count
data.
Terms Contingency Table Cross-Classification
Table Measure of association
Independence in two-way tables Chi-Square Test
for Independence or Homogeneity
18Test of Independence or Association
A university conducted a study concerning
faculty teaching evaluation classification by
students. A sample of 467 faculty is randomly
selected, and each person is classified
according to rank (Instructor, Assistant
Professor, etc. ) and teaching evaluation
(Above, Average, Below).
Data can be formatted into a cross-tabulation or
contingency table.
Each person has two categorical responses.
19What are we interested in from this two-way
classification table?
Is the level of teaching evaluation related to
rank?
Are Professors more likely to be judged above
average than other ranks?
Ho Teaching Evaluation and Rank are independent
variables.
Two variables that have been categorized in a
two-way table are independent if the probability
that a measurement is classified into a given
cell of the table is equal to the probability of
being classified into that row times the
probability of being classified into that column.
This must be true for all cells of the table.
20The independence assumption
Observed
Expected
21Observed Counts
22Expected Counts
Assumptions no Eij lt 1, and no more than 20 of
Eij lt 5.
23Individual Cell Chi Square Values
? Reject Ho
There is evidence of an association between rank
and evaluation. Note that we observed less
Assistant Professors getting below average
evaluations (13) than we would expect under
independence (26.2). Chi Square value is 6.67.
24Minitab
rank eval count 1 1 30 1 2 48 1 3 36 2 1 13 2 2 50
2 3 62 3 1 20 3 2 35 3 3 45 4 1 35 4 2 43 4 3 50
STAT gt TABLES gt Cross
Tabs Classification Variables rank eval Check
Chi-square Analysis, and Above and Std.
residual Frequencies are in count
Input data in this way
25 Tabulated Statistics eval, rank Rows eval
Columns rank 1 2
3 4 All 1 30
13 20 35 98
23.92 26.23 20.99 26.86 98.00
1.24 -2.58 -0.22 1.57 --
2 48 50 35 43
176 42.96 47.11 37.69 48.24
176.00 0.77 0.42 -0.44
-0.75 -- 3 36 62
45 50 193 47.11
51.66 41.33 52.90 193.00 -1.62
1.44 0.57 -0.40 -- All
114 125 100 128
467 114.00 125.00 100.00 128.00
467.00 -- --
-- -- -- Chi-Square
17.435, DF 6, P-Value 0.008
Cell Contents -- Count Exp
Freq Std. Resid
Square roots of Individual Chi-square values
26SAS
options ls79 ps40 nocenter data eval input
job rating number datalines Instructor
Above 36 Instructor Average 48 Instructor Below
30 Assistant Above 62 Assistant Average
50 Assistant Below 13 Associate Above
45 Associate Average 35 Associate Below
20 Professor Above 50 Professor Average
43 Professor Below 35 run proc freq
dataeval weight number table jobrating /
chisq run
Table of job by rating job
rating Frequency Percent Row Pct Col Pct
Above Average Below Total
Assistan 62
50 13 125 13.28 10.71
2.78 26.77 49.60 40.00
10.40 32.12 28.41 13.27
Associat
45 35 20 100
9.64 7.49 4.28 21.41 45.00
35.00 20.00 23.32 19.89
20.41 Inst
ruct 36 48 30 114
7.71 10.28 6.42 24.41
31.58 42.11 26.32 18.65
27.27 30.61
Professo 50 43 35
128 10.71 9.21 7.49 27.41
39.06 33.59 27.34
25.91 24.43 35.71
Total 193 176 98
467 41.33 37.69 20.99
100.00
27 The FREQ Procedure Statistics for Table of job
by rating Statistic DF
Value Prob
Chi-Square
6 17.4354 0.0078 Likelihood Ratio
Chi-Square 6 18.7430 0.0046 Mantel-Haens
zel Chi-Square 1 10.8814 0.0010 Phi
Coefficient
0.1932 Contingency Coefficient
0.1897 Cramer's V
0.1366 Sample Size 467
28SPSS
First you need to tell SPSS that each observation
must be weighted by the cell count.
DATA gt WEIGHT CASES
Then you choose the analysis. ANALYZE gt
DESCRIPTIVE STATISTICS gt CROSS TABS
29(No Transcript)
30R
- gt score lt- c(36,48,30,62,50,13,45,35,20,50,43,35)
- gt mscore lt- matrix(score,3,4)
- gt mscore
- ,1 ,2 ,3 ,4
- 1, 36 62 45 50
- 2, 48 50 35 43
- 3, 30 13 20 35
- gt chisq.test(mscore)
- Pearson's Chi-squared test
- data mscore
- X-squared 17.4354, df 6, p-value 0.00781
- gt out lt- chisq.test(mscore)
- gt out1length(out)
- statistic
- X-squared
- 17.43537
31method 1 "Pearson's Chi-squared
test" data.name 1 "mscore" observed
,1 ,2 ,3 ,4 1, 36 62 45 50 2,
48 50 35 43 3, 30 13 20
35 expected ,1 ,2 ,3
,4 1, 47.11349 51.65953 41.32762
52.89936 2, 42.96360 47.10921 37.68737
48.23983 3, 23.92291 26.23126 20.98501
26.86081 residuals ,1 ,2
,3 ,4 1, -1.6191155 1.4386830
0.5712511 -0.3986361 2, 0.7683695 0.4211764
-0.4377528 -0.7544218 3, 1.2424774 -2.5834003
-0.2150237 1.5704402
Square roots of Individual Chi-square values
32Test of Homogeneity
Suppose we wish to determine if there is an
association between a rare disease and another
more common categorical variable (e.g. smoking).
We cant just take a random sample of subjects
and hope to get enough cases (subjects with the
disease). One solution is to choose a fixed
number of cases, and a fixed number of controls,
and classify each according to whether they are
smokers or not. The same chi square test of
independence applies here, but since we are
sampling within subpopulations (have fixed margin
totals), this is now called a chi square test of
homogeneity (of distributions). In general, if
the column categories represent c distinct
subpopulations, random samples of size n1, n2, ,
nc are selected from each and classified into the
r values of a categorical variable represented by
the rows of the contingency table. The hypothesis
of interest here is if there a difference in the
distribution of subpopulation units among the r
levels of the categorical variable, i.e. are the
subpopulations homogenous or not.
33Example Myocardial Infarction (MI)
Data was collected to determine if there is an
association between myocardial infarction and
smoking in women. 262 women suffering from MI
were classified according to whether they had
ever smoked or not. Two controls (patients with
other acute disorders) were matched to every case.
Is the incidence of smoking the same for MI and
non-MI sufferers? Ho the incidence of MI is
homogenous with respect to smoking Ho ?11?12
and ?21?22
34Example MI results in MTB
Stat -gt Tables -gt Chi-Square Test ----------------
--------------------------------------------------
-------------------------- Chi-Square Test MI
Yes, MI No Expected counts are printed below
observed counts MI Yes MI No
Total 1 172 173 345
115.74 229.26 2 90 346
436 146.26 289.74 Total 262
519 781 Chi-Sq 27.352 13.808 21.643
10.926 73.729 DF 1, P-Value 0.000
Conclude there is evidence of lack of
homogeneity of incidence of MI with respect to
smoking.
35Example MI and Odds Ratios
For women sufferers of MI, the proportion who
ever smoked is 172/262 0.656. In other words,
the odds that a woman MI sufferer is a smoker are
0.656/(1-0.656) 1.9. For women non-sufferers
of MI, the proportion who ever smoked is 173/519
0.333. In other words, the odds that a woman
non-MI sufferer is a smoker are 0.333/(1-0.333)
0.5. We can now calculate the odds ratio of
being a smoker among MI sufferers OR 1.9/0.5
3.82 Among MI suffers, the odds of being a smoker
are about 4 times the odds of not being a smoker.
Put another way a randomly selected MI sufferer
is about twice as likely (.656/.333) of being a
smoker than of not being one.