Title: Introduction to Categorical Data Analysis July 22, 2004
1Introduction to Categorical
DataAnalysisJuly 22, 2004
2Categorical data
- The t-test, ANOVA, and linear regression all
assumed outcome variables that were continuous
(normally distributed). - Even their non-parametric equivalents assumed at
least many levels of the outcome (discrete
quantitative or ordinal). - We havent discussed the case where the outcome
variable is categorical.
3Types of Variables a taxonomy
Categorical
Quantitative
continuous
discrete
ordinal
nominal
binary
2 categories more categories
order matters numerical
uninterrupted
4Overview of statistical tests
- Independent variablepredictor
- Dependent variableoutcome
- e.g., BMD pounds age amenorrheic (1/0)
5(No Transcript)
6(No Transcript)
7Difference in proportions
- Example You poll 50 people from random
districts in Florida as they exit the polls on
election day 2004. You also poll 50 people from
random districts in Massachusetts. 49 of
pollees in Florida say that they voted for Kerry,
and 53 of pollees in Massachusetts say they
voted for Kerry. Is there enough evidence to
reject the null hypothesis that the states voted
for Kerry in equal proportions?
8Null distribution of a difference in proportions
9Null distribution of a difference in proportions
10Answer to Example
- We saw a difference of 4 between Florida and
Massachusetts - Null distribution predicts chance variation
between the two states of 10. - P(our data/null distribution)P(Zgt.04/.10.4)gt.05
- Not enough evidence to reject the null.
11Chi-square testfor comparing proportions (of a
categorical variable) between groups
I. Chi-Square Test of Independence When both
your predictor and outcome variables are
categorical, they may be cross-classified in a
contingency table and compared using a chi-square
test of independence. Â A contingency table
with R rows and C columns is an R x C contingency
table.
12Example
- Asch, S.E. (1955). Opinions and social pressure.
Scientific American, 193, 31-35.
13The Experiment
- A Subject volunteers to participate in a visual
perception study. - Everyone else in the room is actually a
conspirator in the study (unbeknownst to the
Subject). - The experimenter reveals a pair of cards
14The Task Cards
Standard line
Comparison lines A, B, and C
15The Experiment
- Everyone goes around the room and says which
comparison line (A, B, or C) is correct the true
Subject always answers last after hearing all
the others answers. - The first few times, the 7 conspirators give
the correct answer. - Then, they start purposely giving the (obviously)
wrong answer. - 75 of Subjects tested went along with the
groups consensus at least once.
16Further Results
- In a further experiment, group size (number of
conspirators) was altered from 2-10. - Does the group size alter the proportion of
subjects who conform?
17The Chi-Square test
Â
Â
Â
Apparently, conformity less likely when less or
more group members
Â
18- 20 50 75 60 30 235 conformed
- out of 500 experiments.
- Overall likelihood of conforming 235/500 .47
19Expected frequencies if no association between
group size and conformity
Â
Â
Â
Â
20 Â
- Do observed and expected differ more than
expected due to chance?
Â
Â
Â
21Chi-Square test
Rule of thumb if the chi-square statistic is
much greater than its degrees of freedom,
indicates statistical significance. Here 85gtgt4.
22The Chi-Square distributionis sum of squared
normal deviates
The expected value and variance of a
chi-square E(x)df  Var(x)2(df)
23Chi-Square test
Rule of thumb if the chi-square statistic is
much greater than its degrees of freedom,
indicates statistical significance. Here 85gtgt4.
24Caveat
- When the sample size is very small in any cell
(lt5), Fischers exact test is used as an
alternative to the chi-square test.
25Example of Fishers Exact Test
26Fishers Tea-tasting experiment
Claim Fishers colleague (call her Cathy)
claimed that, when drinking tea, she could
distinguish whether milk or tea was added to the
cup first. To test her claim, Fisher designed
an experiment in which she tasted 8 cups of tea
(4 cups had milk poured first, 4 had tea poured
first). Null hypothesis Cathys guessing
abilities are no better than chance. Alternatives
hypotheses Right-tail She guesses right more
than expected by chance. Left-tail She guesses
wrong more than expected by chance
27Fishers Tea-tasting experiment
Experimental Results
28Fishers Exact Test
Step 1 Identify tables that are as extreme or
more extreme than what actually happened Here
she identified 3 out of 4 of the
milk-poured-first teas correctly. Is that good
luck or real talent? The only way she could have
done better is if she identified 4 of 4 correct.
29Fishers Exact Test
Step 2 Calculate the probability of the tables
(assuming fixed marginals)
30Step 3 to get the left tail and right-tail
p-values, consider the probability mass
function Probability mass function of X, where
X the number of correct identifications of the
cups with milk-poured-first
31SAS code and outputfor generating Fishers Exact
statistics for 2x2 table
32data tea input MilkFirst GuessedMilk
Freq datalines 1 1 3 1 0 1 0 1 1 0 0
3 run data tea Fix quirky reversal of SAS 2x2
tables set tea MilkFirst1-MilkFirst Guessed
Milk1-GuessedMilkrun proc freq
datatea tables MilkFirstGuessedMilk
/exact weight freqrun
33SAS output
Statistics for Table of
MilkFirst by GuessedMilk
Statistic DF Value
Prob Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’
Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’
Chi-Square 1 2.0000
0.1573 Likelihood Ratio
Chi-Square 1 2.0930 0.1480
Continuity Adj. Chi-Square 1
0.5000 0.4795
Mantel-Haenszel Chi-Square 1 1.7500
0.1859 Phi Coefficient
0.5000
Contingency Coefficient 0.4472
Cramer's V
0.5000 WARNING 100
of the cells have expected counts less
than 5. Chi-Square may not be
a valid test.
Fisher's Exact Test
Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’
Cell (1,1) Frequency (F)
3 Left-sided
Pr lt F 0.9857
Right-sided Pr gt F 0.2429
Table Probability (P)
0.2286 Two-sided
Pr lt P 0.4857
Sample Size 8
34Introduction to the 2x2 Table
35Introduction to the 2x2 Table
36Cohort Studies
Disease
Disease-free
Target population
Disease
Disease-free
TIME
37The Risk Ratio, or Relative Risk (RR)
38Hypothetical Data
39Case-Control Studies
- Sample on disease status and ask retrospectively
about exposures (for rare diseases) - Marginal probabilities of exposure for cases and
controls are valid. - Doesnt require knowledge of the absolute risks
of disease - For rare diseases, can approximate relative risk
40Case-Control Studies
Exposed in past
Not exposed
Target population
Exposed
No Disease (Controls)
Not Exposed
41The Odds Ratio (OR)
42The Odds Ratio
43Properties of the OR (simulation)
44Properties of the lnOR
Standard deviation
45Hypothetical Data
30
30
46Example Cell phones and brain tumors
(cross-sectional data)
47Same data, but use Chi-square testor Fischers
exact
48Same data, but use Odds Ratio