Title: Categorical Data Analysis Week 1 April 18
1Categorical Data AnalysisWeek 1 April 18 April
20
- Dingcai Cao
- d-cao_at_uchicago.edu
2Categorical Data Analysis
- Textbook Introduction to Categorical Data
Analysis by Alan Agresti - Recommended reading Categorical Data Analysis
Using the SAS System, by Maura E. Stokes, Charles
S. Davis Gary G. Koch
- Office hours By appointment or after class
3Scales of Measurement
- Nominal ? identity
- Ordinal ? identity ? magnitude
- Interval ? identity ? magnitude ? equal
distance - Ratio ? identity ? magnitude ? equal
distance ? absolute/true zero
4Categorical data analysis strategies
- Hypothesis testing Is there an association?
- Chi-square test, Fishers exact test, etc
- Chapters 1, 2, 3
- Modeling What is the nature of the association?
- Logistic regression, log linear modeling
- Chapters 4, 5, 6
5Two-way contingency Tables
- Contingency table A table with cells containing
frequency counts of combinations of different
levels of two or more categorical variables. - Two-way table A contingency table that cross
classifies two variables.
6The 2x2 Table
- Research question Is one sex more likely than
the other to believe in an afterlife? - Statistical question Is belief in an afterlife
independent of gender?
7Probabilities for contingency tables
Joint probability ?ijP(X i,Y j) The
probability that (X,Y) falls in the cell in row i
and column j. Sample joint probability pijnij/n
n 435147375134 1091
8Probabilities for contingency tables
Marginal probability ?I or ?j row or columun
totals of the joint probabilities. Sample joint
probability pI or pj
n 435147375134 1091
9Probabilities for contingency tables
Marginal probability ?I or ?j row or columun
totals of the joint probabilities. Sample joint
probability pI or pj
n 435147375134 1091
10Probabilities for contingency tables
Conditional probability Probability of Y at each
level of X, or probability of X at each level of
Y.
P(GenderFemalesYYes) 435/(435375) 0.54
column probability
P(YYes GenderFemales) 435/(435147) 0.75
row probability
11Playing with SAS
DATA BELIEF INPUT GENDER BELIEF
COUNT DATALINES FEMALE YES 435 FEMALE NO 147 MA
LE YES 375 MALE NO 134 PROC FREQ DATA
BELIEF WEIGHT COUNT TABLES GENDERBELIEF RUN
12COMPARING PROPORTIONS
Difference of proportions?1- ?2 Sample
difference of proportion p1-p2 Standard error
Confidence interval Better to use when
proportion is not close to 0 or 1
13COMPARING PROPORTIONS
Relative risk?1/?2 Sample difference of
proportion p1/p2 Standard error too
complicated to talk about Confidence
interval too complicated. Rely on SAS to do the
computation. Better to use when proportion
is near 0 or 1
14COMPARING PROPORTIONS
Odds Ratio odds1 ?1/(1-?1) odds2
?2/(1-?2) odds ratio ?odds1/odds2
?1/(1-?1)/?2/(1-?2) sample odds ratio
Confidence interval
15Playing with SAS
DATA BELIEF INPUT GENDER BELIEF
COUNT DATALINES FEMALE YES 435 FEMALE NO 147 MA
LE YES 375 MALE NO 134 PROC FREQ DATA
BELIEF WEIGHT COUNT TABLES GENDERBELIEF/RISKD
IFF MEASURES RUN
16Playing with SAS
17Playing with SAS
18Playing with SAS
Relative risk
19TESTS OF INDEPENDENCE
Independence Two variables are said to be
statistically independent if the conditional
distributions of Y are identical at each level of
X. Equivalently, statistical independence is that
all joint probabilities equal the product of
their marginal probabilities, ?ij?i?j for i
1, 2, ,I and j 1,2, J
- Test of independence
- H0 ?ij?i?j for i 1, 2, ,I and j 1,2, J
- H1 ?ij??i?j for i 1, 2, ,I and j 1,2, J
- Pearson Chi-square test
- Likelihood-ratio test
20TESTS OF INDEPENDENCEPearson Chi-Square test
Y
Level 2
X
Level 1
Level 1 Level 2
n11 n21
n12 n22
n
21TESTS OF INDEPENDENCELikelihood ratio test
Y
Level 2
X
Level 1
Level 1 Level 2
n11 n21
n12 n22
n
22Playing with SAS
DATA BELIEF INPUT GENDER BELIEF
COUNT DATALINES FEMALE YES 435 FEMALE NO 147 MA
LE YES 375 MALE NO 134 PROC FREQ DATA
BELIEF WEIGHT COUNT TABLES GENDERBELIEF/CHISQ
NOCOL NOROW NOPCT RUN
23Playing with SAS
24TESTS OF INDEPENDENCEFishers exact test
What if the conditions are not satisfied?
Fishers Exact test
25TESTS OF INDEPENDENCEFishers exact test
Fishers exact test rely on hypergeometric
distribution. For a 2x2 Table with odds ratio of
1 (independence null hypothesis)
DATA TEA INPUT POUR GUESS
COUNT DATALINES MILK MILK 3 MILK TEA 1 TEA
MILK 1 TEA TEA 3 PROC FREQ DATA TEA WEIGHT
COUNT TABLES POURGUESS/CHISQ NOCOL NOROW
NOPCT RUN
26Playing with SAS
Fisher Tea Taster Data DATA TEA INPUT POUR
GUESS COUNT DATALINES MILK MILK 3 MILK TEA
1 TEA MILK 1 TEA TEA 3 PROC FREQ DATA
TEA WEIGHT COUNT TABLES POURGUESS/CHISQ
NOCOL NOROW NOPCT RUN
27Playing with SAS
28TESTING INDEPENDENCE FOR ORDINAL DATA
The X2 and G2 tests treat both classification as
nominal. What if the rows or the columns are
ordinal?
LINEAR TREND OR CORRELATION TEST
The idea is to calculate the Pearson correlation,
r, based on the scores assigned to row and
column categories.A statistic for testing the
null hypothesis of independence against the
two-sided alternative hypothesis of nonzero true
correlation is given by M2(n-1)r2M2 has
approximately a chi-squared distribution with df
1.
29TESTING INDEPENDENCE FOR ORDINAL DATA
LINEAR TREND OR CORRELATION TESTChoice of
scores midranksThe average rank of all subjects
in a category.
For variable X Midrank of Level 1
(1n1)/2 Midrank of Level 2 (1n1 n2
)/2 Midrank of Level I (1nI-1 nI )/2
For variable Y Midrank of Level 1
(1n1)/2 Midrank of Level 2 (1n1 n2
)/2 Midrank of Level J (1nJ-1 nJ )/2
30Playing with SAS
DATA INFANTS INPUT MALFORM ALCOHOL COUNT
_at__at_ DATALINES 1 0 17066 1 0.5 14464 1 1.5 788 1
4.0 126 1 7.0 37 2 0 48 2 0.5 38 2 1.5 5 2 4.0 1
2 7.0 1 PROC FREQ DATA INFANTS TITLE
"LINEAR TREND TEST EQ. 2.5.1" WEIGHT
COUNT TABLES MALFORMALCOHOL/CHISQ CMH1
ALL PROC FREQ DATA INFANTS TITLE "LINEAR
TREND TEST MIDRANK SCORE" WEIGHT
COUNT TABLES MALFORMALCOHOL/CMH1 SCORE
RIDIT
31Playing with SAS
32Playing with SAS
33Playing with SAS
34Playing with SAS
35Playing with SAS