Title: Categorical Data Analysis
1Categorical Data Analysis Logistic Regression
????? ?????? ? ? ? ? ???? ???? ????? ? ? ?
2Outline
- Two-way contingency tables RR, Odds ratio,
- Chi-square tests
- Three-way contingency tables Conditional
- independence, Homogeneous association,
Common - odds ratio
- Logistic regression Dichotomous response
- Logistic regression Polytomous response
3First example Aspirin heart attacks
- Clinical trials table of aspirin use
and MI - Test whether regular intake of aspirin reduces
mortality - from cardiovascular disease
- Data set
- Prospective sampling design Cohort studies,
Clinical trials
Myocardial Infarction Myocardial Infarction
Group Yes No Total
Placebo 189 10,845 11,034
Aspirin 104 10,933 11,037
4Second example Smoking heart attacks
- Case-control study table of smoking status
- and MI
- Compare ever-smokers with nonsmokers in terms of
the - proportion who suffered MI
- Data set
- Retrospective sampling design Case-control
study, Cross-sectional - design
- Remark Observational studies vs. experimental
study
Ever- Smoker Myocardial Infarction Controls
Yes 172 173
No 90 346
Total 262 519
5Comparing proportions in table
- Difference
- Relative risk
- Useful when both proportions 0 or 1
-
RR is more informative -
Response is independent - of group
6Example (revisited)
- 1st example
- 0.0171-0.00940.0077, 95
CI(0.005, 0.011) - Taking aspirin diminishes heart attack
- , 95 CI(1.43, 2.3)
- Risk of MI is at least 43 higher for the placebo
group - 2nd example
- , Not estimable,
meaningless even though possible - Estimate proportions in the reverse direction
- Proportion of smoking given MI status
- (suffering MI),
(Not suffered MI)
7Association measure odds ratio
- Defn
- Meaning
- When two variables are independent, i.e.,
- When odds of success (in row 1) gt (in
row 2) - When odds of success (in row 1)
lt (in row 2) - Remark When both variables are response,
- (called cross-product ratio) using joint
probabilities
8Properties of odds ratio
- Values of father from 1 in a given direction
- represent stronger association
- When one value is the inverse of the other, two
values - of are the same strength of
association, but in the - opposite directions
-
- Not changed when the table orientation reverses
- Unnecessary to identify one classification as a
response variable
9Example (revisited)
- 1st example
- ,
95 CI(1.44, 2.33) - Estimated odds is 83 higher for the placebo
group -
- 2nd example
-
- Rough estimate of RR3.8
- Women who had ever smoked were about four times
as likely to - suffer as women who had never smoked
10Independence tests
- Hypothesis
- Two chi-square tests
- Under , estimated expected frequency
- Pearsons
- Likelihood ratio(LR) statistic
- For a large sample, follow a
chi-squared null distribution with - Remark When the chi-squared
approximation - is good. If not, apply Fishers exact test
11Example AZT use AIDS
- Development of AIDS symptoms in AZT use and race
- Study on the effects of AZT in slowing the
development of AIDS - symptoms
- Data set
Symptoms Symptoms
Race AZT Use Yes No Total
White Yes 14 93 107
No 32 81 113
Black Yes 11 52 63
No 12 43 55
12Three interests in table
- Conditional independence? When controlling for
race, - AZT treatment and development of AIDS symptom
- are independent
- Use Cochran-Mantel-Haenszel(CMH) test
- Summarize the information from partial
tables - Homogeneous association? Odds ratios of AZT
- treatment and development of AIDS symptom are
- common for each race
- Use Breslow-Day test
- Common odds ratio? Use Mantel-Haenszel estimate
13Example (AZT use AIDS revisited)
- CMH6.8( 1) with -value0.0091
- Not independent!
- Breslow-Day1.39( 1) with -value0.2384
- Homogeneous association!
- Common odds ratio0.49
- For each race, estimated odds of developing
symptoms are half as - high for those who took AZT
14Overview of types of generalized linear
models(GLMs)
- Three components Random component (response
- variable), Linear predictor (linear
combination of - covariates), Link function
- Types of GLMs
Random Component Link Systematic Component Model
Normal Normal Normal Binomial Poisson Multinomial Identity Identity Identity Logit Log Generalized logit Continuous Categorical Mixed Mixed Mixed Mixed Regression Analysis of variance Analysis of covariance Logistic regression Loglinear Multinomial response
15Logistic regression with a quantitative covariate
- Model
- Another representations
- Odds
- Odds at level equals the odds at
multiplied by -
- Curve ascends ( ) or descends ( )
- The rate of change increases as increases
16Example Horseshoe crabs
- Binary response
- if a female crab has at least one
satellite otherwise - Covariate female crabs width
- Data set
Width Number Cases Number Having Satellites
lt 23.25 23.25-24.25 24.25-25.25 25.25-26.25 26.25-27.25 27.25-28.25 28.25-29.25 gt 29.25 14 14 28 39 22 24 18 14 5 4 17 21 15 20 15 14
17Example Horseshoe crabs
18Goodness-of-fit tests
- Working model number of settings
number of parameters in - Hypothesis fits the data
- Pearsons statistic
- Deviance statistic
- approximately follow a chi-square
null distribution with
19Inference for parameters
- Interval estimation
- Two significance tests
- Wald test Use
- Likelihood ratio test Use ,
log-likelihood function - Two tests have a large-sample chi-squared null
distribution - with
20Example (Horseshoe crabs revisited)
- Fitted model
- larger at lager width ( )
- There is a 64 increase in estimated odds of a
satellite - for each centimeter increase in width (
) - with
-value0.506 - with
-value0.4012 - 95 CI for (0.298, 0.697)
- Significance test Wald23.9 ( 1) with
-value lt 0.0001 LRT31.3 ( 1) with -value
lt 0.0001
21Logistic regression with qualitativepredictors
AIDS symptoms data
-
- Use indicator variables for representing
categories of - predictors
- Logits implied by indicator variables
Logit
0 0
1 0
0 1
1 1
22Logistic regression with qualitativepredictors
AIDS symptoms data
- difference between two logits (i.e., log of
odds - ratio) at a fixed category of
-
- Homogeneous association model
23Equivalence of contingency table
logistic regression
- Conditional independence CMH test vs.
- Homogeneous association Breslow-Day test vs.
- Goodness-of-fit test
- Common odds ratio estimate Mantel-Haenszel
- estimate vs.
24Computer Output for Model with AIDS Symptoms Data
Log Likelihood -167.5756 Analysis of MaximumLikelihood Estmates Log Likelihood -167.5756 Analysis of MaximumLikelihood Estmates Log Likelihood -167.5756 Analysis of MaximumLikelihood Estmates Log Likelihood -167.5756 Analysis of MaximumLikelihood Estmates Log Likelihood -167.5756 Analysis of MaximumLikelihood Estmates
Parameter Estimate Std Error Wald Chi-Square Pr gt ChiSq
Intercept azt race -1.0736 -0.7195 0.0555 0.2629 0.2790 0.2886 16.6705 6.6507 0.0370 lt.0.001 0.0099 0.8476
LR Statistics LR Statistics LR Statistics LR Statistics LR Statistics
Source Df Chi-Square PrgtChiSq
azt race 1 1 6.87 0.04 0.0088 0.8473
azt race 1 1 6.87 0.04 0.0088 0.8473
Obs race azt y n pi_hat lower upper
1 2 3 4 1 1 0 0 1 0 1 0 14 32 11 12 107 113 63 55 0.14962 0.26540 0.14270 0.25472 0.09897 0.19668 0.08704 0.16953 0.21987 0.34774 0.22519 0.36396
25Logistic regression with mixed predictors
Horseshoe crabs data
-
- For colormedium light,
- For colormedium,
- For colormedium dark,
- For controlling
26Computer Output for Model for Horseshoe Crabs Data
Parameter Estimate Std. Error Likelihood Ratio 95 Confidence Limits Likelihood Ratio 95 Confidence Limits Likelihood Ratio 95 Confidence Limits Chi Square Pr gt Chi Sq
intercept c1 c2 c3 width -12.7151 1.3299 1.4023 1.1061 0.4680 2.7618 0.8525 0.5484 0.5921 0.1055 -18.4564 -0.2738 0.3527 -0.0279 0.2713 -7.5788 3.1354 2.5260 2.3138 0.6870 -7.5788 3.1354 2.5260 2.3138 0.6870 21.20 2.43 6.54 3.49 19.66 lt .0001 0.1188 0.0106 0.0617 lt .0001
LR Statistics LR Statistics
Source DF Chi-Square Pr gt Chi Sq
width color 1 3 26.40 7.00 lt .0001 0.0720
27Estimated probabilities for primary food choice
28Logistic regression ploytomous
- Model categorical responses with more than two
- categories
- Two ways
- Use generalized logits function for nominal
response - Use cumulative logits function for ordinal
response - Notation
- number of categories
- response probabilities with
29Generalized logit model nominal response
- Baseline-category logit Pair each category with
a - baseline category
-
when is the baseline - Model with a predictor
-
- The effects vary according to the category paired
with the baseline - These pairs of categories determine equations for
all other pairs of - categories
- Eg, for a pair of categories
- Remark Parameter estimates are same no matter
- which category is the baseline
30Example Alligator food choice
- 59 alligators sample in Lake Gorge, Florida
- Response Primary food type found in alligators
- stomach
- Fish(1), Invertebrate(2), Other(3, baseline
category) - Predictor alligator length, which varies
1.243.89(m) - ML prediction equations
-
-
- Larger alligator seem to select fish than
invertebrates - Independence test Food choice length
- LRT16.8006(
) with -value0.0002
31Cumulative logit model ordinal response
- Logit of a cumulative probability
-
- Categories 1 to combined, Categories
to combined - Cumulative proportional odds model with a
predictor -
- The effect of are identical for all
cumulative logits - Any one curve for is identical to any of
others shifted to the - right or shifted to the left
- For
log of odds ratio - is
- Proportional to the difference between
values - Same for each cumulative probability
32Example Political ideology party affiliation
- Response Political ideology with five-point
ordinal - scale
- Predictors Political party(Democratic,
Republican)
Political Party Political Ideology Political Ideology Political Ideology Political Ideology Political Ideology
Political Party Very Liberal Slightly Liberal Moderate Slightly Conservative Very Conservative
Democratic 80 81 171 41 55
Republican 30 46 148 84 99
33Example Political ideology party affiliation
- Parameter inference
- ,
- Democrats tend to be more liberal than
Republicans - Wald57.1( ) with
-value lt 0.0001 - Strong evidence of an association
- 95 CI for (0.72, 1.23) or (2.1, 3.4)
- At least twice as high for Democrats as for
Republicans - Goodness-of-fit
- with
-value0.2957 Good adequacy!
34Another logit forms for ordinal response
categories
- Adjacent-categories logit
-
- Adjacent-categories logits determine the logits
for all pairs of - response categories
- Continuation-ratio logit
- Form1
- Contrast each category with a grouping of
categories from lower - levels of response scale
- Form2
- Contrast each category with a grouping of
categories from higher - levels of response scale
35Summary
- Two-way contingency tables RR, Odds ratio,
- Chi-square tests
- Three-way contingency tables Conditional
- independence, Homogeneous association,
Common - odds ratio
- Logistic regression Dichotomous response
- Logistic regression Polytomous response
36References
- Agresti, A. (1996). An Introduction to
Categorical - Data Analysis, Wiley New York (Also the 2nd
- edition is available)
- Stokes, M.E., Davis, C.S., and Koch, G.G. (2000).
- Categorical Data Analysis Using The SAS
System, - Second Ed., SAS Inc. Cary