Categorical Data Analysis - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Categorical Data Analysis

Description:

Title: PowerPoint Presentation Last modified by: jkim Created Date: 1/1/1601 12:00:00 AM Document presentation format: Other titles – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 37
Provided by: www1Suwo
Category:

less

Transcript and Presenter's Notes

Title: Categorical Data Analysis


1
Categorical Data Analysis Logistic Regression

????? ?????? ? ? ? ? ???? ???? ????? ? ? ?
2
Outline
  • Two-way contingency tables RR, Odds ratio,
  • Chi-square tests
  • Three-way contingency tables Conditional
  • independence, Homogeneous association,
    Common
  • odds ratio
  • Logistic regression Dichotomous response
  • Logistic regression Polytomous response

3
First example Aspirin heart attacks
  • Clinical trials table of aspirin use
    and MI
  • Test whether regular intake of aspirin reduces
    mortality
  • from cardiovascular disease
  • Data set
  • Prospective sampling design Cohort studies,
    Clinical trials

Myocardial Infarction Myocardial Infarction
Group Yes No Total
Placebo 189 10,845 11,034
Aspirin 104 10,933 11,037
4
Second example Smoking heart attacks
  • Case-control study table of smoking status
  • and MI
  • Compare ever-smokers with nonsmokers in terms of
    the
  • proportion who suffered MI
  • Data set
  • Retrospective sampling design Case-control
    study, Cross-sectional
  • design
  • Remark Observational studies vs. experimental
    study

Ever- Smoker Myocardial Infarction Controls
Yes 172 173
No 90 346
Total 262 519
5
Comparing proportions in table
  • Difference
  • Relative risk
  • Useful when both proportions 0 or 1

  • RR is more informative

  • Response is independent
  • of group

6
Example (revisited)
  • 1st example
  • 0.0171-0.00940.0077, 95
    CI(0.005, 0.011)
  • Taking aspirin diminishes heart attack
  • , 95 CI(1.43, 2.3)
  • Risk of MI is at least 43 higher for the placebo
    group
  • 2nd example
  • , Not estimable,
    meaningless even though possible
  • Estimate proportions in the reverse direction
  • Proportion of smoking given MI status
  • (suffering MI),
    (Not suffered MI)

7
Association measure odds ratio
  • Defn
  • Meaning
  • When two variables are independent, i.e.,
  • When odds of success (in row 1) gt (in
    row 2)
  • When odds of success (in row 1)
    lt (in row 2)
  • Remark When both variables are response,
  • (called cross-product ratio) using joint
    probabilities

8
Properties of odds ratio
  • Values of father from 1 in a given direction
  • represent stronger association
  • When one value is the inverse of the other, two
    values
  • of are the same strength of
    association, but in the
  • opposite directions
  • Not changed when the table orientation reverses
  • Unnecessary to identify one classification as a
    response variable

9
Example (revisited)
  • 1st example
  • ,
    95 CI(1.44, 2.33)
  • Estimated odds is 83 higher for the placebo
    group
  • 2nd example
  • Rough estimate of RR3.8
  • Women who had ever smoked were about four times
    as likely to
  • suffer as women who had never smoked

10
Independence tests
  • Hypothesis
  • Two chi-square tests
  • Under , estimated expected frequency
  • Pearsons
  • Likelihood ratio(LR) statistic
  • For a large sample, follow a
    chi-squared null distribution with
  • Remark When the chi-squared
    approximation
  • is good. If not, apply Fishers exact test

11
Example AZT use AIDS
  • Development of AIDS symptoms in AZT use and race
  • Study on the effects of AZT in slowing the
    development of AIDS
  • symptoms
  • Data set

Symptoms Symptoms
Race AZT Use Yes No Total
White Yes 14 93 107
No 32 81 113
Black Yes 11 52 63
No 12 43 55
12
Three interests in table
  • Conditional independence? When controlling for
    race,
  • AZT treatment and development of AIDS symptom
  • are independent
  • Use Cochran-Mantel-Haenszel(CMH) test
  • Summarize the information from partial
    tables
  • Homogeneous association? Odds ratios of AZT
  • treatment and development of AIDS symptom are
  • common for each race
  • Use Breslow-Day test
  • Common odds ratio? Use Mantel-Haenszel estimate

13
Example (AZT use AIDS revisited)
  • CMH6.8( 1) with -value0.0091
  • Not independent!
  • Breslow-Day1.39( 1) with -value0.2384
  • Homogeneous association!
  • Common odds ratio0.49
  • For each race, estimated odds of developing
    symptoms are half as
  • high for those who took AZT

14
Overview of types of generalized linear
models(GLMs)
  • Three components Random component (response
  • variable), Linear predictor (linear
    combination of
  • covariates), Link function
  • Types of GLMs

Random Component Link Systematic Component Model
Normal Normal Normal Binomial Poisson Multinomial Identity Identity Identity Logit Log Generalized logit Continuous Categorical Mixed Mixed Mixed Mixed Regression Analysis of variance Analysis of covariance Logistic regression Loglinear Multinomial response
15
Logistic regression with a quantitative covariate
  • Model
  • Another representations
  • Odds
  • Odds at level equals the odds at
    multiplied by
  • Curve ascends ( ) or descends ( )
  • The rate of change increases as increases

16
Example Horseshoe crabs
  • Binary response
  • if a female crab has at least one
    satellite otherwise
  • Covariate female crabs width
  • Data set

Width Number Cases Number Having Satellites
lt 23.25 23.25-24.25 24.25-25.25 25.25-26.25 26.25-27.25 27.25-28.25 28.25-29.25 gt 29.25 14 14 28 39 22 24 18 14 5 4 17 21 15 20 15 14
17
Example Horseshoe crabs
18
Goodness-of-fit tests
  • Working model number of settings
    number of parameters in
  • Hypothesis fits the data
  • Pearsons statistic
  • Deviance statistic
  • approximately follow a chi-square
    null distribution with

19
Inference for parameters
  • Interval estimation
  • Two significance tests
  • Wald test Use
  • Likelihood ratio test Use ,
    log-likelihood function
  • Two tests have a large-sample chi-squared null
    distribution
  • with

20
Example (Horseshoe crabs revisited)
  • Fitted model
  • larger at lager width ( )
  • There is a 64 increase in estimated odds of a
    satellite
  • for each centimeter increase in width (
    )
  • with
    -value0.506
  • with
    -value0.4012
  • 95 CI for (0.298, 0.697)
  • Significance test Wald23.9 ( 1) with
    -value lt 0.0001 LRT31.3 ( 1) with -value
    lt 0.0001

21
Logistic regression with qualitativepredictors
AIDS symptoms data
  • Use indicator variables for representing
    categories of
  • predictors
  • Logits implied by indicator variables

Logit
0 0
1 0
0 1
1 1
22
Logistic regression with qualitativepredictors
AIDS symptoms data
  • difference between two logits (i.e., log of
    odds
  • ratio) at a fixed category of
  • Homogeneous association model

23
Equivalence of contingency table
logistic regression
  • Conditional independence CMH test vs.
  • Homogeneous association Breslow-Day test vs.
  • Goodness-of-fit test
  • Common odds ratio estimate Mantel-Haenszel
  • estimate vs.

24
Computer Output for Model with AIDS Symptoms Data
Log Likelihood -167.5756 Analysis of MaximumLikelihood Estmates Log Likelihood -167.5756 Analysis of MaximumLikelihood Estmates Log Likelihood -167.5756 Analysis of MaximumLikelihood Estmates Log Likelihood -167.5756 Analysis of MaximumLikelihood Estmates Log Likelihood -167.5756 Analysis of MaximumLikelihood Estmates
Parameter Estimate Std Error Wald Chi-Square Pr gt ChiSq
Intercept azt race -1.0736 -0.7195 0.0555 0.2629 0.2790 0.2886 16.6705 6.6507 0.0370 lt.0.001 0.0099 0.8476
LR Statistics LR Statistics LR Statistics LR Statistics LR Statistics
Source Df Chi-Square PrgtChiSq
azt race 1 1 6.87 0.04 0.0088 0.8473
azt race 1 1 6.87 0.04 0.0088 0.8473
Obs race azt y n pi_hat lower upper
1 2 3 4 1 1 0 0 1 0 1 0 14 32 11 12 107 113 63 55 0.14962 0.26540 0.14270 0.25472 0.09897 0.19668 0.08704 0.16953 0.21987 0.34774 0.22519 0.36396
25
Logistic regression with mixed predictors
Horseshoe crabs data
  • For colormedium light,
  • For colormedium,
  • For colormedium dark,
  • For controlling

26
Computer Output for Model for Horseshoe Crabs Data
Parameter Estimate Std. Error Likelihood Ratio 95 Confidence Limits Likelihood Ratio 95 Confidence Limits Likelihood Ratio 95 Confidence Limits Chi Square Pr gt Chi Sq
intercept c1 c2 c3 width -12.7151 1.3299 1.4023 1.1061 0.4680 2.7618 0.8525 0.5484 0.5921 0.1055 -18.4564 -0.2738 0.3527 -0.0279 0.2713 -7.5788 3.1354 2.5260 2.3138 0.6870 -7.5788 3.1354 2.5260 2.3138 0.6870 21.20 2.43 6.54 3.49 19.66 lt .0001 0.1188 0.0106 0.0617 lt .0001


LR Statistics LR Statistics
Source DF Chi-Square Pr gt Chi Sq
width color 1 3 26.40 7.00 lt .0001 0.0720
27
Estimated probabilities for primary food choice
28
Logistic regression ploytomous
  • Model categorical responses with more than two
  • categories
  • Two ways
  • Use generalized logits function for nominal
    response
  • Use cumulative logits function for ordinal
    response
  • Notation
  • number of categories
  • response probabilities with

29
Generalized logit model nominal response
  • Baseline-category logit Pair each category with
    a
  • baseline category

  • when is the baseline
  • Model with a predictor
  • The effects vary according to the category paired
    with the baseline
  • These pairs of categories determine equations for
    all other pairs of
  • categories
  • Eg, for a pair of categories
  • Remark Parameter estimates are same no matter
  • which category is the baseline

30
Example Alligator food choice
  • 59 alligators sample in Lake Gorge, Florida
  • Response Primary food type found in alligators
  • stomach
  • Fish(1), Invertebrate(2), Other(3, baseline
    category)
  • Predictor alligator length, which varies
    1.243.89(m)
  • ML prediction equations
  • Larger alligator seem to select fish than
    invertebrates
  • Independence test Food choice length
  • LRT16.8006(
    ) with -value0.0002

31
Cumulative logit model ordinal response
  • Logit of a cumulative probability
  • Categories 1 to combined, Categories
    to combined
  • Cumulative proportional odds model with a
    predictor
  • The effect of are identical for all
    cumulative logits
  • Any one curve for is identical to any of
    others shifted to the
  • right or shifted to the left
  • For
    log of odds ratio
  • is
  • Proportional to the difference between
    values
  • Same for each cumulative probability

32
Example Political ideology party affiliation
  • Response Political ideology with five-point
    ordinal
  • scale
  • Predictors Political party(Democratic,
    Republican)

Political Party Political Ideology Political Ideology Political Ideology Political Ideology Political Ideology
Political Party Very Liberal Slightly Liberal Moderate Slightly Conservative Very Conservative
Democratic 80 81 171 41 55
Republican 30 46 148 84 99
33
Example Political ideology party affiliation
  • Parameter inference
  • ,
  • Democrats tend to be more liberal than
    Republicans
  • Wald57.1( ) with
    -value lt 0.0001
  • Strong evidence of an association
  • 95 CI for (0.72, 1.23) or (2.1, 3.4)
  • At least twice as high for Democrats as for
    Republicans
  • Goodness-of-fit
  • with
    -value0.2957 Good adequacy!

34
Another logit forms for ordinal response
categories
  • Adjacent-categories logit
  • Adjacent-categories logits determine the logits
    for all pairs of
  • response categories
  • Continuation-ratio logit
  • Form1
  • Contrast each category with a grouping of
    categories from lower
  • levels of response scale
  • Form2
  • Contrast each category with a grouping of
    categories from higher
  • levels of response scale

35
Summary
  • Two-way contingency tables RR, Odds ratio,
  • Chi-square tests
  • Three-way contingency tables Conditional
  • independence, Homogeneous association,
    Common
  • odds ratio
  • Logistic regression Dichotomous response
  • Logistic regression Polytomous response

36
References
  • Agresti, A. (1996). An Introduction to
    Categorical
  • Data Analysis, Wiley New York (Also the 2nd
  • edition is available)
  • Stokes, M.E., Davis, C.S., and Koch, G.G. (2000).
  • Categorical Data Analysis Using The SAS
    System,
  • Second Ed., SAS Inc. Cary
Write a Comment
User Comments (0)
About PowerShow.com