Chapter 11: Analyzing the Association Between Categorical Variables - PowerPoint PPT Presentation

1 / 83
About This Presentation
Title:

Chapter 11: Analyzing the Association Between Categorical Variables

Description:

Chapter 11: Analyzing the Association Between Categorical Variables Section 11.1: What is Independence and What is Association? Learning Objectives Comparing ... – PowerPoint PPT presentation

Number of Views:249
Avg rating:3.0/5.0
Slides: 84
Provided by: lwj7
Category:

less

Transcript and Presenter's Notes

Title: Chapter 11: Analyzing the Association Between Categorical Variables


1
Chapter 11 Analyzing the Association Between
Categorical Variables
  • Section 11.1 What is Independence and What is
    Association?

2
Learning Objectives
  1. Comparing Percentages
  2. Independence vs. Dependence

3
Learning Objective 1 Example Is There an
Association Between Happiness and Family Income?
4
Learning Objective 1 Example Is There an
Association Between Happiness and Family Income?
  • The percentages in a particular row of a table
    are called conditional percentages
  • They form the conditional distribution for
    happiness, given a particular income level

5
Learning Objective 1 Example Is There an
Association Between Happiness and Family Income?
6
Learning Objective 1 Example Is There an
Association Between Happiness and Family Income?
  • Guidelines when constructing tables with
    conditional distributions
  • Make the response variable the column variable
  • Compute conditional proportions for the response
    variable within each row
  • Include the total sample sizes

7
Learning Objective 2Independence vs. Dependence
  • For two variables to be independent, the
    population percentage in any category of one
    variable is the same for all categories of the
    other variable
  • For two variables to be dependent (or
    associated), the population percentages in the
    categories are not all the same

8
Learning Objective 2Independence vs. Dependence
  • Are race and belief in life after death
    independent or dependent?
  • The conditional distributions in the table are
    similar but not exactly identical
  • It is tempting to conclude that the variables are
    dependent

9
Learning Objective 2Independence vs. Dependence
  • Are race and belief in life after death
    independent or dependent?
  • The definition of independence between variables
    refers to a population
  • The table is a sample, not a population

10
Learning Objective 2Independence vs. Dependence
  • Even if variables are independent, we would not
    expect the sample conditional distributions to be
    identical
  • Because of sampling variability, each sample
    percentage typically differs somewhat from the
    true population percentage

11
Chapter 11 Analyzing the Association Between
Categorical Variables
  • Section 11.2 How Can We Test Whether Categorical
    Variables Are Independent?

12
Learning Objectives
  1. A Significance Test for Categorical Variables
  2. What Do We Expect for Cell Counts if the
    Variables Are Independent?
  3. How Do We Find the Expected Cell Counts?
  4. The Chi-Squared Test Statistic
  5. The Chi-Squared Distribution
  6. The Five Steps of the Chi-Squared Test of
    Independence

13
Learning Objectives
  1. Chi-Squared is Also Used as a Test of
    Homogeneity
  2. Chi-Squared and the Test Comparing Proportions in
    2x2 Tables
  3. Limitations of the Chi-Squared Test

14
Learning Objective 1A Significance Test for
Categorical Variables
  • Create a table of frequencies divided into the
    categories of the two variables
  • The hypotheses for the test are
  • H0 The two variables are independent
  • Ha The two variables are dependent
    (associated)
  • The test assumes random sampling and a large
    sample size (cell counts in the frequency table
    of at least 5)

15
Learning Objective 2What Do We Expect for Cell
Counts if the Variables Are Independent?
  • The count in any particular cell is a random
    variable
  • Different samples have different count values
  • The mean of its distribution is called an
    expected cell count
  • This is found under the presumption that H0 is
    true

16
Learning Objective 3How Do We Find the Expected
Cell Counts?
  • Expected Cell Count
  • For a particular cell,
  • The expected frequencies are values that have the
    same row and column totals as the observed
    counts, but for which the conditional
    distributions are identical (this is the
    assumption of the null hypothesis).

17
Learning Objective 3How Do We Find the Expected
Cell Counts?Example
18
Learning Objective 4The Chi-Squared Test
Statistic
  • The chi-squared statistic summarizes how far the
    observed cell counts in a contingency table fall
    from the expected cell counts for a null
    hypothesis

19
Learning Objective 4Example Happiness and
Family Income
  • State the null and alternative hypotheses for
    this test
  • H0 Happiness and family income are independent
  • Ha Happiness and family income are dependent
    (associated)

20
Learning Objective 4Example Happiness and
Family Income
  • Report the statistic and explain how it was
    calculated
  • To calculate the statistic, for each cell,
    calculate
  • Sum the values for all the cells
  • The value is 73.4

21
Learning Objective 4Example Happiness and
Family Income
22
Learning Objective 4The Chi-Squared Test
Statistic
  • The larger the value, the greater the
    evidence against the null hypothesis of
    independence and in support of the alternative
    hypothesis that happiness and income are
    associated

23
Learning Objective 5The Chi-Squared Distribution
  • To convert the test statistic to a
    P-value, we use the sampling distribution of the
    statistic
  • For large sample sizes, this sampling
    distribution is well approximated by the
    chi-squared probability distribution

24
Learning Objective 5The Chi-Squared Distribution
25
Learning Objective 5The Chi-Squared Distribution
  • Main properties of the chi-squared distribution
  • It falls on the positive part of the real number
    line
  • The precise shape of the distribution depends on
    the degrees of freedom
  • df (r-1)(c-1)

26
Learning Objective 5The Chi-Squared Distribution
  • Main properties of the chi-squared distribution
  • The mean of the distribution equals the df value
  • It is skewed to the right
  • The larger the value, the greater the
    evidence against H0 independence

27
Learning Objective 5The Chi-Squared Distribution
28
Learning Objective 6The Five Steps of the
Chi-Squared Test of Independence
  • 1. Assumptions
  • Two categorical variables
  • Randomization
  • Expected counts 5 in all cells

29
Learning Objective 6The Five Steps of the
Chi-Squared Test of Independence
  • 2. Hypotheses
  • H0 The two variables are independent
  • Ha The two variables are dependent (associated)

30
Learning Objective 6The Five Steps of the
Chi-Squared Test of Independence
  • 3. Test Statistic

31
Learning Objective 6The Five Steps of the
Chi-Squared Test of Independence
  • 4. P-value Right-tail probability above the
    observed value, for the chi-squared
    distribution with df (r-1)(c-1)
  • 5. Conclusion Report P-value and interpret in
    context
  • If a decision is needed, reject H0 when P-value
    significance level

32
Learning Objective 7Chi-Squared is Also Used as
a Test of Homogeneity
  • The chi-squared test does not depend on which is
    the response variable and which is the
    explanatory variable
  • When a response variable is identified and the
    population conditional distributions are
    identical, they are said to be homogeneous
  • The test is then referred to as a test of
    homogeneity

33
Learning Objective 8Chi-Squared and the Test
Comparing Proportions in 2x2 Tables
  • In practice, contingency tables of size 2x2 are
    very common. They often occur in summarizing the
    responses of two groups on a binary response
    variable.
  • Denote the population proportion of success by p1
    in group 1 and p2 in group 2
  • If the response variable is independent of the
    group, p1p2, so the conditional distributions
    are equal
  • H0 p1p2 is equivalent to H0 independence

34
Learning Objective 8Example Aspirin and Heart
Attacks Revisited
35
Learning Objective 8 Example Aspirin and
Heart Attacks Revisited
  • What are the hypotheses for the chi-squared test
    for these data?
  • The null hypothesis is that whether a doctor has
    a heart attack is independent of whether he takes
    placebo or aspirin
  • The alternative hypothesis is that theres an
    association

36
Learning Objective 8 Example Aspirin and
Heart Attacks Revisited
  • Report the test statistic and P-value for the
    chi-squared test
  • The test statistic is 25.01 with a P-value of
    0.000
  • This is very strong evidence that the population
    proportion of heart attacks differed for those
    taking aspirin and for those taking placebo

37
Learning Objective 8 Example Aspirin and
Heart Attacks Revisited
  • The sample proportions indicate that the aspirin
    group had a lower rate of heart attacks than the
    placebo group

38
Learning Objective 9Limitations of the
Chi-Squared Test
  • If the P-value is very small, strong evidence
    exists against the null hypothesis of
    independence
  • But
  • The chi-squared statistic and the P-value tell us
    nothing about the nature of the strength of the
    association

39
Learning Objective 9Limitations of the
Chi-Squared Test
  • We know that there is statistical significance,
    but the test alone does not indicate whether
    there is practical significance as well

40
Learning Objective 9Limitations of the
Chi-Squared Test
  • The chi-squared test is often misused. Some
    examples are
  • when some of the expected frequencies are too
    small
  • when separate rows or columns are dependent
    samples
  • data are not random
  • quantitative data are classified into categories
    - results in loss of information

41
Learning Objective 10Goodness of Fit
Chi-Squared Tests
  • The Chi-Squared test can also be used for testing
    particular proportion values for a categorical
    variable.
  • The null hypothesis is that the distribution of
    the variable follows a given probability
    distribution the alternative is that it does not
  • The test statistic is calculated in the same
    manner where the expected counts are what would
    be expected in a random sample from the
    hypothesized probability distribution
  • For this particular case, the test statistic is
    referred to as a goodness-of-fit statistic.

42
Chapter 11 Analyzing the Association Between
Categorical Variables
  • Section 11.3 How Strong is the Association?

43
Learning Objectives
  1. Analyzing Contingency Tables
  2. Measures of Association
  3. Difference of Proportions
  4. The Ratio of Proportions Relative Risk
  5. Properties of the Relative Risk
  6. Large Chi-square Does Not Mean Theres a Strong
    Association

44
Learning Objective 1Analyzing Contingency Tables
  • Is there an association?
  • The chi-squared test of independence addresses
    this
  • When the P-value is small, we infer that the
    variables are associated

45
Learning Objective 1Analyzing Contingency Tables
  • How do the cell counts differ from what
    independence predicts?
  • To answer this question, we compare each observed
    cell count to the corresponding expected cell
    count

46
Learning Objective 1Analyzing Contingency Tables
  • How strong is the association?
  • Analyzing the strength of the association reveals
    whether the association is an important one, or
    if it is statistically significant but weak and
    unimportant in practical terms

47
Learning Objective 2Measures of Association
  • A measure of association is a statistic or a
    parameter that summarizes the strength of the
    dependence between two variables
  • a measure of association is useful for comparing
    associations

48
Learning Objective 3Difference of Proportions
  • An easily interpretable measure of association is
    the difference between the proportions making a
    particular response

Case (a) exhibits the weakest possible
association no association. The difference of
proportions is 0
Case (b) exhibits the strongest possible
association The difference of proportions is 1
49
Learning Objective 3Difference of Proportions
  • In practice, we dont expect data to follow
    either extreme (0 difference or 100
    difference), but the stronger the association,
    the larger the absolute value of the difference
    of proportions

50
Learning Objective 3Difference of Proportions
Example Do Student Stress and Depression Depend
on Gender?
  • Which response variable, stress or depression,
    has the stronger sample association with gender?
  • The difference of proportions between females and
    males was 0.35 0.16 0.19 for feeling stressed
  • The difference of proportions between females and
    males was 0.08 0.06 0.02 for feeling depressed

51
Learning Objective 3Difference of Proportions
Example Do Student Stress and Depression Depend
on Gender?
  • In the sample, stress (with a difference of
    proportions 0.19) has a stronger association
    with gender than depression has (with a
    difference of proportions 0.02)

52
Learning Objective 4The Ratio of Proportions
Relative Risk
  • Another measure of association, is the ratio of
    two proportions p1/p2
  • In medical applications in which the proportion
    refers to an adverse outcome, it is called the
    relative risk

53
Learning Objective 4 Example Relative Risk
for Seat Belt Use and Outcome of Auto Accidents
  • Treating the auto accident outcome as the
    response variable, find and interpret the
    relative risk

54
Learning Objective 4 Example Relative Risk
for Seat Belt Use and Outcome of Auto Accidents
  • The adverse outcome is death
  • The relative risk is formed for that outcome
  • For those who wore a seat belt, the proportion
    who died equaled 510/412,878 0.00124
  • For those who did not wear a seat belt, the
    proportion who died equaled 1601/164,128
    0.00975

55
Learning Objective 4 Example Relative Risk
for Seat Belt Use and Outcome of Auto Accidents
  • The relative risk is the ratio
  • 0.00124/0.00975 0.127
  • The proportion of subjects wearing a seat belt
    who died was 0.127 times the proportion of
    subjects not wearing a seat belt who died

56
Learning Objective 4 Example Relative Risk
for Seat Belt Use and Outcome of Auto Accidents
  • Many find it easier to interpret the relative
    risk but reordering the rows of data so that the
    relative risk has value above 1.0

57
Learning Objective 4 Example Relative Risk
for Seat Belt Use and Outcome of Auto Accidents
  • Reversing the order of the rows, we calculate the
    ratio
  • 0.00975/0.00124 7.9
  • The proportion of subjects not wearing a seat
    belt who died was 7.9 times the proportion of
    subjects wearing a seat belt who died

58
Learning Objective 4 Example Relative Risk
for Seat Belt Use and Outcome of Auto Accidents
  • A relative risk of 7.9 represents a strong
    association
  • This is far from the value of 1.0 that would
    occur if the proportion of deaths were the same
    for each group
  • Wearing a set belt has a practically significant
    effect in enhancing the chance of surviving an
    auto accident

59
Learning Objective 5Properties of the Relative
Risk
  • The relative risk can equal any nonnegative
    number
  • When p1 p2, the variables are independent and
    relative risk 1.0
  • Values farther from 1.0 (in either direction)
    represent stronger associations

60
Learning Objective 6Large Does Not Mean
Theres a Strong Association
  • A large chi-squared value provides strong
    evidence that the variables are associated
  • It does not imply that the variables have a
    strong association
  • This statistic merely indicates (through its
    P-value) how certain we can be that the variables
    are associated, not how strong that association is

61
Chapter 11 Analyzing the Association Between
Categorical Variables
  • Section 11.4 How Can Residuals Reveal The
    Pattern of Association?

62
Learning Objectives
  1. Association Between Categorical Variables
  2. Residual Analysis

63
Learning Objective 1Association Between
Categorical Variables
  • The chi-squared test and measures of association
    such as (p1 p2) and p1/p2 are fundamental
    methods for analyzing contingency tables
  • The P-value for summarized the strength of
    evidence against H0 independence

64
Learning Objective 1Association Between
Categorical Variables
  • If the P-value is small, then we conclude that
    somewhere in the contingency table the population
    cell proportions differ from independence
  • The chi-squared test does not indicate whether
    all cells deviate greatly from independence or
    perhaps only some of them do so

65
Learning Objective 2Residual Analysis
  • A cell-by-cell comparison of the observed counts
    with the counts that are expected when H0 is true
    reveals the nature of the evidence against H0
  • The difference between an observed and expected
    count in a particular cell is called a residual

66
Learning Objective 2Residual Analysis
  • The residual is negative when fewer subjects are
    in the cell than expected under H0
  • The residual is positive when more subjects are
    in the cell than expected under H0

67
Learning Objective 2Residual Analysis
  • To determine whether a residual is large enough
    to indicate strong evidence of a deviation from
    independence in that cell we use a adjusted form
    of the residual the standardized residual

68
Learning Objective 2Residual Analysis
  • The standardized residual for a cell
  • (observed count expected count)/se
  • A standardized residual reports the number of
    standard errors that an observed count falls from
    its expected count
  • The se describes how much the difference would
    tend to vary in repeated sampling if the
    variables were independent
  • Its formula is complex
  • Software can be used to find its value
  • A large standardized residual value provides
    evidence against independence in that cell

69
Learning Objective 2 Example Standardized
Residuals for Religiosity and Gender
  • To what extent do you consider yourself a
    religious person?

70
Learning Objective 2 Example Standardized
Residuals for Religiosity and Gender
  • Interpret the standardized residuals in the table
  • The table exhibits large positive residuals for
    the cells for females who are very religious and
    for males who are not at all religious.
  • In these cells, the observed count is much larger
    than the expected count
  • There is strong evidence that the population has
    more subjects in these cells than if the
    variables were independent

71
Learning Objective 2 Example Standardized
Residuals for Religiosity and Gender
  • The table exhibits large negative residuals for
    the cells for females who are not at all
    religious and for males who are very religious
  • In these cells, the observed count is much
    smaller than the expected count
  • There is strong evidence that the population has
    fewer subjects in these cells than if the
    variables were independent

72
Chapter 11 Analyzing the Association Between
Categorical Variables
  • Section 11.5 What if the Sample Size is Small?
  • Fishers Exact Test

73
Learning Objectives
  1. Fishers Exact Test
  2. Example using Fishers Exact Test
  3. Summary of Fishers Exact Test of Independence
    for 2x2 Tables

74
Learning Objective 1Fishers Exact Test
  • The chi-squared test of independence is a
    large-sample test
  • When the expected frequencies are small, any of
    them being less than about 5, small-sample tests
    are more appropriate
  • Fishers exact test is a small-sample test of
    independence

75
Learning Objective 1Fishers Exact Test
  • The calculations for Fishers exact test are
    complex
  • Statistical software can be used to obtain the
    P-value for the test that the two variables are
    independent
  • The smaller the P-value, the stronger the
    evidence that the variables are associated

76
Learning Objective 2Fishers Exact Test
Example Tea Tastes Better with Milk Poured First?
  • This is an experiment conducted by Sir Ronald
    Fisher
  • His colleague, Dr. Muriel Bristol, claimed that
    when drinking tea she could tell whether the milk
    or the tea had been added to the cup first

77
Learning Objective 2Fishers Exact Test
Example Tea Tastes Better with Milk Poured First?
  • Experiment
  • Fisher asked her to taste eight cups of tea
  • Four had the milk added first
  • Four had the tea added first
  • She was asked to indicate which four had the milk
    added first
  • The order of presenting the cups was randomized

78
Learning Objective 2Fishers Exact Test
Example Tea Tastes Better with Milk Poured First?
  • Results

79
Learning Objective 2Fishers Exact Test
Example Tea Tastes Better with Milk Poured First?
  • Analysis

80
Learning Objective 2Fishers Exact Test
Example Tea Tastes Better with Milk Poured First?
  • The one-sided version of the test pertains to the
    alternative that her predictions are better than
    random guessing
  • Does the P-value suggest that she had the ability
    to predict better than random guessing?

81
Learning Objective 2Fishers Exact Test
Example Tea Tastes Better with Milk Poured First?
  • The P-value of 0.243 does not give much evidence
    against the null hypothesis
  • The data did not support Dr. Bristols claim that
    she could tell whether the milk or the tea had
    been added to the cup first

82
Learning Objective 3Summary of Fishers Exact
Test of Independence for 2x2 Tables
  • Assumptions
  • Two binary categorical variables
  • Data are random
  • Hypotheses
  • H0 the two variables are independent (p1p2)
  • Ha the two variables are associated
  • (p1?p2 or p1gtp2 or p1ltp2)

83
Learning Objective 3Summary of Fishers Exact
Test of Independence for 2x2 Tables
  • Test Statistic
  • First cell count (this determines the others
    given the margin totals)
  • P-value
  • Probability that the first cell count equals the
    observed value or a value even more extreme as
    predicted by Ha
  • Conclusion
  • Report the P-value and interpret in context. If
    a decision is required, reject H0 when P-value
    significance level
Write a Comment
User Comments (0)
About PowerShow.com