Title: ChiSquare Tests
1Chapter 20
2What is the chi-square test?
- The chi-square test is like the binomial test
except that events do not have to be dichotomous. - However, they do have to break down into distinct
categories. - For example, possible events might be A, B, C or
- A, B, C, D or
- Any finite number of distinct outcomes.
3Example
- Recall that, in the chapter on the binomial
distribution, we wanted to analyze dichotomous
events. - However, many times there are more than two
outcomes. - Suppose you are planning a large party and you
want to know what soft drinks to get. - You scour the web for statistics on soft drink
consumption and find the following data
4Example
- The data only covers up to 2003 but you figure
things havent changed that much since then. - You decide to use the old data.
- You have another problem.
- There are a large number of soda products and you
cant buy them all.
5Example
- So you decide to buy the appropriate amounts of
the top two. - Everyone else, you serve beer.
- Your party purchases break down as follows
6Example
- 48 people are coming to your party.
- You figure out how many guests are in each
category as follows - Expected Coke Drinkers expected X 48.18
8.64 - Expected Pepsi Drinkers expected Y 48.12
5.7 - Expected To Drink Beer expected Z 48.70
33.6
7Example
- The day of your party finally comes and you get
the requests for beverages as shown in the table
below. - Being obsessed with statistics, you wonder if the
market share for various soft drinks have changed
since 2003 or if the differences between
predicted and actual are simply due to sampling
error.
8A test statistic for the multinomial case
- We need a test statistic that can handle any
number of events. - But first some notation
- ?2 is our test statistic
- fei will be the expected number of events of type
i. - foi will be the observed number of events of type
i.
9?2 in our example
10?2 as an extension of the binomial test
- Is the ?2 test something completely new?
- No, ?2 is an extension of the binomial test.
- To show this, we will prove that
- in the case where there are just 2 possible
events.
11?2 as an extension of the binomial test
12?2 as an extension of the binomial test
- So, in the case of 2 events, ?2 is just a
different way of writing z2. - But the new form has the advantage of being
generalizable. - Furthermore, we can now see the advantage of
using the squared form. - If we didnt square each term, the positives and
negatives might cancel each other out.
13Correcting??2 for bias
- Like the binomial z, ?2 must sometimes be
corrected for bias. - Use the corrected formula when there are 2 events
and Nlt100.
14What distribution does the ?2 statistic follow?
- It follows the ?2 distribution of course.
- But what does it look like?
- Assuming the null hypothesis, it looks like this
15Back to our example
- We were wondering if the mismatch between
beverages and our guests requests was due to
outdated market share statistics. - Let us now answer that question.
16Back to our example
- But what is ?2crit?
- It is the ?2crit bounding .05 area of the ?2
distribution. - Degrees of freedom are the number of possible
events - 1. - Use table A14.
17Example
- So, what can we conclude about our party
planning?
18No preference hypothesis testing
- Now suppose we want to do an analysis like we did
in the binomial chapter. - There, executives were trying to find out if
their pilot study gave significant results. - Suppose that, since the pilot was inconclusive,
they decided to run a more comprehensive study.
19No preference hypothesis testing
- Since the top 4 soft drinks have the greatest
impact on their profits, they decide to focus on
those. - Restricting the population of interest to Coke,
Pepsi, and their diet counterparts. - They run a taste test on 100 tasters and find the
following breakdown of events.
20No preference hypothesis testing
- The number of predicted tasters are all the same
because we are testing a null hypothesis that
there is no difference in taste. - Otherwise we use the same procedure.
21No preference hypothesis testing
- Calculate ?2
- df 4 - 1 3
- ?2crit 7.81 from table A14
22Exercises
- Page 641
- 1, 3, 4, 6, 7
- Roll a die 30 times and test to see if it is
loaded.
23SPSS for Chi-squared
- Using the grades.sav data.
- Does the ethnicity of our class reflect the
ethnicity of our (imaginary) University? - Analyze-gtNonparametric Tests-gtChi-Square
- Add ethnic to the box.
- Set Expected Frequencies to 20, 20, 20, 20, 25 as
the ethnic makeup of Imaginary U. - OK
- In the output, notice that ?2 is 51.14 and p
.000.
24SPSS Exercises
- Using the grades.sav data
- Are the section sizes significantly different
from one another? - Are the freshman, sophmore, junior, and senior
classes equally represented? - The US census says that the US population breaks
down as follows (2004 data) - 236,057,761 Whites
- 37,502,320 Blacks
- 2,824,751 Native Americans
- 12,326,216 Asians
- 41,322,070 Hispanics
- Is the class in grades.sav representative of the
US population?
25Two variable contingency tables
- Obviously, not all of the complexity of life can
be captured by observing one type of event, even
if it has many possible outcomes. - Often there are multiple types of events.
- And, sometimes those types of events are related
in interesting ways.
26Two variable contingency tableExample
- A study by Krupinski et al (1998) investigated
factors for suicides of inpatients with
depressive psychoses. - They considered a number of factors that might be
related to suicide - Gender
- Age
- Marital status
- Whether they had children
- Number of siblings
- Time of last inpatient treatment
- Previous suicide attempt
- Abuse or addiction to alcohol or drugs
- Stress
- Voluntary treatment
- Diagnoses
27Two variable contingency tableExample
- We will look at their contingency table for
suicide vs. age group.
28Two variable contingency tableExample
- Our null hypothesis will be that the differences
in the number of patients in each cell are due to
sampling inhomogeneities. - Sampling inhomogeneities are not the same as
sampling error. - We just happen to have more of some types than
others available.
29Two variable contingency tableExample
- For example, we have more people not committing
suicide than committing suicide in our available
sample. - We also have more people gt 60 than any other age
group. - These 2 facts together mean that we should expect
a lot of people in the (gt60, No-Suicide) cell. - This is true, independent of the relationship
between suicide and age.
30Two variable contingency tableExample
- Likewise we can find cells with small counts.
31Two variable contingency tableExample
- In general, the expected count in a cell depends
on the row and column sums as follows - Where N is the total count over all cells.
32Two variable contingency tableExample
- So for example the expected count in the (gt60,
No-Suicide) cell is
33Two variable contingency tableExample
- We can fill in (in parentheses) the expected
count for all cells using this formula - Are we going to get statistical significance?
34Two variable contingency tableExample
- Our null hypothesis suicide and age are
independent. - Our alternative hypothesis__________________
35Two variable contingency tableExample
- Calculating ?2
- This is exactly as before except we will now sum
over all cells.
36Two variable contingency tableExample
- Now we need an ?2crit
- For which we need a df.
- df (row-1)(columns-1) 51 5
- ?2crit 11.7 (?.05) from table A14
- p . 08
37Two variable contingency tableExample
- So what would you do if you were a hospital
administrator? - ?2crit 9.24 (?.10) from table A14
38Other predictors of suicide
39Exercises
40Cramers Phi
- What measure do we have of the strength of
association between 2 variables?
41Cramers Phi
- We could compute the correlation between the 2
variables in our contingency table. - There is a shortcut to accomplish this based on
?2, which weve already computed. - Where N is the total sample size and L is the
number of rows or columns, whichever is least.
42Cramers Phi
- Lets see if our suicide prediction study has a
large association. - As correlations go, this is very small.
43SPSS for contingency tables
- Using grades.sav
- Lets see if the ethnicity and sex are associated
in our class. - In other words, can we predict the gender of a
classmate from their ethnicity? - Analyze -gt Descriptive Statistics -gt Crosstabs.
- Move sex into the upper box.
- Move ethnic into the middle box.
- In Cells
- Select Observed and Expected.
- In Statistics choose Phi and Cramers V.
- OK.
- In the output note Pearson Chi-Square, its
Significance (p), and Cramers V which is
Cramers Phi.
44SPSS Exercises
- Using divorce.sav
- Test the following pairs of variables for
significant association and for strength of
association. - Current family income vs current marital status
- Current family income vs employment status
- Current marital status vs employment status