Title: Cross Tabs and Chi-Squared
1Cross Tabs and Chi-Squared
- Testing for a Relationship Between
Nominal/Ordinal Variables
2Cross Tabs and Chi-Squared
- The test you choose depends on level of
measurement - Independent Dependent Statistical Test
- Dichotomous Interval-ratio Independent Samples
t-test - Dichotomous
- Nominal Nominal Cross Tabs
- Dichotomous Dichotomous
- Nominal Interval-ratio ANOVA
- Dichotomous Dichotomous
- Interval-ratio Interval-ratio Correlation and
OLS Regression - Dichotomous
3Cross Tabs and Chi-Squared
- We are asking whether there is a relationship
between two nominal (or ordinal) variablesthis
includes dichotomous variables. - (Even though one may use cross tabs for ordinal
variables, it is generally better to treat them
as interval variables and use more powerful
statistical techniques whenever you can.)
4Cross Tabs and Chi-Squared
- Cross tabs and Chi-Squared will tell you whether
classification on one nominal or ordinal variable
is related to classification on a second nominal
or ordinal variable. - For Example
- Are rural Americans more likely to vote
Republican in presidential races than urban
Americans? - Classification of Region Party Vote
- Are white people more likely to drive SUVs than
blacks or Hispanics? - Race Type of Vehicle
5Cross Tabs and Chi-Squared
- The statistical focus will be on the number of
people in a sample who are classified in
patterned ways on two variables. - Why?
- Means and standard deviations are meaningless for
nominal variables.
6Cross Tabs and Chi-Squared
- The procedure starts with a cross
classification of the cases in categories of
each variable. - Example
- Data on male and female support for SJSU football
from 650 students put into a matrix - Yes No Maybe Total
- Female 185 200 65 450
- Male 80 65 55 200
- Total 265 265 120 650
7Cross Tabs and Chi-Squared
- In the example, I can see that the campus is
divided on the issue. - But are there associations between sex and
attitudes? - Example
- Data on male and female support for SJSU football
from 650 students put into a matrix - Yes No Maybe Total
- Female 185 200 65 450
- Male 80 65 55 200
- Total 265 265 120 650
8Cross Tabs and Chi-Squared
- But are there associations between sex and
attitudes? - An easy way to get more information is to convert
the frequencies to percentages. - Example
- Data on male and female support for SJSU football
from 650 students put into a matrix - Yes No Maybe Total
- Female 185 (41) 200 (44) 65 (14) 450 (99)
- Male 80 (40) 65 (33) 55 (28) 200 (101)
- Total 265 (41) 265 (41) 120 (18) 650 (100)
- percentages do not add to 100 due to rounding
9Cross Tabs and Chi-Squared
- We can see that in the sample men are less likely
to oppose football, but no more likely to say
yes than womenmen are more likely to say
maybe - Example
- Data on male and female support for SJSU football
from 650 students put into a matrix - Yes No Maybe Total
- Female 185 (41) 200 (44) 65 (14) 450 (99)
- Male 80 (40) 65 (33) 55 (28) 200 (101)
- Total 265 (41) 265 (41) 120 (18) 650 (100)
- percentages do not add to 100 due to rounding
10Cross Tabs and Chi-Squared
- Data on male and female support for SJSU football
from 650 students put into a matrix - Yes No Maybe Total
- Female 185 (41) 200 (44) 65 (14) 450 (99)
- Male 80 (40) 65 (33) 55 (28) 200 (101)
- Total 265 (41) 265 (41) 120 (18) 650 (100)
- percentages do not add to 100 due to rounding
- Using percentages to describe relationships is
valid statistical analysis These are
descriptive statistics! However, they are not
inferential statistics. - What can we say about the population?
- Could we have gotten sample statistics like these
from a population where there is no association
between sex and attitudes about starting
football? - This is where the Chi-Squared Test of
Independence comes in handy.
11Cross Tabs and Chi-Squared
- The whole idea behind the Chi-Squared test of
independence is to determine whether the patterns
of frequencies in your cross classification table
could have occurred by chance, or whether they
represent systematic assignment to particular
cells. - For example, were women more likely to answer
no than men or could the deviation in responses
by sex have occurred because of random sampling
or chance alone?
12Cross Tabs and Chi-Squared
- A number called Chi-Squared, ?2, tells us whether
the numbers in our sample deviate from what would
be expected by chance. - Its formula
- fo observed frequency in each cell fe expected
frequency in each cell - A bigger ?2 will result as our sample data
deviates more and more from what would be
expected by chance. - A big ?2 will imply that there is a relationship
between our two nominal variables.
?2 ? ((fo - fe)2 / fe)
13Cross Tabs and Chi-Squared
?2 ? ((fo - fe)2 / fe)
- Calculating ?2 begins with the concept of a
deviation of observed data from what is expected
by chance alone. - Deviation in ?2 Observed frequency Expected
frequency - Observed frequency is just the number of cases in
each cell of the cross classification table. For
example, 185 women said yes, they support
football at SJSU. 185 is the observed frequency. - Expected frequency is the number of cases that
would be in a cell of the cross classification
table if people in each group of one variable had
a propensity to answer the same as each other on
the second variable.
14Cross Tabs and Chi-Squared
?2 ? ((fo - fe)2 / fe)
- Data on male and female support for SJSU football
from 650 students - Yes No Maybe Total
- Female 185 200 65 450
- Male 80 65 55 200
- Total 265 265 120 650
- Expected frequency (if our variables were
unrelated) - Since females comprise 69.2 of the sample, wed
expect 69.2 of the Yes answers to come from
females, 69.2 of the No answers to come from
females, and 69.2 of the Maybe answers to come
from females. On the other hand, 30.8 of the
Yes, No, and Maybe answers should come from
Men. - Therefore, to calculate expected frequency for
each cell you do this - fe cells row total / table total cells
column total or - fe cells column total / table total cells
row total - The idea is that you find the percent of persons
in one category on the first variable, and
expect to find that percent of those people in
the other variables categories.
15Cross Tabs and Chi-Squared
?2 ? ((fo - fe)2 / fe)
- Data on male and female support for SJSU football
from 650 students - Yes No Maybe Total
- Female 185 200 65 450
- Male 80 65 55 200
- Total 265 265 120 650
- Now you know how to calculate the expected
frequency (and the observed frequency is
obvious). - fe1 (450/650) 265 183.5 fe4 (200/650)
265 81.5 - fe2 (450/650) 265 183.5 fe5 (200/650)
265 81.5 - fe3 (450/650) 120 83.1 fe6 (200/650)
120 36.9 - You already saw how to calculate the deviations
too. - Dc fo fe
- D1 185 183.5 1.5 D4 80 81.5
-1.5 - D2 200 183.5 16.5 D5 65 81.5 -16.5
- D3 65 83.1 -18.1 D4 55 36.9
18.1
16Cross Tabs and Chi-Squared
?2 ? ((fo - fe)2 / fe)
- Data on male and female support for SJSU football
from 650 students - Yes No Maybe Total
- Female 185 200 65 450
- Male 80 65 55 200
- Total 265 265 120 650
- Deviations
- Dc fo fe
- D1 185 183.5 1.5 D4 80 81.5
-1.5 - D2 200 183.5 16.5 D5 65 81.5 -16.5
- D3 65 83.1 -18.1 D4 55 36.9
18.1 - Now, we want to add up the deviations
- What would happen if we added these deviations
together? - To get rid of negative deviations, we square each
one (like in computing standard deviations).
17Cross Tabs and Chi-Squared
?2 ? ((fo - fe)2 / fe)
- Data on male and female support for SJSU football
from 650 students - Yes No Maybe Total
- Female 185 200 65 450
- Male 80 65 55 200
- Total 265 265 120 650
- Deviations
- Dc fo fe
- D1 185 183.5 1.5 D4 80 81.5
-1.5 - D2 200 183.5 16.5 D5 65 81.5 -16.5
- D3 65 83.1 -18.1 D4 55 36.9
18.1 - To get rid of negative deviations, we square each
one (like in standard deviations). - (D1)2 (1.5)2 2.25 (D4)2 (-1.5)2
2.25 - (D2)2 (16.5)2 272.25 (D5)2 (-16.5)2
272.25 - (D3)2 (-18.1)2 327.61 (D6)2 (18.1)2
327.61
18Cross Tabs and Chi-Squared
?2 ? ((fo - fe)2 / fe)
- Squared Deviations
- (D1)2 (1.5)2 2.25 (D4)2 (-1.5)2
2.25 - (D2)2 (16.5)2 272.25 (D5)2 (-16.5)2
272.25 - (D3)2 (-18.1)2 327.61 (D6)2 (18.1)2
327.61 - Just how large is each of these squared
deviations? - The next step is to give the deviations a
metric. The deviations are compared relative
to the what was expected. In other words, we
divide by what was expected. - Youve already calculated what was expected in
each cell - fe1 (450/650) 265 183.5 fe4 (200/650)
265 81.5 - fe2 (450/650) 265 183.5 fe5 (200/650)
265 81.5 - fe3 (450/650) 120 83.1 fe6 (200/650)
120 36.9 - Relative Deviations-squaredSmall values indicate
little deviation from what was expected, while
larger values indicate much deviation from what
was expected - (D1)2 / fe1 2.25 / 183.5 0.012 (D4)2 /
fe4 2.25 / 81.5 0.028 - (D2)2 / fe2 272.25 / 183.5 1.484 (D5)2 / fe5
272.25 / 81.5 3.340 - (D3)2 / fe3 327.61 / 83.1 3.942 (D6)2 /
fe6 327.61 / 36.9 8.878
19Cross Tabs and Chi-Squared
?2 ? ((fo - fe)2 / fe)
- Relative Deviations-squaredSmall values indicate
little deviation from what was expected, while
larger values indicate much deviation from what
was expected - (D1)2 / fe1 2.25 / 183.5 0.012 (D4)2 /
fe4 2.25 / 81.5 0.028 - (D2)2 / fe2 272.25 / 183.5 1.484 (D5)2 / fe5
272.25 / 81.5 3.340 - (D3)2 / fe3 327.61 / 83.1 3.942 (D6)2 /
fe6 327.61 / 36.9 8.878 - The next step will be to see what the total
relative deviations-squared are - Sum of
- Relative Deviations-squared 0.012 1.484
3.942 0.028 3.340 8.878 17.684 - This number is also what we call Chi-Squared or
?2. - So
- Of what good is knowing this number?
?2 ? ((fo - fe)2 / fe)
20Cross Tabs and Chi-Squared
- This value, ?2, would form an identifiable shape
in repeated sampling if the two variables were
unrelated to each other. - That shape depends only on the number of rows and
columns. We technically refer to this as the
degrees of freedom. - For ?2, df (rows 1)(columns 1)
21Cross Tabs and Chi-Squared
- For ?2, df (rows 1)(columns 1)
- ?2 distributions
df 5
FYI This should remind you of the normal
distribution, except that, it changes shape
depending on the nature of your variables.
df 10
df 20
df 1
1 5 10 20
22Cross Tabs and Chi-Squared
Think of the Power!!!!
- We can use the known properties of the ?2
distribution to identify the probability that we
would get our samples ?2 if our variables were
unrelated! - This is exciting!
23Cross Tabs and Chi-Squared
- If our ?2 in a particular analysis were under the
shaded area or beyond, what could we say about
the population given our sample?
5 of ?2 values
24Cross Tabs and Chi-Squared
- Answer Wed reject the null, saying that it is
highly unlikely that we could get such a large
chi-squared value from a population where the two
variables are unrelated.
5 of ?2 values
25Cross Tabs and Chi-Squared
- So, what is the critical ?2 value?
5 of ?2 values
26Cross Tabs and Chi-Squared
- That depends on the particular problem because
the distribution changes depending on the number
of rows and columns.
df 5
df 10
df 20
df 1
1 5 10 20
Critical ?2 s
27Cross Tabs and Chi-Squared
- According to Table C, df 1, critical ?2
3.84 - with ?-level .05, if df 5, critical ?2
11.07 - df 10, critical ?2 18.31
- df 20, critical ?2 31.41
df 5
df 10
df 20
df 1
1 5 10 20
28Cross Tabs and Chi-Squared
- In our football problem above, we had a
chi-squared of 17.68 in a cross classification
table with 2 rows and 3 columns. - Our chi-squared distribution for that table would
have - df (2 1) (3 1) 2.
- According to Table C, with ?-level .05,
Critical Chi-Squared is 5.99. - Since 17.68 gt 5.99, we reject the null.
- We reject that our sample could have come from a
population where sex was not related to attitudes
toward football.
29Cross Tabs and Chi-Squared
- Now lets get formal
- 7 steps to Chi-squared test of independence
- Set ?-level (e.g., .05)
- Find Critical ?2 (depends on df and ?-level)
- The null and alternative hypotheses
- Ho The two nominal variables are independent
- Ha The two variables are dependent on each
other - Collect Data
- Calculate ?2 ?2 ? ((fo - fe)2 / fe)
- Make decision about the null hypothesis
- Report the P-value
30Cross Tabs and Chi-Squared
- Afterwards, what have you found?
- If Chi-Squared is not significant, your variables
are unrelated. - If Chi-Squared is significant, your variables are
related. - Thats All!
- Chi-Squared cannot tell you anything like the
strength or direction of association. For purely
nominal variables, there is no direction of
association. - Chi-Squared is a large-sample test. If dealing
with small samples, look up appropriate tests. (A
condition of the test no expected frequency
lower than 5 in each cell) - The larger the sample size, the easier it is for
Chi-Squared to be significant. - 2 x 2 table Chi-Square gives same result as
Independent Samples t-test for proportion and
ANOVA.
31Cross Tabs and Chi-Squared
- If you want to know how you depart from
independence, you may - Check percentages (conditional distributions) in
your cross classification table. - Do a residual analysis
- The difference between observed and expected
counts in a cell behaves like a significance test
when divided by a standard error for the
difference. - That s.e. ?fe(1-cells row ?)(1 cells
column ?) - fo fe
- Z s.e.
32Cross Tabs and Chi-Squared
- Residual Analysis
- Lets do cell 5! s.e. ?fe(1-cells row ?)(1
cells column ?) - fo fe 5 row ? 200/650
.308, column ? 265/650 .408 - Z s.e. s.e.
?81.5 (.692) (.592) 5.78 - Z 65 81.5 / 5.78 -2.85 2.85 gt 1.96, there
is a significant difference in cell 5 - Data on male and female support for SJSU football
from 650 students - Yes No Maybe Total
- Female 185 200 65 450
- Male 80 65 55 200
- Total 265 265 120 650
- fe1 (450/650) 265 183.5 fe4 (200/650)
265 81.5 - fe2 (450/650) 265 183.5 fe5 (200/650)
265 81.5 - fe3 (450/650) 120 83.1 fe6 (200/650)
120 36.9 - Deviations
- Dc fo fe
- D1 185 183.5 1.5 D4 80 81.5
-1.5
33Cross Tabs and Chi-Squared
- Further topics you could explore
- Strength of Association
- Discussing outcomes in terms of difference of
proportions - Reporting Odds Ratios (likelihood of a group
giving one answer versus other answers or the
group giving an answer relative to other groups
giving that answer) - Strength and Direction of Association for Ordinal
Variables - Gamma (an inferential statistic, so check for
significance) - Ranges from -1 to 1
- Valence indicates direction of relationship
- Magnitude indicates strength of relationship
- Chi-squared and Gamma can disagree when there is
a nonrandom pattern that has no direction.
Chi-squared will catch it, gamma wont. - Kendalls tau-b
- Somers d