Action Research Data Manipulation and Crosstabs - PowerPoint PPT Presentation

1 / 57

About This Presentation

Title:

Action Research Data Manipulation and Crosstabs

Description:

Pearson's chi-square 2 (used for nominal or ordinal scale data) ... Is a nonparametric test, a.k.a. the Goodness of Fit statistic ... – PowerPoint PPT presentation

Number of Views:144

Avg rating:3.0/5.0

Slides: 58

Provided by: gle9

Category:

more less

Transcript and Presenter's Notes

Title: Action Research Data Manipulation and Crosstabs

1
Action ResearchData Manipulation and Crosstabs

INFO 515
Glenn Booker

2
Parametric vs. Nonparametric

Statistical tests fall into two broad categories
parametric nonparametric
Parametric methods
Require data at higher levels of measurement -
interval and/or ratio scales
Are more mathematically powerful than
nonparametric statistics
But often require more assumptions about the
data, such as having a normal distribution, or
equal variances

3
Parametric vs. Nonparametric

Nonparametric methods
Use nominal or ordinal scale data
Still allows us to test for a relationship, and
its strength and direction (direction only if
ordinal)
Often has easier prerequisites for being tested
(e.g. no distribution limits)
Ratio or interval scale data may be recoded to
become nominal or ordinal data, and hence be used
with nonparametric tests

4
Significance and Association

are useful for inferring population values from
samples (inferential statistics)
Significance establishes whether chance can be
ruled out as the most likely explanation of
differences
Association shows the nature, strength, and/or
direction of the relationship between two (or
among three or more) variables
Need to show significance before association is
meaningful

5
Common Tests of Significance

Weve been introduced to three common tests of
significance
z test (large samples of ratio or interval data)
t test (small samples of ratio or interval data)
F test (ANOVA)
Shortly well explore a fourth one
Pearsons chi-square ?2 (used for nominal or
ordinal scale data)

? is the Greek letter chi, pronounced kye,
rhymes with rye
6
Common Measures of Association

Association measures often range in valuefrom -1
to 1 (but not always!)
Absence of association between variables
generally means a result of 0
Examples
Pearsons r (for interval or ratio scale data)
Yules Q (ordinal data in a 2x2 table)
Gamma (ordinal more than 2x2 table)

A 2x2 table has 2 rows and 2 columns of data.
7
Common Measures of Association

Notice these are all for nominal scale data
Phi (?, fee) (nominal data in a 2x2 table)
Contingency Coefficient (nominal table larger
than 2x2)
Cramers V (nominal - larger than 2x2)
Lambda (l) - nominal data
Eta (?) nominal data

8
Significance and Association

Tests of significance and measures of association
are often used together
But you can have statistical significance without
having association

9
Significance and Association Examples

Ratio data You might use F to determine if there
is a significant relationship, then use r from
a regression to measure its strength
Ordinal data You might run a chi-square to
determine statistical significance in the
frequencies of two variables, and then run a
Yules Q to show the relationship between the
variables

10
Crosstabs

Brief digression to introduce crosstabs before
discussing non-parametric methods
Crosstabs are a table, often used to display
data, sorted by two nominal or ordinal variables
at once, to study the relationship between
variables that have a small number of possible
answers each
Generally contains basic descriptive statistics,
such as frequency counts and percentages

11
Crosstabs

Used to check the distribution of data, and as a
foundation for more complex tests
Look for gaps or sparse data (little or no
contribution to the data set)
Rule of thumb - put independent variable in the
columns and dependent variable in the rows

12
Percentages

Can show both column and row percentages in
crosstabs, rather than just frequency counts (or
show both counts and percentages)
Make sure percentages add to 100!
Raw frequency counts of variables dont always
provide an accurate picture
Unequal numbers of subjects in groups (N) might
make the numbers appear skewed

13
Crosstabs Example

Open data set GSS91 political.sav
Use Analyze / Descriptive Statistics /
Crosstabs...
Set the Row(s) as region, and the Column(s) as
relig
Note the default scope of an SPSS crosstab is to
show frequency Counts, with row and column totals

14
Crosstabs Example
15
Crosstabs Example

Repeat the same example with percentages selected
under the Cells button to get detailed data in
each cell
Percent within that region (Row)
Percent within that religious preference (Column)
Percent of total data set (divide by Total N)
Gets a bit messy to show this much!

16
Crosstabs Example
17
Recoding

An interval or ratio scaled variable, like age or
salary, may have too many distinct values to use
in a crosstab
Recoding lets you combine values into a single
new variable -- also called collapsing the codes
Also helpful for creating histogram variables
(e.g. ranges of age or income)

18
Recoding Example

Use Transform / Recode / Into Different
Variables
Move age from the dropdown list for the Numeric
Variable
Define the new Output Variable to have Name
agegroup and Label Age Group
Click Change button to use agegroup
Click on Old and New Values button

19
Recoding Example

For the Old Value, enter Range of 18 to 30
Assign this to a New Value of 1
Click on Add
Repeat to define ages 31-50 as agegroup New Value
2, 51-75 as 3, and 76-200 as 4
Click Continue and now a new variable exists as
defined

20
RecodingExample
21
Recoding Example

Now generate a crosstab with agegroup as
columns, and region as the rows

22
Second Recoding Example

Prof. Yonker had a previous INFO515 class
surveyed for their height (in inches) and desired
salaries (/yr)
Rather than analyze ratio data with few
frequencies larger than one, she recoded
Heights into Dwarves for people below average
height, and Giants for those above
Desired salaries were recoded into Cheap and
Expensive, again below and above average

23
Second Recoding Example

The resulting crosstab was like this

24
Pearson Chi Square Test

The Chi Square test measures how much observed
(actual) frequencies (fo) differ from expected
frequencies (fe)
Is a nonparametric test, a.k.a. the Goodness of
Fit statistic
Does not require assumptions about the shape of
the population distribution
Does not require variables be measured on an
interval or ratio scale

25
Chi Square Concept

Chi Square test is like the ANOVA test
ANOVA proved whether there was a difference among
several means proved that the means are
different from each other in some way
Chi square is trying to prove whether the
frequency distribution is different from a random
one is there a significant difference among
frequencies?
Allows us to test for a relationship (but not the
strength or direction if there is one)

26
Chi Square Null Hypothesis

Null hypothesis is that the frequencies in cells
are independent of each other (there is no
relationship among them)
Each case is independent of every other case
that is, the value of the variable for one
individual does not influence the value for
another individual
Chi Square works better for small sample sizes (lt
hundreds of samples)
WARNING Almost any really large table will have
a significant chi square

27
Assumptions for Chi Square

A random sample is the expected basis for
comparison
Each case can fall into only one cell
No zero values are allowed for the observed
frequency, fo
And no expected frequencies, fe, less than one
At least 80 of expected frequencies, fe, should
be greater than or equal to five (5)

28
Expected Frequency

The expected frequency for a cell is based on the
fraction of things which would fall into it
randomly, given the same general row and column
count proportions as the actual data set
fe (row total) (column total) / N
So if 90 people live in New England, and 335 are
in Age Group 1 from a total sample of 1500, then
we would expect fe 90335/1500 20.1 people
in that cell

See slide 21
29
Expected Frequency

So the general formula for the expected frequency
of a given cell is fe (actual row total)
(actual column total)/N
Notice that this is NOT using the average
expected frequency for every cell fe N /
( of rows)( of columns)

30
Calculating Chi Square

The Chi square value for each cell is the
observed frequency minus the expected one,
squared, divided by the expected frequencyChi
square per cell (fo-fe)2/fe
Sum this for all cells in the crosstab
For the cell on slide 28, the actual frequency
was 25, so Chi square for that cell is
(25-20.1)2/20.1 1.195 Note Chi square is
always positive

31
Calculating Chi Square

Page 36/37 of the Action Research handout has an
example of chi square calculation, where fo is
the observed (actual) frequency fe is the
expected frequency
E.g. fe for the first cell is 2030/60 10.0
Chi square for each cell is (fo-fe)2/fe
Sum chi square for all cells in the table

No comments about fe fi fo fum! Is that clear?!?!
32
Interpreting Chi Square

When the total Chi square is larger than the
critical value, reject the null hypothesis
See Action Research handout page 42/43 for
critical Chi square (?2) values
Look up critical value using the df value,
which is based on the number of rows and columns
in the crosstab df (rows - 1)(columns -
1)
For the example on slide 21, df (9-1)(4-1)
83 24

33
Interpreting Chi Square

Or you can be lazy and use the old standby
if the significance is less than 0.050, reject
the null hypothesis if the significance is less
than 0.050, reject the null hypothesis if the
significance is less than 0.050, reject the null
hypothesisif the significance is less than
0.050, reject the null hypothesis

34
Chi Square Example

Open data set GSS91 political.sav
Use Analyze / Descriptive Statistics /
Crosstabs...
Set the Row(s) as region, and the Column(s) as
agegroup
Click on Statistics and select the
Chi-square test

Notice were still using the Crosstab command!
35
Chi Square Example
36
Chi Square Example

Note that we correctly predicted the df value
of 24
SPSS is ready to warn you if too many cells
expected a count below five, or had expected
counts below one
The significance is below 0.050, indicating we
reject the null hypothesis
The total Chi square for all cells is 43.260

37
Chi Square Example

The critical Chi square value can be looked up on
page 42/43 of Yonker
For df 24, and significance level 0.050, we get
a critical Chi square of 36.415
Since the actual Chi square (43.260) is greater
than the critical value (36.415), reject the null
hypothesis
Chi square often shows significance falsely for
large sample sizes (hence the earlier warning)

38
Chi Square Example

What are the other tests? They dont apply
here...
The Likelihood Ratio test is specifically for
log-linear models
The Linear-by-Linear Association test is a
function of Pearsons r, so it only applies to
interval or ratio scale variables
Notice that SPSS doesnt realize those tests
dont apply, and blindly presents results for
them

39
One-variable Chi square Test

To check only one variables distribution, there
is another way to run Chi square
Null hypothesis is that the variable is evenly
distributed across all of its categories
Hence all expected frequencies are equal for each
category, unless you specify otherwise
Expected range can also be specified

40
Other Chi square Example

Use Analyze / Nonparametric Tests / Chi-square
NOT using the Crosstab command here
Add region to the Test Variable List
Now df is the number of categories in the
variable, minus one
df ( categories) - 1
Significance is interpreted the same

41
Other Chi square Example
42
Other Chi square Example

So in this case, the region variable has nine
categories, for a df of 9-1 8
Critical Chi square for df 8 is 15.507, so the
actual value of 290 shows these data are not
evenly distributed across regions
Significance below 0.050 still, in keeping with
our fine long established tradition, rejects the
null hypothesis

43
Whodunit?

The chi-square value by itself doesnt tell us
which of the cells are major contributors to the
statistical significance
We compute the standardized residual to address
that issue
This hints at which cells contribute a lot to the
total chi square

44
Residuals

The Residual is the Observed value minus the
Estimated value for some data point
Residual fo - fe
If this variable is evenly distributed, the
Residuals should have a normal distribution
Plots of residuals are sometimes used to check
data normalcy (i.e. how normal is this datas
distribution?)

45
Standardized Residual

The Standardized Residual is the Residual divided
by the standard deviation of the residuals
When the absolute value of the Standardized
Residual for a cell is greater than 2, you may
conclude that it is a major contributor to the
overall chi-square value
Analogous to the original t test, looking for
t gt 2

46
Standardized Residual

Extreme values of Standardized Residual (e.g.
minimum, maximum) can also help identify extreme
data points
The meaning of residual is the same for
regression analysis, BTW, where residuals are an
optional output

47
Standardized Residual Example

In the crosstab region-agegroup example
Click Cells and select Standardized Residuals
In this case, the worst cell is the combination
W. Nor. Central region - Age Group 4, which
produced a standardized residual of 2.1

48
Standardized Residual Example
49
Crosstab Statistics for 2x2 Table

2x2 tables appear so often that many tests have
been developed specifically for them
Equality of proportions
McNemar Chi-square
Yates Correction
Fisher Exact Test

50
Crosstab Statistics for 2x2 Table

Equality of proportions tests prove whether the
proportion of one variable is the same as for two
different values of another variable
e.g. Do homeowners vote as often as renters?
McNemar Chi-square tests for frequencies in a 2x2
table where samples are dependent (such as
pre-test and post-test results)

51
Crosstab Statistics for 2x2 Table

Yates Correction for Continuity chi-square is
refined for small observed frequencies
fe ( fo-fe - 0.5)/fe
Corrections are too conservative dont use!
Fisher Exact Test assumes row/column
frequencies remain fixed, and computes all
possible tables gives significance value like
Chi square

52
Nominal Measures of Association

Are used to test if each measure is zero (null
hypothesis) using different scales
Phi
Cramers V
Contingency Coefficient
All three are zero iff Chi square is zero
iff is mathspeak for if and only if

53
Nominal Measures of Association

The usual Significance criterion is used for all
three
If significance lt 0.050, reject the null
hypothesis, hence the association is significant
Notice that direction is meaningless for nominal
variables, so only the strength of an
association can be determined

54
Phi

For a 2x2 table, Phi and Cramers V are equal to
Pearsons r
Phi (f) can be gt 1, making it an unusual measure
of association
Phi sqrt (Chi square) / N
Phi 0 means no association
Phi near or over 1 means strong association

55
Cramers V

Cramers V 1
V sqrt Chi Square / (N(k 1) where k is
the smaller of the number of columns or rows
Is a better measure for tables larger than 2x2
instead of the Contingency Coefficient

56
Contingency Coefficient

a.k.a. C or Pearsons C or Pearsons Contingency
Coefficient
Most widely used measure based on chi-square
Requires only nominal data
C has a value of 0 when there is no association

57
Contingency Coefficient

The max possible value of C is the square root of
(the number of columns minus 1, divided by the
number of columns)Cmax sqrt( (column - 1) /
column)
C sqrt Chi Square / (Chi Square N)

Write a Comment

User Comments (0)