Title: The Tabachnick and Fidell Sample Problem
1The Tabachnick and Fidell Sample Problem
Our second problem comes from the Logistic
Regression chapter from Barbara G. Tabachnick and
Linda S. Fidell, Using Multivariate Statistics,
Third Edition. The problem is to examine the
relationship between work status of women
(employed outside the home or housewife) and four
attitudinal predictors locus of control,
attitude toward current marital status, attitude
toward women's rights, and attitude toward
housework. The data was collected as part of a
study on women's health and drugs. The data for
this problem is WorkStatusOfWomen.SAV
Tabachnick and Fidell Sample Problem
2Stage One Define the Research Problem
- In this stage, the following issues are
addressed - Relationship to be analyzed
- Specifying the dependent and independent
variables - Method for including independent variables
Relationship to be analyzed
The goal of this analysis is to determine the
relationship between the dependent variable
WORKSTAT 'Current work status', and the
independent variables of CONTROL 'Locus of
control', ATTMAR 'Satisfaction with current
marital status', ATTROLE 'Attitude toward role
of women', and ATTHOUSE 'Attitude toward
housework'.
Tabachnick and Fidell Sample Problem
3Specifying the dependent and independent variables
- The dependent variable is WORKSTAT 'Current work
status', a dichotomous variable, where 1
indicates working outside the home and 0
indicates a housewife. - The independent variables are
- CONTROL 'Locus of control'
- ATTMAR 'Satisfaction with current marital
status' - ATTROLE 'Attitude toward role of women'
- ATTHOUSE 'Attitude toward housework'
Method for including independent variables
Since we are interested in the relationship
between the dependent variable and all of the
independent variables, we will use direct entry
of the independent variables.
Tabachnick and Fidell Sample Problem
4Stage 2 Develop the Analysis Plan Sample Size
Issues
- In this stage, the following issues are
addressed - Missing data analysis
- Minimum sample size requirement 15-20 cases per
independent variable
Missing data analysis
There is no missing data in this problem.
Minimum sample size requirement15-20 cases per
independent variable
The data set has 440 cases and 4 independent
variables for a ratio of 110 to 1, well in excess
of the requirement that we have 15-20 cases per
independent variable.
Tabachnick and Fidell Sample Problem
5Stage 2 Develop the Analysis Plan Measurement
Issues
- In this stage, the following issues are
addressed - Incorporating nonmetric data with dummy variables
- Representing Curvilinear Effects with Polynomials
- Representing Interaction or Moderator Effects
Incorporating Nonmetric Data with Dummy Variables
There are no nonmetric independent variables.
Representing Curvilinear Effects with Polynomials
We do not have any evidence of curvilinear
effects at this point in the analysis.
Representing Interaction or Moderator Effects
We do not have any evidence at this point in the
analysis that we should add interaction or
moderator variables.
Tabachnick and Fidell Sample Problem
6Stage 3 Evaluate Underlying Assumptions
- In this stage, the following issues are
addressed - Nonmetric dependent variable with two groups
- Metric or dummy-coded independent variables
Nonmetric dependent variable having two groups
The dependent variable WORKSTAT 'Current work
status' is a dichotomous variable.
Metric or dummy-coded independent variables
The independent variables CONTROL 'Locus of
control', ATTMAR 'Satisfaction with current
marital status', ATTROLE 'Attitude toward role
of women', and ATTHOUSE 'Attitude toward
housework' are all metric variables.
Tabachnick and Fidell Sample Problem
7Stage 4 Estimation of Logistic Regression and
Assessing Overall Fit Model Estimation
- In this stage, the following issues are
addressed - Compute logistic regression model
Compute the logistic regression
The steps to obtain a logistic regression
analysis are detailed on the following screens.
Tabachnick and Fidell Sample Problem
8Requesting a Logistic Regression
Tabachnick and Fidell Sample Problem
9Specifying the Dependent Variable
Tabachnick and Fidell Sample Problem
10Specifying the Independent Variables
Tabachnick and Fidell Sample Problem
11Specify the method for entering variables
Tabachnick and Fidell Sample Problem
12Specifying Options to Include in the Output
Tabachnick and Fidell Sample Problem
13Specifying the New Variables to Save
Tabachnick and Fidell Sample Problem
14Complete the Logistic Regression Request
Tabachnick and Fidell Sample Problem
15Stage 4 Estimation of Logistic Regression and
Assessing Overall Fit Assessing Model Fit
- In this stage, the following issues are
addressed - Significance test of the model log likelihood
(Change in -2LL) - Measures Analogous to R² Cox and Snell R² and
Nagelkerke R² - Hosmer-Lemeshow Goodness-of-fit
- Classification matrices as a measure of model
accuracy - Check for Numerical Problems
- Presence of outliers
Tabachnick and Fidell Sample Problem
16Initial statistics before independent variables
are included
The Initial Log Likelihood Function, (-2 Log
Likelihood or -2LL) is a statistical measure like
total sums of squares in regression. If our
independent variables have a relationship to the
dependent variable, we will improve our ability
to predict the dependent variable accurately, and
the log likelihood value will decrease. The
initial 2LL value is 607.922 on step 0, before
any variables have been added to the model.
Tabachnick and Fidell Sample Problem
17Significance test of the model log likelihood
The difference between these two measures is the
model child-square value (21.779 607.922
586.143) that is tested for statistical
significance. This test is analogous to the
F-test for R² or change in R² value in multiple
regression which tests whether or not the
improvement in the model associated with the
additional variables is statistically significant.
In this problem the model Chi-Square value of
21.779 has a significance of 0.000, less than
0.05, so we conclude that there is a significant
relationship between the dependent variable and
the set of independent variables.
Tabachnick and Fidell Sample Problem
18Measures Analogous to R²
The next SPSS outputs indicate the strength of
the relationship between the dependent variable
and the independent variables, analogous to the
R² measures in multiple regression.
The Cox and Snell R² measure operates like R²,
with higher values indicating greater model fit.
However, this measure is limited in that it
cannot reach the maximum value of 1, so
Nagelkerke proposed a modification that had the
range from 0 to 1. We will rely upon
Nagelkerke's measure as indicating the strength
of the relationship. The strength of the
relationship would be regarded as weak given an
R² value of 0.064 (Nagelkerke - R2).
Tabachnick and Fidell Sample Problem
19Correspondence of Actual and Predicted Values of
the Dependent Variable
The final measure of model fit is the Hosmer and
Lemeshow goodness-of-fit statistic, which
measures the correspondence between the actual
and predicted values of the dependent variable.
In this case, better model fit is indicated by a
smaller difference in the observed and predicted
classification. A good model fit is indicated by
a nonsignificant chi-square value.
The goodness-of-fit measure has a value of 9.2611
which has the desirable outcome of
nonsignificance.
Tabachnick and Fidell Sample Problem
20The Classification Matrices as a Measure of Model
Accuracy
The classification matrices in logistic
regression serve the same function as the
classification matrices in discriminant analysis,
i.e. evaluating the accuracy of the model.
If the predicted and actual group memberships are
the same, i.e. 1 and 1 or 0 and 0, then the
prediction is accurate for that case. If
predicted group membership and actual group
membership are different, the model "misses" for
that case. The overall percentage of accurate
predictions (77.4 in this case) is the measure
of a model that I rely on most heavily for this
analysis as well as for discriminant analysis
because it has a meaning that is readily
communicated, i.e. the percentage of cases for
which our model predicts accurately. To
evaluate the accuracy of the model, we compute
the proportional by chance accuracy rate and the
maximum by chance accuracy rates, if appropriate.
The proportional by chance accuracy rate is
equal to 0.530 (0.6232 0.3772). A 25
increase over the proportional by chance accuracy
rate would equal 0.663. Our model accuracy race
of 77.4 meets this criterion. Since one of our
groups contains 62.3 of the cases, we might also
apply the maximum by chance criterion. A 25
increase over the largest groups would equal
0.778. Our model accuracy race of 77.4 almost
meets this criterion.
21Stacked Histogram
SPSS provides a visual image of the
classification accuracy in the stacked histogram
as shown below. To the extent to which the
cases in one group cluster on the left and the
other group clusters on the right, the predictive
accuracy of the model will be higher. As we can
see in this plot, the two groups are completely
overlapping. The available variables are not
useful in distinguishing between housewives and
women employed outside the home. This makes some
sense to me from the perspective that the
selection of independent variables all imply that
whether a subject works outside the home is
strictly a matter of choice or personality
dynamics, not impacted by external forces, like
family economics.
Tabachnick and Fidell Sample Problem
22Check for Numerical Problems
There are several numerical problems that can in
logistic regression that are not detected by SPSS
or other statistical packages multicollinearity
among the independent variables, zero cells for a
dummy-coded independent variable because all of
the subjects have the same value for the
variable, and "complete separation" whereby the
two groups in the dependent event variable can be
perfectly separated by scores on one of the
independent variables. All of these problems
produce large standard errors (over 2) for the
variables included in the analysis and very often
produce very large B coefficients as well. If we
encounter large standard errors for the predictor
variables, we should examine frequency tables,
one-way ANOVAs, and correlations for the
variables involved to try to identify the source
of the problem.
The standard errors and B coefficients are not
excessively large, so there is no evidence of a
numeric problem with this analysis.
Tabachnick and Fidell Sample Problem
23Presence of outliers
There are two outputs to alert us to outliers
that we might consider excluding from the
analysis listing of residuals and saving Cook's
distance scores to the data set. SPSS provides
a casewise list of residuals that identify cases
whose residual is above or below a certain number
of standard deviation units. Like multiple
regression there are a variety of ways to compute
the residual. In logistic regression, the
residual is the difference between the observed
probability of the dependent variable event and
the predicted probability based on the model.
The standardized residual is the residual divided
by an estimate of its standard deviation. The
deviance is calculated by taking the square root
of -2 x the log of the predicted probability for
the observed group and attaching a negative sign
if the event did not occur for that case. Large
values for deviance indicate that the model does
not fit the case well. The studentized residual
for a case is the change in the model deviance if
the case is excluded. Discrepancies between the
deviance and the studentized residual may
identify unusual cases. (See the SPSS chapter on
Logistic Regression Analysis for additional
details). In the output for our problem, SPSS
informs us that there are no outliers in this
analysis with the following output
Tabachnick and Fidell Sample Problem
24Cooks Distance
SPSS has an option to compute Cook's distance as
a measure of influential cases and add the score
to the data editor. I am not aware of a precise
formula for determining what cutoff value should
be used, so we will rely on the more traditional
method for interpreting Cook's distance which is
to identify cases that either have a score of 1.0
or higher, or cases which have a Cook's distance
substantially different from the other. The
prescribed method for detecting unusually large
Cook's distance scores is to create a scatterplot
of Cook's distance scores versus case id.
SPSS Sample Problem
25Request the Scatterplot
Tabachnick and Fidell Sample Problem
26Specifying the Variables for the Scatterplot
Tabachnick and Fidell Sample Problem
27The Scatterplot of Cook's Distances
On the plot of Cook's distances shown below, we
see no cases that exceed the 1.0 rule of thumb
for influential cases. We do, however, identify
cases that have relatively larger Cook's distance
values (above 0.06) than the majority of cases.
I ran the logistic regression procedure again
without these six cases, but found that the
overall accuracy of the model increased only from
60.00 to 60.37. With such a minor improvement,
I decided to retain all cases in this analysis.
Tabachnick and Fidell Sample Problem
28Stage 5 Interpret the Results
- In this section, we address the following issues
- Identifying the statistically significant
predictor variables - Direction of relationship and contribution to
dependent variable
Tabachnick and Fidell Sample Problem
29Identifying the statistically significant
predictor variables
The coefficients are found in the column labeled
B, and the test that the coefficient is not zero,
i.e. changes the odds of the dependent variable
event is tested with the Wald statistic, instead
of the t-test as was done for the individual B
coefficients in the multiple regression equation.
Only one independent variable, ATTROLE 'Attitude
toward role of women', has a statistically
significant relationship to work status.
Tabachnick and Fidell Sample Problem
30Direction of relationship and contribution to
dependent variable
The sign of the coefficient for the only
statistically significant predictor, ATTROLE
'Attitude toward role of women', is negative,
indicating an inverse relationship, i.e. higher
scores on this variable were associated with
belong to the dependent variable group
corresponding to 1, Working Outside the Home.
Women's Attitude Toward Role is a scale that
measures conservative versus liberal attitudes,
with higher scores indicating more conservative
attitudes. Given that Working Outside the Home is
the option coded 1 on the Work Status variable,
we can interpret the odds-ratio (Exp(B) .936)
associated with ATTROLE as indicating that the
more conservative a respondent's attitude, the
less likely she is to be working outside the home.
Tabachnick and Fidell Sample Problem
31Stage 6 Validate The Model
- In this stage, we are normally concerned with the
following issues - Creating the Selection Variabl
- Computing the Split-half Analysis
- The Output for the Validation Analysis
Tabachnick and Fidell Sample Problem
32Conducting the Validation Analysis
To validate the logistic regression, we can
randomly divide our sample into two groups, a
screening sample and a validation sample. The
analysis is computed for the screening sample and
used to predict membership on the dependent
variable in the validation sample. If the model
in the screening sample is valid, we would expect
that the accuracy rates for both samples to be
about the same. In the double cross- validation
strategy, we reverse the designation of the
screening and validation sample and re-run the
analysis. We can then compare the significant
independent variables found for both screening
samples. If the two screening analyses contain a
very different set of significant variables, it
indicates that the variables might have achieved
significance because of the sample size and not
because of the strength of the relationship. Our
findings about these individual variables would
that the predictive utility of these variables is
not generalizable.
Tabachnick and Fidell Sample Problem
33Set the Starting Point for Random Number
Generation
Tabachnick and Fidell Sample Problem
34Compute the Variable to Randomly Split the Sample
into Two Halves
Tabachnick and Fidell Sample Problem
35Specify the Cases to Include in the First
Screening Sample
Tabachnick and Fidell Sample Problem
36Specify the Value of the Selection Variable for
the First Validation Analysis
Tabachnick and Fidell Sample Problem
37Specify the Value of the Selection Variable for
the Second Validation Analysis
Tabachnick and Fidell Sample Problem
38Generalizability of the Logistic Regression Model
The model chi-square value for the second
validation analysis does not achieve statistical
significance, suggesting that the finding of an
overall statistical relationship is suspect, and
should not be generalized to the larger
population of women. Moreover, the accuracy
rates for the validation samples fall below the
25 increase over the proportional by chance
accuracy rates, implying that by chance alone, we
could do almost as good a job at predicting works
status as what we obtained in the statistical
results.
Tabachnick and Fidell Sample Problem