Title: SW388R7
1Logistic Regression Stepwise Entry of Variables
- Sample Problem
- Steps in Solving Problems
2Level of Measurement - question
The first question requires us to examine the
level of measurement requirements for binary
logistic regression. Binary logistic regression
requires that the dependent variable be
dichotomous and the independent variables be
metric or dichotomous.
3Level of Measurement evidence and answer
True with caution is the correct answer, since we
satisfy the level of measurement requirements,
but include ordinal level variables in the
analysis.
4Sample Size - question
The second question asks about the sample size
requirements for binary logistic regression. To
answer this question, we will run the a baseline
logistic regression to obtain some basic data
about the problem and solution. The phrase
stepwise entry dictates the method for
including variables in the model.
5Request stepwise logistic regression
Select the Regression Binary Logistic command
from the Analyze menu.
6Selecting the dependent variable
First, highlight the dependent variable uswary in
the list of variables.
Second, click on the right arrow button to move
the dependent variable to the Dependent text box.
7Adding the independent variables
First, move the predictors to the Covariates list
box.
8Specifying the method for including variables
In our stepwise logistic regression, we specify
the Forward Conditional method for adding
variables. This is one of the available methods
for doing stepwise logistic regression.
9Adding options to the output
To add a summary of steps at the end of the
analysis and specifications for stepwise method,
click on the Options button.
10Set the option for listing outliers
First, mark the checkbox for Casewise listing of
residuals, accepting the default of outliers
outside 2 standard deviations.
Second, click on the At last step option to
display the table of outliers only at the end of
the analysis.
11Specifications for stepwise method
Click on the Continue button to close the dialog
box.
We can change the criteria for adding and
removing variables from the analysis by changing
the probability for entry and removal. We will
use the default level of significance of 0.05 for
entry and 0.10 for removal.
12Completing the logistic regression request
Click on the OK button to request the output for
the logistic regression.
13Sample size ratio of cases to variables
The minimum ratio of valid cases to independent
variables for stepwise logistic regression is 10
to 1, with a preferred ratio of 50 to 1. In this
analysis, there are 136 valid cases and 3
independent variables. The ratio of cases to
independent variables is 45.33 to 1, which
satisfies the minimum requirement. However, the
ratio of 45.33 to 1 does not satisfy the
preferred ratio of 50 to 1. A caution should be
added to the interpretation of the analysis and a
split sample validation should be conducted.
True with caution is the correct answer to the
question about sample size.
14Outliers in the analysis - question
Outliers in logistic regression are defined as
cases that have a studentized residual of /-2.0
or larger.
15Outliers in the analysis evidence and answer
Using the criteria of studentized residuals
greater than /- 2.0, SPSS did not identify any
outliers and did not print the Casewise
List. SPSS informs us in a footnote to the
Casewise List output which is not printed. The
correct answer to the outlier question is true.
Since there were no outliers, there is no revised
model to run and no decision to use one or the
other model. We will interpret the baseline model.
16Multicollinearity and Numerical Problems -
question
Multicollinearity in the logistic regression
solution is detected by examining the standard
errors for the b coefficients. A standard error
larger than 2.0 indicates numerical problems,
such as multicollinearity among the independent
variables, cells with a zero count for a
dummy-coded independent variable because all of
the subjects have the same value for the
variable, and 'complete separation' whereby the
two groups in the dependent event variable can be
perfectly separated by scores on one of the
independent variables. Analyses that indicate
numerical problems should not be interpreted.
17Multicollinearity and Numerical Problems
evidence and answer
The standard errors for the variables included in
the analysis were "total family income" (.033).
None of the independent variables in this
analysis had a standard error larger than
2.0. True is the correct answer.
SPSS does not output the standard error for
variables not included in the equation, so we
cannot tell if any variables were excluded
because of multicollinearity.
18Overall Relationship - question
The presence of a relationship between the
dependent variable and combination of independent
variables is based on the statistical
significance of the model chi-square at the step
when the last variable was entered into the
analysis. Only one variable, total family
income, is indicated to be a useful predictor of
group membership. Total family income is the most
useful predictor if it is the first variable
entered into the analysis.
19Overall Relationship evidence and answer - 1
There was only one step in this stepwise
analysis. At the end of that step, the
probability of the model chi-square (9.001) was
p0.003, less than or equal to the level of
significance of 0.05. The null hypothesis that
there is no difference between the model with
only a constant and the model with independent
variables was rejected. The existence of a
relationship between the independent variables
and the dependent variable was supported.
20Overall Relationship evidence and answer - 2
On step 1, the variable INCOME98, or total family
income was included in the analysis. The
statement that it is the most useful predictor is
supported.
The answer to this question is true with
caution. Caution in interpreting the
relationship should be exercised because of the
ordinal level variable "highest academic degree"
degree was treated as metric the ordinal level
variable "total family income" income98 was
treated as metric the ordinal level variable
"satisfaction with financial situation" satfin
was treated as metric and the available sample
was less than the preferred number of cases.
21Relationship of Individual Independent Variables
to Dependent Variable
The probability of the Wald statistic for the
variable total family income was p0.004, less
than or equal to the level of significance of
0.05. The null hypothesis that the b coefficient
for total family income was equal to zero was
rejected. Total family income is an ordinal
variable that is coded so that higher numeric
values are associated with survey respondents who
had higher total family incomes.
The value of Exp(B) was 0.909 which implies a
decrease in the odds of 9.1 (0.909 - 1.0
-0.091) This supports the relationship that
"survey respondents who had higher total family
incomes were 9.1 less likely to have been more
positive that the United States would fight in
another world war within the next ten years."
22Individual Relationships Academic degree -
question
To answer the question about an individual
relationship, we look to the significance of the
Wald test of the B coefficient and the
interpretation of the odds ratio and the step
summary which lists the order of entry.
23Individual Relationships Academic degree
evidence and answer
The independent variable "highest academic
degree" degree was not included in the stepwise
logistic regression analysis. False is the
correct answer.
24Individual Relationships Family income
question
To answer the question about an individual
relationship, we look to the significance of the
Wald test of the B coefficient and the
interpretation of the odds ratio and the step
summary which lists the order of entry.
25Individual Relationships Family income
evidence and answer
In the Step Summary table, "total family income"
income98 was added to the logistic regression
analysis in step 1. This makes it the best
predictor.
26Individual Relationships Family income
evidence and answer
The probability of the Wald statistic for the
variable "total family income" income98 was
p0.004, less than or equal to the level of
significance of 0.05. The null hypothesis that
the b coefficient for "total family income"
income98 was equal to zero was rejected.
"Total family income" income98 is an ordinal
variable that is coded so that higher numeric
values are associated with survey respondents who
had higher total family incomes. The value of
Exp(B) was 0.909 which implies a decrease in the
odds of 9.1 (0.909 - 1.0 -0.091). The
correct interpretation of the relationship is
that 'survey respondents who had higher total
family incomes were 9.1 less likely to have been
more positive that the United States would fight
in another world war within the next ten years.'
27Individual Relationships Family income
evidence and answer
True with caution is the correct answer. Caution
in interpreting the relationship should be
exercised because of the ordinal level variable
"total family income" income98 was treated as
metric and the available sample was less than
the preferred number of cases.
28Individual Relationships financial satisfaction
question
To answer the question about an individual
relationship, we look to the significance of the
Wald test of the B coefficient and the
interpretation of the odds ratio and the step
summary which lists the order of entry.
29Individual Relationships financial satisfaction
evidence and answer
The independent variable "satisfaction with
financial situation" satfin was not included in
the stepwise logistic regression analysis. False
is the correct answer.
30Classification Accuracy - question
The independent variables could be characterized
as useful predictors distinguishing survey
respondents who have been more supportive that
the use of marijuana should be made legal from
survey respondents who have been less supportive
that the use of marijuana should be made legal if
the classification accuracy rate was
substantially higher than the accuracy attainable
by chance alone. Operationally, the
classification accuracy rate should be 25 or
more higher than the proportional by chance
accuracy rate.
31Classification Accuracy evidence and answerby
chance accuracy rate
The proportional by chance accuracy rate was
computed by calculating the proportion of cases
for each group based on the number of cases in
each group in the classification table at Step 0.
The proportion in the No group was 0.603, making
the proportion in the Yes group 0.397 (1.0
0.603). The proportion of cases in each group
are then squared and summed (0.603² 0.397²
0.521). The proportional by chance accuracy
criteria is 25 higher, or 65.2 (1.25 x 52.1
65.2).
32Classification Accuracy evidence and answer
The classification accuracy rate computed by SPSS
was 67.6 which was greater than or equal to the
proportional by chance accuracy criteria of 65.2
(1.25 x 52.1 65.2). The criteria for
classification accuracy is satisfied. The
criteria for classification accuracy is
satisfied. The answer to the question is true.
33Validation - question
For a stepwise logistic regression, the 75-25
cross-validation must verify the overall
contribution of the independent variables
included in the analysis. In addition, the
pattern of significance for the individual
relationships between the dependent variable and
the predictors for the training sample should be
the same as the pattern for the full data
set. And finally, the classification accuracy
rate for the validation sample must be within 2
of the accuracy rate for the training sample.
34Validation analysisset the random number seed
To set the random number seed, select the Random
Number Seed command from the Transform menu.
35Set the random number seed
First, click on the Set seed to option button to
activate the text box.
Second, type in the random seed stated in the
problem.
Third, click on the OK button to complete the
dialog box. Note that SPSS does not provide
you with any feedback about the change.
36Validation analysiscompute the split variable
To enter the formula for the variable that will
split the sample in two parts, click on the
Compute command.
37The formula for the split variable
First, type the name for the new variable, split,
into the Target Variable text box.
Second, the formula for the value of split is
shown in the text box. The uniform(1) function
generates a random decimal number between 0 and
1. The random number is compared to the value 0.
75. If the random number is less than or equal
to 0.75, the value of the formula will be 1, the
SPSS numeric equivalent to true. If the random
number is larger than 0.75, the formula will
return a 0, the SPSS numeric equivalent to false.
Third, click on the OK button to complete the
dialog box.
38Running the logistic regression again with the
training sample
We repeat the logistic regression analysis for
the first validation sample.
Select the Regression Binary Logistic command
from the Analyze menu.
39Using "split" as the selection variable
First, scroll down the list of variables and
highlight the variable split.
Second, click on the right arrow button to move
the split variable to the Selection Variable text
box.
40Setting the value of split to select cases
When the variable named split is moved to the
Selection Variable text box, SPSS adds "?" after
the name to prompt up to enter a specific value
for split.
Click on the Rule button to enter a value for
split.
41Completing the value selection
First, type the value for the first half of the
sample, 1, into the Value text box.
Second, click on the Continue button to complete
the value entry.
42Requesting output for the validation sample
Click on the OK button to request the output.
When the value entry dialog box is closed, SPSS
adds the value we entered after the equal sign.
This specification now tells SPSS to include in
the analysis only those cases that have a value
of 1 for the split variable.
43CROSS-VALIDATION - 1
In the cross-validation analysis, the
relationship between the independent variables
and the dependent variable was statistically
significant. The probability for the model
chi-square (7.572) testing overall relationship
was p0.006.
The significance of the overall relationship
between the individual independent variables and
the dependent variable supports the validation
analysis.
44CROSS-VALIDATION - 2
The relationship between family income and
expectation about war" uswary was
statistically significant for the model using the
full data set (p0.004). Similarly, the
relationship in the cross-validation analysis was
statistically significant. In the
cross-validation analysis, the probability for
the test of relationship between family income
and expectation about war" uswary was p0.008,
which was less than or equal to the level of
significance of 0.05 and statistically
significant.
45CROSS-VALIDATION - 5
The classification accuracy rate for the model
using the training sample was 67.0, compared to
66.7 for the validation sample. The shrinkage in
classification accuracy for the validation
analysis is the difference between the accuracy
for the training sample (67.0) and the accuracy
for the validation sample (66.7), which equals
0.3 in this analysis. The shrinkage was within
the 2 criteria for minimal shrinkage, small
enough to support a conclusion that the logistic
regression model based on this analysis would be
effective in predicting scores for cases other
than those included in the calculation of the
regression analysis.
The validation analysis supports the
generalizability of the findings. The answer to
the question is true.
46Summary of Findings - question
The final question is a summary of the findings
of the analysis overall relationship, individual
relationships, and usefulness of the model.
Cautions are added, if needed, for sample size
and level of measurement issues.
47Summary of Findings evidence and answer
True with caution is the correct answer.
48Stepwise binary logistic regression level of
measurement
Question Variables included in the analysis
satisfy the level of measurement requirements?
Dependent dichotomous? Independent variables
metric or dichotomous?
Inappropriate application of a statistic
No
Yes
Ordinal independent variable included in analysis?
True with caution
True
49Stepwise binary logistic regression sample size
Question Number of variables and cases satisfy
sample size requirements?
Run baseline logistic regression, using stepwise
method for including variables identified in the
research question. Record classification
accuracy for evaluation of the effect of removing
outliers.
Ratio of cases to independent variables at least
10 to 1?
Inappropriate application of a statistic
Ratio of cases to independent variables at least
50 to 1?
True with caution
True
50Stepwise binary logistic regression detecting
outliers
Question Outliers were not detected in the
analysis?
Outliers for the solution identified by
studentized residuals gt 2.0?
True
False
51Stepwise binary logistic regression selecting
model for interpretation
Question Interpret baseline model or model
excluding outliers ?
Outliers for the solution identified by
studentized residuals gt 2.0?
Run revised logistic regression excluding
outliers, using method for including variables
identified in research question.
Classification accuracy omitting outliers better
than baseline by 2 or more?
Pick baseline logistic regression for
interpretation
Pick logistic regression that omits outliers for
interpretation
False
True
52Stepwise binary logistic regression
multicollinearity or numerical problems
Question no evidence of multicollinearity or
numerical problems?
Standard errors of coefficients indicate presence
of numerical problems (s.e. gt 2.0)?
False
If numerical problem found, halt analysis until
problem is resolved.
True
53Stepwise binary logistic regression overall
relationship
Question overall relationship between
independent variables and dependent variable?
Relationship confirmed by significance of model
chi-square for predictors at last step?
False
Caution for ordinal variable or sample size not
meeting preferred requirements?
True with caution
True
54Stepwise binary logistic regression
relationships between IV's and DV - 1
Question Interpretation of relationship between
independent variable and dependent variable
groups?
Entry order from Step Summary matches order
stated in problem?
False
Individual relationship confirmed by significance
of Wald statistic?
False
Direction and size of odds ratio interpreted
correctly?
False
55Stepwise binary logistic regression
relationships between IV's and DV - 2
Caution for ordinal variable or sample size not
meeting preferred requirements?
True with caution
True
56Stepwise binary logistic regression
classification accuracy
Question Classification accuracy sufficient to
be characterized as a useful model?
Overall accuracy rate is 25 gt than
proportional by chance accuracy rate?
False
True
57Stepwise binary logistic regression validation
- 1
Question Validation analysis supports
generalizability of model?
Compute 75-25 split variable. Re-run logistic
regression, using method for including variables
identified in the research question.
Model chi-square for predictors at last step lt
level of significance?
False
58Stepwise binary logistic regression validation
- 2
Significance of predictors in training sample
matches pattern for model using full data set?
False
Shrinkage in classification accuracy for holdout
sample lt 2?
False
59Stepwise binary logistic regressionsummary of
findings - 1
Question Summary of findings correctly stated,
including cautions?
Overall relationship correctly stated?
False
Individual relationship with IV and DV correctly
stated?
False
Classification accuracy supports useful model?
False
60Stepwise binary logistic regressionsummary of
findings - 2
Satisfies preferred ratio of cases to IV's of 50
to 1?
True with caution
One or more IV's are ordinal level variables?
True with caution
True