Title: SW388R7
1Complete Regression Analysis
- Additional issues in regression analysis
- Assumption of independence of errors
- Studentized residuals
- Multicollinearity
- Validation analysis
- Sample problem
- Steps in complete regression analysis
2Assumption of independence of errors - 1
- Multiple regression assumes that the errors are
independent and there is no serial correlation.
Errors are the residuals or differences between
the actual score for a case and the score
estimated using the regression equation. - No serial correlation implies that the size of
the residual for one case has no impact on the
size of the residual for the next case. - The Durbin-Watson statistic is used to test for
the presence of serial correlation among the
residuals. The value of the Durbin-Watson
statistic ranges from 0 to 4. As a general rule
of thumb, the residuals are not correlated if the
Durbin-Watson statistic is approximately 2, and
an acceptable range is 1.50 - 2.50.
3Assumption of independence of errors - 2
- Serial correlation is more of a concern in
analyses that involve time series. - If it does occur in relationship analyses, its
presence can usually be understood by changing
the sequence of cases and running the analysis
again. - If the problem with serial correlation
disappears, it may be ignored. - If the problem with serial correlation remains,
it can usually be handled by using the difference
between successive data points as the dependent
variable.
4Studentized residuals - 1
- We have previously examined outliers as
univariate outliers because of the score on the
dependent variable or multivariate outliers
because of the combination of scores on the
independent variables. - A case can also be an outlier on the combinations
of scores on the dependent and independent
variables. - For example, suppose most people with high
education and high income had high prestige jobs.
A person might have high education and low
income, but still be in a high prestige job.
This person might be an outlier in a regression
solution of the relationship of education and
income to occupational prestige.
5Studentized residuals - 2
- Detection of outliers in the regression solution
is based on an analysis of the residuals, or
prediction error not accounted for by the
regression. - Standardizing the residuals means we can detect
outliers with large residuals using the same
z-score criteria we used for univariate outliers
(/-3.29). - Studentizing means that the case being considered
was omitted from the calculations for the
standard deviation used to calculate the zscore.
If we leave a case with a large residual in the
calculations, it will reduce the z-score it gets
assigned and make it look like less of an
outlier. - SPSS computes and saves studentized residuals
using the regression procedure, just as it
calculates Mahalanobis Distance.
6Multicollinearity - 1
- Multicollinearity is a problem in regression
analysis that occurs when two independent
variables are highly correlated, e.g. r 0.90,
or higher. - The relationship between the independent
variables and the dependent variables is
distorted by the very strong relationship between
the independent variables, leading to the
likelihood that our interpretation of
relationships will be incorrect. - In the worst case, if the variables are perfectly
correlated, the regression cannot be computed. - SPSS guards against the failure to compute a
regression solution by arbitrarily omitting the
collinear variable from the analysis.
7Multicollinearity - 2
- Multicollinearity is detected by examining the
tolerance for each independent variable.
Tolerance is the amount of variability in one
independent variable that is not explained by the
other independent variables. - Tolerance values less than 0.10 indicate
collinearity. - If we discover collinearity in the regression
output, we should reject the interpretation of
the relationships as false until the issue is
resolved. - Multicollinearity can be resolved by combining
the highly correlated variables through principal
component analysis, or omitting one of the
collinear variables from the analysis.
8Validation analysis
- Multivariate statistics are powerful mathematical
techniques that respond to subtle patterns in the
data, sometimes appearing to provide a meaningful
solution to a research question when the solution
should be attributed to the values for a small of
cases in the study. - When values are predicted for the same cases that
were used to derive the statistical model, the
accuracy may be optimistic or over-stated. - The purpose of validation analysis is to test the
generalizability of the regression analysis to
the population represented by the sample in the
analysis.
975/25 cross-validation
- While there are different strategies for
validation analysis, we will use a 75/25
cross-validation. - In this strategy, we have SPSS randomly assign
75 of the cases in the analysis to a group
called the training sample, and the remaining 25
of the cases to another group called the
validation sample. - The training sample is used to create the
statistical model, which is evaluated on the
validation sample. - If the teaching model is as effective at
predicting values for the validation model as it
was at predicting values for its own cases (i.e.
the R² are similar in size), the validation is
successful.
10Criteria for successful cross-validation
- The validation is considered successful if the
pattern of statistical significance for both the
overall relationship and the individual
relationships between the independent variables
and the dependent variable for the training
sample agrees with the model for the full sample. - While it is expected that the R² for the
validation sample will be lower than the R² for
the training sample, the difference (called
shrinkage) should be no larger than 2. - If the validation is not successful, the analysis
is false.
11Cross-validation and sample size
- When we compare a model with 75 of the cases to
the model with 100 of the cases, we obviously
are losing power due to the reduction in sample
size. - To offset the loss in power, we increase the
preferred sample size to be 33 larger than the
minimum sample size. - The preferred sample size assures that the
training sample still satisfy the minimum sample
size requirement. - If we do not have enough cases to satisfy the
preferred sample size, we add a caution to the
analysis. If the validation is not successful,
it may be due to the reduced sample size
available to the analysis.
12Notes
- Findings are stated on the results for the
analysis of the full data set, not the
validations. - If our validation analysis does not support the
findings of the analysis on the full data set, we
will declare the answer to the problem to be
false. There is, however, another common option,
which is to only cite statistical significance
for independent variables supported by the
validation analysis as well as the full data set
analysis. All other variables are considered to
be non-significant. Generally it is the
independent variables with the weakest individual
relationship to the dependent variable which fail
to validate.
13Problem 1
- In the dataset GSS2000R, is the following
statement true, false, or an incorrect
application of a statistic? Use a level of
significance of 0.05 for the statistical
analysis. Use a level of significance of 0.01 for
evaluating missing data and assumptions. Validate
the results of your regression analysis by
conducting a 75/25 cross-validation, using
2044128 as the random number seed. - The variables "age" age, "age when first child
was born" agekdbrn, and "highest year of school
completed" educ have a moderate relationship to
the variable "number of children" childs. - Survey respondents who were older had more
children. Survey respondents who were older when
first child was born had fewer children. Survey
respondents who had completed more years of
school had more children. - 1. True
- 2. True with caution
- 3. False
- 4. Inappropriate application of a statistic
14Dissecting problem 1 - 1
The problem may give us different levels of
significance for the analysis. In this problem,
we are told to use 0.05 as alpha for the
regression, but 0.01 for testing assumptions.
- In the dataset GSS2000R, is the following
statement true, false, or an incorrect
application of a statistic? Use a level of
significance of 0.05 for the statistical
analysis. Use a level of significance of 0.01 for
evaluating missing data and assumptions. Validate
the results of your regression analysis by
conducting a 75/25 cross-validation, using
2044128 as the random number seed. - The variables "age" age, "age when first child
was born" agekdbrn, and "highest year of school
completed" educ have a moderate relationship to
the variable "number of children" childs.
The random number seed (2044128) for the
cross-validation is provided.
After testing for missing data problem, and
evaluating assumptions and outliers, we will
decide whether we should use the model with
transformations and excluding outliers, or the
model with the original form of the variables and
all cases.
15Dissecting problem 1 - 2
The variables listed first in the problem
statement are the independent variables (IVs)
"age" age, "age when first child was born"
agekdbrn, and "highest year of school
completed" educ.
When a problem states that there is a
relationship between some independent variables
and a dependent variable, we do standard multiple
regression.
- In the dataset GSS2000R, is the following
statement true, false, or an incorrect
application of a statistic? Use a level of
significance of 0.05 for the statistical
analysis. Use a level of significance of 0.01 for
evaluating missing data and assumptions. Validate
the results of your regression analysis by
conducting a 75/25 cross-validation, using
2044128 as the random number seed. - The variables "age" age, "age when first child
was born" agekdbrn, and "highest year of school
completed" educ have a moderate relationship to
the variable "number of children" childs. - Survey respondents who were older had more
children. Survey respondents who were older when
first child was born had fewer children. Survey
respondents who had completed more years of
school had more children. - 1. True
- 2. True with caution
- 3. False
- 4. Inappropriate application of a statistic
The variable that is the target of the
relationship is the dependent variable (DV)
"number of children" childs.
16Dissecting problem 1 - 3
In order for a problem to be true, we will have
to find that there is a statistically significant
relationship between the set of IVs and the DV,
and the strength of the relationship stated in
the problem must be correct.
- In the dataset GSS2000R, is the following
statement true, false, or an incorrect
application of a statistic? Use a level of
significance of 0.05 for the statistical
analysis. Use a level of significance of 0.01 for
evaluating missing data and assumptions. Validate
the results of your regression analysis by
conducting a 75/25 cross-validation, using
2044128 as the random number seed. - The variables "age" age, "age when first child
was born" agekdbrn, and "highest year of school
completed" educ have a moderate relationship to
the variable "number of children" childs. - Survey respondents who were older had more
children. Survey respondents who were older when
first child was born had fewer children. Survey
respondents who had completed more years of
school had more children. - 1. True
- 2. True with caution
- 3. False
- 4. Inappropriate application of a statistic
In addition, the relationship or lack of
relationship between the individual IV's and the
DV must be identified correctly, and must be
characterized correctly.
17LEVEL OF MEASUREMENT
Multiple regression requires that the dependent
variable be metric and the independent variables
be metric or dichotomous.
- In the dataset GSS2000R, is the following
statement true, false, or an incorrect
application of a statistic? Use a level of
significance of 0.05 for the statistical
analysis. Use a level of significance of 0.01 for
evaluating missing data and assumptions. Validate
the results of your regression analysis by
conducting a 75/25 cross-validation, using
2044128 as the random number seed. - The variables "age" age, "age when first child
was born" agekdbrn, and "highest year of school
completed" educ have a moderate relationship to
the variable "number of children" childs. - Survey respondents who were older had more
children. Survey respondents who were older when
first child was born had fewer children. Survey
respondents who had completed more years of
school had more children. - 1. True
- 2. True with caution
- 3. False
- 4. Inappropriate application of a statistic
"Number of children" childs is interval,
satisfying the metric level of measurement
requirement for the dependent variable. "Age"
age, "age when first child was born"
agekdbrn, and "highest year of school
completed" educ are interval, satisfying the
metric or dichotomous level of measurement
requirement for independent variables.
18PATTERNS OF MISSING DATA - 1
Run the script to check missing data. Move the
variables included in the analysis, mark the
option form missing data, and click the OK button
19PATTERNS OF MISSING DATA - 2
One variable was missing data for more than 5 of
the cases in the data set "age when first child
was born" agekdbrn was missing data for 28.5
of the cases in the data set (77 of 270 cases). A
missing/valid dichotomous variable was created
for this variable to test whether the group of
cases with missing data differed significantly
from the group of cases with valid data on the
other variables included in the analysis.
The variables "number of children" childs,
"age" age, and "highest year of school
completed" educ were missing data for less than
5 of the cases in the data set. T-tests and
chi-square tests to compare cases with missing
data to cases with valid data for the other
variables included in the analysis were not
conducted.
20PATTERNS OF MISSING DATA - 3
There were significant differences in the
statistical tests comparing cases with missing
data to cases with valid data. Cases who had
missing data for the variable "age when first
child was born" agekdbrn had an average score
on the variable "number of children" childs
that was 2.36 units lower than the average for
cases who had valid data (t-23.265, plt0.001).
21PATTERNS OF MISSING DATA - 4
Since there were significant differences in the
statistical tests comparing cases with missing
data to cases with valid data, a caution was
added to the interpretation of any findings,
pending further analysis of the missing data
pattern.
Cases who had missing data for the variable "age
when first child was born" agekdbrn had an
average score on the variable "age" age that
was 8.45 units lower than the average for cases
who had valid data (t-3.667, plt0.001).
22The baseline regression - 1
After we check for violations of assumptions and
outliers, we will make a decision whether we
should interpret the model that includes the
transformed variables and omits outliers (the
revised model), or whether we will interpret the
model that uses the untransformed variables and
includes all cases including the outliers (the
baseline model). In order to make this
decision, we run the baseline regression before
we examine assumptions and outliers, and record
the R² for the baseline model. If using
transformations and outliers substantially
improves the analysis (a 2 increase in R²), we
interpret the revised model. If the increase is
smaller, we interpret the baseline model.
To run the baseline model, select Regression
Linear from the Analyze model.
23The baseline regression - 2
Specify the dependent and independent variables.
Select Enter as the Method for including
variables to produce a standard multiple
regression.
Click on the Statistics button to select
statistics we will need for the analysis.
24The baseline regression - 3
Retain the default checkboxes for Estimates and
Model fit to obtain the baseline R², which will
be used to determine whether we should use the
model with transformations and excluding
outliers, or the model with the original form of
the variables and all cases.
Click on the Continue button to close the dialog
box.
Mark the Descriptives checkbox to get the number
of cases available for the analysis and the
Collinearity diagnostics to get tolerance values
for testing multicollinearity.
Mark the checkbox for the Durbin-Watson
statistic, which will be used to test the
assumption of independence of errors.
25The baseline regression - 4
Click on the OK button to obtain the output for
the baseline model.
26R² for the baseline model
Prior to any transformations of variables to
satisfy the assumptions of multiple regression or
removal of outliers, the proportion of variance
in the dependent variable explained by the
independent variables (R²) was 19.8. The
relationship is statistically significant, though
we would not stop if it were not significant
because the lack of significance may be a
consequence of violation of assumptions or the
inclusion of outliers.
The R² of 0.198 is the benchmark that we will use
to evaluate the utility of transformations and
the elimination of outliers.
27Run the script to test normality
First, move the variables to the list boxes based
on the role that the variable plays in the
analysis and its level of measurement.
Second, click on the Normality option button to
request that SPSS produce the output needed to
evaluate the assumption of normality.
Fourth, click on the OK button to produce the
output.
Third, mark the checkboxes for the
transformations that we want to test in
evaluating the assumption.
28Normality of the dependent variable number of
children
The dependent variable "number of children"
childs satisfied the criteria for a normal
distribution. The skewness of the distribution
(0.752) was between -1.0 and 1.0 and the
kurtosis of the distribution (0.342) was between
-1.0 and 1.0.
29Normality of the independent variable age
The independent variable "age" age satisfied
the criteria for a normal distribution. The
skewness of the distribution (0.595) was between
-1.0 and 1.0 and the kurtosis of the
distribution (-0.351) was between -1.0 and 1.0.
30Normality of the second independent variableage
when first child was born
The independent variable "age when first child
was born" agekdbrn did not satisfy the criteria
for a normal distribution. Both the skewness
(1.087) and kurtosis (1.890) fell outside the
range from -1.0 to 1.0.
31Normality of transformed variables for age when
first child was born
Multiple transformations satisfied the criteria
for normality. The "log of age when first child
was born LGAGEKDB LG10(AGEKDBRN)" satisfied
the criteria for a normal distribution (skewness
0.388, kurtosis0.032), as did "square root of
age when first child was born SQAGEKDBSQRT(AGEK
DBRN)" (skewness0.715, kurtosis 0.734) and
"inverse of age when first child was born
INAGEKDB-1/(AGEKDBRN)" (skewness-0.165,
kurtosis-0.435). The inverse of age when first
child was born INAGEKDB-1/(AGEKDBRN)" was
substituted for "age when first child was born"
agekdbrn in the analysis because it had the
smallest skew (-0.165) of the alternatives.
32Normality of the third independent
variablehighest year of school completed
The independent variable "highest year of school
completed" educ did not satisfy the criteria
for a normal distribution. The skewness of the
distribution (-0.137) was between -1.0 and 1.0,
but the kurtosis of the distribution (1.246) fell
outside the range from -1.0 to 1.0.
33Normality of transformed independent variables
for highest year of school completed
Since the distribution was skewed to the left, it
was necessary to reflect, or reverse code, the
values for the variable before computing the
transformation. Neither the logarithmic
(skew-1.969, kurtosis5.053), the square root
(skew-0.846, kurtosis1.647), nor the inverse
transformation (skew-4.188, kurtosis18.338)
induced normality in the variable "highest year
of school completed" educ. A caution was added
to the findings.
34Run the script to test linearity
First, since the transformation "inverse of age
when first child was born INAGEKDB-1/(AGEKDBRN)
" was incorporated in the analysis in the
evaluation of normality, additional
transformations for linearity were not
considered. Remove it from the list of metric
independent variables.
Second, click on the Linearity option button to
request that SPSS produce the output needed to
evaluate the assumption of linearity.
When the linearity option is selected, a default
set of transformations to test is marked.
Third, click on the OK button to produce the
output.
35Linearity test age and number of children
The assessment of the linearity of the
relationship between "number of children"
childs and "age" age indicated that the
relationship could be considered linear because
the probability associated with the correlation
coefficient for the relationship (r0.281) was
statistically significant (plt0.001) and none of
the statistically significant transformations for
age had a relationship that was substantially
stronger. The relationship between the
untransformed variables was assumed to satisfy
the assumption of linearity.
36Linearity test highest year of school completed
and number of children
The assessment of the linear relationship between
"number of children" childs and "highest year
of school completed" educ indicated that the
relationship was weak, rather than nonlinear. The
statistical probabilities associated with the
correlation coefficients measuring the
relationship with the untransformed independent
variable (r-0.101, p0.099), the logarithmic
transformation (r0.083, p0.174), the square
root transformation (r0.099, p0.106), and the
inverse transformation (r0.029, p0.634), were
all greater than the level of significance for
testing assumptions (0.01). There was no evidence
that the assumption of linearity was violated.
37Run the script to test homogeneity of variance
There were no nonmetric variables in this
analysis, so the test of homogeneity of variance
was not conducted.
38Including the transformed variable in the data set
In the evaluation for normality, we resolved a
problem with normality for age when first child
was born by using an inverse transformation. We
need to add this transformed variable to the data
set, so that we can incorporate it in our
detection of outliers. We can use the script to
compute transformed variables and add them to the
data set. We select an assumption to test
(Normality is the easiest), mark the check box
for the transformation we want to retain, and
clear the check box "Delete variables created in
this analysis."
NOTE this will leave the transformed variable
in the data set. To remove it, you can delete
the column or close the data set without saving.
39Including the transformed variable in the data set
First, move the variable AGEKDBRN to the list box
for Metric independent variables.
Second, click on the Normality option button to
request that SPSS do the test for normality,
including the transformation we will mark.
Fourth, clear the check box for the option
"Delete variables created in this analysis".
Third, mark the transformation we want to retain
(Inverse) and clear the checkboxes for the other
transformations.
Fifth, click on the OK button.
40Including the transformed variable in the data set
If we scroll to the rightmost column in the data
editor, we see than the inverse of AGEKDBRN in
included in the data set.
41Including the transformed variable in the list of
variables in the script
If we scroll to the bottom of the list of
variables, we see that the inverse of AGEKDBRN is
not included in the list of available variables.
To tell the script to add the inverse of AGEKDBRN
to the list of variables in the script, click on
the Reset button. This will start the script
over again, with a new list of variables from the
data set.
42Including the transformed variable in the list of
variables in the script
If we scroll to the bottom of the list of
variables now, we see that the inverse of
AGEKDBRN is included in the list of available
variables.
43Run the script to detect outliers
Move the variables to the list boxes for the
dependent and independent variables, including
transformed variables that we have decided to use.
Note when detecting outliers, clear the check
box for deleting variables to keep SPSS from
deleting the variables immediately after it
creates them.
Click on the Detect outliers option button to
request that SPSS create the variables needed to
detect outliers.
Click on the OK button to produce the output.
44Outliers in the data set
One of the univariate outliers was also an
outlier on the studentized residual criteria
The data set was sorted in descending order by
zchilds to show that there are two univariate
outliers in the data set. There are no
multivariate outliers in the data set.
45Removing the outliers from the regression using
transformations and omitting outliers
Our next step is to run the revised regression
model that uses transformed variables and omits
outliers. Our first step in this process is to
tell SPSS to exclude the outliers from the
analysis. We accomplish this by telling SPSS to
include in the analysis all of the cases that are
not outliers.
First, select the Select Cases command from the
Transform menu.
46Specifying the condition to omit outliers
First, mark the If condition is satisfied option
button to indicate that we will enter a specific
condition for including cases.
Second, click on the If button to specify the
criteria for inclusion in the analysis.
47The formula for omitting outliers
To eliminate the outliers, we request the cases
that are not outliers be selected into the
analysis. The formula specifies that we should
include cases if the standard score, regardless
of sign, for the dependent variable zchilds is
less than or equal to 3.29 and the probability
for Mahalanobis D² is higher than the benchmark
level of significance of 0.001 and the
studentized residual is less than or equal to
3.29. The abs() or absolution value function
tells SPSS to ignore the sign of the value.
After typing in the formula, click on the
Continue button to close the dialog box.
48Completing the request for the selection
To complete the request, we click on the OK
button.
49Omitted outlier cases in the data set
SPSS identifies the excluded cases by drawing a
slash mark through the case number. The omitted
cases were the univariate outliers we detected
above.
SPSS creates a special variable called filter_
to selected cases in or out of the analysis.
Cases with a value of 0 for filter_ are omitted,
and cases with a value of 1 for filter_ are
included.
50The revised regression using transformations and
omitting outliers
We run the regression again, without the outliers
which we selected out with the Select If command.
Select the Regression Linear command from
the Analyze menu.
51The revised regression substituting transformed
variables
Remove the variable AGEKDBRN from the list of
independent variables. Include the inverse of the
variable, INAGEKDB.
Click on the Statistics button to select
statistics we will need for the analysis.
52The revised regression selecting statistics
Retain the default checkboxes for Estimates and
Model fit to obtain the revised R², which will be
used to determine whether we should use the model
with transformations and excluding outliers, or
the model with the original form of the variables
and all cases.
Click on the Continue button to close the dialog
box.
Mark the Descriptives checkbox to get the number
of cases available for the analysis and the
Collinearity diagnostics to get tolerance values
for testing multicollinearity.
Mark the checkbox for the Durbin-Watson
statistic, which will be used to test the
assumption of independence of errors.
53The revised regression obtaining output
Click on the OK button to obtain the output for
the revised model.
54SELECTION OF MODEL FOR INTERPRETATION
Prior to any transformations of variables to
satisfy the assumptions of multiple regression
and the removal of outliers, the proportion of
variance in the dependent variable explained by
the independent variables (R²) was 19.8. After
substituting transformed variables and removing
outliers, the proportion of variance in the
dependent variable explained by the independent
variables (R²) was 17.7. Since the revised
regression analysis using transformations and
omitting outliers explained less variance than
the baseline regression analysis with all cases
and no transformations, the baseline regression
analysis with all cases and no transformed
variables was interpreted.
55Restoring the outliers to the data set - 1
If the revised model had been the preference for
interpretation, the analysis could have proceeded
using the output for the revised model. Since
the revised model did not improve the analysis by
removing outliers and using transformed
variables, we revert back to the baseline model
with all cases included and using untransformed
variables. To accomplish this, we will first
restore all cases to the data set and then run
the regression again with the untransformed
variables.
First, select the Select Cases command from the
Transform menu.
56Restoring the outliers to the data set - 2
First, mark the All cases option button on the
Select panel.
Second, click on the OK button to close the
dialog box.
57Restoring the outliers to the data set - 3
The slash marks from the case numbers are
removed, indicating that no cases will be
excluded from the analysis. Note that SPSS does
not delete nor change the values of the filter_
variable. To remove it from the data set, we
need to delete the column.
58Restoring the order of the cases
The cases in the data set had been sorted in the
examination of outliers. Before re-running the
regression commands, the data set should be
restored to its original order.
Click on the header for the caseid column to
highlight the column. With the column
highlighted, right click on the column header and
select Sort Ascending from the popup menu.
59Re-running the baseline regression - 1
Having decided to use the baseline model for the
interpretation of this analysis, the SPSS
regression output was re-created.
To run the baseline model, select Regression
Linear from the Analyze model.
60Re-running the baseline regression - 2
Specify the dependent and independent variables.
Select Enter as the Method for including
variables to produce a standard multiple
regression.
Click on the Statistics button to select
statistics we will need for the analysis.
61Re-running the baseline regression - 3
Retain the default checkboxes for Estimates and
Model fit to obtain the baseline R², which will
be used to determine whether we should use the
model with transformations and excluding
outliers, or the model with the original form of
the variables and all cases.
Click on the Continue button to close the dialog
box.
Mark the Descriptives checkbox to get the number
of cases available for the analysis and the
Collinearity diagnostics to get tolerance values
for testing multicollinearity.
Mark the checkbox for the Durbin-Watson
statistic, which will be used to test the
assumption of independence of errors.
62Re-running the baseline regression - 4
Click on the OK button to obtain the output for
the baseline model.
63Assumption of independence of errorsthe
Durbin-Watson statistic
Having selected a regression model for
interpretation, we can now examine the final
assumptions of independence of errors. The
Durbin-Watson statistic is used to test for the
presence of serial correlation among the
residuals, i.e., the assumption of independence
of errors, which requires that the residuals or
errors in prediction do not follow a pattern from
case to case. The value of the Durbin-Watson
statistic ranges from 0 to 4. As a general rule
of thumb, the residuals are not correlated if the
Durbin-Watson statistic is approximately 2, and
an acceptable range is 1.50 - 2.50.
The Durbin-Watson statistic for this problem is
2.043 which falls within the acceptable range.
If the Durbin-Watson statistic was not in the
acceptable range, we would add a caution to the
findings for a violation of regression
assumptions.
64SAMPLE SIZE
The 192 cases available for the analysis
satisfied the minimum sample size of 107 for the
standard multiple regression (104 cases 3
independent variables). In addition, the 192
cases satisfied the preferred sample size of 143
needed for the 75/25 cross-validation (107 x
133 143). If we failed to satisfy the
preferred sample size, a caution would be added
to the interpretation.
65OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND
DEPENDENT VARIABLES
Based on the results in the ANOVA table (F(3,
188) 15.436, plt0.001), there was an overall
relationship between the dependent variable
"number of children" childs and the independent
variables "age" age, "age when first child was
born" agekdbrn, and "highest year of school
completed" educ. Since the probability of the F
statistic (plt0.001) was less than or equal to the
level of significance (0.05), the null hypothesis
that Multiple R was equal to 0 was rejected.
The research hypothesis that there was a
relationship between the set of independent
variables and the dependent variable was
supported.
66OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND
DEPENDENT VARIABLES
The Multiple R for the relationship between the
combined set of independent variables and the
dependent variable was 0.445, which would be
characterized as moderate using the rule of thumb
that a correlation less than or equal to 0.20 is
characterized as very weak greater than 0.20 and
less than or equal to 0.40 is weak greater than
0.40 and less than or equal to 0.60 is moderate
greater than 0.60 and less than or equal to 0.80
is strong and greater than 0.80 is very strong.
The relationship between the independent
variables and the dependent variable was
correctly characterized as moderate.
67MULTICOLLINEARITY
Multicollinearity occurs when one independent
variable is so strongly correlated with one or
more of the other independent variables that its
relationship to the dependent variable is likely
to be misinterpreted. Its potential unique
contribution to explaining the dependent variable
is minimized by its strong relationship to other
independent variables. Multicollinearity is
indicated when the tolerance value for an
independent variable is less than 0.10. The
tolerance values for all of the independent
variables are larger than 0.10. Multicollinearity
is not a problem in this regression analysis.
68RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES
TO DEPENDENT VARIABLE - 1
Based on the statistical test of the b
coefficient (t 3.826, plt0.001) for the
independent variable "age" age, the null
hypothesis that the slope or b coefficient was
equal to 0 was rejected. The research hypothesis
that there was a relationship between age and
number of children was supported. The positive
sign of the b coefficient (0.021) meant the the
relationship between age and number of children
was a direct relationship, implying that higher
numeric values for the independent variable (age)
were associated with higher numeric values for
the dependent variable (number of children). The
statement in the problem that "survey respondents
who were older had more children" is correct.
69RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES
TO DEPENDENT VARIABLE - 2
Based on the statistical test of the b
coefficient (t -6.132, plt0.001) for the
independent variable "age when first child was
born" agekdbrn, the null hypothesis that the
slope or b coefficient was equal to 0 was
rejected. The research hypothesis that there was
a relationship between age when first child was
born and number of children was supported. The
negative sign of the b coefficient (-0.103)
meant the the relationship between age when first
child was born and number of children was a
inverse relationship, implying that higher
numeric values for the independent variable (age
when first child was born) were associated with
lower numeric values for the dependent variable
(number of children). The statement in the
problem that "survey respondents who were older
when first child was born had fewer children" is
correct.
70RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES
TO DEPENDENT VARIABLE - 3
Based on the statistical test of the b
coefficient (t 2.026, p0.044) for the
independent variable "highest year of school
completed" educ, the null hypothesis that the
slope or b coefficient was equal to 0 was
rejected. The research hypothesis that there was
a relationship between highest year of school
completed and number of children was supported.
The positive sign of the b coefficient (0.069)
meant the the relationship between highest year
of school completed and number of children was a
direct relationship, implying that higher numeric
values for the independent variable (highest year
of school completed) were associated with higher
numeric values for the dependent variable (number
of children). The statement in the problem that
"survey respondents who had completed more years
of school had more children" is correct.
71Validation analysisset the random number seed
The problems states Validate the results of your
regression analysis by conducting a 75/25
cross-validation, using 2044128 as the random
number seed.
To set the random number seed, select the Random
Number Seed command from the Transform menu.
72Set the random number seed
First, click on the Set seed to option button to
activate the text box.
Second, type in the random seed stated in the
problem.
Third, click on the OK button to complete the
dialog box. Note that SPSS does not provide
you with any feedback about the change.
73Validation analysiscompute the split variable
To enter the formula for the variable that will
split the sample in two parts, click on the
Compute command.
74The formula for the split variable
First, type the name for the new variable, split,
into the Target Variable text box.
Second, the formula for the value of split is
shown in the text box. The uniform(1) function
generates a random decimal number between 0 and
1. The random number is compared to the value
0.75. If the random number is less than or
equal to 0.75, the value of the formula will be
1, the SPSS numeric equivalent to true. If the
random number is larger than 0.75, the formula
will return a 0, the SPSS numeric equivalent to
false.
Third, click on the OK button to complete the
dialog box.
75The split variable in the data editor
In the data editor, the split variable shows a
random pattern of zeros and ones. To select
the cases for the training sample, we select the
cases where split 1.
76Repeat the regression for the validation
To repeat the multiple regression analysis for
the validation sample, select Regression Linear
from the Analyze tool button.
77Using "split" as the selection variable
First, scroll down the list of variables and
highlight the variable split.
Second, click on the right arrow button to move
the split variable to the Selection Variable text
box.
78Setting the value of split to select cases
When the variable named split is moved to the
Selection Variable text box, SPSS adds "?" after
the name to prompt up to enter a specific value
for split.
Click on the Rule button to enter a value for
split.
79Completing the value selection
First, type the value for the training sample, 1,
into the Value text box.
Second, click on the Continue button to complete
the value entry.
80Requesting output for the validation analysis
Click on the OK button to request the output.
When the value entry dialog box is closed, SPSS
adds the value we entered after the equal sign.
This specification now tells SPSS to include in
the analysis only those cases that have a value
of 1 for the split variable.
81CROSS-VALIDATION - 1
The validation analysis requires that the
regression model for the 75 training sample
replicate the pattern of statistical significance
found for the full data set.
In the analysis of the 75 training sample, the
relationship between the set of independent
variables and the dependent variable was
statistically significant, F(3, 144) 12.418,
plt0.001, as was the overall relationship in the
analysis of the full data set, F(3, 188)
15.436, plt0.001.
82Relationship of Individual Independent Variables
to Dependent Variable - 1
The pattern of significance for the individual
relationships between the dependent variable and
the independent variables was the same for the
analysis using the full data set and the 75
training sample.
The relationship between age when first child was
born and number of children was statistically
significant in both the analysis using the full
data set (t-6.132, plt0.001) and the analysis
using the 75 training sample (t-5.673, plt0.001).
83Relationship of Individual Independent Variables
to Dependent Variable - 2
The relationship between age and number of
children was statistically significant in both
the analysis using the full data set (t3.826,
plt0.001) and the analysis using the 75 training
sample (t3.393, p0.001).
84Relationship of Individual Independent Variables
to Dependent Variable - 3
The relationship between highest year of school
completed and number of children was
statistically significant in both the analysis
using the full data set (t2.026, p0.044) and
the analysis using the 75 training sample
(t2.415, p0.017).
The pattern of statistical significance of the
independent variables for the analysis using the
75 training sample matched the pattern
identified in the analysis of the full data set.
85Comparison of Training Sample and Validation
Sample
The total proportion of variance explained in the
model using the training sample was 20.5
(0.453²), compared to 17.3 (0.416²) for the
validation sample. The shrinkage in R² for the
validation analysis is the difference between the
R² for the training sample (20.5) and the R² for
the validation sample (17.3), which equals 3.2
in this analysis. The shrinkage in R² was larger
than the 2 criteria for minimal shrinkage, too
large to support a conclusion that the regression
model based on this analysis would be effective
in predicting scores for cases other than those
included in the sample.
The validation analysis raised serious questions
about the generalizability of the findings of the
analysis to the population represented by the
sample in the data set.
86Answering the problem question - 1
We have found that there is a statistically
significant relationship between the set of IVs
and the DV (plt0.001), and the Multiple R was
0.420, which would be characterized as a moderate
relationship.
- In the dataset GSS2000R, is the following
statement true, false, or an incorrect
application of a statistic? Use a level of
significance of 0.05 for the statistical
analysis. Use a level of significance of 0.01 for
evaluating missing data and assumptions. Validate
the results of your regression analysis by
conducting a 75/25 cross-validation, using
2044128 as the random number seed. - The variables "age" age, "age when first child
was born" agekdbrn, and "highest year of school
completed" educ have a moderate relationship to
the variable "number of children" childs. - Survey respondents who were older had more
children. Survey respondents who were older when
first child was born had fewer children. Survey
respondents who had completed more years of
school had more children. - 1. True
- 2. True with caution
- 3. False
- 4. Inappropriate application of a statistic
87Answering the problem question - 2
- In the dataset GSS2000R, is the following
statement true, false, or an incorrect
application of a statistic? Use a level of
significance of 0.05 for the statistical
analysis. Use a level of significance of 0.01 for
evaluating missing data and assumptions. Validate
the results of your regression analysis by
conducting a 75/25 cross-validation, using
2044128 as the random number seed. - The variables "age" age, "age when first child
was born" agekdbrn, and "highest year of school
completed" educ have a moderate relationship to
the variable "number of children" childs. - Survey respondents who were older had more
children. Survey respondents who were older when
first child was born had fewer children. Survey
respondents who had completed more years of
school had more children. - 1. True
- 2. True with caution
- 3. False
- 4. Inappropriate application of a statistic
The b coefficient associated with age was
statistically significant (plt 0.001), so there
was an individual relationship to interpret. The
b coefficient (0.021) was positive, indicating an
direct relationship in which higher numeric
values for age are associated with higher numeric
values for number of children Therefore, the
positive value of b implies that survey
respondents who were older had more children..
88Answering the problem question - 3
For the independent variable age when first child
was born, the probability of the t statistic
(6.132) for the b coefficient is plt0.001 which is
less than the level of significance of 0.05.
There was a relationship between age when first
child was born and number of children was
supported. The negative sign of the b
coefficient (-0.103) meant the relationship
between age when first child was born and number
of children was a inverse relationship, implying
that higher numeric values for the independent
variable (age when first child was born) were
associated with lower numeric values for the
dependent variable (number of children). Survey
respondents who were older when first child was
born had fewer children.
- In the dataset GSS2000R, is the following
statement true, false, or an incorrect
application of a statistic? Use a level of
significance of 0.05 for the statistical
analysis. Use a level of significance of 0.01 for
evaluating missing data and assumptions. Validate
the results of your regression analysis by
conducting a 75/25 cross-validation, using
2044128 as the random number seed. - The variables "age" age, "age when first child
was born" agekdbrn, and "highest year of school
completed" educ have a moderate relationship to
the variable "number of children" childs. - Survey respondents who were older had more
children. Survey respondents who were older when
first child was born had fewer children. Survey
respondents who had completed more years of
school had more children. - 1. True
- 2. True with caution
89Answering the problem question - 4
- In the dataset GSS2000R, is the following
statement true, false, or an incorrect
application of a statistic? Use a level of
significance of 0.05 for the statistical
analysis. Use a level of significance of 0.01 for
evaluating missing data and assumptions. Validate
the results of your regression analysis by
conducting a 75/25 cross-validation, using
2044128 as the random number seed. - The variables "age" age, "age when first child
was born" agekdbrn, and "highest year of school
completed" educ have a moderate relationship to
the variable "number of children" childs. - Survey respondents who were older had more
children. Survey respondents who were older when
first child was born had fewer children. Survey
respondents who had completed more years of
school had more children. - 1. True
- 2. True with caution
Based on the statistical test of the b
coefficient (t 2.026, p0.044) for the
independent variable "highest year of school
completed" educ, there was a relationship
between highest year of school completed and
number of children was supported. The positive
sign of the b coefficient (0.069) meant the the
relationship between highest year of school
completed and number of children was a direct
relationship, implying that higher numeric values
for the independent variable (highest year of
school completed) were associated with higher
numeric values for the dependent variable (number
of children). Survey respondents who had
completed more years of school had more children.
90Answering the problem question - 5
- In the dataset GSS2000R, is the following
statement true, false, or an incorrect
application of a statistic? Use a level of
significance of 0.05 for the statistical
analysis. Use a level of significance of 0.01 for
evaluating missing data and assumptions. Validate
the results of your regression analysis by
conducting a 75/25 cross-validation, using
2044128 as the random number seed. - The variables "age" age, "age when first child
was born" agekdbrn, and "highest year of school
completed" educ have a moderate relationship to
the variable "number of children" childs. - Survey respondents who were older had more
children. Survey respondents who were older when
first child was born had fewer children. Survey
respondents who had completed more years of
school had more children. - 1. True
- 2. True with caution
- 3. False
- 4. Inappropriate application of a statistic
Though the relationships stated in the problem
were correct, he validation analysis raised
serious questions about the generalizability of
the findings of the analysis to the population
represented by the sample in the data set. The
answer to the question is false.
91Complete regression analysislevel of measurement
The following is a guide to the decision process
for answering questions about complete
regression analysis
Is the dependent variable metric and the
independent variables metric or dichotomous?
Incorrect application of a statistic
No
Yes
92Complete regression analysisanalyzing missing
data
Create missing/valid group variable to use in
t-tests with other metric variables in the
analysis and chi-square tests with other
nonmetric variables in the analysis.
Is any variable missing data for more than 5 of
the cases in the data set?
Yes
No
Probability of t-tests or chi-square tests lt
level of significance?
No
Yes
Add caution to interpretation to require further
work to understand pattern
Run baseline regression will all cases and
untransformed variables. Record R² for comparison
with R² for revised model.
93Complete regression analysis assumption of
normality
If more than one transformation satisfies
normality, use one with smallest skew
Log, square root, or inverse transformation
satisfies normality?
Dependent variable satisfies criteria for a
normal distribution?
No
No
Yes
Use transformation in revised model
Add caution for violation of normality
Log, square root, or inverse transformation
satisfies normality?
Metric IV's satisfy criteria for a normal
distribution?
No
No
Yes
Use transformation in revised model
Add caution for violation of normality
94Complete regression analysis assumption of
linearity
If independent variable was transformed to
satisfy normality, skip check for linearity.
If dependent variable was transformed for
normality, use transformed dependent variable in
the test for linearity.
If more than one transformation satisfies
linearity, use one with largest r
Probability of correlation (r) for relationship
with any transformation of IV lt level of
significance?
Probability of Pearson correlation (r) lt level
of significance?
No
No
Yes
Yes
Significant correlation (r) for transformed IV
AND r greater than untransformed r by 0.20?
Weak relationship. No caution needed
Use transformation in revised model
Yes
95Complete regression analysis assumption of
homogeneity of variance
If dependent variable was transformed for
normality, substitute transformed dependent
variable in the test for the assumption of
homogeneity of variance
Probability of Levene statistic lt level of
significance?
Yes
Add caution for violation of homoscedasticity
No
96Complete regression analysis detecting outliers
If any variables were transformed for normality
or linearity, substitute transformed variables in
the detection of outliers.
Is the standard score for the dependent variable
for any cases gt 3.29, or probability of
Mahalanobis D² lt 0.001, or studentized residual
gt 3.29?
Exclude outliers from revised model
Yes
No
Run revised regression using transformed
variables.
97Complete regression analysis picking regression
model for interpretation
R² for revised regression greater than R² for
baseline regression by 2 or more?
Pick baseline regression with untransformed
variables and all cases for interpretation
Pick revised regression with transformations and
omitting outliers for interpretation
98Complete regression analysis assumption of
independence of errors
Residuals are independent, Durbin-Watson between
1.5 and 2.5?
Add caution for violation of independence of
errors
99Complete regression analysis minimum sample
size requirement
Number of cases available greater than minimum
required sample ( 104 number of independent
variables)
Inappropriate application of a statistic
100Complete regression analysisoverall relationship
Probability of ANOVA test of regression less
than/equal to level of significance?
False
Strength of relationship for included variables
interpreted correctly?
False
Yes
101Complete regression analysis multicollinearity
Tolerance for all IVs greater than 0.10,
indicating no multicollinearity?
False
102Complete regression analysisindividual
relationships
Probability of relationship between each IV and
DV lt level of significance?
False
Direction of relationship between each IV and DV
interpreted correctly?
False
Set the random seed and randomly split the
sample 75 training sample and 25 validation
sample.
103Complete regression analysis validation analysis
Probability of ANOVA test for training sample lt
level of significance?
False
Pattern of significance for independent variables
in training sample matches pattern for full data
set?
False
Shrinkage in R² (R² for training sample - R² for
validation sample) lt 2?
False
104Complete regression analysisanswering the
question
Problematic pattern of missing data or violation
of regression assumptions?
True with caution
No
Any independent variable or dependent variable
ordinal level of measurement?
True with caution
No
Number of cases available satisfies the preferred
number needed for the validation?
True with caution
True