Regression Analysis with SPSS - PowerPoint PPT Presentation

1 / 96
About This Presentation
Title:

Regression Analysis with SPSS

Description:

Schematic Diagrams of Linear Regression processes ... is non-normality, use quantile regression with bootstrapped standard errors in STATA or SPLUS. ... – PowerPoint PPT presentation

Number of Views:3711
Avg rating:3.0/5.0
Slides: 97
Provided by: robertay
Category:

less

Transcript and Presenter's Notes

Title: Regression Analysis with SPSS


1
Regression Analysiswith SPSS
  • Robert A. Yaffee, Ph.D.
  • Statistics, Mapping and Social Science Group
  • Academic Computing Services
  • Information Technology Services
  • New York University
  • Office 75 Third Ave Level C3
  • Tel 212.998.3402
  • E-mail yaffee_at_nyu.edu
  • February 04

2
Outline
  • Conceptualization
  • Schematic Diagrams of Linear Regression processes
  • Using SPSS, we plot and test relationships for
    linearity
  • Nonlinear relationships are transformed to linear
    ones
  • General Linear Model
  • Derivation of Sums of Squares and
    ANOVADerivation of intercept and regression
    coefficients
  • The Prediction Interval and its derivation
  • Model Assumptions
  • Explanation
  • Testing
  • Assessment
  • Alternatives when assumptions are unfulfilled

3
Conceptualization of Regression Analysis
  • Hypothesis testing
  • Path Analytical Decomposition of effects

4
Hypothesis Testing
  • For example hypothesis 1 X is statistically
    significantly related to Y.
  • The relationship is positive (as X increases, Y
    increases) or negative (as X decreases, Y
    increases).
  • The magnitude of the relationship is small,
    medium, or large.
  • If the magnitude is small, then a unit change in
    x is associated with a small change in Y.

5
Regression AnalysisHave a clear notion of what
you can and cannot do with regression analysis
  • Conceptualization
  • A Path Model of a Regression Analysis

6
In a path analysis, Yi is endogenous. It is the
outcome of several paths. Direct effects on Y3
C,E, F Indirect effects on Y3 BF, BDF Total
Effects Direct Indirect effects
7
Interaction coefficient C X1 and X2 must be in
model for interaction to be properly specified.
8
A Precursor to Modeling with Regression
  • Data Exploration Run a scatterplot matrix and
    search for linear relationships with the
    dependent variable.

9
Click on graphs and then on scatter
10
When the scatterplot dialog box appears, select
Matrix
11
A Matrix of Scatterplots will appear
Search for distinct linear relationships
12
(No Transcript)
13
(No Transcript)
14
Decomposition of the Sums of Squares
15
Graphical Decomposition of Effects
16
Decomposition of the sum of squares
17
Decomposition of the sum of squares
  • Total SS model SS error SS
  • and if we divide by df
  • This yields the Variance Decomposition We have
    the total variance model variance error
    variance

18
F test for significance and R2 for magnitude of
effect
  • R2 Model var/total var
  • F test for model significance
  • Model Var/Error Var

19
ANOVA tests the significance of the Regression
Model
20
The Multiple Regression Equation
  • We proceed to the derivation of its components
  • The intercept a
  • The regression parameters, b1 and b2

21
Derivation of the Intercept
22
Derivation of the Regression Coefficient
23
  • If we recall that the formula for the
    correlation coefficient can be expressed as
    follows

24
(No Transcript)
25
Extending the bivariate case To the Multiple
linear regression case
26
It is also easy to extend the bivariate
intercept to the multivariate case as follows.
27
Significance Tests for the Regression Coefficients
  • We find the significance of the parameter
    estimates by using the F or t test.
  • The R2 is the proportion of variance explained.

28
F and T tests for significance for overall model
29
Significance tests
  • If we are using a type II sum of squares, we
    are dealing with the ballantine. DV Variance
    explained a b

30
Significance tests
  • T tests for statistical significance

31
Significance tests
  • Standard Error of intercept

Standard error of regression coefficient
32
Programming Protocol
After invoking SPSS, procede to File, Open, Data
33
Select a Data Set (we choose employee.sav) and
click on open
34
We open the data set
35
To inspect the variable formats, click on
variable view on the lower left
36
Because gender is a string variable, we need to
recode gender into a numeric format
37
We autorecode gender by clicking on transform and
then autorecode
38
We select gender and move it into the variable
box on the right
39
Give the variable a new name and click on add new
name
40
Click on ok and the numeric variable sex is
created
It has values 1 for female and 2 for male and
those values labels are inserted.
41
To invoke Regression analysis,Click on Analyze
42
Click on Regression and then linear
43
Select the dependent variable Current Salary
44
Enter it in the dependent variable box
45
Entering independent variables
  • These variables are entered in blocks. First the
    potentially confounding covariates that have to
    entered.
  • We enter time on job, beginning salary, and
    previous experience.

46
After entering the covariates, we click on next
47
We now enter the hypotheses we wish to test
  • We are testing for minority or sex differences in
    salary after controlling for the time on job,
    previous experience, and beginning salary.
  • We enter minority and numeric gender (sex)

48
After entering these variables, click on
statistics
49
We select the following statistics from the
dialog box and click on continue
50
Click on plots to obtain the plots dialog box
51
We click on OK to run the regression analysis
52
Navigation window (left) and output window(right)
This shows that SPSS is reading the variables
correctly
53
Variables Entered and Model Summary
54
Omnibus ANOVA
Significance Tests for the Model at each stage of
the analysis
55
Full ModelCoefficients
56
We omit insignificant variables and rerun the
analysis to obtain trimmed model coefficients
57
Beta weights
  • These are standardized regression coefficients
    used to compare the contribution to the
    explanation of the variance of the dependent
    variable within the model.

58
T tests and signif.
  • These are the tests of significance for each
    parameter estimate.
  • The significance levels have to be less than .05
    for the parameter to be statistically significant.

59
Assumptions of the Linear Regression Model
  • Linear Functional form
  • Fixed independent variables
  • Independent observations
  • Representative sample and proper specification of
    the model (no omitted variables)
  • Normality of the residuals or errors
  • Equality of variance of the errors (homogeneity
    of residual variance)
  • No multicollinearity
  • No autocorrelation of the errors
  • No outlier distortion

60
Explanation of the Assumptions
  • 1. Linear Functional form
  • Does not detect curvilinear relationships
  • Independent observations
  • Representative samples
  • Autocorrelation inflates the t and r and f
    statistics and warps the significance tests
  • Normality of the residuals
  • Permits proper significance testing
  • Equality of variance
  • Heteroskedasticity precludes generalization and
    external validity
  • This also warps the significance tests
  • Multicollinearity prevents proper parameter
    estimation. It may also preclude computation of
    the parameter estimates completely if it is
    serious enough.
  • Outlier distortion may bias the results If
    outliers have high influence and the sample is
    not large enough, then they may serious bias the
    parameter estimates

61
Diagnostic Tests for the Regression Assumptions
  • Linearity tests Regression curve fitting
  • No level shifts One regime
  • Independence of observations Runs test
  • Normality of the residuals Shapiro-Wilks or
    Kolmogorov-Smirnov Test
  • Homogeneity of variance if the residuals
    Whites General Specification test
  • No autocorrelation of residuals Durbin Watson or
    ACF or PACF of residuals
  • Multicollinearity Correlation matrix of
    independent variables.. Condition index or
    condition number
  • No serious outlier influence tests of additive
    outliers Pulse dummies.
  • Plot residuals and look for high leverage of
    residuals
  • Lists of Standardized residuals
  • Lists of Studentized residuals
  • Cooks distance or leverage statistics

62
Explanation of Diagnostics
  • Plots show linearity or nonlinearity of
    relationship
  • Correlation matrix shows whether the independent
    variables are collinear and correlated.
  • Representative sample is done with probability
    sampling

63
Explanation of Diagnostics
  • Tests for Normality of the residuals. The
    residuals are saved and then subjected to either
    of
  • Kolmogorov-Smirnov Test Tests the limit of the
    theoretical cumulative normal distribution
    against your residual distribution.
  • Nonparametric Tests
  • 1 sample K-S test

64
Collinearity Diagnostics
65
More Collinearity Diagnostics
  • condition numbers
  • maximum eigenvalue/minimum eigenvalue.
  • If condition numbers are between 100 and 1000,
    there is moderate to strong collinearity

If Condition index gt 30 then there is strong
collinearity
66
Outlier Diagnostics
  • Residuals.
  • The predicted value minus the actual value. This
    is otherwise known as the error.
  • Studentized Residuals
  • the residuals divided by their standard errors
    without the ith observation
  • Leverage, called the Hat diag
  • This is the measure of influence of each
    observation
  • Cooks Distance
  • the change in the statistics that results from
    deleting the observation. Watch this if it is
    much greater than 1.0.

67
Outlier detection
  • Outlier detection involves the determination
    whether the residual (error predicted actual)
    is an extreme negative or positive value.
  • We may plot the residual versus the fitted plot
    to determine which errors are large, after
    running the regression.

68
Create Standardized Residuals
  • A standardized residual is one divided by its
    standard deviation.

69
Limits of Standardized Residuals
  • If the standardized residuals have values in
    excess of 3.5
  • and -3.5, they are outliers.
  • If the absolute values are less than 3.5, as
    these are, then there are no outliers
  • While outliers by themselves only distort mean
    prediction when the sample size is small enough,
    it is important to gauge the influence of
    outliers.

70
Outlier Influence
  • Suppose we had a different data set with two
    outliers.
  • We tabulate the standardized residuals and obtain
    the following output

71
Outlier a does not distort and outlier b does.
72
Studentized Residuals
  • Alternatively, we could form studentized
    residuals. These are distributed as a t
    distribution with dfn-p-1, though they are not
    quite independent. Therefore, we can
    approximately determine if they are statistically
    significant or not.
  • Belsley et al. (1980) recommended the use of
    studentized residuals.

73
Studentized Residual
These are useful in estimating the statistical
significance of a particular observation, of
which a dummy variable indicator is formed. The
t value of the studentized residual will indicate
whether or not that observation is a
significant outlier. The command to generate
studentized residuals, called rstudt is predict
rstudt, rstudent
74
Influence of Outliers
  • Leverage is measured by the diagonal components
    of the hat matrix.
  • The hat matrix comes from the formula for the
    regression of Y.

75
Leverage and the Hat matrix
  • The hat matrix transforms Y into the predicted
    scores.
  • The diagonals of the hat matrix indicate which
    values will be outliers or not.
  • The diagonals are therefore measures of leverage.
  • Leverage is bounded by two limits 1/n and 1.
    The closer the leverage is to unity, the more
    leverage the value has.
  • The trace of the hat matrix the number of
    variables in the model.
  • When the leverage gt 2p/n then there is high
    leverage according to Belsley et al. (1980) cited
    in Long, J.F. Modern Methods of Data Analysis
    (p.262). For smaller samples, Vellman and Welsch
    (1981) suggested that 3p/n is the criterion.

76
Cooks D
  • Another measure of influence.
  • This is a popular one. The formula for it is

Cook and Weisberg(1982) suggested that values of
D that exceeded 50 of the F distribution (df
p, n-p) are large.
77
Using Cooks D in SPSS
  • Cook is the option /R
  • Finding the influential outliers
  • List cook, if cook gt 4/n
  • Belsley suggests 4/(n-k-1) as a cutoff

78
DFbeta
  • One can use the DFbetas to ascertain the
    magnitude of influence that an observation has on
    a particular parameter estimate if that
    observation is deleted.

79
Programming Diagnostic TestsTesting
homoskedasiticitySelect histogram, normal
probability plot, and insert zresid in Yand
zpred in X
Then click on continue
80
Click on Save to obtain the Save dialog box
81
We select the following
Then we click on continue, go back to the Main
Regression Menu and click on OK
82
Check for linear Functional Form
  • Run a matrix plot of the dependent variable
    against each independent variable to be sure that
    the relationship is linear.

83
Move the variables to be graphed into the box on
the upper right, and click on OK
84
Residual Autocorrelation check
See significance tables for this statistic
85
Run the autocorrelation function from the Trends
Module for a better analysis
86
Testing for Homogeneity of variance
87
Normality of residuals can be visually inspected
from the histogram with the superimposed normal
curve. Here we check the skewness for symmetry
and the kurtosis for peakedness
88
Kolmogorov Smirnov Test An objective test of
normality
89
(No Transcript)
90
(No Transcript)
91
Multicollinearity test with the correlation
matrix
92
(No Transcript)
93
(No Transcript)
94
Alternatives to Violations of Assumptions
  • 1. Nonlinearity Transform to linearity if
    there is nonlinearity or run a nonlinear
    regression
  • 2. Nonnormality Run a least absolute
    deviations regression or a median regression
    (available in other packages or generalized
    linear models SPLUS glm, STATA glm, or SAS
    Proc MODEL or PROC GENMOD).
  • 3. Heteroskedasticity weighted least squares
    regression (SPSS) or white estimator (SAS,
    Stata, SPLUS). One can use a robust regression
    procedure (SAS, STATA, or SPLUS) to obtain
    downweighted outlier effect in the estimation.
  • 4. Autocorrelation Run AREG in SPSS Trends
    module or either Prais or Newey-West procedure
    in STATA.
  • 4. Multicollinearity components regression or
    ridge regression or proxy variables. 2sls in
    SPSS or ivreg in stata or SAS proc model or proc
    syslin.

95
Model Building Strategies
  • Specific to General Cohen and Cohen
  • General to Specific Hendry and Richard
  • Extreme Bounds analysis E. Leamer.

96
Nonparametric Alternatives
  • If there is nonlinearity, transform to linearity
    first.
  • If there is heteroskedasticity, use robust
    standard errors with STATA or SAS or SPLUS.
  • If there is non-normality, use quantile
    regression with bootstrapped standard errors in
    STATA or SPLUS.
  • If there is autocorrelation of residuals, use
    Newey-West autoregression or First order
    autocorrelation correction with Areg. If there
    is higher order autocorrelation, use Box Jenkins
    ARIMA modeling.
Write a Comment
User Comments (0)
About PowerShow.com