Multivariate Data Analysis Using SPSS

About This Presentation

Title:

Multivariate Data Analysis Using SPSS

Description:

Multivariate Data Analysis Using SPSS John Zhang ARL, IUP Logistic outputs (cont.) The Modle chi-square value is the difference of the initial and final 2LL (small ... – PowerPoint PPT presentation

Number of Views:1754

Avg rating:3.0/5.0

Slides: 125

Provided by: ATS49

Category:

more less

Transcript and Presenter's Notes

Title: Multivariate Data Analysis Using SPSS

1
Multivariate Data Analysis Using SPSS

John Zhang
ARL, IUP

2
Topics

A Guide to Multivariate Techniques
Preparation for Statistical Analysis
Review ANOVA
Review ANCOVA
MANOVA
MANCOVA
Repeated Measure Analysis
Factor Analysis
Discriminant Analysis
Cluster Analysis

3
Guide-1

Correlation 1 IV 1 DV relationship
Regression 1 IV 1 DV relation/prediction
T test 1 IV (Cat.) 1 DV group diff.
One-way ANOVA 1 IV (2 cat.) 1 DV group diff.
One-way ANCOVA 1 IV (2 cat.) 1 DV 1
covariates group diff.
One-way MANOVA 1 IV (2 cat.) 2 DVs group
diff.

4
Guide-2

One-way MANCOVA 1 IV (2cat.) 2 DVs 1
covariate group diff.
Factorial MANOVA 2 IVs (2cat.) 2 DVs group
diff.
Factorial MANCOVA 2 IVs (2cat.) 2 DVs 1
covariate group diff.
Discriminant Analysis 2 IVs 1 DV (cat.)
group prediction
Factor Analysis explore the underlying structure

5
Preparation for Stat. Analysis-1

Screen data
SPSS Utility procedures
Frequency procedure
Missing data analysis (missing data should be
random)
Check if patterns exist
Drop data case-wise
Drop data variable-wise
Impute missing data

6
Preparation for Stat. Analysis-2

Outliers (generally, statistical procedures are
sensitive to outliers.
Univariate case boxplot
Multivariate case Mahalanobis distance (a
chi-square statistics), a point is an outlier
when its p-value is lt .001.
Treatment
Drop the case
Report two analysis (one with outlier, one
without)

7
Preparation for Stat. Analysis-3

Normality
Testing univariate normal
Q-Q plot
Skewness and Kurtosis they should be 0 when
normal not normal when p-value lt .01 or .001
Komogorov-Smirnov statistic significant means
not normal.
Testing multivariate normal
Scatterplots should be elliptical
Each variable must be normal

8
Preparation for Stat. Analysis-4

Linearity
Linear combination of variables make sense
Two variables (or comb. of variables) are linear
Check for linearity
Residual plot in regression
Scatterplots

9
Preparation for Stat. Analysis-5

Homoscedasticity the covariance matrixes are
equal across groups
Boxs M test test the equality of the covariance
matrixes across groups
Sensitive to normality
Levenes test test equality of variances across
groups.
Not sensitive to normality

10
Preparation for Stat. Analysis-Example-1

Steps in preparation for stat. analysis
Check for variable codling, recode if necessary
Examining missing data
Check for univariate outlier, normality,
homogeneity of variances (Explore)
Test for homogeneity of variances (ANOVA)
Check for multivariate outliers (RegressiongtSavegt
Mahalanobis)
Check for linearity (scatterplots residual plots
in regression)

11
Preparation for Stat. Analysis-Example-2

Use dataset dssft.sav
Objective we are interested in investigating
group differences (satjob2) in income (income91),
age (age_2) and education (educ)
Check for coding need to recode rincome91 into
rincome_2 (22, 98, 99 be system missing)
TransformgtRecodegtInto Different Variable

12
Preparation for Stat. Analysis-Example-3

Check for missing value
Use Frequency for categorical variable
Use Descriptive Stat. for measurement variable
For categorical variables
If missing value is lt 5, use List-wise option
If gt5, define the missing value as a new
category
For measurement variables
If missing value is lt 5, use List-wise option
If between 5 and 15, use TransformgtReplace
Missing Value. Replacing less than 15 of data
has little effect on the outcome
If greater than 15, consider to drop the
variable or subject

13
Preparation for Stat. Analysis-Example-4

Check missing value for satjob2
AnalysisgtDescriptive StatisticsgtFrequency
Check for missing value for rincome_2
AnalysisgtDescriptive StatisticsgtDescriptive
Replaying the missing values in rincome_2
TransformgtReplacing Missing Value

14
Preparation for Stat. Analysis-Example-5

Check for univariate outliers, normality,
Homogeneity of variances
AnalysisgtDescriptive StatisticsgtExplore
Put rincome_2, age_2, and educ into the Dependent
List box satjob2 into Factor List box
There are outliers in rincome_2, lets change
those outliers to the acceptable min or max value
TransformgtRecodegtInto Different Variable
Put income_2 into Original Variable box, type
income_3 as the new name
Replace all values lt 3 by 4, all other values
remain the same

15
Preparation for Stat. Analysis-Example-6

Explore rincome_3 again not normal
Transform rincome_3 into rincome_4 by ln or sqrt
Explore rincome_4
Check for multivariate outliers
AnalysisgtRegressiongtlinear
Put id (dummy variable) into Depend box, put
rincome_4, age_2, and educ into Independent box
Click at Save, then Mahalanobis box
Compare Mahalanobis dist. with chi-sqrt critical
value at p.001 and dfnumber of independent
variables

16
Preparation for Stat. Analysis-Example-7

Check for multivariate normal
Must univariate normal
Construct a scatterplot matrix, each scatterplot
should be elliptical shape
Check for Homoscedasticity
Univariate (ANOVA, Levenes test)
Multivariate (MANOVA, Boxs M test, use .01 level
of significance level)

17
Review ANOVA -1

One-way ANOVA test the equality of group means
Assumptions independent observations normality
homogeneity of variance
Two-way ANOVA tests three hypotheses
simultaneously
Test the interaction of the levels of the two
independent variables
Interaction occurs when the effects of one factor
depends on the different levels of the second
factor
Test the two independent variable separately

18
Review ANCOVA -1

Idea the difference on a DV often does not just
depend on one or two IVs, it may depend on other
measurement variables. ANCOVA takes into account
of such dependency.
i.e. it removes the effect of one or more
covariates
Assumptions in addition to the regular ANOVA
assumptions, we need
Linear relationship between DV and covariates
The slope for the regression line is the same for
each group
The covariates are reliable and is measure
without error

19
Review ANCOVA -2

Homogeneity of slopes homogeneity of regression
there is interaction between IVs and the
covariate
If the interaction between covariate and IVs are
significant, ANCOVA should not be conducted
Example determine if hours worked per week
(hrs2) is different by gender (sex) and for those
satisfy or dissatisfied with their job (satjob2),
after adjusted to their income (or equalized to
their income)

20
Review ANCOVA -3

AnalysisgtGLMgtUnivariate
Move hrs2 into DV box move sex and satjob2 into
Fixed Factor box move rincome_2 into Covariate
box
Click at ModelgtCustom
Highlight all variables and move it to the Model
box
Make sure the Interaction option is selected
Click at Option
Move sex and satjob2 into Display Means box
Click Descriptive Stat. Estimates of effect
size and Homogeneity tests
This tests the homogeneity of regression slopes

21
Review ANCOVA -4

If there is no interaction found by the previous
step, then repeat the previous step except click
at ModelgtFactorial instead of ModelgtCustom

22
Review ANOVA -2

Interaction is significant means the two IVs in
combination result in a significant effect on the
DV, thus, it does not make sense to interpret the
main effects.
Assumptions the same as One-way ANOVA
Example the impact of gender (sex) and age
(agecat4) on income (rincome_2)
Explore (omitted)
AnalysisgtGLMgtunivariate
Click modelgtclick Full factorialgtCont.
Click OptionsgtClick Descriptive Stat Estimates
of effect size Homogeneity test
Click Post Hocgtclick LSD Bonferroni Scheffe
Cont.
Click Plotsgtput one IV into Horizontal and the
other into Separate line

23
MANOVA-1

Characteristics
Similar to ANOVA
Multiple DVs
The DVs are correlated and linear combination
makes sense
It tests whether mean differences among k groups
on a combination of DVs are likely to have
occurred by chance
The idea of MANOVA is find a linear combination
that separates the groups optimally, and
perform ANOVA on the linear combination

24
MANOVA-2

Advantages
The chance of discovering what actually changed
as a result of the the different treatment
increases
May reveal differences not shown in separate
ANOVAs
Without inflation of type one error
The use of multiple ANOVAs ignores some very
important info (the fact that the DVs are
correlated)

25
MANOVA-3

Disadvantages
More complicated
ANOVA is often more powerful
Assumptions
Independent random samples
Multivariate normal distribution in each group
Homogeneity of covariance matrix
Linear relationship among DVs

26
MANOVA-4

Steps in carry out MANOVA
Check for assumptions
If MANOVA is not significant, stop
If MANOVA is significant, carry out univariate
ANOVA
If univariate ANOVA is significant, do Post Hoc
If homoscedasticity, use Wilks Lambda, if not,
use Pillais Trace. In general, all 4 statistics
should be similar.

27
MANOVA-5

ExampleAn experiment looking at the memory
effects of different instructions 3 groups of
human subjects learned nonsense syllables as they
were presented and were administered two memory
tests recall and recognition. The first group of
subjects was instructed to like or dislike the
syllables as they were presented (to generate
affect). A second group was instructed that they
will be tested (induce anxiety?). The 3rd group
was told to count the syllable as the were
presented (interference). The objective is to
access group differences in memory

28
MANOVA-6

How to do it?
FilegtOpen Data
Open the file As9.por in InstructgtZhang
Multivariate Short Course folder
AnalyzegtGLMgtMultivariate
Move recall and recog into Dependent Variable
box move group into Fixed Factors box
Click at Options move group into Display means
box (this will display the marginal means
predicted by the model, these means may be
different than the observed means if there are
covariates or the model is not factorial)
Compare main effect box is for testing the every
pair of the estimated marginal means for the
selected factors.
Click at Estimates of effect size and Homogeneity
of variance

29
MANOVA-7

Push buttons
Plots create a profile plot for each DV
displaying group means
Post Hoc Post Hoc tests for marginal means
Save save predicted values, etc.
Contrast perform planned comparisons
Model specify the model
Options
Display Means for display the estimated means
predicted by the model
Compare main effects test for significant
difference between every pair of estimated
marginal means for each of the main effects

30
MANOVA-8

Observed power produce a statistical power
analysis for your study
Parameter estimate check this when you need a
predictive model
Spread vs. level plot visual display of
homogeneity of variance

31
MANOVA-9

Example 2 Check for the impact of job
satisfaction (satjob) and gender (sex) on income
(rincome_2) and education (educ) (in gssft.sav)
Screen data transform educ to educ2 to eliminate
cases with 6 or less
Check for assumptions explore
MANOVA

32
MANCOVA-1

Objective Test for mean differences among groups
for a linear combination of DVs after adjusted
for the covariate.
Example to test if there is differences in
productivity (measured by income and hours
worked) for individuals in different age groups
after adjusted for the education level

33
MANCOVA-2

Assumptions similar to ANCOVA
SPSS how to
AnalysisgtGLMgtMultivariate
Move rincome_2 and educ2 to DV box move sex and
satjob into IV box move age to Covariate box
Check for homogeneity of regression
Click at ModelgtCustom Highlight all variables
and move them to Model box
If the covariate-IVs interaction is not
significant, repeat the process but select the
Full under model

34
Repeated Measure Analysis-1

Objective test for significant differences in
means when the same observation appears in
multiple levels of a factor
Examples of repeated measure studies
Marketing compare customers ratings on 4
different brands
Medicine compare test results before,
immediately after, and six months after a
procedure
Education compare performance test scores
before and after an intervention program

35
Repeated Measure Analysis-2

The logic of repeated measure SPSS performs
repeated measure ANOVA by computing contrasts
(differences) across the repeated measures
factors levels for each subject, then testing if
the means of the contrasts are significantly
different from 0 any between subject tests are
based on the means of the subjects.

36
Repeated Measure Analysis-3

Assumptions
Independent observations
Normality
Homogeneity of variances
Sphericity if two or more contrasts are to be
pooled (the test of main effect is based on this
pooling), then the contrasts should be equally
weighted and uncorrelated (equal variances and
uncorrelated contrasts) this assumption is
equivalent to the covariance matrix is diagonal
and the diagonal elements are the same)

37
Repeated Measure Analysis-4

Example 1 A study in which 5 subjects were
tested in each of 4 drug conditions
Open data file
FilegtOpenData select Repmeas1.por
SPSS repeated measure procedure
AnalyzegtGLMgtRepeated Measure
Within-Subject Factor Name (the name of the
repeated measure factor) a repeated measure
factor is expressed as a set of variables
Replace factor1 with Drug
Number of levels the number of repeated
measurements
Type 4

38
Repeated Measure Analysis-5

The Measure pushbutton for two functions
For multiple dependent measures (e.g. we recorded
4 measures of physiological stress under each of
the drug conditions)
To label the factor levels
Click Measure type memory in Measure name box
click add
Click Define here we link the repeated measure
factor level to variable names define between
subject factors and covariates
Move drug1 drug 4 to the Within-Subject box
You can move a selected variable by the up and
down button

39
Repeated Measure Analysis-6

Model button by default a complete model
Contrast button specify particular contrasts
Plot button create profile plots that graph
factor level estimated marginal means for up to 3
factors at a time
Post Hoc provide Post Hoc tests for between
subject factors
Save button allow you to save predicted values,
residuals, etc.
Options similar to MANOVA
Click Descriptive click at Transformation Matrix
(it provides the contrasts)

40
Repeated Measure Analysis-7

Interpret the results
Look at the descriptive statistics
Look at the test for Sphericity
If Sphericity is significant, use the
Multivariate results (test on the contrasts). It
tests whether all of the contrast variables are
zero in the population
If Sphericity is not significant, use the
Sphericity Assumed result
Look at the tests for within subject contrasts
it test the linear trend the quadratic trend
It may not be make sense in some applications, as
in this example (but it makes sense in terms of
time and dosage)

41
Repeated Measure Analysis-8

Transformation matrix provide info on what are
linear contrast, etc.
The fist table is for the average across the
repeated measure factor (here they are all .5, it
means each variable is weighted equally,
normalization requires that the square of the
sums equals to 1)
The second table defines the corresponding
repeated measure factor
Linear increase by a constant, etc.
Linear and quadratic is orthogonal, etc.
Having concluded there are memory differences due
to drug condition, , we want to know which
condition differ to which others

42
(No Transcript)
43
Repeated Measure Analysis-9

Repeat the analysis, except under Option button,
move drug into Display Means, click at Compare
Main effects and select Bonferroni adjustment
Transformation Coefficients (M Matrix) it shows
how the variables are created for comparison.
Here, we compare the drug conditions, so the M
matrix is an identity matrix
Suppose we want to test each adjacent pair of
means drug1 vs. drug2 drug2 vs. drug3 drug3
vs. drug 4
Repeated measuregtDefinegtContrastgtSelect Repeated

44
Repeated Measure Analysis-10

Example 2 A marketing experiment was devised to
evaluate whether viewing a commercial produces
improved ratings for a specific brand. Ratings on
3 brands were obtained from objects before and
after viewing the commercial. Since the hope was
that the commercial would improve ratings of only
one brand (A), researchers expected a significant
brand by pre-post commercial interaction. There
are two between-subjects factors sex and brand
used by the subject

45
Repeated Measure Analysis-11

SPSS how to
AnalyzegtGLMgtRepeated Measures
Replace factor1 with prepost in the
Within-Subject Factor box type 2 in the Number
of level box click add
Type brand in the Within-Subject Factor box type
3 in the Number of level box click add
Click measure type measure in Measure Name box
click add
Note SPSS expects 2 between-subject factors

46
Repeated Measure Analysis-12

Click Define button move the appropriate
variable into place move sex and user into
Between-Subject Factor box
Click Options button move sex, user, prepost and
brand into the Display means box
Click Homogeneity tests and descriptive boxes
Click Plot move user into Horizontal Axis box
and brand into Separate Lines box
Click continue OK

47
Factor Analysis-1

The main goal of factor analysis is data
reduction. A typical use of factor analysis is in
survey research, where a researcher wishes to
represent a number of questions with a smaller
number of factors
Two questions in factor analysis
How many factors are there and what they
represent (interpretation)
Two technical aids
Eigenvalues
Percentage of variance accounted for

48
Factor Analysis-2

Two types of factor analysis
Exploratory introduce here
Confirmatory SPSS AMOS
Theoretical basis
Correlations among variables are explained by
underlying factors
An example of mathematical 1 factor model for two
variables
V1L1F1E1
V2L2F1E2

49
Factor Analysis-3

Each variable is compose of a common factor (F1)
multiply by a loading coefficient (L1, L2 the
lambdas or factor loadings) plus a random
component
V1 and V2 correlate because the common factor and
should relate to the factor loadings, thus, the
factor loadings can be estimated by the
correlations
A set of correlations can derive different factor
loadings (i.e. the solutions are not unique)
One should pick the simplest solution

50
Factor Analysis-4

A factor solution needs to be confirm
By a different factor method
By a different sample
More on terminology
Factor loading interpreted as the Pearson
correlation between the variable and the factor
Communality the proportion of variability for a
given variable that is explained by the factor
Extraction the process by which the factors are
determined from a large set of variables

51
Factor Analysis-5

Principle component one of the extraction
methods
A principle component is a linear combination of
observed variables that is independent
(orthogonal) of other components
The first component accounts for the largest
amount of variance in the input data the second
component accounts for the largest amount or the
remaining variance
Components are orthogonal means they are
uncorrelated

52
Factor Analysis-6

Possible application of principle components
E.g. in a survey research, it is common to have
many questions to address one issue (e.g.
customer service). It is likely that these
questions are highly correlated. It is
problematic to use these variables in some
statistical procedures (e.g. regression). One can
use factor scores, computed from factor loadings
on each orthogonal component

53
Factor Analysis-7

Principle component vs. other extract methods
Principle component focus on accounting for the
maximum among of variance (the diagonal of a
correlation matrix)
Other extract methods (e.g. principle axis
factoring) focus more on accounting for the
correlations between variables (off diagonal
correlations)
Principle component can be defined as a unique
combination of variables but the other factor
methods can not
Principle component are use for data reduction
but more difficult to interpret

54
Factor Analysis-8

Number of factors
Eigenvalues are often used to determine how many
factors to take
Take as many factors there are eigenvalues
greater than 1
Eigenvalue represents the amount of standardized
variance in the variable accounted for by a
factor
The amount of standardized variance in a variable
is 1
The sum of eigenvalues is the percentage of
variance accounted for

55
Factor Analysis-9

Rotation
Objective to facilitate interpretation
Orthogonal rotation done when data reduction is
the objective and factors need to be orthogonal
Varimax attempts to simplify interpretation by
maximize the variances of the variable loadings
on each factor
Quartimax simplify solution by finding a
rotation that produces high and low loadings
across factors for each variable
Oblique rotation use when there are reason to
allow factors to be correlated
Oblimin and Promax (promax runs fast)

56
Factor Analysis-10

Factor scores if you are satisfy with a factor
solution
You can request that a new set of variables be
created that represents the scores of each
observation on the factor (difficult of
interpret)
You can use the lambda coefficient to judge which
variables are highly related to the factor the
compute the sum of the mean of this variables for
further analysis (easy to interpret)

57
Factor Analysis-11

Sample size the sample size should be about 10
to 15 times of the number of variables (as other
multivariate procedures)
Number of methods there are 8 factoring methods,
including principle component
Principle axis account for correlations between
the variables
Unweighted least-squares minimize the residual
between the observed and the reproduced
correlation matrix

58
Factor Analysis-12

Generalize least-squares similar to Unweighted
least-squares but give more weight the the
variables with stronger correlation
Maximum Likelihood generate the solution that is
the most likely to produce the correlation matrix
Alpha Factoring Consider variables as a sample
not using factor loadings
Image factoring decompose the variables into a
common part and a unique part, then work with the
common part

59
Factor Analysis-13

Recommendations
Principle components and principle axis are the
most common used methods
When there are multicollinearity, use principle
components
Rotations are often done. Try to use Varimax

60
Factor Analysis-14

Example 1 whether a small number of athletic
skills account for performance in the ten
separate decathlon events
FilegtOpengtData select Olymp88.por
Looking at correlation
AnalyzegtCorrelationgtBivariate
Principle component with orthogonal rotation
AnalyzegtData ReductiongtFactor
Select all variables except score
Click Extract buttongtclick Scree Plot
Check off Unrotated factor solution
Click continue

61
Factor Analysis-15

Click Rotation buttongtclick Varimax Loading
plots click continue
Click options buttongtclick sorted by size click
Suppress absolute values box change .1 to ,3
click continue
Click DescriptivegtUnivariate descriptive KMO and
Bartletts test of sphericity (KMO measures how
well the sample data are suited for factor
analysis .9 is great and less than .5 is not
acceptable Bartletts test tests the sphericity
of the correlation matrix) click continue
Click OK

62
Factor Analysis-16

Try to validate the first factor solution using a
different method
AnalyzegtData ReductiongtFactor Analysis
Click ExtractiongtSelect Principle axis factoring
click continue
Click RotationgtSelect Direct Oblimin (leave delta
value at 0, most oblique value possible) type 50
in the Max Iteration box click continue
Click Score buttongtclick save as variables (this
involve solving system of equation for the
factors, regression is one of the methods to
solve the equations) click continue
Click OK

63
Factor Analysis-17

Note the Patten matrix gives the standardized
linear weights and the Structure matrix gives the
correlation between variable and factors (in
principle component analysis, the component
matrix gives both factor loadings and the
correlations)

64
Discriminant Analysis-1

Discriminant analysis characterize the
relationship between a set of IVs with a
categorical DV with relatively few categories
It creates a linear combination of the IVs that
best characterizes the differences among the
groups
Predictive discriminant analysis focus on
creating a rule to predict group membership
Descriptive DA studies the relationship between
the DV and the IVs.

65
Discriminant Analysis-2

Possible applications
Whether a bank should offer a loan to a new
customer?
Which customer is likely to buy?
Identify patients who may be at high risk for
problems after surgery

66
Discriminant Analysis-3

How does it work?
Assume the population of interest is composed of
distinct populations
Assume the IVs follows multivariate normal
distribution
DS seek a linear combination of the IVs that best
separate the populations
If we have k groups, we need k-1 discriminate
functions
A discriminant score is computed for each
function
This score is used to classify cases into one of
the categories

67
Discriminant Analysis-4

There are three methods to classify group
memberships
Maximum likelihood method assign case to group k
is the probability of membership is greater in
group k than any other group
Fisher (linear) classification functions assign
a membership to group k if its score on the
function for group k is greater than any other
function scores
Distance function assign membership to group k
if its distance to the centroid of the group is
minimum
Note SPSS uses Maximum likelihood method

68
Discriminant Analysis-5

Basic steps in DA
Identify the variables
Screen data look for outliers, variables may not
be good predictors, etc
Run DA
Check for the correct prediction rate
Check for the importance of individual predictors
Validate the model

69
Discriminant Analysis-6

Assumptions
IVs are either dichotomous or measurement
Normality
Homogeneity of variances

70
Discriminant Analysis-7

Example 1 VCR buyers filled out a survey we
want to determine which set of demographic
information and attitude best predict which
customer may buy another VCR
FilegtOpen DatagtCSM.por
Explore the data
AnalyzegtClassifygtDiscriminant
Move age, complain, educ, fail, pinnovat,
preliabl, puse, qual, use, and value into
Independent box
Move buyyes into Grouping box
Click Define Range type 1 for Min and 2 for Max
Click continue

71
Discriminant Analysis-8

Click Statisticsgtclick Boxs M and Fishers
continue
Click Classify buttongtclick Summary table
Separate groups Continue
Click Save buttongtclick on Discriminant Scores
continue
Click OK
How original variables related to the
discriminant score?
GraphsgtScattergtClick Define
Move pinnovat into X and dis1_1 into Y move
buyyes into Set Markers by box

72
Discriminant Analysis-9

Since Boxs M test was significant, one can ask
SPSS to run DA using separate covariances
option (under Classify) and compare the results
From the 1st analysis, we see that age was not
important, one can redo the analysis without
age and compare the results

73
Discriminant Analysis-10

Validate the model leave-one-out classification
Repeat the analysis, click on Classifygtclick
leave-one-out classification Click continue
Example 2 predict smoking and drinking habits
AnalyzegtClassifygtDiscriminant
Move smkdrnk into Grouping Variable box move
age, attend, black, class, educ, sex and white
into IV list
Click StatisticsgtSelect Fishers and Box M
Continue
Click ClassifygtSummary table, Combine-groups
Territorial map Continue
Click OK

74
Cluster Analysis-1

Cluster analysis is an exploratory data analysis
technique design to reveal groups
How?
By distance close together observations should
be in the same group, and observations in the
groups should be far apart
Applications
Plants and animals into ecological groups
Companies for product usage

75
Cluster Analysis-2

Two types of method
Hierarchical requires observations to remain
together once they have joint in a cluster
Complete linkage
Between groups average linkage
Wards method
Nonhierarchical no such requirement
Research must pick a number of clusters to run
(K-means algorithm)

76
Cluster Analysis-3

Recommendations
For relative small samples, use hierarchical
(less than a few hundred)
For large samples, use K-means
Example 1 evaluating 20 types of beer
FilegtOpengtData select beer.por
AnalyzegtDescriptive StatgtDescriptive
Move cost, calories, sodium, and alcohol into
variable list
Click at Save standardized values OK

77
Cluster Analysis-4

AnalyzegtClassifygtHierarchical Cluster
Move cost, calories, sodium, and alcohol into
Variable list box
Move Beer into label cases by box
Click Plotsgtclick Dendrogram click none in
Icicle area continue
Click Methodgtselect Z-score from the standardize
drop-down list Continue
Click SavegtClick range of solutions range 2-5
clusters continue
OK

78
Cluster Analysis-5

Additional analysis
Look at the last 4 column of the data (clu5_1 to
clu2_1) they contain memberships for each
solution between 5 and 2 clusters
AnalyzegtDescriptivegtFrequencies
Move clu2_1 to clu5_1 to Variable box
OK
Obtain mean profile for clusters
GraphgtLinegtsummary of separate variables
Click Definegtmove zcost, zcalorie, zsodium, and
zalcohol to Lines Rep. Box
Click clu4_1 and move it to Category box

79
Path Analysis-1

Path analysis is a technique based on regression
to establish causal relationship
Start with a diagram with causal flow
Direct causal effects model (regression)
The direct causal effect of an IV on a DV is the
coefficient (the number of unit change in DV for
1 unit change in X)
Building on the DCEM
Two forms of causal model
Diagram
Equation (structure equation)

80
Path Analysis-2

An example of a causal model
Structural equation
Z4p41Z1p42Z2p43Z3e4
P path coefficient
e disturbance
Z4, endogenous variable
Z1 exogenous variable
Path diagram
Indirect effect is the multiplication of the path
coefficients

81
Path Analysis-3

Steps in path analysis
Create a path diagram
Use regression to estimate structural equation
coefficients
Assess to model
Compare the observed and reproduced correlations
(reproduced correlations will be computed by hand)

82
Path Analysis-4

Research questions
Is our model-which describe the causal effects
among the variables region of the world,
status as a developing nation, number of
doctors, and male life expectancy-consistent
with our observed correlation among these
variables?
If our model is consistent, what are the
estimated direct, indirect, and total causal
effects among the variables?

83
Path Analysis-5

Legal path
No path may pass through the same variable more
than once
No path may go backward on an arrow after going
forward on another arrow
No path may include more than one double headed
curve arrow

84
Path Analysis-6

Component labels
D direct effect (just one straight arrow)
I indirect effect (more than one straight
arrows)
S spurious effect (there is a backward arrow)
U effect is uncertain (start with a two arrows
curve)

85
Path Analysis-7

If the model is in question (some of the
reproduced correlations differ from the observed
correlations by more than .05)
Test all missing paths (running additional
regressions and check for significance of the
coefficients)
Reduce the existing paths if their coefficients
are not significant

86
Logistic regression - Motivations

When the dependent variable is dichotomous,
regular regression is not appropriate
We want to predict probability
OLS regression predictions could be any numbers,
not just numbers between 0 and 1
When dealing with proportions, variance is
depended on mean, equal variance assumption in
OLS is violated

87
Motivations-Continue

Fit a S curve to the data

88
What is Logistic Regression?

Regressions of the form
ln(Odds)B0B1X1BkXk
ln(Odds) is called a logic
OddsPorb/(1-Prob)

89
Application of Logistic Regression

When to use it?
When the dependent valuable is dichotomous
Objectives
Run a logistic regression
Apply a stepwise logistic regression
Use ROC (response operating characteristic) curve
to access the model

90
Assumptions of logistic regression

The indep. variables be interval or dichotomous
All relevant predictors be included, no
irrelevant predictors be included and the form of
the relationship is linear
The expected value of the error term is zero
There is no autocorrelation

91
Assumptions of logistic regression Cont.

There is no correlation between the error and the
independent variables
There is an absence of perfect multicollinearity
between the independent variables
Need to have a large sample (rule of thumb n
should be gt 30 times of the number of parameters)

92
Note on assumptions

No need for normality of errors
No need for equal variance

93
Example

Objective to predict low birth weight babies
Variables
Low 1 lt2500 grams, 0 gt2500 grams
LWT weight at last menstrual cycle
Age
Smoke
PTL of premature deliveries
HT History of Hypertension
UI uterine irritability
FTV of physician visits during first trimester
Race 1white, 2black, 3other

94
Example

File gt Open gt Data gt Select SPSS Portable type gt
select Birthwt (in Regression)
Analyze gt Regression gt Binary Logistic
Move low to the Dependent list box
Move age, ftv, ht, ptl, race, smoke,
and ui into the Covariate list box

95
Example (cont.)

Click the Categorical button
Place race in the Categorical Covariates box
Click Continue, click Save
Click the Probability and Group Membership check
boxes
Click Continue and then the Option button

96
Example (cont.)

Click on the Classification plots and
Hosmer-Lemeshow goodness of fit checkboxes
Click Continue, then OK

97
Logistic outputs

Initial summary output info on dependent and
categorical variables
Block 0 based on the model just include a
constant provides baseline info
Block 1 Method Enter include the model info
Chi-square tests if all the coeffs are 0 (similar
to F in regression)

98
Logistic outputs (cont.)

The Modle chi-square value is the difference of
the initial and final 2LL (small value of -2LL
indicates a good fit, -2LL0 indicates a perfect
fit)
The Step and Block display the the result of last
Step and Block (they are the same here because we
are not using stepwise regression)

99
Logistic outputs (cont.)

The goodness of fit statistics 2LL is 203.554
Cox Snell R square similar to R-square in OLS
Nagelkerke R squre (prefered b/c it can be 1)
Hosmer and Lemeshow test test there is no
difference between expected and observe counts.
I.e. we prefer a non-significant result

100
Logistic outputs (cont.)

Classification table can our model to predict
accurately?
Overall accuracy is 73
We do much better on higher birth weight
Does a poor job on lower birth weight
A significant model doesnt mean having high
predictability

101
Interpretation of the coefficients

E.g. HT (hypertension)
B1.736 hypertension in the mother increase the
log odds by 1.736
Exp(B)5.831 - hypertension in the mother
increase the odds of having a low birth baby by a
factor of 5.831
What is the prob. change?
If the original odds is 1100 (p.0099), it
changes to 5.831100 (p.0551) if the original
odds is 11 (p.5), it changes to 51 (p.83)

102
Interpretation of the coefficients (cont.)

Categorical variable Race
First an overall effect
Race(1) white the effect of being white is
significant, acting to decrease the odds ratio
compared to those of other by a factor of .4
The effect of being black is not significant
compared with other

103
Making prediction

Suppose a mother
Age 20
Weigh 130 pounds
Smoke
No hypertension or premature labor
Has uterine irritability
White
Two visits to her doctor

104
Making prediction (cont.)

P(event) 1/(1exp(-(ab1X1bkXk)
P.397
Predicted to be not have low birth rate because
the prob. is less that .5

105
Checking classification

Need to study the characteristics of mispredicted
cases
TransformgtComputegt Pred_err1 if
AnalyzegtCompare Means (LWT vs Pred_err)
The mean LWT for mispredicted is much lower than
the correctly predicted

106
Residual Analysis

AnalyzegtRegressiongtLogisticgtClick Save gtClick
Cooks, Leverage, Unstandardized, Logit, and
Standardized
Examining data
Cooks and Leverage should be small (if a case
has no influence on the regression result, the
values would be 0)
Res_1 is the residual of probability (e.g. 1st
case have predicted prob. .29804 and and actual
low value is 0, and the res_10-.29804-.29804)
Zre_1 is the standardized residual of the probs
lre_1 is the residual in terms of logit

107
ROC curve (Receiver Operating Characteristic)

Sensitivity true positive
Specificity true negative
Changing cut off points (.5) changes both the
sensitivity and specificity
ROC can help us to select an optimal cut off
point
GraphgtROC Curvegtmove pre_1 to Test Variable,
low to State Variable, type 1 in the Value
of State Variable, click with diagonal
reference line and Coordinate points of the ROC
Curve

108
ROC curve interpretation

Vertical axis sensitivity (true positive rate)
Horizontal axis false negative rate
Diagonal reference
Give the trade off between sensitivity and false
negative rates
Pay attention to the area where the curve rise
rapidly
The 1st column of coordinate of the curve gives
the cut off prob.

109
Residual Analysis Cont.

Examine the distribution of zre_1
GraphgtInteractivegtHistogramgtdrag zre_1 to X axis,
click Histogram, click Normal Curve
Note this plot need not to should normality
Finding influential cases
GraphgtScatterplotgtDefinegtMove id to X axis, coo_1
to Y axis
Multicollinearity
Use OLS regression to check (?)

110
Multinomial Logistic Regression

The dependent variable is categorical with two or
more categories
It is an extension of the logistic regression
The assumptions are the assumptions for logistic
regression plus the dependent variable has
multinomial distribution

111
Example

Objective predict risk credit risk (3
categories) base on financial and demographic
variables
Variables
Age
Income
Gender
Marital (single, married, divsepwid)
Numkids of dependent children

112
Example Cont.

Numcards of credit cards
Howpaid how often paid (weekly, monthly)
Mortgage have a mortgage (y, n)
Storecar of store credit cards
Loans of other loads
Risk 1bad risk, 2bad risk-profit, 3good risk

113
How does it work?

Let f(j) be the probability of being in outcome
category j
f(1)P(bad risk-lost)
f(2)P(bad risk-profit)
f(3)P(good risk)
g(1)f(1)/f(3)
g(2)f(2)/f(3)
g(3)f(3)/f(3)1

114
How does it work? Cont.

Fit the modele
ln(g(1)) A1B11X1B1kXk
ln(g(2)) A2B21X1B2kXk
ln(g(3)) ln(1)0A3B31X1B3kXk

115
How does it work? Cont.
116
Example Cont.

File gt Open gt Data gt Select Risk gt Open
Move risk into dependent list box
Move marital and mortgage into the Factor(s) list
box
Move income and numberkids into the Covariate(s)
list box
Click model button
Click cancel button

117
Example (Cont.)

Click Statistics button
Check the Classification table check box
Click Continue
Click Save
The Multinomial Logistic regression in SPSS
version 10 will only save model info in an XML
(Extensible Markup Language) format
Click cancel
Click OK

118
Multinomial output

Model Fit and Pseudo R-square, Likelihood ratio
test are similar to logistic regression
Parameter estimates table is different
There are two sets of parameters
One for the probability ratio of(bad
risk-lost)/(good risk)
Another set for the prob. Ratio of
(bad risk-profit)/(good risk)

119
Interpretation of coefficients

Income in the bad lost section
It is significant
Exp(B).962 the expected probability ratio is
decreased a little (by a factor of .962) for one
unit increase of income

120
How to predict?

F(1) the chance in bad loss group
F(2) the chance in bad profit group
F(3) the chance in good risk group
F(j)g(j)/sum(g(i))
g(j)exp(modelj)

121
How to predict? (cont.)

Suppose an individual
Single, has a mortgage
No children
Income of 35,000 pounds
g(1).218
g(2).767
g(3)1

122
How to predict?

F(1).218/(.218.7671).110
F(2).386
F(3).504
The individual is classified as good risk

123
Multinomial Logistic Reg. With Interaction

AnalyzegtRegressiongtMultinomial LogisticgtClick at
Model, select customgtspecify your model (all main
effects and the interaction between Marital and
Mortgage)
Interpret the results as usual

124
Interaction effects in logistic Regression

It is similar to OLS regression
Add interaction terms to the model as
crossproducts
In SPSS, highlight two variables (holding down
the ctrl key) and move them into the variable box
will create the interaction term

Write a Comment

User Comments (0)