Multivariate Data Analysis Using SPSS - PowerPoint PPT Presentation

About This Presentation
Title:

Multivariate Data Analysis Using SPSS

Description:

Multivariate Data Analysis Using SPSS John Zhang ARL, IUP Logistic outputs (cont.) The Modle chi-square value is the difference of the initial and final 2LL (small ... – PowerPoint PPT presentation

Number of Views:1750
Avg rating:3.0/5.0
Slides: 125
Provided by: ATS49
Category:

less

Transcript and Presenter's Notes

Title: Multivariate Data Analysis Using SPSS


1
Multivariate Data Analysis Using SPSS
  • John Zhang
  • ARL, IUP

2
Topics
  • A Guide to Multivariate Techniques
  • Preparation for Statistical Analysis
  • Review ANOVA
  • Review ANCOVA
  • MANOVA
  • MANCOVA
  • Repeated Measure Analysis
  • Factor Analysis
  • Discriminant Analysis
  • Cluster Analysis

3
Guide-1
  • Correlation 1 IV 1 DV relationship
  • Regression 1 IV 1 DV relation/prediction
  • T test 1 IV (Cat.) 1 DV group diff.
  • One-way ANOVA 1 IV (2 cat.) 1 DV group diff.
  • One-way ANCOVA 1 IV (2 cat.) 1 DV 1
    covariates group diff.
  • One-way MANOVA 1 IV (2 cat.) 2 DVs group
    diff.

4
Guide-2
  • One-way MANCOVA 1 IV (2cat.) 2 DVs 1
    covariate group diff.
  • Factorial MANOVA 2 IVs (2cat.) 2 DVs group
    diff.
  • Factorial MANCOVA 2 IVs (2cat.) 2 DVs 1
    covariate group diff.
  • Discriminant Analysis 2 IVs 1 DV (cat.)
    group prediction
  • Factor Analysis explore the underlying structure

5
Preparation for Stat. Analysis-1
  • Screen data
  • SPSS Utility procedures
  • Frequency procedure
  • Missing data analysis (missing data should be
    random)
  • Check if patterns exist
  • Drop data case-wise
  • Drop data variable-wise
  • Impute missing data

6
Preparation for Stat. Analysis-2
  • Outliers (generally, statistical procedures are
    sensitive to outliers.
  • Univariate case boxplot
  • Multivariate case Mahalanobis distance (a
    chi-square statistics), a point is an outlier
    when its p-value is lt .001.
  • Treatment
  • Drop the case
  • Report two analysis (one with outlier, one
    without)

7
Preparation for Stat. Analysis-3
  • Normality
  • Testing univariate normal
  • Q-Q plot
  • Skewness and Kurtosis they should be 0 when
    normal not normal when p-value lt .01 or .001
  • Komogorov-Smirnov statistic significant means
    not normal.
  • Testing multivariate normal
  • Scatterplots should be elliptical
  • Each variable must be normal

8
Preparation for Stat. Analysis-4
  • Linearity
  • Linear combination of variables make sense
  • Two variables (or comb. of variables) are linear
  • Check for linearity
  • Residual plot in regression
  • Scatterplots

9
Preparation for Stat. Analysis-5
  • Homoscedasticity the covariance matrixes are
    equal across groups
  • Boxs M test test the equality of the covariance
    matrixes across groups
  • Sensitive to normality
  • Levenes test test equality of variances across
    groups.
  • Not sensitive to normality

10
Preparation for Stat. Analysis-Example-1
  • Steps in preparation for stat. analysis
  • Check for variable codling, recode if necessary
  • Examining missing data
  • Check for univariate outlier, normality,
    homogeneity of variances (Explore)
  • Test for homogeneity of variances (ANOVA)
  • Check for multivariate outliers (RegressiongtSavegt
    Mahalanobis)
  • Check for linearity (scatterplots residual plots
    in regression)

11
Preparation for Stat. Analysis-Example-2
  • Use dataset dssft.sav
  • Objective we are interested in investigating
    group differences (satjob2) in income (income91),
    age (age_2) and education (educ)
  • Check for coding need to recode rincome91 into
    rincome_2 (22, 98, 99 be system missing)
  • TransformgtRecodegtInto Different Variable

12
Preparation for Stat. Analysis-Example-3
  • Check for missing value
  • Use Frequency for categorical variable
  • Use Descriptive Stat. for measurement variable
  • For categorical variables
  • If missing value is lt 5, use List-wise option
  • If gt5, define the missing value as a new
    category
  • For measurement variables
  • If missing value is lt 5, use List-wise option
  • If between 5 and 15, use TransformgtReplace
    Missing Value. Replacing less than 15 of data
    has little effect on the outcome
  • If greater than 15, consider to drop the
    variable or subject

13
Preparation for Stat. Analysis-Example-4
  • Check missing value for satjob2
  • AnalysisgtDescriptive StatisticsgtFrequency
  • Check for missing value for rincome_2
  • AnalysisgtDescriptive StatisticsgtDescriptive
  • Replaying the missing values in rincome_2
  • TransformgtReplacing Missing Value

14
Preparation for Stat. Analysis-Example-5
  • Check for univariate outliers, normality,
    Homogeneity of variances
  • AnalysisgtDescriptive StatisticsgtExplore
  • Put rincome_2, age_2, and educ into the Dependent
    List box satjob2 into Factor List box
  • There are outliers in rincome_2, lets change
    those outliers to the acceptable min or max value
  • TransformgtRecodegtInto Different Variable
  • Put income_2 into Original Variable box, type
    income_3 as the new name
  • Replace all values lt 3 by 4, all other values
    remain the same

15
Preparation for Stat. Analysis-Example-6
  • Explore rincome_3 again not normal
  • Transform rincome_3 into rincome_4 by ln or sqrt
  • Explore rincome_4
  • Check for multivariate outliers
  • AnalysisgtRegressiongtlinear
  • Put id (dummy variable) into Depend box, put
    rincome_4, age_2, and educ into Independent box
  • Click at Save, then Mahalanobis box
  • Compare Mahalanobis dist. with chi-sqrt critical
    value at p.001 and dfnumber of independent
    variables

16
Preparation for Stat. Analysis-Example-7
  • Check for multivariate normal
  • Must univariate normal
  • Construct a scatterplot matrix, each scatterplot
    should be elliptical shape
  • Check for Homoscedasticity
  • Univariate (ANOVA, Levenes test)
  • Multivariate (MANOVA, Boxs M test, use .01 level
    of significance level)

17
Review ANOVA -1
  • One-way ANOVA test the equality of group means
  • Assumptions independent observations normality
    homogeneity of variance
  • Two-way ANOVA tests three hypotheses
    simultaneously
  • Test the interaction of the levels of the two
    independent variables
  • Interaction occurs when the effects of one factor
    depends on the different levels of the second
    factor
  • Test the two independent variable separately

18
Review ANCOVA -1
  • Idea the difference on a DV often does not just
    depend on one or two IVs, it may depend on other
    measurement variables. ANCOVA takes into account
    of such dependency.
  • i.e. it removes the effect of one or more
    covariates
  • Assumptions in addition to the regular ANOVA
    assumptions, we need
  • Linear relationship between DV and covariates
  • The slope for the regression line is the same for
    each group
  • The covariates are reliable and is measure
    without error

19
Review ANCOVA -2
  • Homogeneity of slopes homogeneity of regression
    there is interaction between IVs and the
    covariate
  • If the interaction between covariate and IVs are
    significant, ANCOVA should not be conducted
  • Example determine if hours worked per week
    (hrs2) is different by gender (sex) and for those
    satisfy or dissatisfied with their job (satjob2),
    after adjusted to their income (or equalized to
    their income)

20
Review ANCOVA -3
  • AnalysisgtGLMgtUnivariate
  • Move hrs2 into DV box move sex and satjob2 into
    Fixed Factor box move rincome_2 into Covariate
    box
  • Click at ModelgtCustom
  • Highlight all variables and move it to the Model
    box
  • Make sure the Interaction option is selected
  • Click at Option
  • Move sex and satjob2 into Display Means box
  • Click Descriptive Stat. Estimates of effect
    size and Homogeneity tests
  • This tests the homogeneity of regression slopes

21
Review ANCOVA -4
  • If there is no interaction found by the previous
    step, then repeat the previous step except click
    at ModelgtFactorial instead of ModelgtCustom

22
Review ANOVA -2
  • Interaction is significant means the two IVs in
    combination result in a significant effect on the
    DV, thus, it does not make sense to interpret the
    main effects.
  • Assumptions the same as One-way ANOVA
  • Example the impact of gender (sex) and age
    (agecat4) on income (rincome_2)
  • Explore (omitted)
  • AnalysisgtGLMgtunivariate
  • Click modelgtclick Full factorialgtCont.
  • Click OptionsgtClick Descriptive Stat Estimates
    of effect size Homogeneity test
  • Click Post Hocgtclick LSD Bonferroni Scheffe
    Cont.
  • Click Plotsgtput one IV into Horizontal and the
    other into Separate line

23
MANOVA-1
  • Characteristics
  • Similar to ANOVA
  • Multiple DVs
  • The DVs are correlated and linear combination
    makes sense
  • It tests whether mean differences among k groups
    on a combination of DVs are likely to have
    occurred by chance
  • The idea of MANOVA is find a linear combination
    that separates the groups optimally, and
    perform ANOVA on the linear combination

24
MANOVA-2
  • Advantages
  • The chance of discovering what actually changed
    as a result of the the different treatment
    increases
  • May reveal differences not shown in separate
    ANOVAs
  • Without inflation of type one error
  • The use of multiple ANOVAs ignores some very
    important info (the fact that the DVs are
    correlated)

25
MANOVA-3
  • Disadvantages
  • More complicated
  • ANOVA is often more powerful
  • Assumptions
  • Independent random samples
  • Multivariate normal distribution in each group
  • Homogeneity of covariance matrix
  • Linear relationship among DVs

26
MANOVA-4
  • Steps in carry out MANOVA
  • Check for assumptions
  • If MANOVA is not significant, stop
  • If MANOVA is significant, carry out univariate
    ANOVA
  • If univariate ANOVA is significant, do Post Hoc
  • If homoscedasticity, use Wilks Lambda, if not,
    use Pillais Trace. In general, all 4 statistics
    should be similar.

27
MANOVA-5
  • ExampleAn experiment looking at the memory
    effects of different instructions 3 groups of
    human subjects learned nonsense syllables as they
    were presented and were administered two memory
    tests recall and recognition. The first group of
    subjects was instructed to like or dislike the
    syllables as they were presented (to generate
    affect). A second group was instructed that they
    will be tested (induce anxiety?). The 3rd group
    was told to count the syllable as the were
    presented (interference). The objective is to
    access group differences in memory

28
MANOVA-6
  • How to do it?
  • FilegtOpen Data
  • Open the file As9.por in InstructgtZhang
    Multivariate Short Course folder
  • AnalyzegtGLMgtMultivariate
  • Move recall and recog into Dependent Variable
    box move group into Fixed Factors box
  • Click at Options move group into Display means
    box (this will display the marginal means
    predicted by the model, these means may be
    different than the observed means if there are
    covariates or the model is not factorial)
    Compare main effect box is for testing the every
    pair of the estimated marginal means for the
    selected factors.
  • Click at Estimates of effect size and Homogeneity
    of variance

29
MANOVA-7
  • Push buttons
  • Plots create a profile plot for each DV
    displaying group means
  • Post Hoc Post Hoc tests for marginal means
  • Save save predicted values, etc.
  • Contrast perform planned comparisons
  • Model specify the model
  • Options
  • Display Means for display the estimated means
    predicted by the model
  • Compare main effects test for significant
    difference between every pair of estimated
    marginal means for each of the main effects

30
MANOVA-8
  • Observed power produce a statistical power
    analysis for your study
  • Parameter estimate check this when you need a
    predictive model
  • Spread vs. level plot visual display of
    homogeneity of variance

31
MANOVA-9
  • Example 2 Check for the impact of job
    satisfaction (satjob) and gender (sex) on income
    (rincome_2) and education (educ) (in gssft.sav)
  • Screen data transform educ to educ2 to eliminate
    cases with 6 or less
  • Check for assumptions explore
  • MANOVA

32
MANCOVA-1
  • Objective Test for mean differences among groups
    for a linear combination of DVs after adjusted
    for the covariate.
  • Example to test if there is differences in
    productivity (measured by income and hours
    worked) for individuals in different age groups
    after adjusted for the education level

33
MANCOVA-2
  • Assumptions similar to ANCOVA
  • SPSS how to
  • AnalysisgtGLMgtMultivariate
  • Move rincome_2 and educ2 to DV box move sex and
    satjob into IV box move age to Covariate box
  • Check for homogeneity of regression
  • Click at ModelgtCustom Highlight all variables
    and move them to Model box
  • If the covariate-IVs interaction is not
    significant, repeat the process but select the
    Full under model

34
Repeated Measure Analysis-1
  • Objective test for significant differences in
    means when the same observation appears in
    multiple levels of a factor
  • Examples of repeated measure studies
  • Marketing compare customers ratings on 4
    different brands
  • Medicine compare test results before,
    immediately after, and six months after a
    procedure
  • Education compare performance test scores
    before and after an intervention program

35
Repeated Measure Analysis-2
  • The logic of repeated measure SPSS performs
    repeated measure ANOVA by computing contrasts
    (differences) across the repeated measures
    factors levels for each subject, then testing if
    the means of the contrasts are significantly
    different from 0 any between subject tests are
    based on the means of the subjects.

36
Repeated Measure Analysis-3
  • Assumptions
  • Independent observations
  • Normality
  • Homogeneity of variances
  • Sphericity if two or more contrasts are to be
    pooled (the test of main effect is based on this
    pooling), then the contrasts should be equally
    weighted and uncorrelated (equal variances and
    uncorrelated contrasts) this assumption is
    equivalent to the covariance matrix is diagonal
    and the diagonal elements are the same)

37
Repeated Measure Analysis-4
  • Example 1 A study in which 5 subjects were
    tested in each of 4 drug conditions
  • Open data file
  • FilegtOpenData select Repmeas1.por
  • SPSS repeated measure procedure
  • AnalyzegtGLMgtRepeated Measure
  • Within-Subject Factor Name (the name of the
    repeated measure factor) a repeated measure
    factor is expressed as a set of variables
  • Replace factor1 with Drug
  • Number of levels the number of repeated
    measurements
  • Type 4

38
Repeated Measure Analysis-5
  • The Measure pushbutton for two functions
  • For multiple dependent measures (e.g. we recorded
    4 measures of physiological stress under each of
    the drug conditions)
  • To label the factor levels
  • Click Measure type memory in Measure name box
    click add
  • Click Define here we link the repeated measure
    factor level to variable names define between
    subject factors and covariates
  • Move drug1 drug 4 to the Within-Subject box
  • You can move a selected variable by the up and
    down button

39
Repeated Measure Analysis-6
  • Model button by default a complete model
  • Contrast button specify particular contrasts
  • Plot button create profile plots that graph
    factor level estimated marginal means for up to 3
    factors at a time
  • Post Hoc provide Post Hoc tests for between
    subject factors
  • Save button allow you to save predicted values,
    residuals, etc.
  • Options similar to MANOVA
  • Click Descriptive click at Transformation Matrix
    (it provides the contrasts)

40
Repeated Measure Analysis-7
  • Interpret the results
  • Look at the descriptive statistics
  • Look at the test for Sphericity
  • If Sphericity is significant, use the
    Multivariate results (test on the contrasts). It
    tests whether all of the contrast variables are
    zero in the population
  • If Sphericity is not significant, use the
    Sphericity Assumed result
  • Look at the tests for within subject contrasts
    it test the linear trend the quadratic trend
  • It may not be make sense in some applications, as
    in this example (but it makes sense in terms of
    time and dosage)

41
Repeated Measure Analysis-8
  • Transformation matrix provide info on what are
    linear contrast, etc.
  • The fist table is for the average across the
    repeated measure factor (here they are all .5, it
    means each variable is weighted equally,
    normalization requires that the square of the
    sums equals to 1)
  • The second table defines the corresponding
    repeated measure factor
  • Linear increase by a constant, etc.
  • Linear and quadratic is orthogonal, etc.
  • Having concluded there are memory differences due
    to drug condition, , we want to know which
    condition differ to which others

42
(No Transcript)
43
Repeated Measure Analysis-9
  • Repeat the analysis, except under Option button,
    move drug into Display Means, click at Compare
    Main effects and select Bonferroni adjustment
  • Transformation Coefficients (M Matrix) it shows
    how the variables are created for comparison.
    Here, we compare the drug conditions, so the M
    matrix is an identity matrix
  • Suppose we want to test each adjacent pair of
    means drug1 vs. drug2 drug2 vs. drug3 drug3
    vs. drug 4
  • Repeated measuregtDefinegtContrastgtSelect Repeated

44
Repeated Measure Analysis-10
  • Example 2 A marketing experiment was devised to
    evaluate whether viewing a commercial produces
    improved ratings for a specific brand. Ratings on
    3 brands were obtained from objects before and
    after viewing the commercial. Since the hope was
    that the commercial would improve ratings of only
    one brand (A), researchers expected a significant
    brand by pre-post commercial interaction. There
    are two between-subjects factors sex and brand
    used by the subject

45
Repeated Measure Analysis-11
  • SPSS how to
  • AnalyzegtGLMgtRepeated Measures
  • Replace factor1 with prepost in the
    Within-Subject Factor box type 2 in the Number
    of level box click add
  • Type brand in the Within-Subject Factor box type
    3 in the Number of level box click add
  • Click measure type measure in Measure Name box
    click add
  • Note SPSS expects 2 between-subject factors

46
Repeated Measure Analysis-12
  • Click Define button move the appropriate
    variable into place move sex and user into
    Between-Subject Factor box
  • Click Options button move sex, user, prepost and
    brand into the Display means box
  • Click Homogeneity tests and descriptive boxes
  • Click Plot move user into Horizontal Axis box
    and brand into Separate Lines box
  • Click continue OK

47
Factor Analysis-1
  • The main goal of factor analysis is data
    reduction. A typical use of factor analysis is in
    survey research, where a researcher wishes to
    represent a number of questions with a smaller
    number of factors
  • Two questions in factor analysis
  • How many factors are there and what they
    represent (interpretation)
  • Two technical aids
  • Eigenvalues
  • Percentage of variance accounted for

48
Factor Analysis-2
  • Two types of factor analysis
  • Exploratory introduce here
  • Confirmatory SPSS AMOS
  • Theoretical basis
  • Correlations among variables are explained by
    underlying factors
  • An example of mathematical 1 factor model for two
    variables
  • V1L1F1E1
  • V2L2F1E2

49
Factor Analysis-3
  • Each variable is compose of a common factor (F1)
    multiply by a loading coefficient (L1, L2 the
    lambdas or factor loadings) plus a random
    component
  • V1 and V2 correlate because the common factor and
    should relate to the factor loadings, thus, the
    factor loadings can be estimated by the
    correlations
  • A set of correlations can derive different factor
    loadings (i.e. the solutions are not unique)
  • One should pick the simplest solution

50
Factor Analysis-4
  • A factor solution needs to be confirm
  • By a different factor method
  • By a different sample
  • More on terminology
  • Factor loading interpreted as the Pearson
    correlation between the variable and the factor
  • Communality the proportion of variability for a
    given variable that is explained by the factor
  • Extraction the process by which the factors are
    determined from a large set of variables

51
Factor Analysis-5
  • Principle component one of the extraction
    methods
  • A principle component is a linear combination of
    observed variables that is independent
    (orthogonal) of other components
  • The first component accounts for the largest
    amount of variance in the input data the second
    component accounts for the largest amount or the
    remaining variance
  • Components are orthogonal means they are
    uncorrelated

52
Factor Analysis-6
  • Possible application of principle components
  • E.g. in a survey research, it is common to have
    many questions to address one issue (e.g.
    customer service). It is likely that these
    questions are highly correlated. It is
    problematic to use these variables in some
    statistical procedures (e.g. regression). One can
    use factor scores, computed from factor loadings
    on each orthogonal component

53
Factor Analysis-7
  • Principle component vs. other extract methods
  • Principle component focus on accounting for the
    maximum among of variance (the diagonal of a
    correlation matrix)
  • Other extract methods (e.g. principle axis
    factoring) focus more on accounting for the
    correlations between variables (off diagonal
    correlations)
  • Principle component can be defined as a unique
    combination of variables but the other factor
    methods can not
  • Principle component are use for data reduction
    but more difficult to interpret

54
Factor Analysis-8
  • Number of factors
  • Eigenvalues are often used to determine how many
    factors to take
  • Take as many factors there are eigenvalues
    greater than 1
  • Eigenvalue represents the amount of standardized
    variance in the variable accounted for by a
    factor
  • The amount of standardized variance in a variable
    is 1
  • The sum of eigenvalues is the percentage of
    variance accounted for

55
Factor Analysis-9
  • Rotation
  • Objective to facilitate interpretation
  • Orthogonal rotation done when data reduction is
    the objective and factors need to be orthogonal
  • Varimax attempts to simplify interpretation by
    maximize the variances of the variable loadings
    on each factor
  • Quartimax simplify solution by finding a
    rotation that produces high and low loadings
    across factors for each variable
  • Oblique rotation use when there are reason to
    allow factors to be correlated
  • Oblimin and Promax (promax runs fast)

56
Factor Analysis-10
  • Factor scores if you are satisfy with a factor
    solution
  • You can request that a new set of variables be
    created that represents the scores of each
    observation on the factor (difficult of
    interpret)
  • You can use the lambda coefficient to judge which
    variables are highly related to the factor the
    compute the sum of the mean of this variables for
    further analysis (easy to interpret)

57
Factor Analysis-11
  • Sample size the sample size should be about 10
    to 15 times of the number of variables (as other
    multivariate procedures)
  • Number of methods there are 8 factoring methods,
    including principle component
  • Principle axis account for correlations between
    the variables
  • Unweighted least-squares minimize the residual
    between the observed and the reproduced
    correlation matrix

58
Factor Analysis-12
  • Generalize least-squares similar to Unweighted
    least-squares but give more weight the the
    variables with stronger correlation
  • Maximum Likelihood generate the solution that is
    the most likely to produce the correlation matrix
  • Alpha Factoring Consider variables as a sample
    not using factor loadings
  • Image factoring decompose the variables into a
    common part and a unique part, then work with the
    common part

59
Factor Analysis-13
  • Recommendations
  • Principle components and principle axis are the
    most common used methods
  • When there are multicollinearity, use principle
    components
  • Rotations are often done. Try to use Varimax

60
Factor Analysis-14
  • Example 1 whether a small number of athletic
    skills account for performance in the ten
    separate decathlon events
  • FilegtOpengtData select Olymp88.por
  • Looking at correlation
  • AnalyzegtCorrelationgtBivariate
  • Principle component with orthogonal rotation
  • AnalyzegtData ReductiongtFactor
  • Select all variables except score
  • Click Extract buttongtclick Scree Plot
  • Check off Unrotated factor solution
  • Click continue

61
Factor Analysis-15
  • Click Rotation buttongtclick Varimax Loading
    plots click continue
  • Click options buttongtclick sorted by size click
    Suppress absolute values box change .1 to ,3
    click continue
  • Click DescriptivegtUnivariate descriptive KMO and
    Bartletts test of sphericity (KMO measures how
    well the sample data are suited for factor
    analysis .9 is great and less than .5 is not
    acceptable Bartletts test tests the sphericity
    of the correlation matrix) click continue
  • Click OK

62
Factor Analysis-16
  • Try to validate the first factor solution using a
    different method
  • AnalyzegtData ReductiongtFactor Analysis
  • Click ExtractiongtSelect Principle axis factoring
    click continue
  • Click RotationgtSelect Direct Oblimin (leave delta
    value at 0, most oblique value possible) type 50
    in the Max Iteration box click continue
  • Click Score buttongtclick save as variables (this
    involve solving system of equation for the
    factors, regression is one of the methods to
    solve the equations) click continue
  • Click OK

63
Factor Analysis-17
  • Note the Patten matrix gives the standardized
    linear weights and the Structure matrix gives the
    correlation between variable and factors (in
    principle component analysis, the component
    matrix gives both factor loadings and the
    correlations)

64
Discriminant Analysis-1
  • Discriminant analysis characterize the
    relationship between a set of IVs with a
    categorical DV with relatively few categories
  • It creates a linear combination of the IVs that
    best characterizes the differences among the
    groups
  • Predictive discriminant analysis focus on
    creating a rule to predict group membership
  • Descriptive DA studies the relationship between
    the DV and the IVs.

65
Discriminant Analysis-2
  • Possible applications
  • Whether a bank should offer a loan to a new
    customer?
  • Which customer is likely to buy?
  • Identify patients who may be at high risk for
    problems after surgery

66
Discriminant Analysis-3
  • How does it work?
  • Assume the population of interest is composed of
    distinct populations
  • Assume the IVs follows multivariate normal
    distribution
  • DS seek a linear combination of the IVs that best
    separate the populations
  • If we have k groups, we need k-1 discriminate
    functions
  • A discriminant score is computed for each
    function
  • This score is used to classify cases into one of
    the categories

67
Discriminant Analysis-4
  • There are three methods to classify group
    memberships
  • Maximum likelihood method assign case to group k
    is the probability of membership is greater in
    group k than any other group
  • Fisher (linear) classification functions assign
    a membership to group k if its score on the
    function for group k is greater than any other
    function scores
  • Distance function assign membership to group k
    if its distance to the centroid of the group is
    minimum
  • Note SPSS uses Maximum likelihood method

68
Discriminant Analysis-5
  • Basic steps in DA
  • Identify the variables
  • Screen data look for outliers, variables may not
    be good predictors, etc
  • Run DA
  • Check for the correct prediction rate
  • Check for the importance of individual predictors
  • Validate the model

69
Discriminant Analysis-6
  • Assumptions
  • IVs are either dichotomous or measurement
  • Normality
  • Homogeneity of variances

70
Discriminant Analysis-7
  • Example 1 VCR buyers filled out a survey we
    want to determine which set of demographic
    information and attitude best predict which
    customer may buy another VCR
  • FilegtOpen DatagtCSM.por
  • Explore the data
  • AnalyzegtClassifygtDiscriminant
  • Move age, complain, educ, fail, pinnovat,
    preliabl, puse, qual, use, and value into
    Independent box
  • Move buyyes into Grouping box
  • Click Define Range type 1 for Min and 2 for Max
  • Click continue

71
Discriminant Analysis-8
  • Click Statisticsgtclick Boxs M and Fishers
    continue
  • Click Classify buttongtclick Summary table
    Separate groups Continue
  • Click Save buttongtclick on Discriminant Scores
    continue
  • Click OK
  • How original variables related to the
    discriminant score?
  • GraphsgtScattergtClick Define
  • Move pinnovat into X and dis1_1 into Y move
    buyyes into Set Markers by box

72
Discriminant Analysis-9
  • Since Boxs M test was significant, one can ask
    SPSS to run DA using separate covariances
    option (under Classify) and compare the results
  • From the 1st analysis, we see that age was not
    important, one can redo the analysis without
    age and compare the results

73
Discriminant Analysis-10
  • Validate the model leave-one-out classification
  • Repeat the analysis, click on Classifygtclick
    leave-one-out classification Click continue
  • Example 2 predict smoking and drinking habits
  • AnalyzegtClassifygtDiscriminant
  • Move smkdrnk into Grouping Variable box move
    age, attend, black, class, educ, sex and white
    into IV list
  • Click StatisticsgtSelect Fishers and Box M
    Continue
  • Click ClassifygtSummary table, Combine-groups
    Territorial map Continue
  • Click OK

74
Cluster Analysis-1
  • Cluster analysis is an exploratory data analysis
    technique design to reveal groups
  • How?
  • By distance close together observations should
    be in the same group, and observations in the
    groups should be far apart
  • Applications
  • Plants and animals into ecological groups
  • Companies for product usage

75
Cluster Analysis-2
  • Two types of method
  • Hierarchical requires observations to remain
    together once they have joint in a cluster
  • Complete linkage
  • Between groups average linkage
  • Wards method
  • Nonhierarchical no such requirement
  • Research must pick a number of clusters to run
    (K-means algorithm)

76
Cluster Analysis-3
  • Recommendations
  • For relative small samples, use hierarchical
    (less than a few hundred)
  • For large samples, use K-means
  • Example 1 evaluating 20 types of beer
  • FilegtOpengtData select beer.por
  • AnalyzegtDescriptive StatgtDescriptive
  • Move cost, calories, sodium, and alcohol into
    variable list
  • Click at Save standardized values OK

77
Cluster Analysis-4
  • AnalyzegtClassifygtHierarchical Cluster
  • Move cost, calories, sodium, and alcohol into
    Variable list box
  • Move Beer into label cases by box
  • Click Plotsgtclick Dendrogram click none in
    Icicle area continue
  • Click Methodgtselect Z-score from the standardize
    drop-down list Continue
  • Click SavegtClick range of solutions range 2-5
    clusters continue
  • OK

78
Cluster Analysis-5
  • Additional analysis
  • Look at the last 4 column of the data (clu5_1 to
    clu2_1) they contain memberships for each
    solution between 5 and 2 clusters
  • AnalyzegtDescriptivegtFrequencies
  • Move clu2_1 to clu5_1 to Variable box
  • OK
  • Obtain mean profile for clusters
  • GraphgtLinegtsummary of separate variables
  • Click Definegtmove zcost, zcalorie, zsodium, and
    zalcohol to Lines Rep. Box
  • Click clu4_1 and move it to Category box

79
Path Analysis-1
  • Path analysis is a technique based on regression
    to establish causal relationship
  • Start with a diagram with causal flow
  • Direct causal effects model (regression)
  • The direct causal effect of an IV on a DV is the
    coefficient (the number of unit change in DV for
    1 unit change in X)
  • Building on the DCEM
  • Two forms of causal model
  • Diagram
  • Equation (structure equation)

80
Path Analysis-2
  • An example of a causal model
  • Structural equation
  • Z4p41Z1p42Z2p43Z3e4
  • P path coefficient
  • e disturbance
  • Z4, endogenous variable
  • Z1 exogenous variable
  • Path diagram
  • Indirect effect is the multiplication of the path
    coefficients

81
Path Analysis-3
  • Steps in path analysis
  • Create a path diagram
  • Use regression to estimate structural equation
    coefficients
  • Assess to model
  • Compare the observed and reproduced correlations
    (reproduced correlations will be computed by hand)

82
Path Analysis-4
  • Research questions
  • Is our model-which describe the causal effects
    among the variables region of the world,
    status as a developing nation, number of
    doctors, and male life expectancy-consistent
    with our observed correlation among these
    variables?
  • If our model is consistent, what are the
    estimated direct, indirect, and total causal
    effects among the variables?

83
Path Analysis-5
  • Legal path
  • No path may pass through the same variable more
    than once
  • No path may go backward on an arrow after going
    forward on another arrow
  • No path may include more than one double headed
    curve arrow

84
Path Analysis-6
  • Component labels
  • D direct effect (just one straight arrow)
  • I indirect effect (more than one straight
    arrows)
  • S spurious effect (there is a backward arrow)
  • U effect is uncertain (start with a two arrows
    curve)

85
Path Analysis-7
  • If the model is in question (some of the
    reproduced correlations differ from the observed
    correlations by more than .05)
  • Test all missing paths (running additional
    regressions and check for significance of the
    coefficients)
  • Reduce the existing paths if their coefficients
    are not significant

86
Logistic regression - Motivations
  • When the dependent variable is dichotomous,
    regular regression is not appropriate
  • We want to predict probability
  • OLS regression predictions could be any numbers,
    not just numbers between 0 and 1
  • When dealing with proportions, variance is
    depended on mean, equal variance assumption in
    OLS is violated

87
Motivations-Continue
  • Fit a S curve to the data

88
What is Logistic Regression?
  • Regressions of the form
  • ln(Odds)B0B1X1BkXk
  • ln(Odds) is called a logic
  • OddsPorb/(1-Prob)

89
Application of Logistic Regression
  • When to use it?
  • When the dependent valuable is dichotomous
  • Objectives
  • Run a logistic regression
  • Apply a stepwise logistic regression
  • Use ROC (response operating characteristic) curve
    to access the model

90
Assumptions of logistic regression
  • The indep. variables be interval or dichotomous
  • All relevant predictors be included, no
    irrelevant predictors be included and the form of
    the relationship is linear
  • The expected value of the error term is zero
  • There is no autocorrelation

91
Assumptions of logistic regression Cont.
  • There is no correlation between the error and the
    independent variables
  • There is an absence of perfect multicollinearity
    between the independent variables
  • Need to have a large sample (rule of thumb n
    should be gt 30 times of the number of parameters)

92
Note on assumptions
  • No need for normality of errors
  • No need for equal variance

93
Example
  • Objective to predict low birth weight babies
  • Variables
  • Low 1 lt2500 grams, 0 gt2500 grams
  • LWT weight at last menstrual cycle
  • Age
  • Smoke
  • PTL of premature deliveries
  • HT History of Hypertension
  • UI uterine irritability
  • FTV of physician visits during first trimester
  • Race 1white, 2black, 3other

94
Example
  • File gt Open gt Data gt Select SPSS Portable type gt
    select Birthwt (in Regression)
  • Analyze gt Regression gt Binary Logistic
  • Move low to the Dependent list box
  • Move age, ftv, ht, ptl, race, smoke,
    and ui into the Covariate list box

95
Example (cont.)
  • Click the Categorical button
  • Place race in the Categorical Covariates box
  • Click Continue, click Save
  • Click the Probability and Group Membership check
    boxes
  • Click Continue and then the Option button

96
Example (cont.)
  • Click on the Classification plots and
    Hosmer-Lemeshow goodness of fit checkboxes
  • Click Continue, then OK

97
Logistic outputs
  • Initial summary output info on dependent and
    categorical variables
  • Block 0 based on the model just include a
    constant provides baseline info
  • Block 1 Method Enter include the model info
  • Chi-square tests if all the coeffs are 0 (similar
    to F in regression)

98
Logistic outputs (cont.)
  • The Modle chi-square value is the difference of
    the initial and final 2LL (small value of -2LL
    indicates a good fit, -2LL0 indicates a perfect
    fit)
  • The Step and Block display the the result of last
    Step and Block (they are the same here because we
    are not using stepwise regression)

99
Logistic outputs (cont.)
  • The goodness of fit statistics 2LL is 203.554
  • Cox Snell R square similar to R-square in OLS
  • Nagelkerke R squre (prefered b/c it can be 1)
  • Hosmer and Lemeshow test test there is no
    difference between expected and observe counts.
    I.e. we prefer a non-significant result

100
Logistic outputs (cont.)
  • Classification table can our model to predict
    accurately?
  • Overall accuracy is 73
  • We do much better on higher birth weight
  • Does a poor job on lower birth weight
  • A significant model doesnt mean having high
    predictability

101
Interpretation of the coefficients
  • E.g. HT (hypertension)
  • B1.736 hypertension in the mother increase the
    log odds by 1.736
  • Exp(B)5.831 - hypertension in the mother
    increase the odds of having a low birth baby by a
    factor of 5.831
  • What is the prob. change?
  • If the original odds is 1100 (p.0099), it
    changes to 5.831100 (p.0551) if the original
    odds is 11 (p.5), it changes to 51 (p.83)

102
Interpretation of the coefficients (cont.)
  • Categorical variable Race
  • First an overall effect
  • Race(1) white the effect of being white is
    significant, acting to decrease the odds ratio
    compared to those of other by a factor of .4
  • The effect of being black is not significant
    compared with other

103
Making prediction
  • Suppose a mother
  • Age 20
  • Weigh 130 pounds
  • Smoke
  • No hypertension or premature labor
  • Has uterine irritability
  • White
  • Two visits to her doctor

104
Making prediction (cont.)
  • P(event) 1/(1exp(-(ab1X1bkXk)
  • P.397
  • Predicted to be not have low birth rate because
    the prob. is less that .5

105
Checking classification
  • Need to study the characteristics of mispredicted
    cases
  • TransformgtComputegt Pred_err1 if
  • AnalyzegtCompare Means (LWT vs Pred_err)
  • The mean LWT for mispredicted is much lower than
    the correctly predicted

106
Residual Analysis
  • AnalyzegtRegressiongtLogisticgtClick Save gtClick
    Cooks, Leverage, Unstandardized, Logit, and
    Standardized
  • Examining data
  • Cooks and Leverage should be small (if a case
    has no influence on the regression result, the
    values would be 0)
  • Res_1 is the residual of probability (e.g. 1st
    case have predicted prob. .29804 and and actual
    low value is 0, and the res_10-.29804-.29804)
  • Zre_1 is the standardized residual of the probs
  • lre_1 is the residual in terms of logit

107
ROC curve (Receiver Operating Characteristic)
  • Sensitivity true positive
  • Specificity true negative
  • Changing cut off points (.5) changes both the
    sensitivity and specificity
  • ROC can help us to select an optimal cut off
    point
  • GraphgtROC Curvegtmove pre_1 to Test Variable,
    low to State Variable, type 1 in the Value
    of State Variable, click with diagonal
    reference line and Coordinate points of the ROC
    Curve

108
ROC curve interpretation
  • Vertical axis sensitivity (true positive rate)
  • Horizontal axis false negative rate
  • Diagonal reference
  • Give the trade off between sensitivity and false
    negative rates
  • Pay attention to the area where the curve rise
    rapidly
  • The 1st column of coordinate of the curve gives
    the cut off prob.

109
Residual Analysis Cont.
  • Examine the distribution of zre_1
  • GraphgtInteractivegtHistogramgtdrag zre_1 to X axis,
    click Histogram, click Normal Curve
  • Note this plot need not to should normality
  • Finding influential cases
  • GraphgtScatterplotgtDefinegtMove id to X axis, coo_1
    to Y axis
  • Multicollinearity
  • Use OLS regression to check (?)

110
Multinomial Logistic Regression
  • The dependent variable is categorical with two or
    more categories
  • It is an extension of the logistic regression
  • The assumptions are the assumptions for logistic
    regression plus the dependent variable has
    multinomial distribution

111
Example
  • Objective predict risk credit risk (3
    categories) base on financial and demographic
    variables
  • Variables
  • Age
  • Income
  • Gender
  • Marital (single, married, divsepwid)
  • Numkids of dependent children

112
Example Cont.
  • Numcards of credit cards
  • Howpaid how often paid (weekly, monthly)
  • Mortgage have a mortgage (y, n)
  • Storecar of store credit cards
  • Loans of other loads
  • Risk 1bad risk, 2bad risk-profit, 3good risk

113
How does it work?
  • Let f(j) be the probability of being in outcome
    category j
  • f(1)P(bad risk-lost)
  • f(2)P(bad risk-profit)
  • f(3)P(good risk)
  • g(1)f(1)/f(3)
  • g(2)f(2)/f(3)
  • g(3)f(3)/f(3)1

114
How does it work? Cont.
  • Fit the modele
  • ln(g(1)) A1B11X1B1kXk
  • ln(g(2)) A2B21X1B2kXk
  • ln(g(3)) ln(1)0A3B31X1B3kXk

115
How does it work? Cont.
116
Example Cont.
  • File gt Open gt Data gt Select Risk gt Open
  • Move risk into dependent list box
  • Move marital and mortgage into the Factor(s) list
    box
  • Move income and numberkids into the Covariate(s)
    list box
  • Click model button
  • Click cancel button

117
Example (Cont.)
  • Click Statistics button
  • Check the Classification table check box
  • Click Continue
  • Click Save
  • The Multinomial Logistic regression in SPSS
    version 10 will only save model info in an XML
    (Extensible Markup Language) format
  • Click cancel
  • Click OK

118
Multinomial output
  • Model Fit and Pseudo R-square, Likelihood ratio
    test are similar to logistic regression
  • Parameter estimates table is different
  • There are two sets of parameters
  • One for the probability ratio of(bad
    risk-lost)/(good risk)
  • Another set for the prob. Ratio of
  • (bad risk-profit)/(good risk)

119
Interpretation of coefficients
  • Income in the bad lost section
  • It is significant
  • Exp(B).962 the expected probability ratio is
    decreased a little (by a factor of .962) for one
    unit increase of income

120
How to predict?
  • F(1) the chance in bad loss group
  • F(2) the chance in bad profit group
  • F(3) the chance in good risk group
  • F(j)g(j)/sum(g(i))
  • g(j)exp(modelj)

121
How to predict? (cont.)
  • Suppose an individual
  • Single, has a mortgage
  • No children
  • Income of 35,000 pounds
  • g(1).218
  • g(2).767
  • g(3)1

122
How to predict?
  • F(1).218/(.218.7671).110
  • F(2).386
  • F(3).504
  • The individual is classified as good risk

123
Multinomial Logistic Reg. With Interaction
  • AnalyzegtRegressiongtMultinomial LogisticgtClick at
    Model, select customgtspecify your model (all main
    effects and the interaction between Marital and
    Mortgage)
  • Interpret the results as usual

124
Interaction effects in logistic Regression
  • It is similar to OLS regression
  • Add interaction terms to the model as
    crossproducts
  • In SPSS, highlight two variables (holding down
    the ctrl key) and move them into the variable box
    will create the interaction term
Write a Comment
User Comments (0)
About PowerShow.com