Multivariate Statistics - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Multivariate Statistics

Description:

Hopefully, a small subset of the newly constructed variables carry as much of ... Has fewer components than variables (Kaiser criterion, scree plot, etc. ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 36
Provided by: wvande
Category:

less

Transcript and Presenter's Notes

Title: Multivariate Statistics


1
Multivariate Statistics
  • Confirmatory Factor Analysis I
  • W. M. van der Veld
  • University of Amsterdam

2
Overview
  • PCA a review
  • Introduction to Confirmatory Factor Analysis
  • Digression History and example
  • Intuitive specification
  • Digression Decomposition rule
  • Identification
  • Exercise 1
  • An example

3
PCA a review
4
PCA a review
  • Principal Components analysis is a kind of data
    reduction
  • Start with a set of observed variables.
  • Transform the data so that a new set of variables
    is constructed from the observed variables (full
    component solution).
  • Hopefully, a small subset of the newly
    constructed variables carry as much of their
    information as possible, so that we can reduce
    the number of variables in our study.
  • Full component solution
  • Has as many components as variables
  • Accounts for 100 of the variables variance
  • Each variable has a final communality of 1.00
  • Truncated components solution
  • Has fewer components than variables (Kaiser
    criterion, scree plot, etc.)
  • Components can hopefully be interpreted
  • Accounts for lt100 of the variables variance
  • Each variable has a communality lt 1.00

5
PCA a review
  • Each principal component
  • Accounts for maximum available variance,
  • Is orthogonal (uncorrelated, independent) with
    all prior components (often called factors)
  • Full solution (as many factors as variables)
    accounts for all the variance.
  • Full solution Truncated solution

6
PCA a review
  • PCA is part of the toolset for exploring data.
  • When the researcher does not have, a priori,
    sufficient evidence to form a hypothesis about
    the structure in the data.
  • But anyone who can read should be able to get an
    idea about the structure, i.e. what goes with
    what.
  • Highly empirical or data driven technique.
  • Because one will always find components.
  • And when you find them you have to interpret
    them.
  • Interpretation of components measured by many
    variables can be quite complicated. Especially if
    you have no a priori ideas.
  • PCA at best suggests hypotheses for further
    research.
  • Confirmatory Factor Analysis has everything that
    PCA lacks,
  • Nevertheless, PCA is useful when the data is
    collected with hypothesis in mind, that you want
    to explore quickly.
  • And when you want to construct variables by
    definition.

7
Introduction to CFA
8
Introduction to CFA
  • Some variables of interest cannot be directly
    observed.
  • These unobserved variables are referred to as
    either
  • Latent variables, or
  • Factors
  • How do we obtain information about these latent
    variables, if they cannot be observed?
  • We do so by assuming that the latent variables
    have real observable consequences. And these
    consequences are related.
  • By studying the relations between the
    consequences, we can infer that there is or is
    not a latent variable.
  • History and example.

9
Digression History and example
  • Sir Francis Galton (1822-1911)
  • Psychologist, etc.
  • Galton was one of the first experimental
    psychologists, and the founder of the field of
    enquiry now called Differential Psychology, which
    concerns itself with psychological differences
    between people, rather than on common traits.  He
    started virtually from scratch, and had to invent
    the major tools he required, right down to the
    statistical methods - correlation and regression
    - which he later developed.  These are now the
    nuts-and-bolts of the empirical human sciences,
    but were unknown in his time.  One of the
    principal obstacles he had to overcome was the
    treatment of differences on measures as
    measurement error, rather than as natural
    variability.
  • His influential study Hereditary Genius (1869)
    was the first systematic attempt to investigate
    the effect of heredity on intellectual abilities,
    and was notable for its use of the bell-shaped
    Normal Distribution, then called the "Law of
    Errors", to describe differences in intellectual
    ability, and its use of pedigree analysis to
    determine hereditary effects.

10
Digression History and example
  • Charles Edward Spearman (1863-1945)
  • Psychologist, with a strong statistical
    background.
  • Spearman set out to estimate the intelligence of
    twenty-four children in the village school. In
    the course of his study, he realized that any
    empirically observed correlation between two
    variables will underestimate the "true" degree of
    relationship, to the extent that there is
    inaccuracy or unreliability in the measurement of
    those two variables.
  • Further, if the amount of unreliability is
    precisely known, it is possible to "correct" the
    attenuated observed correlation according to the
    formula (where r stands for the correlation
    coefficient) r (true) r (observed)/reliability
    of variable 1 X reliability of variable 2.
  • Using his correction formula, Spearman found
    "perfect" relationships and inferred that
    "General Intelligence" or "g" was in fact
    something real, and not merely an arbitrary
    mathematical abstraction.

11
Digression History and example
  • M are measures of math skills, and L are measures
    of language skills.
  • He then discovered yet another marvelous
    coincidence, the correlations were positive and
    hierarchal. These discoveries lead Spearman to
    the eventual development of a two-factor theory
    of intelligence.

12
Digression History and example
g
Verbal Ability
Math Ability
L1
L2
L3
M1
M2
M3
13
Introduction to CFA
  • Factor analysis is a very powerful and elegant
    way to analyze data.
  • Among others, because it is close to the core of
    scientific purpose.
  • CFA is a theory-testing model
  • As opposed to a theory-generating method like
    EFA, incl. PCA.
  • Researcher begins with a hypothesis prior to the
    analysis.
  • The model or hypothesis specifies
  • which variables will be correlated with which
    factors
  • which factors are correlated.
  • The hypothesis should be based on a strong
    theoretical and/or empirical foundation (Stevens,
    1996).

14
Intuitive specification
  • xi are the observed variables,
  • ? is the common latent variable that is the cause
    of the relations between xi.
  • ?i is the effect from ? on xi.
  • di are the unique components of the observed
    variables. They are also latent variables,
    commonly interpreted as random measurement error.

15
Intuitive specification
  • This causal diagram can be represented with the
    following set of equations x1 ?1? d1 x2
    ?2? d2 x3 ?3? d3
  • With E(xi)E(?i)0, var(xi)var(?i)1,
    E(didj)E(dx)E(d?)0
  • Only the x variables are known.
  • We dont know anything about the other variables
    ? and d.
  • How can we compute ?i?
  • We need a theory about why there are correlations
    between the x variables.

16
Digression Decomposition rule
  • In Structural Equation Modeling (includes CFA)
  • The correlation between two variables is equal to
    the sum of
  • - the direct effect,
  • - indirect effects,
  • - spurious relations and
  • - joint effects between these variables.

17
Digression Basic concepts
18
Digression Decomposition rule
  • In Structural Equation Modeling (includes CFA)
  • The correlation between two variables is equal to
    the sum of- the direct effect,- indirect
    effects, - spurious relations and- joint
    effects between these variables.
  • The indirect effects, spurious relations and
    joint effects are equal to the products - of the
    coefficients along the path - going from one
    variable to the other - while one can not pass
    the same variable twice - and can not go against
    the direction of the arrows.

19
Intuitive specification
  • So the theory, according to the model, is that
    the correlations between the x variables are
    spurious due to the latent variable ?.
  • ?(x1x2)?1?2,?(x1x3)?1?3,?(x2x3)?2?3.
  • Proof in the formal specification.
  • A simplified notation yields?12?1?2,?13?1?3,
    ?23?2?3
  • Where ?1, ?2, and ?3, are unknown, and?12, ?13,
    and ?23, are known.
  • This is solvable, eg.

20
Intuitive specification
  • The correlations between the x variables are
  • We know that
  • ?12 .42 ?1?2?13 .56 ?1?3?23 .48
    ?2?3
  • ?2 .42/?1 gt?3 .56/?1 gt
  • ?1 v(.42.56)/.48) .7
  • ?2 .6
  • ?3 .8

21
Intuitive specification
  • Notice that this is much more advanced than PCA.
  • It is based upon theory.
  • It is straightforward
  • It allows the estimation of an effect of
    something that we have not measured!

22
Identification
  • What would happen if the latent variable had four
    indicators?
  • We already had ?12?1?2 .42 ?13?1?3
    .56 ?23?2?3 .48
  • Now we also have ?14?1?4 .35 ?24?2?4
    .30 ?34?3?4 .50

23
Identification
  • Now we also have ?14?1?4 .35 ?24?2?4
    .30 ?34?3?4 .50
  • Of these three equations we need only one to
    determine the value of ?4 when we have solved ?1,
    ?2, ?3.
  • Therefore, this model is called over-identified
    with 2 degrees of freedom (df2)
  • df ( of correlations) ( of parameters to be
    estimated)
  • If the number of degrees of freedom is larger
    than 1, a test is possible of the model
    parameters.

24
Identification
  • A test of the model
  • ?14 ?1?4 .7?4 .35 ?24 ?2?4 .6?4
    .30 ?34 ?3?4 .8?4 .50
  • Lets use the first equation to estimate ?4, then
    ?4 .35/.7.5
  • Now we know all coefficients and two equations
    are not used yet.
  • These equations can be used to test the model.
    Using the idea that the observed correlation
    should equal the estimated correlation, or ?(obs)
    ?(est)0
  • ?24 ?2?4 .30 .6?4 .30 .6.50 0
    (good) ?34 ?3?4 .50 .8?4 .50 .8.50
    0.1 (less good)
  • These differences are called residuals. If these
    residuals are big the model must be wrong.

25
Identification
  • With df gt 0, a test of the model is possible
  • With df 0, no test of the model is possible
  • This is the case with a one factor model and 3
    observed variables.
  • With df lt 0, no test of the model is possible,
    and even the parameters cannot be estimated.
  • However, if the parameters of the model can be
    constraint, degrees of freedom can be gained, so
    that the model could be estimated.
  • The issue of identification is connected to the
    issue of number of unknowns versus number of
    equations. Although not entirely, because there
    is more.
  • Generally we leave this issue to the computer
    program. If the model is not identified, you will
    get a message.

26
Exercise 1
  • Assume that E(xi)E(?i)0, var(xi)var(?i)1,
    E(didj)E(dx)E(d?)0
  • Formulate expressions for the correlations
    between the x variables in terms of the
    parameters of the model.
  • ?12, ?13, ?14, ?23, ?24, ?34.
  • ?12 ?11?21 ?13 ?11f21?32 ?14
    ?11f21?42 ?23 ?21f21?32 ?24 ?21f21?42
    ?34 ?32?42
  • We can get the same result via the formal
    specification.

27
An example (PCA)
  • How does the factor model look like for these
    items?
  • We suggested last time that there should be 1
    factor behind these items.

28
An example (PCA)
Varimax rotated PCA solution. Claims a better
interpretation. There are two PCs, that can be
interpreted as positive (4,7,8) and negative
(others) loneliness. But this is strange, because
they should then be related!!!!
PCA solution
29
An example (CFA)
  • We did not look at the correlations, but, some
    items were positive and some negative.
  • Therefore, if the same response scale is used, we
    expect negative correlations. And they should be
    found in the dark yellow cells.

30
An example (CFA)
31
An example (CFA)
  • This model does not fit (p0).
  • So we must search for another model.
  • One possibility that the correlations are
    positive, where expected negative, is due to
    acquiescence.
  • Acquiescence is the tendency to agree (confirm),
    with a statement, regardless the content.
  • There are all kinds of psychological explanations
    for this response behavior of respondents, being
    nice, etc.
  • If people say agree to both negative and positive
    items, we will find positive correlations between
    those items.
  • Can we correct for acquiescence?
  • YES!

32
An example (CFA)
  • In order to identify an acquiescence factor, I
    have to split the positive and negative items
    into two factors.
  • Theoretically it makes sense if those two
    loneliness factors are correlated -1, because
    they are the same but opposites.

33
An example (CFA)
34
An example (CFA)
  • The result is a model that still does not fit,
    but much better than before.
  • This model however shows that most variation and
    covariation is due to the acquiescence factor.
  • The loneliness factors that remain after
    correction for random measurement error, and
    systematic error (acquiescence), are pure.
  • However, given the low loadings what do these
    items have in common with the underlying
    factor. That is the loneliness factor(s) does
    explain 0.2 of the variance of the best
    indicators (0.452).

35
An example (CFA)
  • Conclusion
  • The PCA did not provide a theoretical sound
    solution.
  • Thinking about the items, and what some
    respondents do when answering questions, lead to
    a theory
  • This theory could be tested, and produced very
    sensible results.
  • However the ?2 was not acceptable 74 (19),
    whereas with 19 df the ?2 should be 30.
Write a Comment
User Comments (0)
About PowerShow.com