Title: Multivariate Statistics
1Multivariate Statistics
- Confirmatory Factor Analysis I
- W. M. van der Veld
- University of Amsterdam
2Overview
- PCA a review
- Introduction to Confirmatory Factor Analysis
- Digression History and example
- Intuitive specification
- Digression Decomposition rule
- Identification
- Exercise 1
- An example
3PCA a review
4PCA a review
- Principal Components analysis is a kind of data
reduction - Start with a set of observed variables.
- Transform the data so that a new set of variables
is constructed from the observed variables (full
component solution). - Hopefully, a small subset of the newly
constructed variables carry as much of their
information as possible, so that we can reduce
the number of variables in our study. - Full component solution
- Has as many components as variables
- Accounts for 100 of the variables variance
- Each variable has a final communality of 1.00
- Truncated components solution
- Has fewer components than variables (Kaiser
criterion, scree plot, etc.) - Components can hopefully be interpreted
- Accounts for lt100 of the variables variance
- Each variable has a communality lt 1.00
5PCA a review
- Each principal component
- Accounts for maximum available variance,
- Is orthogonal (uncorrelated, independent) with
all prior components (often called factors) - Full solution (as many factors as variables)
accounts for all the variance. - Full solution Truncated solution
6PCA a review
- PCA is part of the toolset for exploring data.
- When the researcher does not have, a priori,
sufficient evidence to form a hypothesis about
the structure in the data. - But anyone who can read should be able to get an
idea about the structure, i.e. what goes with
what. - Highly empirical or data driven technique.
- Because one will always find components.
- And when you find them you have to interpret
them. - Interpretation of components measured by many
variables can be quite complicated. Especially if
you have no a priori ideas. - PCA at best suggests hypotheses for further
research. - Confirmatory Factor Analysis has everything that
PCA lacks, - Nevertheless, PCA is useful when the data is
collected with hypothesis in mind, that you want
to explore quickly. - And when you want to construct variables by
definition.
7Introduction to CFA
8Introduction to CFA
- Some variables of interest cannot be directly
observed. - These unobserved variables are referred to as
either - Latent variables, or
- Factors
- How do we obtain information about these latent
variables, if they cannot be observed? - We do so by assuming that the latent variables
have real observable consequences. And these
consequences are related. - By studying the relations between the
consequences, we can infer that there is or is
not a latent variable. - History and example.
9Digression History and example
- Sir Francis Galton (1822-1911)
- Psychologist, etc.
- Galton was one of the first experimental
psychologists, and the founder of the field of
enquiry now called Differential Psychology, which
concerns itself with psychological differences
between people, rather than on common traits. He
started virtually from scratch, and had to invent
the major tools he required, right down to the
statistical methods - correlation and regression
- which he later developed. These are now the
nuts-and-bolts of the empirical human sciences,
but were unknown in his time. One of the
principal obstacles he had to overcome was the
treatment of differences on measures as
measurement error, rather than as natural
variability. - His influential study Hereditary Genius (1869)
was the first systematic attempt to investigate
the effect of heredity on intellectual abilities,
and was notable for its use of the bell-shaped
Normal Distribution, then called the "Law of
Errors", to describe differences in intellectual
ability, and its use of pedigree analysis to
determine hereditary effects.
10Digression History and example
- Charles Edward Spearman (1863-1945)
- Psychologist, with a strong statistical
background. - Spearman set out to estimate the intelligence of
twenty-four children in the village school. In
the course of his study, he realized that any
empirically observed correlation between two
variables will underestimate the "true" degree of
relationship, to the extent that there is
inaccuracy or unreliability in the measurement of
those two variables. - Further, if the amount of unreliability is
precisely known, it is possible to "correct" the
attenuated observed correlation according to the
formula (where r stands for the correlation
coefficient) r (true) r (observed)/reliability
of variable 1 X reliability of variable 2. - Using his correction formula, Spearman found
"perfect" relationships and inferred that
"General Intelligence" or "g" was in fact
something real, and not merely an arbitrary
mathematical abstraction.
11Digression History and example
- M are measures of math skills, and L are measures
of language skills. - He then discovered yet another marvelous
coincidence, the correlations were positive and
hierarchal. These discoveries lead Spearman to
the eventual development of a two-factor theory
of intelligence.
12Digression History and example
g
Verbal Ability
Math Ability
L1
L2
L3
M1
M2
M3
13Introduction to CFA
- Factor analysis is a very powerful and elegant
way to analyze data. - Among others, because it is close to the core of
scientific purpose. - CFA is a theory-testing model
- As opposed to a theory-generating method like
EFA, incl. PCA. - Researcher begins with a hypothesis prior to the
analysis. - The model or hypothesis specifies
- which variables will be correlated with which
factors - which factors are correlated.
- The hypothesis should be based on a strong
theoretical and/or empirical foundation (Stevens,
1996).
14Intuitive specification
- xi are the observed variables,
- ? is the common latent variable that is the cause
of the relations between xi. - ?i is the effect from ? on xi.
- di are the unique components of the observed
variables. They are also latent variables,
commonly interpreted as random measurement error.
15Intuitive specification
- This causal diagram can be represented with the
following set of equations x1 ?1? d1 x2
?2? d2 x3 ?3? d3 - With E(xi)E(?i)0, var(xi)var(?i)1,
E(didj)E(dx)E(d?)0 - Only the x variables are known.
- We dont know anything about the other variables
? and d. - How can we compute ?i?
- We need a theory about why there are correlations
between the x variables.
16Digression Decomposition rule
- In Structural Equation Modeling (includes CFA)
- The correlation between two variables is equal to
the sum of - - the direct effect,
- - indirect effects,
- - spurious relations and
- - joint effects between these variables.
17Digression Basic concepts
18Digression Decomposition rule
- In Structural Equation Modeling (includes CFA)
- The correlation between two variables is equal to
the sum of- the direct effect,- indirect
effects, - spurious relations and- joint
effects between these variables.
- The indirect effects, spurious relations and
joint effects are equal to the products - of the
coefficients along the path - going from one
variable to the other - while one can not pass
the same variable twice - and can not go against
the direction of the arrows.
19Intuitive specification
- So the theory, according to the model, is that
the correlations between the x variables are
spurious due to the latent variable ?. - ?(x1x2)?1?2,?(x1x3)?1?3,?(x2x3)?2?3.
- Proof in the formal specification.
- A simplified notation yields?12?1?2,?13?1?3,
?23?2?3 - Where ?1, ?2, and ?3, are unknown, and?12, ?13,
and ?23, are known. - This is solvable, eg.
20Intuitive specification
- The correlations between the x variables are
- We know that
- ?12 .42 ?1?2?13 .56 ?1?3?23 .48
?2?3 - ?2 .42/?1 gt?3 .56/?1 gt
- ?1 v(.42.56)/.48) .7
- ?2 .6
- ?3 .8
21Intuitive specification
- Notice that this is much more advanced than PCA.
- It is based upon theory.
- It is straightforward
- It allows the estimation of an effect of
something that we have not measured!
22Identification
- What would happen if the latent variable had four
indicators?
- We already had ?12?1?2 .42 ?13?1?3
.56 ?23?2?3 .48
- Now we also have ?14?1?4 .35 ?24?2?4
.30 ?34?3?4 .50
23Identification
- Now we also have ?14?1?4 .35 ?24?2?4
.30 ?34?3?4 .50 - Of these three equations we need only one to
determine the value of ?4 when we have solved ?1,
?2, ?3. - Therefore, this model is called over-identified
with 2 degrees of freedom (df2) - df ( of correlations) ( of parameters to be
estimated) - If the number of degrees of freedom is larger
than 1, a test is possible of the model
parameters.
24Identification
- A test of the model
- ?14 ?1?4 .7?4 .35 ?24 ?2?4 .6?4
.30 ?34 ?3?4 .8?4 .50 - Lets use the first equation to estimate ?4, then
?4 .35/.7.5 - Now we know all coefficients and two equations
are not used yet. - These equations can be used to test the model.
Using the idea that the observed correlation
should equal the estimated correlation, or ?(obs)
?(est)0 - ?24 ?2?4 .30 .6?4 .30 .6.50 0
(good) ?34 ?3?4 .50 .8?4 .50 .8.50
0.1 (less good) - These differences are called residuals. If these
residuals are big the model must be wrong.
25Identification
- With df gt 0, a test of the model is possible
- With df 0, no test of the model is possible
- This is the case with a one factor model and 3
observed variables. - With df lt 0, no test of the model is possible,
and even the parameters cannot be estimated. - However, if the parameters of the model can be
constraint, degrees of freedom can be gained, so
that the model could be estimated. - The issue of identification is connected to the
issue of number of unknowns versus number of
equations. Although not entirely, because there
is more. - Generally we leave this issue to the computer
program. If the model is not identified, you will
get a message.
26Exercise 1
- Assume that E(xi)E(?i)0, var(xi)var(?i)1,
E(didj)E(dx)E(d?)0 - Formulate expressions for the correlations
between the x variables in terms of the
parameters of the model. - ?12, ?13, ?14, ?23, ?24, ?34.
- ?12 ?11?21 ?13 ?11f21?32 ?14
?11f21?42 ?23 ?21f21?32 ?24 ?21f21?42
?34 ?32?42 - We can get the same result via the formal
specification.
27An example (PCA)
- How does the factor model look like for these
items?
- We suggested last time that there should be 1
factor behind these items.
28An example (PCA)
Varimax rotated PCA solution. Claims a better
interpretation. There are two PCs, that can be
interpreted as positive (4,7,8) and negative
(others) loneliness. But this is strange, because
they should then be related!!!!
PCA solution
29An example (CFA)
- We did not look at the correlations, but, some
items were positive and some negative. - Therefore, if the same response scale is used, we
expect negative correlations. And they should be
found in the dark yellow cells.
30An example (CFA)
31An example (CFA)
- This model does not fit (p0).
- So we must search for another model.
- One possibility that the correlations are
positive, where expected negative, is due to
acquiescence. - Acquiescence is the tendency to agree (confirm),
with a statement, regardless the content. - There are all kinds of psychological explanations
for this response behavior of respondents, being
nice, etc. - If people say agree to both negative and positive
items, we will find positive correlations between
those items. - Can we correct for acquiescence?
- YES!
32An example (CFA)
- In order to identify an acquiescence factor, I
have to split the positive and negative items
into two factors. - Theoretically it makes sense if those two
loneliness factors are correlated -1, because
they are the same but opposites.
33An example (CFA)
34An example (CFA)
- The result is a model that still does not fit,
but much better than before. - This model however shows that most variation and
covariation is due to the acquiescence factor. - The loneliness factors that remain after
correction for random measurement error, and
systematic error (acquiescence), are pure. - However, given the low loadings what do these
items have in common with the underlying
factor. That is the loneliness factor(s) does
explain 0.2 of the variance of the best
indicators (0.452).
35An example (CFA)
- Conclusion
- The PCA did not provide a theoretical sound
solution. - Thinking about the items, and what some
respondents do when answering questions, lead to
a theory - This theory could be tested, and produced very
sensible results. - However the ?2 was not acceptable 74 (19),
whereas with 19 df the ?2 should be 30.