Title: Michael Fahey
1 Identifying dietary patterns in a multivariate
mixture non-normality versus correlation
- Michael Fahey
- Ian White
- July, 2007
2Outline
- motivation
- semi-normal mixture model
- sensitivity to
- marginal non-normality
- correlation among responses
3Motivation
- diet assessment in epidemiology
- foods eaten together interact
- holistic description using multiple dietary
responses - latent characteristic (dietary pattern) induces
correlation among responses
4Notation
is the multivariate probability density function
for a set of food responses y
probability of membership to latent class x
class-specific density, given latent class x
5Mixture distribution
Define dietary patterns as sub-groups having
different food consumption probability
distributions. To find sub-groups decompose the
mixture distribution into sum of K class-specific
distributions
6Food consumption data
- 8000 women in the EPIC-Norfolk cohort who
completed a dietary questionnaire - eight food groups reflecting different dietary
components and characteristics in data - consumption calculated in (g/d) or (g/d 1) on
log, square root and Box-Cox scales
7(No Transcript)
8Semi-continuous data
9Semi-normal (SMN) mixture
- two-part density for semi-continuous responses
- binary part (whether food was consumed)
- continuous part (if consumed, how much)
- the two parts considered together if food not
consumed, continuous part set to missing
10Two-part density
Define d 1 if y gt 0, and d 0 if y 0, and
the PDF for the two-part density is
Here, p is the probability of consumption, g(y)
is scale dependent, and in the mixture f(y x)
is replaced by f(y, d x) for semi-continuous
data.
11Sensitivity to non-normality
- fit SMN mixture on raw, log, square root,
Box-Cox, rankit scales - means and variances of responses vary by class,
covariances constrained to zero - vary K 1 to 4 and evaluate model BIC,
pseudo-class residual analysis
12Rankit transformation
Raw scale data are ranked, ri, and assigned their
expected value as a standard normal deviate
For example, the median on the raw scale
has rankit 0 and the 95th percentile has rankit
1.65.
13Change in BIC by scale
?BIC BIC(K-1) - BIC(K)
14Pseudo-residual analysis
- randomly assign women to one latent class using
p(x y, d) and compute residuals - pseudo-class residuals have good properties (Wang
et al, JASA, 2005) - repeat random assignments M times for each woman
to create N x M data - use Q-Q plots to evaluate class-specific
multivariate normality on log and rankit scale
15Pseudo-residuals log scale
16Pseudo-residuals rankit scale
17Sensitivity to correlation
- vary class-specific covariance matrix, SK
- diagonal SK DK
- linear factor SK ??T DK
- where ? is the coefficient of a class-constant
linear factor, f, such that E(y) µ ?f - unconstrained SK CK DK
- where CK is a matrix of covariances among
continuous responses, and zero elsewhere
18Covariance matrix rankit scale
19Pseudo residuals rankit, SK CK DK
20Concluding remarks 1
- marginal normality leads to
- 1) lower latent dimensionality
- 2) better fitting class-specific models
- SMN model removes non-normality due to clumping
at zero identification shifts to association
involving binary responses
21Concluding remarks 2
- two latent classes identified in associations
involving binary responses - correlation among continuous responses may be
uninteresting, e.g. measurement error,
association not caused by dietary pattern - shifting emphasis from non-normality to
correlation localises identification
22Notation
is the multivariate probability density
function for a set of food variables y given
covariates z and parameter vector ? (?,?)
is the probability of membership to latent class
x given covariates z and parameter vector ?
is the class-specific density, given latent class
x, covariates z and parameter vector ?
23Norfolk dietary questionnaire
24Model details (1)
- means and variances of y vary by class
- covariances among the y constrained to zero
- latent class probabilities and the mean of the y
depend on one covariate, z - z the log of total energy intake (log kcal)
25Model details (2)
- Model dependence on covariates as follows
26Sensitivity to non-normality (OLD)
- fit SMN mixtures on raw, log, square root,
Box-Cox, rankit scales vary K 1 to 4 - evaluate model fit BIC, pseudo-class residual
analysis - examine classification agreement between any two
candidate solutions Rand index
27Rand index (RI) 1
- compares two modal classifications by considering
all pairs of women - agreement if a pair are either
- in the same class on each classification
- or in different classes on each classification
- RI (count of agreements) / (number of pairs)
28Rand index (RI) 2
- values usually range from 0 to 1
- probability that classifications on a randomly
chosen pair agree - adjusted RI takes chance agreement into account
- invariant to permutation of class labels
29Classification agreement (RI) among scales, K 4
30Classification log vs others
31Covariance matrix log scale
32Number of parameters
Semi-normal models with K and SK as given
33MVN patterns deviation from grand mean
34SMN deviation from grand mean
35Posterior probabilities
Average modal posterior probabilities by class
36Hard versus soft classification
hard modal classification soft mixing
proportions estimated from model
37Predicting cancer risk
Cox regression crude effect of SMN latent class
patterns on incidence of 49 cancers in
EPIC-Norfolk