Title: Introduction to Linkage and Association for Quantitative Traits
1Introduction to Linkage and Association for
Quantitative Traits
- Michael C Neale
- Boulder Colorado Workshop March 2 2009
2Overview
- A brief history of SEM
- Regression
- Maximum likelihood estimation
- Models
- Twin data
- Sib pair linkage analysis
- Association analysis
3Origins of SEM
- Regression analysis
- Reversion Galton 1877 Biological phenomenon
- Yule 1897 Pearson 1903 General Statistical
Context - Initially Gaussian X and Y Fisher 1922 YX
- Path Analysis
- Sewall Wright 1918 1921
- Path Diagrams of regression and covariance
relationships
4Structural Equation Modeling Basics
- Two kinds of relationships
- Linear regression X -gt Y single-headed
- Unspecified covariance Xlt-gtY double-headed
- Four kinds of variable
- Squares observed variables
- Circles latent, not observed variables
- Triangles constant (zero variance) for
specifying means - Diamonds observed variables used as moderators
(on paths)
5Linear Regression Covariance SEM
Var(X)
Res(Y)
b
Y
X
Models covariances only Of historical interest
6Linear Regression SEM with means
Var(X)
Res(Y)
b
Y
X
M u(y)
M u(x)
1
Models Means and Covariances
7Linear Regression SEM Individual-level
Yi a bXi
Res(Y)
X i
b
D
Yi
1
a
1
Models Mean and Covariance of Y only Must have
raw (individual level) data Xi is a definition
variable Mean of Y different for every observation
8Single Factor Covariance Model
9Two Factor Model with Covs Means
1
1.00
1.00
mF1
mF2
F1
F2
lm
l3
l1
l2
S1
S2
S3
Sm
mSm
e1
e2
e3
e4
mS2
mS3
mS1
N.B. Not identified
1
10Factor model essentials
- In SEM the factors are typically assumed to be
normally distributed - May have more than one latent factor
- The error variance is typically assumed to be
normal as well - May be applied to binary or ordinal data
- Threshold model
11Multifactorial Threshold Model
Normal distribution of liability. Affected
when liability x gt t
t
0.5
?
0.4
0.3
0.2
0.1
0
0
1
2
3
4
-1
-2
-3
-4
x
12Measuring Variation
- Distribution
- Population
- Sample
- Observed measures
- Probability density function pdf
- Smoothed out histogram
- f(x) gt 0 for all x
13Flipping Coins
4 coins 5 outcomes
Probability
0.4
0.3
0.2
0.1
0
HHHH
HHHT
HHTT
HTTT
TTTT
Outcome
14Bank of China Coin Toss
Infinite outcomes
0.5
0.4
0.3
0.2
0.1
0
0
1
2
3
4
-1
-2
-3
-4
Heads-Tails
De Moivre 1733 Gauss 1827
15Variance Average squared deviation
Normal distribution
?
xi
di
0
1
2
3
-1
-2
-3
Variance ? di2/N
16Deviations in two dimensions
?x
?y
17Deviations in two dimensions dx x dy
?x
dx
dy
?y
18Covariance
- Measure of association between two variables
- Closely related to variance
- Useful to partition variance
- Analysis of Variance term coined by Fisher
19Variance covariance matrix
Univariate Twin/Sib Data
Var(Twin1) Cov(Twin1,Twin2)
Cov(Twin2,Twin1) Var(Twin2) Suitable
for modeling when no missing data Good conceptual
perspective
20Maximum Likelihood Estimates Nice Properties
- 1. Asymptotically unbiased
- Large sample estimate of p -gt population value
- 2. Minimum variance Efficient
- Smallest variance of all estimates with property
1 - 3. Functionally invariant
- If g(a) is one-to-one function of parameter a
- and MLE (a) a
- then MLE g(a) g(a)
- See http//wikipedia.org
21Full Information Maximum Likelihood (FIML)
Calculate height of curve for each raw data vector
-1
22Height of normal curve ?x 0
Probability density function
?x
?(xi)
0
1
2
3
-1
-2
-3
xi
?(xi) is the likelihood of data point xi for
particular mean variance estimates
23Height of normal curve at xi ?x .5
Function of mean
?x
?(xi)
0
1
2
3
-1
-2
-3
xi
Likelihood of data point xi increases as ?x
approaches xi
24Likelihood of xi as a function of ?
Likelihood function
L(xi)
MLE
0
1
2
3
-1
-2
-3
xi
?x
L(xi) is the likelihood of data point xi for
particular mean variance estimates
25Height of normal curve at x1
Function of variance
?x
??(xi var 1)
??(xi var 2)
??(xi var 3)
xi
0
1
2
3
-1
-2
-3
Likelihood of data point xi changes as variance
of distribution changes
26Height of normal curve at x1 and x2
?x
??(x1 var 1)
??(x1 var 2)
??(x2 var 2)
??(x2 var 1)
x1 x2
0
1
2
3
-1
-2
-3
x1 has higher likelihood with var1 whereas x2
has higher likelihood with var2
27Height of bivariate normal density function
Likelihood varies as f(???? ???? ?1, ?2, ??
y
x
28Likelihood of Independent Observations
- Chance of getting two heads
- L(x1xn) Product (L(x1), L(x2) , L(xn))
- L(xi) typically lt 1
- Avoid vanishing L(x1xn)
- Computationally convenient log-likelihood
- ln (a b) ln(a) ln(b)
- Minimization more manageable than maximization
- Minimize -2 ln(L)
29Likelihood Ratio Tests
- Comparison of likelihoods
- Consider ratio L(data,model 1) / L(data, model
2) - ln(a/b) ln(a) - ln(b)
- Log-likelihood lnL(data, model 1) - ln L(data,
model 2) - Useful asymptotic feature when model 2 is a
submodel of model 1 - -2 (lnL(data, model 1) - lnL(data, model 2))
??? - df parameters of model 1 - parameters of
model 2 - BEWARE of gotchas!
- Estimates of a2 q2 etc. have implicit bound of
zero - Distributed as 5050 mixture of 0 and ????
0
1
2
3
-1
-2
-3
l
30Two Group ACE Model for twin data
1
1(MZ) .5(DZ)
1
1
1
1
1
1
A
C
E
A
C
E
e
a
c
e
c
a
PT1
PT2
m
m
1
31Linkage vs Association
- Linkage
- Family-based
- Matching/ethnicity generally unimportant
- Few markers for genome coverage (300-400 STRs)
- Can be weak design
- Good for initial detection poor for fine-mapping
- Powerful for rare variants
- Association
- Families or unrelated individuals
- Matching/ethnicity crucial
- Many markers req for genome coverage (105 106
SNPs) - Powerful design
- Ok for initial detection good for fine-mapping
- Powerful for common variants rare variants
generally impossible
32Identity by Descent (IBD)
Number of alleles shared IBD at a locus, parents
AB and CD Three subgroups of sibpairs
AC
AD
BC
BD
AC
2
1
1
0
AD
1
2
0
1
BC
1
0
2
1
BD
0
1
1
2
33Partitioned Twin Analysis
- Nance Neale (1989) Behav Genet 191
- Separate DZ pairs into subgroups
- IBD0 IBD1 IBD2
- Correlate Q with 0 .5 and 1 coefficients
- Compute statistical power
34Partitioned Twin Analysis Three DZ groups
.5
.5
1
.25
IBD1 group
A1
C1
D1
E1
Q1
Q2
E2
D2
C2
A2
P1
P2
IBD2 group
IBD0 group
1
0
.5
1
.25
.5
1
.25
A1
C1
D1
E1
Q1
Q2
E2
D2
C2
A2
A1
C1
D1
E1
Q1
Q2
E2
D2
C2
A2
P1
P2
P1
P2
35Problem 1 with Partitioned Twin analysis Low
Power
36Problem 2 IBD is not known with certainty
- Markers may not be fully informative
- Only so much heterozygosity in e.g., 20 allele
microsatellite marker - Less in a SNP
- Unlikely to have typed the exact locus we are
looking for - Genome is big!
37IBD pairs vary in similarity
Effect of selecting concordant pairs
IBD2
t
IBD1
IBD0
t
38Improving Power for Linkage
- Increase marker density (yaay SNP chips)
- Change design
- Families
- Larger Sibships
- Selected samples
- Multivariate data
- More heritable traits with less error
39Problem 2 IBD is not known with certainty
- Markers may not be fully informative
- Only so much heterozygosity in e.g., 20 allele
microsatellite marker - Less in a SNP
- Unlikely to have typed the locus that causes
variation - Genome is big!
- The Universe is Big. Really big. It may seem like
a long way to the corner chemist, but compared to
the Universe, that's peanuts. - D. Adams
40(No Transcript)
41Using Merlin/Genehunter etc
- Several Faculty experts
- Goncalo Abecasis
- Sarah Medland
- Stacey Cherny
- Possible to use Merlin via Mx GUI
42Pi-hat approach
- 1 Pick a putative QTL location
- 2 Compute p(IBD0) p(IBD1) p(IBD2) given
- marker data Use Mapmaker/sibs or Merlin
- 3 Compute ?i p(IBD2) .5p(IBD1)
- 4 Fit model
- Repeat 1-4 as necessary for different locations
across genome
Elston Stewart
43Basic Linkage (QTL) Model
?
?i p(IBDi2) .5 p(IBDi1) individual-level
1
1
1
1
1
1
1
Pihat
F1
F2
Q1
Q2
E2
E1
e
f
q
q
f
e
P1
P2
Q QTL Additive Genetic F Family
Environment E Random Environment3
estimated parameters q, f and e Every
sibship may have different model
P
P
44Association Model
1
Geno1
Geno2
LDL1i a b Geno1i Var(LDLi)
R Cov(LDL1,LDL2) C C may be f(?i)
in joint linkage association
G2
G1
b
a
a
b
LDL1
LDL2
C
R
R
45Between/Within Fulker Association Model
M
Geno1
Geno2
Model for the means
G1
G2
0.50
LDL1i .5bGeno1 .5bGeno2
.5wGeno1 - .5wGeno2
.5( b(Geno1Geno2)
w(Geno1-Geno2) )
-0.50
0.50
0.50
S
D
m
m
w
b
B
W
1.00
-1.00
1.00
1.00
LDL1
LDL2
R
R
C