Title: Introduction to linkage analysis
1Introduction to linkage analysis
Harald H.H. Göring
Course Study Design and Data Analysis for
Genetic Studies, Universidad ded Zulia,
Maracaibo, Venezuela, 9-10 April 2005
2Marker loci
- There are many different types of polymorphisms,
e.g. - single nucleotide polymorphism (SNP)
- AAACATAGACCGGTT
- AAACATAGCCCGGTT
- microsatellite/variable number of tandem repeat
(VNTR) - AAACATAGCACACA----CCGGTT
- AAACATAGCACACACACCGGTT
- insertion/deletion (indel)
- AAACATAGACCACCGGTT
- AAACATAG--------CCGGTT
- restriction fragment length polymorphism (RFLP)
3Tracing chromosomal inheritanceusing marker
locus genotypes
4Tracing chromosomal inheritance(fully
informative situation)
5Linkage analysislocus with known genotypes
6Linkage analysis
- In linkage analysis, one evaluates statistically
whether or not the alleles at 2 loci co-segregate
during meiosis more often than expected by
chance. If the evidence of increased
co-segregation is convincing, one generally
concludes that the 2 loci are linked, i.e. are
located on the same chromosome (syntenic loci).
The degree of co-segregation provides an estimate
of the proximity of the 2 loci, with near
complete co-segregation for very tightly linked
loci.
7Lets step backto Mendel
8One of Mendels pea crosses
P1
Mendels law of uniformity
F1
F2
Mendels law of independent assortment
315 108 101
32 9 3 3
1
observed ratio
9P1
Mendels law of uniformity
F1
Mendels law of segregation
F2
25
50
25
(in expectation)
10P1
Mendels law of uniformity
F1
Mendels law of segregation
F2
25
50
25
(in expectation)
11P1
Mendels law of uniformity
F1
F2
Mendels law of independent assortment
6.25
12.5
6.25
12.5
12.5
6.25
12.5
6.25
25
(in expectation)
12Co-segregation(due to linkage)
P1 generation (diploid)
1
1
2
2
a
a
b
b
gametes (haploid)
1
2
a
b
Mendels law of uniformity
F1 generation (diploid)
1
2
1
2
a
b
a
b
gametes (haploid)
1
1
2
2
a
a
b
b
Mendels law of segregation
F2 generation (diploid)
1
1
2
2
1
2
a
a
b
b
a
b
25
50
25
13Recombination
- Recombination between 2 loci is said to have
occurred if an individual received, from one
parent, alleles (at these 2 loci) that originated
in 2 different grandparents.
14Who is a recombinant?
N
N
N
R
N
N
N
N
R
N
15Possible explanations for recombination
1/1
2/2
a/a
b/b
N
R
R
N
1
2
1
2
I
different chromosomes
b
a
b
a
homologous recombination during meiosis
1
2
1
2
II
b
a
b
a
III
genotyping error
2
R
a
16Recombination fraction
- The recombination fraction between 2 loci is
defined as the proportion of meioses resulting in
a recombinant gamete. For loci on different
chromosomes (or for loci far apart on the same,
large chromosome), the recombination fraction is
0.5. Such loci are said to be unlinked. For loci
close together on the same chromosome, the
recombination fraction is lt 0.5. Such loci are
said to be linked. The closer the loci, the
smaller the recombination fraction (? 0).
17Estimation of recombination fraction
N
N
N
R
N
N
N
N
R
N
18Missing phase informationWho is a recombinant??
1/2
3/3
a/b
c/c
19Missing phase and genotype informationWho is a
recombinant??
?/?
3/3
1/2
c/c
a/b
20Missing phase and genotype informationWho is a
recombinant???
?/?
?/?
c/c
a/b
21Likelihood
- The likelihood of a hypothesis (e.g. specific
parameter value(s)) on a given dataset,
L(hypothesisdata), is defined to be proportional
to the probability of the data given the
hypothesis, P(datahypothesis) - L(hypothesisdata) constant
P(datahypothesis) - Because of the proportionality constant, a
likelihood by itself has no interpretation. - The likelihood ratio (LR) of 2 hypotheses is
meaningful if the 2 hypotheses are nested (i.e.,
one hypothesis is contained within the other) - Under certain conditions, maximum likelihood
estimates are asymptotically unbiased and
asymptotically efficient. Likelihood theory
describes how to interpret a likelihood ratio.
22Evaluating the evidence of linkagelod score
The lod (logarithm of odds) score is defined as
the logarithm (to the base 10) of the likelihood
of 2 hypothesis on a given dataset
In linkage analysis, typically the different
hypotheses refer to different values of the
recombination fraction
23Who is a recombinant?
N
N
N
R
N
N
N
N
R
N
24Example lod score calculation
0
0.1 0.644
0.2 0.837
0.3 0.725
0.4 0.439
0.5 0
25Missing phase informationWho is a recombinant??
1/2
3/3
a/b
c/c
26Example lod score calculation(missing phase
information)
P(dataq) P(phase 1) P(dataphase 1, q)
P(phase 2) P(dataphase 2 , q)
0
0.1 0.343
0.2 0.536
0.3 0.427
0.4 0.175
0.5 0
27Missing phase and genotype informationWho is a
recombinant???
?/?
?/?
c/c
a/b
28Example lod score calculation(missing phase and
genotype information)
Assuming 3 equally frequent alleles , i.e. P(1)
P(2) P(3) 0.333
q Z(q) 0 -0.304 0.1 0.204 0.2 0.346 0.3 0.264 0.4
0.096 0.5 0
q Z(q) 0 -0.378 0.1 0.183 0.2 0.332 0.3 0.253 0.4
0.091 0.5 0
Assuming P(1) 0.495, P(2) 0.495, P(3) 0.010
29known phase, known genotypes
unknown phase, known genotypes
3
unknown phase, unknown genotypes
30Interpretation of lod score
- The traditional threshold for declaring evidence
of linkage statistically significance is a lod
score of 3, or a likelihood ratio of 10001,
meaning the likelihood of linkage on the data is
1000-times higher than the likelihood of no
linkage on the data. - Asymptotically, a lod score of 3 has a point-wise
significance level (p-value) of 0.0001. In other
words, the probability of obtaining a lod score
of at least this magnitude by chance is 0.0001. - Due to the many linkage tests being conducted as
part of a genome-wide linkage scan, a lod score
of 3 has a significance level of 0.05.
31P-value
The p-value is defined as the probability of
obtaining an outcome at least as extreme as
observed by chance (i.e. when the null hypothesis
is true).
Example Testing whether a coin is fair H0
P(head) 0.5 H1 P(head) ? 0.5 (2-sided
alternative hypothesis). You observe 1 head out
of 10 coin tosses. The p-value then is the
probability of observing exactly 1 head in 10
trials (observed outcome), or 0 head in 10 trials
(more extreme outcome), or 9 (equally extreme
outcome) or 10 (more extreme outcome) heads in 10
trials.
32P-value
The p-value is defined as the probability of
obtaining an outcome at least as extreme as
observed by chance (i.e. when the null hypothesis
is true).
Example Testing whether 2 loci are linked H0
P(recombination) 0.5 H1 P(recombination) 0.5
(1-sided alternative hypothesis). You observe 0
recombinant and 10 non-recombinant in 10
informative meioses. The p-value then is the
probability of observing exactly 0 recombinants
in 10 trials (observed outcome there is no more
extreme outcome).
33Lod score
Example Testing whether 2 loci are linked H0
P(recombination) 0.5 H1 P(recombination) 0.5
(1-sided alternative hypothesis). You observe 0
recombinant and 10 non-recombinant in 10
informative meioses. The p-value then is the
probability of observing exactly 0 recombinants
in 10 trials (observed outcome there is no more
extreme outcome).
In the ideal case, 10 fully informative meioses
may suffice to obtain significant evidence of
linkage.
34Lod score and significance level
lod score (point-wise) p-value
0.588 0.05
1.175 0.01
2.000 0.001
3.000 0.0001
4.000 0.00001
5.000 0.000001
35Linkage analysis reducesmultiple testing problem
- Linkage analysis is so useful because it greatly
reduces the multiple testing problem
3,000,000,000 bp of DNA are interrogated in 500
independent linkage tests for human data. This is
possible because a meiotic recombination event
occurs on average only once every 100,000,000 bp. - No specification of prior hypotheses is therefore
necessary, as all possible hypotheses can be
screened.
36Linkage analysis trait locus with unknown
genotypes
37Statistical gene mapping with trait phenotypes
38Many different types of linkage methods
- penetrance model-based linkage analysis
(classical linkage analysis) - penetrance model-free linkage analysis
(model-free or non-parametric linkage
analysis - affected sib-pair linkage analysis
- affected relative-pair linkage analysis
- regression-based linkage analysis
- variance components-based linkage analysis
39Variation with each linkage method
- 2-point analysis vs. multiple 2-point analysis
vs. multi-point analysis - exact calculation vs. approximation (e.g., MCMC)
- qualitative trait vs. quantitative traits
- rare simple mendelian diseases vs. common
complex multifactorial diseases
40Penetrance-model-based linkage analysis
41Segregation analysis
In segregation analysis, one attempts to
characterize the mode of inheritance of a trait,
by statistically examining the segregation
pattern of the trait through a sample of related
individuals. In a way, heritability analysis is
a way of segregation analysis. In heritability
analysis, the analysis is not focused on
characterization of the segregation pattern per
se, but on quantification of inheritance assuming
a given mode of inheritance (such as, generally,
additivity/co-dominance).
42Relationship between genotypes and phenotypes
(penetrances) at the ABO blood group locus
penetrance P(phenotype given genotype)
Phenotype (blood group)
Genotype A B AB O A/A 1 0 0 0 A/B 0 0 1 0 A/O
1 0 0 0 B/B 0 1 0 0 B/O 0 1 0 0 O/O 0 0 0 1
43Probability model correlating trait phenotypes
and trait locus genotypespenetrances
penetrance P(phenotype given genotype)
Ex. fully-penetrant dominant disease without
phenocopies
Phenotype
Genotype unaffected affected / 1 0 D/ or
/D 0 1 D/D 0 1
44Statistical gene mapping with trait
phenotypessimple dominant inheritance model
45Linkage analysis trait locus (genotypes based on
assumed dominant inheritance model)
46Example of multipoint lod score curve
Pseudoxanthoma elasticum
From Le Saux et al (1999) Pseudoxanthoma
elasticum maps to an 820 kb region of the p13.1
region of chromosome 16. Genomics 621-10
47Genetic heterogeneity
locus homogeneity, allelic homogeneity
time
locus homogeneity, allelic heterogeneity
locus heterogeneity, allelic homogeneity (at
each locus)
time
locus heterogeneity, allelic heterogeneity (at
each locus)
48Pros and cons ofpenetrance-model-based linkage
analysis
- potentially very powerful (under suitable
penetrance model) - statistically well-behaved
- - requires specification of penetrance model not
powerful at all under unsuitable penetrance model
49Effects of model misspecification
informative
uninformative
dominant inheritance
/
D/
1/2
3/4
P(aff.DD or D) 1 P(aff.) 0
D/
/
D/
1/3
1/4
2/3
uninformative
informative
recessive inheritance
D/
D/D
1/2
3/4
P(aff.DD) 1 P(aff. or D) 0
D/D
D/
D/D
1/3
1/4
2/3
50Pros and cons ofpenetrance-model-based linkage
analysis
- potentially very powerful (under suitable
penetrance model) - statistically well-behaved
- - requires specification of penetrance model not
powerful at all under unsuitable penetrance model - - modeling flexibility limited
- - computationally intensive
51Mendelian vs. complex traits
- simple mendelian disease
- genotypes of a single locus cause disease
- often little genetic (locus) heterogeneity
(sometimes even little allelic heterogeneity)
little interaction between genotypes at different
genes - often hardly any environmental effects
- often low prevalence
- often early onset
- often clear mode of inheritance
- good pedigrees for gene mapping can often be
found - often straightforward to map
- complex multifactorial disease
- genotypes of a single locus merely increase risk
of disease - genotypes of many different genes (and various
environmental factors) jointly and often
interactively determine the disease status
- important environmental factors
- often high prevalence
- often late onset
- no clear mode of inheritance
- not easy to find good pedigrees for gene
mapping - difficult to map
52A quantitative trait is not necessarily complex
observed trait phenotypes
53Fundamental problem in complex trait gene mapping
correlation to be detected
etiology given ascertainment
genetic distance (linkage, allelic
association)
54Etiological complexity
gene 2
gene 1
gene 3
trait phenotype
other env. factor(s)
other gene(s)
environm. factor 1
environm. factor 3
environm. factor 2
55How to improve power to detect correlations
between trait phenotypes and trait locus
genotypes?
etiology
56How to simplify the etiological architecture?
- choose tractable trait
- Are there sub-phenotypes within trait?
- age of onset
- severity
- combination of symptoms (syndrome)
- endophenotype or biomarker vs. disease
- quantitative vs. qualitative (discrete)
- Dichotomizing quantitative phenotypes leads to
loss of information. - simple/cheap measurement vs. uncertain/expensive
diagnosis - not as clinically relevant, but with simpler
etiology - given trait, choose appropriate study
design/ascertainment protocol - study population
- genetic heterogeneity
- environmental heterogeneity
- random ascertainment vs. ascertainment based on
phenotype of interest - single or multiple probands
- concordant or discordant probands
- pedigrees with apparent mendelian inheritance?
- inbred pedigrees?
57Affected sib-pair linkage analysis
58Identity-by-state (IBS) vs. identity-by-descent
(IBD)
If IBD then necessarily IBS (assuming absence of
mutation event). If IBS then not necessarily IBD
(unless a locus is 100 informative, i.e. has an
infinite number of alleles, each with
infinitesimally small allele frequency).
59Probabilistic inference of IBD
IBD
1 0 0.5 1 1
2 1.5 1 0.5 0
0.25 0.5
NIBD
p
60Rationale ofaffected sib-pair linkage analysis
- A pair of sibs affected with the same disorder
is expected to share the alleles at the trait
locus/loci---and also alleles at linked
loci---more often (gt 50 ) than a random pair of
sibs (50 ).
61Basic concept ofaffected sib pair linkage
analysis
62Affected sib pair linkage analysis(mean test)
NIBD IBD
counts in example ped. 1 1
total counts in dataset
1/2
3/4
Conditional on the fact that both sibs are
affected, test if
1/3
1/4
63Affected sib pair linkage analysis(mean test)
NIBD IBD
probability
counts in ex. 1 1
total counts
64Penetrance-model based linkage analysis on
affected sib pair
65Penetrance-model-based linkage analysis on
affected sib pair
assuming a rare recessive trait w/o phenocopies
66Penetrance-based linkage analysis on affected sib
pair
(assuming a rare, recessive trait w/o
phenocopies)
67Relationship of affected sib-pair linkage
analysis and penetrance-model-based linkage
analysis
For an affected sib-pair of unaffected parents,
affected sib-pair linkage analysis and
penetrance-model-based linkage analysis assuming
a rare recessive trait w/o phenocopies are
identical.
68Penetrance-based linkage analysis on affected sib
pair
Assuming a rare, recessive trait w/o
phenocopies, the father is no longer
informative.
Penetrance-based linkage analysis is then no
longer equivalent to affected sib pair linkage
analysis.
69Pseudo-marker analog of affected sib pair
linkage analysis (mean test)
pseudo-marker genotypes
70Take home message regarding relationship of
penetrance-model-based and model-free
approaches to gene mapping
- The perceived differences between
penetrance-model based and many popular
model-free methods are more related to the
underlying study design than the statistical
methodology. - A deterministic pseudo-marker genotype
assignment algorithm can be used to mimic popular
model-free approaches, allowing joint analysis
of different data structures for linkage and/or
LD in a framework identical to penetrance-based
analysis. - These pseudo-marker statistics are generally
better behaved and more powerful than their
conventional model-free analogs.
71Regression-based methods forlinkage analysis of
quantitative traits
The basic rationale behind this approach (in its
various forms) is that pairs of individuals (of a
given relationship) with similar phenotypes are
expected to be more similar to each other
genetically at/near loci influencing the trait of
interest than pairs of relatives (of the same
relationship) who have dissimilar phenotypes. The
degree of phenotypic similarity therefore should
be reflected in the proportion of alleles that
individuals share IBD at/near trait loci.
72Haseman-Elston sib pair linkage testfor
quantitative traits
squared phenotypic difference between 2 sibs
Statistical inference Is the regression slope lt
0?
D2
IBD
0 0.5 1
73Variance components-basedlinkage analysis
74Rationale of variance components-based linkage
analysis
- The pattern of phenotypic similarity among
pedigree members should be reflected by the
pattern of IBD sharing among them at chromosomal
loci influencing the trait of interest.
75Variance components approachmultivariate normal
distribution (MVN)
In variance components analysis, the phenotype is
generally assumed to follow a multivariate normal
distribution
no. of individuals (in a pedigree)
n?n covariance matrix
phenotype vector
mean phenotype vector
76Modeling the resemblance among relative
heritability analysis
linkage analysis
77Matrix of estimated allele sharing among relatives
P
M
12
33
S1
S2
S3
13
13
13
P M S1 S2 S3
P 1 0 0.5 0.5 0.5
M 1 0.5 0.5 0.5
S1 1 0.5 0.5
S2 1 0.5
S3 1
P M S1 S2 S3
P 1 0 0.5 0.5 0.5
M 1 0.5 0.5 0.5
S1 1 0.75 0.75
S2 1 0.75
S3 1
78Variance components-based lod score
79Sample size requirements to detect linkage to a
QTL with a lod score of 3 and 80 power
80Pros and cons ofvariance-components-based
linkage analysis
- no need to specify inheritance model
- robust to allelic heterogeneity at a locus
- modeling flexibility
- computationally feasible even on large
pedigrees - - generally assumes additive inheritance model
- - modeling restrictions
- - not always well-behaved statistically
(depending on phenotypic distribution and
ascertainment) - generally less powerful than penetrance-model-base
d linkage analysis under suitable model
81Choice of covariates
Covariates ought to be included in the likelihood
model if they are known to influence the
phenotype of interest and if their own genetic
regulation does not overlap the genetic
regulation of the target phenotype. Typical
examples include sex and age. In the analysis
of height, information on nutrition during
childhood should probably be included during
analysis. However, known growth hormone levels
probably should not be.
82Choice of covariates
83Choice of covariates
84Choice of covariatesspecial case of
treatment/medication
85Before treatment/medicationof affected
individuals
unaffected
affected
86After (partially effective) treatment /
medication of affected individuals
apparent effect of covariate
unaffected
affected
87Choice of covariatesspecial case of
treatment/medication
- If medication is ineffective/partially effective,
including treatment as a covariate is worse than
ignoring it in the analysis. - If medication is very effective, such that the
phenotypic mean of individuals after treatment is
equal to the phenotypic mean of the population as
a whole, then including medication as a covariate
has no effect. - If medication is extremely effective, such that
the phenotypic mean of individuals after
treatment is better than the phenotypic mean of
the population as a whole, then including
medication as a covariate is better than ignoring
it, but still far from satisfying. - Either censor individuals or, better, infer or
integrate over their phenotypes before treatment,
based on information on efficacy etc.
88Two-point vs. multi-point linkage analysis
- In linkage analysis, one always examines whether
or not the alleles at 2 loci tend to co-segregate
during meiosis. - In two-point linkage analysis, chromosomal
inheritance is inferred from the observed trait
phenotypes on the one hand (locus 1) and from a
single (genotyped) marker locus on the other hand
(locus 2). - In multi-point linkage analysis, chromosomal
inheritance is inferred from the observed trait
phenotypes on the one hand (locus 1) and from
multiple (genotyped) marker loci on the other
hand (locus 2).
89Pros and cons of multi-point linkage analysis
- Genotypes at multiple markers contain at least
as much and generally more information to infer
chromosomal inheritance than genotypes at a
single marker, resulting in greater power to
detect linkage. - The number of independent tests in genome-wide
linkage analysis is somewhat reduced in
multi-point linkage analysis vs. two-point
linkage analysis. - - Multi-point linkage analysis requires knowledge
of the genetic marker map (marker order and
inter-marker recombination fractions). If this
information is incorrect, power can be reduced
and/or the false positive rate can be increased. - - Multi-point linkage analysis is more
susceptible to genotyping errors. - - Multi-point linkage analysis typically assumes
linkage equilibrium between markers. If this does
not hold, power can be reduced and/or the false
positive rate can be increased. - - Multi-point linkage analysis is computationally
more demanding than two-point linkage analysis.
90Genetic map vs. physical map
m1
m2
m3
m4
?23
?34
?12
genetic map
x1
x2
x3
x4
cM
physicalmap
y1
y2
y3
y4
Mb
91Genetic map distance vs. recombination fraction
Def. of recombination fraction probability that
recombination takes place between 2 chromosomal
positions during meiosis Recombination fractions
are not additive, i.e., for 3 loci and
recombination fractions ?12 and ?23, ?13 ? ?12
?23.
Def. of genetic map distance (Morgan, M)
distance in which 1 recombination event is
expected to take place or, equivalently, average
distance between recombination events.
centi-Morgan (cM) is equal to 1/100
Morgan. Genetic map distances are additive, i.e.
for 3 loci and map distances x12 cM and x23 cM,
x13 x12 x23 cM.
Neither recombination fractions nore genetic map
distances are easily converted into physical map
distances.
92Why a genome-wide linkage scan may fail
- The sample size is too small.
- The marker genotypes are not sufficiently
informative (low heterozygosity and/or large gaps
in marker map). - There is no major gene.
- The chosen analytical approach is unsuitable.
- Bad luck!
93A fairytale of 2 traits
94Heritability estimates
trait A trait B
45-82 63-92
95Quantitative trait A (sample 1)
large, randomly ascertained pedigrees no. of
phenotyped individuals 268 trait heritability
estimate 0.55
96Quantitative trait B (sample 1)
large, randomly ascertained pedigrees no. of
phenotyped individuals 324 trait heritability
estimate 0.88
97Quantitative trait A (sample 1)
98Quantitative trait A (samples 1--2)
99Quantitative trait A (samples 1--3)
100Quantitative trait A (samples 1--3 combined)
101Quantitative trait B (sample 1)
102Quantitative trait B (samples 1--2)
103Quantitative trait B (samples 1--3)
104Quantitative trait B (samples 1--4)
105Quantitative trait B (samples 1--5)
106Quantitative trait B (samples 1--6)
107Quantitative trait B (samples 1--7)
108Quantitative trait B (samples 1--8)
109Quantitative trait B (samples 1--9)
110quantitative trait A lipoprotein A
(concentration in serum)
quantitative trait B height (in adults)
111Heritability of adult height(additive
heritability, adjusted for sex and age)
study study sample size heritability estimate
TOPS TOPS 2199 0.78
FLS FLS 705 0.83
GAIT GAIT 324 0.88
SAFHS SAFHS 903 0.76
SAFDS SAFDS 737 0.92
SHFS AZ 643 0.80
SHFS DK 675 0.81
SHFS OK 647 0.79
Jiri Jiri 616 0.63
total total 7449
112 Polygenic or oligogenic ?
113Height (9 samples)