Title: Parametric linkage analysis and lod scores
1Parametric linkage analysis and lod scores
- Steve Horvath
- Depts. of Human Genetics Biostatistics
- UCLA
2Contents
- the big picture meiotic mapping techniques
- genetic distances and genetic maps
- map functions
- LOD (log of the odds) score analysis
- 2-point analysis
- testing for linkage between a marker and an
affectation status locus
- example rare, fully penetrant, dominant
Mendelian disease
- more general disease models
- parameters in parametric linkage analysis
- multipoint analysis algorithms for LOD scores
- significance levels, thresholds and false
positives
3The big picturelocating (mapping) disease
genes
4Meiotic mapping allows to identify DNA segments
that contain disease genes
trait 1
Reverse genetics trait - DNA
trait 3
trait 2
- Mapping is part of the positional cloning
strategy.
- works well for Mendelian diseases,
- correspond to rare, highly penetrant disease
alleles
5Different ways of expressing the goal of genomics
- goal find stretches of DNA that are risk factors
for a disease.
- known as reverse genetics if you start with the
phenotype (e.g. affectation status)
- aka. positional cloning (Collins FS)
- 3 step procedure (adapted)
- first meiotic mapping (linkage, linkage
disequilibrium)
- second, physical mapping (includes sequencing)
- third, find mutation and verify functional role
6Different kinds of meiotic mapping methods
- parametric (better model-based) lod score
analysis
- single point
- multipoint
- non-parametric (better model-free) linkage
analysis
- allele sharing methods
- key concept identity by descent
- confusing factoid non-parametric models
sometimes equivalent to parametric methods (Knapp
M, 1993?)
- association studies, linkage disequilibrium
mapping
- family-based methods (TDT, FBAT)
- population-based methods (chi-square test,
log-linear model)
7What do meiotic mapping methods have in common?
- based on meiosis
- made possible through the violation of Mendels
law of independent assortment
- crossing over effects, recombination, ....
- recombination fraction ?
- requires genetic markers, and sometimes the
distances between them (genetic map)
- usually test hypothesis of no linkage H ?1/2
- but sometimes test for no linkage disequilibrium
8What is parametric linkage analysis?
- A meiotic mapping technique based on
constructing a disease gene transmission model to
explain the inheritance of a disease in
pedigrees. - Meaning will become clear....
9Genetic markers
- desirable properties of genetic markers
- locus-specific
- polymorphic in the studied population
- many heterozygotes
- easily genotyped
- quality measures for markers
- heterozygosity homozygotes are uninformative!
- or Polymorphism Information Content
- probability that the parent is heterozygous x
probability that the offspring is informative
-
10Important co-dominant genetic markers
- microsatellites
- variations in the number of tandem repeats
- high level of polymorphism
- even distribution across the genome
- 2nd generation map
- SNPs
- single nucleotide polymorphisms
- bi-allelic codominant marker
- heterozygosity is limited at 50 percent
- 3rd generation map
11Genetic distances and genetic maps
Will be very relevant for multipoint linkage
studies.
12The recombination fraction is a measure of
distance between 2 loci
- recombination fraction ?the probability that a
recombinant gamete is transmitted
- If two loci are on different chromosomes, they
will segregate independently
- recombination fraction ?.5.
- if two loci are right next to each other, they
will segregate together during meiosis
- recombination fraction ?0
- terminology
- ?
- ?.5 the loci are far apart (they are not
linked)
13Genetic distance (unit is Morgan) expected no.
of cross-over pts per gamete
- notation let a and b be 2 points in the genome.
- Nab number of chiasmata between them
- chiasmatacrossing-over points
- Definition the genetic (map) distance is
dE(Nab)/2
- Why factor of 2? Want no. of chiasmata per
gamete.
- Example if on average 49 crossovers per per cell
in meiosis
- then total genetic map distance49/224.5
Morgans
- 1 Morgan100 centimorgan
14There is a relationship between crossing over and
recombination fraction
- Mathers formula ?.5P(Nab0)
- for small distances d approximately equal to ?,
- since in this case E(Nab)P(Nab0)
- P(Nab0) is related to dE(Nab)/2
- different probability models for Nab lead to
different relationships between ? and d.
- each sensible relationships between ? and d is
called a map functions
- Great reference Lange K Mathematical and
Statistical methods in genetic analysis book,
Springer
15The mathematical relationship between
recombination fraction and genetic distance is
called mapping function
- Haldanes mapping function
- d-.5 ln(1-2?)
- the distance d is measure in centimorgan
- perfect if crossovers occurred at random (no
interference)
- Kosambis mapping function
- d.25 ln(12?)/(1-2?)
- again distance is measured in centimorgan
- suitable if there is (crossover) interference
- one cross-over prevents another from taking place
nearby
- widely used
16- Note for both mapping functions
- if ?.5, d infinite Morgans (infinite
distance)
- if ?.0, d 0 M (0 distance)
- if ?27, Haldane.3939cM, Kosambi .30
Morgans30cM
17Men are genetically shorter than women
- Total male map length2851cM
- Total female map length4296cM (excluding the X)
- Thus over 3000Mb (megabases) autosomal genome
- 1 male cM averages 1.05 Mb
- 1 female cM averages 0.88Mb
18Meiotic versus physical maps
- meiotic maps measure distances in genetic
distances, i.e. centimorgan
- pretty coarse and often inaccurate
- problem 1 which marker order?
- problem2 which mapping function?
- physical maps measure distances in base pairs
- extremely high resolution allows you to find the
actual mutation
- Connection between the 2 maps
- rule of thumb 1cM equals 1 million base pairs
- but this thumb is very crooked!!!
19Computing the lod score
20The likelihood
- likelihoodprobability of data given the
parameters
- likelihoods are useful for estimation and for
testing
- example phase-known fully informative case
- observed data Rno. of recombinations, NRno of
non-recomb.
- parameter the recombination fraction
?Pr(recombination)
- likelihood is proportional to ?R(1- ?)NR
- maximum likelihood likelihood estimate
- use the log of the likelihood for mathematical
convenience
21Advantages of max. likelihood estimation
- advantages
- asymptotically most efficient,
- high precision
- asymptotically consistent
- it will converge closer and closer to the true
value
- asymptotically unbiased
- corresponding likelihood ratio test enjoys
similar optimality criteria
22How to compute lod scores?
Lod scores are computed for each pedigree (i)
as For a given value of ?, pedigree
-specific lod scores are summed across the F fam
ilies to yield an overall lod score
23Example lod score calculation
PEDIGREE DRAWING Message disease status is n
ot required....
242 point parametric linkage analysis
252 point parametric linkage analysis
- Setting
- genotype of 1 marker locus is known for family
members
- the genotypes of the other locus (disease
susceptibility locus) are unknown
- but the disease locus phenotype (affectation
status) is known
- GOAL
- test whether the disease locus and marker are
linked
- Q Why is it important?
- A If they are linked, the disease locus must be
close to the marker, i.e. we have localized the
disease gene.
26Test for linkage is carried out in 3 steps
Step 1 use the disease status to infer the
underlying disease locus genotypes
Step 2 count the number of recombinations and n
on-recombinations for the different possible
paternal phases Step 3 compute the lod score a
nd check whether it is bigger than 3.0
27DATA for a single pedigree
rare, fully penetrant, dominant disease
Grandpa unaffected, 22, Grandma affected 11
father affected
28Step 1-3
- STEP 1
- we assume that the disease locus carries 2
alleles
- since the disease genotype is fully penetrant,
the genotypes of the unaffecteds must equal dd
- the genotype of the grandma is Dd or DD. Since
the disase is rare, it is probably Dd.
- thus we get the same pedigree as described
earlier
- STEPs 2-3 were already carried out earlier.
29Parameters in parametric linkage analysis
30Glitch for non-Mendelian diseases
- the relation between disease locus genotypes and
affectation status is in general very complex and
can no longer be solved by inspection
-
- need powerful statistical and computation
methods
- start with likelihood (easy to write down)
- compute the likelihood (hard)
31Most general form of the likelihood of pedigree
data
- summation of j is over all founders (specify
allele frequencies)
- product (k,l,m) is taken over all
parent-offspring triples.
- transmission probabilities depend on ?
- for multiple markers (multipoint analysis) need
to specify
- a mapping function, e.g., Kosambi
32Marker parameters
- notation marker alleles denoted here by 1, 2,
.
- relation between marker genotype and phenotype
- usually known (example ABO blood group)
- SNPs and microsatellites are codominantrelation
is trivial
- allele frequencies p1,p2, .
- if parents are unavailable, the results may
depend critically on getting them right. Also
homozygosity mapping.
- vary between different populations
- but can be estimated from the pedigree data
- genetic marker map for multiple markers
- marker order
- genetic distance
- increasingly accurate because of DNA sequencing
33Disease locus parameters
- notation often 2 alleles D (bad) and d (normal)
- allele frequencies pD and pd
- pentrancesP(affected/genotype)
- fDDP(affected/genotype DD)
- fDdP(affected/genotype Dd)
- fddP(affected/genotype dd)
- liability classes
- fancy terminology for letting penetrances between
individuals
- example different penetrances for men and
women,
- or age dependence young versus old
34The biology is modeled through penetrance values
- fully penetrant, dominant disease, no
phenocopies
- fDDfDd1, fdd0
- fully penetrant, recessive disease, no
phenocopies
- fDD1, fDdfdd0
- no effect
- fDDfDdfdd
- incomplete penetrance fDD
- definition phenocopies are affecteds without
disease genes
- phenocopies are present if fdd0
- for the experts imprinting is modeled by using 4
penetrances and keeping track of maternally and
paternally transmitted alleles
352-point versus multipoint linkage analysis
36Two point mapping
- computerized lod score analysis is best way to
analyze complex pedigrees for linkage with
mendelian traits
- use computer software, e.g., Mendel
- the result of a linkage analysis is a table of
lod scores at various recombination fractions
- the result can be plotted to give curves,
- region with lod3 are linked and those with
lod
- the curve will peak at the most likely
recombination fraction
37Output of a 2 point linkage analysis
significant
excluded
- Equivalently, consider the table
- ? 0.01, 0.10, 0.20, 0.30, 0.35, 0.40, 0.45,
0.50
- lod -5.0, -2.0, 1.0, 3.3, 4.0, 3.0,
1.0, 0.0
38Multipoint mapping is more efficient than two
point mapping
- idea analyze data for more than 2 loci
simultaneously
- helps overcome limited informativeness of
markers
- especially relevant for SNPs
- peak heights depend crucially on the precise
distances between markers and the mapping
function-problematic
- highest peak marks the most likely location
- powerful method for scanning the genome in 20-Mb
segments
39Standard lod score analysis is not without
problems
- genotyping errors misdiagnosis- loss of power
- lead to spurious recombinants - inflates the
length of the genetic map
- multi-locus maps can detect such errors by
checking for double recombinants
- locus heterogeneity is always a pitfall
- mutations in unlinked loci may produce the same
clinical phenotype
- use Genehunter of Homog to test for homogeneity
- computational difficulties limit the pedigrees
that can be analyzed (na
not really....)
40Comparing different multipoint linkage analysis
algorithms
41Limitations of the different methods
Â
Slide from webpage http//watson.hgen.pitt.edu/do
cs/simwalk2.html
Â
42Computation times of the algorithms.
General-Pedigree Linkage Analysis Packages
Â
Â
43Critical values for linkage tests
44Distinction between pointwise (nominal) and
genome-wide significance
- pointwise p-valueprobability of exceeding
observed value at a given point, under H?1/2
- genome-wide p-valueprob that the observed value
will be exceeded anywhere in the genome
- reality check about p-values
- if the p-value finding is significant
- the smaller the p-value, the higher the
statistical significance
- genome-wide p-valuepointwise p value
45Lod score thresholds should ensure a .05
genomwide false positive rate
- genomwide false positive rate alphachance of a
false positive result occurring anywhere during a
whole genome scan
- for single point, classically want lod 3.0
- multipoint threshold for a Mendelian disease 3.3
- Lander Schork 1994
- multipoint threshold for a complex disease
- 3.3-4.0 (depends on the study design, Lander and
Kruglyak 1995)
- pointwise p value for significant linkage
510(-5)
46How to relate the pointwise (?P) to the
genome-wide false positive rate (?G).
- conservative Bonferroni correction
- ?P ?G/(no of potential pointwise tests)
- Example no. of potential pointwise testsno of
potential SNPs1 million, ?G.05 ?P
510(-8)
- ignores dependencies (linkage) between markers
- Lander and Kruglyak 1995 found the asymptotic
relation
- ?G(T) C9.2?GT?P(T)
- Tthreshold lod score
- Cnumber of chromosomes23
- ?crossover rate, depends on relationship being
studied, e.g., sibs
- Glength of the genome in Morgans33
- for sibpairs use 3.6 for IBD testing and 4.0 for
IBS testing
47Linkage finding are controversial because of high
false positive rate.
- The smart money knows
- want to see a lod score 4 (or even 5)
- meiotic mapping techniques fail at detecting
complex disease genes
- if the disease is complex, it is a false
positive.
- if the effect is real, 2 point linkage analysis
performs pretty well
- How to avoid arguments over finding?
- replicate the finding in a different sample
- find the mutation