Title: BMI 731 Winter 2005 Chapter1: SNP Analysis
1BMI 731- Winter 2005Chapter1 SNP Analysis
- Catalin Barbacioru
- Department of Biomedical Informatics
- Ohio State University
2Biological Background
- Cells are fundamental working units of every
living systems - The nucleus contains a large DNA
(Deoxyribonucleic acid) molecule, which carries
the genetic instructions - A DNA molecule consists of two strands that wrap
around each other to resemble a twisted ladder. - Each strand is composed of one sugar molecule,
one phosphate molecule, and a base. - Four different bases are present in DNA - adenine
(A), thymine (T), cytosine (C), and guanine (G). - The particular order of the bases arranged along
the sugar - phosphate backbone is called the DNA
sequence
3Biological Background
4Biological Background
- Each strand of the DNA molecule is held together
at its base by weak hydrogen bonds. - The four bases pair in a set manner Adenine (A)
pairs with thymine (T), while cytosine (C) pairs
with guanine (G). These pairs of bases are known
as Base Pairs (bp). - The DNA is organized into separate long segments
called chromosomes, where the number of
chromosomes differ across organisms (46 for
humans or 23 pairs, each parent contributes 23
chromosomes)
5Glossary
- Allele Alternative form of a gene. One of the
different forms of a gene that can exist at a
single locus. - Genotype The specific allelic composition of a
cell, either of the entire cell or more commonly
for a certain gene or a set of genes. - Haplotype A set of closely linked genetic
markers present on one chromosome which tend to
be inherited together (not easily separable by
recombination).
6Glossary
- Locus A point in the genome, identified by a
marker, which can be mapped by some means. - Marker Also known as a genetic marker, a segment
of DNA with an identifiable physical location on
a chromosome whose inheritance can be followed. A
marker can be a gene, or it can be some section
of DNA with no known function. - Mutation A permanent structural alteration in
DNA.
7Glossary
- Hardy-Weinberg equilibrium The stable frequency
distribution of genotypes, AA, Aa, and aa, in the
proportions p2, 2pq, and q2 respectively (where
p and q are the frequencies of the alleles, A and
a) that is a consequence of random mating in the
absence of mutation, migration, natural
selection, or random drift. - Linkage disequilibrium When the observed
frequencies of haplotypes in a population does
not agree with haplotype frequencies predicted by
multiplying together the frequency of individual
genetic markers in each haplotype.
8A Little Population Genetics
- Population genetics (and evolutionary genetics)
deal with groups of organisms and families,
usually natural populations. - We can discern two strands of thought in the
area. One is the study of very large ("ideal")
idealized groups or populations, where models can
be deterministic. - The other is dealing with smaller populations,
where the role of chance can play a larger role
(so called genetic drift).
9Genotype and allele frequencies
- One question of crucial interest is this how
common are the different alleles at a given locus
in a given population.
The percentages are our best estimate of the
probability that an individual will carry that
genotype in the population of London, Oxford and
Cambridge. The observed heterozygosity is 49.6.
10There is another population described in this
table. It is the population of gametes that gave
rise to individuals tested
The percentages here are our best estimate of the
probability that a sperm or egg taken from that
population will carry that particular allele. If
the frequency of the commonest allele at a
particular locus is less than 99, we call this a
polymorphic locus or polymorphism.
11Hardy-Weinberg equilibrium
- Hardy-Weinberg equilibrium describes the
relationship between the gametic or allele
frequencies, and the resulting genotypic
frequencies. It holds if the following properties
are true for the given locus, - 1.Random mating or panmixia the choice of a
mate is not influenced by his/her genotype at the
locus. - 2.The locus does not affect the chance of mating
at all, either by altering fertility or
decreasing survival to reproductive age.
12- If these properties hold, then the probability
that two gametes will meet and give rise to a new
genotype is simply the product of the allele
frequencies (a la binomial) - P(AA) P(A) x P(A) pA2
- P(aa) P(a) x P(a) pa2
- P(Aa) 1 - P(AA) - P(aa) 2 x P(A) x P(a)
- 2pApa.
13Tests for HWE
- For a two-allele case, disequilibrium coefficient
is - D PAA pA2
- where PAA P(AA) the probability of AA genotype
and - pA P(A) is the probability of allele A.
- If nAA, nAa, naa are the numbers of individuals
with genotypes AA, Aa and aa respectively, from a
total of n individuals, then estimators of the
above probabilities are - PAA nAA/n, PAa nAa/n, Paa naa/n, where n
nAAnAanaa - pA (2nAAnAa)/2n, pa (2naanAa)/2n and pa
pA 1
14Chi-square testfor HWE
15Chi-square testfor HWE
- The goodness-of-fit chi-squared statistic is
- XA2 Sgenotypes (Obs-Exp)2/Exp
- (nD)2/npA2 (-2nD)2/2npApa (nD)2/npa2
- nD2/pA2(1-pA)2
- and the test rejects (H0) the assumption of HWE
if - XA2 gt 3.84
- The usual problems associated with this test that
it is sensitive to small expected values. An
alternative version (Yates), which overcomes
continuity assumptions is - XA2 Sgenotypes (Obs-Exp-0.5)2/Exp
16Fisher (exact) test for HWE
- Under HWE hypothesis, the probability of the
observed set of genotypic counts nAA, nAa and naa
in a sample of size n is
whereas the allele counts nA and na are
binomially distributed if HWE holds
17Fisher (exact) test for HWE
- Putting together, the probability of the observed
genotypic frequencies, assuming HWE, conditional
on the observed allele frequencies is
which can be expressed in terms of the allele A
number and Of the number of heterozygotes nAa. We
reject the HWE hypothesis if the above
conditional probability is less than the
significance level of type I error (a), usually
0.05.
18HWE test - Example
Causes rejection of HWE at 5 significance level
19Power and sample size of tests for HWE
- Statistical tests of hypothesis are subject to
two kind of errors a true hypothesis may be
rejected (type I error or a or significance level
or p-value) or a false hypothesis may not be
rejected (type II error or ß or 1-power of the
test). - For the chi-square test, theory provides that, in
large samples, X2 is distributed approximately as
a chi-square with 1 d.f. when the hypothesis is
true and as a noncentral chi-square when the
hypothesis is false i.e. - X2 ?2(1) when H0 is true
- X2 ?2(1, ?) when H0 is false
- where ? is the noncentrality parameter (see
tables).
20Power and sample size of tests for HWE
- The disequilibrium coefficient, D, required for
attaining 90 power and a 0.05 significance level
for the chi-square test is
Alternatively, the number of samples required in
order to attain 90 power and a 0.05
significance level for the chi-square test when
the disequilibrium coefficient is D, is
If the required power is 50 or 80, then 10.5
is replaced by 3.84 or 8.7
21Linkage disequilibriumGametic disequilibrium at
two loci
- Measures the association of two alleles at two
different loci. - Given two biallelic loci with alleles A, a and B,
b respectively, let the disequilibrium
coefficient be - DAB pAB pApB.
- The (ML) estimator of DAB is DAB pAB pApB.
- A chi-square statistic for the hypothesis of no
disequilibrium, H0 DAB0, is the test statistic
and the test rejects H0 if XAB2 gt 3.84 .
22Linkage disequilibriumGametic disequilibrium at
two loci
- An exact test for gametic linkage disequilibrium
depends on the probabilities of all possible
samples of gametic numbers for the observed
allele numbers. Under the assumption of no
linkage disequilibrium
and the allele probabilities are
23Linkage disequilibriumGametic disequilibrium at
two loci
- Taking the ratio between these quantities gives
the probability of gametic numbers conditional on
allele numbers
which depends on n, nAB, nA and nB only. As in
the case of HWE, this probability is compared
with the chosen significance Level (p-value).
24Linkage disequilibrium Genotypic disequilibrium
- When genotypes are scored, it is often not
possible to distinguish between the two double
heterozygotes AB/ab and Ab/aB, so that the
gametic frequencies cannot be inferred. Under the
assumption of random mating, in which genotypic
frequencies are assumed to be the products of
gametic frequencies, it is possible to estimate
gametic frequencies. A measure of (digenic)
linkage disequilibrium between alleles A and B
is
25Linkage disequilibrium Genotypic disequilibrium
- If the 9 genotypic classes are numbered as
then an (ML) estimator for ?AB is
26Linkage disequilibrium Genotypic disequilibrium
- The chi-square test statistics for LD is
Note the explicit way in which departures from HW
are Included in this expresion.
27- ?2 represents the statistical correlation between
two sites, and takes value 1 if only two
haplotypes are present. It is arguably the most
relevant measure for association between
susceptibility loci and SNPs. For example,
suppose SNP1 is involved in disease
susceptibility, but we genotype cases and
controls at a nearby site SNP2. Then, to achieve
the same power to detect associations at SNP2 as
we would have at SNP1, we need to increase our
sample size by a factor of 1/ ?2.
28- These measures are defined for pairs of sites,
but for some applications we might instead want
to measure how strong LD is across an entire
region that contains many polymorphic sites for
example, for testing whether the strength of LD
differs significantly among loci or across
populations, or whether there is more or less LD
in a region than predicted under a particular
model. Measuring LD across a region is not
straightforward, but one approach is to use the
measure ?, which measures how much recombination
would be required under a particular population
model to generate the LD that is seen in the
data. The development of methods for estimating
is now an active research. This type of method
can potentially also provide a statistically
rigorous approach to the problem of determining
whether LD data provide evidence for the presence
of hotspots.