BMI 731 Winter 2005 Chapter1: SNP Analysis - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

BMI 731 Winter 2005 Chapter1: SNP Analysis

Description:

Cells are fundamental working units of every living systems ... 1.Random mating or panmixia: the choice of a mate is not influenced by his/her ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 29

Provided by: Biomedical94

Learn more at: http://bmi.osu.edu

Category:

more less

Transcript and Presenter's Notes

Title: BMI 731 Winter 2005 Chapter1: SNP Analysis

1
BMI 731- Winter 2005Chapter1 SNP Analysis

Catalin Barbacioru
Department of Biomedical Informatics
Ohio State University

2
Biological Background

Cells are fundamental working units of every
living systems
The nucleus contains a large DNA
(Deoxyribonucleic acid) molecule, which carries
the genetic instructions
A DNA molecule consists of two strands that wrap
around each other to resemble a twisted ladder.
Each strand is composed of one sugar molecule,
one phosphate molecule, and a base.
Four different bases are present in DNA - adenine
(A), thymine (T), cytosine (C), and guanine (G).
The particular order of the bases arranged along
the sugar - phosphate backbone is called the DNA
sequence

3
Biological Background
4
Biological Background

Each strand of the DNA molecule is held together
at its base by weak hydrogen bonds.
The four bases pair in a set manner Adenine (A)
pairs with thymine (T), while cytosine (C) pairs
with guanine (G). These pairs of bases are known
as Base Pairs (bp).
The DNA is organized into separate long segments
called chromosomes, where the number of
chromosomes differ across organisms (46 for
humans or 23 pairs, each parent contributes 23
chromosomes)

5
Glossary

Allele Alternative form of a gene. One of the
different forms of a gene that can exist at a
single locus.
Genotype The specific allelic composition of a
cell, either of the entire cell or more commonly
for a certain gene or a set of genes.
Haplotype A set of closely linked genetic
markers present on one chromosome which tend to
be inherited together (not easily separable by
recombination).

6
Glossary

Locus A point in the genome, identified by a
marker, which can be mapped by some means.
Marker Also known as a genetic marker, a segment
of DNA with an identifiable physical location on
a chromosome whose inheritance can be followed. A
marker can be a gene, or it can be some section
of DNA with no known function.
Mutation A permanent structural alteration in
DNA.

7
Glossary

Hardy-Weinberg equilibrium The stable frequency
distribution of genotypes, AA, Aa, and aa, in the
proportions p2, 2pq, and q2 respectively (where
p and q are the frequencies of the alleles, A and
a) that is a consequence of random mating in the
absence of mutation, migration, natural
selection, or random drift.
Linkage disequilibrium When the observed
frequencies of haplotypes in a population does
not agree with haplotype frequencies predicted by
multiplying together the frequency of individual
genetic markers in each haplotype.

8
A Little Population Genetics

Population genetics (and evolutionary genetics)
deal with groups of organisms and families,
usually natural populations.
We can discern two strands of thought in the
area. One is the study of very large ("ideal")
idealized groups or populations, where models can
be deterministic.
The other is dealing with smaller populations,
where the role of chance can play a larger role
(so called genetic drift).

9
Genotype and allele frequencies

One question of crucial interest is this how
common are the different alleles at a given locus
in a given population.

The percentages are our best estimate of the
probability that an individual will carry that
genotype in the population of London, Oxford and
Cambridge. The observed heterozygosity is 49.6.
10
There is another population described in this
table. It is the population of gametes that gave
rise to individuals tested
The percentages here are our best estimate of the
probability that a sperm or egg taken from that
population will carry that particular allele. If
the frequency of the commonest allele at a
particular locus is less than 99, we call this a
polymorphic locus or polymorphism.
11
Hardy-Weinberg equilibrium

Hardy-Weinberg equilibrium describes the
relationship between the gametic or allele
frequencies, and the resulting genotypic
frequencies. It holds if the following properties
are true for the given locus,
1.Random mating or panmixia the choice of a
mate is not influenced by his/her genotype at the
locus.
2.The locus does not affect the chance of mating
at all, either by altering fertility or
decreasing survival to reproductive age.

If these properties hold, then the probability
that two gametes will meet and give rise to a new
genotype is simply the product of the allele
frequencies (a la binomial)
P(AA) P(A) x P(A) pA2
P(aa) P(a) x P(a) pa2
P(Aa) 1 - P(AA) - P(aa) 2 x P(A) x P(a)
2pApa.

13
Tests for HWE

For a two-allele case, disequilibrium coefficient
is
D PAA pA2
where PAA P(AA) the probability of AA genotype
and
pA P(A) is the probability of allele A.
If nAA, nAa, naa are the numbers of individuals
with genotypes AA, Aa and aa respectively, from a
total of n individuals, then estimators of the
above probabilities are
PAA nAA/n, PAa nAa/n, Paa naa/n, where n
nAAnAanaa
pA (2nAAnAa)/2n, pa (2naanAa)/2n and pa
pA 1

14
Chi-square testfor HWE

Then under HWE

15
Chi-square testfor HWE

The goodness-of-fit chi-squared statistic is
XA2 Sgenotypes (Obs-Exp)2/Exp
(nD)2/npA2 (-2nD)2/2npApa (nD)2/npa2
nD2/pA2(1-pA)2
and the test rejects (H0) the assumption of HWE
if
XA2 gt 3.84
The usual problems associated with this test that
it is sensitive to small expected values. An
alternative version (Yates), which overcomes
continuity assumptions is
XA2 Sgenotypes (Obs-Exp-0.5)2/Exp

16
Fisher (exact) test for HWE

Under HWE hypothesis, the probability of the
observed set of genotypic counts nAA, nAa and naa
in a sample of size n is

whereas the allele counts nA and na are
binomially distributed if HWE holds
17
Fisher (exact) test for HWE

Putting together, the probability of the observed
genotypic frequencies, assuming HWE, conditional
on the observed allele frequencies is

which can be expressed in terms of the allele A
number and Of the number of heterozygotes nAa. We
reject the HWE hypothesis if the above
conditional probability is less than the
significance level of type I error (a), usually
0.05.
18
HWE test - Example
Causes rejection of HWE at 5 significance level
19
Power and sample size of tests for HWE

Statistical tests of hypothesis are subject to
two kind of errors a true hypothesis may be
rejected (type I error or a or significance level
or p-value) or a false hypothesis may not be
rejected (type II error or ß or 1-power of the
test).
For the chi-square test, theory provides that, in
large samples, X2 is distributed approximately as
a chi-square with 1 d.f. when the hypothesis is
true and as a noncentral chi-square when the
hypothesis is false i.e.
X2 ?2(1) when H0 is true
X2 ?2(1, ?) when H0 is false
where ? is the noncentrality parameter (see
tables).

20
Power and sample size of tests for HWE

The disequilibrium coefficient, D, required for
attaining 90 power and a 0.05 significance level
for the chi-square test is

Alternatively, the number of samples required in
order to attain 90 power and a 0.05
significance level for the chi-square test when
the disequilibrium coefficient is D, is
If the required power is 50 or 80, then 10.5
is replaced by 3.84 or 8.7
21
Linkage disequilibriumGametic disequilibrium at
two loci

Measures the association of two alleles at two
different loci.
Given two biallelic loci with alleles A, a and B,
b respectively, let the disequilibrium
coefficient be
DAB pAB pApB.
The (ML) estimator of DAB is DAB pAB pApB.
A chi-square statistic for the hypothesis of no
disequilibrium, H0 DAB0, is the test statistic

and the test rejects H0 if XAB2 gt 3.84 .
22
Linkage disequilibriumGametic disequilibrium at
two loci

An exact test for gametic linkage disequilibrium
depends on the probabilities of all possible
samples of gametic numbers for the observed
allele numbers. Under the assumption of no
linkage disequilibrium

and the allele probabilities are
23
Linkage disequilibriumGametic disequilibrium at
two loci

Taking the ratio between these quantities gives
the probability of gametic numbers conditional on
allele numbers

which depends on n, nAB, nA and nB only. As in
the case of HWE, this probability is compared
with the chosen significance Level (p-value).
24
Linkage disequilibrium Genotypic disequilibrium

When genotypes are scored, it is often not
possible to distinguish between the two double
heterozygotes AB/ab and Ab/aB, so that the
gametic frequencies cannot be inferred. Under the
assumption of random mating, in which genotypic
frequencies are assumed to be the products of
gametic frequencies, it is possible to estimate
gametic frequencies. A measure of (digenic)
linkage disequilibrium between alleles A and B
is

25
Linkage disequilibrium Genotypic disequilibrium

If the 9 genotypic classes are numbered as

then an (ML) estimator for ?AB is
26
Linkage disequilibrium Genotypic disequilibrium

The chi-square test statistics for LD is

Note the explicit way in which departures from HW
are Included in this expresion.
27

?2 represents the statistical correlation between
two sites, and takes value 1 if only two
haplotypes are present. It is arguably the most
relevant measure for association between
susceptibility loci and SNPs. For example,
suppose SNP1 is involved in disease
susceptibility, but we genotype cases and
controls at a nearby site SNP2. Then, to achieve
the same power to detect associations at SNP2 as
we would have at SNP1, we need to increase our
sample size by a factor of 1/ ?2.

These measures are defined for pairs of sites,
but for some applications we might instead want
to measure how strong LD is across an entire
region that contains many polymorphic sites for
example, for testing whether the strength of LD
differs significantly among loci or across
populations, or whether there is more or less LD
in a region than predicted under a particular
model. Measuring LD across a region is not
straightforward, but one approach is to use the
measure ?, which measures how much recombination
would be required under a particular population
model to generate the LD that is seen in the
data. The development of methods for estimating
is now an active research. This type of method
can potentially also provide a statistically
rigorous approach to the problem of determining
whether LD data provide evidence for the presence
of hotspots.