Title: CSE280b: Population Genetics
1CSE280b Population Genetics
- Vineet Bafna/Pavel Pevzner
www.cse.ucsd.edu/classes/sp05/cse291
2Population Genetics
- Individuals in a species (population) are
phenotypically different. - Often these differences are inherited (genetic).
- Studying these differences is important!
- QHow predictive are these differences?
3EXPopulation Structure
- 377 locations (loci) were sampled in 1000 people
from 52 populations. - 6 genetic clusters were obtained, which
corresponded to 5 geographic regions (Rosenberg
et al. Science 2003) - Genetic differences can predict ethnicity.
4Scope of these lectures
- Basic terminology
- Key principles
- Sources of variation
- HW equilibrium
- Linkage
- Coalescent theory
- Recombination/Ancestral Recombination Graph
- Haplotypes/Haplotype phasing
- Population sub-structure
- Structural polymorphisms
- Medical genetics basis Association
mapping/pedigree analysis
5Alleles
- Genotype genetic makeup of an individual
- Allele A specific variant at a location
- The notion of alleles predates the concept of
gene, and DNA. - Initially, alleles referred to variants that
described a measurable phenotype (round/wrinkled
seed) - Now, an allele might be a nucleotide on a
chromosome, with no measurable phenotype. - Humans are diploid, they have 2 copies of each
chromosome. - They may have heterozygosity/homozygosity at a
location - Other organisms (plants) have higher forms of
ploidy. - Additionally, some sites might have 2 allelic
forms, or even many allelic forms.
6What causes variation in a population?
- Mutations (may lead to SNPs)
- Recombinations
- Other genetic events (gene conversion)
- Structural Polymorphisms
7Single Nucleotide Polymorphisms
Infinite Sites Assumption Each site mutates at
most once
00000101011 10001101001 01000101010 01000000011 00
011110000 00101100110
8Short Tandem Repeats
GCTAGATCATCATCATCATTGCTAG GCTAGATCATCATCATTGCTAGTT
A GCTAGATCATCATCATCATCATTGC GCTAGATCATCATCATTGCTAG
TTA GCTAGATCATCATCATTGCTAGTTA GCTAGATCATCATCATCATC
ATTGC
4 3 5 3 3 5
9STR can be used as a DNA fingerprint
- Consider a collection of regions with variable
length repeats. - Variable length repeats will lead to variable
length DNA - Vector of lengths is a finger-print
4 2 3 3 5 1 3 2 3 1 5 3
individuals
loci
10Recombination
00000000 11111111 00011111
11Gene Conversion
- Gene Conversion versus crossover
- Hard to distinguish in a population
12Structural polymorphisms
- Large scale structural changes (deletions/insertio
ns/inversions) may occur in a population.
13Topic 1 Basic Principles
- In a stable population, the distribution of
alleles obeys certain laws - Not really, and the deviations are interesting
- HW Equilibrium
- (due to mixing in a population)
- Linkage (dis)-equilibrium
- Due to recombination
14Hardy Weinberg equilibrium
- Consider a locus with 2 alleles, A, a
- p (respectively, q) is the frequency of A (resp.
a) in the population - 3 Genotypes AA, Aa, aa
- Q What is the frequency of each genotype
- If various assumptions are satisfied, (such as
- random mating, no natural selection), Then
- PAAp2
- PAa2pq
- Paaq2
15Hardy Weinberg why?
- Assumptions
- Diploid
- Sexual reproduction
- Random mating
- Bi-allelic sites
- Large population size,
- Why? Each individual randomly picks his two
chromosomes. Therefore, Prob. (Aa) pqqp 2pq,
and so on.
16Hardy Weinberg Generalizations
- Multiple alleles with frequencies
- By HW,
- Multiple loci?
17Hardy Weinberg Implications
- The allele frequency does not change from
generation to generation. Why? - It is observed that 1 in 10,000 caucasians have
the disease phenylketonuria. The disease
mutation(s) are all recessive. What fraction of
the population carries the disease? - Males are 100 times more likely to have the red
type of color blindness than females. Why? - Conclusion While the HW assumptions are rarely
satisfied, the principle is still important as a
baseline assumption, and significant deviations
are interesting.
18Recombination
00000000 11111111 00011111
19What if there were no recombinations?
- Life would be simpler
- Each individual sequence would have a single
parent (even for higher ploidy) - The relationship is expressed as a tree.
20The Infinite Sites Assumption
0 0 0 0 0 0 0 0
3
0 0 1 0 0 0 0 0
5
8
0 0 1 0 1 0 0 0
0 0 1 0 0 0 0 1
- The different sites are linked. A 1 in position
8 implies 0 in position 5, and vice versa. - Some phenotypes could be linked to the
polymorphisms - Some of the linkage is destroyed by
recombination
21Infinite sites assumption and Perfect Phylogeny
- Each site is mutated at most once in the history.
- All descendants must carry the mutated value, and
all others must carry the ancestral value
i
1 in position i
0 in position i
22Perfect Phylogeny
- Assume an evolutionary model in which no
recombination takes place, only mutation. - The evolutionary history is explained by a tree
in which every mutation is on an edge of the
tree. All the species in one sub-tree contain a
0, and all species in the other contain a 1. Such
a tree is called a perfect phylogeny.
23The 4-gamete condition
- A column i partitions the set of species into two
sets i0, and i1 - A column is homogeneous w.r.t a set of species,
if it has the same value for all species.
Otherwise, it is heterogenous. - EX i is heterogenous w.r.t A,D,E
i A 0 B 0 C 0 D 1 E 1 F 1
i0
i1
244 Gamete Condition
- 4 Gamete Condition
- There exists a perfect phylogeny if and only if
for all pair of columns (i,j), j is not
heterogenous w.r.t i0, or i1. - Equivalent to
- There exists a perfect phylogeny if and only if
for all pairs of columns (i,j), the following 4
rows do not exist - (0,0), (0,1), (1,0), (1,1)
254-gamete condition proof (only if)
- Depending on which edge the mutation j occurs,
either i0, or i1 should be homogenous. - (only if) Every perfect phylogeny satisfies the
4-gamete condition - (if) If the 4-gamete condition is satisfied, does
a prefect phylogeny exist?
i
j
i0
i1
26Handling recombination
- A tree is not sufficient as a sequence may have 2
parents - Recombination leads to loss of correlation
between columns
27Linkage (Dis)-equilibrium (LD)
- Consider sites A B
- Case 1 No recombination
- Each new individual chromosome chooses a parent
from the existing haplotype
A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0
1 0
28Linkage (Dis)-equilibrium (LD)
- Consider sites A B
- Case 2 diploidy and recombination
- Each new individual chooses a parent from the
existing alleles
A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0
1 1
29Linkage (Dis)-equilibrium (LD)
- Consider sites A B
- Case 1 No recombination
- Each new individual chooses a parent from the
existing haplotype - PrA,B0,1 0.25
- Linkage disequilibrium
- Case 2 Extensive recombination
- Each new individual simply chooses and allele
from either site - PrA,B(0,1)0.125
- Linkage equilibrium
A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0
30LD
- In the absence of recombination,
- Correlation between columns
- The joint probability PrAa,Bb is different
from P(a)P(b) - With extensive recombination
- Pr(a,b)P(a)P(b)
31Measures of LD
- Consider two bi-allelic sites with alleles marked
with 0 and 1 - Define
- P00 PrAllele 0 in locus 1, and 0 in locus 2
- P0 PrAllele 0 in locus 1
- Linkage equilibrium if P00 P0 P0
- D abs(P00 - P0 P0) abs(P01 - P0 P1)
32LD over time
- With random mating, and fixed recombination rate
r between the sites, Linkage Disequilibrium will
disappear - Let D(t) LD at time t
- P(t)00 (1-r) P(t-1)00 r P(t-1)0 P(t-1)0
- D(t) P(t)00 - P(t)0 P(t)0 P(t)00 - P(t-1)0
P(t-1)0 (HW) - D(t) (1-r) D(t-1) (1-r)t D(0)
33LD over distance
- Assumption
- Recombination rate increases linearly with
distance - LD decays exponentially with distance.
- The assumption is reasonable, but recombination
rates vary from region to region, adding to
complexity - This simple fact is the basis of disease
association mapping.
34LD and disease mapping
- Consider a mutation that is causal for a disease.
- The goal of disease gene mapping is to discover
which gene (locus) carries the mutation. - Consider every polymorphism, and check
- There might be too many polymorphisms
- Multiple mutations (even at a single locus) that
lead to the same disease - Instead, consider a dense sample of polymorphisms
that span the genome
35LD can be used to map disease genes
LD
0 1 1 0 0 1
D N N D D N
- LD decays with distance from the disease allele.
- By plotting LD, one can short list the region
containing the disease gene.
36LD and disease gene mapping problems
- Marker density?
- Complex diseases
- Population sub-structure
37Population Genetics
- Often we look at these equilibria (Linkage/HW)
and their deviations in specific populations - These deviations offer insight into evolution.
- However, what is Normal?
- A combination of empirical (simulation) and
theoretical insight helps distinguish between
expected and unexpected.
38Topic 2 Simulating population data
- We described various population genetic concepts
(HW, LD), and their applicability - The values of these parameters depend critically
upon the population assumptions. - What if we do not have infinite populations
- No random mating (Ex geographic isolation)
- Sudden growth
- Bottlenecks
- Ad-mixture
- It would be nice to have a simulation of such a
population to test various ideas. How would you
do this simulation?
39Wright Fisher Model of Evolution
- Fixed population size from generation to
generation - Random mating
40Coalescent model
- Insight 1
- Separate the genealogy from allelic states
(mutations) - First generate the genealogy (who begat whom)
- Assign an allelic state (0) to the ancestor. Drop
mutations on the branches.
41Coalescent theory
- Insight 2
- Much of the genealogy is irrelevant, because it
disappears. - Better to go backwards
42Coalescent theory (Kingman)
- Input
- (Fixed population (N individuals), random mating)
- Consider 2 individuals.
- Probability that they coalesce in the previous
generation (have the same parent)
- Probability that they do not coalesce after t
generations
43Coalescent theory
- Consider k individuals.
- Probability that no pair coalesces after 1
generation - Probability that no pair coalesces after t
generations
44Coalescent approximation
- Insight 3
- Topology is independent of coalescent times
- If you have n individuals, generate a random
binary topology - Iterate (until one individual)
- Pick a pair at random, and coalesce
- Insight 4
- To generate coalescent times, there is no need to
go back generation by generation
45Coalescent approximation
- At any step, there are 1 lt k lt n individuals
- To generate time to coalesce (k to k-1
individuals) - Pick a number from exponential distribution with
rate k(k-1)/2 - Mean time to coalescence
2/(k(k-1))
46Typical coalescents
- 4 random examples with n6 (Note that we do not
need to specify N. Why?) - Expected time to coalesce?
47Coalescent properties
- Expected time for the last step
- The last step is half of the total time to
coalesce - Studying larger number of individuals does not
change numbers tremendously - EX Number of mutations in a population is
proportional to the total branch length of the
tree - E(Ttot)
1
48Variants (exponentially growing populations)
- If the population is growing exponentially, the
branch lengths become similar, or even star-like.
Why? - With appropriate scaling of time, the same
process can be extended to various scenarios
male-female, hermaphrodite, segregation,
migration, etc.
49Simulating population data
- Generate a coalescent (Topology Branch lengths)
- For each branch length, drop mutations with rate
? - Generate sequence data
- Note that the resulting sequence is a perfect
phylogeny. - Given such sequence data, can you reconstruct the
coalescent tree? (Only the topology, not the
branch lengths) - Also, note that all pairs of positions are
correlated (should have high LD).
50Coalescent with Recombination
- An individual may have one parent, or 2 parents
51ARG Coalescent with recombination
- Given mutation rate ?, recombination rate ?,
population size 2N (diploid), sample size n. - How can you generate the ARG (topologybranch
lengths) efficiently? - How will you generate sequences for n
individuals? - Given sequence data, can you reconstruct the ARG
(topology)
52Recombination
- Define r as the probability of recombining.
- Note that the parameter is a caled value which
will be defined later - Assume k individuals in a generation. The
following might happen - An individual arises because of a recombination
event between two individuals (It will have 2
parents). - Two individuals coalesce
- Neither (Each individual has a distinct parent)
- Multiple events (low probability)
53Recombination
- We ignore the case of multiple (gt 1) events in
one generation - Pr (No recombination) 1-kr
- Pr (No coalescence)
- Consider scaled time in units of 2N generations.
Thus the number of individuals increase with rate
kr2N, and decrease with rate - The value 2rN is usually small, and therefore,
the process will ultimately coalesce to a single
individual (MRCA)
54ARG
- Let k n,
- Define
- Iterate until k 1
- Choose time from an exponential distribution with
rate - Pick event as recombination with probability
- If event is recombination, choose an individual
to recombine, and a position, else choose a pair
to coalesce. - Update k, and continue
55Simulating sequences on the ARG
- Generate topology and branch lengths as before
- For each recombination, generate a position.
- Next generate mutations at random on branch
lengths - For a mutation, select a position as well.
56Recombination events and ?
- Given ?, n, can you compute the expected number
of recombination events? - It can be shown that E(n, ?) ? log (n)
- The question that people are really interested in
- Given a set of sequences from a population,
compute the recombination rate ? - Given a population reconstruct the most likely
history (as an ancestral recombination graph) - We will address this question in subsequent
lectures
57An algorithm for constructing a perfect phylogeny
- We will consider the case where 0 is the
ancestral state, and 1 is the mutated state. This
will be fixed later. - In any tree, each node (except the root) has a
single parent. - It is sufficient to construct a parent for every
node. - In each step, we add a column and refine some of
the nodes containing multiple children. - Stop if all columns have been considered.
58Inclusion Property
- For any pair of columns i,j
- i lt j if and only if i1 ? j1
- Note that if iltj then the edge containing i is an
ancestor of the edge containing i
i
j
59Example
1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D
0 0 1 0 1 E 1 0 0 0 0
Initially, there is a single clade r, and each
node has r as its parent
60Sort columns
- Sort columns according to the inclusion property
(note that the columns are already sorted here). - This can be achieved by considering the columns
as binary representations of numbers (most
significant bit in row 1) and sorting in
decreasing order
1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1
0 D 0 0 1 0 1 E 1 0 0 0 0
61Add first column
- In adding column i
- Check each edge and decide which side you belong.
- Finally add a node if you can resolve a clade
1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1
0 D 0 0 1 0 1 E 1 0 0 0 0
r
u
B
D
A
C
E
62Adding other columns
- Add other columns on edges using the ordering
property
1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D
0 0 1 0 1 E 1 0 0 0 0
r
1
3
E
B
2
5
4
D
A
C
63Unrooted case
- Switch the values in each column, so that 0 is
the majority element. - Apply the algorithm for the rooted case
64(No Transcript)
65(No Transcript)