CSE280b: Population Genetics - PowerPoint PPT Presentation

About This Presentation
Title:

CSE280b: Population Genetics

Description:

Individuals in a species (population) are phenotypically different. ... to various scenarios: male-female, hermaphrodite, segregation, migration, etc. ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 66
Provided by: vineet50
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: CSE280b: Population Genetics


1
CSE280b Population Genetics
  • Vineet Bafna/Pavel Pevzner

www.cse.ucsd.edu/classes/sp05/cse291
2
Population Genetics
  • Individuals in a species (population) are
    phenotypically different.
  • Often these differences are inherited (genetic).
  • Studying these differences is important!
  • QHow predictive are these differences?

3
EXPopulation Structure
  • 377 locations (loci) were sampled in 1000 people
    from 52 populations.
  • 6 genetic clusters were obtained, which
    corresponded to 5 geographic regions (Rosenberg
    et al. Science 2003)
  • Genetic differences can predict ethnicity.

4
Scope of these lectures
  • Basic terminology
  • Key principles
  • Sources of variation
  • HW equilibrium
  • Linkage
  • Coalescent theory
  • Recombination/Ancestral Recombination Graph
  • Haplotypes/Haplotype phasing
  • Population sub-structure
  • Structural polymorphisms
  • Medical genetics basis Association
    mapping/pedigree analysis

5
Alleles
  • Genotype genetic makeup of an individual
  • Allele A specific variant at a location
  • The notion of alleles predates the concept of
    gene, and DNA.
  • Initially, alleles referred to variants that
    described a measurable phenotype (round/wrinkled
    seed)
  • Now, an allele might be a nucleotide on a
    chromosome, with no measurable phenotype.
  • Humans are diploid, they have 2 copies of each
    chromosome.
  • They may have heterozygosity/homozygosity at a
    location
  • Other organisms (plants) have higher forms of
    ploidy.
  • Additionally, some sites might have 2 allelic
    forms, or even many allelic forms.

6
What causes variation in a population?
  • Mutations (may lead to SNPs)
  • Recombinations
  • Other genetic events (gene conversion)
  • Structural Polymorphisms

7
Single Nucleotide Polymorphisms
Infinite Sites Assumption Each site mutates at
most once
00000101011 10001101001 01000101010 01000000011 00
011110000 00101100110
8
Short Tandem Repeats
GCTAGATCATCATCATCATTGCTAG GCTAGATCATCATCATTGCTAGTT
A GCTAGATCATCATCATCATCATTGC GCTAGATCATCATCATTGCTAG
TTA GCTAGATCATCATCATTGCTAGTTA GCTAGATCATCATCATCATC
ATTGC
4 3 5 3 3 5
9
STR can be used as a DNA fingerprint
  • Consider a collection of regions with variable
    length repeats.
  • Variable length repeats will lead to variable
    length DNA
  • Vector of lengths is a finger-print

4 2 3 3 5 1 3 2 3 1 5 3
individuals
loci
10
Recombination
00000000 11111111 00011111
11
Gene Conversion
  • Gene Conversion versus crossover
  • Hard to distinguish in a population

12
Structural polymorphisms
  • Large scale structural changes (deletions/insertio
    ns/inversions) may occur in a population.

13
Topic 1 Basic Principles
  • In a stable population, the distribution of
    alleles obeys certain laws
  • Not really, and the deviations are interesting
  • HW Equilibrium
  • (due to mixing in a population)
  • Linkage (dis)-equilibrium
  • Due to recombination

14
Hardy Weinberg equilibrium
  • Consider a locus with 2 alleles, A, a
  • p (respectively, q) is the frequency of A (resp.
    a) in the population
  • 3 Genotypes AA, Aa, aa
  • Q What is the frequency of each genotype
  • If various assumptions are satisfied, (such as
  • random mating, no natural selection), Then
  • PAAp2
  • PAa2pq
  • Paaq2

15
Hardy Weinberg why?
  • Assumptions
  • Diploid
  • Sexual reproduction
  • Random mating
  • Bi-allelic sites
  • Large population size,
  • Why? Each individual randomly picks his two
    chromosomes. Therefore, Prob. (Aa) pqqp 2pq,
    and so on.

16
Hardy Weinberg Generalizations
  • Multiple alleles with frequencies
  • By HW,
  • Multiple loci?

17
Hardy Weinberg Implications
  • The allele frequency does not change from
    generation to generation. Why?
  • It is observed that 1 in 10,000 caucasians have
    the disease phenylketonuria. The disease
    mutation(s) are all recessive. What fraction of
    the population carries the disease?
  • Males are 100 times more likely to have the red
    type of color blindness than females. Why?
  • Conclusion While the HW assumptions are rarely
    satisfied, the principle is still important as a
    baseline assumption, and significant deviations
    are interesting.

18
Recombination
00000000 11111111 00011111
19
What if there were no recombinations?
  • Life would be simpler
  • Each individual sequence would have a single
    parent (even for higher ploidy)
  • The relationship is expressed as a tree.

20
The Infinite Sites Assumption
0 0 0 0 0 0 0 0
3
0 0 1 0 0 0 0 0
5
8
0 0 1 0 1 0 0 0
0 0 1 0 0 0 0 1
  • The different sites are linked. A 1 in position
    8 implies 0 in position 5, and vice versa.
  • Some phenotypes could be linked to the
    polymorphisms
  • Some of the linkage is destroyed by
    recombination

21
Infinite sites assumption and Perfect Phylogeny
  • Each site is mutated at most once in the history.
  • All descendants must carry the mutated value, and
    all others must carry the ancestral value

i
1 in position i
0 in position i
22
Perfect Phylogeny
  • Assume an evolutionary model in which no
    recombination takes place, only mutation.
  • The evolutionary history is explained by a tree
    in which every mutation is on an edge of the
    tree. All the species in one sub-tree contain a
    0, and all species in the other contain a 1. Such
    a tree is called a perfect phylogeny.

23
The 4-gamete condition
  • A column i partitions the set of species into two
    sets i0, and i1
  • A column is homogeneous w.r.t a set of species,
    if it has the same value for all species.
    Otherwise, it is heterogenous.
  • EX i is heterogenous w.r.t A,D,E

i A 0 B 0 C 0 D 1 E 1 F 1
i0
i1
24
4 Gamete Condition
  • 4 Gamete Condition
  • There exists a perfect phylogeny if and only if
    for all pair of columns (i,j), j is not
    heterogenous w.r.t i0, or i1.
  • Equivalent to
  • There exists a perfect phylogeny if and only if
    for all pairs of columns (i,j), the following 4
    rows do not exist
  • (0,0), (0,1), (1,0), (1,1)

25
4-gamete condition proof (only if)
  • Depending on which edge the mutation j occurs,
    either i0, or i1 should be homogenous.
  • (only if) Every perfect phylogeny satisfies the
    4-gamete condition
  • (if) If the 4-gamete condition is satisfied, does
    a prefect phylogeny exist?

i
j
i0
i1
26
Handling recombination
  • A tree is not sufficient as a sequence may have 2
    parents
  • Recombination leads to loss of correlation
    between columns

27
Linkage (Dis)-equilibrium (LD)
  • Consider sites A B
  • Case 1 No recombination
  • Each new individual chromosome chooses a parent
    from the existing haplotype

A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0
1 0
28
Linkage (Dis)-equilibrium (LD)
  • Consider sites A B
  • Case 2 diploidy and recombination
  • Each new individual chooses a parent from the
    existing alleles

A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0
1 1
29
Linkage (Dis)-equilibrium (LD)
  • Consider sites A B
  • Case 1 No recombination
  • Each new individual chooses a parent from the
    existing haplotype
  • PrA,B0,1 0.25
  • Linkage disequilibrium
  • Case 2 Extensive recombination
  • Each new individual simply chooses and allele
    from either site
  • PrA,B(0,1)0.125
  • Linkage equilibrium

A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0
30
LD
  • In the absence of recombination,
  • Correlation between columns
  • The joint probability PrAa,Bb is different
    from P(a)P(b)
  • With extensive recombination
  • Pr(a,b)P(a)P(b)

31
Measures of LD
  • Consider two bi-allelic sites with alleles marked
    with 0 and 1
  • Define
  • P00 PrAllele 0 in locus 1, and 0 in locus 2
  • P0 PrAllele 0 in locus 1
  • Linkage equilibrium if P00 P0 P0
  • D abs(P00 - P0 P0) abs(P01 - P0 P1)

32
LD over time
  • With random mating, and fixed recombination rate
    r between the sites, Linkage Disequilibrium will
    disappear
  • Let D(t) LD at time t
  • P(t)00 (1-r) P(t-1)00 r P(t-1)0 P(t-1)0
  • D(t) P(t)00 - P(t)0 P(t)0 P(t)00 - P(t-1)0
    P(t-1)0 (HW)
  • D(t) (1-r) D(t-1) (1-r)t D(0)

33
LD over distance
  • Assumption
  • Recombination rate increases linearly with
    distance
  • LD decays exponentially with distance.
  • The assumption is reasonable, but recombination
    rates vary from region to region, adding to
    complexity
  • This simple fact is the basis of disease
    association mapping.

34
LD and disease mapping
  • Consider a mutation that is causal for a disease.
  • The goal of disease gene mapping is to discover
    which gene (locus) carries the mutation.
  • Consider every polymorphism, and check
  • There might be too many polymorphisms
  • Multiple mutations (even at a single locus) that
    lead to the same disease
  • Instead, consider a dense sample of polymorphisms
    that span the genome

35
LD can be used to map disease genes
LD
0 1 1 0 0 1
D N N D D N
  • LD decays with distance from the disease allele.
  • By plotting LD, one can short list the region
    containing the disease gene.

36
LD and disease gene mapping problems
  • Marker density?
  • Complex diseases
  • Population sub-structure

37
Population Genetics
  • Often we look at these equilibria (Linkage/HW)
    and their deviations in specific populations
  • These deviations offer insight into evolution.
  • However, what is Normal?
  • A combination of empirical (simulation) and
    theoretical insight helps distinguish between
    expected and unexpected.

38
Topic 2 Simulating population data
  • We described various population genetic concepts
    (HW, LD), and their applicability
  • The values of these parameters depend critically
    upon the population assumptions.
  • What if we do not have infinite populations
  • No random mating (Ex geographic isolation)
  • Sudden growth
  • Bottlenecks
  • Ad-mixture
  • It would be nice to have a simulation of such a
    population to test various ideas. How would you
    do this simulation?

39
Wright Fisher Model of Evolution
  • Fixed population size from generation to
    generation
  • Random mating

40
Coalescent model
  • Insight 1
  • Separate the genealogy from allelic states
    (mutations)
  • First generate the genealogy (who begat whom)
  • Assign an allelic state (0) to the ancestor. Drop
    mutations on the branches.

41
Coalescent theory
  • Insight 2
  • Much of the genealogy is irrelevant, because it
    disappears.
  • Better to go backwards

42
Coalescent theory (Kingman)
  • Input
  • (Fixed population (N individuals), random mating)
  • Consider 2 individuals.
  • Probability that they coalesce in the previous
    generation (have the same parent)
  • Probability that they do not coalesce after t
    generations

43
Coalescent theory
  • Consider k individuals.
  • Probability that no pair coalesces after 1
    generation
  • Probability that no pair coalesces after t
    generations

44
Coalescent approximation
  • Insight 3
  • Topology is independent of coalescent times
  • If you have n individuals, generate a random
    binary topology
  • Iterate (until one individual)
  • Pick a pair at random, and coalesce
  • Insight 4
  • To generate coalescent times, there is no need to
    go back generation by generation

45
Coalescent approximation
  • At any step, there are 1 lt k lt n individuals
  • To generate time to coalesce (k to k-1
    individuals)
  • Pick a number from exponential distribution with
    rate k(k-1)/2
  • Mean time to coalescence

2/(k(k-1))
46
Typical coalescents
  • 4 random examples with n6 (Note that we do not
    need to specify N. Why?)
  • Expected time to coalesce?

47
Coalescent properties
  • Expected time for the last step
  • The last step is half of the total time to
    coalesce
  • Studying larger number of individuals does not
    change numbers tremendously
  • EX Number of mutations in a population is
    proportional to the total branch length of the
    tree
  • E(Ttot)

1
48
Variants (exponentially growing populations)
  • If the population is growing exponentially, the
    branch lengths become similar, or even star-like.
    Why?
  • With appropriate scaling of time, the same
    process can be extended to various scenarios
    male-female, hermaphrodite, segregation,
    migration, etc.

49
Simulating population data
  • Generate a coalescent (Topology Branch lengths)
  • For each branch length, drop mutations with rate
    ?
  • Generate sequence data
  • Note that the resulting sequence is a perfect
    phylogeny.
  • Given such sequence data, can you reconstruct the
    coalescent tree? (Only the topology, not the
    branch lengths)
  • Also, note that all pairs of positions are
    correlated (should have high LD).

50
Coalescent with Recombination
  • An individual may have one parent, or 2 parents

51
ARG Coalescent with recombination
  • Given mutation rate ?, recombination rate ?,
    population size 2N (diploid), sample size n.
  • How can you generate the ARG (topologybranch
    lengths) efficiently?
  • How will you generate sequences for n
    individuals?
  • Given sequence data, can you reconstruct the ARG
    (topology)

52
Recombination
  • Define r as the probability of recombining.
  • Note that the parameter is a caled value which
    will be defined later
  • Assume k individuals in a generation. The
    following might happen
  • An individual arises because of a recombination
    event between two individuals (It will have 2
    parents).
  • Two individuals coalesce
  • Neither (Each individual has a distinct parent)
  • Multiple events (low probability)

53
Recombination
  • We ignore the case of multiple (gt 1) events in
    one generation
  • Pr (No recombination) 1-kr
  • Pr (No coalescence)
  • Consider scaled time in units of 2N generations.
    Thus the number of individuals increase with rate
    kr2N, and decrease with rate
  • The value 2rN is usually small, and therefore,
    the process will ultimately coalesce to a single
    individual (MRCA)

54
ARG
  • Let k n,
  • Define
  • Iterate until k 1
  • Choose time from an exponential distribution with
    rate
  • Pick event as recombination with probability
  • If event is recombination, choose an individual
    to recombine, and a position, else choose a pair
    to coalesce.
  • Update k, and continue

55
Simulating sequences on the ARG
  • Generate topology and branch lengths as before
  • For each recombination, generate a position.
  • Next generate mutations at random on branch
    lengths
  • For a mutation, select a position as well.

56
Recombination events and ?
  • Given ?, n, can you compute the expected number
    of recombination events?
  • It can be shown that E(n, ?) ? log (n)
  • The question that people are really interested in
  • Given a set of sequences from a population,
    compute the recombination rate ?
  • Given a population reconstruct the most likely
    history (as an ancestral recombination graph)
  • We will address this question in subsequent
    lectures

57
An algorithm for constructing a perfect phylogeny
  • We will consider the case where 0 is the
    ancestral state, and 1 is the mutated state. This
    will be fixed later.
  • In any tree, each node (except the root) has a
    single parent.
  • It is sufficient to construct a parent for every
    node.
  • In each step, we add a column and refine some of
    the nodes containing multiple children.
  • Stop if all columns have been considered.

58
Inclusion Property
  • For any pair of columns i,j
  • i lt j if and only if i1 ? j1
  • Note that if iltj then the edge containing i is an
    ancestor of the edge containing i

i
j
59
Example
1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D
0 0 1 0 1 E 1 0 0 0 0
Initially, there is a single clade r, and each
node has r as its parent
60
Sort columns
  • Sort columns according to the inclusion property
    (note that the columns are already sorted here).
  • This can be achieved by considering the columns
    as binary representations of numbers (most
    significant bit in row 1) and sorting in
    decreasing order

1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1
0 D 0 0 1 0 1 E 1 0 0 0 0
61
Add first column
  • In adding column i
  • Check each edge and decide which side you belong.
  • Finally add a node if you can resolve a clade

1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1
0 D 0 0 1 0 1 E 1 0 0 0 0
r
u
B
D
A
C
E
62
Adding other columns
  • Add other columns on edges using the ordering
    property

1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D
0 0 1 0 1 E 1 0 0 0 0
r
1
3
E
B
2
5
4
D
A
C
63
Unrooted case
  • Switch the values in each column, so that 0 is
    the majority element.
  • Apply the algorithm for the rooted case

64
(No Transcript)
65
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com