Title: Inferring human demographic history from DNA sequence data
1Inferring human demographic history from DNA
sequence data
- Apr. 28, 2009
- J. Wall
- Institute for Human Genetics, UCSF
2Standard model of human evolution
3Standard model of human evolution(Origin and
spread of genus Homo)
2 2.5 Mya
4Standard model of human evolution(Origin and
spread of genus Homo)
?
?
1.6 1.8 Mya
5Standard model of human evolution(Origin and
spread of genus Homo)
0.8 1.0 Mya
6Standard model of human evolutionOrigin and
spread of modern humans
150 200 Kya
7Standard model of human evolutionOrigin and
spread of modern humans
100 Kya
8Standard model of human evolutionOrigin and
spread of modern humans
40 60 Kya
9Standard model of human evolutionOrigin and
spread of modern humans
15 30 Kya
10Estimating demographic parameters
- How can we quantify this qualitative scenario
into an explicit model? - How can we choose a model that is both
biologically feasible as well as computationally
tractable? - How do we estimate parameters and quantify
uncertainty in parameter estimates?
11Estimating demographic parameters
- Calculating full likelihoods (under realistic
models including recombination) is
computationally infeasible - So, compromises need to be made if one is
interested in parameter estimation
12African populations
10 populations 229 individuals
13African populations
Mandenka (bantu)
61 autosomal loci 350 Kb sequence data
Biaka (pygmies)
San (bushmen)
14A simple model of African population history
T
g1
m
g2
Biaka (or San)
Mandenka
15Estimation method
- We use a composite-likelihood method (cf. Plagnol
and Wall 2006) that uses information from the
joint frequency spectrum such as - Numbers of segregating sites
- Numbers of shared and fixed differences
- Tajimas D
- FST
- Fu and Lis D
16Estimation method
- We use a composite-likelihood method (cf. Plagnol
and Wall 2006) that uses information from the
joint frequency spectrum such as - Numbers of segregating sites
- Numbers of shared and fixed differences
- Tajimas D
- FST
- Fu and Lis D
17Estimating likelihoods
Pop1 Pop2
18Estimating likelihoods
Pop 1 private polymorphisms
Pop1 Pop2
19Estimating likelihoods
Pop 1 private polymorphisms Pop 2 private
polymorphisms
Pop1 Pop2
20Estimating likelihoods
Pop 1 private polymorphisms Pop 2 private
polymorphisms Shared polymorphisms
Pop1 Pop2
21Estimation method
- We use a composite-likelihood method (cf. Plagnol
and Wall 2006) that uses information from the
joint frequency spectrum such as - Numbers of segregating sites
- Numbers of shared and fixed differences
- Tajimas D
- FST
- Fu and Lis D
22Estimating likelihoods
- We assume these other statistics are multivariate
normal. - Then, we run simulations to estimate the means
and the covariance matrix. - This accounts (in a crude way) for dependencies
across different summary statistics.
23Composite likelihood
- We form a composite likelihood by assuming these
two classes of summary statistics are independent
from each other - We estimate the (composite)-likelihood over a
grid of values of g1, g2, T and M and tabulate
the MLE. - We also use standard asymptotic assumptions to
estimate confidence intervals
24Estimates (with 95 CIs)
- Parameter Man-Bia Man-San
- g1 (000s) 0 (0 3.8) 0 (0 3.8)
- g2 (000s) 4 (0 7.9) 2 (0 11)
- T (000s) 450 (300 640) 100 (77 550)
- M ( 4Nm) 10 (8.4 12) 3 (2.2 4)
25Fit of the null model
- How well does the demographic null model fit the
- patterns of genetic variation found in the actual
- data?
26Fit of the null model
- How well does the demographic null model fit the
- patterns of genetic variation found in the actual
- data?
- Quite well. The model accurately reproduces both
- parameters used in the original fitting (e.g.,
- Tajimas D in each population) as well as other
- aspects of the data (e.g., estimates of ? 4Nr)
27Estimates (with 95 CIs)
- Parameter Man-Bia Man-San
- g1 (000s) 0 (0 3.8) 0 (0 3.8)
- g2 (000s) 4 (0 7.9) 2 (0 11)
- T (000s) 450 (300 640) 100 (77 550)
- M ( 4Nm) 10 (8.4 12) 3 (2.2 4)
28Population growth
population size
time
29Population growth
population size
time
spread of agriculture and animal husbandry?
30Estimates (with 95 CIs)
- Parameter Man-Bia Man-San
- g1 (000s) 0 (0 3.8) 0 (0 3.8)
- g2 (000s) 4 (0 7.9) 2 (0 11)
- T (000s) 450 (300 640) 100 (77 550)
- M ( 4Nm) 10 (8.4 12) 3 (2.2 4)
31Ancestral structure in Africa
- At face value, these results suggest that
population structure within Africa is old, and
predates the migration of modern humans out of
Africa. - Is there any evidence for additional (unknown)
ancient population structure within Africa?
32Model of ancestral structure
Archaic human population
T
g1
m
g2
Biaka (or San)
Mandenka
33Standard model of human evolutionOrigin and
spread of modern humans
100 Kya
34Admixture mapping
Modern human DNA
Neandertal DNA
35Admixture mapping
Modern human DNA
Neandertal DNA
36Admixture mapping
Modern human DNA
Neandertal DNA
37Admixture mapping
Modern human DNA
Neandertal DNA
38Admixture mapping
Modern human DNA
Neandertal DNA
Orange chunks are 10 100 Kb in length
39Genealogy with archaic ancestry
time
Modern humans
Archaic humans
present
40Genealogy without archaic ancestry
time
Modern humans
Archaic humans
present
41Our main questions
- What pattern does archaic ancestry produce in DNA
sequence polymorphism data (from extant humans)? - How can we use data to
- estimate the contribution of archaic humans to
the modern gene pool (c)? - test whether c gt 0?
42Genealogy with archaic ancestry(Mutations added)
time
Modern humans
Archaic humans
present
43Genealogy with archaic ancestry(Mutations added)
time
Modern humans
Archaic humans
present
44Patterns in DNA sequence data
- Sequence 1 A T C C A C A G C T G
- Sequence 2 A G C C A C G G C T G
- Sequence 3 T G C G G T A A C C T
- Sequence 4 A G C C A C A G C T G
- Sequence 5 T G T G G T A A C C T
- Sequence 6 A G C C A T A G A T G
- Sequence 7 A G C C A T A G A T G
45Patterns in DNA sequence data
- Sequence 1 A T C C A C A G C T G
- Sequence 2 A G C C A C G G C T G
- Sequence 3 T G C G G T A A C C T
- Sequence 4 A G C C A C A G C T G
- Sequence 5 T G T G G T A A C C T
- Sequence 6 A G C C A T A G A T G
- Sequence 7 A G C C A T A G A T G
46Patterns in DNA sequence data
- Sequence 1 A T C C A C A G C T G
- Sequence 2 A G C C A C G G C T G
- Sequence 3 T G C G G T A A C C T
- Sequence 4 A G C C A C A G C T G
- Sequence 5 T G T G G T A A C C T
- Sequence 6 A G C C A T A G A T G
- Sequence 7 A G C C A T A G A T G
We call the sites in red congruent sites these
are sites inferred to be on the same branch of an
unrooted tree
47Linkage disequilibrium (LD)
- LD is the nonrandom association of alleles at
different sites. - Low LD A C High LD A C
- A T A C
- A C A C
- A T A C
- G C G T
- G T G T
- G C G T
- G T G T
High recombination Low recombination
48Measuring congruence
- To measure the level of congruence in SNP data
from - larger regions we define a score function
- S
- where S (i1, . . . ik)
- and S (ij, ij1) is a function of both congruence
(or near - congruence) and physical distance between ij and
ij1.
49An example
50An example (CHRNA4)
51An example (CHRNA4)
How often is S from simulations greater than or
equal to the S value from the actual data?
52An example (CHRNA4)
How often is S from simulations greater than or
equal to the S value from the actual data? p
0.025
53S is sensitive to ancient admixture
54General approach
- We use the model parameters estimated before
(growth rates, migration rate, split time) as a
demographic null model. - Is our null model sufficient to explain the
patterns of LD in the data? - We test this by comparing the observed S values
with the distribution of S values calculated
from data simulated under the null model.
55Distribution of p-values(Mandenka and San)
frequency
p-value
56Distribution of p-values(Mandenka and San)
frequency
p-value
Global p-value 2.5 10-5
57Estimating ancient admixture rates
The global p-values for S are highly significant
in every population that weve studied! If we
estimate the ancient admixture rate in our
(composite)-likelihood framework, we can exclude
no ancient admixture for all populations
studied.
58A region on chromosome 4
59A region on chromosome 4
19 mutations (from 6 Kb of sequence) separate 3
Biaka sequences from all of the other sequences
in our sample. Simulations suggest this cannot
be caused by recent population structure (p lt
10-3) This corresponds to isolation lasting 1.5
million years!
60Possible explanations
- Isolation followed by later mixing is a recurrent
feature of human population history - Mixing between archaic humans and modern humans
happened at least once prior to the exodus of
modern humans out of Africa - Some other feature of population structure is
unaccounted for in our simple models
61Acknowledgments
- Collaborators
- Mike Hammer (U. of Arizona)
- Vincent Plagnol (Cambridge University)
-
- Samples
- Foundation Jean Dausset (CEPH)
- Y chromosome consortium (YCC)
- Funding
- National Science Foundation
- National Institutes for Health