Title: Approximate Bayesian Methods in Genetic Data Analysis
1Approximate Bayesian Methods in Genetic Data
Analysis
Mark A. Beaumont, University of Reading,
2Acknowledgements
Wenyang Zhang, University of Kent, David
Balding, Imperial College, London Dave Tallmon,
Juneau, Alaska Arnaud Estoup, Montpellier BBSRC,
NERC
3General Problem
In population genetics the data we observe have
many possible unobservable causes, which
generally follow a hierarchical structure. For
example, genetic data depends on some unknown
genealogical history, which in turn depends on
the mutation model, demographic history, and the
effects of selection. These, in turn, depend on
the ecology of the organism. Therefore we have
many competing explanations for the data and we
wish to choose among them. How to do this?
4Be pragmatic take a Bayesian approach
Bayesian analysis offers a flexible framework for
modelling uncertainty. MCMC has made this
possible for population genetic problems.
5Problems with MCMC-based methods of genealogical
inference
MCMC is useful, but
- Slow problems of convergence.
- Difficult to code up.
- Difficult to modify flexibly to different
scenarios. - Difficulty addressing the questions that
biologists want answered. (Hence the rise of
cladistic, network-based methods like NCA.)
6Method for Sampling from Posterior Distribution
Consider parameters F , data D Simulate samples
Fi,Di from the joint density P(F,D) First
simulate from the prior Fi P(F) Then simulate
from the likelihood Di P(D Fi) The
posterior distribution for any given D can be
estimate by the proportion of all simulated
points that correspond to that particular D and F
divided by the proportion of points corresponding
to D (ignoring F).
7Prior p(F)
Marginal likelihood p(D)
Likelihood p(D F)
Posterior distribution p(F D)
8Replace the data with summary statistics
- Key Points
- For most problems, we cant hit the data exactly.
- But similar data may have similar posterior
distributions. - If we replace the data with summary statistics,
then it is easier to decide how similar data
sets are to each other.
9(No Transcript)
10History
Tavaré et al. (1997, Genetics) Specify P(S
F), use rejection to estimate P(F S). Fu and
Li (1997, MBE) Use S and rejection
to estimate posterior distribution of coalescence
times (I.e. P(G S)). Weiss and von Haeseler
(1998, Genetics) use rejection to
estimate likelihood P(S F). Pritchard et al.
(1999, MBE) use rejection to
estimate P(F,G S). Wall (2000, MBE)
uses rejection to estimate P(S
F). Beaumont et al. (2002, Genetics)
uses regression/rejection to estimate P(F
S). Marjoram et al. (2003, PNAS) uses
MCMC and rejection to estimate P(F S). F
Demographic/mutational parameters S Summary
statistics G - Genealogy
11(No Transcript)
12Beaumont, Zhang, and Balding (2002) Approximate
Bayesian Computation in Population Genetics.
Genetics 162 2025-2035.
- This is a problem of density estimation. We want
to use information about the relationship between
the summary statistics and the parameters in the
vicinity of the observed summary statistics. - Keep the idea of accepting points close to those
observed in the data. - Use multiple regression to correct for
relationship between summary statistics and
parameter values. - Downweight points further away from the observed
values. - The idea is that we should be able to accept many
more points.
13Local Linear Regression
Assume we have observed a d dimensional vector of
summary statistics s, and we have n random draws
of a (scalar) parameter F1,,n and corresponding
summary statistics S1,,n. We scale s and S1,,n
so that S1,,n have unit variance.
14We want to minimize
where
Epanechnikov kernel
15The solution is
where
16Our best estimate of the posterior mean is then
where e1 is a d1 length vector (1,0,,0).
17Parameter
Summary Statistic
1
0
Weight
18Obtaining posterior densities and other summaries
using regression approach.
We make an assumption that the errors are
constant in the interval and adjust the parameter
values as
The posterior density for F can be approximated as
where KD(t) is another Epanechnikov kernel with
bandwidth D. Alternative, can use some other
density estimation method.
19Model Comparison
As noted in Pritchard et al. (1999), can compare
two models, M1 and M2 by evaluating the marginal
distribution of the summary statistics at s. I.e.
Could use original Pritchard method (proportion
of points within tolerance window). Alternatively,
use multivariate kernel methods to estimate
density.
201
0
21Example estimation of q (2Nm) in a population
with constant size
- Simulate 100 sets 445 chromosomes, 8 linked
microsatellite loci (SMM) - q 10
- Summary statistics mean heterozygosity, mean
variance in allele length, number of distinct
haplotypes. - Rectangular priors (0,50)
- Point estimate posterior mean.
- Also use MCMC (Batwing) to estimate posterior
mean (flat prior). - Compare Mean Square Error of different methods.
22Accuracy in the estimation of scaled mutation
rate q 2Nm
- Data-
- linked microsat loci
Standard Rejection
Relative mean square error
- Summary statistics-
- mean variance in length
- mean heterozygosity
- number of haplotypes
MCMC
Regression
Tolerance
23(No Transcript)
24(No Transcript)
25(No Transcript)
26Main Conclusion
The regression method allows a much larger
proportion of points to be used than the
rejection method. This means that more summary
statistics can be used in the regression method
without compromising accuracy.
27Generalisations
- You want to investigate a system which gives
rise to genetical and/or ecological data. - Construct a (complicated) model
(individual-based, stage-structured,
genealogical) that gives rise to the same type
of data. - Put priors on all the parameters.
- Decide on the parameters you want to make
inferences about. - Choose summary statistics. Measure these from
your data. - Perform simulations.
- Construct posterior distributions for the
parameters of interest, using e.g. the regression
methods here.
28Some Papers using Approximate Bayesian approaches
Pritchard method Estoup et al. (2002, Genetics)
Demographic history of invasion of islands by
cane toads. 10 microsatellite loci, 22 allozyme
loci. 4/3 summary statistics, 6 demographic
parameters. Estoup and Clegg (2003, Molecular
Ecology) Demographic history of colonisation of
islands by silvereyes. Regression
method Tallmon et al (2004, Genetics)
Estimating effective population size by temporal
method. One main parameter of interest (Ne), 4
summary statistics, tested on up to Estoup et
al. (2004, Evolution, in press) Demographic
history of invasion of Australia by cane toads.
75/63 summary statistics, model comparison, up to
5 demographic parameters.
29From Tallmon, Luikart, and Beaumont (Genetics,
2004).
Coalescent
30Future Work
How to choose suitable summary statistics? Need
for Data Mining techniques. Projection pursuit.
Orthogonalisation. Stepwise regression. Because
the method is quick, can use e.g. MSE,
integrated squared error, coverage etc as an
ultimate criterion. Improve conditional density
estimation. Improve choice of bandwidth in
kernel. Use of transformations (e.g. log-linear
modelling). Quantile regression.