Approximate Bayesian Methods in Genetic Data Analysis - PowerPoint PPT Presentation

About This Presentation

Title:

Approximate Bayesian Methods in Genetic Data Analysis

Description:

Dave Tallmon, Juneau, Alaska. Arnaud Estoup, Montpellier. BBSRC, NERC. General Problem. In population genetics the data we observe have many possible unobservable ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 31

Provided by: maths8

Category:

more less

Transcript and Presenter's Notes

Title: Approximate Bayesian Methods in Genetic Data Analysis

1
Approximate Bayesian Methods in Genetic Data
Analysis
Mark A. Beaumont, University of Reading,
2
Acknowledgements
Wenyang Zhang, University of Kent, David
Balding, Imperial College, London Dave Tallmon,
Juneau, Alaska Arnaud Estoup, Montpellier BBSRC,
NERC
3
General Problem
In population genetics the data we observe have
many possible unobservable causes, which
generally follow a hierarchical structure. For
example, genetic data depends on some unknown
genealogical history, which in turn depends on
the mutation model, demographic history, and the
effects of selection. These, in turn, depend on
the ecology of the organism. Therefore we have
many competing explanations for the data and we
wish to choose among them. How to do this?
4
Be pragmatic take a Bayesian approach
Bayesian analysis offers a flexible framework for
modelling uncertainty. MCMC has made this
possible for population genetic problems.
5
Problems with MCMC-based methods of genealogical
inference
MCMC is useful, but

Slow problems of convergence.
Difficult to code up.
Difficult to modify flexibly to different
scenarios.
Difficulty addressing the questions that
biologists want answered. (Hence the rise of
cladistic, network-based methods like NCA.)

6
Method for Sampling from Posterior Distribution
Consider parameters F , data D Simulate samples
Fi,Di from the joint density P(F,D) First
simulate from the prior Fi P(F) Then simulate
from the likelihood Di P(D Fi) The
posterior distribution for any given D can be
estimate by the proportion of all simulated
points that correspond to that particular D and F
divided by the proportion of points corresponding
to D (ignoring F).
7
Prior p(F)
Marginal likelihood p(D)
Likelihood p(D F)
Posterior distribution p(F D)
8
Replace the data with summary statistics

Key Points
For most problems, we cant hit the data exactly.
But similar data may have similar posterior
distributions.
If we replace the data with summary statistics,
then it is easier to decide how similar data
sets are to each other.

9
(No Transcript)
10
History
Tavaré et al. (1997, Genetics) Specify P(S
F), use rejection to estimate P(F S). Fu and
Li (1997, MBE) Use S and rejection
to estimate posterior distribution of coalescence
times (I.e. P(G S)). Weiss and von Haeseler
(1998, Genetics) use rejection to
estimate likelihood P(S F). Pritchard et al.
(1999, MBE) use rejection to
estimate P(F,G S). Wall (2000, MBE)
uses rejection to estimate P(S
F). Beaumont et al. (2002, Genetics)
uses regression/rejection to estimate P(F
S). Marjoram et al. (2003, PNAS) uses
MCMC and rejection to estimate P(F S). F
Demographic/mutational parameters S Summary
statistics G - Genealogy
11
(No Transcript)
12
Beaumont, Zhang, and Balding (2002) Approximate
Bayesian Computation in Population Genetics.
Genetics 162 2025-2035.

This is a problem of density estimation. We want
to use information about the relationship between
the summary statistics and the parameters in the
vicinity of the observed summary statistics.
Keep the idea of accepting points close to those
observed in the data.
Use multiple regression to correct for
relationship between summary statistics and
parameter values.
Downweight points further away from the observed
values.
The idea is that we should be able to accept many
more points.

13
Local Linear Regression
Assume we have observed a d dimensional vector of
summary statistics s, and we have n random draws
of a (scalar) parameter F1,,n and corresponding
summary statistics S1,,n. We scale s and S1,,n
so that S1,,n have unit variance.
14
We want to minimize
where
Epanechnikov kernel
15
The solution is
where
16
Our best estimate of the posterior mean is then
where e1 is a d1 length vector (1,0,,0).
17
Parameter
Summary Statistic
1
0
Weight
18
Obtaining posterior densities and other summaries
using regression approach.
We make an assumption that the errors are
constant in the interval and adjust the parameter
values as
The posterior density for F can be approximated as
where KD(t) is another Epanechnikov kernel with
bandwidth D. Alternative, can use some other
density estimation method.
19
Model Comparison
As noted in Pritchard et al. (1999), can compare
two models, M1 and M2 by evaluating the marginal
distribution of the summary statistics at s. I.e.
Could use original Pritchard method (proportion
of points within tolerance window). Alternatively,
use multivariate kernel methods to estimate
density.
20
1
0
21
Example estimation of q (2Nm) in a population
with constant size

Simulate 100 sets 445 chromosomes, 8 linked
microsatellite loci (SMM)
q 10
Summary statistics mean heterozygosity, mean
variance in allele length, number of distinct
haplotypes.
Rectangular priors (0,50)
Point estimate posterior mean.
Also use MCMC (Batwing) to estimate posterior
mean (flat prior).
Compare Mean Square Error of different methods.

22
Accuracy in the estimation of scaled mutation
rate q 2Nm

Data-
linked microsat loci

Standard Rejection
Relative mean square error

Summary statistics-
mean variance in length
mean heterozygosity
number of haplotypes

MCMC
Regression
Tolerance
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
Main Conclusion
The regression method allows a much larger
proportion of points to be used than the
rejection method. This means that more summary
statistics can be used in the regression method
without compromising accuracy.
27
Generalisations

You want to investigate a system which gives
rise to genetical and/or ecological data.
Construct a (complicated) model
(individual-based, stage-structured,
genealogical) that gives rise to the same type
of data.
Put priors on all the parameters.
Decide on the parameters you want to make
inferences about.
Choose summary statistics. Measure these from
your data.
Perform simulations.
Construct posterior distributions for the
parameters of interest, using e.g. the regression
methods here.

28
Some Papers using Approximate Bayesian approaches
Pritchard method Estoup et al. (2002, Genetics)
Demographic history of invasion of islands by
cane toads. 10 microsatellite loci, 22 allozyme
loci. 4/3 summary statistics, 6 demographic
parameters. Estoup and Clegg (2003, Molecular
Ecology) Demographic history of colonisation of
islands by silvereyes. Regression
method Tallmon et al (2004, Genetics)
Estimating effective population size by temporal
method. One main parameter of interest (Ne), 4
summary statistics, tested on up to Estoup et
al. (2004, Evolution, in press) Demographic
history of invasion of Australia by cane toads.
75/63 summary statistics, model comparison, up to
5 demographic parameters.
29
From Tallmon, Luikart, and Beaumont (Genetics,
2004).
Coalescent
30
Future Work
How to choose suitable summary statistics? Need
for Data Mining techniques. Projection pursuit.
Orthogonalisation. Stepwise regression. Because
the method is quick, can use e.g. MSE,
integrated squared error, coverage etc as an
ultimate criterion. Improve conditional density
estimation. Improve choice of bandwidth in
kernel. Use of transformations (e.g. log-linear
modelling). Quantile regression.

Write a Comment

User Comments (0)