Title: L4: Counting Recombination events
1L4 Counting Recombination events
- Iteratively estimate
- (Z(0),P(0)), (Z(1),P(1)),.., (Z(m),P(m))
- After convergence, Z(m) is the answer.
- Iteration
- Guess Z(0)
- For m 1,2,..
- Sample P(m) from Pr(P X, Z(m-1))
- Sample Z(m) from Pr(Z X, P(m))
- How is this sampling done?
3Allowing for admixture
- Define qi,k as the fraction of individual i that
originated from population k. - Iteration
- Guess Z(0)
- For m 1,2,..
- Sample P(m),Q(m) from Pr(P,Q X, Z(m-1))
- Sample Z(m) from Pr(Z X, P(m),Q(m))
4Estimating Z (admixture case)
- Instead of estimating Pr(Z(i)kX,P,Q), (origin
of individual i is k), we estimate
5Results on admixture prediction simulated data
6Results Thrush data
- For each individual, q(i) is plotted as the
distance to the opposite side of the triangle. - The assignment is reliable, and there is evidence
of admixture.
7Population Structure
- 377 locations (loci) were sampled in 1000 people
from 52 populations. - 6 genetic clusters were obtained, which
corresponded to 5 geographic regions (Rosenberg
et al. Science 2003)
East Asia
8NJ versus Structurethrush data
- Objective function is different in standard
clustering algorithms!
9Population sub-structureresearch problem
- Systematically explore the effect of admixture.
Can admixture be predicted for a locus, or for an
individual - The sampling approach may or may not be
appropriate. Formulate as an optimization/learning
problem - (w/out admixture). Assign individuals to
sub-populations so as to maximize linkage
equilibrium, and hardy weinberg equilibrium in
each of the sub-populations - (w/ admixture) Assign (individuals, loci) to
10Admixture mapping
11Estimating Recombination Rates
12Recombination in human chromosome 22 (Mb scale)
Dawson et al. Nature 2002
Q Can we give a direct count of the number of
the recombination events?
13Recombination hot-spots (fine scale)
14Recombination rates (chimp/human)
- Fine scale recombination rates differ between
chimp and human - The six hot-spots seen in human are not seen in
15Combinatorial Bounds for estimating recombination
- Recall that expected recombinations ? log n
- Procedure
- Generate N random ARGs that results in the given
sample - Compute mean of the number of recombinations
- Alternatively, generate a summary statistic s
from the population. - For each ?, generate many populations, and
compute the mean and variance of s (This only
needs to be done once). - Use this to select the most likely ?
- What is the correct summary statistic?
- Today, we talk about the min. number of
recombination events as a possible summary
statistic. It is not the most natural, but it is
the most interesting computationally.
16The Infinite Sites Assumption the 4 gamete
0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0
0 0 1 0 1 0 0 0
0 0 1 0 0 0 0 1
- Consider a history without recombination. No pair
of sites shows all four gametes 00,01,10,11. - A pair of sites with all 4 gametes implies a
recombination event
17Hudson Kaplan
- Any pair of sites (i,j) containing 4 gametes must
admit a recombination event. - Disjoint (non-overlapping) sites must contain
distinct recombination events, which can be
summed! This gives a lower bound on the number of
recombination events. - Based on simulations, this bound is not tight.
18Myers and Griffiths03 Idea 1
- Let B(i,j) be a lower bound on the number of
recombinations between sites i and j.
1i1 i2 i3 i4 i5 i6
- Can we compute maxP R(P) efficiently?
19The Rm bound
20Improved lower bounds
- The Rm bound also gives a general technique for
combining local lower bounds into an overall
lower bound. - In the example, Rm2, but we cannot give any ARG
with 2 recombination events. - Can we improve upon Hudson and Kaplan to get
better local lower bounds?
0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1
21Hudson and Kaplan Idea 2
- Consider the history of individuals. Let Ht
denote the number of distinct halotypes at time t - One of three things might happen at time t
- Mutation Ht increase by at most 1
- Recombination Ht increase by at most 1
- Coalescence Ht does not increase
22The RH bound
0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1
Ex Rgt 8-3-14
23RH bound
- In general, RH can be quite weak
- consider the case when SgtH
- However, it can be improved
- Partitioning idea sum RH over disjoint intervals
- Apply to any subset of columns. Ex Apply RH to
the yellow columns
000000000000000 000000000000001 000000010000000 00
0000010000001 100000000000000 100000000000001 1000
00010000000 111111111111111
24The Rs bound
- Compute the minimum number of recombination
events R in any ARG. Note that, we do not
explicitly construct the ARG. - Consider a matrix with M with H rows and S
columns. - The rows correspond to haplotypes.
- Columns correspond to sites.
25Rs bound Observation I
- Non-informative column If a site contains at
most one 1, or one 0, then in any history, it can
be obtained by adding a mutation to a branch. - EX if a is the haplotype containing a 1, It can
simply be added to the branch without increasing
number of recombination events - R(M) R(M-s)
0 0 0 1
26Rs bound Observation 2
- Redundant rows If two rows h1 and h2 are
identical, then - R(M) R(M-h1)
27Rs bound Observation 3
- Suppose M has no non-informative columns, or
redundant rows. - Then, at least one of the haplotypes is a
recombinant. - There exists h s.t.
- R(M) R(M-h)1
- Which h should you choose?
28Rs bound (Procedural)
- Procedure Compute_Rs(M)
- If ? non-informative column s
- return (Compute_Rs(M-s))
- Else if ? redundant row h
- return (Compute_Rs(M-h))
- Else
- return (1 minh(Compute_Rs(M-h))
30Additional results/problems
- Using dynamic programming, Rs can be computed in
2n poly(mn) time. - Also, Rs can be augmented to handle
intermediates. - Are there poly. time lower bounds?
- The number of connected components in the
conflict graph is a lower bound (BB04). - Fast algorithms for computing ARGs with minimum
recombination. - Poly. Time to get ARG with 0 recombination
- Poly. Time to get ARGs that are galled trees
31Underperforming lower bounds
- Sometimes, Rs can be quite weak
- An RI lower bound that uses intermediates can
help (BB05)
32LPL data set
- 71 individuals, 9.7Kbp genomic sequence
- Rm22, Rh70