Title: L5: Estimating Recombination Rates
1L5 Estimating Recombination Rates
2Review
- mM min. number of recombination events in any
explanation of the haplotypes in M - Last time, we covered 3 lower bounds on mM
- The only exact algorithm that is known is super
exponential. Not even an exponential time
algorithm is known. - Can we get efficient upper bounds that are tight.
- Idea An Rs like method can be used to get an
upper bound.
3Upper bounds
- Rs bound
- Procedure Compute_Rs(M)
- If ? non-informative column
- return (Compute_Rs(M-s))
- else if ? redundant row
- return (Compute_Rs(M-h))
- else
- return (1 minh(Compute_Rs(M-h))
- Upper Bound
- Procedure Compute_U(M)
- if ? non-informative column
- return (Compute_U(M-s))
- else if ? redundant row
- return (Compute_U(M-h))
- else
- return(minh(f(h,M-h)Compute_U(M-h))
Number of recombinations needed to explain h
4Many approaches to estimating ?
51. Counting methods
- Rm
- Rh
- Rs
- ARG with min number of recombinations
- These numbers correlate with ? but how do we get
a value for ? given this number - These numbers still have value in defining
hot-spots of recombination (showing variance in
local recombination rates) - They generally underestimate the true number of
recombinations
62. Model based approaches
- Full likelihood approaches
- Approximate likelihood approaches
Fearnhead, Donnelly
7Approximate Likelihood approaches
- Two locus sampling
- 4 gamete violation implies recombination.
- Generalization
- Define vector n n00, n01 , n10, n11 for a
pair of loci - The distribution of n depends upon ?, ?
- Can we compute Pr(n ?, ?)? Then, we can iterate
to get the Max likelihood estimator for ?.
8Two locus method
- Generate MANY random ARGs with n n00 n01 n10
n11 leaves. - For each ARG, generate the two trees
corresponding to the two loci - Drop 2 mutations at random, to get a value for n
- How can you make this more efficient?
- Given an ARG (topology), we know the edge pairs
that would generate desired n. -
9Two locus estimation
10Multi locus estimator
- For a site with multiple loci, assume each pair
to be independent, each generating a vector ni - Assume recombination rate (per bp) to be constant
in the region
11Performance of the 2 locus estimator
- The composite likelihood estimator performs
well in practice. - Note that the values of ? can be pre-computed
making this a fast method. - Note that this plot does not describe the variance
12Performancs 90/10 percentile
13Research 2 locus versus other statistics
- Q1 Can we use some of the counting based methods
as summary statistic? - It is better than composite likelihood in that
- It does not assume independence between loci.
- There is a direct linear relationship (expected
number of recombination events is ? log n) - Variation might be better.
- Can we compute Pr(Rh ?, ?) efficiently? In a
sense, it does not matter, because we can
pre-compute the numbers. - Incorporate distance constraints in computing
these summary statistics. It is reasonable to
assume that the rate is constant per bp within a
window.
14Research Problem
- Recombination hot-spots are NOT correlated
between humans and Chimps. - 99 sequence identity
- Virtually no overlap between hot-spots (generated
using pop. Genetics). - What can cause this?
- Method
- Europeans/Africans share hot-spots
- Concordance with sperm typing
- Population sub-structure? Not (as shown by
structure) - Genomic factors
15Genomic factors
- Recombination is elevated in GC rich regions
- Epigenetic factors (such as acetylation,
methylation) that affect chromatin structure
might be key. - Yeast is a useful model for studying
recombination - In yeast, recombination hotspots can be
eliminated by insertion of transposable elements! - Can differential insertion of Alus explain the
differences between chimps/humans?
16Haplotype Phasing
17Genotypes and Haplotypes
- Each individual has two copies of each
chromosome. - At each site, each chromosome has one of two
alleles - Current Genotyping technology doesnt give phase
0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0
0
2 1 2 1 0 0 1 2 0
Genotype for the individual
18- Why is haplotype phasing important ?
19Haplotype Phasing
- Haplotype Phasing is the resolution of a genotype
into the two haplotypes. - Haplotypes increase the power of an association
between marker loci and phenotypic traits - Current approaches to Haplotyping
- Via technological innovations (expensive)
- Statistical Methods (ML, Phase,PL)
- This lecture, we will consider a combinatorial
approach to the phasing problem - Efficient, provable quality of solution
- Not completely generalizable (as yet)
20The Perfect Phylogeny Model
- We assume that the evolution of extant haplotypes
can be displayed on a rooted, directed tree, with
the all-0 haplotype at the root, where each site
changes from 0 to 1 on exactly one edge, and each
extant haplotype is created by accumulating the
changes on a path from the root to a leaf, where
that haplotype is displayed. - In other words, the extant haplotypes evolved
along a perfect phylogeny with all-0 root.
12345
00000
1
4
3
00010
2
10100
5
10000
01010
01011
Extant Haplotypes
21Haplotyping via Perfect Phylogeny
PPH Given a set of genotypes, find an explaining
set of haplotypes that fits a perfect phylogeny
00
1 2
a 1 0
a 0 1
b 0 0
b 0 1
c 1 0
c 1 0
1 2
a 2 2
b 0 2
c 1 0
1
2
b
00
a
a
b
c
c
01
01
10
10
10
22The Alternative Explanation
1 2
a 1 1
a 0 0
b 0 0
b 0 1
c 1 0
c 1 0
No tree possible for this explanation
1 2
a 2 2
b 0 2
c 1 0
23The 4 Gamete Test for Perfect Phylogeny
- Arrange the haplotypes in a matrix, two
haplotypes for each individual. - Then (with no duplicate columns), the haplotypes
fit a unique perfect phylogeny if and only if no
two columns contain all four pairs (Buneman) - 0,0 and 0,1 and 1,0 and 1,1
00
10
01
11
24The Alternative Explanation
1 2
a 1 1
a 0 0
b 0 0
b 0 1
c 1 0
c 1 0
No tree possible for this explanation
1 2
a 2 2
b 0 2
c 1 0
25The Tree Explanation Again
0 0
1 2
a 1 0
a 0 1
b 0 0
b 0 1
c 1 0
c 1 0
1 2
a 2 2
b 0 2
c 1 0
1
2
b
0 0
a
b
a
c
c
0 1
0 1
26The Combinatorial Problem
- Input A ternary matrix (0,1,2) M with N rows
- Output A binary matrix M created from M by
replacing each 2 in M with a 0 and 1, such that
M passes the 4 gamete test - Gusfield (Recomb2002) proposed a solution which
used a reduction to Matroids. - We present a (slightly inefficient) solution
using elementary techniques - Independently by (Eskin, Halperin, Karp02)
27Initial Observations
- Forced Expansions
- EX 1 If two columns(sites) of M contain the
following rows - 2 0
- 0 2
- Then M will contain a row with 1 0 and a row
with 0 1 in those columns. - EX 2 Similarly, if two columns of M contain the
rows - 2 1
- 2 0
- Then M will contain rows with 1 1 and 0 0 in
those columns
28Initial Observations
If a forced expansion of two columns creates rows
0 1, and 1 0 in those columns, then any 2 2 in
those columns must be set to be 0 1 1 0
We say that two columns are forced out-of-phase.
22
If a forced expansion of two columns creates 1 1,
and 0 0 in those columns, then any 2 2 in those
columns must be set to be 1 1 0 0 We
say that two columns are forced in-phase.
22
29Immediate Failure
It can happen that the forced expansion of
cells creates a 4x2 submatrix that fails the
4-Gamete Test. In that case, there is no PPH
solution for M.
20 12 02
Example
Will fail the 4-Gamete Test
30An O(ns2)-time Algorithm
- Find all the forced phase relationships by
considering columns in pairs. - Find all the inferred, invariant, phase
relationships. - Find a set of column pairs whose phase
relationship can be arbitrarily set, so that all
the remaining phase relationships can be
inferred. - Result An implicit representation of all
solutions to the PPH problem.
31A Running Example
1 2 3 4 5 6 7
1 2 2 2 0 0 0
2 0 2 0 0 0 2
1 2 2 2 0 2 0
1 2 2 0 2 0 0
2 2 0 0 0 2 0
0 0 0 0 0 0 0
A B C D E F
32Companion Graph G_c
1
1 2 3 4 5 6 7
1 2 2 2 0 0 0
2 0 2 0 0 0 2
1 2 2 2 0 2 0
1 2 2 0 2 0 0
2 2 0 0 0 2 0
0 0 0 0 0 0 0
A B C D E F
- Each node represents a column in M, and each
edge indicates that the pair of columns has a row
with 2s in both columns. - The algorithm builds this graph, and then checks
whether any pair of nodes is forced in or out of
phase.
33Phasing Edges in G_c
1
- Each Red edge indicates that the columns are
forced in-phase. - Each Blue edge indicates that the columns are
forced out-of-phase.
7
6
3
4
2
5
Let G_f be the sub-graph of G_c defined by the
red and blue edges.
34Connected Components in G_f
1
- Graph G_f has three connected components
.
35Phase-parity Lemma
- Lemma 1 There is a solution to the PPH problem
for M if and only if there is a coloring of the
black edges of G_c with the following property - For any triangle in G_c containing at least
one black edge, the coloring makes either 0 or 2
of the edges - blue (i.e., out of phase)
Thats nice, but how do we assign the colors?
36A Weak Triangulation Rule
1
- Theorem 1 If there are any black edges whose
ends are in the same connected component of G_f,
at least one edge is in a triangle where the
other edges are not black - In every PPH solution, it must be colored so
that the triangle has an even number of Blue (out
of Phase) edges. - This an inferred coloring.
37(No Transcript)
387
6
3
4
2
5
Graph G_f
397
6
3
4
2
5
Graph G_f
407
6
3
4
2
5
Graph G_f
41Corollary
- Inside any connected component of G_f, ALL the
phase relationships on edges (columns of M) are
uniquely determined, either as forced
relationships based on pair-wise column
comparisons, or by triangle-based inferred
colorings. - Hence, the phase relationships of all the columns
in a connected component of G_f are INVARIANT
over all the solutions to the PPH problem. - The black edges in G_f can be ordered so that the
inferred colorings can be done in linear time.
Modification of DFS.
42Phase Parity Lemma Proof
2 X
Y 2
2 2
If X ? 2, and Y ? 2, Then the two columns are
forced
43Phase Parity Lemma proof
A B C
- Lemma If a triangle contains a black edge, then
a PPH solution exists only if there are 0 or 2
blue edges in the final coloring. - Proof
- No black edge unless x2, or y2 or z2
(previous lemma) - If there is a row with all 2s, then there must be
an even number of blue edges
2 2 y
x 2 2
2 z 2
B
A
C
44Proof of Weak Triangulation Theorem
A
- Arbitrary chordless cycles are possible in the
graph, with forced edges. - See example. The pattern 0,2 2,0 and 2,2
implies a blue (out of phase) edge - A single unforced edge changes the picture
E
B
D
C
A B C D E
2 2 0 0 0
0 2 2 0 0
0 0 2 2 0
0 0 0 2 2
2 0 0 0 2
45Proof of Weak Triangulation Theorem
K
K
- Let (J,J) be a black edge connecting a long
path J,K,K,J of forced edges - In the Matrix, x ? 2, otherwise there is a chord.
Likewise y?2 - By previous lemma, (J,J) is forced
J
J
K J J K
2 2 x
y 2 2
2 2
46Finishing the Solution
- Problem A connected component C of G may
contain several connected components of G_f, so
any edge crossing two components of G_f will
still be black. How should they be colored?
471
- How should we color the remaining black edges in
a connected component C of G_c?
48Answer
- For a connected component C of G with k
connected components of Gf, select any subset S
of k-1 black edges in C, so that S together with
the red and blue edges span all the nodes of C. - Arbitrarily, color each edge in S either red or
blue. - Infer the color of any remaining black edges by
- successive use of the triangle rule.
7
6
3
4
2
5
497
3
2
5
50Theorem 2
- Any selected S works (allows the triangle rule to
work) and any coloring of the edges in S
determines the colors of any remaining black
edges. - Different colorings of S determine different
colorings of the remaining black edges. - Each different coloring of S determines a
different solution to the PPH problem. - All PPH solutions can be obtained in this way,
i.e. using just one selected S set, but coloring
it in all 2(k-1) ways.
51Corollary
- In a single connected component C of G with k
connected components in Gf, there are exactly
2(k-1) different solutions to the PPH problem in
the columns of M represented by C. - If G_c has r connected components and t connected
components of G_f, then there are exactly 2(t-r)
solutions to the PPH problem. - There is one unique PPH solution if and only if
each connected component in G is a connected
component in G_f.
52Algorithm
- Build Graph G and find its connected components.
Solve each connected component C of G separately. - Find the forced (red or blue) edges. Let Gf be
the subgraph of C containing colored edges. - Find each connected component of Gf and make the
inferred edge colorings (phase decisions). - Find a spanning tree of uncolored edges in C, and
color those edges arbitrarily, and follow the
inferred edge colorings
53Conclusion
- In the special case of blocks with no
recombination, and no recurrent mutations, the
haplotypes satisfy a perfect phylogeny - Given a set of genotypes, there is an efficient
(O(ns2)) algorithm for representing all possible
haplotype solutions that satisfy a prefect
phylogeny - Efficiency
- Input is size O(ns),
- All operations except building the graph are
O(nss2) - Valid PPH only if s O(n). Is O(ns) possible?
- Current best solution is O(nsn(1-e) s2) using
Matrix Multiplication idea - Future work involves combining this with some
heuristics to deal with general cases (lo
recombination/hi recombination)
54Simulated Data
- Coalescent model (Hudson)
- No Recombination
- 400 chromosomes, 100 sites
- Infinite sites
- Recombination
- 100 chromosomes
- Infinite sites
- R4.0 2501
- Pr(Recombination) 410(-9) between adjacent
bases
55Error Measurement
- Discrepancy 1 (Num Haplotypes incorrectly
predicted) - Switch Error 2
01010 00101 01010 10101
02222 22222
00101 01010 00000 11111
56No Recombination
57No Recombination
58Choosing between solutions
59Choosing between solutions
60Choosing between solutions
61Conclusion
- Extremely low error rates (lt 1 discrepancy) if
no recombination - Randomly choosing between equivalent solutions is
sufficient - Other measures (Parsimony, Likelihood, Entropy)
do not improve the quality of solution
62With Recombination