L5: Estimating Recombination Rates - PowerPoint PPT Presentation

About This Presentation

Title:

L5: Estimating Recombination Rates

Description:

This lecture, we will consider a combinatorial approach to the phasing problem ... forced relationships based on pair-wise column comparisons, or by triangle-based ... – PowerPoint PPT presentation

Number of Views:26

Avg rating:3.0/5.0

Slides: 62

Provided by: vineet50

Learn more at: https://cseweb.ucsd.edu

Category:

more less

Transcript and Presenter's Notes

Title: L5: Estimating Recombination Rates

1
L5 Estimating Recombination Rates
2
Review

mM min. number of recombination events in any
explanation of the haplotypes in M
Last time, we covered 3 lower bounds on mM
The only exact algorithm that is known is super
exponential. Not even an exponential time
algorithm is known.
Can we get efficient upper bounds that are tight.
Idea An Rs like method can be used to get an
upper bound.

3
Upper bounds

Rs bound
Procedure Compute_Rs(M)
If ? non-informative column
return (Compute_Rs(M-s))
else if ? redundant row
return (Compute_Rs(M-h))
else
return (1 minh(Compute_Rs(M-h))

Upper Bound
Procedure Compute_U(M)
if ? non-informative column
return (Compute_U(M-s))
else if ? redundant row
return (Compute_U(M-h))
else
return(minh(f(h,M-h)Compute_U(M-h))

Number of recombinations needed to explain h
4
Many approaches to estimating ?
5
1. Counting methods

Rm
Rh
Rs
ARG with min number of recombinations
These numbers correlate with ? but how do we get
a value for ? given this number
These numbers still have value in defining
hot-spots of recombination (showing variance in
local recombination rates)
They generally underestimate the true number of
recombinations

6
2. Model based approaches

Full likelihood approaches
Approximate likelihood approaches

Fearnhead, Donnelly
7
Approximate Likelihood approaches

Two locus sampling
4 gamete violation implies recombination.
Generalization
Define vector n n00, n01 , n10, n11 for a
pair of loci
The distribution of n depends upon ?, ?
Can we compute Pr(n ?, ?)? Then, we can iterate
to get the Max likelihood estimator for ?.

8
Two locus method

Generate MANY random ARGs with n n00 n01 n10
n11 leaves.
For each ARG, generate the two trees
corresponding to the two loci
Drop 2 mutations at random, to get a value for n
How can you make this more efficient?
Given an ARG (topology), we know the edge pairs
that would generate desired n.

9
Two locus estimation
10
Multi locus estimator

For a site with multiple loci, assume each pair
to be independent, each generating a vector ni
Assume recombination rate (per bp) to be constant
in the region

11
Performance of the 2 locus estimator

The composite likelihood estimator performs
well in practice.
Note that the values of ? can be pre-computed
making this a fast method.
Note that this plot does not describe the variance

12
Performancs 90/10 percentile
13
Research 2 locus versus other statistics

Q1 Can we use some of the counting based methods
as summary statistic?
It is better than composite likelihood in that
It does not assume independence between loci.
There is a direct linear relationship (expected
number of recombination events is ? log n)
Variation might be better.
Can we compute Pr(Rh ?, ?) efficiently? In a
sense, it does not matter, because we can
pre-compute the numbers.
Incorporate distance constraints in computing
these summary statistics. It is reasonable to
assume that the rate is constant per bp within a
window.

14
Research Problem

Recombination hot-spots are NOT correlated
between humans and Chimps.
99 sequence identity
Virtually no overlap between hot-spots (generated
using pop. Genetics).
What can cause this?
Method
Europeans/Africans share hot-spots
Concordance with sperm typing
Population sub-structure? Not (as shown by
structure)
Genomic factors

15
Genomic factors

Recombination is elevated in GC rich regions
Epigenetic factors (such as acetylation,
methylation) that affect chromatin structure
might be key.
Yeast is a useful model for studying
recombination
In yeast, recombination hotspots can be
eliminated by insertion of transposable elements!
Can differential insertion of Alus explain the
differences between chimps/humans?

16
Haplotype Phasing
17
Genotypes and Haplotypes

Each individual has two copies of each
chromosome.
At each site, each chromosome has one of two
alleles
Current Genotyping technology doesnt give phase

0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0
0
2 1 2 1 0 0 1 2 0
Genotype for the individual
18

Why is haplotype phasing important ?

19
Haplotype Phasing

Haplotype Phasing is the resolution of a genotype
into the two haplotypes.
Haplotypes increase the power of an association
between marker loci and phenotypic traits
Current approaches to Haplotyping
Via technological innovations (expensive)
Statistical Methods (ML, Phase,PL)
This lecture, we will consider a combinatorial
approach to the phasing problem
Efficient, provable quality of solution
Not completely generalizable (as yet)

20
The Perfect Phylogeny Model

We assume that the evolution of extant haplotypes
can be displayed on a rooted, directed tree, with
the all-0 haplotype at the root, where each site
changes from 0 to 1 on exactly one edge, and each
extant haplotype is created by accumulating the
changes on a path from the root to a leaf, where
that haplotype is displayed.
In other words, the extant haplotypes evolved
along a perfect phylogeny with all-0 root.

12345
00000
1
4
3
00010
2
10100
5
10000
01010
01011
Extant Haplotypes
21
Haplotyping via Perfect Phylogeny
PPH Given a set of genotypes, find an explaining
set of haplotypes that fits a perfect phylogeny
00
1 2
a 1 0
a 0 1
b 0 0
b 0 1
c 1 0
c 1 0
1 2
a 2 2
b 0 2
c 1 0
1
2
b
00
a
a
b
c
c
01
01

10
10
10
22
The Alternative Explanation
1 2
a 1 1
a 0 0
b 0 0
b 0 1
c 1 0
c 1 0
No tree possible for this explanation
1 2
a 2 2
b 0 2
c 1 0
23
The 4 Gamete Test for Perfect Phylogeny

Arrange the haplotypes in a matrix, two
haplotypes for each individual.
Then (with no duplicate columns), the haplotypes
fit a unique perfect phylogeny if and only if no
two columns contain all four pairs (Buneman)
0,0 and 0,1 and 1,0 and 1,1

00
10
01
11
24
The Alternative Explanation
1 2
a 1 1
a 0 0
b 0 0
b 0 1
c 1 0
c 1 0
No tree possible for this explanation
1 2
a 2 2
b 0 2
c 1 0
25
The Tree Explanation Again
0 0
1 2
a 1 0
a 0 1
b 0 0
b 0 1
c 1 0
c 1 0
1 2
a 2 2
b 0 2
c 1 0
1
2
b
0 0
a
b
a
c
c
0 1
0 1
26
The Combinatorial Problem

Input A ternary matrix (0,1,2) M with N rows
Output A binary matrix M created from M by
replacing each 2 in M with a 0 and 1, such that
M passes the 4 gamete test
Gusfield (Recomb2002) proposed a solution which
used a reduction to Matroids.
We present a (slightly inefficient) solution
using elementary techniques
Independently by (Eskin, Halperin, Karp02)

27
Initial Observations

Forced Expansions
EX 1 If two columns(sites) of M contain the
following rows
2 0
0 2
Then M will contain a row with 1 0 and a row
with 0 1 in those columns.
EX 2 Similarly, if two columns of M contain the
rows
2 1
2 0
Then M will contain rows with 1 1 and 0 0 in
those columns

28
Initial Observations
If a forced expansion of two columns creates rows
0 1, and 1 0 in those columns, then any 2 2 in
those columns must be set to be 0 1 1 0
We say that two columns are forced out-of-phase.
22
If a forced expansion of two columns creates 1 1,
and 0 0 in those columns, then any 2 2 in those
columns must be set to be 1 1 0 0 We
say that two columns are forced in-phase.
22
29
Immediate Failure
It can happen that the forced expansion of
cells creates a 4x2 submatrix that fails the
4-Gamete Test. In that case, there is no PPH
solution for M.
20 12 02
Example
Will fail the 4-Gamete Test
30
An O(ns2)-time Algorithm

Find all the forced phase relationships by
considering columns in pairs.
Find all the inferred, invariant, phase
relationships.
Find a set of column pairs whose phase
relationship can be arbitrarily set, so that all
the remaining phase relationships can be
inferred.
Result An implicit representation of all
solutions to the PPH problem.

31
A Running Example
1 2 3 4 5 6 7
1 2 2 2 0 0 0
2 0 2 0 0 0 2
1 2 2 2 0 2 0
1 2 2 0 2 0 0
2 2 0 0 0 2 0
0 0 0 0 0 0 0
A B C D E F
32
Companion Graph G_c
1
1 2 3 4 5 6 7
1 2 2 2 0 0 0
2 0 2 0 0 0 2
1 2 2 2 0 2 0
1 2 2 0 2 0 0
2 2 0 0 0 2 0
0 0 0 0 0 0 0
A B C D E F

Each node represents a column in M, and each
edge indicates that the pair of columns has a row
with 2s in both columns.
The algorithm builds this graph, and then checks
whether any pair of nodes is forced in or out of
phase.

33
Phasing Edges in G_c
1

Each Red edge indicates that the columns are
forced in-phase.
Each Blue edge indicates that the columns are
forced out-of-phase.

7
6
3
4
2
5
Let G_f be the sub-graph of G_c defined by the
red and blue edges.
34
Connected Components in G_f
1

Graph G_f has three connected components

.
35
Phase-parity Lemma

Lemma 1 There is a solution to the PPH problem
for M if and only if there is a coloring of the
black edges of G_c with the following property
For any triangle in G_c containing at least
one black edge, the coloring makes either 0 or 2
of the edges
blue (i.e., out of phase)

Thats nice, but how do we assign the colors?
36
A Weak Triangulation Rule
1

Theorem 1 If there are any black edges whose
ends are in the same connected component of G_f,
at least one edge is in a triangle where the
other edges are not black
In every PPH solution, it must be colored so
that the triangle has an even number of Blue (out
of Phase) edges.
This an inferred coloring.

37
(No Transcript)
38
7
6
3
4
2
5
Graph G_f
39
7
6
3
4
2
5
Graph G_f
40
7
6
3
4
2
5
Graph G_f
41
Corollary

Inside any connected component of G_f, ALL the
phase relationships on edges (columns of M) are
uniquely determined, either as forced
relationships based on pair-wise column
comparisons, or by triangle-based inferred
colorings.
Hence, the phase relationships of all the columns
in a connected component of G_f are INVARIANT
over all the solutions to the PPH problem.
The black edges in G_f can be ordered so that the
inferred colorings can be done in linear time.
Modification of DFS.

42
Phase Parity Lemma Proof
2 X
Y 2
2 2
If X ? 2, and Y ? 2, Then the two columns are
forced
43
Phase Parity Lemma proof
A B C

Lemma If a triangle contains a black edge, then
a PPH solution exists only if there are 0 or 2
blue edges in the final coloring.
Proof
No black edge unless x2, or y2 or z2
(previous lemma)
If there is a row with all 2s, then there must be
an even number of blue edges

2 2 y
x 2 2
2 z 2
B
A
C
44
Proof of Weak Triangulation Theorem
A

Arbitrary chordless cycles are possible in the
graph, with forced edges.
See example. The pattern 0,2 2,0 and 2,2
implies a blue (out of phase) edge
A single unforced edge changes the picture

E
B
D
C
A B C D E
2 2 0 0 0
0 2 2 0 0
0 0 2 2 0
0 0 0 2 2
2 0 0 0 2
45
Proof of Weak Triangulation Theorem
K
K

Let (J,J) be a black edge connecting a long
path J,K,K,J of forced edges
In the Matrix, x ? 2, otherwise there is a chord.
Likewise y?2
By previous lemma, (J,J) is forced

J
J
K J J K
2 2 x
y 2 2
2 2
46
Finishing the Solution

Problem A connected component C of G may
contain several connected components of G_f, so
any edge crossing two components of G_f will
still be black. How should they be colored?

47
1

How should we color the remaining black edges in
a connected component C of G_c?

48
Answer

For a connected component C of G with k
connected components of Gf, select any subset S
of k-1 black edges in C, so that S together with
the red and blue edges span all the nodes of C.
Arbitrarily, color each edge in S either red or
blue.
Infer the color of any remaining black edges by
successive use of the triangle rule.

7
6
3
4
2
5
49
7
3
2
5
50
Theorem 2

Any selected S works (allows the triangle rule to
work) and any coloring of the edges in S
determines the colors of any remaining black
edges.
Different colorings of S determine different
colorings of the remaining black edges.
Each different coloring of S determines a
different solution to the PPH problem.
All PPH solutions can be obtained in this way,
i.e. using just one selected S set, but coloring
it in all 2(k-1) ways.

51
Corollary

In a single connected component C of G with k
connected components in Gf, there are exactly
2(k-1) different solutions to the PPH problem in
the columns of M represented by C.
If G_c has r connected components and t connected
components of G_f, then there are exactly 2(t-r)
solutions to the PPH problem.
There is one unique PPH solution if and only if
each connected component in G is a connected
component in G_f.

52
Algorithm

Build Graph G and find its connected components.
Solve each connected component C of G separately.
Find the forced (red or blue) edges. Let Gf be
the subgraph of C containing colored edges.
Find each connected component of Gf and make the
inferred edge colorings (phase decisions).
Find a spanning tree of uncolored edges in C, and
color those edges arbitrarily, and follow the
inferred edge colorings

53
Conclusion

In the special case of blocks with no
recombination, and no recurrent mutations, the
haplotypes satisfy a perfect phylogeny
Given a set of genotypes, there is an efficient
(O(ns2)) algorithm for representing all possible
haplotype solutions that satisfy a prefect
phylogeny
Efficiency
Input is size O(ns),
All operations except building the graph are
O(nss2)
Valid PPH only if s O(n). Is O(ns) possible?
Current best solution is O(nsn(1-e) s2) using
Matrix Multiplication idea
Future work involves combining this with some
heuristics to deal with general cases (lo
recombination/hi recombination)

54
Simulated Data

Coalescent model (Hudson)
No Recombination
400 chromosomes, 100 sites
Infinite sites
Recombination
100 chromosomes
Infinite sites
R4.0 2501
Pr(Recombination) 410(-9) between adjacent
bases

55
Error Measurement

Discrepancy 1 (Num Haplotypes incorrectly
predicted)
Switch Error 2

01010 00101 01010 10101
02222 22222
00101 01010 00000 11111
56
No Recombination
57
No Recombination
58
Choosing between solutions
59
Choosing between solutions
60
Choosing between solutions
61
Conclusion

Extremely low error rates (lt 1 discrepancy) if
no recombination
Randomly choosing between equivalent solutions is
sufficient
Other measures (Parsimony, Likelihood, Entropy)
do not improve the quality of solution

62
With Recombination

Write a Comment

User Comments (0)