Title: L6: Haplotype phasing
1L6 Haplotype phasing
2Genotypes and Haplotypes
- Each individual has two copies of each
chromosome. - At each site, each chromosome has one of two
alleles - Current Genotyping technology doesnt give phase
0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0
0
2 1 2 1 0 0 1 2 0
Genotype for the individual
3Haplotype Phasing
- Haplotype Phasing is the resolution of a genotype
into the two haplotypes. - Haplotypes increase the power of an association
between marker loci and phenotypic traits - Current approaches to Haplotyping
- Via technological innovations (expensive)
- Statistical Methods (ML, Phase,PL)
- Combinatorial approach to the phasing problem
- Efficient, provable quality of solution
- Not completely generalizable (as yet)
4Clarks idea
- Using the HWE principle, infer phase using
homozygous sites. - Not described as an algorithm, but as a
methodology to infer phase.
- 0 1 1 1 0 0 1 1 0
- 1 1 0 2 0 0 2 0 0
- 2 1 2 0 0 0 0 0 0
5Maximum likelihood estimation of phase
- Input Genotypes 1m with counts n1, n2,..
- Output Haplotype frequencies (also individual
haplotype assignments) - Define (unknown) genotype probabilities P1,P2,P3
- Likelihood Function (based on genotype
probabilities)
6Genotypes and Haploptypes
- Let cj be the number of haplotype pairings that
will give us genotype j, Then - Use HWE to compute Pr(hk,hl)
7Likelihood using haplotype frequencies
8The Expectation Step
- Q Given haplotype frequencies, what are the
paired haplotype frequencies - A Initially
- Subsequently, (gth iteration)
-
9The M Step
- ?it is 0, 1, or 2 ( of times haplotype t occurs
in paired haplotype t)
10Bayesian approach to phasing
- Idea Small variants of common haplotypes should
also be considered common even though they have
low frequency
11Phase
12Phase
- As described, each haplotype arises from the
prior set only through mutations. Recombination
is not considered - In subsequent versions, recombination is
explicitly considered in the equation
13Phase results
- Phase versus EM versus Clark
- Error rate Proportion of individuals incorrectly
predicted
14Combinatorial Approach to Haplotyping
15The Perfect Phylogeny Model
- We assume that the evolution of extant haplotypes
can be displayed on a rooted, directed tree, with
the all-0 haplotype at the root, where each site
changes from 0 to 1 on exactly one edge, and each
extant haplotype is created by accumulating the
changes on a path from the root to a leaf, where
that haplotype is displayed. - In other words, the extant haplotypes evolved
along a perfect phylogeny with all-0 root.
12345
00000
1
4
3
00010
2
10100
5
10000
01010
01011
Extant Haplotypes
16Haplotyping via Perfect Phylogeny
PPH Given a set of genotypes, find an explaining
set of haplotypes that fits a perfect phylogeny
00
1 2
a 1 0
a 0 1
b 0 0
b 0 1
c 1 0
c 1 0
1 2
a 2 2
b 0 2
c 1 0
1
2
b
00
a
a
b
c
c
01
01
10
10
10
17The Alternative Explanation
1 2
a 1 1
a 0 0
b 0 0
b 0 1
c 1 0
c 1 0
No tree possible for this explanation
1 2
a 2 2
b 0 2
c 1 0
18The 4 Gamete Test for Perfect Phylogeny
- Arrange the haplotypes in a matrix, two
haplotypes for each individual. - Then (with no duplicate columns), the haplotypes
fit a unique perfect phylogeny if and only if no
two columns contain all four pairs (Buneman) - 0,0 and 0,1 and 1,0 and 1,1
00
10
01
11
19The Alternative Explanation
1 2
a 1 1
a 0 0
b 0 0
b 0 1
c 1 0
c 1 0
No tree possible for this explanation
1 2
a 2 2
b 0 2
c 1 0
20The Tree Explanation Again
0 0
1 2
a 1 0
a 0 1
b 0 0
b 0 1
c 1 0
c 1 0
1 2
a 2 2
b 0 2
c 1 0
1
2
b
0 0
a
b
a
c
c
0 1
0 1
21The Combinatorial Problem
- Input A ternary matrix (0,1,2) M with N rows
- Output A binary matrix M created from M by
replacing each 2 in M with a 0 and 1, such that
M passes the 4 gamete test - Gusfield (Recomb2002) proposed a solution which
used a reduction to Matroids. - We present a (slightly inefficient) solution
using elementary techniques - Independently by (Eskin, Halperin, Karp02)
22Initial Observations
- Forced Expansions
- EX 1 If two columns(sites) of M contain the
following rows - 2 0
- 0 2
- Then M will contain a row with 1 0 and a row
with 0 1 in those columns. - EX 2 Similarly, if two columns of M contain the
rows - 2 1
- 2 0
- Then M will contain rows with 1 1 and 0 0 in
those columns
23Initial Observations
If a forced expansion of two columns creates rows
0 1, and 1 0 in those columns, then any 2 2 in
those columns must be set to be 0 1 1 0
We say that two columns are forced out-of-phase.
22
If a forced expansion of two columns creates 1 1,
and 0 0 in those columns, then any 2 2 in those
columns must be set to be 1 1 0 0 We
say that two columns are forced in-phase.
22
24Immediate Failure
It can happen that the forced expansion of
cells creates a 4x2 submatrix that fails the
4-Gamete Test. In that case, there is no PPH
solution for M.
20 12 02
Example
Will fail the 4-Gamete Test
25An O(ns2)-time Algorithm
- Find all the forced phase relationships by
considering columns in pairs. - Find all the inferred, invariant, phase
relationships. - Find a set of column pairs whose phase
relationship can be arbitrarily set, so that all
the remaining phase relationships can be
inferred. - Result An implicit representation of all
solutions to the PPH problem.
26A Running Example
1 2 3 4 5 6 7
1 2 2 2 0 0 0
2 0 2 0 0 0 2
1 2 2 2 0 2 0
1 2 2 0 2 0 0
2 2 0 0 0 2 0
0 0 0 0 0 0 0
A B C D E F
27Companion Graph G_c
1
1 2 3 4 5 6 7
1 2 2 2 0 0 0
2 0 2 0 0 0 2
1 2 2 2 0 2 0
1 2 2 0 2 0 0
2 2 0 0 0 2 0
0 0 0 0 0 0 0
A B C D E F
- Each node represents a column in M, and each
edge indicates that the pair of columns has a row
with 2s in both columns. - The algorithm builds this graph, and then checks
whether any pair of nodes is forced in or out of
phase.
28Phasing Edges in G_c
1
- Each Red edge indicates that the columns are
forced in-phase. - Each Blue edge indicates that the columns are
forced out-of-phase.
7
6
3
4
2
5
Let G_f be the sub-graph of G_c defined by the
red and blue edges.
29Connected Components in G_f
1
- Graph G_f has three connected components
.
30Phase-parity Lemma
- Lemma 1 There is a solution to the PPH problem
for M if and only if there is a coloring of the
black edges of G_c with the following property - For any triangle in G_c containing at least
one black edge, the coloring makes either 0 or 2
of the edges - blue (i.e., out of phase)
Thats nice, but how do we assign the colors?
31A Weak Triangulation Rule
1
- Theorem 1 If there are any black edges whose
ends are in the same connected component of G_f,
at least one edge is in a triangle where the
other edges are not black - In every PPH solution, it must be colored so
that the triangle has an even number of Blue (out
of Phase) edges. - This an inferred coloring.
32(No Transcript)
337
6
3
4
2
5
Graph G_f
347
6
3
4
2
5
Graph G_f
357
6
3
4
2
5
Graph G_f
36Corollary
- Inside any connected component of G_f, ALL the
phase relationships on edges (columns of M) are
uniquely determined, either as forced
relationships based on pair-wise column
comparisons, or by triangle-based inferred
colorings. - Hence, the phase relationships of all the columns
in a connected component of G_f are INVARIANT
over all the solutions to the PPH problem. - The black edges in G_f can be ordered so that the
inferred colorings can be done in linear time.
Modification of DFS.
37Phase Parity Lemma Proof
2 X
Y 2
2 2
If X ? 2, and Y ? 2, Then the two columns are
forced
38Phase Parity Lemma proof
A B C
- Lemma If a triangle contains a black edge, then
a PPH solution exists only if there are 0 or 2
blue edges in the final coloring. - Proof
- No black edge unless x2, or y2 or z2
(previous lemma) - If there is a row with all 2s, then there must be
an even number of blue edges
2 2 y
x 2 2
2 z 2
B
A
C
39Proof of Weak Triangulation Theorem
A
- Arbitrary chordless cycles are possible in the
graph, with forced edges. - See example. The pattern 0,2 2,0 and 2,2
implies a blue (out of phase) edge - A single unforced edge changes the picture
E
B
D
C
A B C D E
2 2 0 0 0
0 2 2 0 0
0 0 2 2 0
0 0 0 2 2
2 0 0 0 2
40Proof of Weak Triangulation Theorem
K
K
- Let (J,J) be a black edge connecting a long
path J,K,K,J of forced edges - In the Matrix, x ? 2, otherwise there is a chord.
Likewise y?2 - By previous lemma, (J,J) is forced
J
J
K J J K
2 2 x
y 2 2
2 2
41Finishing the Solution
- Problem A connected component C of G may
contain several connected components of G_f, so
any edge crossing two components of G_f will
still be black. How should they be colored?
421
- How should we color the remaining black edges in
a connected component C of G_c?
43Answer
- For a connected component C of G with k
connected components of Gf, select any subset S
of k-1 black edges in C, so that S together with
the red and blue edges span all the nodes of C. - Arbitrarily, color each edge in S either red or
blue. - Infer the color of any remaining black edges by
- successive use of the triangle rule.
7
6
3
4
2
5
447
3
2
5
45Theorem 2
- Any selected S works (allows the triangle rule to
work) and any coloring of the edges in S
determines the colors of any remaining black
edges. - Different colorings of S determine different
colorings of the remaining black edges. - Each different coloring of S determines a
different solution to the PPH problem. - All PPH solutions can be obtained in this way,
i.e. using just one selected S set, but coloring
it in all 2(k-1) ways.
46Corollary
- In a single connected component C of G with k
connected components in Gf, there are exactly
2(k-1) different solutions to the PPH problem in
the columns of M represented by C. - If G_c has r connected components and t connected
components of G_f, then there are exactly 2(t-r)
solutions to the PPH problem. - There is one unique PPH solution if and only if
each connected component in G is a connected
component in G_f.
47Algorithm
- Build Graph G and find its connected components.
Solve each connected component C of G separately. - Find the forced (red or blue) edges. Let Gf be
the subgraph of C containing colored edges. - Find each connected component of Gf and make the
inferred edge colorings (phase decisions). - Find a spanning tree of uncolored edges in C, and
color those edges arbitrarily, and follow the
inferred edge colorings
48Conclusion
- In the special case of blocks with no
recombination, and no recurrent mutations, the
haplotypes satisfy a perfect phylogeny - Given a set of genotypes, there is an efficient
(O(ns2)) algorithm for representing all possible
haplotype solutions that satisfy a prefect
phylogeny - Efficiency
- Input is size O(ns),
- All operations except building the graph are
O(nss2) - Valid PPH only if s O(n). Is O(ns) possible?
- Current best solution is O(nsn(1-e) s2) using
Matrix Multiplication idea - Future work involves combining this with some
heuristics to deal with general cases (lo
recombination/hi recombination)
49Simulated Data
- Coalescent model (Hudson)
- No Recombination
- 400 chromosomes, 100 sites
- Infinite sites
- Recombination
- 100 chromosomes
- Infinite sites
- R4.0 2501
- Pr(Recombination) 410(-9) between adjacent
bases
50Error Measurement
- Discrepancy 1 (Num Haplotypes incorrectly
predicted) - Switch Error 2
01010 00101 01010 10101
02222 22222
00101 01010 00000 11111
51No Recombination
52No Recombination
53Choosing between solutions
54Choosing between solutions
55Choosing between solutions
56Conclusion
- Extremely low error rates (lt 1 discrepancy) if
no recombination - Randomly choosing between equivalent solutions is
sufficient - Other measures (Parsimony, Likelihood, Entropy)
do not improve the quality of solution
57With Recombination
58Problems
- Many of the earlier problems (structure/recombinat
ion rate) etc. correspond to phased data. - Can they be resolved for unphased data