L6: Haplotype phasing - PowerPoint PPT Presentation

About This Presentation

Title:

L6: Haplotype phasing

Description:

L6: Haplotype phasing. Genotypes and Haplotypes. Each ... At each site, each chromosome has one of two alleles ... Input: A ternary matrix (0,1,2) M with N rows ... – PowerPoint PPT presentation

Number of Views:349

Avg rating:3.0/5.0

Slides: 58

Provided by: vineet50

Learn more at: https://cseweb.ucsd.edu

Category:

more less

Transcript and Presenter's Notes

Title: L6: Haplotype phasing

1
L6 Haplotype phasing
2
Genotypes and Haplotypes

Each individual has two copies of each
chromosome.
At each site, each chromosome has one of two
alleles
Current Genotyping technology doesnt give phase

0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0
0
2 1 2 1 0 0 1 2 0
Genotype for the individual
3
Haplotype Phasing

Haplotype Phasing is the resolution of a genotype
into the two haplotypes.
Haplotypes increase the power of an association
between marker loci and phenotypic traits
Current approaches to Haplotyping
Via technological innovations (expensive)
Statistical Methods (ML, Phase,PL)
Combinatorial approach to the phasing problem
Efficient, provable quality of solution
Not completely generalizable (as yet)

4
Clarks idea

Using the HWE principle, infer phase using
homozygous sites.
Not described as an algorithm, but as a
methodology to infer phase.

0 1 1 1 0 0 1 1 0
1 1 0 2 0 0 2 0 0
2 1 2 0 0 0 0 0 0

5
Maximum likelihood estimation of phase

Input Genotypes 1m with counts n1, n2,..
Output Haplotype frequencies (also individual
haplotype assignments)
Define (unknown) genotype probabilities P1,P2,P3
Likelihood Function (based on genotype
probabilities)

6
Genotypes and Haploptypes

Let cj be the number of haplotype pairings that
will give us genotype j, Then
Use HWE to compute Pr(hk,hl)

7
Likelihood using haplotype frequencies
8
The Expectation Step

Q Given haplotype frequencies, what are the
paired haplotype frequencies
A Initially
Subsequently, (gth iteration)

9
The M Step

?it is 0, 1, or 2 ( of times haplotype t occurs
in paired haplotype t)

10
Bayesian approach to phasing

Idea Small variants of common haplotypes should
also be considered common even though they have
low frequency

11
Phase
12
Phase

As described, each haplotype arises from the
prior set only through mutations. Recombination
is not considered
In subsequent versions, recombination is
explicitly considered in the equation

13
Phase results

Phase versus EM versus Clark
Error rate Proportion of individuals incorrectly
predicted

14
Combinatorial Approach to Haplotyping
15
The Perfect Phylogeny Model

We assume that the evolution of extant haplotypes
can be displayed on a rooted, directed tree, with
the all-0 haplotype at the root, where each site
changes from 0 to 1 on exactly one edge, and each
extant haplotype is created by accumulating the
changes on a path from the root to a leaf, where
that haplotype is displayed.
In other words, the extant haplotypes evolved
along a perfect phylogeny with all-0 root.

12345
00000
1
4
3
00010
2
10100
5
10000
01010
01011
Extant Haplotypes
16
Haplotyping via Perfect Phylogeny
PPH Given a set of genotypes, find an explaining
set of haplotypes that fits a perfect phylogeny
00
1 2
a 1 0
a 0 1
b 0 0
b 0 1
c 1 0
c 1 0
1 2
a 2 2
b 0 2
c 1 0
1
2
b
00
a
a
b
c
c
01
01

10
10
10
17
The Alternative Explanation
1 2
a 1 1
a 0 0
b 0 0
b 0 1
c 1 0
c 1 0
No tree possible for this explanation
1 2
a 2 2
b 0 2
c 1 0
18
The 4 Gamete Test for Perfect Phylogeny

Arrange the haplotypes in a matrix, two
haplotypes for each individual.
Then (with no duplicate columns), the haplotypes
fit a unique perfect phylogeny if and only if no
two columns contain all four pairs (Buneman)
0,0 and 0,1 and 1,0 and 1,1

00
10
01
11
19
The Alternative Explanation
1 2
a 1 1
a 0 0
b 0 0
b 0 1
c 1 0
c 1 0
No tree possible for this explanation
1 2
a 2 2
b 0 2
c 1 0
20
The Tree Explanation Again
0 0
1 2
a 1 0
a 0 1
b 0 0
b 0 1
c 1 0
c 1 0
1 2
a 2 2
b 0 2
c 1 0
1
2
b
0 0
a
b
a
c
c
0 1
0 1
21
The Combinatorial Problem

Input A ternary matrix (0,1,2) M with N rows
Output A binary matrix M created from M by
replacing each 2 in M with a 0 and 1, such that
M passes the 4 gamete test
Gusfield (Recomb2002) proposed a solution which
used a reduction to Matroids.
We present a (slightly inefficient) solution
using elementary techniques
Independently by (Eskin, Halperin, Karp02)

22
Initial Observations

Forced Expansions
EX 1 If two columns(sites) of M contain the
following rows
2 0
0 2
Then M will contain a row with 1 0 and a row
with 0 1 in those columns.
EX 2 Similarly, if two columns of M contain the
rows
2 1
2 0
Then M will contain rows with 1 1 and 0 0 in
those columns

23
Initial Observations
If a forced expansion of two columns creates rows
0 1, and 1 0 in those columns, then any 2 2 in
those columns must be set to be 0 1 1 0
We say that two columns are forced out-of-phase.
22
If a forced expansion of two columns creates 1 1,
and 0 0 in those columns, then any 2 2 in those
columns must be set to be 1 1 0 0 We
say that two columns are forced in-phase.
22
24
Immediate Failure
It can happen that the forced expansion of
cells creates a 4x2 submatrix that fails the
4-Gamete Test. In that case, there is no PPH
solution for M.
20 12 02
Example
Will fail the 4-Gamete Test
25
An O(ns2)-time Algorithm

Find all the forced phase relationships by
considering columns in pairs.
Find all the inferred, invariant, phase
relationships.
Find a set of column pairs whose phase
relationship can be arbitrarily set, so that all
the remaining phase relationships can be
inferred.
Result An implicit representation of all
solutions to the PPH problem.

26
A Running Example
1 2 3 4 5 6 7
1 2 2 2 0 0 0
2 0 2 0 0 0 2
1 2 2 2 0 2 0
1 2 2 0 2 0 0
2 2 0 0 0 2 0
0 0 0 0 0 0 0
A B C D E F
27
Companion Graph G_c
1
1 2 3 4 5 6 7
1 2 2 2 0 0 0
2 0 2 0 0 0 2
1 2 2 2 0 2 0
1 2 2 0 2 0 0
2 2 0 0 0 2 0
0 0 0 0 0 0 0
A B C D E F

Each node represents a column in M, and each
edge indicates that the pair of columns has a row
with 2s in both columns.
The algorithm builds this graph, and then checks
whether any pair of nodes is forced in or out of
phase.

28
Phasing Edges in G_c
1

Each Red edge indicates that the columns are
forced in-phase.
Each Blue edge indicates that the columns are
forced out-of-phase.

7
6
3
4
2
5
Let G_f be the sub-graph of G_c defined by the
red and blue edges.
29
Connected Components in G_f
1

Graph G_f has three connected components

.
30
Phase-parity Lemma

Lemma 1 There is a solution to the PPH problem
for M if and only if there is a coloring of the
black edges of G_c with the following property
For any triangle in G_c containing at least
one black edge, the coloring makes either 0 or 2
of the edges
blue (i.e., out of phase)

Thats nice, but how do we assign the colors?
31
A Weak Triangulation Rule
1

Theorem 1 If there are any black edges whose
ends are in the same connected component of G_f,
at least one edge is in a triangle where the
other edges are not black
In every PPH solution, it must be colored so
that the triangle has an even number of Blue (out
of Phase) edges.
This an inferred coloring.

32
(No Transcript)
33
7
6
3
4
2
5
Graph G_f
34
7
6
3
4
2
5
Graph G_f
35
7
6
3
4
2
5
Graph G_f
36
Corollary

Inside any connected component of G_f, ALL the
phase relationships on edges (columns of M) are
uniquely determined, either as forced
relationships based on pair-wise column
comparisons, or by triangle-based inferred
colorings.
Hence, the phase relationships of all the columns
in a connected component of G_f are INVARIANT
over all the solutions to the PPH problem.
The black edges in G_f can be ordered so that the
inferred colorings can be done in linear time.
Modification of DFS.

37
Phase Parity Lemma Proof
2 X
Y 2
2 2
If X ? 2, and Y ? 2, Then the two columns are
forced
38
Phase Parity Lemma proof
A B C

Lemma If a triangle contains a black edge, then
a PPH solution exists only if there are 0 or 2
blue edges in the final coloring.
Proof
No black edge unless x2, or y2 or z2
(previous lemma)
If there is a row with all 2s, then there must be
an even number of blue edges

2 2 y
x 2 2
2 z 2
B
A
C
39
Proof of Weak Triangulation Theorem
A

Arbitrary chordless cycles are possible in the
graph, with forced edges.
See example. The pattern 0,2 2,0 and 2,2
implies a blue (out of phase) edge
A single unforced edge changes the picture

E
B
D
C
A B C D E
2 2 0 0 0
0 2 2 0 0
0 0 2 2 0
0 0 0 2 2
2 0 0 0 2
40
Proof of Weak Triangulation Theorem
K
K

Let (J,J) be a black edge connecting a long
path J,K,K,J of forced edges
In the Matrix, x ? 2, otherwise there is a chord.
Likewise y?2
By previous lemma, (J,J) is forced

J
J
K J J K
2 2 x
y 2 2
2 2
41
Finishing the Solution

Problem A connected component C of G may
contain several connected components of G_f, so
any edge crossing two components of G_f will
still be black. How should they be colored?

42
1

How should we color the remaining black edges in
a connected component C of G_c?

43
Answer

For a connected component C of G with k
connected components of Gf, select any subset S
of k-1 black edges in C, so that S together with
the red and blue edges span all the nodes of C.
Arbitrarily, color each edge in S either red or
blue.
Infer the color of any remaining black edges by
successive use of the triangle rule.

7
6
3
4
2
5
44
7
3
2
5
45
Theorem 2

Any selected S works (allows the triangle rule to
work) and any coloring of the edges in S
determines the colors of any remaining black
edges.
Different colorings of S determine different
colorings of the remaining black edges.
Each different coloring of S determines a
different solution to the PPH problem.
All PPH solutions can be obtained in this way,
i.e. using just one selected S set, but coloring
it in all 2(k-1) ways.

46
Corollary

In a single connected component C of G with k
connected components in Gf, there are exactly
2(k-1) different solutions to the PPH problem in
the columns of M represented by C.
If G_c has r connected components and t connected
components of G_f, then there are exactly 2(t-r)
solutions to the PPH problem.
There is one unique PPH solution if and only if
each connected component in G is a connected
component in G_f.

47
Algorithm

Build Graph G and find its connected components.
Solve each connected component C of G separately.
Find the forced (red or blue) edges. Let Gf be
the subgraph of C containing colored edges.
Find each connected component of Gf and make the
inferred edge colorings (phase decisions).
Find a spanning tree of uncolored edges in C, and
color those edges arbitrarily, and follow the
inferred edge colorings

48
Conclusion

In the special case of blocks with no
recombination, and no recurrent mutations, the
haplotypes satisfy a perfect phylogeny
Given a set of genotypes, there is an efficient
(O(ns2)) algorithm for representing all possible
haplotype solutions that satisfy a prefect
phylogeny
Efficiency
Input is size O(ns),
All operations except building the graph are
O(nss2)
Valid PPH only if s O(n). Is O(ns) possible?
Current best solution is O(nsn(1-e) s2) using
Matrix Multiplication idea
Future work involves combining this with some
heuristics to deal with general cases (lo
recombination/hi recombination)

49
Simulated Data

Coalescent model (Hudson)
No Recombination
400 chromosomes, 100 sites
Infinite sites
Recombination
100 chromosomes
Infinite sites
R4.0 2501
Pr(Recombination) 410(-9) between adjacent
bases

50
Error Measurement

Discrepancy 1 (Num Haplotypes incorrectly
predicted)
Switch Error 2

01010 00101 01010 10101
02222 22222
00101 01010 00000 11111
51
No Recombination
52
No Recombination
53
Choosing between solutions
54
Choosing between solutions
55
Choosing between solutions
56
Conclusion

Extremely low error rates (lt 1 discrepancy) if
no recombination
Randomly choosing between equivalent solutions is
sufficient
Other measures (Parsimony, Likelihood, Entropy)
do not improve the quality of solution

57
With Recombination
58
Problems