L6: Haplotype phasing - PowerPoint PPT Presentation

About This Presentation
Title:

L6: Haplotype phasing

Description:

L6: Haplotype phasing. Genotypes and Haplotypes. Each ... At each site, each chromosome has one of two alleles ... Input: A ternary matrix (0,1,2) M with N rows ... – PowerPoint PPT presentation

Number of Views:349
Avg rating:3.0/5.0
Slides: 58
Provided by: vineet50
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: L6: Haplotype phasing


1
L6 Haplotype phasing
2
Genotypes and Haplotypes
  • Each individual has two copies of each
    chromosome.
  • At each site, each chromosome has one of two
    alleles
  • Current Genotyping technology doesnt give phase

0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0
0
2 1 2 1 0 0 1 2 0
Genotype for the individual
3
Haplotype Phasing
  • Haplotype Phasing is the resolution of a genotype
    into the two haplotypes.
  • Haplotypes increase the power of an association
    between marker loci and phenotypic traits
  • Current approaches to Haplotyping
  • Via technological innovations (expensive)
  • Statistical Methods (ML, Phase,PL)
  • Combinatorial approach to the phasing problem
  • Efficient, provable quality of solution
  • Not completely generalizable (as yet)

4
Clarks idea
  • Using the HWE principle, infer phase using
    homozygous sites.
  • Not described as an algorithm, but as a
    methodology to infer phase.
  • 0 1 1 1 0 0 1 1 0
  • 1 1 0 2 0 0 2 0 0
  • 2 1 2 0 0 0 0 0 0

5
Maximum likelihood estimation of phase
  • Input Genotypes 1m with counts n1, n2,..
  • Output Haplotype frequencies (also individual
    haplotype assignments)
  • Define (unknown) genotype probabilities P1,P2,P3
  • Likelihood Function (based on genotype
    probabilities)

6
Genotypes and Haploptypes
  • Let cj be the number of haplotype pairings that
    will give us genotype j, Then
  • Use HWE to compute Pr(hk,hl)

7
Likelihood using haplotype frequencies
8
The Expectation Step
  • Q Given haplotype frequencies, what are the
    paired haplotype frequencies
  • A Initially
  • Subsequently, (gth iteration)

9
The M Step
  • ?it is 0, 1, or 2 ( of times haplotype t occurs
    in paired haplotype t)

10
Bayesian approach to phasing
  • Idea Small variants of common haplotypes should
    also be considered common even though they have
    low frequency

11
Phase
12
Phase
  • As described, each haplotype arises from the
    prior set only through mutations. Recombination
    is not considered
  • In subsequent versions, recombination is
    explicitly considered in the equation

13
Phase results
  • Phase versus EM versus Clark
  • Error rate Proportion of individuals incorrectly
    predicted

14
Combinatorial Approach to Haplotyping
15
The Perfect Phylogeny Model
  • We assume that the evolution of extant haplotypes
    can be displayed on a rooted, directed tree, with
    the all-0 haplotype at the root, where each site
    changes from 0 to 1 on exactly one edge, and each
    extant haplotype is created by accumulating the
    changes on a path from the root to a leaf, where
    that haplotype is displayed.
  • In other words, the extant haplotypes evolved
    along a perfect phylogeny with all-0 root.

12345
00000
1
4
3
00010
2
10100
5
10000
01010
01011
Extant Haplotypes
16
Haplotyping via Perfect Phylogeny
PPH Given a set of genotypes, find an explaining
set of haplotypes that fits a perfect phylogeny
00
1 2
a 1 0
a 0 1
b 0 0
b 0 1
c 1 0
c 1 0
1 2
a 2 2
b 0 2
c 1 0
1
2
b
00
a
a
b
c
c
01
01

10
10
10
17
The Alternative Explanation
1 2
a 1 1
a 0 0
b 0 0
b 0 1
c 1 0
c 1 0
No tree possible for this explanation
1 2
a 2 2
b 0 2
c 1 0
18
The 4 Gamete Test for Perfect Phylogeny
  • Arrange the haplotypes in a matrix, two
    haplotypes for each individual.
  • Then (with no duplicate columns), the haplotypes
    fit a unique perfect phylogeny if and only if no
    two columns contain all four pairs (Buneman)
  • 0,0 and 0,1 and 1,0 and 1,1

00
10
01
11
19
The Alternative Explanation
1 2
a 1 1
a 0 0
b 0 0
b 0 1
c 1 0
c 1 0
No tree possible for this explanation
1 2
a 2 2
b 0 2
c 1 0
20
The Tree Explanation Again
0 0
1 2
a 1 0
a 0 1
b 0 0
b 0 1
c 1 0
c 1 0
1 2
a 2 2
b 0 2
c 1 0
1
2
b
0 0
a
b
a
c
c
0 1
0 1
21
The Combinatorial Problem
  • Input A ternary matrix (0,1,2) M with N rows
  • Output A binary matrix M created from M by
    replacing each 2 in M with a 0 and 1, such that
    M passes the 4 gamete test
  • Gusfield (Recomb2002) proposed a solution which
    used a reduction to Matroids.
  • We present a (slightly inefficient) solution
    using elementary techniques
  • Independently by (Eskin, Halperin, Karp02)


22
Initial Observations
  • Forced Expansions
  • EX 1 If two columns(sites) of M contain the
    following rows
  • 2 0
  • 0 2
  • Then M will contain a row with 1 0 and a row
    with 0 1 in those columns.
  • EX 2 Similarly, if two columns of M contain the
    rows
  • 2 1
  • 2 0
  • Then M will contain rows with 1 1 and 0 0 in
    those columns

23
Initial Observations
If a forced expansion of two columns creates rows
0 1, and 1 0 in those columns, then any 2 2 in
those columns must be set to be 0 1 1 0
We say that two columns are forced out-of-phase.
22
If a forced expansion of two columns creates 1 1,
and 0 0 in those columns, then any 2 2 in those
columns must be set to be 1 1 0 0 We
say that two columns are forced in-phase.
22
24
Immediate Failure
It can happen that the forced expansion of
cells creates a 4x2 submatrix that fails the
4-Gamete Test. In that case, there is no PPH
solution for M.
20 12 02
Example
Will fail the 4-Gamete Test
25
An O(ns2)-time Algorithm
  • Find all the forced phase relationships by
    considering columns in pairs.
  • Find all the inferred, invariant, phase
    relationships.
  • Find a set of column pairs whose phase
    relationship can be arbitrarily set, so that all
    the remaining phase relationships can be
    inferred.
  • Result An implicit representation of all
    solutions to the PPH problem.

26
A Running Example
1 2 3 4 5 6 7
1 2 2 2 0 0 0
2 0 2 0 0 0 2
1 2 2 2 0 2 0
1 2 2 0 2 0 0
2 2 0 0 0 2 0
0 0 0 0 0 0 0
A B C D E F
27
Companion Graph G_c
1
1 2 3 4 5 6 7
1 2 2 2 0 0 0
2 0 2 0 0 0 2
1 2 2 2 0 2 0
1 2 2 0 2 0 0
2 2 0 0 0 2 0
0 0 0 0 0 0 0
A B C D E F
  • Each node represents a column in M, and each
    edge indicates that the pair of columns has a row
    with 2s in both columns.
  • The algorithm builds this graph, and then checks
    whether any pair of nodes is forced in or out of
    phase.

28
Phasing Edges in G_c
1
  • Each Red edge indicates that the columns are
    forced in-phase.
  • Each Blue edge indicates that the columns are
    forced out-of-phase.

7
6
3
4
2
5
Let G_f be the sub-graph of G_c defined by the
red and blue edges.
29
Connected Components in G_f
1
  • Graph G_f has three connected components

.
30
Phase-parity Lemma
  • Lemma 1 There is a solution to the PPH problem
    for M if and only if there is a coloring of the
    black edges of G_c with the following property
  • For any triangle in G_c containing at least
    one black edge, the coloring makes either 0 or 2
    of the edges
  • blue (i.e., out of phase)


Thats nice, but how do we assign the colors?
31
A Weak Triangulation Rule
1
  • Theorem 1 If there are any black edges whose
    ends are in the same connected component of G_f,
    at least one edge is in a triangle where the
    other edges are not black
  • In every PPH solution, it must be colored so
    that the triangle has an even number of Blue (out
    of Phase) edges.
  • This an inferred coloring.


32
(No Transcript)
33
7
6
3
4
2
5
Graph G_f
34
7
6
3
4
2
5
Graph G_f
35
7
6
3
4
2
5
Graph G_f
36
Corollary
  • Inside any connected component of G_f, ALL the
    phase relationships on edges (columns of M) are
    uniquely determined, either as forced
    relationships based on pair-wise column
    comparisons, or by triangle-based inferred
    colorings.
  • Hence, the phase relationships of all the columns
    in a connected component of G_f are INVARIANT
    over all the solutions to the PPH problem.
  • The black edges in G_f can be ordered so that the
    inferred colorings can be done in linear time.
    Modification of DFS.

37
Phase Parity Lemma Proof
2 X
Y 2
2 2
If X ? 2, and Y ? 2, Then the two columns are
forced
38
Phase Parity Lemma proof
A B C
  • Lemma If a triangle contains a black edge, then
    a PPH solution exists only if there are 0 or 2
    blue edges in the final coloring.
  • Proof
  • No black edge unless x2, or y2 or z2
    (previous lemma)
  • If there is a row with all 2s, then there must be
    an even number of blue edges

2 2 y
x 2 2
2 z 2
B
A
C
39
Proof of Weak Triangulation Theorem
A
  • Arbitrary chordless cycles are possible in the
    graph, with forced edges.
  • See example. The pattern 0,2 2,0 and 2,2
    implies a blue (out of phase) edge
  • A single unforced edge changes the picture

E
B
D
C
A B C D E
2 2 0 0 0
0 2 2 0 0
0 0 2 2 0
0 0 0 2 2
2 0 0 0 2
40
Proof of Weak Triangulation Theorem
K
K
  • Let (J,J) be a black edge connecting a long
    path J,K,K,J of forced edges
  • In the Matrix, x ? 2, otherwise there is a chord.
    Likewise y?2
  • By previous lemma, (J,J) is forced

J
J
K J J K
2 2 x
y 2 2
2 2
41
Finishing the Solution
  • Problem A connected component C of G may
    contain several connected components of G_f, so
    any edge crossing two components of G_f will
    still be black. How should they be colored?

42
1
  • How should we color the remaining black edges in
    a connected component C of G_c?

43
Answer
  • For a connected component C of G with k
    connected components of Gf, select any subset S
    of k-1 black edges in C, so that S together with
    the red and blue edges span all the nodes of C.
  • Arbitrarily, color each edge in S either red or
    blue.
  • Infer the color of any remaining black edges by
  • successive use of the triangle rule.

7
6
3
4
2
5
44
7
3
2
5
45
Theorem 2
  • Any selected S works (allows the triangle rule to
    work) and any coloring of the edges in S
    determines the colors of any remaining black
    edges.
  • Different colorings of S determine different
    colorings of the remaining black edges.
  • Each different coloring of S determines a
    different solution to the PPH problem.
  • All PPH solutions can be obtained in this way,
    i.e. using just one selected S set, but coloring
    it in all 2(k-1) ways.

46
Corollary
  • In a single connected component C of G with k
    connected components in Gf, there are exactly
    2(k-1) different solutions to the PPH problem in
    the columns of M represented by C.
  • If G_c has r connected components and t connected
    components of G_f, then there are exactly 2(t-r)
    solutions to the PPH problem.
  • There is one unique PPH solution if and only if
    each connected component in G is a connected
    component in G_f.

47
Algorithm
  • Build Graph G and find its connected components.
    Solve each connected component C of G separately.
  • Find the forced (red or blue) edges. Let Gf be
    the subgraph of C containing colored edges.
  • Find each connected component of Gf and make the
    inferred edge colorings (phase decisions).
  • Find a spanning tree of uncolored edges in C, and
    color those edges arbitrarily, and follow the
    inferred edge colorings

48
Conclusion
  • In the special case of blocks with no
    recombination, and no recurrent mutations, the
    haplotypes satisfy a perfect phylogeny
  • Given a set of genotypes, there is an efficient
    (O(ns2)) algorithm for representing all possible
    haplotype solutions that satisfy a prefect
    phylogeny
  • Efficiency
  • Input is size O(ns),
  • All operations except building the graph are
    O(nss2)
  • Valid PPH only if s O(n). Is O(ns) possible?
  • Current best solution is O(nsn(1-e) s2) using
    Matrix Multiplication idea
  • Future work involves combining this with some
    heuristics to deal with general cases (lo
    recombination/hi recombination)

49
Simulated Data
  • Coalescent model (Hudson)
  • No Recombination
  • 400 chromosomes, 100 sites
  • Infinite sites
  • Recombination
  • 100 chromosomes
  • Infinite sites
  • R4.0 2501
  • Pr(Recombination) 410(-9) between adjacent
    bases

50
Error Measurement
  • Discrepancy 1 (Num Haplotypes incorrectly
    predicted)
  • Switch Error 2

01010 00101 01010 10101
02222 22222
00101 01010 00000 11111
51
No Recombination
52
No Recombination
53
Choosing between solutions
54
Choosing between solutions
55
Choosing between solutions
56
Conclusion
  • Extremely low error rates (lt 1 discrepancy) if
    no recombination
  • Randomly choosing between equivalent solutions is
    sufficient
  • Other measures (Parsimony, Likelihood, Entropy)
    do not improve the quality of solution

57
With Recombination
58
Problems
  • Many of the earlier problems (structure/recombinat
    ion rate) etc. correspond to phased data.
  • Can they be resolved for unphased data
Write a Comment
User Comments (0)
About PowerShow.com