Title: Haplotyping%20via%20Perfect%20Phylogeny:%20A%20Direct%20Approach
1Haplotyping via Perfect Phylogeny A Direct
Approach
- Dan Gusfield
- CS, UC Davis
Joint work with V. Bafna, G. Lancia and S. Yooseph
2Genotypes and Haplotypes
- Each individual has two copies of each
chromosome. - At each site, each chromosome has one of two
alleles (states) denoted by 0 and 1 (motivated by
- SNPs)
0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0
0
Two haplotypes per individual
Merge the haplotypes
2 1 2 1 0 0 1 2 0
Genotype for the individual
3Haplotyping Problem
- Biological Problem For disease association
studies, haplotype data is more valuable than
genotype data, but haplotype data is hard to
collect. Genotype data is easy to collect. - Computational Problem Given a set of n
genotypes, determine the original set of n
haplotype pairs that generated the n genotypes.
This is hopeless without a genetic model.
4The Perfect Phylogeny Model
- We assume that the evolution of extant haplotypes
can be displayed on a rooted, directed tree, with
the all-0 haplotype at the root, where each site - changes from 0 to 1 on exactly one edge, and
each extant haplotype is created by accumulating
the changes on a path from the root to a leaf,
where that haplotype is displayed. - In other words, the extant haplotypes evolved
along a perfect phylogeny with all-0 root.
5The Perfect Phylogeny Model
sites
12345
00000
Ancestral haplotype
1
4
Site mutations on edges
3
00010
2
10100
5
10000
01010
01011
Extant haplotypes at the leaves
6Justification for Perfect Phylogeny Model
- In the absence of recombination each haplotype of
any individual has a single parent, so tracing
back the history of the haplotypes in a
population gives a tree. - Recent strong evidence for long regions of DNA
with no recombination. Key to the NIH haplotype
mapping project (see NY Times October 30, 2002) - Mutations are rare at selected sites, so are
assumed non-recurrent. - Connection with coalescent models.
7The Haplotype Phylogeny Problem
Given a set of genotypes S, find an explaining
set of haplotypes that fits a perfect phylogeny.
sites
A haplotype pair explains a genotype if the merge
of the haplotypes creates the genotype. Example
The merge of 0 1 and 1 0 explains 2 2.
1 2
a 2 2
b 0 2
c 1 0
S
Genotype matrix
8The Haplotype Phylogeny Problem
Given a set of genotypes, find an explaining set
of haplotypes that fits a perfect phylogeny
1 2
a 1 0
a 0 1
b 0 0
b 0 1
c 1 0
c 1 0
1 2
a 2 2
b 0 2
c 1 0
9The Haplotype Phylogeny Problem (PPH problem)
Given a set of genotypes, find an explaining set
of haplotypes that fits a perfect phylogeny
00
1 2
a 1 0
a 0 1
b 0 0
b 0 1
c 1 0
c 1 0
1 2
a 2 2
b 0 2
c 1 0
1
2
b
00
a
a
b
c
c
01
01
10
10
10
10The Alternative Explanation
1 2
a 1 1
a 0 0
b 0 0
b 0 1
c 1 0
c 1 0
No tree possible for this explanation
1 2
a 2 2
b 0 2
c 1 0
11When does a set of haplotypes to fit a perfect
phylogeny?
- Classic NASC Arrange the haplotypes in a
matrix, two haplotypes for each individual. Then
(with no duplicate columns), the haplotypes fit a
unique perfect phylogeny if and only if no two
columns contain all three pairs - 0,1 and 1,0 and 1,1
This is the 3-Gamete Test
12We can remove the red words to obtain
another true statement. Also, we can consider an
unrooted version of the problem, where the
4-gamete test is used, but in this talk we
consider the simpler, rooted version. See the
full paper for the unrooted version.
13The Alternative Explanation
1 2
a 1 1
a 0 0
b 0 0
b 0 1
c 1 0
c 1 0
No tree possible for this explanation
1 2
a 2 2
b 0 2
c 1 0
14The Tree Explanation Again
0 0
1 2
a 1 0
a 0 1
b 0 0
b 0 1
c 1 0
c 1 0
1 2
a 2 2
b 0 2
c 1 0
1
2
b
0 0
a
b
a
c
c
0 1
0 1
15The case of the unknown root
- The 3-Gamete Test
- is for the case when the root is assumed to be
- the all-0 vector. When the root is not known
- then the NASC is that the submatrix
- 00
- 10 must not appear in the matrix. This is
- 10 called the 4-Gamete Test.
- 11
16Solving the Haplotype Phylogeny Problem (PPH) in
nearly linear O(nm alpha(nm)) time
Gusfield, RECOMB, April 2002
- Simple Tools based on classical Perfect Phylogeny
Problem. - Complex Tools based on Graph Realization
- Problem (graphic matroid realization).
- But in this talk, we develop a simpler, but
- somewhat slower version.
17Program PPH
- Program PPH solves the perfect phylogeny
haplotyping problem using the graph realization
approach. It solves problems with 50 sites and
100 individuals in about 1 second. - Program PPH can be obtained at
- www.cs.ucdavis.edu/gusfield
18The Combinatorial Problem
Input A ternary matrix (0,1,2) M with 2N
rows partitioned into N pairs of rows, where
the two rows in each pair are identical. Def
If a pair of rows (r,r) in the partition have
entry values of 2 in a column j then positions
(r,j) and (r,j) are called Mates.
19- Output A binary matrix M created from M
- by replacing each 2 in M with either 0 or 1,
- such that
- A position is assigned 0 if and only if its Mate
- is assigned 1.
- b) M passes the 3-Gamete Test, i.e., does
- not contain a 3x2 submatrix (after row and
- column permutations) with all three
- combinations 0,1 1,0 and 1,1
-
20Initial Observations
- If two columns of M contain the following
rows - 2 0
- 2 0 mates
- 0 2
- 0 2 mates
- then M will contain a row with 1 0 and a
row with 0 1 in those columns. -
- This is a forced expansion.
21Initial Observations
- Similarly, if two columns of M contain the
mates - 2 1
- 2 1
- then M will contain a row with 1 1 in those
columns. - This is a forced expansion.
22If a forced expansion of two columns creates 0 1
in those columns, then any 2 2 1 0
2 2
in those columns must be set
to be 0 1 1 0 We say that two columns are
forced out-of-phase.
If a forced expansion of two columns creates 1 1
in those columns, then any 2 2
2
2 in those columns must be
set to be 1 1 0 0 We say that two columns are
forced in-phase.
23 1 2 3
a
1 2 2
1 2 2
2 0 2
2 0 2
1 2 2
1 2 2
1 2 2
1 2 2
2 2 0
2 2 0
Example
a
Columns 1 and 2, and 1 and 3 are forced
in-phase. Columns 2 and 3 are forced
out-of-phase.
b
b
c
c
d
d
e
e
24Immediate Failure
It can happen that the forced expansion of
cells creates a 3x2 submatrix that fails the
3-Gamete Test. In that case, there is no PPH
solution for M.
20 20 11 11 02 02
Example
Will fail the 3-Gamete Test
25An O(nm2)-time Algorithm
- Find all the forced phase relationships by
considering columns in pairs. - Find all the inferred, invariant, phase
relationships. - Find a set of column pairs whose phase
relationship can be arbitrarily set, so that all
the remaining phase relationships can be
inferred. - Result An implicit representation of all
solutions to the PPH problem.
26 1 2 3 4 5 6 7
a
1 2 2 2 0 0 0
1 2 2 2 0 0 0
2 0 2 0 0 0 2
2 0 2 0 0 0 2
1 2 2 2 0 2 0
1 2 2 2 0 2 0
1 2 2 0 2 0 0
1 2 2 0 2 0 0
2 2 0 0 0 2 0
2 2 0 0 0 2 0
A running example.
a
b
b
c
c
d
d
e
e
277
1
Graph G
Each node represents a column in M, and each edge
indicates that the pair of columns has a row with
2s in both columns. The algorithm builds
this graph, and then checks whether any pair of
nodes is forced in or out of phase.
6
3
4
2
5
287
1
Graph Gc
Each Red edge indicates that the columns
are forced in-phase. Each Blue edge
indicates that the columns are forced
out-of-phase.
6
3
4
2
Let Gf be the subgraph of Gc defined by the red
and blue edges.
5
297
1
Graph Gf has three connected components.
6
3
4
2
5
30The Central Theorem
- There is a solution to the PPH problem for M if
- and only if there is a coloring of the dashed
edges of Gc - with the following property
- For any triangle (i,j,k) in Gc, where there
is one row - containing 2s in all three columns i,j and
k - (any triangle containing at least one
- dashed edge will be of this type), the
coloring makes - either 0 or 2 of the edges blue
(out-of-phase). -
- Nice, but how do we find such a coloring?
31Note on CMU talk Feb. 28, 2003
In that talk I oversimplified the central
theorem, focusing only on the triangles with at
least one dashed edge. This approach can be made
to work, but wasnt quite right as stated in the
talk. The statement in the prior slide is
correct.
327
1
Triangle Rule
Graph Gf
Theorem 1 If there are any dashed edges whose
ends are in the same connected component of Gf,
at least one edge is in a triangle where the
other edges are not dashed, and in every
PPH solution, it must be colored so that the
triangle has an even number of Blue (out
of Phase) edges. This is an inferred coloring.
6
3
4
2
5
337
1
6
3
4
2
5
347
1
6
3
4
2
5
357
1
6
3
4
2
5
36Corollary
Inside any connected component of Gf, ALL the
phase relationships on edges (columns of M) are
uniquely determined, either as forced
relationships based on pairwise column
comparisons, or by triangle-based inferred
colorings. Hence, the phase relationships of all
the columns in a connected component of Gf are
INVARIANT over all the solutions to the PPH
problem.
37The dashed edges in Gf can be ordered so that the
inferred colorings can be done in linear time.
Modification of DFS. See the paper for details,
or assign it as a homework exercise.
38Finishing the Solution
- Problem A connected component C of G may
contain several connected components of Gf, so
any edge crossing two components of Gf will still
be dashed. How should they be colored?
397
1
How should we color the remaining dashed edges in
a connected component C of Gc?
6
3
4
2
5
40Answer
For a connected component C of G with k
connected components of Gf, select any subset S
of k-1 dashed edges in C, so that S together
with the red and blue edges span all the nodes of
C. Arbitrarily, color each edge in S either red
or blue. Infer the color of any remaining dashed
edges by successive use of the triangle rule.
417
1
Pick and color edges (2,5) and (3,7) The
remaining dashed edges are colored by using the
triangle rule.
6
3
4
2
5
427
1
6
3
4
2
5
43Theorem 2
- Any selected S works (allows the triangle rule to
work) and any coloring of the edges in S
determines the colors of any remaining dashed
edges. - Different colorings of S determine different
colorings of the remaining dashed edges. - Each different coloring of S determines a
different solution to the PPH problem. - All PPH solutions can be obtained in this way,
i.e. using just one selected S set, but coloring
it in all 2(k-1) ways.
44 1 2 3 4 5 6 7
a
1 2 2 2 0 0 0
1 2 2 2 0 0 0
1 0 2 0 0 0 2
0 0 2 0 0 0 2
1 2 2 2 0 2 0
1 2 2 2 0 2 0
1 2 2 0 2 0 0
1 2 2 0 2 0 0
1 2 0 0 0 2 0
0 2 0 0 0 2 0
How does the coloring determine a PPH solution?
Each component of G is handled independently.
So, assume only one component of G. Arbitrarily
set the 2s in column 1, say as 1 0
a
b
b
c
c
d
d
e
e
45 1 2 3 4 5 6 7
For j from 2 to m, If a row in column j has a 2,
scan to the left for a column j in M with a 2
in that row. If j is found, use the phase
relationship between j and j to Set those 2s in
col. j. Otherwise, set them arbitrarily.
a
1 1 2 2 0 0 0
1 0 2 2 0 0 0
1 0 2 0 0 0 2
0 0 2 0 0 0 2
1 1 2 2 0 2 0
1 0 2 2 0 2 0
1 1 2 0 2 0 0
1 0 2 0 2 0 0
1 1 0 0 0 2 0
0 0 0 0 0 2 0
a
b
b
c
c
d
d
e
e
46 1 2 3 4 5 6 7
a
1 1 0 0 0 0 0
1 0 1 1 0 0 0
1 0 1 0 0 0 0
0 0 0 0 0 0 1
1 1 0 0 0 1 0
1 0 1 1 0 0 0
1 1 0 0 1 0 0
1 0 1 0 0 0 0
1 1 0 0 0 1 0
0 0 0 0 0 0 0
PPH solution derived from the edge coloring
a
b
b
c
c
d
d
e
e
47A biologically more meaningful restatement?
Once a PPH solution is found we use the
connected components of Gf to partition
the columns (sites) into blocks. Inside each
block, the haplotype pairs are fixed. But in any
block, all the shaded 0s and 1s can be
switched, changing the complete haplotypes,
formed from all the blocks.
48 1 2 3 4 6 5 7
Starting from a PPH Solution, if all shaded
cells in a block switch value, then the result
is also a PPH solution, and any PPH solution can
be obtained in this way, i.e. by choosing in
each block whether to switch or not.
a
1 1 0 0 0 0 0
1 0 1 1 0 0 0
1 0 1 0 0 0 0
0 0 0 0 0 0 1
1 1 0 0 1 0 0
1 0 1 1 0 0 0
1 1 0 0 0 1 0
1 0 1 0 0 0 0
1 1 0 0 1 0 0
0 0 0 0 0 0 0
a
b
b
c
c
d
d
e
e
49Corollary
- In a single connected component C of G with k
connected components in Gf, there are exactly
2(k-1) different solutions to the PPH problem in
the columns of M represented by C. - If G has r connected components and t connected
components of Gf, then there are exactly 2(t-r)
solutions to the PPH problem. - There is one unique PPH solution if and only if
each connected component in G is a connected
component in Gf.
50Algorithm
- Build Graph G and find its connected components.
Solve each connected component C of G separately. - Find the forced (red or blue) edges. Let Gf be
the subgraph of C containing colored edges. - Find each connected component of Gf and make the
inferred edge colorings (phase decisions). - Find a spanning tree of uncolored edges in C, and
color those edges arbitrarily, and follow the
inferred edge colorings.
51Secondary information and optimization
- The partition shows explicitly what added phase
information is useful and what is redundant.
Phase information for an edge is redundant if and
only if the edge is inside a component of Gf.
Apply this successively as additional phase
information is obtained. - Problem Minimize the number of haplotype pairs
(individuals) that need be laboratory determined
in order to find the correct tree. - Minimize the number of (individual, site1, site2)
triples whose phase relationship needs to be
determined, in order to find the correct tree.
52The implicit representation of all solutions
provides a framework for solving these secondary
problems, as well as other problems involving the
use of additional information, and specific
tree-selection criteria.
53A Phase-Transition
Problem, as the ratio of sites to genotypes
changes, how does the probability that the PPH
solution is unique change? For greatest utility,
we want genotype data where the PPH solution is
unique. Intuitively, as the ratio of genotypes
to sites increases, the probability of uniqueness
increases.
54Frequency of a unique solution with 50 and 100
sites, 5 rule and 2500 datasets per entry
geno. Frequency of unique
solution
10 0.0018
20 0.0032
22 0.7646
40 0.7488
42 0.9611
70 0.994
130 0.999
140 1
10 0
20 0
22 0.78
40 0.725
42 0.971
60 0.983
100 0.999
110 1
55Program DPPH
Program DPPH implements the solution to the PPH
problom discussed in this talk. It can be
obtained at wwwcsif.cs.ucdavis.edu/gusfield/
56Observed running times
The following are typical running times
of Program DPPH running on an 800 MHZ Mac G4
Powerbook. The first number is the number of
genotypes and the second the number of
sites. 20,30 0.01 sec 400,500 14.8
sec 50,50 0.02 sec 400,600 23.5
sec 50,100 0.09 sec 500,1000 117.94
sec 100,100 0.16 sec 500,2000 770
sec 300,300 3.8 sec
57The full paper
Technical Report from UCD, July 17, 2002 can be
found on the recent papers page
through wwwcsif.cs.ucdavis.edu/gusfield