L5: Estimating Recombination Rates - PowerPoint PPT Presentation

About This Presentation
Title:

L5: Estimating Recombination Rates

Description:

This lecture, we will consider a combinatorial approach to the phasing problem ... forced relationships based on pair-wise column comparisons, or by triangle-based ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 62
Provided by: vineet50
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: L5: Estimating Recombination Rates


1
L5 Estimating Recombination Rates
2
Review
  • mM min. number of recombination events in any
    explanation of the haplotypes in M
  • Last time, we covered 3 lower bounds on mM
  • The only exact algorithm that is known is super
    exponential. Not even an exponential time
    algorithm is known.
  • Can we get efficient upper bounds that are tight.
  • Idea An Rs like method can be used to get an
    upper bound.

3
Upper bounds
  • Rs bound
  • Procedure Compute_Rs(M)
  • If ? non-informative column
  • return (Compute_Rs(M-s))
  • else if ? redundant row
  • return (Compute_Rs(M-h))
  • else
  • return (1 minh(Compute_Rs(M-h))
  • Upper Bound
  • Procedure Compute_U(M)
  • if ? non-informative column
  • return (Compute_U(M-s))
  • else if ? redundant row
  • return (Compute_U(M-h))
  • else
  • return(minh(f(h,M-h)Compute_U(M-h))

Number of recombinations needed to explain h
4
Many approaches to estimating ?
5
1. Counting methods
  • Rm
  • Rh
  • Rs
  • ARG with min number of recombinations
  • These numbers correlate with ? but how do we get
    a value for ? given this number
  • These numbers still have value in defining
    hot-spots of recombination (showing variance in
    local recombination rates)
  • They generally underestimate the true number of
    recombinations

6
2. Model based approaches
  • Full likelihood approaches
  • Approximate likelihood approaches

Fearnhead, Donnelly
7
Approximate Likelihood approaches
  • Two locus sampling
  • 4 gamete violation implies recombination.
  • Generalization
  • Define vector n n00, n01 , n10, n11 for a
    pair of loci
  • The distribution of n depends upon ?, ?
  • Can we compute Pr(n ?, ?)? Then, we can iterate
    to get the Max likelihood estimator for ?.

8
Two locus method
  • Generate MANY random ARGs with n n00 n01 n10
    n11 leaves.
  • For each ARG, generate the two trees
    corresponding to the two loci
  • Drop 2 mutations at random, to get a value for n
  • How can you make this more efficient?
  • Given an ARG (topology), we know the edge pairs
    that would generate desired n.

9
Two locus estimation
10
Multi locus estimator
  • For a site with multiple loci, assume each pair
    to be independent, each generating a vector ni
  • Assume recombination rate (per bp) to be constant
    in the region

11
Performance of the 2 locus estimator
  • The composite likelihood estimator performs
    well in practice.
  • Note that the values of ? can be pre-computed
    making this a fast method.
  • Note that this plot does not describe the variance

12
Performancs 90/10 percentile
13
Research 2 locus versus other statistics
  • Q1 Can we use some of the counting based methods
    as summary statistic?
  • It is better than composite likelihood in that
  • It does not assume independence between loci.
  • There is a direct linear relationship (expected
    number of recombination events is ? log n)
  • Variation might be better.
  • Can we compute Pr(Rh ?, ?) efficiently? In a
    sense, it does not matter, because we can
    pre-compute the numbers.
  • Incorporate distance constraints in computing
    these summary statistics. It is reasonable to
    assume that the rate is constant per bp within a
    window.

14
Research Problem
  • Recombination hot-spots are NOT correlated
    between humans and Chimps.
  • 99 sequence identity
  • Virtually no overlap between hot-spots (generated
    using pop. Genetics).
  • What can cause this?
  • Method
  • Europeans/Africans share hot-spots
  • Concordance with sperm typing
  • Population sub-structure? Not (as shown by
    structure)
  • Genomic factors

15
Genomic factors
  • Recombination is elevated in GC rich regions
  • Epigenetic factors (such as acetylation,
    methylation) that affect chromatin structure
    might be key.
  • Yeast is a useful model for studying
    recombination
  • In yeast, recombination hotspots can be
    eliminated by insertion of transposable elements!
  • Can differential insertion of Alus explain the
    differences between chimps/humans?

16
Haplotype Phasing
17
Genotypes and Haplotypes
  • Each individual has two copies of each
    chromosome.
  • At each site, each chromosome has one of two
    alleles
  • Current Genotyping technology doesnt give phase

0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0
0
2 1 2 1 0 0 1 2 0
Genotype for the individual
18
  • Why is haplotype phasing important ?

19
Haplotype Phasing
  • Haplotype Phasing is the resolution of a genotype
    into the two haplotypes.
  • Haplotypes increase the power of an association
    between marker loci and phenotypic traits
  • Current approaches to Haplotyping
  • Via technological innovations (expensive)
  • Statistical Methods (ML, Phase,PL)
  • This lecture, we will consider a combinatorial
    approach to the phasing problem
  • Efficient, provable quality of solution
  • Not completely generalizable (as yet)

20
The Perfect Phylogeny Model
  • We assume that the evolution of extant haplotypes
    can be displayed on a rooted, directed tree, with
    the all-0 haplotype at the root, where each site
    changes from 0 to 1 on exactly one edge, and each
    extant haplotype is created by accumulating the
    changes on a path from the root to a leaf, where
    that haplotype is displayed.
  • In other words, the extant haplotypes evolved
    along a perfect phylogeny with all-0 root.

12345
00000
1
4
3
00010
2
10100
5
10000
01010
01011
Extant Haplotypes
21
Haplotyping via Perfect Phylogeny
PPH Given a set of genotypes, find an explaining
set of haplotypes that fits a perfect phylogeny
00
1 2
a 1 0
a 0 1
b 0 0
b 0 1
c 1 0
c 1 0
1 2
a 2 2
b 0 2
c 1 0
1
2
b
00
a
a
b
c
c
01
01

10
10
10
22
The Alternative Explanation
1 2
a 1 1
a 0 0
b 0 0
b 0 1
c 1 0
c 1 0
No tree possible for this explanation
1 2
a 2 2
b 0 2
c 1 0
23
The 4 Gamete Test for Perfect Phylogeny
  • Arrange the haplotypes in a matrix, two
    haplotypes for each individual.
  • Then (with no duplicate columns), the haplotypes
    fit a unique perfect phylogeny if and only if no
    two columns contain all four pairs (Buneman)
  • 0,0 and 0,1 and 1,0 and 1,1

00
10
01
11
24
The Alternative Explanation
1 2
a 1 1
a 0 0
b 0 0
b 0 1
c 1 0
c 1 0
No tree possible for this explanation
1 2
a 2 2
b 0 2
c 1 0
25
The Tree Explanation Again
0 0
1 2
a 1 0
a 0 1
b 0 0
b 0 1
c 1 0
c 1 0
1 2
a 2 2
b 0 2
c 1 0
1
2
b
0 0
a
b
a
c
c
0 1
0 1
26
The Combinatorial Problem
  • Input A ternary matrix (0,1,2) M with N rows
  • Output A binary matrix M created from M by
    replacing each 2 in M with a 0 and 1, such that
    M passes the 4 gamete test
  • Gusfield (Recomb2002) proposed a solution which
    used a reduction to Matroids.
  • We present a (slightly inefficient) solution
    using elementary techniques
  • Independently by (Eskin, Halperin, Karp02)


27
Initial Observations
  • Forced Expansions
  • EX 1 If two columns(sites) of M contain the
    following rows
  • 2 0
  • 0 2
  • Then M will contain a row with 1 0 and a row
    with 0 1 in those columns.
  • EX 2 Similarly, if two columns of M contain the
    rows
  • 2 1
  • 2 0
  • Then M will contain rows with 1 1 and 0 0 in
    those columns

28
Initial Observations
If a forced expansion of two columns creates rows
0 1, and 1 0 in those columns, then any 2 2 in
those columns must be set to be 0 1 1 0
We say that two columns are forced out-of-phase.
22
If a forced expansion of two columns creates 1 1,
and 0 0 in those columns, then any 2 2 in those
columns must be set to be 1 1 0 0 We
say that two columns are forced in-phase.
22
29
Immediate Failure
It can happen that the forced expansion of
cells creates a 4x2 submatrix that fails the
4-Gamete Test. In that case, there is no PPH
solution for M.
20 12 02
Example
Will fail the 4-Gamete Test
30
An O(ns2)-time Algorithm
  • Find all the forced phase relationships by
    considering columns in pairs.
  • Find all the inferred, invariant, phase
    relationships.
  • Find a set of column pairs whose phase
    relationship can be arbitrarily set, so that all
    the remaining phase relationships can be
    inferred.
  • Result An implicit representation of all
    solutions to the PPH problem.

31
A Running Example
1 2 3 4 5 6 7
1 2 2 2 0 0 0
2 0 2 0 0 0 2
1 2 2 2 0 2 0
1 2 2 0 2 0 0
2 2 0 0 0 2 0
0 0 0 0 0 0 0
A B C D E F
32
Companion Graph G_c
1
1 2 3 4 5 6 7
1 2 2 2 0 0 0
2 0 2 0 0 0 2
1 2 2 2 0 2 0
1 2 2 0 2 0 0
2 2 0 0 0 2 0
0 0 0 0 0 0 0
A B C D E F
  • Each node represents a column in M, and each
    edge indicates that the pair of columns has a row
    with 2s in both columns.
  • The algorithm builds this graph, and then checks
    whether any pair of nodes is forced in or out of
    phase.

33
Phasing Edges in G_c
1
  • Each Red edge indicates that the columns are
    forced in-phase.
  • Each Blue edge indicates that the columns are
    forced out-of-phase.

7
6
3
4
2
5
Let G_f be the sub-graph of G_c defined by the
red and blue edges.
34
Connected Components in G_f
1
  • Graph G_f has three connected components

.
35
Phase-parity Lemma
  • Lemma 1 There is a solution to the PPH problem
    for M if and only if there is a coloring of the
    black edges of G_c with the following property
  • For any triangle in G_c containing at least
    one black edge, the coloring makes either 0 or 2
    of the edges
  • blue (i.e., out of phase)


Thats nice, but how do we assign the colors?
36
A Weak Triangulation Rule
1
  • Theorem 1 If there are any black edges whose
    ends are in the same connected component of G_f,
    at least one edge is in a triangle where the
    other edges are not black
  • In every PPH solution, it must be colored so
    that the triangle has an even number of Blue (out
    of Phase) edges.
  • This an inferred coloring.


37
(No Transcript)
38
7
6
3
4
2
5
Graph G_f
39
7
6
3
4
2
5
Graph G_f
40
7
6
3
4
2
5
Graph G_f
41
Corollary
  • Inside any connected component of G_f, ALL the
    phase relationships on edges (columns of M) are
    uniquely determined, either as forced
    relationships based on pair-wise column
    comparisons, or by triangle-based inferred
    colorings.
  • Hence, the phase relationships of all the columns
    in a connected component of G_f are INVARIANT
    over all the solutions to the PPH problem.
  • The black edges in G_f can be ordered so that the
    inferred colorings can be done in linear time.
    Modification of DFS.

42
Phase Parity Lemma Proof
2 X
Y 2
2 2
If X ? 2, and Y ? 2, Then the two columns are
forced
43
Phase Parity Lemma proof
A B C
  • Lemma If a triangle contains a black edge, then
    a PPH solution exists only if there are 0 or 2
    blue edges in the final coloring.
  • Proof
  • No black edge unless x2, or y2 or z2
    (previous lemma)
  • If there is a row with all 2s, then there must be
    an even number of blue edges

2 2 y
x 2 2
2 z 2
B
A
C
44
Proof of Weak Triangulation Theorem
A
  • Arbitrary chordless cycles are possible in the
    graph, with forced edges.
  • See example. The pattern 0,2 2,0 and 2,2
    implies a blue (out of phase) edge
  • A single unforced edge changes the picture

E
B
D
C
A B C D E
2 2 0 0 0
0 2 2 0 0
0 0 2 2 0
0 0 0 2 2
2 0 0 0 2
45
Proof of Weak Triangulation Theorem
K
K
  • Let (J,J) be a black edge connecting a long
    path J,K,K,J of forced edges
  • In the Matrix, x ? 2, otherwise there is a chord.
    Likewise y?2
  • By previous lemma, (J,J) is forced

J
J
K J J K
2 2 x
y 2 2
2 2
46
Finishing the Solution
  • Problem A connected component C of G may
    contain several connected components of G_f, so
    any edge crossing two components of G_f will
    still be black. How should they be colored?

47
1
  • How should we color the remaining black edges in
    a connected component C of G_c?

48
Answer
  • For a connected component C of G with k
    connected components of Gf, select any subset S
    of k-1 black edges in C, so that S together with
    the red and blue edges span all the nodes of C.
  • Arbitrarily, color each edge in S either red or
    blue.
  • Infer the color of any remaining black edges by
  • successive use of the triangle rule.

7
6
3
4
2
5
49
7
3
2
5
50
Theorem 2
  • Any selected S works (allows the triangle rule to
    work) and any coloring of the edges in S
    determines the colors of any remaining black
    edges.
  • Different colorings of S determine different
    colorings of the remaining black edges.
  • Each different coloring of S determines a
    different solution to the PPH problem.
  • All PPH solutions can be obtained in this way,
    i.e. using just one selected S set, but coloring
    it in all 2(k-1) ways.

51
Corollary
  • In a single connected component C of G with k
    connected components in Gf, there are exactly
    2(k-1) different solutions to the PPH problem in
    the columns of M represented by C.
  • If G_c has r connected components and t connected
    components of G_f, then there are exactly 2(t-r)
    solutions to the PPH problem.
  • There is one unique PPH solution if and only if
    each connected component in G is a connected
    component in G_f.

52
Algorithm
  • Build Graph G and find its connected components.
    Solve each connected component C of G separately.
  • Find the forced (red or blue) edges. Let Gf be
    the subgraph of C containing colored edges.
  • Find each connected component of Gf and make the
    inferred edge colorings (phase decisions).
  • Find a spanning tree of uncolored edges in C, and
    color those edges arbitrarily, and follow the
    inferred edge colorings

53
Conclusion
  • In the special case of blocks with no
    recombination, and no recurrent mutations, the
    haplotypes satisfy a perfect phylogeny
  • Given a set of genotypes, there is an efficient
    (O(ns2)) algorithm for representing all possible
    haplotype solutions that satisfy a prefect
    phylogeny
  • Efficiency
  • Input is size O(ns),
  • All operations except building the graph are
    O(nss2)
  • Valid PPH only if s O(n). Is O(ns) possible?
  • Current best solution is O(nsn(1-e) s2) using
    Matrix Multiplication idea
  • Future work involves combining this with some
    heuristics to deal with general cases (lo
    recombination/hi recombination)

54
Simulated Data
  • Coalescent model (Hudson)
  • No Recombination
  • 400 chromosomes, 100 sites
  • Infinite sites
  • Recombination
  • 100 chromosomes
  • Infinite sites
  • R4.0 2501
  • Pr(Recombination) 410(-9) between adjacent
    bases

55
Error Measurement
  • Discrepancy 1 (Num Haplotypes incorrectly
    predicted)
  • Switch Error 2

01010 00101 01010 10101
02222 22222
00101 01010 00000 11111
56
No Recombination
57
No Recombination
58
Choosing between solutions
59
Choosing between solutions
60
Choosing between solutions
61
Conclusion
  • Extremely low error rates (lt 1 discrepancy) if
    no recombination
  • Randomly choosing between equivalent solutions is
    sufficient
  • Other measures (Parsimony, Likelihood, Entropy)
    do not improve the quality of solution

62
With Recombination
Write a Comment
User Comments (0)
About PowerShow.com