Title: Haplotype Blocks
1Haplotype Blocks
- An Overview
- A. Polanski
- Department of Statistics
- Rice University
2Key Papers
- N. Patil et al., (2001), Blocks of Limited
Haplotype Diversity Revealed by High-Resolution
Scanning of Human Chromosome 21, Science, vol.
294, pp. 1719-1723 - N. Wang et al., (2002), Distribution of
Recombination Crossovers and the Origin of
Haplotype Blocks The Interplay of Population
History, Recombination and Mutation, Am. J. Hum.
Genet., vol. 71, pp. 1227-1234. - K. Zhang et al., (2002), A Dynamic Programming
Algorithm for Haplotype Block Partitioning, PNAS,
vol. 99, pp. 7335-7339
3Supplementary Papers
- R. Hudson, N. Kaplan, (1985), Statistical
Properties of the Number of Recombination Events
in The History of a Sample of DNA sequences,
Genetics, vol. 111, pp. 147-164 - R. Hudson, 2002, Generating Samples under a
Wright-Fisher Neutral Model of Genetic Variation,
Bioinformatics, vol. 18, pp. 337-338 - D. Reich et al., (2001), Linkage Disequilibrium
in the Human Genome, Nature, vol. 411, pp. 199-204
4What are Haplotype Blocks ?
- Haplotype block a sequence of contiguous
markers on DNA, homogeneous according to some
criterion - Markers Single Nucleotide Polymorphisms (SNPs)
5Data (Patil et al. 2001)
- Chromosome 21
- Physically separated the two copies of chromosome
21 using a rodent-human somatic cell hybrid
technique - Sample of 20 copies of chromosome 21 (32397439
bases) - Found 35989 SNPs
6Fig. 2 from (Patil et al. 2001)
7SNP no i
01000000000000000000100000000000000100001110000000
00100000001001000000001001000000000000000000001000
000001101000010101010 0000000010000000000010000000
00010010000100000000000000101100100100101000100100
0000000010010001011000000001101010010101010 000000
00010001000101100010100000000101000110000000000101
00000000000100000100110000011101001000000110000110
001000100011010 0000000000000100010010001010000000
01010001100000000001010000000000010000010011000001
1101001000000110000110001000100011010 000000001000
00000000100001000001001000000000000000000010010010
01001010001001000000000010010001011000000001100100
000000000 0010000000100001000010010000000000010000
01100000000001010000000010010011010001000000001000
0001001000001001110100000000000 000000001000000000
00100001001101001000000000000000000010010010010010
10001001000000000010010001011000000001100100000000
000 1000100000000000000001000001000101000000000000
00000100000000100100000100100100000010000000010000
1000000001101010010101010 000000000000100000001000
00000000000100000110000000000000000010010000000010
01000000000000000000001000000001101000010101010 00
00000010000000000010000100000100100000000000000000
00100100110100101000100100000000001001000101100000
0001100100000000000 100010000000000000000100000100
01010000000000000000010000000010010000000010010000
00000000000000001000010001101010010101010 00001000
00000000100001000000000101000000000000000000000000
00100101000000100100000000000000000000100000000110
1000010101010 100010000000000000000100000100010100
00000000000000010000000010010000010010010000001000
00000100001000000001101010010101010 00000001001000
00000010010000000000011000011010000000010100000010
10010010010001001000001010000100100000100111010000
0000000 100010000000001000000100000100010100000000
00000000010000000010010000010010010000001000000001
00001000000001101010010101010 00000000001000000000
10010000000000010000011000000000010100000000100100
10010001000000001000000100100000100111010100000000
1 000000000010000000001001000000000001000001101000
00000101000000101001001001000100100000100000010010
01001001110100000000000 00010010000100000010001000
00001010000000011001111110000000110000000000000010
011101010000001010100100000000001000001011110 0000
10000000000010000100000000010100000000000000000000
00000010010100000010010000000000000000000010000000
01101000010101010 00010100000000000010000000000000
10000010011101000010000000100000000000000010010001
010000001000100100100000001000001011010
20
i 1, 2, , 35989
8Problems
9How do we determine boundaries between blocks ?
- Average value of standarized coefficient of
linkage disequilibrium is greater than some
threshold (Wang et al. 2002, Reich et al. 2001) - Infer sites in the sample of DNA sequences where
recombination events happened in the past history
(Wang et al. 2002, Hudson, 2002) - Chromosome coverage minimum number of SNPs to
account for majority of haplotypes (Patil et al.
2001, Zhang et al. 2002)
10What evolutionary forces are responsible for
haplotype blocks formation ?
- Mutation
- Genetic drift
- Recombination
- Recombination hot spots
11Methods
12Method 1 (Wang et al. 2002)
Infer sites in the sample of DNA sequences where
recombination events happened in the past
history
13Three gamete condition
- Consider a pair of SNPs, SNP1 and SNP2. If there
was no recombination between SNP1 and SNP2, they
must satisfy three gamete condition
GC
SNP1
SNP2
SNP1
SNP2
AC
A
C
GT
A?G
C?T
G
C
G
T
14Four gamete test (Hudson and Kaplan, 1985)
- If we see all four gametes at SNP1 and SNP2
SNP1
SNP2
A
C
4GT
G
C
G
T
A
T
Then there must have been a recombination event
between these sites in their past history
15Array of pairwise 4GT test results
0, if there are less then 4 gametes
D, dij
1, if there are 4 gametes
What is the minimal number of recombinations that
could explain observed data ? Statistics FR
(Hudson and Kaplan, 1985)
16Fig. 1 from Wang et al., 2002
D
Block 1
Block 2
Block 3
17Wang et al., 2002 - Study
- R. Hudsons program for simulating genealogies
with mutation, drift and recombination under
various demographic scenarios - Study of dependence of average lengths of blocks
on different factors - Comparison of simulation results to data from
Patil et al., 2002
18Dependence of average lengths of blocks on
recombination frequency
19 on sample size
20... on mutation intensity
21Comparison to data from Patil et al. 2001
- Compute distribution of haplotype block lengths
in the data from Patil et al. 2001 - Try to tune parameters ? and R to obtain similar
distribution in the simulations
22 Failed
23Try a mixture of two different recombination
frequencies - better
24Method 2 (Patil, 2001)
Chromosome coverage minimum number of SNPs to
account for majority of haplotypes
25Fig. 2 from (Patil et al. 2001)
26Problem formulation
- Define block boundaries to minimize the number of
SNPs that distinguish at least ? percent of the
haplotypes in each block
27Common haplotypes
- Those represented more than one in the block
28Condition
- Common haplotypes must constitute at least ?80
percent of all haplotypes in the block - Blocks that do not satisfy this are not allowed
29Fragment of Fig. 2 from Patil et al., 2001
30Notation
- B block defined as numbers of SNPs,
- e.g., B 45, 46,.50, or B i, i1,, j
- L(B) length of the block (number of SNPs)
- f(B) minimum number of SNPs required to
distinguish common haplotypes
31Greedy Solution
01000000000000000000100000000000000100001110000000
00100000001001000000001001000000000000000000001000
000001101000010101010 0000000010000000000010000000
00010010000100000000000000101100100100101000100100
0000000010010001011000000001101010010101010 000000
00010001000101100010100000000101000110000000000101
00000000000100000100110000011101001000000110000110
001000100011010 0000000000000100010010001010000000
01010001100000000001010000000000010000010011000001
1101001000000110000110001000100011010 000000001000
00000000100001000001001000000000000000000010010010
01001010001001000000000010010001011000000001100100
000000000 0010000000100001000010010000000000010000
01100000000001010000000010010011010001000000001000
0001001000001001110100000000000 000000001000000000
00100001001101001000000000000000000010010010010010
10001001000000000010010001011000000001100100000000
000 1000100000000000000001000001000101000000000000
00000100000000100100000100100100000010000000010000
1000000001101010010101010 000000000000100000001000
00000000000100000110000000000000000010010000000010
01000000000000000000001000000001101000010101010 00
00000010000000000010000100000100100000000000000000
00100100110100101000100100000000001001000101100000
0001100100000000000 100010000000000000000100000100
01010000000000000000010000000010010000000010010000
00000000000000001000010001101010010101010 00001000
00000000100001000000000101000000000000000000000000
00100101000000100100000000000000000000100000000110
1000010101010 100010000000000000000100000100010100
00000000000000010000000010010000010010010000001000
00000100001000000001101010010101010 00000001001000
00000010010000000000011000011010000000010100000010
10010010010001001000001010000100100000100111010000
0000000 100010000000001000000100000100010100000000
00000000010000000010010000010010010000001000000001
00001000000001101010010101010 00000000001000000000
10010000000000010000011000000000010100000000100100
10010001000000001000000100100000100111010100000000
1 000000000010000000001001000000000001000001101000
00000101000000101001001001000100100000100000010010
01001001110100000000000 00010010000100000010001000
00001010000000011001111110000000110000000000000010
011101010000001010100100000000001000001011110 0000
10000000000010000100000000010100000000000000000000
00000010010100000010010000000000000000000010000000
01101000010101010 00010100000000000010000000000000
10000010011101000010000000100000000000000010010001
010000001000100100100000001000001011010
.
Start
End
1. Increment end
0. Fix Start End
2. Compute ratio L(B)/f(B)
3. Stop at max
4. Go to 0
32Results
- 4563 representative SNPs (13)
- 4135 blocks
33Method 3 (Zhang et al. 2002)
- Solves the same problem of 80 chromosome
coverage, but using the better method of dynamic
programming
34Dynamic programming solution
i
B1(i)
B2(i)
B3(i)
01000000000000000000100000000000000100001110000000
00100000001001000000001001000000000000000000001000
000001101000010101010 0000000010000000000010000000
00010010000100000000000000101100100100101000100100
0000000010010001011000000001101010010101010 000000
00010001000101100010100000000101000110000000000101
00000000000100000100110000011101001000000110000110
001000100011010 0000000000000100010010001010000000
01010001100000000001010000000000010000010011000001
1101001000000110000110001000100011010 000000001000
00000000100001000001001000000000000000000010010010
01001010001001000000000010010001011000000001100100
000000000 0010000000100001000010010000000000010000
01100000000001010000000010010011010001000000001000
0001001000001001110100000000000 000000001000000000
00100001001101001000000000000000000010010010010010
10001001000000000010010001011000000001100100000000
000 1000100000000000000001000001000101000000000000
00000100000000100100000100100100000010000000010000
1000000001101010010101010 000000000000100000001000
00000000000100000110000000000000000010010000000010
01000000000000000000001000000001101000010101010 00
00000010000000000010000100000100100000000000000000
00100100110100101000100100000000001001000101100000
0001100100000000000 100010000000000000000100000100
01010000000000000000010000000010010000000010010000
00000000000000001000010001101010010101010 00001000
00000000100001000000000101000000000000000000000000
00100101000000100100000000000000000000100000000110
1000010101010 100010000000000000000100000100010100
00000000000000010000000010010000010010010000001000
00000100001000000001101010010101010 00000001001000
00000010010000000000011000011010000000010100000010
10010010010001001000001010000100100000100111010000
0000000 100010000000001000000100000100010100000000
00000000010000000010010000010010010000001000000001
00001000000001101010010101010 00000000001000000000
10010000000000010000011000000000010100000000100100
10010001000000001000000100100000100111010100000000
1 000000000010000000001001000000000001000001101000
00000101000000101001001001000100100000100000010010
01001001110100000000000 00010010000100000010001000
00001010000000011001111110000000110000000000000010
011101010000001010100100000000001000001011110 0000
10000000000010000100000000010100000000000000000000
00000010010100000010010000000000000000000010000000
01101000010101010 00010100000000000010000000000000
10000010011101000010000000100000000000000010010001
010000001000100100100000001000001011010
Optimal partition of SNPs 1,2, i
Assume that for all i1, 2, , j-1 we know
optimal block partition, B1(i), B2(i), , Bk(i)
that minimizes
35Bellmans equation
36Results
- 3582 representative SNPs (compared to 4563 from
greedy algorithm) - 2575 blocks (compared to 4135 blocks from greedy
algorithm)
37Conclusions
- Studying haplotype block partitions is very
important to - 1. Constructing haplotype maps for genetic
- traits
- 2. Understanding recombination in human
- genome
-
38To expect
- A lot of papers in this area appearing in
scientific journals