Title: Introduction to SNP and Haplotype Analysis
1Introduction to SNP and Haplotype Analysis
Yao-Ting Huang
Kun-Mao Chao
Algorithms and Computational Biology
Lab, Department of Computer Science Information
Engineering, National Taiwan University, Taiwan.
2Genetic Variations
- The genetic variations in DNA sequences (e.g.,
insertions, deletions, and mutations) have a
major impact on genetic diseases and phenotypic
differences. - All humans share 99 the same DNA sequence.
- The genetic variations in the coding region may
change the codon of an amino acid and alters the
amino acid sequence.
3Single Nucleotide Polymorphism
- A Single Nucleotide Polymorphisms (SNP),
pronounced snip, is a genetic variation when a
single nucleotide (i.e., A, T, C, or G) is
altered and kept through heredity. - SNP Single DNA base variation found gt1
- Mutation Single DNA base variation found lt1
C T T A G C T T
C T T A G C T T
99.9
94
C T T A G T T T
C T T A G T T T
0.1
6
SNP
Mutation
4Mutations and SNPs
Observed genetic variations
Common Ancestor
5Single Nucleotide Polymorphism
- SNPs are the most frequent form among various
genetic variations. - 90 of human genetic variations come from SNPs.
- SNPs occur about every 300600 base pairs.
- Millions of SNPs have been identified (e.g.,
HapMap and Perlegen). - SNPs have become the preferred markers for
association studies because of their high
abundance and high-throughput SNP genotyping
technologies.
6Single Nucleotide Polymorphism
- A SNP is usually assumed to be a binary variable.
- The probability of repeat mutation at the same
SNP locus is quite small. - The tri-allele cases are usually considered to be
the effect of genotyping errors. - The nucleotide on a SNP locus is called
- a major allele (if allele frequency gt 50), or
- a minor allele (if allele frequency lt 50).
A C T T A G C T T
T Major allele
94
A C T T A G C T C
C Minor allele
6
7Haplotypes
- A haplotype stands for a set of linked SNPs on
the same chromosome. - A haplotype can be simply considered as a binary
string since each SNP is binary.
8Genotypes
- The use of haplotype information has been limited
because the human genome is a diploid. - In large sequencing projects, genotypes instead
of haplotypes are collected due to cost
consideration.
9Problems of Genotypes
- Genotypes only tell us the alleles at each SNP
locus. - But we dont know the connection of alleles at
different SNP loci. - There could be several possible haplotypes for
the same genotype.
or
We dont know which haplotype pair is real.
10Research Directions of SNPs and Haplotypes in
Recent Years
SNPDatabase
HaplotypeInference
Tag SNPSelection
MaximumParsimony
Perfect Phylogeny
Statistical Methods
Haplotype block
LD bin
PredictionAccuracy
11Haplotype Inference
- The problem of inferring the haplotypes from a
set of genotypes is called haplotype inference. - This problem is already known to be not only
NP-hard but also APX-hard. - Most combinatorial methods consider the maximum
parsimony model to solve this problem. - This model assumes that the real haplotypes in
natural population is rare. - The solution of this problem is a minimum set of
haplotypes that can explain the given genotypes.
12Maximum Parsimony
or
- Find a minimum set of haplotypes to explain the
given genotypes.
13Related Works
- Statistical methods
- Niu, et al. (2002) developed a PL-EM algorithm
called HAPLOTYPER. - Stephens and Donnelly (2003) designed a MCMC
algorithm based on Gibbs sampling called PHASE. - Combinatorial methods
- Gusfield (2003) proposed an integer linear
programming algorithm. - Wang and Xu (2003) developed a branching and
bound algorithm called HAPAR to find the optimal
solution. - Brown and Harrower (2004) proposed a new integer
linear formulation of this problem.
14Our Results
- We formulated this problem as an integer
quadratic programming (IQP) problem. - We proposed an iterative semidefinite programming
(SDP) relaxation algorithm to solve the IQP
problem. - This algorithm finds a solution of O(log n)
approximation. - We implemented this algorithm in MatLab and
compared with existing methods. - Huang, Y.-T., Chao, K.-M., and Chen, T., 2005,
An Approximation Algorithm for Haplotype
Inference by Maximum Parsimony, Journal of
Computational Biology, 12 1261-1274.
15Problem Formulation
- Input
- A set of n genotypes and m possible haplotypes.
- Output
- A minimum set of haplotypes that can explain the
given genotypes.
16Integer Quadratic Programming (IQP)
- Define xi as an integer variable with values 1 or
-1. - xi 1 if the i-th haplotype is selected.
- xi -1 if the i-th haplotype is not selected.
- Minimizing the number of selected haplotypes is
to minimize the following integer quadratic
function
17Integer Quadratic Programming (IQP)
- Each genotype must be resolved by at least one
pair of haplotypes. - For genotype G1, the following integer quadratic
function must be satisfied.
Suppose h1 and h2 are selected
or
18Integer Quadratic Programming (IQP)
Find a minimum set of haplotypes
- Maximum parsimony
- We use the SDP-relaxation technique to solve this
IQP problem.
to resolve all genotypes.
19The Flow of the Iterative SDP Relaxation Algorithm
Integer Quadratic Programming
Semidefinite Programming
Vector Formulation
Vector Solution
SDP Solution
Integral Solution
20Research Directions of SNPs and Haplotypes in
Recent Years
SNPDatabase
HaplotypeInference
Tag SNPSelection
MaximumParsimony
Perfect Phylogeny
Statistical Methods
Haplotype block
LD bin
PredictionAccuracy
21Problems of Using SNPs for Association Studies
- The number of SNPs is still too large to be used
for association studies. - There are millions of SNPs in a human body.
- To reduce the SNP genotyping cost, we wish to use
as few SNPs as possible for association studies. - Tag SNPs are a small subset of SNPs that is
sufficient for performing association studies
without losing the power of using all SNPs. - There are many definitions of tag SNPs.
- We will first study one definition of tag SNPs
based on haplotype blocks model.
22Haplotype Blocks and Tag SNPs
- Recent studies have shown that the chromosome can
be partitioned into haplotype blocks interspersed
by recombination hotspots (Daly et al, Patil et
al.). - Within a haplotype block, there is little or no
recombination occurred. - The SNPs within a haplotype block tend to be
inherited together. - Within a haplotype block, a small subset of SNPs
(called tag SNPs) is sufficient to distinguish
each pair of haplotype patterns in the block. - We only need to genotype tag SNPs instead of all
SNPs within a haplotype block.
23Recombination Hotspots and Haplotype Blocks
24A Haplotype Block Example
- The Chromosome 21 is partitioned into 4,135
haplotype blocks over 24,047 SNPs by Patil et al.
(Science, 2001). - Blue box major allele
- Yellow box minor allele
25Examples of Tag SNPs
Haplotype patterns
An unknown haplotype sample
P1
P2
P3
P4
S1
- Suppose we wish to distinguish an unknown
haplotype sample. - We can genotype all SNPs to identify the
haplotype sample.
S2
S3
S4
S5
S6
SNP loci
S7
S8
S9
Major allele
S10
S11
Minor allele
S12
26Examples of Tag SNPs
Haplotype pattern
P1
P2
P3
P4
S1
- In fact, it is not necessary to genotype all
SNPs. - SNPs S3, S4, and S5 can form a set of tag SNPs.
S2
S3
S4
S5
S6
SNP loci
P1
P2
P3
P4
S7
S8
S3
S9
S4
S10
S5
S11
S12
27Examples of Wrong Tag SNPs
Haplotype pattern
P1
P2
P3
P4
S1
- SNPs S1, S2, and S3 can not form a set of tag
SNPs because P1 and P4 will be ambiguous.
S2
S3
S4
S5
S6
SNP loci
P1
P2
P3
P4
S7
S1
S8
S2
S9
S3
S10
S11
S12
28Examples of Tag SNPs
Haplotype pattern
- SNPs S1 and S12 can form a set of tag SNPs.
- This set of SNPs is the minimum solution in this
example.
P1
P2
P3
P4
S1
S2
S3
S4
S5
S6
SNP loci
S7
S8
P1
P2
P3
P4
S9
S1
S10
S12
S11
S12
29Problem Formulation
P1
P2
P3
P4
- The relation between SNPs and haplotypes can be
formulated as a bipartite graph. - S1 can distinguish (P1, P3), (P1, P4), (P2, P3),
and (P2, P4). - S2 can distinguish (P1, P4), (P2, P4), (P3, P4).
S1
S2
S3
S4
S1
(1,2)
(1,3)
(1,4)
(2,3)
(2,4)
(3,4)
30Observation
- The SNPs can form a set of tag SNPs if each pair
of patterns is connected by at least one edge. - e.g., S1 and S3 can form a set of tag SNPs.
- e.g., S1 and S2 can not be tag SNPs.
(1,2)
(1,3)
(1,4)
(2,3)
(2,4)
(3,4)
Each pair of patterns is connected by at least
one edge.
31Problems of Finding Tag SNPs
- The problem of finding the minimum set of tag
SNPs is known to be NP-hard. - This problem is the minimum test set problem.
- A number of methods have been proposed to find
the minimum set of tag SNPs (Bafna et al., Zhang,
et al.). - In reality, we may fail to obtain some tag SNPs
if they do not pass the threshold of data
quality. - In the current genotyping environment, the
missing rate of SNPs is around 510. - We proposed two greedy algorithms and one linear
programming relaxation algorithm to solve this
problem.
32References
- Huang, Y.-T., Zhang, K., Chen, T. and Chao,
K.-M., 2005, Selecting Additional Tag SNPs for
Tolerating Missing Data in Genotyping, BMC
Bioinformatics, 6 263. - Chang, C.-J., Huang, Y.-T., and Chao, K.-M.,
2006, A Greedier Approach for Finding Tag SNPs,
Bioinformatics, 22 685-691.
33Research Directions of SNPs and Haplotypes in
Recent Years
SNPDatabase
HaplotypeInference
Tag SNPSelection
MaximumParsimony
Perfect Phylogeny
Statistical Methods
Haplotype block
LD bin
PredictionAccuracy
34Linkage Disequilibrium
- The problem of finding tag SNPs can be also
solved from the statistical point of view. - We can measure the correlation between SNPs and
identify sets of highly correlated SNPs. - For each set of correlated SNPs, only one SNP
need to be genotyped and can be used to predict
the values of other SNPs. - Linkage Disequilibrium (LD) is a measure that
estimates such correlation between two SNPs. - We will formally introduce the detailed
information of LD later.
35Linkage Disequilibrium Bins
- The statistical methods for finding tag SNPs are
based on the analysis of LD among all SNPs. - An LD bin is a set of SNPs such that SNPs within
the same bin are highly correlated with each
other. - The value of a single SNP in one LD bin can
predict the values of other SNPs of the same bin.
- These methods try to identify the minimum set of
LD bins.
36An Example of LD Bins (1/3)
- SNP1 and SNP2 can not form an LD bin.
- e.g., A in SNP1 may imply either G or A in SNP2.
Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6
1 A G A C G T
2 T G C C G C
3 A A A T A T
4 T G C T A C
5 T A C C G C
6 T G C T A C
7 A A A T A T
8 A A A T A T
37An Example of LD Bins (2/3)
- SNP1, SNP2, and SNP3 can form an LD bin.
- Any SNP in this bin is sufficient to predict the
values of others.
Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6
1 A G A C G T
2 T G C C G C
3 A A A T A T
4 T G C T A C
5 T A C C G C
6 T G C T A C
7 A A A T A T
8 A A A T A T
38An Example of LD Bins (3/3)
- There are three LD bins, and only three tag SNPs
are required to be genotyped (e.g., SNP1, SNP2,
and SNP4).
Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6
1 A G A C G T
2 T G C C G C
3 A A A T A T
4 T G C T A C
5 T A C C G C
6 T G C T A C
7 A A A T A T
8 A A A T A T
39Difference between Haplotype Blocks and LD bins
- Haplotype blocks are based on the assumption that
SNPs in proximity region should tend to be
correlated with each other. - The probability of recombination occurs in
between is less. - LD bins can group correlated of SNPs distant from
each other. - A disease is usually affected by multiple genes
instead of single one. - The SNPs in one LD bin can be shared by other
bins. - The SNPs in a haplotype block do not appear in
another block.
40Introduction to Linkage Disequilibrium
A, B major alleles a, b minor alleles PA
probability for A alleles at SNP1 Pa probability
for a alleles at SNP1 PB probability for B
alleles at SNP2 PB probability for b alleles at
SNP2 PAB probability for AB haplotypes Pab
probability for ab haplotypes
A
b
SNP1
SNP2
B b Total
A PAB PaB PA
a PaB Pab Pa
Total PB Pb 1.0
41Linkage Equilibrium
- PAB PAPB
- PAb PAPb PA(1-PB)
- PaB PaPB (1-PA) PB
- Pab PaPb (1-PA) (1-PB)
SNP2
B b Total
A PAB PaB PA
a PaB Pab Pa
Total PB Pb 1.0
SNP1
42Linkage Disequilibrium
- PAB ? PAPB
- PAb ? PAPb PA(1-PB)
- PaB ? PaPB (1-PA) PB
- Pab ? PaPb (1-PA) (1-PB)
SNP2
B b Total
A PAB PaB PA
a PaB Pab Pa
Total PB Pb 1.0
SNP1
43An Example of Linkage Disequilibrium
-- A -- -- -- G -- -- --
-- C -- -- -- G -- -- --
-- C -- -- -- C -- -- --
PA1/3PC2/3
PG2/3PC1/3
- Suppose we have three haplotypes AG, CG, and CC.
- There is no AC haplotype, i.e., PAC 0.
- Note that PAC 0, PAPC 1/9, and PAC ? PAPC.
- These two SNPs are linkage disequilibrium.
44An Example of Linkage Equilibrium
Before recombination
After recombination
-- A -- -- -- G -- -- --
-- A -- -- -- G -- -- --
-- C -- -- -- G -- -- --
-- C -- -- -- G -- -- --
-- C -- -- -- C -- -- --
-- C -- -- -- C -- -- --
-- A -- -- -- C -- -- --
PA1/2PC1/2
PG1/2PC1/2
- After recombination,
- PAG PAPG 1/4,
- PCG PCPG 1/4,
- PCC PCPC 1/4, and
- PAC PAPC 1/4.
- These two SNPs are linkage equilibrium.
45Linkage Disequilibrium
- There are many formulas to compute LD between two
SNPs, and most of them are usually normalized
between -11 or 01. - LD 1 (perfect positive correlation)
- LD 0 (no correlation or linkage equilibrium)
- LD -1 (perfect negative correlation)
- LD 0.8 (strong positive correlation)
- LD 0.12 (weak positive correlation)
46Linkage Disequilibrium Formulas
- Mathematical formulas for computing LD
- r2 or ?2
- D
- Chi-square Test.
- P value.
47Correlation Coefficient
- The correlation between two random variables A
and B can be measured by the correaltion
coefficient
48Examples of Computing LD
Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6
1 A T A A G T
2 G T C C T T
3 G A C A G T
4 G A C C T T
5 G A C A G C
49Minimum Clique Cover Problem
- This problem asks for a minimum set of LD bins.
- The minimum LD value required between two SNPs in
one bin is usually set to 0.8. - This problem is known to be the minimum clique
cover problem (by Huang and Chao, 2005). - Consider each SNP as nodes on the graph.
- There exists an edge between two nodes iff the LD
of these two SNPs 0.8.
50Relaxation of This Problem
- The minimum clique cover problem is not easy to
be approximated. - The relaxed problem asks for a minimum set of LD
bins such that at least one SNP in an LD bin has
r2 0.8 with other SNPs in the same bin. - The relaxed problem is known to be the minimum
dominating set problem. - The minimum dominating set problem is still
NP-hard but is easier to be approximated.
51Minimum Dominating Set Problem
- Given a graph G(V, E), the minimum dominating set
C is the minimum set of nodes, such that each
node in V has at least one edge connecting to
nodes in C. - Consider each node as a SNP and each edge as
strong LD (r2 0.8) between two SNPs. - The minimum dominating set of this graph is the
set of tag SNPs. - We can only use this set of SNPs to predict other
SNPs.
52Experimental Data Sets
- Hinds et al. (2005) identified 1,586,383 SNPs
across three human populations. - African, Americans of European, and Asian.
- The database provides both genotype data and
inferred haplotype data.