Introduction to SNP and Haplotype Analysis presentation

About This Presentation

Title:

Introduction to SNP and Haplotype Analysis

Description:

The tri-allele cases are usually considered to be the effect of genotyping errors. ... But we don't know the connection of alleles at different SNP loci. ... –

Number of Views:261

Avg rating:3.0/5.0

Slides: 53

Provided by: chi115

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to SNP and Haplotype Analysis

1
Introduction to SNP and Haplotype Analysis
Yao-Ting Huang
Kun-Mao Chao
Algorithms and Computational Biology
Lab, Department of Computer Science Information
Engineering, National Taiwan University, Taiwan.
2
Genetic Variations

The genetic variations in DNA sequences (e.g.,
insertions, deletions, and mutations) have a
major impact on genetic diseases and phenotypic
differences.
All humans share 99 the same DNA sequence.
The genetic variations in the coding region may
change the codon of an amino acid and alters the
amino acid sequence.

3
Single Nucleotide Polymorphism

A Single Nucleotide Polymorphisms (SNP),
pronounced snip, is a genetic variation when a
single nucleotide (i.e., A, T, C, or G) is
altered and kept through heredity.
SNP Single DNA base variation found gt1
Mutation Single DNA base variation found lt1

C T T A G C T T
C T T A G C T T
99.9
94
C T T A G T T T
C T T A G T T T
0.1
6
SNP
Mutation
4
Mutations and SNPs
Observed genetic variations
Common Ancestor
5
Single Nucleotide Polymorphism

SNPs are the most frequent form among various
genetic variations.
90 of human genetic variations come from SNPs.
SNPs occur about every 300600 base pairs.
Millions of SNPs have been identified (e.g.,
HapMap and Perlegen).
SNPs have become the preferred markers for
association studies because of their high
abundance and high-throughput SNP genotyping
technologies.

6
Single Nucleotide Polymorphism

A SNP is usually assumed to be a binary variable.
The probability of repeat mutation at the same
SNP locus is quite small.
The tri-allele cases are usually considered to be
the effect of genotyping errors.
The nucleotide on a SNP locus is called
a major allele (if allele frequency gt 50), or
a minor allele (if allele frequency lt 50).

A C T T A G C T T
T Major allele
94
A C T T A G C T C
C Minor allele
6
7
Haplotypes

A haplotype stands for a set of linked SNPs on
the same chromosome.
A haplotype can be simply considered as a binary
string since each SNP is binary.

8
Genotypes

The use of haplotype information has been limited
because the human genome is a diploid.
In large sequencing projects, genotypes instead
of haplotypes are collected due to cost
consideration.

9
Problems of Genotypes

Genotypes only tell us the alleles at each SNP
locus.
But we dont know the connection of alleles at
different SNP loci.
There could be several possible haplotypes for
the same genotype.

or
We dont know which haplotype pair is real.
10
Research Directions of SNPs and Haplotypes in
Recent Years
SNPDatabase
HaplotypeInference
Tag SNPSelection

MaximumParsimony
Perfect Phylogeny
Statistical Methods
Haplotype block
LD bin
PredictionAccuracy
11
Haplotype Inference

The problem of inferring the haplotypes from a
set of genotypes is called haplotype inference.
This problem is already known to be not only
NP-hard but also APX-hard.
Most combinatorial methods consider the maximum
parsimony model to solve this problem.
This model assumes that the real haplotypes in
natural population is rare.
The solution of this problem is a minimum set of
haplotypes that can explain the given genotypes.

12
Maximum Parsimony
or

Find a minimum set of haplotypes to explain the
given genotypes.

13
Related Works

Statistical methods
Niu, et al. (2002) developed a PL-EM algorithm
called HAPLOTYPER.
Stephens and Donnelly (2003) designed a MCMC
algorithm based on Gibbs sampling called PHASE.
Combinatorial methods
Gusfield (2003) proposed an integer linear
programming algorithm.
Wang and Xu (2003) developed a branching and
bound algorithm called HAPAR to find the optimal
solution.
Brown and Harrower (2004) proposed a new integer
linear formulation of this problem.

14
Our Results

We formulated this problem as an integer
quadratic programming (IQP) problem.
We proposed an iterative semidefinite programming
(SDP) relaxation algorithm to solve the IQP
problem.
This algorithm finds a solution of O(log n)
approximation.
We implemented this algorithm in MatLab and
compared with existing methods.
Huang, Y.-T., Chao, K.-M., and Chen, T., 2005,
An Approximation Algorithm for Haplotype
Inference by Maximum Parsimony, Journal of
Computational Biology, 12 1261-1274.

15
Problem Formulation

Input
A set of n genotypes and m possible haplotypes.
Output
A minimum set of haplotypes that can explain the
given genotypes.

16
Integer Quadratic Programming (IQP)

Define xi as an integer variable with values 1 or
-1.
xi 1 if the i-th haplotype is selected.
xi -1 if the i-th haplotype is not selected.
Minimizing the number of selected haplotypes is
to minimize the following integer quadratic
function

17
Integer Quadratic Programming (IQP)

Each genotype must be resolved by at least one
pair of haplotypes.
For genotype G1, the following integer quadratic
function must be satisfied.

Suppose h1 and h2 are selected
or
18
Integer Quadratic Programming (IQP)
Find a minimum set of haplotypes

Maximum parsimony
We use the SDP-relaxation technique to solve this
IQP problem.

to resolve all genotypes.
19
The Flow of the Iterative SDP Relaxation Algorithm
Integer Quadratic Programming
Semidefinite Programming
Vector Formulation
Vector Solution
SDP Solution
Integral Solution
20
Research Directions of SNPs and Haplotypes in
Recent Years
SNPDatabase
HaplotypeInference
Tag SNPSelection

MaximumParsimony
Perfect Phylogeny
Statistical Methods
Haplotype block
LD bin
PredictionAccuracy
21
Problems of Using SNPs for Association Studies

The number of SNPs is still too large to be used
for association studies.
There are millions of SNPs in a human body.
To reduce the SNP genotyping cost, we wish to use
as few SNPs as possible for association studies.
Tag SNPs are a small subset of SNPs that is
sufficient for performing association studies
without losing the power of using all SNPs.
There are many definitions of tag SNPs.
We will first study one definition of tag SNPs
based on haplotype blocks model.

22
Haplotype Blocks and Tag SNPs

Recent studies have shown that the chromosome can
be partitioned into haplotype blocks interspersed
by recombination hotspots (Daly et al, Patil et
al.).
Within a haplotype block, there is little or no
recombination occurred.
The SNPs within a haplotype block tend to be
inherited together.
Within a haplotype block, a small subset of SNPs
(called tag SNPs) is sufficient to distinguish
each pair of haplotype patterns in the block.
We only need to genotype tag SNPs instead of all
SNPs within a haplotype block.

23
Recombination Hotspots and Haplotype Blocks
24
A Haplotype Block Example

The Chromosome 21 is partitioned into 4,135
haplotype blocks over 24,047 SNPs by Patil et al.
(Science, 2001).
Blue box major allele
Yellow box minor allele

25
Examples of Tag SNPs
Haplotype patterns
An unknown haplotype sample
P1
P2
P3
P4
S1

Suppose we wish to distinguish an unknown
haplotype sample.
We can genotype all SNPs to identify the
haplotype sample.

S2
S3
S4
S5
S6
SNP loci
S7
S8
S9
Major allele
S10
S11
Minor allele
S12
26
Examples of Tag SNPs
Haplotype pattern
P1
P2
P3
P4
S1

In fact, it is not necessary to genotype all
SNPs.
SNPs S3, S4, and S5 can form a set of tag SNPs.

S2
S3
S4
S5
S6
SNP loci
P1
P2
P3
P4
S7
S8
S3
S9
S4
S10
S5
S11
S12
27
Examples of Wrong Tag SNPs
Haplotype pattern
P1
P2
P3
P4
S1

SNPs S1, S2, and S3 can not form a set of tag
SNPs because P1 and P4 will be ambiguous.

S2
S3
S4
S5
S6
SNP loci
P1
P2
P3
P4
S7
S1
S8
S2
S9
S3
S10
S11
S12
28
Examples of Tag SNPs
Haplotype pattern

SNPs S1 and S12 can form a set of tag SNPs.
This set of SNPs is the minimum solution in this
example.

P1
P2
P3
P4
S1
S2
S3
S4
S5
S6
SNP loci
S7
S8
P1
P2
P3
P4
S9
S1
S10
S12
S11
S12
29
Problem Formulation
P1
P2
P3
P4

The relation between SNPs and haplotypes can be
formulated as a bipartite graph.
S1 can distinguish (P1, P3), (P1, P4), (P2, P3),
and (P2, P4).
S2 can distinguish (P1, P4), (P2, P4), (P3, P4).

S1
S2
S3
S4
S1
(1,2)
(1,3)
(1,4)
(2,3)
(2,4)
(3,4)
30
Observation

The SNPs can form a set of tag SNPs if each pair
of patterns is connected by at least one edge.
e.g., S1 and S3 can form a set of tag SNPs.
e.g., S1 and S2 can not be tag SNPs.

(1,2)
(1,3)
(1,4)
(2,3)
(2,4)
(3,4)
Each pair of patterns is connected by at least
one edge.
31
Problems of Finding Tag SNPs

The problem of finding the minimum set of tag
SNPs is known to be NP-hard.
This problem is the minimum test set problem.
A number of methods have been proposed to find
the minimum set of tag SNPs (Bafna et al., Zhang,
et al.).
In reality, we may fail to obtain some tag SNPs
if they do not pass the threshold of data
quality.
In the current genotyping environment, the
missing rate of SNPs is around 510.
We proposed two greedy algorithms and one linear
programming relaxation algorithm to solve this
problem.

32
References

Huang, Y.-T., Zhang, K., Chen, T. and Chao,
K.-M., 2005, Selecting Additional Tag SNPs for
Tolerating Missing Data in Genotyping, BMC
Bioinformatics, 6 263.
Chang, C.-J., Huang, Y.-T., and Chao, K.-M.,
2006, A Greedier Approach for Finding Tag SNPs,
Bioinformatics, 22 685-691.

33
Research Directions of SNPs and Haplotypes in
Recent Years
SNPDatabase
HaplotypeInference
Tag SNPSelection

MaximumParsimony
Perfect Phylogeny
Statistical Methods
Haplotype block
LD bin
PredictionAccuracy
34
Linkage Disequilibrium

The problem of finding tag SNPs can be also
solved from the statistical point of view.
We can measure the correlation between SNPs and
identify sets of highly correlated SNPs.
For each set of correlated SNPs, only one SNP
need to be genotyped and can be used to predict
the values of other SNPs.
Linkage Disequilibrium (LD) is a measure that
estimates such correlation between two SNPs.
We will formally introduce the detailed
information of LD later.

35
Linkage Disequilibrium Bins

The statistical methods for finding tag SNPs are
based on the analysis of LD among all SNPs.
An LD bin is a set of SNPs such that SNPs within
the same bin are highly correlated with each
other.
The value of a single SNP in one LD bin can
predict the values of other SNPs of the same bin.
These methods try to identify the minimum set of
LD bins.

36
An Example of LD Bins (1/3)

SNP1 and SNP2 can not form an LD bin.
e.g., A in SNP1 may imply either G or A in SNP2.

Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6
1 A G A C G T
2 T G C C G C
3 A A A T A T
4 T G C T A C
5 T A C C G C
6 T G C T A C
7 A A A T A T
8 A A A T A T
37
An Example of LD Bins (2/3)

SNP1, SNP2, and SNP3 can form an LD bin.
Any SNP in this bin is sufficient to predict the
values of others.

Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6
1 A G A C G T
2 T G C C G C
3 A A A T A T
4 T G C T A C
5 T A C C G C
6 T G C T A C
7 A A A T A T
8 A A A T A T
38
An Example of LD Bins (3/3)

There are three LD bins, and only three tag SNPs
are required to be genotyped (e.g., SNP1, SNP2,
and SNP4).

Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6
1 A G A C G T
2 T G C C G C
3 A A A T A T
4 T G C T A C
5 T A C C G C
6 T G C T A C
7 A A A T A T
8 A A A T A T
39
Difference between Haplotype Blocks and LD bins

Haplotype blocks are based on the assumption that
SNPs in proximity region should tend to be
correlated with each other.
The probability of recombination occurs in
between is less.
LD bins can group correlated of SNPs distant from
each other.
A disease is usually affected by multiple genes
instead of single one.
The SNPs in one LD bin can be shared by other
bins.
The SNPs in a haplotype block do not appear in
another block.

40
Introduction to Linkage Disequilibrium
A, B major alleles a, b minor alleles PA
probability for A alleles at SNP1 Pa probability
for a alleles at SNP1 PB probability for B
alleles at SNP2 PB probability for b alleles at
SNP2 PAB probability for AB haplotypes Pab
probability for ab haplotypes
A
b
SNP1
SNP2
B b Total
A PAB PaB PA
a PaB Pab Pa
Total PB Pb 1.0
41
Linkage Equilibrium

PAB PAPB
PAb PAPb PA(1-PB)
PaB PaPB (1-PA) PB
Pab PaPb (1-PA) (1-PB)

SNP2
B b Total
A PAB PaB PA
a PaB Pab Pa
Total PB Pb 1.0
SNP1
42
Linkage Disequilibrium

PAB ? PAPB
PAb ? PAPb PA(1-PB)
PaB ? PaPB (1-PA) PB
Pab ? PaPb (1-PA) (1-PB)

SNP2
B b Total
A PAB PaB PA
a PaB Pab Pa
Total PB Pb 1.0
SNP1
43
An Example of Linkage Disequilibrium
-- A -- -- -- G -- -- --
-- C -- -- -- G -- -- --
-- C -- -- -- C -- -- --
PA1/3PC2/3
PG2/3PC1/3

Suppose we have three haplotypes AG, CG, and CC.
There is no AC haplotype, i.e., PAC 0.
Note that PAC 0, PAPC 1/9, and PAC ? PAPC.
These two SNPs are linkage disequilibrium.

44
An Example of Linkage Equilibrium
Before recombination
After recombination
-- A -- -- -- G -- -- --
-- A -- -- -- G -- -- --
-- C -- -- -- G -- -- --
-- C -- -- -- G -- -- --
-- C -- -- -- C -- -- --
-- C -- -- -- C -- -- --
-- A -- -- -- C -- -- --
PA1/2PC1/2
PG1/2PC1/2

After recombination,
PAG PAPG 1/4,
PCG PCPG 1/4,
PCC PCPC 1/4, and
PAC PAPC 1/4.
These two SNPs are linkage equilibrium.

45
Linkage Disequilibrium

There are many formulas to compute LD between two
SNPs, and most of them are usually normalized
between -11 or 01.
LD 1 (perfect positive correlation)
LD 0 (no correlation or linkage equilibrium)
LD -1 (perfect negative correlation)
LD 0.8 (strong positive correlation)
LD 0.12 (weak positive correlation)

46
Linkage Disequilibrium Formulas

Mathematical formulas for computing LD
r2 or ?2
D
Chi-square Test.
P value.

47
Correlation Coefficient

The correlation between two random variables A
and B can be measured by the correaltion
coefficient

48
Examples of Computing LD
Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6
1 A T A A G T
2 G T C C T T
3 G A C A G T
4 G A C C T T
5 G A C A G C
49
Minimum Clique Cover Problem

This problem asks for a minimum set of LD bins.
The minimum LD value required between two SNPs in
one bin is usually set to 0.8.
This problem is known to be the minimum clique
cover problem (by Huang and Chao, 2005).
Consider each SNP as nodes on the graph.
There exists an edge between two nodes iff the LD
of these two SNPs 0.8.

50
Relaxation of This Problem

The minimum clique cover problem is not easy to
be approximated.
The relaxed problem asks for a minimum set of LD
bins such that at least one SNP in an LD bin has
r2 0.8 with other SNPs in the same bin.
The relaxed problem is known to be the minimum
dominating set problem.
The minimum dominating set problem is still
NP-hard but is easier to be approximated.

51
Minimum Dominating Set Problem

Given a graph G(V, E), the minimum dominating set
C is the minimum set of nodes, such that each
node in V has at least one edge connecting to
nodes in C.
Consider each node as a SNP and each edge as
strong LD (r2 0.8) between two SNPs.
The minimum dominating set of this graph is the
set of tag SNPs.
We can only use this set of SNPs to predict other
SNPs.