Introduction to SNP and Haplotype Analysis

About This Presentation
Title:

Introduction to SNP and Haplotype Analysis

Description:

The tri-allele cases are usually considered to be the effect of genotyping errors. ... But we don't know the connection of alleles at different SNP loci. ... –

Number of Views:261
Avg rating:3.0/5.0
Slides: 53
Provided by: chi115
Category:

less

Transcript and Presenter's Notes

Title: Introduction to SNP and Haplotype Analysis


1
Introduction to SNP and Haplotype Analysis
Yao-Ting Huang
Kun-Mao Chao
Algorithms and Computational Biology
Lab, Department of Computer Science Information
Engineering, National Taiwan University, Taiwan.
2
Genetic Variations
  • The genetic variations in DNA sequences (e.g.,
    insertions, deletions, and mutations) have a
    major impact on genetic diseases and phenotypic
    differences.
  • All humans share 99 the same DNA sequence.
  • The genetic variations in the coding region may
    change the codon of an amino acid and alters the
    amino acid sequence.

3
Single Nucleotide Polymorphism
  • A Single Nucleotide Polymorphisms (SNP),
    pronounced snip, is a genetic variation when a
    single nucleotide (i.e., A, T, C, or G) is
    altered and kept through heredity.
  • SNP Single DNA base variation found gt1
  • Mutation Single DNA base variation found lt1

C T T A G C T T
C T T A G C T T
99.9
94
C T T A G T T T
C T T A G T T T
0.1
6
SNP
Mutation
4
Mutations and SNPs
Observed genetic variations
Common Ancestor
5
Single Nucleotide Polymorphism
  • SNPs are the most frequent form among various
    genetic variations.
  • 90 of human genetic variations come from SNPs.
  • SNPs occur about every 300600 base pairs.
  • Millions of SNPs have been identified (e.g.,
    HapMap and Perlegen).
  • SNPs have become the preferred markers for
    association studies because of their high
    abundance and high-throughput SNP genotyping
    technologies.

6
Single Nucleotide Polymorphism
  • A SNP is usually assumed to be a binary variable.
  • The probability of repeat mutation at the same
    SNP locus is quite small.
  • The tri-allele cases are usually considered to be
    the effect of genotyping errors.
  • The nucleotide on a SNP locus is called
  • a major allele (if allele frequency gt 50), or
  • a minor allele (if allele frequency lt 50).

A C T T A G C T T
T Major allele
94
A C T T A G C T C
C Minor allele
6
7
Haplotypes
  • A haplotype stands for a set of linked SNPs on
    the same chromosome.
  • A haplotype can be simply considered as a binary
    string since each SNP is binary.

8
Genotypes
  • The use of haplotype information has been limited
    because the human genome is a diploid.
  • In large sequencing projects, genotypes instead
    of haplotypes are collected due to cost
    consideration.

9
Problems of Genotypes
  • Genotypes only tell us the alleles at each SNP
    locus.
  • But we dont know the connection of alleles at
    different SNP loci.
  • There could be several possible haplotypes for
    the same genotype.

or
We dont know which haplotype pair is real.
10
Research Directions of SNPs and Haplotypes in
Recent Years
SNPDatabase
HaplotypeInference
Tag SNPSelection

MaximumParsimony
Perfect Phylogeny
Statistical Methods
Haplotype block
LD bin
PredictionAccuracy
11
Haplotype Inference
  • The problem of inferring the haplotypes from a
    set of genotypes is called haplotype inference.
  • This problem is already known to be not only
    NP-hard but also APX-hard.
  • Most combinatorial methods consider the maximum
    parsimony model to solve this problem.
  • This model assumes that the real haplotypes in
    natural population is rare.
  • The solution of this problem is a minimum set of
    haplotypes that can explain the given genotypes.

12
Maximum Parsimony
or
  • Find a minimum set of haplotypes to explain the
    given genotypes.

13
Related Works
  • Statistical methods
  • Niu, et al. (2002) developed a PL-EM algorithm
    called HAPLOTYPER.
  • Stephens and Donnelly (2003) designed a MCMC
    algorithm based on Gibbs sampling called PHASE.
  • Combinatorial methods
  • Gusfield (2003) proposed an integer linear
    programming algorithm.
  • Wang and Xu (2003) developed a branching and
    bound algorithm called HAPAR to find the optimal
    solution.
  • Brown and Harrower (2004) proposed a new integer
    linear formulation of this problem.

14
Our Results
  • We formulated this problem as an integer
    quadratic programming (IQP) problem.
  • We proposed an iterative semidefinite programming
    (SDP) relaxation algorithm to solve the IQP
    problem.
  • This algorithm finds a solution of O(log n)
    approximation.
  • We implemented this algorithm in MatLab and
    compared with existing methods.
  • Huang, Y.-T., Chao, K.-M., and Chen, T., 2005,
    An Approximation Algorithm for Haplotype
    Inference by Maximum Parsimony, Journal of
    Computational Biology, 12 1261-1274.

15
Problem Formulation
  • Input
  • A set of n genotypes and m possible haplotypes.
  • Output
  • A minimum set of haplotypes that can explain the
    given genotypes.

16
Integer Quadratic Programming (IQP)
  • Define xi as an integer variable with values 1 or
    -1.
  • xi 1 if the i-th haplotype is selected.
  • xi -1 if the i-th haplotype is not selected.
  • Minimizing the number of selected haplotypes is
    to minimize the following integer quadratic
    function

17
Integer Quadratic Programming (IQP)
  • Each genotype must be resolved by at least one
    pair of haplotypes.
  • For genotype G1, the following integer quadratic
    function must be satisfied.

Suppose h1 and h2 are selected
or
18
Integer Quadratic Programming (IQP)
Find a minimum set of haplotypes
  • Maximum parsimony
  • We use the SDP-relaxation technique to solve this
    IQP problem.

to resolve all genotypes.
19
The Flow of the Iterative SDP Relaxation Algorithm
Integer Quadratic Programming
Semidefinite Programming
Vector Formulation
Vector Solution
SDP Solution
Integral Solution
20
Research Directions of SNPs and Haplotypes in
Recent Years
SNPDatabase
HaplotypeInference
Tag SNPSelection

MaximumParsimony
Perfect Phylogeny
Statistical Methods
Haplotype block
LD bin
PredictionAccuracy
21
Problems of Using SNPs for Association Studies
  • The number of SNPs is still too large to be used
    for association studies.
  • There are millions of SNPs in a human body.
  • To reduce the SNP genotyping cost, we wish to use
    as few SNPs as possible for association studies.
  • Tag SNPs are a small subset of SNPs that is
    sufficient for performing association studies
    without losing the power of using all SNPs.
  • There are many definitions of tag SNPs.
  • We will first study one definition of tag SNPs
    based on haplotype blocks model.

22
Haplotype Blocks and Tag SNPs
  • Recent studies have shown that the chromosome can
    be partitioned into haplotype blocks interspersed
    by recombination hotspots (Daly et al, Patil et
    al.).
  • Within a haplotype block, there is little or no
    recombination occurred.
  • The SNPs within a haplotype block tend to be
    inherited together.
  • Within a haplotype block, a small subset of SNPs
    (called tag SNPs) is sufficient to distinguish
    each pair of haplotype patterns in the block.
  • We only need to genotype tag SNPs instead of all
    SNPs within a haplotype block.

23
Recombination Hotspots and Haplotype Blocks
24
A Haplotype Block Example
  • The Chromosome 21 is partitioned into 4,135
    haplotype blocks over 24,047 SNPs by Patil et al.
    (Science, 2001).
  • Blue box major allele
  • Yellow box minor allele

25
Examples of Tag SNPs
Haplotype patterns
An unknown haplotype sample
P1
P2
P3
P4
S1
  • Suppose we wish to distinguish an unknown
    haplotype sample.
  • We can genotype all SNPs to identify the
    haplotype sample.

S2
S3
S4
S5
S6
SNP loci
S7
S8
S9
Major allele
S10
S11
Minor allele
S12
26
Examples of Tag SNPs
Haplotype pattern
P1
P2
P3
P4
S1
  • In fact, it is not necessary to genotype all
    SNPs.
  • SNPs S3, S4, and S5 can form a set of tag SNPs.

S2
S3
S4
S5
S6
SNP loci
P1
P2
P3
P4
S7
S8
S3
S9
S4
S10
S5
S11
S12
27
Examples of Wrong Tag SNPs
Haplotype pattern
P1
P2
P3
P4
S1
  • SNPs S1, S2, and S3 can not form a set of tag
    SNPs because P1 and P4 will be ambiguous.

S2
S3
S4
S5
S6
SNP loci
P1
P2
P3
P4
S7
S1
S8
S2
S9
S3
S10
S11
S12
28
Examples of Tag SNPs
Haplotype pattern
  • SNPs S1 and S12 can form a set of tag SNPs.
  • This set of SNPs is the minimum solution in this
    example.

P1
P2
P3
P4
S1
S2
S3
S4
S5
S6
SNP loci
S7
S8
P1
P2
P3
P4
S9
S1
S10
S12
S11
S12
29
Problem Formulation
P1
P2
P3
P4
  • The relation between SNPs and haplotypes can be
    formulated as a bipartite graph.
  • S1 can distinguish (P1, P3), (P1, P4), (P2, P3),
    and (P2, P4).
  • S2 can distinguish (P1, P4), (P2, P4), (P3, P4).

S1
S2
S3
S4
S1
(1,2)
(1,3)
(1,4)
(2,3)
(2,4)
(3,4)
30
Observation
  • The SNPs can form a set of tag SNPs if each pair
    of patterns is connected by at least one edge.
  • e.g., S1 and S3 can form a set of tag SNPs.
  • e.g., S1 and S2 can not be tag SNPs.

(1,2)
(1,3)
(1,4)
(2,3)
(2,4)
(3,4)
Each pair of patterns is connected by at least
one edge.
31
Problems of Finding Tag SNPs
  • The problem of finding the minimum set of tag
    SNPs is known to be NP-hard.
  • This problem is the minimum test set problem.
  • A number of methods have been proposed to find
    the minimum set of tag SNPs (Bafna et al., Zhang,
    et al.).
  • In reality, we may fail to obtain some tag SNPs
    if they do not pass the threshold of data
    quality.
  • In the current genotyping environment, the
    missing rate of SNPs is around 510.
  • We proposed two greedy algorithms and one linear
    programming relaxation algorithm to solve this
    problem.

32
References
  • Huang, Y.-T., Zhang, K., Chen, T. and Chao,
    K.-M., 2005, Selecting Additional Tag SNPs for
    Tolerating Missing Data in Genotyping, BMC
    Bioinformatics, 6 263.
  • Chang, C.-J., Huang, Y.-T., and Chao, K.-M.,
    2006, A Greedier Approach for Finding Tag SNPs,
    Bioinformatics, 22 685-691.

33
Research Directions of SNPs and Haplotypes in
Recent Years
SNPDatabase
HaplotypeInference
Tag SNPSelection

MaximumParsimony
Perfect Phylogeny
Statistical Methods
Haplotype block
LD bin
PredictionAccuracy
34
Linkage Disequilibrium
  • The problem of finding tag SNPs can be also
    solved from the statistical point of view.
  • We can measure the correlation between SNPs and
    identify sets of highly correlated SNPs.
  • For each set of correlated SNPs, only one SNP
    need to be genotyped and can be used to predict
    the values of other SNPs.
  • Linkage Disequilibrium (LD) is a measure that
    estimates such correlation between two SNPs.
  • We will formally introduce the detailed
    information of LD later.

35
Linkage Disequilibrium Bins
  • The statistical methods for finding tag SNPs are
    based on the analysis of LD among all SNPs.
  • An LD bin is a set of SNPs such that SNPs within
    the same bin are highly correlated with each
    other.
  • The value of a single SNP in one LD bin can
    predict the values of other SNPs of the same bin.
  • These methods try to identify the minimum set of
    LD bins.

36
An Example of LD Bins (1/3)
  • SNP1 and SNP2 can not form an LD bin.
  • e.g., A in SNP1 may imply either G or A in SNP2.

Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6
1 A G A C G T
2 T G C C G C
3 A A A T A T
4 T G C T A C
5 T A C C G C
6 T G C T A C
7 A A A T A T
8 A A A T A T
37
An Example of LD Bins (2/3)
  • SNP1, SNP2, and SNP3 can form an LD bin.
  • Any SNP in this bin is sufficient to predict the
    values of others.

Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6
1 A G A C G T
2 T G C C G C
3 A A A T A T
4 T G C T A C
5 T A C C G C
6 T G C T A C
7 A A A T A T
8 A A A T A T
38
An Example of LD Bins (3/3)
  • There are three LD bins, and only three tag SNPs
    are required to be genotyped (e.g., SNP1, SNP2,
    and SNP4).

Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6
1 A G A C G T
2 T G C C G C
3 A A A T A T
4 T G C T A C
5 T A C C G C
6 T G C T A C
7 A A A T A T
8 A A A T A T
39
Difference between Haplotype Blocks and LD bins
  • Haplotype blocks are based on the assumption that
    SNPs in proximity region should tend to be
    correlated with each other.
  • The probability of recombination occurs in
    between is less.
  • LD bins can group correlated of SNPs distant from
    each other.
  • A disease is usually affected by multiple genes
    instead of single one.
  • The SNPs in one LD bin can be shared by other
    bins.
  • The SNPs in a haplotype block do not appear in
    another block.

40
Introduction to Linkage Disequilibrium
A, B major alleles a, b minor alleles PA
probability for A alleles at SNP1 Pa probability
for a alleles at SNP1 PB probability for B
alleles at SNP2 PB probability for b alleles at
SNP2 PAB probability for AB haplotypes Pab
probability for ab haplotypes
A
b
SNP1
SNP2
B b Total
A PAB PaB PA
a PaB Pab Pa
Total PB Pb 1.0
41
Linkage Equilibrium
  • PAB PAPB
  • PAb PAPb PA(1-PB)
  • PaB PaPB (1-PA) PB
  • Pab PaPb (1-PA) (1-PB)

SNP2
B b Total
A PAB PaB PA
a PaB Pab Pa
Total PB Pb 1.0
SNP1
42
Linkage Disequilibrium
  • PAB ? PAPB
  • PAb ? PAPb PA(1-PB)
  • PaB ? PaPB (1-PA) PB
  • Pab ? PaPb (1-PA) (1-PB)

SNP2
B b Total
A PAB PaB PA
a PaB Pab Pa
Total PB Pb 1.0
SNP1
43
An Example of Linkage Disequilibrium
-- A -- -- -- G -- -- --
-- C -- -- -- G -- -- --
-- C -- -- -- C -- -- --
PA1/3PC2/3
PG2/3PC1/3
  • Suppose we have three haplotypes AG, CG, and CC.
  • There is no AC haplotype, i.e., PAC 0.
  • Note that PAC 0, PAPC 1/9, and PAC ? PAPC.
  • These two SNPs are linkage disequilibrium.

44
An Example of Linkage Equilibrium
Before recombination
After recombination
-- A -- -- -- G -- -- --
-- A -- -- -- G -- -- --
-- C -- -- -- G -- -- --
-- C -- -- -- G -- -- --
-- C -- -- -- C -- -- --
-- C -- -- -- C -- -- --
-- A -- -- -- C -- -- --
PA1/2PC1/2
PG1/2PC1/2
  • After recombination,
  • PAG PAPG 1/4,
  • PCG PCPG 1/4,
  • PCC PCPC 1/4, and
  • PAC PAPC 1/4.
  • These two SNPs are linkage equilibrium.

45
Linkage Disequilibrium
  • There are many formulas to compute LD between two
    SNPs, and most of them are usually normalized
    between -11 or 01.
  • LD 1 (perfect positive correlation)
  • LD 0 (no correlation or linkage equilibrium)
  • LD -1 (perfect negative correlation)
  • LD 0.8 (strong positive correlation)
  • LD 0.12 (weak positive correlation)

46
Linkage Disequilibrium Formulas
  • Mathematical formulas for computing LD
  • r2 or ?2
  • D
  • Chi-square Test.
  • P value.

47
Correlation Coefficient
  • The correlation between two random variables A
    and B can be measured by the correaltion
    coefficient

48
Examples of Computing LD
Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6
1 A T A A G T
2 G T C C T T
3 G A C A G T
4 G A C C T T
5 G A C A G C
49
Minimum Clique Cover Problem
  • This problem asks for a minimum set of LD bins.
  • The minimum LD value required between two SNPs in
    one bin is usually set to 0.8.
  • This problem is known to be the minimum clique
    cover problem (by Huang and Chao, 2005).
  • Consider each SNP as nodes on the graph.
  • There exists an edge between two nodes iff the LD
    of these two SNPs 0.8.

50
Relaxation of This Problem
  • The minimum clique cover problem is not easy to
    be approximated.
  • The relaxed problem asks for a minimum set of LD
    bins such that at least one SNP in an LD bin has
    r2 0.8 with other SNPs in the same bin.
  • The relaxed problem is known to be the minimum
    dominating set problem.
  • The minimum dominating set problem is still
    NP-hard but is easier to be approximated.

51
Minimum Dominating Set Problem
  • Given a graph G(V, E), the minimum dominating set
    C is the minimum set of nodes, such that each
    node in V has at least one edge connecting to
    nodes in C.
  • Consider each node as a SNP and each edge as
    strong LD (r2 0.8) between two SNPs.
  • The minimum dominating set of this graph is the
    set of tag SNPs.
  • We can only use this set of SNPs to predict other
    SNPs.

52
Experimental Data Sets
  • Hinds et al. (2005) identified 1,586,383 SNPs
    across three human populations.
  • African, Americans of European, and Asian.
  • The database provides both genotype data and
    inferred haplotype data.
Write a Comment
User Comments (0)
About PowerShow.com