Title: Single Nucleotide Polymorphisms
1Single Nucleotide Polymorphisms
Instructor Yao-Ting Huang
Bioinformatics Laboratory, Department of Computer
Science Information Engineering, National Chung
Cheng University.
2Genetic Variants
- We are distinguished from each other by genetic
variants. - Single Nucleotide Polymorphisms (SNP)
- Insertion/deletion
- Copy Number Polymorphism (CNP)
- Inversion
3Genetic Variants Over Time
Variants observed in a population
Mutations over time
Common Ancestor
time
present
4SNPs and Haplotypes
- A Single Nucleotide Polymorphism (SNP),
pronounced snip, is a single DNA base variation
observed in the human population. - A haplotype stands for a set of linked SNPs on
the same chromosome.
5Single Nucleotide Polymorphism
- We only consider SNPs observed with sufficient
frequency in the population. - SNP the minor allele frequency is at least 5.
- Mutation the minor allele frequency is less than
5.
C T T A G C T T
C T T A G T T T
SNP
A C T T A G C T T
99.9
A C T T A G T T T
0.1
Mutation
6Single Nucleotide Polymorphism
- All humans share 99.9 the same DNA sequence
- SNPs occur about every 200600 base pairs.
- 90 of human genome variation comes SNPs.
- The human genome contains about four million
SNPs. - Because the probability of recurrent mutation at
the same locus is quite low, we usually observe
only two alleles at a SNP locus.
7Single Nucleotide Polymorphism
- The SNPs differ among members in the human
population.
Black eye Brown eye Black eye Blue eye Brown
eye Brown eye
GATATTCGTACGGA-T GATGTTCGTACTGAAT GATATTCGTACGGA-T
GATATTCGTACGGAAT GATGTTCGTACTGAAT GATGTTCGTACTGAA
T
Haplotypes
AG- 2/6 GTA 3/6 AGA 1/6
DNASequences of 6 individuals
8Discovery of SNPs
- The DNA of two individuals differs in less than
0.1. - Hinds et al. identified 1,586,383 Single
Nucleotide Polymorphisms across three human
populations (Science, 2005).
9The HapMap Project
- The International HapMap project aims to provide
a map of SNPs in the human genome (269
individuals from 4 populations). - Phase I 1,007,329 SNPs.
- Phase II (ongoing) 4.6 millions SNPs.
10Haplotype v.s. Genotype
- The collection of haplotypes has been limited
because the human genome is a diploid. - In above projects, genotypes instead of
haplotypes are collected due to cost
consideration.
11Haplotype v.s. Genotype
- Genotypes only tell us the alleles at each SNP
locus. - But we dont know the connection of alleles at
different SNP loci. - There could be several possible haplotype pairs
for the same genotype.
or
We dont know which haplotype pair is real.
12Three Possible Genotypes at Each SNP Locus
- At SNP1, it is possible to observe three
genotypes (A, C), (A, A), and (C, C) in the
population. - (A, C) Heterozygous (One major and one minor
alleles). - (A, A) Homozygous wild type (two major alleles).
- (C, C) Homozygous mutant (two minor alleles).
T
C
G3
C
T
SNP1
SNP2
13Haplotype Inference
- Inferring the haplotypes from a set of genotypes
is called haplotype inference. - Without further assumption, this problem can not
be solved. - Most combinatorial methods consider the maximum
parsimony model to solve this problem. - Methods based on this model search for a minimum
set of haplotypes which can explain all
genotypes. - This problem is shown to be APX-hard (Lancia
etal, 2005).
14Maximum Parsimony
or
- Find a minimum set of haplotypes that can explain
all genotypes.
15Related Works
- Statistical methods
- Niu, et al. (2002) developed a PL-EM algorithm
called HAPLOTYPER. - Stephens and Donnelly (2003) designed a MCMC
algorithm based on Gibbs sampling called PHASE. - Combinatorial methods
- Gusfield (2003) proposed an integer linear
programming for this problem. - Wang and Xu (2003) developed a branching and
bound algorithm called HAPAR to find the optimal
solution. - Brown and Harrower (2004) proposed a new integer
linear programming for this problem.
16Our Results
- Huang et al. An approximation algorithm for
haplotype inference by maximum parsimony, Journal
of Computational Biology, 2005.
Yao-Ting Huang
17Approximation Approaches to NP-hard problems
- Formulate the problem to an integer linear
problem - Relax to a Linear Programming (LP) problem and
solve it. - Gusfield and Brown formulate the haplotype
inference problem into integer programming. - Formulate the problem to an integer quadratic
programming (IQP) problem - Relax to a Semi-Definite Programming (SDP)
problem and solve it. - We formulate the haplotype inference problem into
an IQP problem.
18Integer Quadratic Programming
- Define xi as an integer variable with values 1 or
-1. - xi 1 if the i-th haplotype is selected.
- xi -1 if the i-th haplotype is not selected.
- Finding a minimum set of haplotypes is to
minimize the following function
19Integer Quadratic Programming
- Each genotype must be explained by at least one
pair of haplotypes. - For genotype G1, the following inequality must be
satisfied.
Suppose h1 and h2 are selected
or
20Integer Quadratic Programming
Constraint Functions
Find a minimum set of haplotypes
which can explain all genotypes.
21An Iterative Semi-definite Programming Relaxation
Algorithm
Integer Quadratic Programming
Semi-definite Programming
Vector Formulation
Vector Solution
SDP Solution
Integral Solution
22Relaxation
Integer Quadratic Programming
Vector Formulation
- We relax xi into a (m1)-dimensional unit vector
yi. - Replace the integer constant 1 with another unit
vector y0 (1, 0, , 0).
23SDP Formulation
Vector Formulation
- Let Y (y0 y1 ym)T(y0 y1 ym)
24Reformulation
Vector Formulation
25Solving SDP
Semidefinite Programming
- The SDP problem can be solved by algorithms such
as the interior point method in polynomial time. - We can obtain the SDP solution matrix Y.
26Decomposition
SDP Solution
- Recall that Y (y0 y1 ym)T(y0 y1 ym).
- Use the incomplete Choleskey decomposition method
to obtain vector solutions y0, y1, , ym.
27Randomized Rounding
IntegralSolution
Vector Solution
- Randomly generate two unit vectors z1 and z2.
- Set xi 1 if
- ( z1 yi ) ( z1 y0 ) gt 0, and
- ( z2 yi ) ( z2 y0 ) gt 0.
- Set xi -1 otherwise.
We will discuss this later
28Iterative Process
Integer Quadratic Programming
- Check if all inequalities are satisfied.
- No, repeat this algorithm only for those
unsatisfied inequalities. - Yes, we are done.
29Analysis of the SDP-relaxation Algorithm
- Recall the randomized rounding
- Randomly generate two unit vectors z1 and z2.
- Set xi 1 if
- ( z1 yi ) ( z1 y0 ) gt 0, and
- ( z2 yi ) ( z2 y0 ) gt 0.
- Set xi -1 otherwise.
- We will show that the randomized rounding outputs
a solution Ew at least as good as the optimal
solution.
30Analysis of the SDP-relaxation Algorithm
- The randomized rounding method can output a
solution Ew at least as good as the optimal
solution. - We will show OPT(IQP) OPT(SDP) Ew.
- The solution space of SDP includes that of IQP,
- We already have OPT(IQP) OPT(SDP).
- We can set yi (1,0,0,0, ) ? xi 1.
- We can set yi (-1,0,0,0, ) ? xi -1.
31Analysis of the SDP-relaxation Algorithm
- We still need to prove
- OPT(IQP) OPT(SDP) Ew.
gt lt?
32Analysis of the SDP-relaxation Algorithm
- Recall xi 1 if
- ( z1 yi ) ( z1 y0 ) gt 0, and
- ( z2 yi ) ( z2 y0 ) gt 0.
- Note that cos? vi vj
- Let the angle between vectors y0 and yi be ?.
- Recall that cos? gt 0 when ?0, p/2 or p, 3p/2.
33Analysis of the SDP-relaxation Algorithm
- Recall xi 1 if
- ( z1 yi ) ( z1 y0 ) gt 0, and
- ( z2 yi ) ( z2 y0 ) gt 0.
- Let the angle between vectors y0 and yi be ?.
- ( z1 yi ) ( z1 y0 ) gt 0 if z1 is within region
(p-?) or the opposite region. - ( z2 yi ) ( z2 y0 ) gt 0 if z2 is within region
(p-?) or the opposite region.. - xi 1 with probability ((p-?) /p)2.
34Analysis of the SDP-relaxation Algorithm
35Analysis of the SDP-relaxation Algorithm
- We now complete the proof
- OPT(IQP) OPT(SDP) Ew.
36Simulation Methods
- The haplotypes are used to validate the result.
- We randomly pair two haplotypes to generate a
genotype.
HaplotypeData
GenotypeData
Solution
h1 h2 hm
G1 h1h4 G2 h2hm Gn h1h2
G1 h1h4 G2 h1h2 Gn h1h2
SDPHapInferHAPARHAPLOTYPER PHASE
37Results
- We prove that SDPHapInfer gives a solution of
O(log n)-approximation with a high probability,
where n is the number of genotypes. - We implement SDPHapInfer in MatLab.
- We compare the number of haplotypes found by
different methods on simulated data sets.
38Experimental Results (1/2)
Error rate
Number of genotypes
100 simluated data sets of 10 haplotypes with 20
SNPs
39The Challenge
- The problem of inferring haplotypes for long
genotypes is still a challenging problem. - Existing methods are forced to
- partition the genotypes into small segments,
- infer haplotype in each segment,
- and concatenate inferred haplotypes to construct
a final solution.
40The First Application of SDP on Approximation
Algorithms
- A 0.878 randomized approximation algorithm for
the MAXCUT problem is developed by SDP relaxation
technique. - The LP-relaxation can only achieve 0.5
approximation ratio. - An upper bound has shown to be 0.941.
- Goemans, M. and Williamson, D. at ACM STOC 1994.
41The MAXCUT Problem
- Given an undirected graph with n nodes Gx1 , x2
, , xn, find a cut to maximize the number of
edges on the cut. - Let xi be 1 if the vertex is at one side of the
cut, and -1 if the vertex is at the other side of
the cut.
-1
-1
-1
42Integer Quadratic Programming
- Define aij be 1 if the edge (xi , xj) exists and
0 otherwise.
x2
x1
x3
x4
- Relax the integer constraint of xi to be the unit
length vector in dimension m.
43Semidefinite Programming Formulation
x2
x1
x3
x4
- Let X be (v1 ,v2 , , vn)T ? (v1, v2 ,, vn).
44Randomized Rounding Method
- Once X is found, perform Cholesky decomposition
to obtain the vector solutions v1, v2, , vn. - Pick a random unit vector r and
- Set xi 1 if vi ? r 0
- Set xi -1 if vi ? r lt 0
- Note that cos? vi ? vj
- The edge (vi , vj) is on the cut iff (vi ? r )
and (vj ? r) has different sign.
vi
r
?
vj
45Analysis
- Denote C as the size of the cut found by the
above algorithm. - The expectation that each edge (xi , xj) is the
solution is
vi
r
?
vj
46Analysis
- The randomized rounding partition the nodes by a
hyperplane.
r
1
1
1
-1
-1
47Linear Algebra Background
- A symmetric n?n matrix A is positive semidefinite
iff xTAx ? 0 , for every x?Rn. - ABTB , for some m?n matrix B.
- All the eigenvalues of A are non-negative.
- The inner product of symmetric matrices A and B is