Single Nucleotide Polymorphisms - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Single Nucleotide Polymorphisms

Description:

A Single Nucleotide Polymorphism (SNP), pronounced snip, is a single DNA base ... Brown and Harrower (2004) proposed a new integer linear programming for this problem. ... – PowerPoint PPT presentation

Number of Views:683
Avg rating:3.0/5.0
Slides: 48
Provided by: yaotin
Category:

less

Transcript and Presenter's Notes

Title: Single Nucleotide Polymorphisms


1
Single Nucleotide Polymorphisms
Instructor Yao-Ting Huang
Bioinformatics Laboratory, Department of Computer
Science Information Engineering, National Chung
Cheng University.
2
Genetic Variants
  • We are distinguished from each other by genetic
    variants.
  • Single Nucleotide Polymorphisms (SNP)
  • Insertion/deletion
  • Copy Number Polymorphism (CNP)
  • Inversion

3
Genetic Variants Over Time
Variants observed in a population
Mutations over time
Common Ancestor
time
present
4
SNPs and Haplotypes
  • A Single Nucleotide Polymorphism (SNP),
    pronounced snip, is a single DNA base variation
    observed in the human population.
  • A haplotype stands for a set of linked SNPs on
    the same chromosome.

5
Single Nucleotide Polymorphism
  • We only consider SNPs observed with sufficient
    frequency in the population.
  • SNP the minor allele frequency is at least 5.
  • Mutation the minor allele frequency is less than
    5.

C T T A G C T T
C T T A G T T T
SNP
A C T T A G C T T
99.9
A C T T A G T T T
0.1
Mutation
6
Single Nucleotide Polymorphism
  • All humans share 99.9 the same DNA sequence
  • SNPs occur about every 200600 base pairs.
  • 90 of human genome variation comes SNPs.
  • The human genome contains about four million
    SNPs.
  • Because the probability of recurrent mutation at
    the same locus is quite low, we usually observe
    only two alleles at a SNP locus.

7
Single Nucleotide Polymorphism
  • The SNPs differ among members in the human
    population.

Black eye Brown eye Black eye Blue eye Brown
eye Brown eye
GATATTCGTACGGA-T GATGTTCGTACTGAAT GATATTCGTACGGA-T
GATATTCGTACGGAAT GATGTTCGTACTGAAT GATGTTCGTACTGAA
T
Haplotypes
AG- 2/6 GTA 3/6 AGA 1/6
DNASequences of 6 individuals
8
Discovery of SNPs
  • The DNA of two individuals differs in less than
    0.1.
  • Hinds et al. identified 1,586,383 Single
    Nucleotide Polymorphisms across three human
    populations (Science, 2005).

9
The HapMap Project
  • The International HapMap project aims to provide
    a map of SNPs in the human genome (269
    individuals from 4 populations).
  • Phase I 1,007,329 SNPs.
  • Phase II (ongoing) 4.6 millions SNPs.

10
Haplotype v.s. Genotype
  • The collection of haplotypes has been limited
    because the human genome is a diploid.
  • In above projects, genotypes instead of
    haplotypes are collected due to cost
    consideration.

11
Haplotype v.s. Genotype
  • Genotypes only tell us the alleles at each SNP
    locus.
  • But we dont know the connection of alleles at
    different SNP loci.
  • There could be several possible haplotype pairs
    for the same genotype.

or
We dont know which haplotype pair is real.
12
Three Possible Genotypes at Each SNP Locus
  • At SNP1, it is possible to observe three
    genotypes (A, C), (A, A), and (C, C) in the
    population.
  • (A, C) Heterozygous (One major and one minor
    alleles).
  • (A, A) Homozygous wild type (two major alleles).
  • (C, C) Homozygous mutant (two minor alleles).

T
C
G3
C
T
SNP1
SNP2
13
Haplotype Inference
  • Inferring the haplotypes from a set of genotypes
    is called haplotype inference.
  • Without further assumption, this problem can not
    be solved.
  • Most combinatorial methods consider the maximum
    parsimony model to solve this problem.
  • Methods based on this model search for a minimum
    set of haplotypes which can explain all
    genotypes.
  • This problem is shown to be APX-hard (Lancia
    etal, 2005).

14
Maximum Parsimony
or
  • Find a minimum set of haplotypes that can explain
    all genotypes.

15
Related Works
  • Statistical methods
  • Niu, et al. (2002) developed a PL-EM algorithm
    called HAPLOTYPER.
  • Stephens and Donnelly (2003) designed a MCMC
    algorithm based on Gibbs sampling called PHASE.
  • Combinatorial methods
  • Gusfield (2003) proposed an integer linear
    programming for this problem.
  • Wang and Xu (2003) developed a branching and
    bound algorithm called HAPAR to find the optimal
    solution.
  • Brown and Harrower (2004) proposed a new integer
    linear programming for this problem.

16
Our Results
  • Huang et al. An approximation algorithm for
    haplotype inference by maximum parsimony, Journal
    of Computational Biology, 2005.

Yao-Ting Huang
17
Approximation Approaches to NP-hard problems
  • Formulate the problem to an integer linear
    problem
  • Relax to a Linear Programming (LP) problem and
    solve it.
  • Gusfield and Brown formulate the haplotype
    inference problem into integer programming.
  • Formulate the problem to an integer quadratic
    programming (IQP) problem
  • Relax to a Semi-Definite Programming (SDP)
    problem and solve it.
  • We formulate the haplotype inference problem into
    an IQP problem.

18
Integer Quadratic Programming
  • Define xi as an integer variable with values 1 or
    -1.
  • xi 1 if the i-th haplotype is selected.
  • xi -1 if the i-th haplotype is not selected.
  • Finding a minimum set of haplotypes is to
    minimize the following function

19
Integer Quadratic Programming
  • Each genotype must be explained by at least one
    pair of haplotypes.
  • For genotype G1, the following inequality must be
    satisfied.

Suppose h1 and h2 are selected
or
20
Integer Quadratic Programming
Constraint Functions
  • Maximum parsimony

Find a minimum set of haplotypes
which can explain all genotypes.
21
An Iterative Semi-definite Programming Relaxation
Algorithm
Integer Quadratic Programming
Semi-definite Programming
Vector Formulation
Vector Solution
SDP Solution
Integral Solution
22
Relaxation
Integer Quadratic Programming
Vector Formulation
  • We relax xi into a (m1)-dimensional unit vector
    yi.
  • Replace the integer constant 1 with another unit
    vector y0 (1, 0, , 0).

23
SDP Formulation
Vector Formulation
  • Let Y (y0 y1 ym)T(y0 y1 ym)

24
Reformulation
Vector Formulation
25
Solving SDP
Semidefinite Programming
  • The SDP problem can be solved by algorithms such
    as the interior point method in polynomial time.
  • We can obtain the SDP solution matrix Y.

26
Decomposition
SDP Solution
  • Recall that Y (y0 y1 ym)T(y0 y1 ym).
  • Use the incomplete Choleskey decomposition method
    to obtain vector solutions y0, y1, , ym.

27
Randomized Rounding
IntegralSolution
Vector Solution
  • Randomly generate two unit vectors z1 and z2.
  • Set xi 1 if
  • ( z1 yi ) ( z1 y0 ) gt 0, and
  • ( z2 yi ) ( z2 y0 ) gt 0.
  • Set xi -1 otherwise.

We will discuss this later
28
Iterative Process
Integer Quadratic Programming
  • Check if all inequalities are satisfied.
  • No, repeat this algorithm only for those
    unsatisfied inequalities.
  • Yes, we are done.

29
Analysis of the SDP-relaxation Algorithm
  • Recall the randomized rounding
  • Randomly generate two unit vectors z1 and z2.
  • Set xi 1 if
  • ( z1 yi ) ( z1 y0 ) gt 0, and
  • ( z2 yi ) ( z2 y0 ) gt 0.
  • Set xi -1 otherwise.
  • We will show that the randomized rounding outputs
    a solution Ew at least as good as the optimal
    solution.

30
Analysis of the SDP-relaxation Algorithm
  • The randomized rounding method can output a
    solution Ew at least as good as the optimal
    solution.
  • We will show OPT(IQP) OPT(SDP) Ew.
  • The solution space of SDP includes that of IQP,
  • We already have OPT(IQP) OPT(SDP).
  • We can set yi (1,0,0,0, ) ? xi 1.
  • We can set yi (-1,0,0,0, ) ? xi -1.

31
Analysis of the SDP-relaxation Algorithm
  • We still need to prove
  • OPT(IQP) OPT(SDP) Ew.

gt lt?
32
Analysis of the SDP-relaxation Algorithm
  • Recall xi 1 if
  • ( z1 yi ) ( z1 y0 ) gt 0, and
  • ( z2 yi ) ( z2 y0 ) gt 0.
  • Note that cos? vi vj
  • Let the angle between vectors y0 and yi be ?.
  • Recall that cos? gt 0 when ?0, p/2 or p, 3p/2.

33
Analysis of the SDP-relaxation Algorithm
  • Recall xi 1 if
  • ( z1 yi ) ( z1 y0 ) gt 0, and
  • ( z2 yi ) ( z2 y0 ) gt 0.
  • Let the angle between vectors y0 and yi be ?.
  • ( z1 yi ) ( z1 y0 ) gt 0 if z1 is within region
    (p-?) or the opposite region.
  • ( z2 yi ) ( z2 y0 ) gt 0 if z2 is within region
    (p-?) or the opposite region..
  • xi 1 with probability ((p-?) /p)2.

34
Analysis of the SDP-relaxation Algorithm
35
Analysis of the SDP-relaxation Algorithm
  • We now complete the proof
  • OPT(IQP) OPT(SDP) Ew.

36
Simulation Methods
  • The haplotypes are used to validate the result.
  • We randomly pair two haplotypes to generate a
    genotype.

HaplotypeData
GenotypeData
Solution
h1 h2 hm
G1 h1h4 G2 h2hm Gn h1h2
G1 h1h4 G2 h1h2 Gn h1h2
SDPHapInferHAPARHAPLOTYPER PHASE
37
Results
  • We prove that SDPHapInfer gives a solution of
    O(log n)-approximation with a high probability,
    where n is the number of genotypes.
  • We implement SDPHapInfer in MatLab.
  • We compare the number of haplotypes found by
    different methods on simulated data sets.

38
Experimental Results (1/2)
Error rate
Number of genotypes
100 simluated data sets of 10 haplotypes with 20
SNPs
39
The Challenge
  • The problem of inferring haplotypes for long
    genotypes is still a challenging problem.
  • Existing methods are forced to
  • partition the genotypes into small segments,
  • infer haplotype in each segment,
  • and concatenate inferred haplotypes to construct
    a final solution.

40
The First Application of SDP on Approximation
Algorithms
  • A 0.878 randomized approximation algorithm for
    the MAXCUT problem is developed by SDP relaxation
    technique.
  • The LP-relaxation can only achieve 0.5
    approximation ratio.
  • An upper bound has shown to be 0.941.
  • Goemans, M. and Williamson, D. at ACM STOC 1994.

41
The MAXCUT Problem
  • Given an undirected graph with n nodes Gx1 , x2
    , , xn, find a cut to maximize the number of
    edges on the cut.
  • Let xi be 1 if the vertex is at one side of the
    cut, and -1 if the vertex is at the other side of
    the cut.

-1
-1
-1
42
Integer Quadratic Programming
  • Define aij be 1 if the edge (xi , xj) exists and
    0 otherwise.

x2
x1
x3
x4
  • Relax the integer constraint of xi to be the unit
    length vector in dimension m.

43
Semidefinite Programming Formulation
x2
x1
x3
x4
  • Let X be (v1 ,v2 , , vn)T ? (v1, v2 ,, vn).

44
Randomized Rounding Method
  • Once X is found, perform Cholesky decomposition
    to obtain the vector solutions v1, v2, , vn.
  • Pick a random unit vector r and
  • Set xi 1 if vi ? r 0
  • Set xi -1 if vi ? r lt 0
  • Note that cos? vi ? vj
  • The edge (vi , vj) is on the cut iff (vi ? r )
    and (vj ? r) has different sign.

vi
r
?
vj
45
Analysis
  • Denote C as the size of the cut found by the
    above algorithm.
  • The expectation that each edge (xi , xj) is the
    solution is

vi
r
?
vj
46
Analysis
  • The randomized rounding partition the nodes by a
    hyperplane.

r
1
1
1
-1
-1
47
Linear Algebra Background
  • A symmetric n?n matrix A is positive semidefinite
    iff xTAx ? 0 , for every x?Rn.
  • ABTB , for some m?n matrix B.
  • All the eigenvalues of A are non-negative.
  • The inner product of symmetric matrices A and B is
Write a Comment
User Comments (0)
About PowerShow.com