Linkage Disequilibrium Bins - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Linkage Disequilibrium Bins

Description:

A haplotype can be simply considered as a binary string since each SNP is binary. ... Within a haplotype block, its tag SNPs is sufficient to distinguish each pair of ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 32
Provided by: Inte46
Category:

less

Transcript and Presenter's Notes

Title: Linkage Disequilibrium Bins


1
Linkage Disequilibrium Bins
Bioinformatics and Computational Molecular
Biology (Fall 2005) Final Project
An Algorithm for Tag SNP Selection from SNP
database Working directory and program
fileshttp//bunny.idv.tw/sbb/work/bcc Detailed
experimental reporthttp//www.bunny.idv.tw/sbb/w
ork/bcc/report.html
R94922059 ???(sbb)
2
Overview
  • Basic concepts
  • SNP
  • Tag SNP
  • Introduction to "Tag SNP Selection"
  • Algorithm
  • Linkage Disequilibrium
  • Minimum Dominating Set Problem
  • Applications
  • Experiment result
  • Summary

3
Basic concept - SNP
  • Single Nucleotide Polymorphisms
  • A genetic variation when a single nucleotide
    (i.e., A, T, C, or G) is altered and kept through
    heredity.
  • The most frequent form among various genetic
    variations
  • The preferred markers for association studies
    because of their high abundance and
    high-throughput SNP genotyping technologies

4
Basic concept - SNP
Observed genetic variations
Common Ancestor
5
Basic concept - Haplotypes
  • A haplotype stands for a set of linked SNPs on
    the same chromosome.
  • A haplotype can be simply considered as a binary
    string since each SNP is binary.

6
Tag SNP Selection
  • Tag SNP
  • A small subset of SNPs that is sufficient for
    performing association studies without losing the
    power of using all SNPs
  • Within a haplotype block, its tag SNPs is
    sufficient to distinguish each pair of haplotype
    patterns in the block

7
Tag SNP
Haplotype patterns
An unknown haplotype sample
P1
P2
P3
P4
S1
  • Suppose we wish to distinguish an unknown
    haplotype sample.
  • We can genotype all SNPs to identify the
    haplotype sample.

S2
S3
S4
S5
S6
SNP loci
S7
S8
S9
Major allele
S10
S11
Minor allele
S12
8
Tag SNP - Example
Haplotype pattern
P1
P2
P3
P4
S1
  • In fact, it is not necessary to genotype all
    SNPs.
  • SNPs S3, S4, and S5 can form a set of tag SNPs.

S2
S3
S4
S5
S6
SNP loci
S7
P1
P2
P3
P4
S8
S3
S9
S4
S10
S5
S11
S12
9
Tag SNP - Wrong Example
Haplotype pattern
P1
P2
P3
P4
S1
  • SNPs S1, S2, and S3 can not form a set of tag
    SNPs because P1 and P4 will be ambiguous.

S2
S3
S4
S5
S6
SNP loci
P1
P2
P3
P4
S7
S1
S8
S2
S9
S3
S10
S11
S12
10
Tag SNP Selection
  • The problem of finding the minimum set of tag
    SNPs is known to be NP-hard.
  • This problem is the minimum test set problem.
  • A number of methods have been proposed to find
    the minimum set of tag SNPs (Bafna et al., Zhang,
    et al.).
  • In reality, we may fail to obtain some tag SNPs
    if they do not pass the threshold of data
    quality.
  • In the current genotyping environment, the
    missing rate of SNPs is around 510.
  • We proposed two greedy algorithms and one linear
    programming relaxation algorithm to solve this
    problem
  • Linkage Disequilibrium!

11
Linkage Disequilibrium
  • The problem of finding tag SNPs can be also
    solved from the statistical point of view.
  • We can measure the correlation between SNPs and
    identify sets of highly correlated SNPs.
  • Linkage Disequilibrium (LD) is a measure that
    estimates such correlation between two SNPs

12
Linkage Disequilibrium Bins
  • The statistical methods for finding tag SNPs are
    based on the analysis of LD among all SNPs.
  • An LD bin is a set of SNPs such that SNPs within
    the same bin are highly correlated with each
    other.
  • The value of a single SNP in one LD bin can
    predict the values of other SNPs of the same bin.
  • These methods try to identify the minimum set of
    LD bins.

13
LD Bins Example (1/3)
  • SNP1 and SNP2 can not form an LD bin.
  • e.g., A in SNP1 may imply either G or A in SNP2.

14
LD Bins Example (2/3)
  • SNP1, SNP2, and SNP3 can form an LD bin.
  • Any SNP in this bin is sufficient to predict the
    values of others.

15
LD Bins Example (3/3)
  • There are three LD bins, and only three tag SNPs
    are required to be genotyped (e.g., SNP1, SNP2,
    and SNP4).

16
Linkage Disequilibrium
A, B major alleles a, b minor alleles PA
probability for A alleles at SNP1 Pa probability
for a alleles at SNP1 PB probability for B
alleles at SNP2 PB probability for b alleles at
SNP2 PAB probability for AB haplotypes Pab
probability for ab haplotypes
A
b
SNP1
SNP2
17
Linkage Disequilibrium
  • PAB ? PAPB
  • PAb ? PAPb PA(1-PB)
  • PaB ? PaPB (1-PA) PB
  • Pab ? PaPb (1-PA) (1-PB)

SNP2
SNP1
18
Linkage Equilibrium
  • PAB PAPB
  • PAb PAPb PA(1-PB)
  • PaB PaPB (1-PA) PB
  • Pab PaPb (1-PA) (1-PB)

SNP2
SNP1
19
Linkage Disequilibrium
  • There are many formulas to compute LD between two
    SNPs, and most of them are usually normalized
    between -11 or 01.
  • LD 1 (perfect positive correlation)
  • LD 0 (no correlation or linkage equilibrium)
  • LD -1 (perfect negative correlation)
  • LD 0.8 (strong positive correlation)
  • LD 0.12 (weak positive correlation)

20
Linkage Disequilibrium Formulas
  • Mathematical formulas for computing LD
  • r2 or ?2
  • D
  • Chi-square Test.
  • P value.

21
Minimum Clique Cover Problem
  • This problem asks for a minimum set of LD bins.
  • The minimum LD value required between two SNPs in
    one bin is usually set to 0.8.
  • This problem is known to be the minimum clique
    cover problem (by Chao, K.-M., 2005).
  • Consider each SNP as nodes on the graph.
  • There exists an edge between two nodes iff the LD
    of these two SNPs 0.8.

22
Relaxation of This Problem
  • The minimum clique cover problem is not easy to
    be approximated.
  • The relaxed problem asks for a minimum set of LD
    bins such that at least one SNP in an LD bin has
    r2 0.8 with other SNPs in the same bin.
  • The relaxed problem is known to be the minimum
    dominating set problem.
  • The minimum dominating set problem is still
    NP-hard but is easier to be approximated.

23
Minimum Dominating Set Problem
  • Given a graph G(V, E), the minimum dominating set
    C is the minimum set of nodes, such that each
    node in V has at least one edge connecting to
    nodes in C.
  • Consider each node as a SNP and each edge as
    strong LD (r2 0.8) between two SNPs.
  • The minimum dominating set of this graph is the
    set of tag SNPs.
  • We can only use this set of SNPs to predict other
    SNPs.

24
MDS Problem
  • Here we introduce 3 algorithms
  • Greedy
  • Approximative solution
  • Iterative Greedy
  • Approximative solution
  • Block Prune Search
  • Optimal solution

25
MDS Greedy
  • Approximative solution
  • ????????(?????????)??????
  • ???????dominating vertex,??????????????LD
    bin,????????????
  • Time Complexity O(n)

26
MDS Iterative Greedy
  • ?Greedy??,??????dominating vertex?????????????????
    ??
  • ???????????dominating vertex,???????????????LD
    bin,??????????????
  • ???????????,???????
  • ????dominating vertex???????????
  • Time Complexity O(n E n log(n)) on
    worst caseO(E (E / n) log(E / n)) on
    average case.

27
MDS Block Prune Search
  • ????????connected-components,???block(??block?????
    ??????????????)
  • ?????????????
  • Prune Search ??????,??????????????????,??????????
    ????,????????????
  • MDS????????????????????dominate??????,??prune?lo
    wer bound
  • ?lower bound??????G'(V', E')??????,????????????
    ??????(a1,a2,a3,...,aV'),?????? k ?? k
    a1a2a3...ak gt V'
  • ?? (???? k) gt ????????,????????????

28
Experiment
  • Program
  • C STL
  • Input
  • Haplotype data from http//genome.perlegen.com/bro
    wser/download.html
  • 24 Human chromosomes
  • Output
  • Tag SNPs
  • Smallest of LD Bins found
  • Some statistics

29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com