Title: Linkage Disequilibrium Bins
1Linkage Disequilibrium Bins
Bioinformatics and Computational Molecular
Biology (Fall 2005) Final Project
An Algorithm for Tag SNP Selection from SNP
database Working directory and program
fileshttp//bunny.idv.tw/sbb/work/bcc Detailed
experimental reporthttp//www.bunny.idv.tw/sbb/w
ork/bcc/report.html
R94922059 ???(sbb)
2Overview
- Basic concepts
- SNP
- Tag SNP
- Introduction to "Tag SNP Selection"
- Algorithm
- Linkage Disequilibrium
- Minimum Dominating Set Problem
- Applications
- Experiment result
- Summary
3Basic concept - SNP
- Single Nucleotide Polymorphisms
- A genetic variation when a single nucleotide
(i.e., A, T, C, or G) is altered and kept through
heredity. - The most frequent form among various genetic
variations - The preferred markers for association studies
because of their high abundance and
high-throughput SNP genotyping technologies
4Basic concept - SNP
Observed genetic variations
Common Ancestor
5Basic concept - Haplotypes
- A haplotype stands for a set of linked SNPs on
the same chromosome. - A haplotype can be simply considered as a binary
string since each SNP is binary.
6Tag SNP Selection
- Tag SNP
- A small subset of SNPs that is sufficient for
performing association studies without losing the
power of using all SNPs - Within a haplotype block, its tag SNPs is
sufficient to distinguish each pair of haplotype
patterns in the block
7Tag SNP
Haplotype patterns
An unknown haplotype sample
P1
P2
P3
P4
S1
- Suppose we wish to distinguish an unknown
haplotype sample. - We can genotype all SNPs to identify the
haplotype sample.
S2
S3
S4
S5
S6
SNP loci
S7
S8
S9
Major allele
S10
S11
Minor allele
S12
8Tag SNP - Example
Haplotype pattern
P1
P2
P3
P4
S1
- In fact, it is not necessary to genotype all
SNPs. - SNPs S3, S4, and S5 can form a set of tag SNPs.
S2
S3
S4
S5
S6
SNP loci
S7
P1
P2
P3
P4
S8
S3
S9
S4
S10
S5
S11
S12
9Tag SNP - Wrong Example
Haplotype pattern
P1
P2
P3
P4
S1
- SNPs S1, S2, and S3 can not form a set of tag
SNPs because P1 and P4 will be ambiguous.
S2
S3
S4
S5
S6
SNP loci
P1
P2
P3
P4
S7
S1
S8
S2
S9
S3
S10
S11
S12
10Tag SNP Selection
- The problem of finding the minimum set of tag
SNPs is known to be NP-hard. - This problem is the minimum test set problem.
- A number of methods have been proposed to find
the minimum set of tag SNPs (Bafna et al., Zhang,
et al.). - In reality, we may fail to obtain some tag SNPs
if they do not pass the threshold of data
quality. - In the current genotyping environment, the
missing rate of SNPs is around 510. - We proposed two greedy algorithms and one linear
programming relaxation algorithm to solve this
problem - Linkage Disequilibrium!
11Linkage Disequilibrium
- The problem of finding tag SNPs can be also
solved from the statistical point of view. - We can measure the correlation between SNPs and
identify sets of highly correlated SNPs. - Linkage Disequilibrium (LD) is a measure that
estimates such correlation between two SNPs
12Linkage Disequilibrium Bins
- The statistical methods for finding tag SNPs are
based on the analysis of LD among all SNPs. - An LD bin is a set of SNPs such that SNPs within
the same bin are highly correlated with each
other. - The value of a single SNP in one LD bin can
predict the values of other SNPs of the same bin.
- These methods try to identify the minimum set of
LD bins.
13LD Bins Example (1/3)
- SNP1 and SNP2 can not form an LD bin.
- e.g., A in SNP1 may imply either G or A in SNP2.
14LD Bins Example (2/3)
- SNP1, SNP2, and SNP3 can form an LD bin.
- Any SNP in this bin is sufficient to predict the
values of others.
15LD Bins Example (3/3)
- There are three LD bins, and only three tag SNPs
are required to be genotyped (e.g., SNP1, SNP2,
and SNP4).
16Linkage Disequilibrium
A, B major alleles a, b minor alleles PA
probability for A alleles at SNP1 Pa probability
for a alleles at SNP1 PB probability for B
alleles at SNP2 PB probability for b alleles at
SNP2 PAB probability for AB haplotypes Pab
probability for ab haplotypes
A
b
SNP1
SNP2
17Linkage Disequilibrium
- PAB ? PAPB
- PAb ? PAPb PA(1-PB)
- PaB ? PaPB (1-PA) PB
- Pab ? PaPb (1-PA) (1-PB)
SNP2
SNP1
18Linkage Equilibrium
- PAB PAPB
- PAb PAPb PA(1-PB)
- PaB PaPB (1-PA) PB
- Pab PaPb (1-PA) (1-PB)
SNP2
SNP1
19Linkage Disequilibrium
- There are many formulas to compute LD between two
SNPs, and most of them are usually normalized
between -11 or 01. - LD 1 (perfect positive correlation)
- LD 0 (no correlation or linkage equilibrium)
- LD -1 (perfect negative correlation)
- LD 0.8 (strong positive correlation)
- LD 0.12 (weak positive correlation)
20Linkage Disequilibrium Formulas
- Mathematical formulas for computing LD
- r2 or ?2
- D
- Chi-square Test.
- P value.
21Minimum Clique Cover Problem
- This problem asks for a minimum set of LD bins.
- The minimum LD value required between two SNPs in
one bin is usually set to 0.8. - This problem is known to be the minimum clique
cover problem (by Chao, K.-M., 2005). - Consider each SNP as nodes on the graph.
- There exists an edge between two nodes iff the LD
of these two SNPs 0.8.
22Relaxation of This Problem
- The minimum clique cover problem is not easy to
be approximated. - The relaxed problem asks for a minimum set of LD
bins such that at least one SNP in an LD bin has
r2 0.8 with other SNPs in the same bin. - The relaxed problem is known to be the minimum
dominating set problem. - The minimum dominating set problem is still
NP-hard but is easier to be approximated.
23Minimum Dominating Set Problem
- Given a graph G(V, E), the minimum dominating set
C is the minimum set of nodes, such that each
node in V has at least one edge connecting to
nodes in C. - Consider each node as a SNP and each edge as
strong LD (r2 0.8) between two SNPs. - The minimum dominating set of this graph is the
set of tag SNPs. - We can only use this set of SNPs to predict other
SNPs.
24MDS Problem
- Here we introduce 3 algorithms
- Greedy
- Approximative solution
- Iterative Greedy
- Approximative solution
- Block Prune Search
- Optimal solution
25MDS Greedy
- Approximative solution
- ????????(?????????)??????
- ???????dominating vertex,??????????????LD
bin,???????????? - Time Complexity O(n)
26MDS Iterative Greedy
- ?Greedy??,??????dominating vertex?????????????????
?? - ???????????dominating vertex,???????????????LD
bin,?????????????? - ???????????,???????
- ????dominating vertex???????????
- Time Complexity O(n E n log(n)) on
worst caseO(E (E / n) log(E / n)) on
average case.
27MDS Block Prune Search
- ????????connected-components,???block(??block?????
??????????????) - ?????????????
- Prune Search ??????,??????????????????,??????????
????,???????????? - MDS????????????????????dominate??????,??prune?lo
wer bound - ?lower bound??????G'(V', E')??????,????????????
??????(a1,a2,a3,...,aV'),?????? k ?? k
a1a2a3...ak gt V' - ?? (???? k) gt ????????,????????????
28Experiment
- Program
- C STL
- Input
- Haplotype data from http//genome.perlegen.com/bro
wser/download.html - 24 Human chromosomes
- Output
- Tag SNPs
- Smallest of LD Bins found
- Some statistics
29(No Transcript)
30(No Transcript)
31(No Transcript)