Title: CrossParallel Likelihood between any 2 SNPs:
1 Family Trios Phasing and Missing data
recovery Dumitru Brinza, Jingwu He, Weidong Mao
and Alexander Zelikovsky Computer Science
Department
Diploid organisms, haplotypes, genotypes and SNPs
New Phasing Method 2-SNP Statistics (2SNP)
Greedy method for Trio Phasing
- Proposed by Halperin et al. in Perfect phylogeny
and haplotype assignment (2004) - For each trio we introduce four partial
haplotypes with SNPs 0, 1 and ? - Algorithm iteratively finds the complete
haplotype which covers the maximum possible
number of partial haplotypes, removes this set of
resolved partial haplotypes and continues in that
manner - The drawback of this method is introducing errors
to trio constraint
- Diploid - two haplotypes (different copies of
each chromosome) - SNP - single nucleotide site where two or more
different - nucleotides occur in a large percentage of
population - 0 willde type/major (frequency) allele
- 1 mutation/minor (frequency) allele
- Haplotype - description of a single copy
- Example 00110101 (0 is for major, 1 is for minor
allele) - Genotype - description of the mixed two copies
- Example 01122110 (000, 111, 201)
Phasing Genotype Graph Coloring
- Genotype graph for genotype g is a complete
graph G(g ) where - Vertices heterozygous SNPs in g
- (I,j)-edge weight w(I,j)cross/parallel
likelihood phasing - Phasing of 2 heterozygous SNPs
- Parallel edge 22 00 11
- Cross edge 22 01 10
- Graph coloring
- Color all vertices in two colors such that any 2
vertices connected with the parallel edge - have the same color, and any 2 vertices
connected with cross edge have opposite colors
Integer Linear Program for PPTP
- For each trio we introduce four template
haplotypes 0,1,2,? - 0,1 correspond to fully resolved haplotypes, 2
comes in SNPs corresponding to the genotypes
2s, ? unconstrained SNPs - Variables
- for each possible haplotype i, xi? 0,1,
- for each heterozigous SNP j in each template,
yj? 0,1
Cross/Parallel Phasing Likelihood
- Cross/Parallel Likelihood between any 2 SNPs
-
- Fcross total number of 01 and 10 haplotypes in
SNPs i and j - Fparallel total number of 00 and 11 haplotypes
in SNPs i and j - Fexp_cross expected number of 01 and 10
haplotypes in SNPs i and j based on single SNP
frequencies. - Fexp_parallel expected number of 00 and 11
haplotypes in SNPs i and j based on single SNP
frequencies. - Adjusting of Fcross and Fparallel
assuming that any two pairs of
haplotypes form a genotype with the same
probability - Positive value ? cross, negative value ? parallel
- Larger absolute value gives more confidence
Inferring Haplotypes (Phasing) from Population
Genotypes
- Haplotypes increase the power of an association
between marker loci and phenotypic traits - Genotype phasing is resolution of a genotype into
the two haplotypes - Physical phasing is too expensive
- Soon single (Affy) chip will allow finding all
(?107) SNPs of single human - Computational phasing (inferring) is much cheaper
- Statistical methods (Phase, Phamily, PL,
HAPLOTYPER, SNPHAP, GERBIL) - Combinatorial methods (Parsimony, HAPINF,
Perfect Phylogeny, HAP)
Family Trio Phasing without Recombinations
projections closest phasings w/o recombinations
GERBIL
PHASE
phasings of trio data as unrelated individuals
Family Trio Phasing
2-SNP Statistics (2SNP)
trio-phasings w/o recombinations
- Frequently genotype data consist of family trios
two parents and one offspring - Trio data allow to correctly recover offspring
haplotypes almost in all cases
- Example of unique phasing
- Unique example of ambiguous phasing
Missing Data Recovery Problem
- Collect statistics on haplotype/genotype
frequencies for any 2 SNPs - For each 2 SNPs compute weights reflecting
likelihood of cross/parallel - For each genotype g
- Find Maximum Spaning Tree (MST) for genotype
graph G(g ) - Color G(g ) vertices and phase based on colors
- Missing data recovery
- Recover each missing site (?s) based on the
closest haplotype with the phased site (Zhang et
al.)
pure parsimonious trio-phasing w/o recombinations
- Real data often miss some SNPs
- Daly et al data (Chron Disease) 10-16
- Gabriel et al data (Hapmap) 7-10
- How to reconstruct missing values, how to verify
reconstruction method? - Scramble extra 10 and reconstruct them
- Karp-Halperin (2004) have error rate 2.8
- Parental haplotypes may recombine ? impossible to
recover parental haplotypes - ASSUMPTION No recombinations in parents
Family Trio Phasing Validation
Input Unrelated genotypes
Missing data recovery
Statistics collection
Graph coloring
Phasing
Genotype Graph MST
- Phasing method can be validated on simulated data
(haplotypes are known) - The validation on real data is usually performed
on the trio data - Trio validation can not be applied since a trio
phasing method may rely on both offsptring and
parents genotypes - Validation by erasing randomly chosen SNP
0?210022002
0?010010000
0?110001001
s1
..
Our contributions
s2
00110101000
..
0?010010000
- Formulating the Pure-parsimony Trio Phasing
Problem(PTPP)
and the Trio Missing Data
Recovery Problem
(TMDRP) - Two new greedy and integer linear programming
(ILP) based methods solving PTPP and TMDRP - New 2-SNP Statistics (2SNP) phasing method for
unrelated individuals - Extensive experimental validation of proposed
methods and comparison with the previously known
methods
00010010000
0?110001001
s4
s3
00110001001
s3
s1
s2
s4
Data Sets
- Daly et al data (Chron Disease) derived from the
616 kilobase region of human Chromosome 5q31 - 129 family trios of genotypes with 103 SNPs
- Gabriel et al data (Hapmap) consists SNPs from
the 62 regions of human genome - 29 family trios genotypes with ? 3000 SNPs
- Using MS simulator we simulate
- Daly et al. data by generating 258 populations,
each population with 100 individuals and each
haplotype with 103 SNPs, then randomly choosing
one haplotype from each population. We only
simulate parentss haplotypes, then we obtain
family trio haplotypes and genotypes by random
matching the parental haplotypes.
Unrelated Individuals Phasing Validation
- Phasing method can be validated on simulated data
(haplotypes are known) - The validation on real data is usually performed
on the trio data - Offspring haplotypes are mostly known (inferred
from parents haplotypes) - Error types
- Single-Site error
- Number of SNPs in offspring phased haplotypes
which differ from SNPs inferred from trio data,
divide by (total number of
SNPs) x (total number of haplotypes) - Individual error
- Number of correctly phased offspring genotypes
(no Single-Site errors) divide by total number of
genotypes - Switching error
- Minimum number of switches which should be done
in pair of haplotypes of offspring phased
genotype such that both haplotypes will coincide
with haplotypes inferred from trio data, divide
by total number of heterozigous positions in
offspring genotypes.
Previous work
Experimental Results
- PHASE Bayesian statistical method (Stephens et
al., 2001, 2003) - HAPLOTYPER proposed a Monte Carlo aproach (Niu
et al., 2002) - Phamily phase the trio families based on PHASE
(Acherman et al., 2003) - Greedy method for phasing and missing data
recoveryby (Halperin and Karp, 2004) - GERBIL statistical method using maximum
likelihood (ML), MST and expectation-maximization
(EM) (Kimmel and Shamir, 2005) - SNPHAP use ML/EM assuming Hardy-Weinberg
equilibrium (Clayton et al., 2004)
- The results for five phasing methods on the real
data sets of Daly et al. and Gabriel et al. and
on simulated data. The second column corresponds
to the ratio of erased data. The C corresponds to
the error of offspring. The P corresponds to the
error of parents. The T corresponds to the total
error.
2SNP Results, Comparison with other Phasing
Methods
Problem Formulation Family Trio Phasing w/o
Recombinations (TPP)
- Phasing methods
- PHASE, GERBIL, HAPLOTYPER, 2SNP
- Reported errors
- Single-site error, individual error, swithing
error - Errors are reported in percents
- Running time
- Time is reported in hours(h), minutes(m),
seconds(s) - running time is not stable (average is
reported) - Data Sets
- Daly et al. offspring data, 129 genotypes with
103 SNPs - Gabriel et al. offspring data, 29 genotypes in 61
blocks with 50 SNPs on average - Forton et al. 128 genotypes with 89 SNPs obtained
as random matching of 256 randomly chosen real
haplotypes - MS simulated data with recombination ratio
(0,4,16), 100 genotypes and 103 SNPs
- Given a set of family trios of genotypes each
with m sites corresponding to m SNPs - 0 homozygote with major allele, 1 homozygote
with minor allele, 2 heterozygote, ?
missing SNP value - Find for each trio four haplotypes h1, h2, h3, h4
each with m 0-1-sites such that - h1 and h2 explain fathers genotype, h3 and h4
explain mothers genotype, h1 and h3 explain
offsprings genotype
- Missing data recovery in Family Trios
- The results for missing data recovery on the real
and simulated data sets with five methods. The
second column corresponds to the ratio of erased
data. The C corresponds to the error of
offspring. The P corresponds to the error of
parents. The T corresponds to the total error.
Pure-Parsimony Trio Phasing (PPTP)
- Easy to find a feasible solution to TPP
(exponential number of feasible solutions) - We pursue parsimonious objective,i.e.,
minimization of the total number of haplotypes - Drawback of PP is that when the number of SNPs
becomes large (as well as the number of
recombinations), then the quality of pure
parsimony phasing is diminishing - Partition the genotypes into blocks
- In case of trio data we do not have joining
blocks problem - Pure-Parsimony Trio Phasing (PPTP). Given 3n
genotypes corresponding to n family trios find
minimum number of distinct haplotypes explaining
all trios