Title: Which SNP genotyping errors are most costly and when
1Which SNP genotyping errors are most costly and
when?
- Stephen J. Finch
- Stony Brook University
2Acknowledgments
- Joint work
- Derek Gordon (Rockefeller University)
- Sun Jung Kang (Duke University)
- Five papers are the material for this talk with
additional coauthors - Michael Nothnagel and Jurg Ott in paper 1
- Mark Levenstien and Jurg Ott in paper 2
- Abe Brown and Jurg Ott in paper 4
3Acknowledgments
- Colleagues
- Nancy Mendell
- Kenny Ye
- Stony Brook students (work in progress)
- Nathan Tintle (repeated sampling)
- Qing Wang (LRT for mixtures)
- Kwangmi Ahn, Rose Saint Fleur
- Undergraduates Alex Borress, Josh Ren, Jelani
Wiltshire
4First Paper
- Gordon, D., Finch, S.J., Nothnagel, M., Ott, J.
(2002). Power and sample size calculations for
case-control genetic association tests when
errors are present application to single
nucleotide polymorphisms. Human Heredity, 54,
22-33.
5Second Paper
- Gordon D., Levenstien M.A., Finch S.J., and Ott
J. (2003). Errors and linkage disequilibrium
interact multiplicatively when computing sample
sizes for genetic case-control association
studies. Pacific Symposium on Biocomputing
490-501.
6Third Paper
- Kang, S.J., Gordon, D., Finch, S.J. (2004). What
SNP Genotyping Errors Are Most Costly for Genetic
Association Studies. Genetic Epidemiology, 26,
132-141.
7Fourth Paper
- Kang, S.J., Gordon, D., Brown, A.M., Ott, J.,
Finch, S.J. (2004). Tradeoff between No-Call
Reduction in Genotyping Error Rate and Loss of
Sample Size for Genetic Case/Control Association
Studies. Pacific Symposium on Biocomputing
8Fifth Paper
- Kang, S.J., Finch, S.J., Gordon, D. (2004).
Quantifying the cost of SNP genotyping errors in
genetic model based association studies. Human
Heredity, In press.
9PAWE Web Site
- http//linkage.rockefeller.edu/pawe/pawe.cgi
10Review Paper
- Gordon, D., Finch, S.J. (2004). Factors affecting
statistical power to detect genetic association.
Submitted for publication.
11Background
- Definition of SNPs
- SNP genotyping measurements
- Specification of error models
- Tests of association
- Two supplementary measurement approaches
12Definition of SNP
- A gene with two possible alleles (here A and B)
- A is the more common allele in the controls
- Three possible genotypes
- AA, index1 (more common homozygote)
- AB, index2 (heterozygote)
- BB, index3 (less common homozygote)
13Measure of Cost
- The percentage increase in the minimum sample
size necessary to maintain constant Type I and
Type II error rates associated with an increase
of 1 in a genotyping error rate is our measure
of the cost of a genotyping error. - MSSN is our abbreviation for this measure.
14SNP Genotyping Measurements
- Two die intensities are measured R and G.
- Measurements are typically taken at two or three
time points. - Ratio FR/(RG) is used to classify into
genotypes. - Genotyping error event in which an observed
genotype is different from the true genotype.
15SNP Genotyping Measurements (Raw Data)
16Scatterplot of SNP Dye Intensities
17Scatterplot of SNP Fraction by Cycle Time
18Approaches to Replication
- Sutcliffe studied the reclassification of
subjects using the same classification procedure
at all remeasurements. - Tenenbein studied the reclassification of
subjects using a virtually perfect instrument for
the second reclassification.
19Regenotyping Results
- There is a common perception that genotyping
error is negligible. - One test is to regenotype a set of data.
- COGA provided such data to last GAW.
- Tintle et al. (2004) analyzed it.
20Regenotyping Results Summed over All SNPs(COGA
GAW Data)
21Observations on Table
- Homozygote to homozygote inconsistencies are
extremely rare. - CIDR missing rate is 6.7.
- Affymetrix missing rate is 6.1
- Double missing rate is 1.7, much higher than the
0.4 expected under independence, suggesting some
subjects may be consistently more difficult to
genotype.
22Regenotyping Definitions
- Consistency Two genotypes on a SNP for a
regenotyped subject exist and are the same. - Nonreplication One genotype on a SNP for a
regenotyed subject exists, and data is missing
for the other genotype. Note that we treat two
missing genotypes as replicated. - SNP nonreplication rate the number of
non-replications divided by the sum of the number
of replications and the number of non
replications.
23(No Transcript)
24Critical assumptions about errors
- Regardless of nature of errors, they are random
and independent - Error model is same for cases (affecteds) and
controls (unaffecteds)
25Mote-Anderson Model 1965 Penetrance Table
(most general)
26Simple but Realistic Error Model
- Homozygote to homozygote error rates set to zero
- All other error rates set to equal error rate
27Three Component Normal Mixture
- Given AA, F is normal(-?, 1)
- Given AB, F is normal(0, 1)
- Given BB, F is normal(?,1)
- Symmetric cutpoints create an error model that
has equal error rates for all errors except
homozygote to homozygote errors.
28Tests of Association
- Case-control study. The ratio of number of
controls to number of cases is k. - We use the 2x3 chi-squared test of independence
(simplest non-trivial case). - Mitra found the noncentrality parameter of the
chi-squared test of association which is needed
for power and sample size calculations - Recommended (Sasieni) test is test of trend
(Armitage).
29Test Statistic
- Pearsons on 2 3 tables
- Example Table
30Effect of Misclassification Errors on Tests of
Association
- Bross found that level of significance is
unchanged when the same error mechanism affects
cases and control and that parameter estimates
are biased. - Mote and Anderson found that the power is reduced
(level of significance constant) when there are
misclassification errors.
31Notation
- Count parameters
- NA number of cases in the absence of errors
- NU number of controls in the absence of errors
- NA number of cases in the presence of errors
- NU number of controls in the presence of
errors
32What is needed for asymptotic power calculations?
33Genetic model free parameterization
- Specify the genotype probabilities directly
- Assuming Hardy Weinberg Equilibrium (HWE), all
probabilities specified with two parameters ( p,
q )
34Genetic model free parameterization
- Specify the genotype probabilities directly
- Not assuming HWE, can specify all probabilities
with four parameters
35Genetic Model Specification
- p1 allele frequency of SNP marker 1 allele
- p2 allele frequency of SNP marker 2 allele 1-
p1 - pd allele frequency of disease locus d allele
- p allele frequency of disease wild-type allele
1- pd
36Genetic Model Specification
- D disequilibrium (non-scaled as defined in Hartl
and Clark - DMAX min (p1 pd, p2 p)
- DD/ DMAX
37Genetic Model Specification (penetrance
parameters)
38Results
- Demonstrate analytic solution of asymptotic
power using standard chi-square test of genotypic
association
39Genotype Frequencies in the Presence of Errors
40Noncentrality Parameter
- We assume NU kNA.
- Using Mitras work (1958),
41Noncentrality Parameter
- Let ?kNAg, where g is the bracketed function for
genotypes measured without error. - Let ?kNAg, where g is the bracketed function
using frequencies for genotypes observed with
error.
42To maintain constant asymptotic power
- We choose NA so that ? ?.
43Paper 1 Findings
- Noncentrality parameter for the 2x3 chi-squared
test of independence from Mitra to describe
asymptotic power. - Increase in error rate (three error models)
requires a corresponding increase in sample size
to maintain Type I and Type II error rates. - Regression analysis of increase in MSSN as
function of error rate in a number of published
models. - Interaction of linkage disequilibrium (D) and
measure of overall error rate (S).
44Paper 2 Findings
- Linkage Disequilibrium (LD) and errors interact
in a non-linear fashion. - The increase in sample size necessary to maintain
constant asymptotic power and level of
significance as a function of S (sum of error
rates) is smallest when D 1 (perfect LD). - The increase grows monotonically as D decreases
to 0.5 for all studies.
45Paper 3 Method
- Saturated error model (called Mote-Anderson in
PAWE software). - Taylor series expansion of the ratio of sample
sizes expressed with the non-centrality
parameters. - The coefficients of each error parameter give the
MSSN for a 1 increase in that error rate.
46Recall the Noncentrality Parameters
- Let ?kNAg, where g is the bracketed function.
- Let ?kNAg, where g is the bracketed function
using frequencies for genotypes observed with
error. - Then, when ? ? (that is, equal power for both
specifications), NA/NAg/g.
47MSSN Function
- ( NA / NA ) 1 C12e12 C13e13 C21e21
C23e23 C31e31 C 32e32. - Suppose C13 7. Then every 1 increase in e13
requires a 7 increase in sample size to maintain
constant power
48MSSN Coefficients
- The MSSN coefficient associated with the error
rate of misclassifying the more common homozygote
as the heterozygote is given by
49MSSN Coefficients
- Similar expressions hold for the other five MSSN
coefficients.
50Example of Sample Size increase in presence of
errors
51Example Error Model Penetrance
52Comparison of Genotype Frequencies
- Without error With 1 error With 3 error
53Sample size in presence of errors
- Assume we want 0.80 power at 0.05 level of
significance. Let k 1.
54Cost coefficients for our example
- Coefficient Type of error
- More common hom to het
- Het to more common hom
- Het to less common hom
- Less common hom to het
55Simplest non-trivial case to develop insights
- Assume HWE, cases and controls
- pa 0.2, 0.3, 0.4, 0.5
- pu pa d, d 0.01
- P01 (1- pa )2 , P02 2(1- pa ) pa , P03 (pa
)2 - P11 (1- pu )2 , P12 2(1- pu ) pu , P13 (pu
)2
56Results for MSSN Coefficients d 0.01
Cost
Case SNP minor allele frequency
57Conclusion What happens to MSSN coefficients as
minor SNP allele frequency approaches 0?
Costly errors are those made on the more common
homozygote
58Extension to non-HWE generalizing example
- MSSN coefficients C12 and C13 have infinite
limits. -
- Additionally, C23 may have infinite limit.
59How to perform calculations in practice?
60(No Transcript)
61(No Transcript)
62(No Transcript)
63(No Transcript)
64Paper 5 Findings
- MSSN coefficients with infinite limit hold when
studying usual genetic models. - Recessive models can have C23 with infinite limit
as minor SNP allele frequency goes to zero. - Dominant models have a notably different behavior
with fewer MSSN coefficients with infinite
limit. Behavior can be more problematic. - MSSN coefficients are complex functions that
should be studied on a case-by-case basis.
65Paper 5 Definitions
- Total MSSN is defined to be
66Total MSSN by Freq of Disease Allele and Minor
SNP Allele
67Coefficient of Heterozygote to Less Common
Homozygote Error
68Dominant Model, Total MSSN
69Possible Strategies to Counter Effects of SNP
Genotyping Errors
- Increase sample size to compensate for loss of
power. Use small Type I and Type II error rates
in designing studies. (This works.) - When a three component normal mixture describes
the measurements that are the basis of
genotyping, use no-call rules to lessen error
rates and reduce consequent cost.
70Possible Strategies to Counter Effects of SNP
Genotyping Errors
- Use the same genotyping classification procedure
and regenotype subjects (Tintles problem). - Use a perfect genotyping classification procedure
on some of the subjects (Gordon et al.)
71Increase sample size
- Use PAWE software to identify whether the problem
under consideration has the possibility of large
MSSN coefficients. - Good design (using small Type I and Type II error
rates) can yield protocols that are less
sensitive to the consequences of SNP genotyping
errors.
72Power in presence of errors
A study design in which type I error rate is low
and power is high is less sensitive to genotyping
error rate
73No Call Regions for Three Component Normal
Mixture Model
74Power Using No Call Rules
75No-Call Rules (Paper 4)
- The gain (less reduction in power) from a reduced
error rate using no call is almost exactly
balanced by the loss of power due to reduced
sample size. - That is, there is only so much information in the
sample. - Conclusion Use all of the data without resorting
to no call procedures.
76Regenotype Subjects
- Tintle will report on this approach in the next
seminar.
77Double Sampling
- See the following paper.
- Gordon, D., Yang, Y., Haynes, C., Finch, S.J.,
Mendell, N.R., Brown, A.M., Haroutunian, V.
(2004) "Increasing power for tests of genetic
association in the presence of phenotype and/or
genotype error by use of double-sampling."
Statistical Applications in Genetics and
Molecular Biology.
78Summary
- 1. We have described quantitatively the magnitude
of the effect of genotype errors on case/control
association studies How much power or
(equivalently) how much increase in sample size
necessary to maintain constant power - - We have quantified this magnitude for the
chi-square test of independence
(http//linkage.rockefeller.edu/pawe)
79Summary
- 2. Under HWE, cost coefficients of both error
types made on the more common homozygote have
infinite limits as SNP minor allele frequency
approaches 0
80Recommendations
- 1. Researchers should increase sample size to
maintain specification of type I error rate and
power in case/control studies - A study design in which type I error rate is low
and power is high is less sensitive to genotyping
error rate
81Recommendations
- 2. Researchers designing SNP genotyping
technologies should avoid designs where
homozygote-gthomozygote misclassifications might
occur with non-zero probability
82References
- Armitage, P., Tests for linear trends in
proportions and frequencies. Biometrics, 1955.
11 p. 375-386. - Bross, I., Misclassification in 2 x 2 tables.
Biometrics, 1954. 10 p. 478-486. - Hartl, D.L. and A.G. Clark, Principles of
population genetics. 2nd ed. 1989, Sunderland
Sinauer Associates. - Mitra, S.K., On the limiting power function of
the frequency chi-square test. Annals of
Mathematical Statistics, 1958. 29(4) p.
1221-1233. - Mote VL, Anderson RL (1965) An investigation of
the effect of misclassification on the properties
of chisquare-tests in the analysis of categorical
data. Biometrika 5295-109
83References
- Sasieni, P.D., From genotypes to genes doubling
the sample size. Biometrics, 1997. 53(4) p.
1253-61. - Sutcliffe, J.P. (1965) A probability model for
errors of classification. I. General
considerations. Psychometrika, 30, 73-96. - Sutcliffe, J.P. (1965) A probability model for
errors of classification. II. Particular cases.
Psychometrika, 30, 129-155.
84References
- Tenenbein, A. 1970. A double sampling scheme for
estimating from binomial data with
misclassifications. Journal of the American
Statistical Association 651350-1361. - Tenenbein, A. 1972. A double sampling scheme for
estimating from misclassified multinomial data
with applications to sampling inspection.
Technometrics 14187-202. - Tintle, N., Ahn, K., Mendell, N.R., Gordon, D.,
Finch, S.J. (2004). Using Replicated SNP
Genotypes for CoGA. Genetics Analysis Workshop
contribution.