Which SNP genotyping errors are most costly and when presentation

About This Presentation

Title:

Which SNP genotyping errors are most costly and when

Description:

Stephen J. Finch. Stony Brook University. Acknowledgments. Joint work ... Gordon D., Levenstien M.A., Finch S.J., and Ott J. (2003) ... Gordon, D., Finch, S.J. (2004) ... –

Number of Views:38

Avg rating:3.0/5.0

Slides: 85

Provided by: stephe139

Category:

more less

Transcript and Presenter's Notes

Title: Which SNP genotyping errors are most costly and when

1
Which SNP genotyping errors are most costly and
when?

Stephen J. Finch
Stony Brook University

2
Acknowledgments

Joint work
Derek Gordon (Rockefeller University)
Sun Jung Kang (Duke University)
Five papers are the material for this talk with
additional coauthors
Michael Nothnagel and Jurg Ott in paper 1
Mark Levenstien and Jurg Ott in paper 2
Abe Brown and Jurg Ott in paper 4

3
Acknowledgments

Colleagues
Nancy Mendell
Kenny Ye
Stony Brook students (work in progress)
Nathan Tintle (repeated sampling)
Qing Wang (LRT for mixtures)
Kwangmi Ahn, Rose Saint Fleur
Undergraduates Alex Borress, Josh Ren, Jelani
Wiltshire

4
First Paper

Gordon, D., Finch, S.J., Nothnagel, M., Ott, J.
(2002). Power and sample size calculations for
case-control genetic association tests when
errors are present application to single
nucleotide polymorphisms. Human Heredity, 54,
22-33.

5
Second Paper

Gordon D., Levenstien M.A., Finch S.J., and Ott
J. (2003). Errors and linkage disequilibrium
interact multiplicatively when computing sample
sizes for genetic case-control association
studies. Pacific Symposium on Biocomputing
490-501.

6
Third Paper

Kang, S.J., Gordon, D., Finch, S.J. (2004). What
SNP Genotyping Errors Are Most Costly for Genetic
Association Studies. Genetic Epidemiology, 26,
132-141.

7
Fourth Paper

Kang, S.J., Gordon, D., Brown, A.M., Ott, J.,
Finch, S.J. (2004). Tradeoff between No-Call
Reduction in Genotyping Error Rate and Loss of
Sample Size for Genetic Case/Control Association
Studies. Pacific Symposium on Biocomputing

8
Fifth Paper

Kang, S.J., Finch, S.J., Gordon, D. (2004).
Quantifying the cost of SNP genotyping errors in
genetic model based association studies. Human
Heredity, In press.

9
PAWE Web Site

http//linkage.rockefeller.edu/pawe/pawe.cgi

10
Review Paper

Gordon, D., Finch, S.J. (2004). Factors affecting
statistical power to detect genetic association.
Submitted for publication.

11
Background

Definition of SNPs
SNP genotyping measurements
Specification of error models
Tests of association
Two supplementary measurement approaches

12
Definition of SNP

A gene with two possible alleles (here A and B)
A is the more common allele in the controls
Three possible genotypes
AA, index1 (more common homozygote)
AB, index2 (heterozygote)
BB, index3 (less common homozygote)

13
Measure of Cost

The percentage increase in the minimum sample
size necessary to maintain constant Type I and
Type II error rates associated with an increase
of 1 in a genotyping error rate is our measure
of the cost of a genotyping error.
MSSN is our abbreviation for this measure.

14
SNP Genotyping Measurements

Two die intensities are measured R and G.
Measurements are typically taken at two or three
time points.
Ratio FR/(RG) is used to classify into
genotypes.
Genotyping error event in which an observed
genotype is different from the true genotype.

15
SNP Genotyping Measurements (Raw Data)
16
Scatterplot of SNP Dye Intensities
17
Scatterplot of SNP Fraction by Cycle Time
18
Approaches to Replication

Sutcliffe studied the reclassification of
subjects using the same classification procedure
at all remeasurements.
Tenenbein studied the reclassification of
subjects using a virtually perfect instrument for
the second reclassification.

19
Regenotyping Results

There is a common perception that genotyping
error is negligible.
One test is to regenotype a set of data.
COGA provided such data to last GAW.
Tintle et al. (2004) analyzed it.

20
Regenotyping Results Summed over All SNPs(COGA
GAW Data)
21
Observations on Table

Homozygote to homozygote inconsistencies are
extremely rare.
CIDR missing rate is 6.7.
Affymetrix missing rate is 6.1
Double missing rate is 1.7, much higher than the
0.4 expected under independence, suggesting some
subjects may be consistently more difficult to
genotype.

22
Regenotyping Definitions

Consistency Two genotypes on a SNP for a
regenotyped subject exist and are the same.
Nonreplication One genotype on a SNP for a
regenotyed subject exists, and data is missing
for the other genotype. Note that we treat two
missing genotypes as replicated.
SNP nonreplication rate the number of
non-replications divided by the sum of the number
of replications and the number of non
replications.

23
(No Transcript)
24
Critical assumptions about errors

Regardless of nature of errors, they are random
and independent
Error model is same for cases (affecteds) and
controls (unaffecteds)

25
Mote-Anderson Model 1965 Penetrance Table
(most general)

26
Simple but Realistic Error Model

Homozygote to homozygote error rates set to zero
All other error rates set to equal error rate

27
Three Component Normal Mixture

Given AA, F is normal(-?, 1)
Given AB, F is normal(0, 1)
Given BB, F is normal(?,1)
Symmetric cutpoints create an error model that
has equal error rates for all errors except
homozygote to homozygote errors.

28
Tests of Association

Case-control study. The ratio of number of
controls to number of cases is k.
We use the 2x3 chi-squared test of independence
(simplest non-trivial case).
Mitra found the noncentrality parameter of the
chi-squared test of association which is needed
for power and sample size calculations
Recommended (Sasieni) test is test of trend
(Armitage).

29
Test Statistic

Pearsons on 2 3 tables
Example Table

30
Effect of Misclassification Errors on Tests of
Association

Bross found that level of significance is
unchanged when the same error mechanism affects
cases and control and that parameter estimates
are biased.
Mote and Anderson found that the power is reduced
(level of significance constant) when there are
misclassification errors.

31
Notation

Count parameters
NA number of cases in the absence of errors
NU number of controls in the absence of errors
NA number of cases in the presence of errors
NU number of controls in the presence of
errors

32
What is needed for asymptotic power calculations?
33
Genetic model free parameterization

Specify the genotype probabilities directly
Assuming Hardy Weinberg Equilibrium (HWE), all
probabilities specified with two parameters ( p,
q )

34
Genetic model free parameterization

Specify the genotype probabilities directly
Not assuming HWE, can specify all probabilities
with four parameters

35
Genetic Model Specification

p1 allele frequency of SNP marker 1 allele
p2 allele frequency of SNP marker 2 allele 1-
p1
pd allele frequency of disease locus d allele
p allele frequency of disease wild-type allele
1- pd

36
Genetic Model Specification

D disequilibrium (non-scaled as defined in Hartl
and Clark
DMAX min (p1 pd, p2 p)
DD/ DMAX

37
Genetic Model Specification (penetrance
parameters)
38
Results

Demonstrate analytic solution of asymptotic
power using standard chi-square test of genotypic
association

39
Genotype Frequencies in the Presence of Errors
40
Noncentrality Parameter

We assume NU kNA.
Using Mitras work (1958),

41
Noncentrality Parameter

Let ?kNAg, where g is the bracketed function for
genotypes measured without error.
Let ?kNAg, where g is the bracketed function
using frequencies for genotypes observed with
error.

42
To maintain constant asymptotic power

We choose NA so that ? ?.

43
Paper 1 Findings

Noncentrality parameter for the 2x3 chi-squared
test of independence from Mitra to describe
asymptotic power.
Increase in error rate (three error models)
requires a corresponding increase in sample size
to maintain Type I and Type II error rates.
Regression analysis of increase in MSSN as
function of error rate in a number of published
models.
Interaction of linkage disequilibrium (D) and
measure of overall error rate (S).

44
Paper 2 Findings

Linkage Disequilibrium (LD) and errors interact
in a non-linear fashion.
The increase in sample size necessary to maintain
constant asymptotic power and level of
significance as a function of S (sum of error
rates) is smallest when D 1 (perfect LD).
The increase grows monotonically as D decreases
to 0.5 for all studies.

45
Paper 3 Method

Saturated error model (called Mote-Anderson in
PAWE software).
Taylor series expansion of the ratio of sample
sizes expressed with the non-centrality
parameters.
The coefficients of each error parameter give the
MSSN for a 1 increase in that error rate.

46
Recall the Noncentrality Parameters

Let ?kNAg, where g is the bracketed function.
Let ?kNAg, where g is the bracketed function
using frequencies for genotypes observed with
error.
Then, when ? ? (that is, equal power for both
specifications), NA/NAg/g.

47
MSSN Function

( NA / NA ) 1 C12e12 C13e13 C21e21
C23e23 C31e31 C 32e32.
Suppose C13 7. Then every 1 increase in e13
requires a 7 increase in sample size to maintain
constant power

48
MSSN Coefficients

The MSSN coefficient associated with the error
rate of misclassifying the more common homozygote
as the heterozygote is given by

49
MSSN Coefficients

Similar expressions hold for the other five MSSN
coefficients.

50
Example of Sample Size increase in presence of
errors

Suppose we have

51
Example Error Model Penetrance
52
Comparison of Genotype Frequencies

Without error With 1 error With 3 error

53
Sample size in presence of errors

Assume we want 0.80 power at 0.05 level of
significance. Let k 1.

54
Cost coefficients for our example

Coefficient Type of error
More common hom to het
Het to more common hom
Het to less common hom
Less common hom to het

55
Simplest non-trivial case to develop insights

Assume HWE, cases and controls
pa 0.2, 0.3, 0.4, 0.5
pu pa d, d 0.01
P01 (1- pa )2 , P02 2(1- pa ) pa , P03 (pa
)2
P11 (1- pu )2 , P12 2(1- pu ) pu , P13 (pu
)2

56
Results for MSSN Coefficients d 0.01
Cost
Case SNP minor allele frequency
57
Conclusion What happens to MSSN coefficients as
minor SNP allele frequency approaches 0?
Costly errors are those made on the more common
homozygote
58
Extension to non-HWE generalizing example

MSSN coefficients C12 and C13 have infinite
limits.
Additionally, C23 may have infinite limit.

59
How to perform calculations in practice?

Use PAWE webtool.

60
(No Transcript)
61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
Paper 5 Findings

MSSN coefficients with infinite limit hold when
studying usual genetic models.
Recessive models can have C23 with infinite limit
as minor SNP allele frequency goes to zero.
Dominant models have a notably different behavior
with fewer MSSN coefficients with infinite
limit. Behavior can be more problematic.
MSSN coefficients are complex functions that
should be studied on a case-by-case basis.

65
Paper 5 Definitions

Total MSSN is defined to be

66
Total MSSN by Freq of Disease Allele and Minor
SNP Allele
67
Coefficient of Heterozygote to Less Common
Homozygote Error
68
Dominant Model, Total MSSN
69
Possible Strategies to Counter Effects of SNP
Genotyping Errors

Increase sample size to compensate for loss of
power. Use small Type I and Type II error rates
in designing studies. (This works.)
When a three component normal mixture describes
the measurements that are the basis of
genotyping, use no-call rules to lessen error
rates and reduce consequent cost.

70
Possible Strategies to Counter Effects of SNP
Genotyping Errors

Use the same genotyping classification procedure
and regenotype subjects (Tintles problem).
Use a perfect genotyping classification procedure
on some of the subjects (Gordon et al.)

71
Increase sample size

Use PAWE software to identify whether the problem
under consideration has the possibility of large
MSSN coefficients.
Good design (using small Type I and Type II error
rates) can yield protocols that are less
sensitive to the consequences of SNP genotyping
errors.

72
Power in presence of errors
A study design in which type I error rate is low
and power is high is less sensitive to genotyping
error rate
73
No Call Regions for Three Component Normal
Mixture Model
74
Power Using No Call Rules
75
No-Call Rules (Paper 4)

The gain (less reduction in power) from a reduced
error rate using no call is almost exactly
balanced by the loss of power due to reduced
sample size.
That is, there is only so much information in the
sample.
Conclusion Use all of the data without resorting
to no call procedures.

76
Regenotype Subjects

Tintle will report on this approach in the next
seminar.

77
Double Sampling

See the following paper.
Gordon, D., Yang, Y., Haynes, C., Finch, S.J.,
Mendell, N.R., Brown, A.M., Haroutunian, V.
(2004) "Increasing power for tests of genetic
association in the presence of phenotype and/or
genotype error by use of double-sampling."
Statistical Applications in Genetics and
Molecular Biology.

78
Summary

1. We have described quantitatively the magnitude
of the effect of genotype errors on case/control
association studies How much power or
(equivalently) how much increase in sample size
necessary to maintain constant power
- We have quantified this magnitude for the
chi-square test of independence
(http//linkage.rockefeller.edu/pawe)

79
Summary

2. Under HWE, cost coefficients of both error
types made on the more common homozygote have
infinite limits as SNP minor allele frequency
approaches 0

80
Recommendations

1. Researchers should increase sample size to
maintain specification of type I error rate and
power in case/control studies
A study design in which type I error rate is low
and power is high is less sensitive to genotyping
error rate

81
Recommendations

2. Researchers designing SNP genotyping
technologies should avoid designs where
homozygote-gthomozygote misclassifications might
occur with non-zero probability

82
References

Armitage, P., Tests for linear trends in
proportions and frequencies. Biometrics, 1955.
11 p. 375-386.
Bross, I., Misclassification in 2 x 2 tables.
Biometrics, 1954. 10 p. 478-486.
Hartl, D.L. and A.G. Clark, Principles of
population genetics. 2nd ed. 1989, Sunderland
Sinauer Associates.
Mitra, S.K., On the limiting power function of
the frequency chi-square test. Annals of
Mathematical Statistics, 1958. 29(4) p.
1221-1233.
Mote VL, Anderson RL (1965) An investigation of
the effect of misclassification on the properties
of chisquare-tests in the analysis of categorical
data. Biometrika 5295-109

83
References

Sasieni, P.D., From genotypes to genes doubling
the sample size. Biometrics, 1997. 53(4) p.
1253-61.
Sutcliffe, J.P. (1965) A probability model for
errors of classification. I. General
considerations. Psychometrika, 30, 73-96.
Sutcliffe, J.P. (1965) A probability model for
errors of classification. II. Particular cases.
Psychometrika, 30, 129-155.

84
References

Tenenbein, A. 1970. A double sampling scheme for
estimating from binomial data with
misclassifications. Journal of the American
Statistical Association 651350-1361.
Tenenbein, A. 1972. A double sampling scheme for
estimating from misclassified multinomial data
with applications to sampling inspection.
Technometrics 14187-202.
Tintle, N., Ahn, K., Mendell, N.R., Gordon, D.,
Finch, S.J. (2004). Using Replicated SNP
Genotypes for CoGA. Genetics Analysis Workshop
contribution.

Write a Comment

User Comments (0)

About PowerShow.com