Which SNP genotyping errors are most costly and when

1 / 84
About This Presentation
Title:

Which SNP genotyping errors are most costly and when

Description:

Stephen J. Finch. Stony Brook University. Acknowledgments. Joint work ... Gordon D., Levenstien M.A., Finch S.J., and Ott J. (2003) ... Gordon, D., Finch, S.J. (2004) ... –

Number of Views:38
Avg rating:3.0/5.0
Slides: 85
Provided by: stephe139
Category:

less

Transcript and Presenter's Notes

Title: Which SNP genotyping errors are most costly and when


1
Which SNP genotyping errors are most costly and
when?
  • Stephen J. Finch
  • Stony Brook University

2
Acknowledgments
  • Joint work
  • Derek Gordon (Rockefeller University)
  • Sun Jung Kang (Duke University)
  • Five papers are the material for this talk with
    additional coauthors
  • Michael Nothnagel and Jurg Ott in paper 1
  • Mark Levenstien and Jurg Ott in paper 2
  • Abe Brown and Jurg Ott in paper 4

3
Acknowledgments
  • Colleagues
  • Nancy Mendell
  • Kenny Ye
  • Stony Brook students (work in progress)
  • Nathan Tintle (repeated sampling)
  • Qing Wang (LRT for mixtures)
  • Kwangmi Ahn, Rose Saint Fleur
  • Undergraduates Alex Borress, Josh Ren, Jelani
    Wiltshire

4
First Paper
  • Gordon, D., Finch, S.J., Nothnagel, M., Ott, J.
    (2002). Power and sample size calculations for
    case-control genetic association tests when
    errors are present application to single
    nucleotide polymorphisms. Human Heredity, 54,
    22-33.

5
Second Paper
  • Gordon D., Levenstien M.A., Finch S.J., and Ott
    J. (2003). Errors and linkage disequilibrium
    interact multiplicatively when computing sample
    sizes for genetic case-control association
    studies. Pacific Symposium on Biocomputing
    490-501.

6
Third Paper
  • Kang, S.J., Gordon, D., Finch, S.J. (2004). What
    SNP Genotyping Errors Are Most Costly for Genetic
    Association Studies. Genetic Epidemiology, 26,
    132-141.

7
Fourth Paper
  • Kang, S.J., Gordon, D., Brown, A.M., Ott, J.,
    Finch, S.J. (2004). Tradeoff between No-Call
    Reduction in Genotyping Error Rate and Loss of
    Sample Size for Genetic Case/Control Association
    Studies. Pacific Symposium on Biocomputing

8
Fifth Paper
  • Kang, S.J., Finch, S.J., Gordon, D. (2004).
    Quantifying the cost of SNP genotyping errors in
    genetic model based association studies. Human
    Heredity, In press.

9
PAWE Web Site
  • http//linkage.rockefeller.edu/pawe/pawe.cgi

10
Review Paper
  • Gordon, D., Finch, S.J. (2004). Factors affecting
    statistical power to detect genetic association.
    Submitted for publication.

11
Background
  • Definition of SNPs
  • SNP genotyping measurements
  • Specification of error models
  • Tests of association
  • Two supplementary measurement approaches

12
Definition of SNP
  • A gene with two possible alleles (here A and B)
  • A is the more common allele in the controls
  • Three possible genotypes
  • AA, index1 (more common homozygote)
  • AB, index2 (heterozygote)
  • BB, index3 (less common homozygote)

13
Measure of Cost
  • The percentage increase in the minimum sample
    size necessary to maintain constant Type I and
    Type II error rates associated with an increase
    of 1 in a genotyping error rate is our measure
    of the cost of a genotyping error.
  • MSSN is our abbreviation for this measure.

14
SNP Genotyping Measurements
  • Two die intensities are measured R and G.
  • Measurements are typically taken at two or three
    time points.
  • Ratio FR/(RG) is used to classify into
    genotypes.
  • Genotyping error event in which an observed
    genotype is different from the true genotype.

15
SNP Genotyping Measurements (Raw Data)
16
Scatterplot of SNP Dye Intensities
17
Scatterplot of SNP Fraction by Cycle Time
18
Approaches to Replication
  • Sutcliffe studied the reclassification of
    subjects using the same classification procedure
    at all remeasurements.
  • Tenenbein studied the reclassification of
    subjects using a virtually perfect instrument for
    the second reclassification.

19
Regenotyping Results
  • There is a common perception that genotyping
    error is negligible.
  • One test is to regenotype a set of data.
  • COGA provided such data to last GAW.
  • Tintle et al. (2004) analyzed it.

20
Regenotyping Results Summed over All SNPs(COGA
GAW Data)
21
Observations on Table
  • Homozygote to homozygote inconsistencies are
    extremely rare.
  • CIDR missing rate is 6.7.
  • Affymetrix missing rate is 6.1
  • Double missing rate is 1.7, much higher than the
    0.4 expected under independence, suggesting some
    subjects may be consistently more difficult to
    genotype.

22
Regenotyping Definitions
  • Consistency Two genotypes on a SNP for a
    regenotyped subject exist and are the same.
  • Nonreplication One genotype on a SNP for a
    regenotyed subject exists, and data is missing
    for the other genotype. Note that we treat two
    missing genotypes as replicated.
  • SNP nonreplication rate the number of
    non-replications divided by the sum of the number
    of replications and the number of non
    replications.

23
(No Transcript)
24
Critical assumptions about errors
  • Regardless of nature of errors, they are random
    and independent
  • Error model is same for cases (affecteds) and
    controls (unaffecteds)

25
Mote-Anderson Model 1965 Penetrance Table
(most general)

26
Simple but Realistic Error Model
  • Homozygote to homozygote error rates set to zero
  • All other error rates set to equal error rate

27
Three Component Normal Mixture
  • Given AA, F is normal(-?, 1)
  • Given AB, F is normal(0, 1)
  • Given BB, F is normal(?,1)
  • Symmetric cutpoints create an error model that
    has equal error rates for all errors except
    homozygote to homozygote errors.

28
Tests of Association
  • Case-control study. The ratio of number of
    controls to number of cases is k.
  • We use the 2x3 chi-squared test of independence
    (simplest non-trivial case).
  • Mitra found the noncentrality parameter of the
    chi-squared test of association which is needed
    for power and sample size calculations
  • Recommended (Sasieni) test is test of trend
    (Armitage).

29
Test Statistic
  • Pearsons on 2 3 tables
  • Example Table

30
Effect of Misclassification Errors on Tests of
Association
  • Bross found that level of significance is
    unchanged when the same error mechanism affects
    cases and control and that parameter estimates
    are biased.
  • Mote and Anderson found that the power is reduced
    (level of significance constant) when there are
    misclassification errors.

31
Notation
  • Count parameters
  • NA number of cases in the absence of errors
  • NU number of controls in the absence of errors
  • NA number of cases in the presence of errors
  • NU number of controls in the presence of
    errors

32
What is needed for asymptotic power calculations?
33
Genetic model free parameterization
  • Specify the genotype probabilities directly
  • Assuming Hardy Weinberg Equilibrium (HWE), all
    probabilities specified with two parameters ( p,
    q )

34
Genetic model free parameterization
  • Specify the genotype probabilities directly
  • Not assuming HWE, can specify all probabilities
    with four parameters

35
Genetic Model Specification
  • p1 allele frequency of SNP marker 1 allele
  • p2 allele frequency of SNP marker 2 allele 1-
    p1
  • pd allele frequency of disease locus d allele
  • p allele frequency of disease wild-type allele
    1- pd

36
Genetic Model Specification
  • D disequilibrium (non-scaled as defined in Hartl
    and Clark
  • DMAX min (p1 pd, p2 p)
  • DD/ DMAX

37
Genetic Model Specification (penetrance
parameters)
38
Results
  • Demonstrate analytic solution of asymptotic
    power using standard chi-square test of genotypic
    association

39
Genotype Frequencies in the Presence of Errors
40
Noncentrality Parameter
  • We assume NU kNA.
  • Using Mitras work (1958),

41
Noncentrality Parameter
  • Let ?kNAg, where g is the bracketed function for
    genotypes measured without error.
  • Let ?kNAg, where g is the bracketed function
    using frequencies for genotypes observed with
    error.

42
To maintain constant asymptotic power
  • We choose NA so that ? ?.

43
Paper 1 Findings
  • Noncentrality parameter for the 2x3 chi-squared
    test of independence from Mitra to describe
    asymptotic power.
  • Increase in error rate (three error models)
    requires a corresponding increase in sample size
    to maintain Type I and Type II error rates.
  • Regression analysis of increase in MSSN as
    function of error rate in a number of published
    models.
  • Interaction of linkage disequilibrium (D) and
    measure of overall error rate (S).

44
Paper 2 Findings
  • Linkage Disequilibrium (LD) and errors interact
    in a non-linear fashion.
  • The increase in sample size necessary to maintain
    constant asymptotic power and level of
    significance as a function of S (sum of error
    rates) is smallest when D 1 (perfect LD).
  • The increase grows monotonically as D decreases
    to 0.5 for all studies.

45
Paper 3 Method
  • Saturated error model (called Mote-Anderson in
    PAWE software).
  • Taylor series expansion of the ratio of sample
    sizes expressed with the non-centrality
    parameters.
  • The coefficients of each error parameter give the
    MSSN for a 1 increase in that error rate.

46
Recall the Noncentrality Parameters
  • Let ?kNAg, where g is the bracketed function.
  • Let ?kNAg, where g is the bracketed function
    using frequencies for genotypes observed with
    error.
  • Then, when ? ? (that is, equal power for both
    specifications), NA/NAg/g.

47
MSSN Function
  • ( NA / NA ) 1 C12e12 C13e13 C21e21
    C23e23 C31e31 C 32e32.
  • Suppose C13 7. Then every 1 increase in e13
    requires a 7 increase in sample size to maintain
    constant power

48
MSSN Coefficients
  • The MSSN coefficient associated with the error
    rate of misclassifying the more common homozygote
    as the heterozygote is given by

49
MSSN Coefficients
  • Similar expressions hold for the other five MSSN
    coefficients.

50
Example of Sample Size increase in presence of
errors
  • Suppose we have

51
Example Error Model Penetrance
52
Comparison of Genotype Frequencies
  • Without error With 1 error With 3 error

53
Sample size in presence of errors
  • Assume we want 0.80 power at 0.05 level of
    significance. Let k 1.

54
Cost coefficients for our example
  • Coefficient Type of error
  • More common hom to het
  • Het to more common hom
  • Het to less common hom
  • Less common hom to het

55
Simplest non-trivial case to develop insights
  • Assume HWE, cases and controls
  • pa 0.2, 0.3, 0.4, 0.5
  • pu pa d, d 0.01
  • P01 (1- pa )2 , P02 2(1- pa ) pa , P03 (pa
    )2
  • P11 (1- pu )2 , P12 2(1- pu ) pu , P13 (pu
    )2

56
Results for MSSN Coefficients d 0.01
Cost
Case SNP minor allele frequency
57
Conclusion What happens to MSSN coefficients as
minor SNP allele frequency approaches 0?
Costly errors are those made on the more common
homozygote
58
Extension to non-HWE generalizing example
  • MSSN coefficients C12 and C13 have infinite
    limits.
  • Additionally, C23 may have infinite limit.

59
How to perform calculations in practice?
  • Use PAWE webtool.

60
(No Transcript)
61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
Paper 5 Findings
  • MSSN coefficients with infinite limit hold when
    studying usual genetic models.
  • Recessive models can have C23 with infinite limit
    as minor SNP allele frequency goes to zero.
  • Dominant models have a notably different behavior
    with fewer MSSN coefficients with infinite
    limit. Behavior can be more problematic.
  • MSSN coefficients are complex functions that
    should be studied on a case-by-case basis.

65
Paper 5 Definitions
  • Total MSSN is defined to be

66
Total MSSN by Freq of Disease Allele and Minor
SNP Allele
67
Coefficient of Heterozygote to Less Common
Homozygote Error
68
Dominant Model, Total MSSN
69
Possible Strategies to Counter Effects of SNP
Genotyping Errors
  • Increase sample size to compensate for loss of
    power. Use small Type I and Type II error rates
    in designing studies. (This works.)
  • When a three component normal mixture describes
    the measurements that are the basis of
    genotyping, use no-call rules to lessen error
    rates and reduce consequent cost.

70
Possible Strategies to Counter Effects of SNP
Genotyping Errors
  • Use the same genotyping classification procedure
    and regenotype subjects (Tintles problem).
  • Use a perfect genotyping classification procedure
    on some of the subjects (Gordon et al.)

71
Increase sample size
  • Use PAWE software to identify whether the problem
    under consideration has the possibility of large
    MSSN coefficients.
  • Good design (using small Type I and Type II error
    rates) can yield protocols that are less
    sensitive to the consequences of SNP genotyping
    errors.

72
Power in presence of errors
A study design in which type I error rate is low
and power is high is less sensitive to genotyping
error rate
73
No Call Regions for Three Component Normal
Mixture Model
74
Power Using No Call Rules
75
No-Call Rules (Paper 4)
  • The gain (less reduction in power) from a reduced
    error rate using no call is almost exactly
    balanced by the loss of power due to reduced
    sample size.
  • That is, there is only so much information in the
    sample.
  • Conclusion Use all of the data without resorting
    to no call procedures.

76
Regenotype Subjects
  • Tintle will report on this approach in the next
    seminar.

77
Double Sampling
  • See the following paper.
  • Gordon, D., Yang, Y., Haynes, C., Finch, S.J.,
    Mendell, N.R., Brown, A.M., Haroutunian, V.
    (2004) "Increasing power for tests of genetic
    association in the presence of phenotype and/or
    genotype error by use of double-sampling."
    Statistical Applications in Genetics and
    Molecular Biology.

78
Summary
  • 1. We have described quantitatively the magnitude
    of the effect of genotype errors on case/control
    association studies How much power or
    (equivalently) how much increase in sample size
    necessary to maintain constant power
  • - We have quantified this magnitude for the
    chi-square test of independence
    (http//linkage.rockefeller.edu/pawe)

79
Summary
  • 2. Under HWE, cost coefficients of both error
    types made on the more common homozygote have
    infinite limits as SNP minor allele frequency
    approaches 0

80
Recommendations
  • 1. Researchers should increase sample size to
    maintain specification of type I error rate and
    power in case/control studies
  • A study design in which type I error rate is low
    and power is high is less sensitive to genotyping
    error rate

81
Recommendations
  • 2. Researchers designing SNP genotyping
    technologies should avoid designs where
    homozygote-gthomozygote misclassifications might
    occur with non-zero probability

82
References
  • Armitage, P., Tests for linear trends in
    proportions and frequencies. Biometrics, 1955.
    11 p. 375-386.
  • Bross, I., Misclassification in 2 x 2 tables.
    Biometrics, 1954. 10 p. 478-486.
  • Hartl, D.L. and A.G. Clark, Principles of
    population genetics. 2nd ed. 1989, Sunderland
    Sinauer Associates.
  • Mitra, S.K., On the limiting power function of
    the frequency chi-square test. Annals of
    Mathematical Statistics, 1958. 29(4) p.
    1221-1233.
  • Mote VL, Anderson RL (1965) An investigation of
    the effect of misclassification on the properties
    of chisquare-tests in the analysis of categorical
    data. Biometrika 5295-109

83
References
  • Sasieni, P.D., From genotypes to genes doubling
    the sample size. Biometrics, 1997. 53(4) p.
    1253-61.
  • Sutcliffe, J.P. (1965) A probability model for
    errors of classification. I. General
    considerations. Psychometrika, 30, 73-96.
  • Sutcliffe, J.P. (1965) A probability model for
    errors of classification. II. Particular cases.
    Psychometrika, 30, 129-155.

84
References
  • Tenenbein, A. 1970. A double sampling scheme for
    estimating from binomial data with
    misclassifications. Journal of the American
    Statistical Association 651350-1361.
  • Tenenbein, A. 1972. A double sampling scheme for
    estimating from misclassified multinomial data
    with applications to sampling inspection.
    Technometrics 14187-202.
  • Tintle, N., Ahn, K., Mendell, N.R., Gordon, D.,
    Finch, S.J. (2004). Using Replicated SNP
    Genotypes for CoGA. Genetics Analysis Workshop
    contribution.
Write a Comment
User Comments (0)
About PowerShow.com