Association Studies: Statistical And Study Design Issues - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Association Studies: Statistical And Study Design Issues

Description:

... for sub-group analysis and multiple comparisons. Poor control group selection ... Respiratory Research 2: 102-112. Statistical power: an increasing concern ... – PowerPoint PPT presentation

Number of Views:101
Avg rating:3.0/5.0
Slides: 40
Provided by: lylep
Category:

less

Transcript and Presenter's Notes

Title: Association Studies: Statistical And Study Design Issues


1
Lyle J. Palmer, PhD
From Genome to DiseaseNHLBI SymposiumJuly 23/24
2003Washington
Association Studies Statistical And Study Design
Issues
2
This lecture
  • Association in context
  • Genetic association analysis some statistical
    issues
  • Study design Problems and issues
  • The future for association analysis

3
Some challenges
  • Field is young and changing rapidly
  • Literature can be difficult
  • Example Haseman Elston is a standard linkage
    method
  • 4-5 variations on HE recently proposed
  • Hard to compare approaches
  • Software often free, but sometimes not well-tested

4
Some challenges - continued
  • Methods are sometimes oversold
  • Sure thing of the day
  • Collecting X affected sib pairs
  • Collecting X unrelated cases and controls
  • Isolated populations
  • SNPs

5
(No Transcript)
6
Broad Genetic Epidemiology Study Design
Categories
  • Linkage Analysis
  • Follows meiotic events through families for
    co-segregation of disease and particular genetic
    variants
  • Large Families
  • Sibling Pairs (or other family pairs)
  • Works VERY well for Mendelian diseases
  • Association Studies
  • Detect association between genetic variants and
    disease across families exploits linkage
    disequilibrium
  • Case-Control designs
  • Cohort designs
  • Parents affected child trios (TDT)
  • May be more appropriate for complex diseases

7
Allelic architecture and mapping strategy
Unlikely to exist
Magnitude of effect
Fn. Studies
Frequency in population
Slide thanks to D. Altshuler
8
What determines the allelic architecture?
Evolutionary selection
Slide thanks to D. Altshuler
9
Power comparision sib-pair linkage versus
population association
Risch, Nature 2000
10
Association Study Approaches
  • Candidate gene search
  • Limited variants or haplotypes based on prior
    knowledge expert opinion, linkage peaks
  • Genome-wide scan
  • Dense set of markers throughout genome

Family-based and population-based designs
11
(No Transcript)
12
Phenotype-genotype association
  • In practical terms, an observed statistical
    association between an allele and a phenotypic
    trait will be due to one of three situations
  • The finding could be due to chance or artifact,
    e.g., confounding or selection bias
  • The allele is in linkage disequilibrium with an
    allele at another locus that directly affects the
    expression of the phenotype or
  • The allele itself is functional and directly
    affects the expression of the phenotype.

13
Candidate Allele Testing
  • Test markers for association with disease
    predisposition
  • One approach perform standard single-locus
    chi-squared tests

By Alleles OR ad/bc Test ?21
By Genotype Test c22
14
Single-marker logistic regression
  • H0 bi 0
  • Genetic model interpretations
  • Assume 11 genotype coding represents genotype
    with lowest absolute risk (baseline)
  • b1 b2 0 no association with that
    polymorphism
  • b1 0, b2 gt 0 (completely) recessive
  • b1 b2 gt 0 (completely) dominant
  • 0 lt b1 lt b2 additive or multiplicative
  • Note This can be extended through GLM to many
    types of outcomes (rather than simply odds of
    disease/not disease, as above)

15
LD Gene Mapping
  • General idea
  • Exploit the phenomenon of linkage disequilibrium
    (LD) between alleles of closely linked markers to
    identify genetic regions associated with disease
    status.
  • i.e., Test for LD between marker loci and
    disease allele
  • LD strength (magnitude) ? 1 / r
  • LD will be highest at areas of the genome that
    are closest to the disease locus
  • Use this to pinpoint (localize, fine-map) the
    disease gene region
  • E.g.
  • Fine-mapping
  • From linkage analysis, may have 10 cM candidate
    region. Next add dense set of markers in
    significant region and perform LD analysis to
    narrow region much further.
  • What about whole-genome approach?
  • Some have suggested genome-wide LD studies are
    feasible with densely spaced markers.

16
(No Transcript)
17
SNP association with disease allele
GENE
marker SNP
marker SNP
marker SNP
disease allele
marker SNP
  • How closely must SNPs be spaced?
  • 30,000 to 1,000,000 SNPs to span the genome?
  • Numbers will depend on local haplotype structure,
    amount of LD
  • -30kb blocks 100,000 independent
    SNPs to span genome

18
Multiple testing
  • Multiple comparisons - 50,000 genes. Even if only
    one functional locus/gene tested, very high
    number of false s
  • Solutions
  • Simulation Empirical p-values
  • Replication

19
What is a haplotype?
  • Some definitions
  • Haplotype Set of particular alleles at separate
    loci on the same transmitted chromosome
  • Linkage Disequilibrium (LD) Association between
    those particular alleles due to their proximity
    on the same chromosome (due to linkage)
  • Haplotype-based analyses provide increased
    informativity
  • Each allele (or mutation) is associated with a
    particular evolutionary history and will thus
    have a unique chromosomal background, or
    haplotype.
  • More Powerful

20
Motivation for Haplotype-based analysis
  • Advantage of combinatorial approach
  • Haplotypes important from population genetics
    standpoint
  • Increase ability to identify regions that are IBD
  • Biologically, combinations of alleles in a region
    may be functionally important, so set of variants
    on a chromosome may be the causative composite
    allele rather than a particular nucleotide at a
    particular SNP
  • Haplotype analyses can be more powerful than
    single-locus analyses when LD is exploited

21
Haplotype vs Single-locus Analyses
  • Consider a 2-locus system with a disease-bearing
    haplotype
  • A-dx-B

ORA-B 2.0
Slide thanks to Dani Fallin
22
Haplotype Determination Options
  • Collect and genotype family members
  • ?150 effort, cost
  • Family members not available
  • Laboratory-based techniques
  • Chromosome isolation
  • Long-range PCR
  • Limited results
  • Time consuming
  • Cost-prohibitive
  • Statistical estimation
  • Sequential rules (Clark, 1990)
  • Likelihood-based E-M algorithms
  • (Hill, 1974, Long et al, '95, Hawley Kidd '95,
    Excoffier Slatkin '95 Fallin Schork, 2000)

23
Relative importance of low risk alleles
  • Population attributable fraction the proportion
    of disease that
  • would be eliminated if the allele was eliminated
    from the population
  • For GRRlt2, alleles with frequency lt.15 have very
    little impact on disease in the population.

24
Study design
  • Targeting SNPs likely to be important in the
    population
  • For GRRlt2, alleles/haplotypes with frequency
    lt.10-.15 have little impact on disease in the
    population
  • If the goal is to develop predictive or
    diagnostic tests, such alleles are of little
    commercial interest
  • Studies can be designed to have high power for
    moderate (? 0.10) allele frequencies and GRR ? 2
    sample sizes on the order of 1000 cases and 1000
    controls are a good start

25
Association Studies Potential Causes of
Inconsistent Results
  • Population stratification Differences between
    cases and controls most often cited reason
  • Genetic heterogeneity Different genetic
    mechanisms in different populations
  • Random error False positive/false negative
    results
  • Study design/analysis problems
  • Poorly defined phenotypes
  • Failure to correct for sub-group analysis and
    multiple comparisons
  • Poor control group selection
  • Small sample sizes
  • Failure to attempt replication
  • Silverman Palmer Am J Respir Cell Mol Biol 2000
  • Cardon Bell, Nat. Rev. Gen. 2001

26
Population stratification
  • If cases and controls have different genetic
    backgrounds (are from different genetic
    sub-populations),
  • There may be inherent gene frequency differences,
    increasing the possibility of a false positive
    (or negative) result

27
Knowler, W. C., R. C. Williams, et al. (1988).
Gm35,13,14 and type 2 diabetes mellitus an
association in American Indians with genetic
admixture. Am J Hum Genet 43(4) 520-6.
28
Designs for Family-based LD studies
  • Sibling controls (discordant siblings)
  • Case-parent trios
  • Nuclear families
  • Extended families

Ancestral Population
X
Disease
Genetic marker
29
Study Designs used for LD mapping
  • Family-based Designs for Association Studies
  • Advantages
  • Not susceptible to confounding due to population
    substructure
  • Tests for linkage and association
  • Can test for parent-of-origin effects
  • Disadvantages
  • Inefficient recruitment, only heterozygous
    parents informative
  • Often cannot test for environmental main-effects
  • Family members often not available (eg,
    late-onset diseases)

30
TDT (transmission-disequilibrium test)
  • Basic idea of TDT
  • Disease alleles are transmitted from parents to
    offspring
  • Marker alleles in LD with these alleles will also
    be transmitted preferentially to affected
    offspring
  • Test if heterozygous parents transmit a
    particular marker allele to affected offspring
    more frequently than expected
  • Looks for excess transmission of particular
    alleles from parents to affected children
  • Controls are non-transmitted alleles

31
Population stratification
  • Random panels of SNPs can be used to test for
    population sub-structure (many new methods).
  • First studies - little empirical evidence of
    stratification in large samples from North
    America, Japan, Latin America and Europe.
  • Problems with Trios and TDT
  • Inefficient in genotyping and sampling
  • Difficult (or impossible) to collect
  • Can only really have a big problem if doing poor
    epidemiology.
  • Potential for bias has been greatly exaggerated.
  • Fear of population stratification led to
    substantial changes in study design and analytic
    methods and widespread adoption of trios design
  • Now no reason to adopt a family-based design
    solely to protect against stratification

Cardon LR, Palmer LJ. Population stratification
and spurious allelic association. The Lancet
2003361598-604.
32
Why Case/Control?
  • Advantages
  • Methodology is well-known
  • Convenient to collect
  • Common
  • Very large samples
  • More efficient recruitment than family-based
    sampling
  • Simultaneous assessment of disease allele
    frequency, penetrance, and AR
  • Unrelated controls can provide increased power
  • Limitations
  • 1. Possible Population Stratification
  • 2. Need for highly dense marker sets (capture LD)
  • Lack of phase information
  • Lack of consistency of results

These can be overcome! 1. Assessment and genomic
control of stratification 2. SNP maps 3.
Imputed haplotypes
33
Sample size requirements for case-control
analyses of SNPs (2 controls per case
detectable difference of OR ?1.5 power80).
Statistical power an increasing concern
Palmer, L. J. and W. O. C. M. Cookson (2001).
Using Single Nucleotide Polymorphisms (SNPs) as
a means to understanding the pathophysiology of
asthma. Respiratory Research 2 102-112.
34
Growing utilization of population-based designs
  • Increasingly apparent for many diseases that
    population-based studies of unrelated
    individuals, in which case-control and cohort
    studies serve as standard designs for genetic
    association analysis, may be a practical and
    powerful approach
  • Power
  • Ease and efficiency of collection
  • Cohort design longitudinal data and prospective
    assessment
  • Birth cohorts
  • Value of historical cohorts
  • e.g., Nurses Health Study 120,000 Nurses
    recruited in mid-70s 98 follow-up of living
    cohort to current day.
  • Pharmacogenetics
  • GxE studies

35
An example UK Biobank
  • Focus on binary outcomes and gene environment
    interactions
  • Large cohort 500,000 individuals
  • Age range at recruitment 45-64 yrs ? 45-69 yrs
  • Comprehensive exposure assessment at recruitment
  • Lifestyle factors, environmental exposures
  • Personal and family history of health and disease
  • Subsequent monitoring via NHS information systems
  • Power set on number of events in 10 years
  • ? 40M-60M

Others EPIC, ISIS, Million Women Study
36
Failure to replicate association
  • Failure to replicate genetic association studies
    is a problem of genuine concern
  • But - more often involves poor study design and
    execution, in particular a lack of appreciation
    for the sample sizes required to detect modest
    genetic effects and over-interpretation of
    marginal results, than undetected population
    stratification.
  • Complex human diseases
  • Initial detection and replication will likely be
    very difficult.
  • multiple testing, laboratory/measurement error,
    and positive publication and investigator-reportin
    g biases
  • Population stratification is only one (and
    possibly amongst the least) of many possible
    reasons for non-replication of association
    results.

Cardon LR, Palmer LJ. Population stratification
and spurious allelic association. The Lancet
2003361598-604.
37
Study reproducibility risk allele frequency
  • Risk allele frequency has a much greater impact
    on power than disease prevalence for allele
    frequencies lt.2 and gt.8

38
Outstanding Genetics Issues
  • Extent of linkage disequilibrium across the
    genome
  • Age of expected mutations in populations
  • Effect of selection on specific genes
  • Allele frequency differences between populations
  • Population demographics
  • Whether or when subtle population differences are
    important factors in association studies
  • Effect of stratification on haplotypes

39
Ultimate Goal
Write a Comment
User Comments (0)
About PowerShow.com