Title: Association Studies: Statistical And Study Design Issues
1Lyle J. Palmer, PhD
From Genome to DiseaseNHLBI SymposiumJuly 23/24
2003Washington
Association Studies Statistical And Study Design
Issues
2This lecture
- Association in context
- Genetic association analysis some statistical
issues - Study design Problems and issues
- The future for association analysis
3Some challenges
- Field is young and changing rapidly
- Literature can be difficult
- Example Haseman Elston is a standard linkage
method - 4-5 variations on HE recently proposed
- Hard to compare approaches
- Software often free, but sometimes not well-tested
4Some challenges - continued
- Methods are sometimes oversold
- Sure thing of the day
- Collecting X affected sib pairs
- Collecting X unrelated cases and controls
- Isolated populations
- SNPs
5(No Transcript)
6Broad Genetic Epidemiology Study Design
Categories
- Linkage Analysis
- Follows meiotic events through families for
co-segregation of disease and particular genetic
variants - Large Families
- Sibling Pairs (or other family pairs)
- Works VERY well for Mendelian diseases
- Association Studies
- Detect association between genetic variants and
disease across families exploits linkage
disequilibrium - Case-Control designs
- Cohort designs
- Parents affected child trios (TDT)
- May be more appropriate for complex diseases
7Allelic architecture and mapping strategy
Unlikely to exist
Magnitude of effect
Fn. Studies
Frequency in population
Slide thanks to D. Altshuler
8What determines the allelic architecture?
Evolutionary selection
Slide thanks to D. Altshuler
9Power comparision sib-pair linkage versus
population association
Risch, Nature 2000
10Association Study Approaches
- Candidate gene search
- Limited variants or haplotypes based on prior
knowledge expert opinion, linkage peaks
- Genome-wide scan
- Dense set of markers throughout genome
Family-based and population-based designs
11(No Transcript)
12Phenotype-genotype association
- In practical terms, an observed statistical
association between an allele and a phenotypic
trait will be due to one of three situations - The finding could be due to chance or artifact,
e.g., confounding or selection bias - The allele is in linkage disequilibrium with an
allele at another locus that directly affects the
expression of the phenotype or - The allele itself is functional and directly
affects the expression of the phenotype.
13Candidate Allele Testing
- Test markers for association with disease
predisposition - One approach perform standard single-locus
chi-squared tests
By Alleles OR ad/bc Test ?21
By Genotype Test c22
14Single-marker logistic regression
- H0 bi 0
- Genetic model interpretations
- Assume 11 genotype coding represents genotype
with lowest absolute risk (baseline) - b1 b2 0 no association with that
polymorphism - b1 0, b2 gt 0 (completely) recessive
- b1 b2 gt 0 (completely) dominant
- 0 lt b1 lt b2 additive or multiplicative
- Note This can be extended through GLM to many
types of outcomes (rather than simply odds of
disease/not disease, as above) -
15LD Gene Mapping
- General idea
- Exploit the phenomenon of linkage disequilibrium
(LD) between alleles of closely linked markers to
identify genetic regions associated with disease
status. - i.e., Test for LD between marker loci and
disease allele - LD strength (magnitude) ? 1 / r
- LD will be highest at areas of the genome that
are closest to the disease locus - Use this to pinpoint (localize, fine-map) the
disease gene region - E.g.
- Fine-mapping
- From linkage analysis, may have 10 cM candidate
region. Next add dense set of markers in
significant region and perform LD analysis to
narrow region much further. - What about whole-genome approach?
- Some have suggested genome-wide LD studies are
feasible with densely spaced markers.
16(No Transcript)
17SNP association with disease allele
GENE
marker SNP
marker SNP
marker SNP
disease allele
marker SNP
- How closely must SNPs be spaced?
- 30,000 to 1,000,000 SNPs to span the genome?
- Numbers will depend on local haplotype structure,
amount of LD - -30kb blocks 100,000 independent
SNPs to span genome
18Multiple testing
- Multiple comparisons - 50,000 genes. Even if only
one functional locus/gene tested, very high
number of false s - Solutions
- Simulation Empirical p-values
- Replication
19What is a haplotype?
- Some definitions
- Haplotype Set of particular alleles at separate
loci on the same transmitted chromosome
- Linkage Disequilibrium (LD) Association between
those particular alleles due to their proximity
on the same chromosome (due to linkage)
- Haplotype-based analyses provide increased
informativity - Each allele (or mutation) is associated with a
particular evolutionary history and will thus
have a unique chromosomal background, or
haplotype. - More Powerful
20Motivation for Haplotype-based analysis
- Advantage of combinatorial approach
- Haplotypes important from population genetics
standpoint - Increase ability to identify regions that are IBD
- Biologically, combinations of alleles in a region
may be functionally important, so set of variants
on a chromosome may be the causative composite
allele rather than a particular nucleotide at a
particular SNP - Haplotype analyses can be more powerful than
single-locus analyses when LD is exploited
21Haplotype vs Single-locus Analyses
- Consider a 2-locus system with a disease-bearing
haplotype - A-dx-B
ORA-B 2.0
Slide thanks to Dani Fallin
22Haplotype Determination Options
- Collect and genotype family members
- ?150 effort, cost
- Family members not available
- Laboratory-based techniques
- Chromosome isolation
- Long-range PCR
- Limited results
- Time consuming
- Cost-prohibitive
- Statistical estimation
- Sequential rules (Clark, 1990)
- Likelihood-based E-M algorithms
- (Hill, 1974, Long et al, '95, Hawley Kidd '95,
Excoffier Slatkin '95 Fallin Schork, 2000)
23Relative importance of low risk alleles
- Population attributable fraction the proportion
of disease that - would be eliminated if the allele was eliminated
from the population - For GRRlt2, alleles with frequency lt.15 have very
little impact on disease in the population.
24Study design
- Targeting SNPs likely to be important in the
population - For GRRlt2, alleles/haplotypes with frequency
lt.10-.15 have little impact on disease in the
population - If the goal is to develop predictive or
diagnostic tests, such alleles are of little
commercial interest - Studies can be designed to have high power for
moderate (? 0.10) allele frequencies and GRR ? 2
sample sizes on the order of 1000 cases and 1000
controls are a good start
25Association Studies Potential Causes of
Inconsistent Results
- Population stratification Differences between
cases and controls most often cited reason - Genetic heterogeneity Different genetic
mechanisms in different populations - Random error False positive/false negative
results - Study design/analysis problems
- Poorly defined phenotypes
- Failure to correct for sub-group analysis and
multiple comparisons - Poor control group selection
- Small sample sizes
- Failure to attempt replication
- Silverman Palmer Am J Respir Cell Mol Biol 2000
- Cardon Bell, Nat. Rev. Gen. 2001
26Population stratification
- If cases and controls have different genetic
backgrounds (are from different genetic
sub-populations), - There may be inherent gene frequency differences,
increasing the possibility of a false positive
(or negative) result
27Knowler, W. C., R. C. Williams, et al. (1988).
Gm35,13,14 and type 2 diabetes mellitus an
association in American Indians with genetic
admixture. Am J Hum Genet 43(4) 520-6.
28Designs for Family-based LD studies
- Sibling controls (discordant siblings)
- Case-parent trios
- Nuclear families
- Extended families
Ancestral Population
X
Disease
Genetic marker
29Study Designs used for LD mapping
- Family-based Designs for Association Studies
- Advantages
- Not susceptible to confounding due to population
substructure - Tests for linkage and association
- Can test for parent-of-origin effects
- Disadvantages
- Inefficient recruitment, only heterozygous
parents informative - Often cannot test for environmental main-effects
- Family members often not available (eg,
late-onset diseases)
30TDT (transmission-disequilibrium test)
- Basic idea of TDT
- Disease alleles are transmitted from parents to
offspring - Marker alleles in LD with these alleles will also
be transmitted preferentially to affected
offspring - Test if heterozygous parents transmit a
particular marker allele to affected offspring
more frequently than expected - Looks for excess transmission of particular
alleles from parents to affected children - Controls are non-transmitted alleles
31Population stratification
- Random panels of SNPs can be used to test for
population sub-structure (many new methods). - First studies - little empirical evidence of
stratification in large samples from North
America, Japan, Latin America and Europe. - Problems with Trios and TDT
- Inefficient in genotyping and sampling
- Difficult (or impossible) to collect
- Can only really have a big problem if doing poor
epidemiology. - Potential for bias has been greatly exaggerated.
- Fear of population stratification led to
substantial changes in study design and analytic
methods and widespread adoption of trios design - Now no reason to adopt a family-based design
solely to protect against stratification
Cardon LR, Palmer LJ. Population stratification
and spurious allelic association. The Lancet
2003361598-604.
32Why Case/Control?
- Advantages
- Methodology is well-known
- Convenient to collect
- Common
- Very large samples
- More efficient recruitment than family-based
sampling - Simultaneous assessment of disease allele
frequency, penetrance, and AR - Unrelated controls can provide increased power
- Limitations
- 1. Possible Population Stratification
- 2. Need for highly dense marker sets (capture LD)
- Lack of phase information
- Lack of consistency of results
These can be overcome! 1. Assessment and genomic
control of stratification 2. SNP maps 3.
Imputed haplotypes
33Sample size requirements for case-control
analyses of SNPs (2 controls per case
detectable difference of OR ?1.5 power80).
Statistical power an increasing concern
Palmer, L. J. and W. O. C. M. Cookson (2001).
Using Single Nucleotide Polymorphisms (SNPs) as
a means to understanding the pathophysiology of
asthma. Respiratory Research 2 102-112.
34Growing utilization of population-based designs
- Increasingly apparent for many diseases that
population-based studies of unrelated
individuals, in which case-control and cohort
studies serve as standard designs for genetic
association analysis, may be a practical and
powerful approach - Power
- Ease and efficiency of collection
- Cohort design longitudinal data and prospective
assessment - Birth cohorts
- Value of historical cohorts
- e.g., Nurses Health Study 120,000 Nurses
recruited in mid-70s 98 follow-up of living
cohort to current day. - Pharmacogenetics
- GxE studies
35An example UK Biobank
- Focus on binary outcomes and gene environment
interactions - Large cohort 500,000 individuals
- Age range at recruitment 45-64 yrs ? 45-69 yrs
- Comprehensive exposure assessment at recruitment
- Lifestyle factors, environmental exposures
- Personal and family history of health and disease
- Subsequent monitoring via NHS information systems
- Power set on number of events in 10 years
- ? 40M-60M
Others EPIC, ISIS, Million Women Study
36Failure to replicate association
- Failure to replicate genetic association studies
is a problem of genuine concern - But - more often involves poor study design and
execution, in particular a lack of appreciation
for the sample sizes required to detect modest
genetic effects and over-interpretation of
marginal results, than undetected population
stratification. - Complex human diseases
- Initial detection and replication will likely be
very difficult. - multiple testing, laboratory/measurement error,
and positive publication and investigator-reportin
g biases - Population stratification is only one (and
possibly amongst the least) of many possible
reasons for non-replication of association
results.
Cardon LR, Palmer LJ. Population stratification
and spurious allelic association. The Lancet
2003361598-604.
37Study reproducibility risk allele frequency
- Risk allele frequency has a much greater impact
on power than disease prevalence for allele
frequencies lt.2 and gt.8
38Outstanding Genetics Issues
- Extent of linkage disequilibrium across the
genome - Age of expected mutations in populations
- Effect of selection on specific genes
- Allele frequency differences between populations
- Population demographics
- Whether or when subtle population differences are
important factors in association studies - Effect of stratification on haplotypes
39Ultimate Goal