Title: Association Studies: Statistical And Study Design Issues
1Genomics Flemming Pociot, MD, DMSc Steno Diabetes
Center STAR Research in Diabetology
Epidemiology
Association Studies Statistical And Study Design
Issues
2This lecture
- Association in context
- Genetic association analysis some statistical
issues - Study design Problems and issues
- The future for association analysis
3Some challenges
- Field is young and changing rapidly
- Literature can be difficult
- Example Haseman Elston is a standard linkage
method - 4-5 variations on HE recently proposed
- Hard to compare approaches
- Software often free, but sometimes not well-tested
4Some challenges - continued
- Methods are sometimes oversold.
- Sure thing of the day
- Collecting X affected sib pairs
- Collecting X unrelated cases and controls
- Isolated populations
- SNPs
5Evidence for genetic effect on phenotype
activity
activity
IBD
Genotype
0
1
2
aa
ab
bb
variance
Relationship
Popula-tion
MZ
Sibs
6How to analyze the genetics underlying a specific
trait or phenotype ?
75/38
8Broad Genetic Epidemiology Study Design
Categories
- Linkage Analysis
- Follows meiotic events through families for
co-segregation of disease and particular genetic
variants - Large Families
- Sibling Pairs (or other family pairs)
- Works VERY well for Mendelian diseases
- Association Studies
- Detect association between genetic variants and
disease across families exploits linkage
disequilibrium - Case-Control designs
- Cohort designs
- Parents affected child trios (TDT)
- May be more appropriate for complex diseases
9Allelic architecture and mapping strategy
Unlikely to exist
Magnitude of effect
Fnct. Studies
Frequency in population
10What determines the allelic architecture?
Evolutionary selection
11Association Study Approaches
- Candidate gene search
- Limited variants or haplotypes based on prior
knowledge expert opinion, linkage peaks
- Genome-wide scan
- Dense set of markers throughout genome
Family-based and population-based designs
12Phenotype-genotype association
- In practical terms, an observed statistical
association between an allele and a phenotypic
trait will be due to one of three situations - The finding could be due to chance or artifact,
e.g., confounding or selection bias - The allele is in linkage disequilibrium with an
allele at another locus that directly affects the
expression of the phenotype or - The allele itself is functional and directly
affects the expression of the phenotype.
13Candidate Allele Testing
- Test markers for association with disease
predisposition - One approach perform standard single-locus
chi-squared tests
By Alleles OR ad/bc Test ?21DF
By Genotype Test c22DF
14LD Gene Mapping
- General idea
- Exploit the phenomenon of linkage disequilibrium
(LD) between alleles of closely linked markers to
identify genetic regions associated with disease
status. - i.e., Test for LD between marker loci and
disease allele - LD strength (magnitude) ? 1 / r
- LD will be highest at areas of the genome that
are closest to the disease locus - Use this to pinpoint (localize, fine-map) the
disease gene region - E.g.
- Fine-mapping
- From linkage analysis, may have 10 cM candidate
region. Next add dense set of markers in
significant region and perform LD analysis to
narrow region much further. - What about whole-genome approach?
- Some have suggested genome-wide LD studies are
feasible with densely spaced markers.
15(No Transcript)
16Multiple testing
- Multiple comparisons - 30,000 genes. Even if only
one functional locus/gene tested, very high
number of false s - Solutions
- Simulation Empirical p-values
- Replication
17(No Transcript)
18Sharing identical by descent
2
1
0
Expected ratio
0.25
0.5
0.25
19Identity by state (IBS) is not the same as
identity by descent (IBD)
20What is a haplotype?
- Some definitions
- Haplotype Set of particular alleles at separate
loci on the same transmitted chromosome
- Linkage Disequilibrium (LD) Association between
those particular alleles due to their proximity
on the same chromosome (due to linkage)
- Haplotype-based analyses provide increased
informativity - Each allele (or mutation) is associated with a
particular evolutionary history and will thus
have a unique chromosomal background, or
haplotype. - More Powerful
18/38
21Motivation for Haplotype-based analysis
- Advantage of combinatorial approach
- Haplotypes important from population genetics
standpoint - Increase ability to identify regions that are IBD
- Biologically, combinations of alleles in a region
may be functionally important, so set of variants
on a chromosome may be the causative composite
allele rather than a particular nucleotide at a
particular SNP - Haplotype analyses can be more powerful than
single-locus analyses when LD is exploited
22Haplotype vs Single-locus Analyses
- Consider a 2-locus system with a disease-bearing
haplotype - A-dx-B
ORA-B 2.0
23Haplotype Determination Options
- Collect and genotype family members
- ?150 effort, cost
- Family members not available
- Laboratory-based techniques
- Chromosome isolation
- Long-range PCR
- Limited results
- Time consuming
- Cost-prohibitive
- Statistical estimation
- Sequential rules (Clark, 1990)
- Likelihood-based E-M algorithms
- (Hill, 1974, Long et al, '95, Hawley Kidd '95,
Excoffier Slatkin '95 Fallin Schork, 2000)
24Relative importance of low risk alleles
- Population attributable fraction the proportion
of disease that would be eliminated if the allele
was eliminated from the population - For GRRlt2, alleles with frequency lt.15 have very
little impact on disease in the population.
25Study design
- Targeting SNPs likely to be important in the
population - For GRRlt2, alleles/haplotypes with frequency
lt.10-.15 have little impact on disease in the
population - If the goal is to develop predictive or
diagnostic tests, such alleles are of little
commercial interest - Studies can be designed to have high power for
moderate (? 0.10) allele frequencies and GRR ? 2
sample sizes on the order of 1000 cases and 1000
controls are a good start
26Association Studies Potential Causes of
Inconsistent Results
- Population stratification Differences between
cases and controls most often cited reason - Genetic heterogeneity Different genetic
mechanisms in different populations - Random error False positive/false negative
results - Study design/analysis problems
- Poorly defined phenotypes
- Failure to correct for sub-group analysis and
multiple comparisons - Poor control group selection
- Small sample sizes
- Failure to attempt replication
27Population stratification
- If cases and controls have different genetic
backgrounds (are from different genetic
sub-populations), - There may be inherent gene frequency differences,
increasing the possibility of a false positive
(or negative) result
- Association is due to ancestral population of
origin rather than to linkage disequilibrium
between the disease and marker loci
population of origin
Genetic marker
Disease
28(No Transcript)
29Designs for Family-based LD studies
- Sibling controls (discordant siblings)
- Case-parent trios
- Nuclear families
- Extended families
30Study Designs used for LD mapping
- Family-based Designs for Association Studies
- Advantages
- Not susceptible to confounding due to population
substructure - Tests for linkage and association
- Can test for parent-of-origin effects
- Disadvantages
- Inefficient recruitment, only heterozygous
parents informative - Often cannot test for environmental main-effects
- Family members often not available (eg,
late-onset diseases)
31TDT (transmission-disequilibrium test)
- Basic idea of TDT
- Disease alleles are transmitted from parents to
offspring - Marker alleles in LD with these alleles will also
be transmitted preferentially to affected
offspring - Test if heterozygous parents transmit a
particular marker allele to affected offspring
more frequently than expected - Looks for excess transmission of particular
alleles from parents to affected children - Controls are non-transmitted alleles
- For each individual, have 2x2 table of 0s, 1s, or
2s - Use all such tables to get a matched chi-square
test for excess occurrence in cells b and c
McNemars test
32(No Transcript)
33Population stratification
- Random panels of SNPs can be used to test for
population sub-structure (many new methods). - First studies - little empirical evidence of
stratification in large samples from North
America, Japan, Latin America and Europe. - Problems with Trios and TDT
- Inefficient in genotyping and sampling
- Difficult (or impossible) to collect
- Can only really have a big problem if doing poor
epidemiology. - Potential for bias has been greatly exaggerated.
- Fear of population stratification led to
substantial changes in study design and analytic
methods and widespread adoption of trios design - Now no reason to adopt a family-based design
solely to protect against stratification
Cardon LR, Palmer LJ. Population stratification
and spurious allelic association. The Lancet
2003361598-604.
34Why Case/Control?
- Advantages
- Methodology is well-known
- Convenient to collect
- Common
- Very large samples
- More efficient recruitment than family-based
sampling - Simultaneous assessment of disease allele
frequency, penetrance, and AR - Unrelated controls can provide increased power
- Limitations
- 1. Possible Population Stratification
- 2. Need for highly dense marker sets (capture LD)
- Lack of phase information
- Lack of consistency of results
These can be overcome! 1. Assessment and genomic
control of stratification 2. SNP maps 3.
Imputed haplotypes
35Sample size requirements for case-control
analyses of SNPs (2 controls per case
detectable difference of OR ?1.5 power80).
Statistical power an increasing concern
Palmer, L. J. and W. O. C. M. Cookson (2001).
Using Single Nucleotide Polymorphisms (SNPs) as
a means to understanding the pathophysiology of
asthma. Respiratory Research 2 102-112.
36Growing utilization of population-based designs
- Increasingly apparent for many diseases that
population-based studies of unrelated
individuals, in which case-control and cohort
studies serve as standard designs for genetic
association analysis, may be a practical and
powerful approach - Power
- Ease and efficiency of collection
- Cohort design longitudinal data and prospective
assessment - Birth cohorts
- Value of historical cohorts
- e.g., Nurses Health Study 120,000 Nurses
recruited in mid-70s 98 follow-up of living
cohort to current day. - Pharmacogenetics
- GxE studies
37An example UK Biobank
- Focus on binary outcomes and gene environment
interactions - Large cohort 500,000 individuals
- Age range at recruitment 45-64 yrs ? 45-69 yrs
- Comprehensive exposure assessment at recruitment
- Lifestyle factors, environmental exposures
- Personal and family history of health and disease
- Subsequent monitoring via NHS information systems
- Power set on number of events in 10 years
- ? 40M-60M
38Failure to replicate association
- Failure to replicate genetic association studies
is a problem of genuine concern - But - more often involves poor study design and
execution, in particular a lack of appreciation
for the sample sizes required to detect modest
genetic effects and over-interpretation of
marginal results, than undetected population
stratification. - Complex human diseases
- Initial detection and replication will likely be
very difficult. - multiple testing, laboratory/measurement error,
and positive publication and investigator-reportin
g biases - Population stratification is only one (and
possibly amongst the least) of many possible
reasons for non-replication of association
results.
Cardon LR, Palmer LJ. Population stratification
and spurious allelic association. The Lancet
2003361598-604.
39Study reproducibility risk allele frequency
- Risk allele frequency has a much greater impact
on power than disease prevalence for allele
frequencies lt.2 and gt.8
40Outstanding Genetics Issues
- Extent of linkage disequilibrium across the
genome - Age of expected mutations in populations
- Effect of selection on specific genes
- Allele frequency differences between populations
- Population demographics
- Whether or when subtle population differences are
important factors in association studies - Effect of stratification on haplotypes
41Ultimate Goal