Title: Biostatistical Basics for Genetic Epidemiology
1Biostatistical Basics for Genetic Epidemiology
- Interdisciplinary Genetic Research Course
- Shanghai
- October 6, 2008
- Kung-Yee Liang
- Department of Biostatistics
- Johns Hopkins School of Public Health
2A Brief Outline
- Some basic concepts in epidemiology
- Designs
- Confounding effect modifications
- Some statistical tools
- Mantel-Haenszel method
- Logistic regression
- Interpretations
- Cautions on conventional inferences
- Matched analysis
- Polytomous regression
- Use in genetic epidemiology
3Epidemiology
- A discipline to study the distribution of
diseases - (disorders) to provide the basics for developing
and - evaluating preventive procedures and public
health - practices
- Disease control and prevention through health
education and intervention - Health policy/expenditure implementations
- Clinical implications prognosis, treatment
strategy
4Issues in Identifying Risk Factors
- Designs
- Measures of association
- Confounding
- Effect modification
- Data analysis
5Designs
- Prospective (cohort) exposed and unexposed
- subjects are followed up prospectively and
- events of interest are observed over time
- Study multiple endpoints
- Temporal and causal relationship
- Usually large scale time consuming and costly
- Loss to follow-up (survivorship bias)
- Breslow and Day (1987). The Design and Analysis
of Cohort Studies, IARC
6Designs (cont)
- Retrospective (case-control) affected (cases)
and - unaffected (controls) subjects are ascertained
and - exposure information collected retrospectively
- More efficient time-wise and budget-wise
- Subject to biases recall, detection, etc.
- More difficult to establish temporal and causal
relationship - Association
- Breslow and Day (1981). The Analysis of
Case-Control Studies, IARC
7Measures of Association
Odds Ratio (OR) for disease
- RR ? 1 iff OR ? 1 no association
- RRgt1 iff ORgt1 positive association
- the larger RR(OR), the greater the association
8Confounding
- The distortion of the true association between
- the disease and the risk factor due to the
- association of other factors with both disease
- and exposure, the latter association with the
- disease being causal.
9Confounding (cont)
350 200
150 300
550 450 1000
500 500
-
10Confounding (cont)
C E D
E C D
- For the latter, the intermediate variable is not
a - confounder, rather a mediated variable
- Smoking ? chronic cough ? lung cancer
- How to avoid this confusion
- Substantive knowledge
11Table 3.2 The relationships between outpatient
expenditure and BMI categories
in Taiwan (Tobit censored model)
plt0.05 plt0.01 plt0.001
12Table 4.2 The relationships between outpatient
expenditure and BMI categories
using Tobit censored model (control for chronic
disease)
plt0.01 plt0.001
13Confounding (cont)
- It appears that once adjusting for chronicle
illnesses such as diabetes mellitus, hypertension
and CVD, the effect of BMI on outpatient
expenditures (OE) and physician visits (PV)
disappears - What is more likely is that the effect of BMI on
OE and PE is indeed real as it is mediated
through those illnesses
14How to Deal with Confounding?
- Stratification
- Sub-divide population into groups that are
- homogeneous with regard to confounding
- variables
- Post stratification
- Frequency matching
- Individual matching
15Stratification (cont)
Data are imbalanced
16Stratification (cont)
- Frequency matching
- Individual (one-to-one) matching
50 100
30 60
20 40
Control
27 29
3 4
Case
63
17Effect Modification
- The effect of risk factor on the disease (the
association between risk factor and disease) is
dependent on (modified by) the level of the
confounding variable
18Effect Modification (cont)
- Also known as interaction between risk
- factor and confounder
- The confounder is called effect modifier
- Useful to identify high risk group
- It is model dependent
- Quantitative interaction same direction
- (2 versus 5)
- Qualitative interaction different direction
(0.5 versus 5) - More problematic
19Statistical Analysis
- Post-stratification / frequency matching
- Mantel-Haenszel (M-H) estimator
- Mantel-Haenszel test statistic
-
ni mi
ai bi
ci di
i 1,, K
ti Ni ni mi
Mantel-Haenszel (1959), JNCI Greenland, Breslow
Robbins (1986) Biometrics
20One-to-One Matching
Control
a b
c d
Case
- a,d concordant pairs
- b,c discordant pairs
- Mantel-Haenszel estimator b/c
- McNemar Test statistics
21Example Revisited
D Oesophageal cancer E ? 80g/day alcohol
consumption OR5.2 The odds (risk) of oesophageal
cancer for those who drink more than 80g/day is
about five times as high s those who drink less.
?
22Example Revisited (cont)
D Endometrial cancer E Estrogen usage OR 29/3
9.67 The odds (risk) of endometrial cancer is
elevated by ten folds if using estrogen
?
23Limitations of the Stratification / M-H Approach
- All the variables (confounder, risk factor) are
required to be discrete - One risk factor at time
- Should NOT use it with qualitative interaction
- Implications
- Cant establish dose response relationship
- The larger the exposure, the higher the risk
- Cant examine joint effects of several risk
factors simultaneously
24An Alternative
- Logistic regression model
- Y 1(0) if affected (unaffected)
- Z1, , Zq confounding variables
- X1,, Xp risk factors of interest
- LogPr(Y1)/Pr(Y0) log odds
- All of the regression coefficients have the log
odds ratio interpretations
25How to Interpret Logistic Regression Coefficients?
- log
-
- X1 log Odds
- 1
? ?1 - 0
? - ß1 log Odds Ratio
-
- log
? ?1X1
26How to Interpret Logistic Regression
Coefficients? (cont)
- Log
- (X1, X2) log Odds
- 1 0
- 0 1
- 0 0
-
? ?1X1 ß2X2
? ?1
? ?2
?
27How to Interpret Logistic Regression
Coefficients? (cont)
- log
- (Z1, X1) log Odds
- 1 1
- 1 0
- 0 1
- 0 0
-
? ?1Z1 ß1X1
? ?1 ?1
? ?1
? ?1
?
28How to Interpret Logistic Regression
Coefficients? (cont)
-
- (Z1, X1) log Odds
- 1 1
- 1 0
- 0 1
- 0 0
log ? ?1Z1 ß1X1 dZ1?X1
? ?1 ?1 ?
? ?1
? ?1
?
29How to Interpret Logistic Regression
Coefficients? (cont)
- log
- X1 log Odds
- 5
- 4
- 31
- 30
? ß1X1 X1 continuous
? 5?1
? 4?1
? 31?1
? 30?1
30How to Interpret Logistic Regression
Coefficients? (cont)
- log
- X1 continuous Z 1(0)
- log
- X1 continuous Z 1(0)
? ?1Z1 ß1X1
? ?1Z1 ß1X1 dZ1?X1
31In General
- X (X1,., Xp) , 0 (0,., 0)
- OR(X 0)
-
-
-
-
- Multiplicative
(log-linear) models
Pr(Y1X, Z) / Pr (Y0X, Z)
Pr(Y10, Z) / Pr (Y00, Z)
e ?1X1 ? e?2X2 ? ?e?PXP
32In General (cont)
- X (x11, x2,., xp), X? (x1, x2,., xP)
- OR(X X?) OR (X 0) / OR (X? 0)
-
-
- e?1
?1(x11) ?2x2,?PxP
e
?1x1?2x2,?PxP
e
33For Matched Case-Control Studies
- The conventional logistic regression method (e.g.
SAS PROC LOGISTIC) is NOT adequate -
34For Matched Case-Control Studies
- Matching variables can be used to examine their
interactions with risk factors (effect
modification) -
Model - Variable 1
2 3 - Estrogen use 2.074
1.431 2.074 - (EST) (0.421)
(0.826) (0.421) - ESTAGE1
0.847 -
(1.034) - ESTAGE2
0.780 -
(1.154) - ESTAGE?
0.385 -
(0.616)
? AGE 0, 1 or 2
35For Matched Case-Control Studies
- 3. How many matching variables to consider?
- Many confounding variables
- May not find matched controls if all confounders
are considered for matching - May run into over-matching problem
- Recommendations
- No more than two or three strong confounders
- The rest are adjusted through regression
36For Matched Case-Control Studies
- 4. How many controls to match per case?
- As a rule of thumb 4 matched controls per case
- Efficiency of one controls versus R control
- R/(R 1)
- More controls maybe needed if
- The risk factor considered is rare
- The underlying degree of association is high
- Breslow et al. (1983) JASA
-
37Polytomous Logistic Regression
- It is common that the response variable has three
or more categories - Cell types in lung cancer
- Severity of injury
- Subtypes in oral cleft
- Cleft lip w/o palate (CLP)
- Cleft palate only (CP)
38Polytomous Logistic Regression (cont)
Oral Cleft Oral Cleft Oral Cleft Oral Cleft
C2 CP CLP control
Present 27 32 24
Absent 97 177 142
OR 1.65 1.07 1.0
?
- C2 target allele for the candidate gene
transforming - factor alpha (TGFA)
39Polytomous Logistic Regression (cont)
- Polytomous logistic regression model
- Y 0, 1, 2,, C
- log aj ß x, j 1, 2,, C
- ßj change in log odds (Y j versus Y 0) per
unit change in x. - ßj -ßk change in log odds (Y j versus Y K)
per unit change in x.
t j
40Polytomous Logistic Regression (cont)
Variable CP/control CLP/control
Intercept (?) 2.756 (0.753) 3.388 (0.679)
TGFA 0.045 (0.406) -0.025 (0.580)
MS 0.821 (0.370) 1.071 (0.329)
TGFA MS 0.580 (0.746) -0.279 (0.714)
MA 0.108 (0.024) -0.112 (0.022)
MS Maternal smoking MA Maternal age
41Summary
- We have discussed
- Some basic concepts in epidemiology
- Designs
- Cohort vs case-control
- Confounding effect modifications
- Some statistical tools
- Mantel-Haenszel procedure
- Logistic regression
- Matching vs not
- Dichotomous vs polytomous
42Summary (cont)
- These designs and methods are useful for genetic
epidemiological research - Detection of familial aggregation
- Identification of genetic subtypes
- Test for genetic association
- Examination of gene-environment interaction
- Liang Beaty (2000) Stat Meth in Med Res