Title: Epidemiology for mathematicians
1Epidemiology for mathematiciansLooking at
wildflowers from horseback
- David Ozonoff, MD, MPH
- Boston University
- School of Public Health
DIMACS Working Group on Order Theory in
Epidemiology March 7, 2005
2Tutorial overview and goals
- The landscape of epidemiology
- What is epidemiology?
- Who is an epidemiologist?
- Who employs them?
- Kinds of epidemiology
- How epidemiologists think
- What kinds of things do they work with?
- What kinds of things are they interested in?
3Tutorial overview and goals, contd
- Some language and concepts of epidemiology
- Language of occurrence measures
- Study designs
- Causal inference
4I. Landscape, perspective, language
- What is epidemiology?
- Who is an epidemiologist?
- Who employs epidemiologists?
- Flavors of epidemiology Descriptive, analytic
- Epi and mathematics models and patterns
- Some examples of epidemiological thinking
5Some definitions of epidemiology
- Study of health and illness in populations
(Kleinbaum, Kupper and Morgenstern) - Study of the distribution and determinants of
disease frequency in human populations (MacMahon
and Pugh Susser) - Study of the occurrence of illness (Rothman I)
- Theoretical epidemiology discipline of how to
study the occurrence of phenomena of interest in
the health field (Miettinnen) NB not illness
centered
6Some more (cynical) definitions
- Rothman II Unfortunately, there seem to be more
definitions of epidemiology than there are
epidemiologists. Some have defined it in terms of
its methods. While the methods of epidemiology
may be distinctive, it is more typical to define
a branch of science in terms of its subject
matter rather than its tools.If the subject of
epidemiologic inquiry is taken to be the
occurrence of disease and other health outcomes,
it is reasonable to infer that the ultimate goal
of most epidemiologic research is the elaboration
of causes that can explain patterns of disease
occurrence. - Schneiderman Epidemiology is the practice of
criticizing other epidemiologists
7Consensus notions
- Deals with populations, not individuals
- Deals with (frequency of) occurrences of health
related events - Has a (major but not exclusive) concern with
causes (determinants) of disease patterns in
populations
8Remarks
- Public health perspective
- Flavors Analytic versus descriptive
epidemiology - Causal inference assumptions
- Disease occurrence is not random.
- Systematic investigation of different populations
can identify causal and preventive factors - Observational versus experimental sciences
- Chronic disease and infectious disease
epidemiology - What is theoretical epidemiology?
9Some examples
- Do environmental exposures increase risk of
disease? - John Snow cholera epidemic of 1854
- Contaminated water and leukemia in Woburn, MA
- Are vitamin supplements beneficial?
- Does Vitamin E lower risk of Alzheimers Disease
- Folic acid and risk of neural tube (birth)
defects - Do behavioral interventions reduce risk
behaviors? - Communitybased studies to change diets
- Peer interventions to reduce HIV-risk behaviors
10Who is an epidemiologist?
- Relatively new in medical science
- Precursors John Graunt (17th century), John Snow
(19th century) - Rise as a profession Wade Hampton Frost at JHU
- 1950s and 1960s CDC and consolidation as
professional discipline, still mainly physicians - 1960s Infectious disease -gt Chronic disease epi
- Professonalization
- Doctoral degrees in epidemiology
- Now most epidemiologists are not docs
11Who employs epidemiologists?
- Public sector
- State and federal health officials
- Communicable and chronic disease programs
- Infectious disease, outbreak investigations
- Cancer registries, environmental studies, program
areas in substance abuse, health services, etc.,
etc. - Research at CDC, NIH, academia, etc.
- Private sector
- Industry (chemical companies, drug companies)
- Consultants
- Academia, NGOs
12Flavors of epidemiology
- Descriptive epidemiology
- Analytic epidemiology (finding risk factors,
a.k.a. causes)
13Descriptive epidemiology
- Describe patterns of disease by person, place,
time - Good for monitoring publics health (e.g.,
surveillance, vital events) - Used for administrative purposes (e.g., planning)
- Good for generating hypotheses
14NB Disease patterns and the Science of patterns
15Description
- Two kinds
- Tabulations or summaries only (no inference or
estimation) - Inference
- Prediction to other populations
(generalization surveys and polling) - True value in face of noise
- May also assume data produced by underlying
population model and try to describe it - Parametric particular functional form assumed
- Parameter value that indexes family functions,
e.g., mean and std deviation of Normal
distribution - Non-parametric data-driven estimate of
underlying density or distribution
16A word about models and patterns (our usage)
- Models are high level, global descriptions of
all or most of dataset - Descriptive or inferential component
- Examples
- Regression models, mixture models, Markov models
- Patterns are local features of data
- Perhaps only a few people or a few variables
- Also descriptive or inferential
- Descriptive look for people with unusual
features - Inferential Predict which people have unusual
features - Examples Association rules, mode or gap in
density function, outliers, inflection point in
regression, symptom clusters, geographic hot
spots, predict disease from symptoms
17Models and patterns, contd
- Epidemiologists use both but more interested in
patterns, i.e., more interested in structure
that is local than structure that is global - George Box All models are wrong but some models
are useful describes epi viewpoint - But epidemiologists tend to think of patterns as
real, even if misleading
18Warning word model differs by context but is
usually some kind of metaphor
- Metaphor a figure of speech literally denoting
one kind of thing but used to represent or reason
about another kind of thing - Examples fashion model, model citizen (represent
an ideal) scale model animal model
mathematical model model of an axiomatic system
regression model
19Question What do we learn from the following
examples?
Describing populations by person, place and time
illustrating how epidemiologists think
20Person (age, sex, race) Death rates per 105 US
population from coronary disease by age and sex,
1981
Age White Men White Women
25-34 9 4
35-44 60 16
45-54 265 71
55-64 708 243
65-74 1670 769
75-84 3752 2359
85 8596 7215
21Place
- Where are the rates of disease the highest and
lowest? - Malignant Melanoma of Skin
22Place
23A Variation on Place Migrant StudiesMortality
rates (per 100,000) due to stomach cancer
Japanese in Japan 58.4
Japanese Immigrants to California 29.9
Sons of Japanese Immigrants 11.7
Native Californians (Caucasians) 8.0
24TimeDoes frequency of disease differ now from in
the past?
25- What is a Population?
- How an epidemiologist would put it
- Group of people with a common characteristics
like age, race, sex, geographic location,
occupation, etc.
- Two types of populations, based on whether
membership is permanent or transient - Fixed population or cohort membership is
permanent and defined by an event - Ex. Atomic bomb survivors, Persons born in
1980 - Dynamic population membership is transient and
defined by being in or out of a "state. - Ex. Members of HMO Blue, residents of the
City of Boston
26First step, summary description
- Tabulate data by selected features of person,
place, time - What are characteristics of population members?
(how many of each sex, race, etc.) And
combinations of these features (How many white
women? Employed? Etc.)
27Constructing contingency table from raw data
- raw data consists of listing of each subject
and his or her attributes
28One-way tables
- One dimensional Contingency Table (CT) is just a
frequency table, i.e., a table that gives number
of subjects with each attribute
29Two-way tables
- Most contingency tables are (at least) two-way,
i.e., they cross-classify two attributes
30Or in more familiar form
Sex by handedness and age
But this is only part of the possible two way
tables as it does not represent handedness versus
age, for example
31What is a Population?How a mathematician might
put it
- A population is a triple, (G, M, I)
- Two sets, G and M G is a set of people or
subjects, M is a set of features the subjects
might have - A relation I, I ? G ? M
- Interpretation r (g, m) ? I means that subject
g ? G has attribute m ? M
32Contingency tables (cross-tabs)
- Mainstay of data preparation, inspection and
analysis - Requires study design based operations
- Sampling ? set of n subjects in set G
- Variable selection (classification scheme) ? set
of m variables in set M - E.g., age, sex, disease status (as indicator
variables) - Measurement ? binary relation I ? G ? M
- E.g., ordered pair (case 2, femaleyes) is
typical member of I - We call the triple (G, M, I) a data structure for
the contingency table (also called a formal
context in FCA literature) - Simple formulation allows use of rich
mathematical theory - Much more about this from Alex Pogel
33Quantification Disease frequency
- Goal will be to see if occurrence of disease
differs in populations with different
characteristics or experiences (note comparison
is at heart of this) - Quantify disease occurrence in a population at
certain point or period of time - Population (counting, absolute scale)
- How big?
- Composition?
- Occurrence (counting, absolute scale)
- Existing cases? New cases?
- Time
- Calendar time? (NB interval scale, preserved
under pos. lin. xform) - Duration of time (NB ratio scale, preserved
under similarity xform) - More about this in Fred Robertss tutorial
34- Ex. Hypothetical Frequency of
- AIDS in Two Cities
- new cases time period population
- City A 58 1985 25,000
- City B 35 1985-86 7,000
-
- Annual "rate" of AIDS
- City A 58/25,000/1yr 232/100,000/yr
- City B 35/7,000/2 yrs 17.5/7000/yr
- 250/100,000/yr
- Make it easy to compare rates (i.e., make them
commensurable) by using same population unit
(say, per 100,000 people) and time period (say, 1
year) - NB Commensurability is property of underlying
relational system used in measurement (treated in
Roberts tutorial)
35Three kinds of quantitative measures of frequency
of occurrence
Used to relate number cases of disease, size of
population, time
- Proportion numerator is subset of denominator,
often expressed as a percentage - Ratio division of one number by another, numbers
don't have to be related - Rate time (sometimes space) is intrinsic part of
denominator, term is often misused (e.g.,
birthrate) - Need to specify if measure represents events or
people
36(Point) Prevalence (P) Quantifies number of
existing cases of disease in a population at a
point in time
- P Number of existing cases of disease (at a
given point in time)/ total population - Ex. City A has 7000 people with arthritis on Jan
1st, 2002 - Population of City A 70,000
- Prevalence of Arthritis on Jan 1st .10 or 10
Prevalence is a proportion
37- Incidence - quantifies number of
- new cases of disease that
- (b) develop in a population at risk
- (c) during a specified time period
- Three key ideas
- New disease events, or for diseases that can
occur more than once, usually first occurrence of
disease - Population at risk (candidate population) - can't
have disease already, should have relevant organs - Enough time must pass for a person to move from
health to disease
38Two Types of Incidence Measures
- Cumulative Incidence
- (Attack Rate) (Abbreviated Cum Inc. CI)
- Incidence Rate
- (Incidence Density) (Abbreviated I, IR, ID)
39- Incidence rate (I, IR) new cases
of disease - Total person-time of observation
- Also called incidence density (ID)
-
40 Accrual of Person-Time
- Jan Jan Jan
- 1981 1982
- -----------------------x
- -------------------------x
- --------------------------------------------
1.1 Person-Year (PY) 1.2 PY 2.2 PY 4.5 PY
Subject 1 Subject 2 Subject 3
X outcome of interest, incident rate 2/4.5 PY
41Some Ways to Accrue 100PY
- 100 people followed 1 year each 100 py
- 10 people followed 10 years each 100 py
- 50 people followed 1 year plus 25 people followed
2 years 100 py - Time unit for person-time year, month or day
- Person-time person-year, person-month,
person-day
42Ex. (Cohort) study of risk of breast cancer
among women with hyperthyroidism
- Followed 1,762 women ---gt 30,324 py
- Average of 17 years of follow-up per woman
- Ascertained 61 cases of breast cancer
- Incidence rate 61/30,324 py .00201/y
- 201/100,000 py (.00201 x 100,000
p/100,000 p)
43Dimensions
Prevalence people
people no dimension Cumulative
incidence people
people no dimension Incidence
rate people people-time dimension
is time 1
44Types of (instantaneous) rates
Relative rate (person-time or incidence rate)
Absolute rate (used in infectious disease epi and
health services)
Also where units do not involve time, such as
accidents per passenger mile or cases per square
area
45Relationship between prevalence and incidence
- P IR x D
- Prevalence depends on incidence rate and duration
of disease (duration lasts from onset of disease
to its termination) - If incidence is low but duration is long -
prevalence is relatively high - If incidence is high but duration is short -
prevalence is relatively low - This is an example of Littles equation in
queuing theory time-avg number of units in the
system arrival rule x avg delay time/unit - This equation is true if ...
46Conditions for equation to be true
- Steady state
-
- IR constant
- Distribution of durations constant
- Prevalence of disease is low (less than 10)
- In queuing theory terms strictly stationary
process in steady state conditions
47Figuring duration from prevalence and incidence
- Lung cancer incidence rate 45.9/100,000 py
- Prevalence of lung cancer 23/100,000
- D P 23/100,000 p 0.5 years
- IR 45.9/100,000 py
- Conclusion Individuals with lung cancer survive
6 months from diagnosis to death
48Uses of Prevalence and Incidence Measures
- Prevalence administration, planning
- Incidence etiologic research (problems with
prevalence since it combines IR and D), planning
49- Common measures of disease
- frequency for public health
- Crude death (mortality) rate
- Total number of deaths from
all causes - 1,000 people
For one year - (also cause-specific, age-specific,
race-specific death rate) - Live-birth rate
- total number of live births
For one year - 1,000 people
(sometimes women of childbearing age) -
- Infant mortality rate
- deaths of infants under 1 year of age
For one year - 1,000 live-births
50Frequency measures used in infectious disease
epidemiology
- Attack rate
- cases of disease that develop during defined
period - in pop. at risk at start of period
- (usually used for infectious disease outbreaks)
- Case fatality rate
- of deaths for a defined period of
time - cases of disease
- Survival rate
- living cases for a defined period of
time - cases of disease
51Tutorial part 2 Exposure - Disease Relationship
52Reprise Epidemiology is a science within public
health
- This means that it adopts a population
perspective - As a science, it is also quantitative
- As a science, it is also interested in
explanation and prediction, not just describing
53Questions asked by communities
- Exposure driven questions
- What will happen to me, my family, my community?
- Outcome driven questions
- Why me, why my child, why us?
- Mixed
- Are we sicker than our neighbors?
54The usual notion of causation John Stuart Mills
Method of Difference
- A causes B if, all else being held constant, a
change in A is accompanied by a subsequent change
in B. - This of course does not mean that nothing else
can produce a change in B. - The formal method to detect such an occurrence is
the Experiment, whereby all things are held
constant except A and B, A is varied, and B
observed
55Exptl vs. Observational Science
- Epidemiology is an observational science
- We do not control the independent variable (or
most other variables) - What is the implication of this for the status of
epidemiology as a science? - What does it mean about epidemiologys ability to
prove causation?
56Sources of information
- Case studies
- Experimental studies
- Observational studies
Once results are observed, it remains to explain
or interpret the observation, whether the result
is a difference or a lack of a difference in the
compared entities.
57Types of observational study designs
- Descriptive
- Case study and case-series
- No comparison Person, place and time
- Cross-sectional comparison (Are we sicker than
our neighbors?) - ecological (comparing communities/environments
not individual level) - Notice how descriptive and analytic shade into
each other (as per examples we did earlier) - Cohort (Whats going to happen to me?)
- Analog of the laboratory experiment
- Case-control (Why me?)
58Central idea compare frequencies of occurrence
in two groups
- Example Summarize relationship between exposure
and disease by comparing two measures of disease
frequency - Overall rate of disease in an exposed group says
nothing about whether exposure is a risk factor
for (causes) a disease - This can be evaluated by comparing disease
incidence in an exposed group to another group
that is not exposed, (a comparison group) - Comparison or contrast is the essence of
epidemiology
59 Two Main Options for Comparing disease
frequencies
- 1. Calculate ratio of two measures of disease
frequency ( a measure in exposed group and a
measure in unexposed comparison group) - 2. Calculate difference between two measures of
disease frequency (a measure in exposed group and
a measure in unexposed comparison group)
60At the heart of an epidemiological study ...
- Lies a comparison
- Between 2 rates, ratios, proportions
- Is the difference/lack of difference due to
- Bias?
- Chance?
- Real effect?
61Determinants of the comparison
- Compared measures differ or they dont (? is
linearly ordered) - Either way, the comparison may be affected by
- Chance (sample variation)
- Bias
- Real effect or lack of effect
- To interpret the comparison and evaluate the last
factor, we need to account for effects of the
first two
62Role of statistics
- Evaluates role that chance might play in the
absence of any other factor - Also used for summary purposes or to express a
model mathematically - Not the main preoccupation of epidemiologists,
however - Bias is main preoccupation of epidemiologists
63Evaluating the role of bias
- Epidemiology is observational discipline, so
uncontrolled variables abound - Most of training is in recognizing and accounting
for sources of bias, often extremely subtle - Less emphasis on role of chance, often handed
over to biostatisticians - Extent to which content area (real effect)
taken into account varies with investigator and
who collaborators are
64I. Definition of Bias
- Bias is a systematic error that results in
an incorrect (invalid) estimate of the measure of
association - A. Can create spurious association when there
really is none (bias away from the null) - B. Can mask an association when there really is
one (bias towards the null) - Bias is primarily introduced by the investigator
or study participants
65I. Definition of Bias (cont)
- D. Bias does not mean that the investigator is
- prejudiced or not objective
- E. Bias can arise in all study types
experimental, cohort, - case-control
- F. Bias occurs in the design and conduct of a
study. It - cannot be fixed in the analysis phase.
- G. Two main types of bias are selection and
information - bias, but there are many other types of bias
- H. We will consider only selection and
information bias for purposes of illustration of
epidemiologic practice
66II. Selection Bias
- A. Results from procedures used to select
subjects into a study that lead to a result
different from what would have been obtained from
the entire population targeted for study - B. Most likely to occur in case-control or
retrospective cohort because exposure and outcome
have occurred at time of study selection
67Selection Bias in a Case-Control Study
- A. Occurs when controls or cases are more (or
less) likely to be included in study if they have
been exposed -- that is, inclusion in study is
not independent of exposure - B. Result Relationship between exposure and
disease observed among study participants is
different from relationship between exposure and
disease in individuals who would have been
eligible but were not included -- OR from a
study that suffers from selection bias will
incorrectly represent the relationship between
exposure and disease in the overall study
population
68Selection Bias Case-Control Study
- Question Do PAP smears prevent cervical cancer?
Cases diagnosed at a city hospital. Controls
randomly sampled from household in same city by
canvassing the neighborhood on foot. Here is the
true relationship
OR (100)(100) / (150)(150) .44 There was a
54 reduced risk of cervical cancer among women
who had PAP smears as compared to women who did
not. (40 of cases had PAP smears versus 60 of
controls)
69Selection Bias Case-Control Study (cont)
- Recall Cases from the hospital and controls come
from the neighborhood around the hospital. - Now for the bias Only controls who were at home
at the time the researchers came around to
recruit for the study were actually included in
the study. Women at home were more likely not to
work and were less likely to have regular
checkups and PAP smears. Therefore, being
included in the study as a control is not
independent of the exposure. The resulting data
are as follows
70Selection Bias (cont)
OR (100)(150) / (150)(100) 1.0 There is no
association between Pap smears and the risk of
cervical cancer. Here, 40 of cases and 40 of
controls had PAP smears.
71Selection Bias Case-Control Study (cont)
- Ramifications of using women who were at home
during the day as controls - These women were not representative of the whole
study population that produced the cases. They
did not accurately represent the distribution of
exposure in the study population that produced
the cases, and so they gave a biased estimate of
the association.
72When interpreting study results, ask yourself
these questions
- Given conditions of the study, could bias have
occurred? - Is bias actually present?
- Are consequences of the bias large enough to
distort the measure of association in an
important way? - Which direction is the distortion? is it towards
the null or away from the null?
73Imputation of Causality
- What are the roles of
- Bias The critique checklist
- Chance Statistical significance
- Real effect
- The Hill viewpoints
- Not necessary criteria (not even criteria)
- Not a checklist
- The way its really done...
74Marks of causality
- Strength of association
- Biologically plausible
- Biological gradient (dose-response)
- Appropriate temporal relationship
- Specificity
- Consistency
75The Fundamental Question (according to Hill)
- "Clearly none of these nine viewpoints can bring
indisputable evidence for or against a
cause-and-effect hypothesis and equally none can
be required as a sine qua non. What they can do,
with greater or less strength, is to help us to
answer the fundamental question--is there any
other way of explaining the set of facts before
us, is there any other answer equally, or more,
likely than cause and effect?
76How its really done
- Assemble the evidence from the literature. What
are the pieces of the jigsaw? - How do you decide?
- Where do they fit?
- How do you decide?
77Interpretation
- Evaluate the evidence (a study) for internal
validity - Evaluate the evidence for external validity
- Bottom line
- What roles are played by bias, chance, real
effect?
78Assemble the jigsaw pieces into a picture
- The picture is your version of causality
- Your picture may disagree with other scientists
- Disagreement among scientists is the rule, not he
exception
79Mathematics in epidemiology
- Traditional
- Evaluate role of chance (statistical hypothesis
testing estimation) - Descriptive (compact summary or generative model)
- Infectious disease epidemiology dynamics
80Comparing chronic and infectious disease
epidemiology
?
?
?, ?
S
P
?
?
?
S
I
R
?1
?2
81?
?
?, ?
S
P
?
?
?
S
I
R
?1
?2
?birth rate or migration in-rate ?incidence
rate or infectivity rate ?, ? mortality and
recovery rates with ?1case fatality rate,
?2background mortality rate Prevalence rate
P/(SP)
82Comparing chronic and infectious epi (contd)
- Chronic
- Usually concentrate on ? (incidence) because
interested in etiology - Have to account for fact that ? is function of
calendar time and age, exposure (?metric), sex,
race, SES, occupation, co-morbid conditions,
latency - But not usually population size or density,
number of other cancer cases, etc.
- Infectious
- Interest in ? usually limited to its value as a
parameter we know the etiology - Interested in dynamics over time and space,
existence of thresholds or periods, effect of
parameters and initial conditions like size
initial population, infectivity, mode of contact
Difference is one of emphasis and interest, not
concepts
83Some new uses for mathematics in epidemiology
- Formalization and theoretical tools
- Pattern and rule detection (data mining)
- Descriptive modeling
- Prediction from data
- Classification
- Taxonomy
- Data organization and retrieval from large
databases - Patient confidentiality/coding/cryptography
- Multi-scale inference
- Network construction/applications, etc.