Title: Selected Problems in Epidemiology
1Selected Problems in Epidemiology
- Nina H. Fefferman, Ph.D.
- Co-Director Tufts Univ. InForMID
2Data mining in public health is not new, but it
is more complicated A small historical example
Cholera, John Snow, 1854
During the height of the Miasmic theory of
Disease 1) There was a Cholera outbreak in
London 2) John Snow became irrationally
convinced that Cholera came from contaminated
drinking water So Snow went to the London
Registrar-General He looked at where those who
died from Cholera got their water and when "The
experiment was on the grandest scale. No fewer
than 300,000 people of both sexes, of every age
and occupation, and of every rank and station,
from gentlefolks down to the very poor, were
divided into two groups without their choice,
and, in most cases, without their knowledge one
group being supplied with water containing the
sewage of London, and, amongst it, whatever might
have come from the cholera patients, the other
group having water quite free from such impurity."
On the Mode of Communication of Cholera, Second
Edition, 1854
3Snows findings
 Number of Houses Death from Cholera Death in Each 10,000 Houses
Southwark and Vauxhall Company 40,046 1,263 315
Lambeth Company 26,107 98 37
Rest of London 256,423 1,422 59
Before 1852, your chances of getting cholera were
not correlated with getting your water from
either water company In the epidemic of 1853-54,
your chances of getting cholera if your water was
from Southwark and Vauxhall were more than eight
times greater than if you got your water from
Lambeth
4And then it got really impressive
Then Cholera reoccurred in the Soho district of
London About 600 people died from cholera in a
10-day period Once again Snow took the
operational death-certificate data from the
Registrar-General This time he plotted the data
on a clustering diagram, using a stacked
histogram technique plotted on a map of Soho to
do the data mining
5Lives saved due to real-time data mining
Based upon this map, Snow was able to convince
the London Board of Guardians to remove the pump
handle from the public pump located on Broad
Street The outbreak of cholera subsided with this
operational change It was later revealed that
the Broad Street well was contaminated by an
underground cesspool located at 40 Broad Street
which was just three feet from the well The
Broad Street pump without a handle remains today
as a tribute to Snow
6Modern problems Happening on every scale
imaginable
Genetic We know what were looking at and
what were looking for, just not how to find
it Single Defined Population We know who
were looking at and what were looking for, but
not how to find it Undefined Population
We dont know who to look at, but we know what
to look for Undefined Everything We want to
save lives, but dont know what to do at all
7Chromosome Sequence Length (in base pairs)
1 245,203,898
2 243,315,028
3 199,411,731
4 191,610,523
5 180,967,295
6 170,740,541
7 158,431,299
8 145,908,738
9 134,505,819
10 135,480,874
11 134,978,784
12 133,464,434
13 114,151,656
14 105,311,216
15 100,114,055
16 89,995,999
17 81,691,216
18 77,753,510
19 63,790,860
20 63,644,868
21 46,976,537
22 49,476,972
X 152,634,166
Y 50,961,097
Genetic Epidemiology
You have good reason to believe that a disease
has a genetic component You have the sequenced
genomes of some afflicted people The human genome
is huge Normally one-tenth of a single percent of
DNA (about 3 million bases) differs from one
person to the next Luckily junk DNA makes up at
least 50 of the human genome But we still know
of about 1.4 million locations where single
nucleotide polymorphisms (SNPs) occur in humans
8A paper on something like this Rodin et al. 2005
J Comput Biol. 12(1) 111. Mining Genetic
Epidemiology Data with Bayesian Networks
Application to APOE Gene Variation and Plasma
Lipid Levels
So we need Data Mining
This type of examination is called a large-scale
genotypephenotype association study Classical
statistical methods (i.e. multivariable
regression, contingency table analysis) are ill
suited for high dimensional problems because they
are single inference procedures We need joint
inference procedures Methods for combining
results across multiple single inference
procedures are inefficient In this type of
case, Data-mining methods are hypothesis-generatin
g and classical statistical methods are
hypothesis-testing
9A single defined population We know who were
looking at and what were looking for, but not
how to find it
In an adverse reaction study for a new vaccine or
drug We know who to watch (those who receive the
treatment) We know were looking for (bad
things that happen to them) How do we find
it? We also have to monitor people who dont
get the treatment and see what happens to
them We wind up with a huge set of all bad
things that happen to lots of people This leads
to a lot of problems
10A reference and paper on something like this
http//www.fda.gov/cder/aers/default.htm or Nu et
al. 2001 Vaccine. 19(32)4627-34.
Example problems in data mining for adverse
events
Health care providers report adverse reactions by
patients to any drug Unfortunately, many
patients need to take several drugs at once, so
all will be reported with the same event And
theres reporting bias - results dont reflect
the overall population (only the people who
needed the drug in the first place, but thats
probably the portion were worried about anyway)
Explicit example Sudden Infant Death Syndrome
(SIDS) and the Polio vaccine You can easily find
a statistical association between the two Does
this mean the polio vaccine is dangerous? Not
necessarily the polio vaccine is mainly given
to infants, who are the only possible victims of
SIDS Receiving the polio vaccine increases your
likelihood of being an infant, which
significantly increases your chance of SIDS
We would need to if there is an association
within infants
11Undefined Population We dont know who to
look at, but we know what to look for
Example Figuring out the source of a food-borne
outbreak (Good news we know some diseases are
caused by food-borne pathogens) We can
hypothesize that a certain activity is somehow
related to the source like the food at a party
being contaminated Unfortunately, there can be a
lot of food at one large party You might not know
if the food at the party is actually the
culprit You need to ask if people at the party
got sick If they did, you need to know which
particular food at the party is
contaminated The normal process here is to
call everyone at the party and conduct a survey
(see handout)
12These surveys can generate a huge amount of data
and theres no guarantee that the party was the
source of the outbreak
- Horror scenario from a data perspective
- Food poisoning at the Republican National
Convention - We wouldnt know
- Which day
- Which location
- Which caterer
- How many people were made ill
- How do you figure out what how and who in real
time? - Part of the problem is to get the answer before
more people become sick, so you want to narrow
the focus of your investigation as you go - ask fewer people, ask fewer questions, all
these surveys take time
13Undefined Everything We want to save lives,
but dont know what to do at all
Cancer Youll hear more about this later in
the program from Dmitriy Fradkin Huge numbers
of people diagnosed Huge numbers of possible
contributing risks environmental exposure to
carcinogens genetic predisposition cancer-causin
g viruses Huge numbers of confounding factors
differences in diagnosis, treatment,
outcome co-morbidity
14Lets say were worried about the beginning of an
outbreak of H5N1 avian flu
It will probably start out looking like normal
flu How quickly we can figure out where it is
will determine how quickly we can try active
intervention strategies We dont know where it
will start International travel? Near
airports? International bird migration
patterns? Along the coasts? Depending on time
of year? Once its here, we dont really know how
it will spread Maybe we want an early warning
system for cities is the disease present or
absent?
15These are the types of Epidemiological problems
we face, what are the kinds of practical
constraints we have to expect?
There are many data collectors Insurance
companies, HMOs, public health agencies Issues of
data control Who controls the data? Is each
entity found only at a single site? Do
different sites contain different types of
data? How can we make sure the data isnt
redundant and therefore skewing our
information? How can we make sure we get all the
pertinent data at the same time? Or at least
how fast is fast enough to figure out what we
need as quickly as possible?
16For more information, see http//www.hipaa.org/
And Privacy and Ethics
Individual privacy concerns limit the willingness
of the data custodians to share it, even with
government agencies such as the U.S. Centers for
Disease Control In many cases, data is shared
only after it has been de-identified according
to HIPAA regulations This removes a lot of
useful information and doesnt really do a whole
lot to protect privacy, but thats another
issue (see Fefferman et al. 2005 J. Public Health
Policy 26(4)430-449) We need a whole different
slew of data mining techniques to mine data
blind (when we dont know what were seeing,
what the numbers represent, how much theyve been
aggregated to represent averages or what were
looking for)
17And other problems
Sometimes we dont know where the best source of
data is We can monitor some cities more
closely We can monitor certain diseases
(notifiable diseases) Although this is
constrained by having to verify by lab
test Sometimes our expectations of normal
levels of disease set the wrong benchmark for
when we should start being concerned Different
diseases have different normal incidence, which
means that an increase of 10 cases per year of
one disease is an outbreak, but it would take
an increase of 1000 in another to be unusual
18BOTULISM, FOODBORNE Number of reported cases,
by year - United States, 1983-2003
19- ESCHERICHIA COLI, ENTEROHEMORRHAGIC O157H7
- Number of reported cases,
- United States and U.S. territories, 2003
Sometimes we expect something intermediate
20SALMONELLOSIS Incidence, by year United States,
1973-2003
Per 100,000 population
And sometimes we expect the numbers to be
reasonably large
21- ACQUIRED IMMUNODEFICIENCY SYNDROME (AIDS)
- Number of reported cases, by year
- United States and U.S. territories, 1983-2003
And sometimes our methods of surveillance itself
creates issues
Total number of AIDS cases includes all cases
reported to CDC as of December 31, 2003. Total
includes cases among residents in U.S.
territories and 220 cases among persons with
unknown state of residence.
22Sometimes our problems are prospective Sometimes
our problems retrospective
In outbreak detection and biosurveillance, we
want to find unusual disease incidence
early In adverse reaction trials, we want to know
overall effects, we dont particularly care
about the time scales on which they act In
classic epidemiological investigations we are
looking for the source of exposure to prevent
further infection
23Advances in technology have caused a shift in our
data mining needs
It used to be that the bottle-neck to appropriate
analysis was figuring out where to look for the
data and collecting it A pre-processing
problem Due to advances in reporting technology,
were very close to getting real-time reporting
for mortality data and were getting there for
incidence data (for at least some diseases) Now
we have to figure out how to find meaningful
results in the chaos and clutter
24Data mining techniques can be tailored to handle
all of these problems
We havent covered all of the problems, but as
you can see, we need better techniques and we
need more people working on the use of these
techniques Thanks for attending this workshop
we need you!