Whats Strange About Recent Events WSARE - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

Whats Strange About Recent Events WSARE

Description:

What's Strange About Recent Events (WSARE) Weng-Keen Wong (University of Pittsburgh) ... Prodrome. ICD9. Hospital. Time. Date. Primary Key ... – PowerPoint PPT presentation

Number of Views:113

Avg rating:3.0/5.0

Slides: 43

Provided by: me659

Category:

more less

Transcript and Presenter's Notes

Title: Whats Strange About Recent Events WSARE

1
Whats Strange About Recent Events (WSARE)

Weng-Keen Wong (University of Pittsburgh)
Andrew Moore (Carnegie Mellon University)
Gregory Cooper (University of Pittsburgh)
Michael Wagner (University of Pittsburgh)

This work funded by DARPA, the State of
Pennsylvania, and NSF
2
Motivation
Suppose we have real-time access to Emergency
Department data from hospitals around a city
(with patient confidentiality preserved)
3
The Problem
From this data, can we detect if a disease
outbreak is happening?
4
The Problem
From this data, can we detect if a disease
outbreak is happening?
Were talking about a non-specific disease
detection
5
The Problem
From this data, can we detect if a disease
outbreak is happening? How early can we detect
it?
6
The Problem
From this data, can we detect if a disease
outbreak is happening? How early can we detect
it?
The question were really asking
Whats strange about recent events?
7
Traditional Approaches

What about using traditional anomaly detection?
Typically assume data is generated by a model

Finds individual data points that have low
probability with respect to this model
These outliers have rare attributes or
combinations of attributes

Need to identify anomalous patterns not isolated
data points

8
Traditional Approaches
What about monitoring aggregate daily counts of
certain attributes?

Weve now turned multivariate data into
univariate data
Lots of algorithms have been developed for
monitoring univariate data

Time series algorithms
Regression techniques
Statistical Quality Control methods
Need to know apriori which attributes to form
daily aggregates for!

9
Traditional Approaches

What if we dont know what attributes to monitor?
What if we want to exploit the spatial, temporal
and/or demographic characteristics of the
epidemic to detect the outbreak as early as
possible?

10
Traditional Approaches

We need to build a univariate detector to monitor
each interesting combination of attributes

Diarrhea cases among children
Number of cases involving people working in
southern part of the city
Respiratory syndrome cases among females
Number of cases involving teenage girls living
in the western part of the city
Viral syndrome cases involving senior citizens
from eastern part of city
Botulinic syndrome cases
Number of children from downtown hospital
And so on
11
Traditional Approaches

We need to build a univariate detector to monitor
each interesting combination of attributes

Diarrhea cases among children
Number of cases involving people working in
southern part of the city
Respiratory syndrome cases among females
Number of cases involving teenage girls living
in the western part of the city
Youll need hundreds of univariate detectors! We
would like to identify the groups with the
strangest behavior in recent events.
Viral syndrome cases involving senior citizens
from eastern part of city
Botulinic syndrome cases
Number of children from downtown hospital
And so on
12
One Possible Approach
Todays Records
Yesterdays Records
Last Years Records
13
One Possible Approach
Todays Records
Yesterdays Records
Last Years Records
Idea Can use association rules to find patterns
in todays records that werent there in past data
14
One Possible Approach
Recent records ( from today )
Baseline records ( from 7 days ago )
Find which rules predict unusually high
proportions in recent records when compared to
the baseline eg. 52/200 records from recent
have Gender Male AND Age Senior 90/180
records from baseline have Gender Male AND
Age Senior
15
Which rules do we report?

Search over all rules up to a maximum number of
components
For each rule, form a 2x2 contingency table eg.
Perform Fishers Exact Test to get a p-value for
each rule (call this the score)
Report the rule with the lowest score

16
Problems with the Approach

Multiple Hypothesis Testing
2. A Changing Baseline

17
Problem 1 Multiple Hypothesis Testing

Cant interpret the rule scores as p-values
Suppose we reject null hypothesis when score lt ?,
where ? 0.05
For a single hypothesis test, the probability of
making a false discovery ?
Suppose we do 1000 tests, one for each possible
rule
Probability(false discovery) could be as bad as
1 ( 1 0.05)1000 gtgt 0.05

18
Randomization Test

Take the recent cases and the baseline cases.
Shuffle the date field to produce a randomized
dataset called DBRand
Find the rule with the best score on DBRand.

19
Randomization Test
Repeat the procedure on the previous slide for
1000 iterations. Determine how many scores from
the 1000 iterations are better than the original
score.
If the original score were here, it would place
in the top 1 of the 1000 scores from the
randomization test. We would be impressed and an
alert should be raised.
Corrected p-value of the rule is better scores
/ iterations
20
Reporting Multiple Rules on each Day

But reporting only the best scoring rule can hide
other more interesting anomalous patterns!
For example
The best scoring rule is statistically
significant but not a public health concern
The top 5 scoring rules indicate anomalous
patterns in 5 neighboring zip codes but
individually their p-values do not cause an alarm
to be raised

21
Our Solution FDR

False Discovery Rate Benjamini and Hochberg
Can determine which of these p-values are
significant
Specifically, given an aFDR, FDR guarantees that
Given an aFDR, FDR produces a threshold below
which any p-values in the history are considered
significant

22
Our Solution FDR

Once we have the set of all possible rules and
their scores,
use FDR to determine which ones are significant

23
Problem 2 A Changing Baseline
From Goldenberg, A., Shmueli, G., Caruana, R.
A., and Fienberg, S. E. (2002). Early
statistical detection of anthrax outbreaks by
tracking over-the-counter medication sales.
Proceedings of the National Academy of Sciences
(pp. 5237-5249)
24
Problem 2 A Changing Baseline

Baseline is affected by temporal trends in health
care data
Seasonal effects in temperature and weather
Day of Week effects
Holidays
Etc.
Choosing the wrong baseline distribution can
affect the detection time and false positives rate

25
Generating the Baseline

Taking into account that today is a public
holiday
Taking into account that this is Spring
Taking into account recent heatwave
Taking into account recent flu levels
Taking into account that theres a known natural
Food-borne outbreak in progress

26
Generating the Baseline

Taking into account that today is a public
holiday
Taking into account that this is Spring
Taking into account recent heatwave
Taking into account recent flu levels
Taking into account that theres a known natural
Food-borne outbreak in progress

Use a Bayes net to model the joint probability
distribution of the attributes
27
Obtaining Baseline Data
All Historical Data

Learn Bayesian Network using Optimal Reinsertion
Moore and Wong 2003

Todays Environment
Baseline
2. Generate baseline given todays environment
28
Environmental Attributes

Divide the data into two types of attributes
Environmental attributes attributes that cause
trends in the data eg. day of week, season,
weather, flu levels
Response attributes all other non-environmental
attributes

29
Environmental Attributes

When learning the Bayesian network structure, do
not allow environmental attributes to have
parents.
Why?
We are not interested in predicting their
distributions
Instead, we use them to predict the distributions
of the response attributes
Side Benefit We can speed up the structure
search by avoiding DAGs that assign parents to
the environmental attributes

Season
Day of Week
Weather
Flu Level
30
Generate Baseline Given Todays Environment
Suppose we know the following for today
Day of Week Monday
Flu Level High
Season Winter
Weather Snow
We fill in these values for the environmental
attributes in the learned Bayesian network
We sample 10000 records from the Bayesian network
and make this data set the baseline
Baseline
31
Generate Baseline Given Todays Environment
Suppose we know the following for today
Day of Week Monday
Flu Level High
Season Winter
Weather Snow
We fill in these values for the environmental
attributes in the learned Bayesian network
Sampling is easy because environmental attributes
are at the top of the Bayes Net
We sample 10000 records from the Bayesian network
and make this data set the baseline
Baseline
32
Generate Baseline Given Todays Environment
Suppose we know the following for today
Day of Week Monday
Flu Level High
Season Winter
Weather Snow
We fill in these values for the environmental
attributes in the learned Bayesian network
An alternate possible technique is to use
inference
We sample 10000 records from the Bayesian network
and make this data set the baseline
Baseline
33
Whats Strange About Recent Events (WSARE) 3.0
2. Search for rule with best score

Obtain Recent and Baseline datasets

All Data
Recent Data

Determine p-value of best scoring rule

Baseline
4. If p-value is less than threshold, signal alert
34
Simulator
35
Simulation

100 different data sets
Each data set consisted of a two year period
Anthrax release occurred at a random point during
the second year
Algorithms allowed to train on data from the
current day back to the first day in the
simulation
Any alerts before actual anthrax release are
considered a false positive
Detection time calculated as first alert after
anthrax release. If no alerts raised, cap
detection time at 14 days

36
Other Algorithms used in Simulation

Control Chart Mean multiplier standard
deviation
Moving Average 7 day window
ANOVA Regression Linear regression with extra
covariates for season, day of week, count from
yesterday
WSARE 2.0 Create baseline using raw historical
data
WSARE 2.5 Use raw historical data that matches
environmental attributes

37
Results on Simulation
38
Results on Actual ED Data from 2001

1. Sat 2001-02-13 SCORE -0.00000004 PVALUE
0.00000000
14.80 ( 74/500) of today's cases have Viral
Syndrome True and Encephalitic Prodome False
7.42 (742/10000) of baseline have Viral
Syndrome True and Encephalitic Syndrome False
2. Sat 2001-03-13 SCORE -0.00000464 PVALUE
0.00000000
12.42 ( 58/467) of today's cases have
Respiratory Syndrome True
6.53 (653/10000) of baseline have
Respiratory Syndrome True
3. Wed 2001-06-30 SCORE -0.00000013 PVALUE
0.00000000
1.44 ( 9/625) of today's cases have 100 lt
Age lt 110
0.08 ( 8/10000) of baseline have 100 lt Age
lt 110
4. Sun 2001-08-08 SCORE -0.00000007 PVALUE
0.00000000
83.80 (481/574) of today's cases have
Unknown Syndrome False
74.29 (7430/10001) of baseline have Unknown
Syndrome False
5. Thu 2001-12-02 SCORE -0.00000087 PVALUE
0.00000000
14.71 ( 70/476) of today's cases have Viral
Syndrome True and Encephalitic Syndrome False
7.89 (789/9999) of baseline have Viral
Syndrome True and Encephalitic Syndrome False

39
Limitations of WSARE

Works on categorical data
Works on lower dimensional, dense data
Cannot monitor aggregate counts relies on
changes in ratios
Assumes that given the environmental variables,
the baseline ratios are fairly stationary over
time

40
Related Work

Contrast sets Bay and Pazzani
Association Rules and Data Mining in Hospital
Infection Control and Public Health Surveillance
Brossette et. al.
Spatial Scan Statistic Kulldorff
WRSARE Whats Really Strange About Recent Events
Singh and Moore

P( Age Senior, Gender Male Season
Winter, Day of Week Monday)
41
Bayesian Biosurveillance of Disease Outbreaks
To appear in UAI04 Cooper, Dash, Levander, Wong,
Hogan, Wagner
42
Conclusion

One approach to biosurveillance one algorithm
monitoring millions of signals derived from
multivariate data
instead of
Hundreds of univariate detectors
WSARE is best used as a general purpose safety
net in combination with other detectors
Careful evaluation of statistical significance
Modeling historical data with Bayesian Networks
to allow conditioning on unique features of today

Software http//www.autonlab.org/

Write a Comment

User Comments (0)