Title: Statistical Analysis of Web-Generated Data
1SURVEILLANCE TOOLS FOR PUBLIC HEALTH DATA AND
MESSAGE STREAMS
DAVID MADIGAN DEPARTMENT OF STATISTICS,
INSTITUTE OF BIOSTATISTICS, DIMACS RUTGERS
UNIVERSITY
2OVERVIEW
- Brief description of two activities at DIMACS
- The DIMACS Working Group on Adverse Event and
Disease Surveillance - 50 members public health, universities,
industry - National Science Foundation
- Monitoring Message Streams Project
- - A dozen researchers and programmers
- - Intelligence Agencies
http//www.stat.rutgers.edu/madigan/
3SURVEILLANCE WORKING GROUP
- WG meetings plus week-long tutorial on analytic
methods - Coordinated closely with the National Syndromic
Surveillance Conferences
4SURVEILLANCE WORKING GROUP
- Challenge Find anomalies in streams of public
health data (disease incidence, medicine sales,
ED chief complaints, adverse events) - Why?
- Detect disease outbreaks
- Post-marketing surveillance
- of medical product safety
- Bioterrorism
5CLASSICAL SURVEILLANCE METHODS
- Find anomalies/changepoints in single streams
Sequential Probability Ratio Test (Wald, 1948)
6NEW SURVEILLANCE CHALLENGES
- Find anomalies/changepoints in multivariate,
heterogeneous streams
ED chief complaint
7SCAN STATISTICS
8MONITORING MESSAGE STREAMS
- Finding interesting messages in large streams
- Interesting
- Human analyst should read
- Important new topic
- Significant with respect to subsequent events
9FIVE COMPONENT PROCESS
- Representation
- Compression
- Matching
- Learning
- Fusion
representing documents in a computer
massive streams gt compression
which messages are similar?
statistical methods for learning from data
mix n match
- Best-ever accuracy on some standard test problems
10SUMMARY
- Dramatically increased interest in public
health surveillance - Homeland security motivation
- Public health benefits
- Exciting analytical challenges
22-year-old woman, w/nausea, vomiting, and a
dull pain inher back for three weeks. woman had
eaten a tube sock.