Title: DIMACS Working Group on Data Mining and Epidemiology
1DIMACS Working Group on Data Mining and
Epidemiology
2- What are the challenges for mathematical
scientists in the defense against disease? - This question led DIMACS, the Center for Discrete
Mathematics and Theoretical Computer Science, to
launch a special focus on this topic.
3DIMACS Special Focus on Computational and
Mathematical Epidemiology 2002-2005
Anthrax
4- Post-September 11 events soon led to an emphasis
on bioterrorism. -
smallpox
5Working Groups
6Working Groups Continued
- Interdisciplinary, international groups of
researchers. - Come together at DIMACS.
- Informal presentations, lots of time for
discussion. - Emphasis on collaboration.
- Return as a full group or in subgroups to pursue
problems/approaches identified in first meeting. - By invitation but contact the organizer.
- Junior researchers welcomed. Nominate them.
7Working Groups
- WGs on Large Data Sets
- Adverse Event/Disease Reporting, Surveillance
Analysis. - Spin-off Health Care Data Privacy and
Confidentiality - Data Mining and Epidemiology.
8- WGs on Analogies between Computers and Humans
- Analogies between Computer Viruses/Immune Systems
and Human Viruses/Immune Systems - Distributed Computing, Social Networks, and
Disease Spread Processes
9WGs on Methods/Tools of TCS
- Phylogenetic Trees and Rapidly Evolving Diseases
- Order-Theoretic Aspects of Epidemiology
10- WGs on Computational Methods for Analyzing Large
Models for Spread/Control of Disease - Spatio-temporal and Network Modeling of Diseases
- Methodologies for Comparing Vaccination
Strategies
11WGs on Mathematical Sciences Methodologies
- Mathematical Models and Defense Against
Bioterrorism - Predictive Methodologies for Infectious Diseases
- Statistical, Mathematical, and Modeling Issues in
the Analysis of Marine Diseases
12Data Mining and Epidemiology
- Interest sparked in part by availability of large
and disparate computerized databases on subjects
relating to disease
13- Early warning is critical in public health
- This is a crucial factor underlying governments
plans to place networks of sensors/detectors to
warn of a bioterrorist attack - Sensors will be a source of huge amounts of data
The BASIS System
14The DIMACS Bioterrorism Sensor Location Project
15Data Mining and Epidemiology Some Research
Issues
16- 1. Streaming Data Analysis
- When you only have one shot at the data
- Widely used to detect trends and sound alarms in
applications in telecommunications and finance - ATT uses this to detect fraudulent use of credit
cards or impending billing defaults - Columbia has developed methods for detecting
fraudulent behavior in financial systems - Uses algorithms based in TCS
- Needs modification to apply to disease detection
17- Research Issues
- Modify methods of data collection, transmission,
processing, and visualization - Explore use of decision trees, vector-space
methods, Bayesian and neural nets - How are the results of monitoring systems best
reported and visualized? - To what extent can they incur fast and safe
automated responses? - How are relevant queries best expressed, giving
the user sufficient power while implicitly
restraining him/her from incurring unwanted
computational overhead?
182. Cluster Analysis
- Used to extract patterns from complex data
- Application of traditional clustering algorithms
hindered by extreme heterogeneity of the data - Newer clustering methods based on TCS for
clustering heterogeneous data need to be modified
for infectious disease and bioterrorist
applications.
193. Visualization
- Large data sets are sometimes best understood by
visualizing them.
203. Visualization (continued)
- Sheer data sizes require new visualization
regimes, which require suitable external memory
data structures to reorganize tabular data to
facilitate access, usage, and analysis. - Visualization algorithms become harder when data
arises from various sources and each source
contains only partial information.
214. Data Cleaning
- Disease detection problem Very dirty data
224. Data Cleaning (continued)
- Very dirty data due to
- manual entry
- lack of uniform standards for content and formats
- data duplication
- measurement errors
- TCS-based methods of data cleaning
- duplicate removal
- merge purge
- automated detection
235. Dealing with Natural Language Reports
- Devise effective methods for translating natural
language input into formats suitable for
analysis. - Develop computationally efficient methods to
provide automated responses consisting of
follow-up questions. - Develop semi-automatic systems to generate
queries based on dynamically changing data.
246. Cryptography and Security
- Devise effective methods for protecting privacy
of individuals about whom data is provided to
biosurveillance teams -- data from emergency
dept. visits, doctor visits, prescriptions - Develop ways to share information between
databases of intelligence agencies while
protecting privacy?
256. Cryptography and Security (continued)
- Specifically How can we make a simultaneous
query to two datasets without compromising
information in those data sets? (E.g., is
individual xx included in both sets?) - Issues include
- insuring accuracy and reliability of responses
- authentication of queries
- policies for access control and authorization
267. Spatio-Temporal Mining of Sensor Data
- Sensors provide observations of the state of the
world localized in space and time. - Finding trends in data from individual sensors
time series data mining. - Detecting general correlations in multiple time
series of observations. - This has been studied in statistics, database
theory, knowledge discovery, data mining. - Complications proximity relationships based on
geography complex chronological effects.
27(No Transcript)