DIMACS Working Group on Data Mining and Epidemiology - PowerPoint PPT Presentation

About This Presentation
Title:

DIMACS Working Group on Data Mining and Epidemiology

Description:

What are the challenges for mathematical scientists in the defense against ... uses this to detect fraudulent use of credit cards or impending billing defaults ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 28
Provided by: fredro
Category:

less

Transcript and Presenter's Notes

Title: DIMACS Working Group on Data Mining and Epidemiology


1
DIMACS Working Group on Data Mining and
Epidemiology
2
  • What are the challenges for mathematical
    scientists in the defense against disease?
  • This question led DIMACS, the Center for Discrete
    Mathematics and Theoretical Computer Science, to
    launch a special focus on this topic.

3
DIMACS Special Focus on Computational and
Mathematical Epidemiology 2002-2005
Anthrax
4
  • Post-September 11 events soon led to an emphasis
    on bioterrorism.

  • smallpox

5
Working Groups
6
Working Groups Continued
  • Interdisciplinary, international groups of
    researchers.
  • Come together at DIMACS.
  • Informal presentations, lots of time for
    discussion.
  • Emphasis on collaboration.
  • Return as a full group or in subgroups to pursue
    problems/approaches identified in first meeting.
  • By invitation but contact the organizer.
  • Junior researchers welcomed. Nominate them.

7
Working Groups
  • WGs on Large Data Sets
  • Adverse Event/Disease Reporting, Surveillance
    Analysis.
  • Spin-off Health Care Data Privacy and
    Confidentiality
  • Data Mining and Epidemiology.

8
  • WGs on Analogies between Computers and Humans
  • Analogies between Computer Viruses/Immune Systems
    and Human Viruses/Immune Systems
  • Distributed Computing, Social Networks, and
    Disease Spread Processes

9
WGs on Methods/Tools of TCS
  • Phylogenetic Trees and Rapidly Evolving Diseases
  • Order-Theoretic Aspects of Epidemiology

10
  • WGs on Computational Methods for Analyzing Large
    Models for Spread/Control of Disease
  • Spatio-temporal and Network Modeling of Diseases
  • Methodologies for Comparing Vaccination
    Strategies

11
WGs on Mathematical Sciences Methodologies
  • Mathematical Models and Defense Against
    Bioterrorism
  • Predictive Methodologies for Infectious Diseases
  • Statistical, Mathematical, and Modeling Issues in
    the Analysis of Marine Diseases

12
Data Mining and Epidemiology
  • Interest sparked in part by availability of large
    and disparate computerized databases on subjects
    relating to disease

13
  • Early warning is critical in public health
  • This is a crucial factor underlying governments
    plans to place networks of sensors/detectors to
    warn of a bioterrorist attack
  • Sensors will be a source of huge amounts of data

The BASIS System
14
The DIMACS Bioterrorism Sensor Location Project
15
Data Mining and Epidemiology Some Research
Issues
16
  • 1. Streaming Data Analysis
  • When you only have one shot at the data
  • Widely used to detect trends and sound alarms in
    applications in telecommunications and finance
  • ATT uses this to detect fraudulent use of credit
    cards or impending billing defaults
  • Columbia has developed methods for detecting
    fraudulent behavior in financial systems
  • Uses algorithms based in TCS
  • Needs modification to apply to disease detection

17
  • Research Issues
  • Modify methods of data collection, transmission,
    processing, and visualization
  • Explore use of decision trees, vector-space
    methods, Bayesian and neural nets
  • How are the results of monitoring systems best
    reported and visualized?
  • To what extent can they incur fast and safe
    automated responses?
  • How are relevant queries best expressed, giving
    the user sufficient power while implicitly
    restraining him/her from incurring unwanted
    computational overhead?

18
2. Cluster Analysis
  • Used to extract patterns from complex data
  • Application of traditional clustering algorithms
    hindered by extreme heterogeneity of the data
  • Newer clustering methods based on TCS for
    clustering heterogeneous data need to be modified
    for infectious disease and bioterrorist
    applications.

19
3. Visualization
  • Large data sets are sometimes best understood by
    visualizing them.

20
3. Visualization (continued)
  • Sheer data sizes require new visualization
    regimes, which require suitable external memory
    data structures to reorganize tabular data to
    facilitate access, usage, and analysis.
  • Visualization algorithms become harder when data
    arises from various sources and each source
    contains only partial information.

21
4. Data Cleaning
  • Disease detection problem Very dirty data

22
4. Data Cleaning (continued)
  • Very dirty data due to
  • manual entry
  • lack of uniform standards for content and formats
  • data duplication
  • measurement errors
  • TCS-based methods of data cleaning
  • duplicate removal
  • merge purge
  • automated detection

23
5. Dealing with Natural Language Reports
  • Devise effective methods for translating natural
    language input into formats suitable for
    analysis.
  • Develop computationally efficient methods to
    provide automated responses consisting of
    follow-up questions.
  • Develop semi-automatic systems to generate
    queries based on dynamically changing data.

24
6. Cryptography and Security
  • Devise effective methods for protecting privacy
    of individuals about whom data is provided to
    biosurveillance teams -- data from emergency
    dept. visits, doctor visits, prescriptions
  • Develop ways to share information between
    databases of intelligence agencies while
    protecting privacy?

25
6. Cryptography and Security (continued)
  • Specifically How can we make a simultaneous
    query to two datasets without compromising
    information in those data sets? (E.g., is
    individual xx included in both sets?)
  • Issues include
  • insuring accuracy and reliability of responses
  • authentication of queries
  • policies for access control and authorization

26
7. Spatio-Temporal Mining of Sensor Data
  • Sensors provide observations of the state of the
    world localized in space and time.
  • Finding trends in data from individual sensors
    time series data mining.
  • Detecting general correlations in multiple time
    series of observations.
  • This has been studied in statistics, database
    theory, knowledge discovery, data mining.
  • Complications proximity relationships based on
    geography complex chronological effects.

27
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com