HOMELAND SECURITY RESEARCH AT DIMACS - PowerPoint PPT Presentation

About This Presentation
Title:

HOMELAND SECURITY RESEARCH AT DIMACS

Description:

Health surveillance a core activity in public health. Concerns about bioterrorism have ... Martin Strauss, AT&T Labs. Wen-Hua Ju, Avaya Labs (collaborator) ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 23
Provided by: spass
Category:

less

Transcript and Presenter's Notes

Title: HOMELAND SECURITY RESEARCH AT DIMACS


1
HOMELAND SECURITY RESEARCH AT DIMACS
2
Working Group on Adverse Event/Disease Reporting,
Surveillance, and Analysis
  • Health surveillance a core activity in public
    health
  • Concerns about bioterrorism have attracted
    attention to new surveillance methods
  • OTC drug sales
  • Subway worker absenteeism
  • Ambulance dispatches
  • Spawns need for novel statistical methods for
    surveillance of multiple data streams.

3
Working Group on Privacy Confidentiality of
Health Data
  • Privacy concerns are a major stumbling block to
    public health surveillance, in particular
    bioterrorism surveillance.
  • Challenge produce anonymous data specific enough
    for research.
  • Exploring ways to remove identifiers (s.s. ,
    tel. , zip code) from data sets.
  • Exploring ways to aggregate, remove information
    from data sets.

4
Working Group on Analogies between Computer
Viruses and Biological Viruses
  • Can ideas for defending against biological
    viruses lead to ideas for defending against
    computer viruses?
  • Concern about large gap between initial time of
    attack and implementation of defensive strategies
  • Public health approach Once a virus has
    infected a machine, it tries to connect it to as
    many computers as possible, as fast as possible.
    A throttle limits rate at which a computer can
    connect to new computers.

5
Working Group on Modeling Social Responses to
Bioterrorism
  • Models of the spread of infectious disease
    commonly assume passive bystanders and rational
    actors who will comply with health authorities.
  • It is not clear how well this assumption applies
    to situations like a bioterrorist attack using
    smallpox or plague.

1947, NYC, smallpox outbreak
  • Interdisciplinary group is discussing
    incorporating social behavior into models, models
    of public health decisionmaking, risk
    communication.

6
The Bioterrorism Sensor Location Problem
  • Early warning is critical
  • This is a crucial factor underlying governments
    plans to place networks of sensors/detectors to
    warn of a bioterrorist attack

The BASIS System
7
Two Fundamental Problems
  • Sensor Location Problem (SLP)
  • Choose an appropriate mix of sensors
  • decide where to locate them for best protection
    and early warning

8
Two Fundamental Problems
  • Pattern Interpretation Problem (PIP) When
    sensors set off an alarm, help public health
    decision makers decide
  • Has an attack taken place?
  • What additional monitoring is needed?
  • What was its extent and location?
  • What is an appropriate response?

9
Monitoring Message Streams Algorithmic Methods
for Automatic Processing of Messages
Supported by Interagency KD-D Group
10
OBJECTIVE
Monitor huge streams of textualized communication
to automatically detect pattern changes and
"significant" events
Motivation monitoring email traffic
11
TECHNICAL PROBLEM
  • Given stream of text in any language.
  • Decide whether "new events" are present in the
    flow of messages.
  • Event new topic or topic with unusual level of
    activity.
  • Retrospective or Supervised Event
    Identification Classification into pre-existing
    classes.

12
TECHNICAL PROBLEM
  • Batch filtering Given relevant documents up
    front.
  • Adaptive filtering pay for information about
    relevance as process moves along.

13
  • MORE COMPLEX PROBLEM PROSPECTIVE DETECTION OR
    UNSUPERVISED LEARNING
  • Classes change - new classes or change meaning
  • A difficult problem in statistics
  • Recent new C.S. approaches
  • Semi-supervised Learning
  • Algorithm suggests a new class
  • Human analyst labels it determines its
    significance

14
COMPONENTS OF AUTOMATIC MESSAGE PROCESSING
  • (1). Compression of Text -- to meet storage and
    processing limitations
  • (2). Representation of Text -- put in form
    amenable to computation and statistical analysis
  • (3). Matching Scheme -- computing similarity
    between documents
  • (4). Learning Method -- build on judged examples
    to determine characteristics of document cluster
    (event)
  • (5). Fusion Scheme -- combine methods (scores) to
    yield improved detection/clustering.

15
COMPONENTS OF AUTOMATIC MESSAGE PROCESSING - II
  • These distinctions are somewhat arbitrary.
  • Many approaches to message processing overlap
    several of these components of automatic message
    processing.
  • Existing methods dont exploit the full power of
    the 5 components, synergies among them, and/or an
    understanding of how to apply them to text data.

16
COMPRESSION
  • Reduce the dimension before statistical analysis.
  • We often have just one shot at the data as it
    comes streaming by

17
COMPRESSION II
  • Recent results One-pass through data can
    reduce volume significantly w/o degrading
    performance significantly.

We believe that sophisticated dimension reduction
methods in a preprocessing stage followed by
sophisticated statistical tools in a
detection/filtering stage can be a very powerful
approach. Our methods so far give us some
confidence that we are right.
18
  • COMPRESSION III
  • Three directions of work involving adaptation of
    nearest neighbor (NN) algorithms from theoretical
    computer science
  • Use of random projections into real subspaces.
    (Still promising, though not competitive for our
    data.)
  • Random projections into Hamming cubes
  • Efficient discovery of deviant cases in stream
    of vectorized entities

19
MORE SOPHISTICATED STATISTICAL APPROACHES BEING
STUDIED
  • Representations Boolean representations
    weighting schemes
  • Matching Schemes Boolean matching nonlinear
    transforms of individual feature values
  • Learning Methods new kernel-based methods more
    complex Bayes classifiers boosting
  • Fusion Methods combining scores based on ranks,
    linear functions, or nonparametric schemes

20
DATA SETS USED
  • No readily available data set has all the
    characteristics of data on which we expect our
    methods to be used
  • However Many of our methods depend essentially
    only on term frequencies by document.
  • Thus, many available data sets can be used for
    experimentation.

21
DATA SETS USED II
  • TREC (Text Retrieval Conference) data
    time-stamped subsets of the data (order 105 to
    106 messages)
  • Reuters Corpus Vol. 1 (8 x 105 messages)
  • Medline Abstracts (order 107 with human indexing)

22
THE MONITORING MESSAGE STREAMS PROJECT TEAM
Endre Boros, RUTCOR Paul Kantor, SCILS Dave
Lewis, Consultant Ilya Muchnik, DIMACS/CS S.
Muthukrishnan, CS David Madigan,
Statistics Rafail Ostrovsky, Telcordia
Technologies Fred Roberts, Rutgers Martin
Strauss, ATT Labs Wen-Hua Ju, Avaya Labs
(collaborator)
Write a Comment
User Comments (0)
About PowerShow.com