Title: HOMELAND SECURITY RESEARCH AT DIMACS
1HOMELAND SECURITY RESEARCH AT DIMACS
2Working Group on Adverse Event/Disease Reporting,
Surveillance, and Analysis
- Health surveillance a core activity in public
health - Concerns about bioterrorism have attracted
attention to new surveillance methods - OTC drug sales
- Subway worker absenteeism
- Ambulance dispatches
- Spawns need for novel statistical methods for
surveillance of multiple data streams.
3Working Group on Privacy Confidentiality of
Health Data
- Privacy concerns are a major stumbling block to
public health surveillance, in particular
bioterrorism surveillance. - Challenge produce anonymous data specific enough
for research. - Exploring ways to remove identifiers (s.s. ,
tel. , zip code) from data sets. - Exploring ways to aggregate, remove information
from data sets.
4Working Group on Analogies between Computer
Viruses and Biological Viruses
- Can ideas for defending against biological
viruses lead to ideas for defending against
computer viruses? - Concern about large gap between initial time of
attack and implementation of defensive strategies - Public health approach Once a virus has
infected a machine, it tries to connect it to as
many computers as possible, as fast as possible.
A throttle limits rate at which a computer can
connect to new computers.
5Working Group on Modeling Social Responses to
Bioterrorism
- Models of the spread of infectious disease
commonly assume passive bystanders and rational
actors who will comply with health authorities. - It is not clear how well this assumption applies
to situations like a bioterrorist attack using
smallpox or plague.
1947, NYC, smallpox outbreak
- Interdisciplinary group is discussing
incorporating social behavior into models, models
of public health decisionmaking, risk
communication.
6The Bioterrorism Sensor Location Problem
- Early warning is critical
- This is a crucial factor underlying governments
plans to place networks of sensors/detectors to
warn of a bioterrorist attack
The BASIS System
7Two Fundamental Problems
- Sensor Location Problem (SLP)
- Choose an appropriate mix of sensors
- decide where to locate them for best protection
and early warning
8Two Fundamental Problems
- Pattern Interpretation Problem (PIP) When
sensors set off an alarm, help public health
decision makers decide - Has an attack taken place?
- What additional monitoring is needed?
- What was its extent and location?
- What is an appropriate response?
9Monitoring Message Streams Algorithmic Methods
for Automatic Processing of Messages
Supported by Interagency KD-D Group
10OBJECTIVE
Monitor huge streams of textualized communication
to automatically detect pattern changes and
"significant" events
Motivation monitoring email traffic
11TECHNICAL PROBLEM
- Given stream of text in any language.
- Decide whether "new events" are present in the
flow of messages. - Event new topic or topic with unusual level of
activity. - Retrospective or Supervised Event
Identification Classification into pre-existing
classes.
12TECHNICAL PROBLEM
- Batch filtering Given relevant documents up
front. - Adaptive filtering pay for information about
relevance as process moves along.
13- MORE COMPLEX PROBLEM PROSPECTIVE DETECTION OR
UNSUPERVISED LEARNING - Classes change - new classes or change meaning
- A difficult problem in statistics
- Recent new C.S. approaches
- Semi-supervised Learning
- Algorithm suggests a new class
- Human analyst labels it determines its
significance
14COMPONENTS OF AUTOMATIC MESSAGE PROCESSING
- (1). Compression of Text -- to meet storage and
processing limitations - (2). Representation of Text -- put in form
amenable to computation and statistical analysis - (3). Matching Scheme -- computing similarity
between documents - (4). Learning Method -- build on judged examples
to determine characteristics of document cluster
(event) - (5). Fusion Scheme -- combine methods (scores) to
yield improved detection/clustering.
15COMPONENTS OF AUTOMATIC MESSAGE PROCESSING - II
- These distinctions are somewhat arbitrary.
- Many approaches to message processing overlap
several of these components of automatic message
processing. - Existing methods dont exploit the full power of
the 5 components, synergies among them, and/or an
understanding of how to apply them to text data.
16COMPRESSION
- Reduce the dimension before statistical analysis.
- We often have just one shot at the data as it
comes streaming by
17COMPRESSION II
- Recent results One-pass through data can
reduce volume significantly w/o degrading
performance significantly.
We believe that sophisticated dimension reduction
methods in a preprocessing stage followed by
sophisticated statistical tools in a
detection/filtering stage can be a very powerful
approach. Our methods so far give us some
confidence that we are right.
18- COMPRESSION III
- Three directions of work involving adaptation of
nearest neighbor (NN) algorithms from theoretical
computer science - Use of random projections into real subspaces.
(Still promising, though not competitive for our
data.) - Random projections into Hamming cubes
- Efficient discovery of deviant cases in stream
of vectorized entities
19MORE SOPHISTICATED STATISTICAL APPROACHES BEING
STUDIED
- Representations Boolean representations
weighting schemes - Matching Schemes Boolean matching nonlinear
transforms of individual feature values - Learning Methods new kernel-based methods more
complex Bayes classifiers boosting - Fusion Methods combining scores based on ranks,
linear functions, or nonparametric schemes
20 DATA SETS USED
- No readily available data set has all the
characteristics of data on which we expect our
methods to be used - However Many of our methods depend essentially
only on term frequencies by document. - Thus, many available data sets can be used for
experimentation.
21 DATA SETS USED II
- TREC (Text Retrieval Conference) data
time-stamped subsets of the data (order 105 to
106 messages) - Reuters Corpus Vol. 1 (8 x 105 messages)
- Medline Abstracts (order 107 with human indexing)
22THE MONITORING MESSAGE STREAMS PROJECT TEAM
Endre Boros, RUTCOR Paul Kantor, SCILS Dave
Lewis, Consultant Ilya Muchnik, DIMACS/CS S.
Muthukrishnan, CS David Madigan,
Statistics Rafail Ostrovsky, Telcordia
Technologies Fred Roberts, Rutgers Martin
Strauss, ATT Labs Wen-Hua Ju, Avaya Labs
(collaborator)