Title: An Overview of Data Analytics at DIMACS and DyDAn
1An Overview of Data Analytics at DIMACS and DyDAn
2What is DIMACS?
- DIMACS was formed as an NSF Science and
Technology center in 1989 to foster research
education programs at the interface between
discrete math and theoretical computer science - Built around multi-year themes called special
foci - Host related workshops and education programs and
lead related research projects - Primary of areas of research discrete math
theoretical CS and their applications interfaces
between mathematics and biology homeland
security - DIMACS has both industry and academic
institutional partners and nearly 300 affiliated
scientists - Many of the worlds leaders in discrete
mathematics and theoretical computer science and
their applications - Statisticians, biologists, psychologists,
chemists, epidemiologists, and engineers - None are paid by DIMACS, but they join in DIMACS
projects
3A Selection of DIMACS Projects
- Bioterrorism Sensor Location
- Port of Entry Inspection Algorithms
- Monitoring Message Streams
- Author Identification
- Computational and Mathematical Epidemiology
- Adverse Event/Disease Reporting/Surveillance/Analy
sis - Bioterrorism Working Group
- Modeling Social Responses to Bioterrorism
- Predicting Disease Outbreaks from Remote Sensing
and Media Data - Communication Security and Information Privacy
4What is DyDAn?
- DyDAn is a DHS Center of Excellence for research
on advanced methods for information analysis - Established as one of four centers for research
in discrete sciences - DyDAn serves as coordinator of the 4 centers,
based at Rutgers, University of Illinois, USC,
and University of Pittsburgh - DyDAn is based at Rutgers and has 5 university
and 2 industry partners - 40 researchers in fields of mathematics,
computer science, statistics, operations
research, engineering and biology - DyDAn is based at DIMACS
DyDAn is developing novel technologies to find
patterns relationships in dynamic,
nonstationary, massive datasets
5DyDAn Researchers Work On
- Counter-terrorism
- Intelligence analysis
- Disease surveillance (natural/man-caused)
- Customs and border protection
- Law enforcement
- Data management in emergency situations
- Nuclear detection/sensors
- Image, audio, text, analysis
Avian flu
Containers for Inspection
We hope to make DyDAn an informatics resource for
the homeland security enterprise
6Project Sensor Management for Nuclear Detection
7Project Universal Information Graphs
- A variety of different massive data sources are
available to analysts Web, Internet, Calls,
Email, Transportation, - Problem Coordinate information from multiple
sources to identify interesting collaborative
information networks - Model each data source as a large multigraph, but
there will be too much information to actually
fuse all these multigraphs into one... - We want to virtually fuse these disparate
multigraphs - Develop computationally-efficient node rank
functions (as in Web search ranking) - Develop linkage metrics between nodes to
understand patterns of communication - Approximate linkage metrics with limited time and
space resources. - Hierarchy Tree tools developed by team members
offer a uniform method for large data
exploration. Particularly well suited for
External Memory Graphs. - I/O and screen bottleneck are handled uniformly.
- Hierarchical slices allow the incorporation of
different data types.
8Project Monitoring Message Streams
- Algorithmic Methods for Automatic Processing of
Messages - Monitor huge communication streams, in
particular, streams of textualized communication
to automatically detect pattern changes and
"significant" events - Components of automatic message processing 1)
text compression 2) text representation 3)
matching scheme 4) learning method 5) fusion
scheme - Project Premise Existing methods dont exploit
the full power of the 5 components, synergies
among them, and/or an understanding of how to
apply them to text data - Our approach is to develop/explore methods for
each component and then to combine them - In the first phase of the project, we did over
5000 complete experiments with different
combinations of methods - Nearest neighbor
- Bayesian methods - the Bayesian Regression
software we developed constitutes the most
efficient software in the world for ultra-high
dimensional Bayesian logistic regression.
9MMS Goal
Monitor huge communication streams, in
particular, streams of textualized communication
to automatically detect pattern changes and
"significant" events
Motivation monitoring email traffic, news,
communiques, faxes, voice intercepts (with speech
recognition)
Emphasis in this phase of project Entity
Resolution
10MMS Key Goal
- Produce entity resolution module that is
- robust, general, well-founded
- based on our current Bayesian logistic regression
framework - can be integrated with software for a variety of
applications.
11Outline
- Bayesian logistic regression
- Advances
- Using domain knowledge to reduce the need for
training data - Speeding up training and classification
- Online training
12Speeding Up Classification
- Completed new version of BMRclassify
- Replaces old BBRclassify and BMRclassify
- More flexible
- Can apply 1000s of binary and polytomous
classifiers simultaneously - Allows meaningful names for features
- Inverted index to classifier suites for speed
- 25x speedup over old BMRclassify and BBRclassify
13Rutgers DIMACS KDDMMS Clustering and Entity
Resolution
- The intelligence problem
- documents
- entities (people, organization)
- first order associations (between different
types) - second order associations (within types)
- Example
- people (in the role of authors)
- scientific publications
- research groups - invisible colleges
14Current ER activity
- Join multiple models approach with multiple
options presented by the BMR modeling package. - Integrate all methods with online algorithms
- Reality checks synthesized or challenge data
- Models for whether a pair of items are from same
agent, or different agents - Uses BXR to identify salient features for this
task, which carry over to agents who have never
been seen before.
15Collaborative Tools
- Paul Kantor (LIS)
- Barry Sopher (Economics)
- Rutgers AEF Support
16Collaboration requires
- The right software
- the right incentivation mechanims
- mechanism design
- research laboratory at Rutgers
- experiments in collaboration
- find the right mix
- build systems that make the rewards
- reliable prompt automatic
17AntWorld
18Potential DOD applications
- Intelligence activity
- asynchronous collaboration
- 24/7 monitoring of open source and SIGINT traffic
- sharing of hypotheses and insights
- Coordination among warfighters
- combine roving information to provide
red/greenindications
19(No Transcript)