An Overview of Data Analytics at DIMACS and DyDAn - PowerPoint PPT Presentation

About This Presentation
Title:

An Overview of Data Analytics at DIMACS and DyDAn

Description:

DIMACS was formed as an NSF Science and Technology center in 1989 to foster ... Statisticians, biologists, psychologists, chemists, epidemiologists, and engineers ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 20
Provided by: tamicar
Category:

less

Transcript and Presenter's Notes

Title: An Overview of Data Analytics at DIMACS and DyDAn


1
An Overview of Data Analytics at DIMACS and DyDAn
  • Paul Kantor
  • Fred Roberts

2
What is DIMACS?
  • DIMACS was formed as an NSF Science and
    Technology center in 1989 to foster research
    education programs at the interface between
    discrete math and theoretical computer science
  • Built around multi-year themes called special
    foci
  • Host related workshops and education programs and
    lead related research projects
  • Primary of areas of research discrete math
    theoretical CS and their applications interfaces
    between mathematics and biology homeland
    security
  • DIMACS has both industry and academic
    institutional partners and nearly 300 affiliated
    scientists
  • Many of the worlds leaders in discrete
    mathematics and theoretical computer science and
    their applications
  • Statisticians, biologists, psychologists,
    chemists, epidemiologists, and engineers
  • None are paid by DIMACS, but they join in DIMACS
    projects

3
A Selection of DIMACS Projects
  • Bioterrorism Sensor Location
  • Port of Entry Inspection Algorithms
  • Monitoring Message Streams
  • Author Identification
  • Computational and Mathematical Epidemiology
  • Adverse Event/Disease Reporting/Surveillance/Analy
    sis
  • Bioterrorism Working Group
  • Modeling Social Responses to Bioterrorism
  • Predicting Disease Outbreaks from Remote Sensing
    and Media Data
  • Communication Security and Information Privacy

4
What is DyDAn?
  • DyDAn is a DHS Center of Excellence for research
    on advanced methods for information analysis
  • Established as one of four centers for research
    in discrete sciences
  • DyDAn serves as coordinator of the 4 centers,
    based at Rutgers, University of Illinois, USC,
    and University of Pittsburgh
  • DyDAn is based at Rutgers and has 5 university
    and 2 industry partners
  • 40 researchers in fields of mathematics,
    computer science, statistics, operations
    research, engineering and biology
  • DyDAn is based at DIMACS

DyDAn is developing novel technologies to find
patterns relationships in dynamic,
nonstationary, massive datasets
5
DyDAn Researchers Work On
  • Counter-terrorism
  • Intelligence analysis
  • Disease surveillance (natural/man-caused)
  • Customs and border protection
  • Law enforcement
  • Data management in emergency situations
  • Nuclear detection/sensors
  • Image, audio, text, analysis

Avian flu
Containers for Inspection
We hope to make DyDAn an informatics resource for
the homeland security enterprise
6
Project Sensor Management for Nuclear Detection
7
Project Universal Information Graphs
  • A variety of different massive data sources are
    available to analysts Web, Internet, Calls,
    Email, Transportation,
  • Problem Coordinate information from multiple
    sources to identify interesting collaborative
    information networks
  • Model each data source as a large multigraph, but
    there will be too much information to actually
    fuse all these multigraphs into one...
  • We want to virtually fuse these disparate
    multigraphs
  • Develop computationally-efficient node rank
    functions (as in Web search ranking)
  • Develop linkage metrics between nodes to
    understand patterns of communication
  • Approximate linkage metrics with limited time and
    space resources.
  • Hierarchy Tree tools developed by team members
    offer a uniform method for large data
    exploration. Particularly well suited for
    External Memory Graphs.
  • I/O and screen bottleneck are handled uniformly.
  • Hierarchical slices allow the incorporation of
    different data types.

8
Project Monitoring Message Streams
  • Algorithmic Methods for Automatic Processing of
    Messages
  • Monitor huge communication streams, in
    particular, streams of textualized communication
    to automatically detect pattern changes and
    "significant" events
  • Components of automatic message processing 1)
    text compression 2) text representation 3)
    matching scheme 4) learning method 5) fusion
    scheme
  • Project Premise Existing methods dont exploit
    the full power of the 5 components, synergies
    among them, and/or an understanding of how to
    apply them to text data
  • Our approach is to develop/explore methods for
    each component and then to combine them
  • In the first phase of the project, we did over
    5000 complete experiments with different
    combinations of methods
  • Nearest neighbor
  • Bayesian methods - the Bayesian Regression
    software we developed constitutes the most
    efficient software in the world for ultra-high
    dimensional Bayesian logistic regression.

9
MMS Goal
Monitor huge communication streams, in
particular, streams of textualized communication
to automatically detect pattern changes and
"significant" events
Motivation monitoring email traffic, news,
communiques, faxes, voice intercepts (with speech
recognition)
Emphasis in this phase of project Entity
Resolution
10
MMS Key Goal
  • Produce entity resolution module that is
  • robust, general, well-founded
  • based on our current Bayesian logistic regression
    framework
  • can be integrated with software for a variety of
    applications.

11
Outline
  • Bayesian logistic regression
  • Advances
  • Using domain knowledge to reduce the need for
    training data
  • Speeding up training and classification
  • Online training

12
Speeding Up Classification
  • Completed new version of BMRclassify
  • Replaces old BBRclassify and BMRclassify
  • More flexible
  • Can apply 1000s of binary and polytomous
    classifiers simultaneously
  • Allows meaningful names for features
  • Inverted index to classifier suites for speed
  • 25x speedup over old BMRclassify and BBRclassify

13
Rutgers DIMACS KDDMMS Clustering and Entity
Resolution
  • The intelligence problem
  • documents
  • entities (people, organization)
  • first order associations (between different
    types)
  • second order associations (within types)
  • Example
  • people (in the role of authors)
  • scientific publications
  • research groups - invisible colleges

14
Current ER activity
  • Join multiple models approach with multiple
    options presented by the BMR modeling package.
  • Integrate all methods with online algorithms
  • Reality checks synthesized or challenge data
  • Models for whether a pair of items are from same
    agent, or different agents
  • Uses BXR to identify salient features for this
    task, which carry over to agents who have never
    been seen before.

15
Collaborative Tools
  • Paul Kantor (LIS)
  • Barry Sopher (Economics)
  • Rutgers AEF Support

16
Collaboration requires
  • The right software
  • the right incentivation mechanims
  • mechanism design
  • research laboratory at Rutgers
  • experiments in collaboration
  • find the right mix
  • build systems that make the rewards
  • reliable prompt automatic

17
AntWorld
18
Potential DOD applications
  • Intelligence activity
  • asynchronous collaboration
  • 24/7 monitoring of open source and SIGINT traffic
  • sharing of hypotheses and insights
  • Coordination among warfighters
  • combine roving information to provide
    red/greenindications

19
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com