Title: Automated Learning Group
1Automated Learning Group
2ALG Mission
- The specific mission of the Automated Learning
Group is -
- To collaborate with researchers to develop novel
computer methods and the scientific foundation
for using historical data to improve future
decision making - To work closely with industrial, government, and
academic partners to explore new application
areas for such methods, and -
- To transfer the resulting software technology
into real world applications
3ALG Research, Development, Technology Transfer
Model
4What is It?
Overview of Knowledge Discovery
- Knowledge Discovery in Databases is the
non-trivial process of identifying valid, novel,
potentially useful, and ultimately understandable
patterns in data - The understandable patterns are used to
- Make predictions about or classifications of new
data - Explain existing data
- Summarize the contents of a large database to
support decision making - Create graphical data visualization to aid humans
in discovering complex patterns
5Why Do We Need Data Mining ?
Overview of Knowledge Discovery
- Data volumes are too large for classical analysis
approaches - Large number of records (108 1012 bytes)
- High dimensional data ( 102 104 attributes)
- How do you explore millions of records, tens or
hundreds or thousands of fields, and find
patterns? - As databases grow, the ability to use traditional
query languages for the decision support process
becomes infeasible - Many queries of interest are difficult to state
in a query language (query formulation problem) - Find all cases of fraud
- Find all individuals likely to by Ford Explorer
- Find all documents that are similar to this
customers problem
6Computational Knowledge Discovery
7Knowledge Discovery Process
Overview of Knowledge Discovery
8Required Effort for each KDD Step
Overview of Knowledge Discovery
- Arrows indicate the direction we want the effort
to go
9Three Primary Paradigms
Overview of Knowledge Discovery
- Predictive Modeling supervised learning
approach where classification or prediction of
one of the attributes is desired - Classification is the prediction of predefined
classes - Naive Bayesian, Decision Trees, and Neural
Networks - Regression is the prediction of continuous data
- Neural Networks, and Decision (Regression) Trees
- Discovery unsupervised learning approach for
exploratory data analysis - Association Rules and Link Analysis
- Clustering and Self Organizing Maps
- Deviation Detection identifying outliers in the
data - Information Visualization
10Advantages of a Framework for Analytics
- Provides scalable environment from the Desktop to
Web Services to Grid Services - Employs a visual programming system for data/work
flow paradigm - Provides capability to build custom applications
- Provides capability to access data management
tools - Contains data mining algorithms for prediction
and discovery - Provides data transformations for standard
operations - Integrated environment for models and
visualization - Supports an extensible interface for creating
ones own algorithms - Provides access to distributed computing
capabilities
11D2K - Data To Knowledge
D2K Overview
- D2K is a flexible data mining system that
integrates effective analytical data mining
methods for prediction, discovery, and anomaly
detection with data management and information
visualization
12D2K and Its Many Components
D2K Overview
- D2K Infrastructure
- D2K API, data flow environment, distributed
computing framework and runtime system - D2K Modules
- Computational units written in Java that follow
the D2K API - D2K Itineraries
- Modules that are connected to form an application
- D2K Toolkit
- User interface for specification of itineraries
and execution that provides the rapid application
development environment - D2K-Driven Applications
- Applications that use D2K modules with a custom
user interface - D2K Streamline (SL)
- Task driven system that uses D2K modules
- D2K Web/Grid Services
- Enables web deployment
13D2K Streamline (D2K SL)
D2K SL
- Provides step by step interface to guide user in
data analysis - Supports return to earlier steps to run with
different parameters - Uses the D2K infrastructure transparently
- Uses same D2K modules
- Provides way to capture different experiments
14Success Story Predictive Analytics
- The Problem
- Predict number of products a customer will
purchase to enable the increase of conquest,
cross, and upsell sales - The Solution
- Built data models to predict what customers were
ready to buy and how many - Computed customer buying propensities
- The Results
- Achieved increase of conquest customer sales lift
by accurately predicting optimal groups for
directed cross/upsell activity - Increase of more than 50 percent in the number of
sales calls on potential customers. - The average number of engines sold to truck fleet
customers rose 67 percent. - In 1998 the number of promising sales targets
identified jumped to 35 percent, and the number
of engines sold grew to 6.75 per truck fleet
customer - Why it worked
- It worked because we added analytics to a process
that had been based on the estimation of experts.
The shift from a process that was built on
professional experience to one that was data
driven reduced opportunities for
misinterpretation of market dynamics.
15Success Story Predictive Analytics
- The Problem
- Predict the length of time a customer will keep a
product (subscribe for a service) - The Solution
- Built data models to predict customers behavior
- Why it worked
- It worked because we added analytics to a process
that had been based on the estimation of experts.
The shift from a process that was built on
professional experience to one that was data
driven reduced opportunities for
misinterpretation of market dynamics.
16Earth, Space, and Environmental Sciences
- Grids are being built to work with distributed
earth, space and environmental science data
stores. A next step is to undertake distributed
data analysis utilizing remote data.
EMO Analysis Environment
- EMO Evolutionary-based Multiobjective
Optimization for Hazard Management - Barbara Minsker, Civil and Environmental
Engineering - MAEViz Multi-modal Data Integration and
Information Visualization - Dan Abrams, Civil Engineering
- MUSTSIM Real-time Data Stream Fusion and
Information Visualization - Amr Elnashi, Dan Kuchman, and Bill Spencer, Civil
Engineering
17Bioinformatics
- Now that the human genome has been sequenced,
attention is turning to the mining of proteomic
and structural biological data and looking for
patterns that arise when examining data from a
wide variety of different omic data sets.
Phylomat
- Phylomat
- Rex Gaskins, Cell and Structural Biology
- Disease Susceptibility
- Larry Schook, Animals Science
- Constructing Biological Networks
- David Rivier, Cell and Structural Biology
18Social Sciences and Humanities
- Although science is leading the way, the
exploring, analyzing, and mining of social
science data stores is beginning to change these
fields, too.
DISCUS Collaboration
- Distributed Innovation and Scalable Collaboration
in Uncertain Settings (DISCUS) - David Goldberg, General Engineering
- Music Information Retrieval MIR
- Stephen Downey, Graduate School LIS
- Ticket To Work, Job Demands
- Tayna Gallagher, College of Applied Life Studies
- Mining Bugzilla
- Les Gasser, Graduate School LIS
- Multi-modal Global Economic Modeling
- Gerald Nelson, AG and Consumer Economics
- Concept Modeling in War Periodical
- Bruce Rosenstock, LAS-Religion
19Homeland Defense
- Mining homeland defense data is difficult because
the data is massive, distributed, complex and
heterogeneous.
MAIDS Analysis
- Mining Alarming Incidents in Data Streams
- Jiawei Han, Computer Science
- Distributed Innovation and Scalable Collaboration
in Uncertain Settings - David Goldberg, General Engineering
- NIBRS Mining the National Incident Based
Reporting System - Tracy McGee, Illinois State Police
- Intelligence Gathering from API New Feeds
20Knowledge Extraction from Streaming Text
- Information extraction
- process of using advanced automated machine
learning approaches - to identify entities in text documents
- extract this information along with the
relationships these entities may have in the text
documents - This project demonstrates information extraction
of names, places and organizations from real-time
news feeds. As news articles arrive, the
information is extracted and displayed.
21D2K Web Service Architecture
D2K Web Service
- Any web enabled client can connect to and use the
D2K Web Service by sending SOAP messages over
HTTP. - Itineraries and modules are stored on the web
service machine and loaded over the network by
the D2K Servers. - Job results are also stored in the web service
tier. - Results are returned to clients upon request.
- A relational database is used by the web service
to lookup accounts, itineraries, servers, and
jobs. - Remote D2K Servers handle itinerary processing.
If possible, modules should load any data from
remote locations.
22MAIDS Stream Mining Architecture
- MAIDS is aimed to
- Discover changes, trends and evolution
characteristics in data streams - Construct clusters and classification models from
data streams - Explore frequent patterns and similarities among
data streams
23MAIDS Stream Characteristics
Current ALG Projects
- Huge volumes of continuous data, possibly
infinite - Fast changing and requires fast, real-time
response - Data stream captures nicely our data processing
needs of today - Random access is expensivesingle linear scan
algorithm (can only have one look) - Store only the summary of the data seen thus far
- Most stream data are at pretty low-level or
multi-dimensional in nature, needs multi-level
and multi-dimensional processing
24Features of MAIDS
- General purpose tool for data stream analysis
- Processes high-rate and multi-dimensional data
- Adopts a flexible tilted time window framework
- Facilitates multi-dimensional analysis using a
stream cube architecture - Integrates multiple data mining functions
- Provides user-friendly interface automatic
analysis and on-demand analysis - Facilitates setting alarms for monitoring
- Built in D2K as D2K modules and leveraged in the
D2K Streamline tool
25 Statistics Query Engine
- Answers user queries on data statistics, such as,
count, max, min, average, regression, etc. - Uses tilted time window
- Uses an efficient data structure, H-tree for
partial computation of data cubes
26Stream Data Classifier
- Builds models to make predictions
- Uses Naïve Bayesian Classifier with boosting
- Uses Tilted Time Window to track time related
info - Sets alarm to monitor events
27Stream Pattern Finder
- Find frequent patterns with multiple time
granularities - Keep precise/ compressed history in tilted time
window - Mine only the interested item set using FP-tree
algorithm - Mining evolution and dramatic changes of frequent
patterns
28 Stream Data Clustering
- Two stages micro-clustering and macro-clustering
- Uses micro-clustering to do incremental, online
processing and maintenance - Uses tilted time frame
- Detects outliers when new clusters are formed
29The ALG Team
- Staff
- Loretta Auvil
- Peter Bajcsy
- Colleen Bushell
- Dora Cai
- David Clutter
- Lisa Gatzke
- Vered Goren
- Chris Navarro
- Greg Pape
- Tom Redman
- Barry Sanders
- Duane Searsmith
- Andrew Shirk
- Anca Suvaiala
- David Tcheng
- Michael Welge
- Students
- John Cassel
- Sang-Chul Lee
- Xiaolei Li
- Martin Urban
- Bei Yu