Title: Collection of general data mining briefings
1Data and Applications Security Introduction to
Data Mining
Dr. Bhavani Thuraisingham Guest Lecture
February 25, 2008
2Objective of the Unit
- This unit provides an introduction to data mining
3Outline of Data Mining
- What is Data Mining?
- Data warehousing vs data mining
- Steps to Data Mining
- Need for Data Mining
- Example Applications
- Technologies for Data Mining
- Why Data Mining Now?
- Preparation for Data Mining
- Data Mining Tasks, Methodology, Techniques
- Commercial Developments
- Status, Challenges , and Directions
4What is Data Mining?
5Data Warehouses vs Data Mining
- Goal Improved business efficiency
- Improve marketing (advertise to the most likely
buyers) - Inventory reduction (stock only needed
quantities) - Information source Historical business data
- Example Supermarket sales records
- Size ranges from 50k records (research studies)
to terabytes (years of data from chains) - Data is already being warehoused
- Sample question what products are generally
purchased together? - The answers are in the data, need to MINE the data
6What Does Warehousing do for Data Mining?
- Difficult to mine disparate data sources
- Data warehouse integrates the disparate data
sources into a single logical entity - Maintains integrity of the data
- Scrubbing and Cleaning
- Formats the data for querying and mining
- Multidimensional data
7Is it Necessary to Have a Data Warehouse for Data
Mining?
- Key to successful data mining is having good data
- Data warehousing integrates heterogeneous data
sources, formats the data, and facilitates
interactive query processing - Having a data warehouse is good for data mining,
but perhaps not essential - Data mining tools could be used directly on
good/clean databases
8Whats going on in data mining?
- What are the technologies for data mining?
- Database management, data warehousing, machine
learning, statistics, pattern recognition,
visualization, parallel processing - What can data mining do for you?
- Data mining outcomes Classification, Clustering,
Association, Anomaly detection, Prediction,
Estimation, . . . - How do you carry out data mining?
- Data mining techniques Decision trees, Neural
networks, Market-basket analysis, Link analysis,
Genetic algorithms, . . . - What is the current status?
- Many commercial products mine relational
databases - What are some of the challenges?
- Mining unstructured data, extracting useful
patterns, web mining, Data mining, national
security and privacy
9Steps to Data Mining
Clean/ modify data sources
Mine the data
Integrate data sources
Report final results
Examine Results/ Prune results
Take Actions
Data Sources
10Knowledge Directed to Data Mining
Mine the data
Clean/ modify data sources
Integrate data sources
Expert System
Report final results
Examine Results/ Prune results
Take Actions
Data Sources
11Need for Data Mining
- Large amounts of current and historical data
being stored - As databases grow larger, decision-making from
the data is not possible need knowledge derived
from the stored data - Data for multiple data sources and multiple
domains - Medical, Financial, Military, etc.
- Need to analyze the data
- Support for planning (historical supply and
demand trends) - Yield management (scanning airline seat
reservation data to maximize yield per seat) - System performance (detect abnormal behavior in a
system) - Mature database analysis (clean up the data
sources)
12Example Applications
- Medical supplies company increases sales by
targeting certain physicians in its advertising
who are likely to buy the products - A credit bureau limits losses by selecting
candidates who are likely not to default on their
payment - An Intelligence agency determines abnormal
behavior of its employees - An investigation agency finds fraudulent behavior
of some people
13Integration of Multiple Technologies
Data Warehousing
Machine Learning
Database Management
Statistics
Parallel Processing
Visualization
Data Mining
14Why Data Mining Now?
- Large amounts of data is being produced
- Data is being organized
- Technologies are developing for database
management, data warehousing, parallel
processing, machine intelligent, etc. - It is now possible to mine the data and get
patterns and trends - Interesting applications exist
15Preparation for Data Mining
- Getting the data into the right format
- Data warehousing
- Scrubbing and cleaning the data
- Some idea of application domain
- Determining the types of outcomes
- e.g., Clustering, classification
- Evaluation of tools
- Getting the staff trained in data mining
16Some Types of Data Mining (Data Mining
Tasks/Outcomes)
- Classification grouping records into meaningful
subclasses - e.g., Marketing organization has a list of people
living in Manhattan all owning cars costing over
20K - Sequence Detection
- John always buys groceries after going to the
bank - Data dependency analysis identifying
potentially interesting dependencies or
relationships among data items - If John, James, and Jane meet, Bill is also
present - Deviation detection discovery of significant
differences between an observation and some
reference - Anomalous instances
- Discrepancies between observed and expected values
17Data Mining Methodology (or Approach)
- Top-down
- Hypothesis testing
- Validate beliefs
- Bottom-up
- Discover patterns
- Directed
- Some idea what you want to get
- Undirected
- Start from fresh
18Some Data Mining Techniques
- Market Basket analysis
- Decision Trees
- Neural networks
- Rough sets and fuzzy logic
- Inductive logic programming
19Commercial Developments in Data Mining Some
Early Products
- Information Discovery-IDIS
- WizSoft - WhizWhy
- Hugin - Hugin
- IBM - Intelligent Miner
- Red Brick DataMind (became part of Informix and
now part of IBM) - Neo Vista - Decision Series
- Reduct Systems - Datalogic/R
- Lockheed Martin - Recon
- Nicesoft Nicel
- SAS Enterprise Miner
- Recent products will be discussed in Unit 9
20Current Status, Challenges and Directions
- Status
- Data Mining is now a technology
- Several prototypes and tools exist Many or
almost all of them work on relational databases - Challenges
- Mining large quantities of data Dealing with
noise and uncertainty - Directions
- Mining multimedia and text databases, Web mining
(structure, usage and content), Data mining,
national security and privacy