Introduction to Data Mining

About This Presentation

Title:

Introduction to Data Mining

Description:

Peter Bajcsy, Ph.D. Research Scientist. Adjunct Assistant ... Pattern Classification by R. Duda, P. Hart and D. Stork, 2nd edition, John Wiley & Sons, 2001 ... – PowerPoint PPT presentation

Number of Views:154

Avg rating:3.0/5.0

Slides: 79

Provided by: peterb80

Learn more at: http://isda.ncsa.uiuc.edu

more less

Transcript and Presenter's Notes

Title: Introduction to Data Mining

1
Introduction to Data Mining
2
Course Overview

Introduction to Knowledge Discovery in Databases
and Data Mining
Why Data Mining? What is Data Mining? On What
Kind of Data?
Applications of Data Mining
Application Domains and Examples
Knowledge Discovery in Databases and Data Mining
Process
Processing Steps
Data Quality, Preparation, and Transformations
Data Mining Tools
D2K, SAS, Clementine, Intelligent Miner,
Insightful Miner, K-Wiz
Data Mining Methods
Association Rules
Decision Trees
Information Visualization
Summary

3
Acknowledgement

Contributions
Michael Welge, Loretta Auvil, Lisa Gatzke,
Automated Learning Group, National Center for
Supercomputing Applications (NCSA), University of
Illinois at Urbana-Champaign
Jiawei Han, Computer Science, University of
Illinois at Urbana-Champaign

4
Literature

Data Mining Concepts and Techniques by J. Han
M. Kamber, Morgan Kaufmann Publishers, 2001
Pattern Classification by R. Duda, P. Hart and D.
Stork, 2nd edition, John Wiley Sons, 2001

5
Introduction to Knowledge Discovery in Databases
and Data Mining
6
Computational Knowledge Discovery
7
Terminology

Data Mining
A step in the knowledge discovery process
consisting of particular algorithms (methods)
that under some acceptable objective, produces a
particular enumeration of patterns (models) over
the data.
Knowledge Discovery Process
The process of using data mining methods
(algorithms) to extract (identify) what is deemed
knowledge according to the specifications of
measures and thresholds, using a database along
with any necessary preprocessing or
transformations.

8
Terminology - A Working Definition

Data Mining is a decision support process in
which we search for patterns of information in
data.
Data Mining is a process of discovering
advantageous patterns in data.
A pattern is a conservative statement about a
probability distribution.
Webster A pattern is (a) a natural or chance
configuration, (b) a reliable sample of traits,
acts, tendencies, or other observable
characteristics of a person, group, or
institution

9
Data Mining On What Kind of Data?

Relational Databases
Data Warehouses
Transactional Databases
Advanced Database Systems
Object-Relational
Spatial and Temporal
Time-Series
Multimedia
Text
Heterogeneous, Legacy, and Distributed
WWW

Structure - 3D Anatomy
Function 1D Signal
Metadata Annotation
10
Data Mining Confluence of Multiple Disciplines
?
20x20 2400 ? 10120 patterns
11
Why Do We Need Data Mining ?

Data volumes are too large for classical analysis
approaches
Large number of records (108 1012 bytes)
High dimensional data ( 102 104 attributes)

How do you explore millions of records, tens or
hundreds of fields, and find patterns?
12
Why Do We Need Data Mining ?

Leverage organizations data assets
Only a small portion (typically - 5-10) of the
collected data is ever analyzed
Data that may never be analyzed continues to be
collected, at a great expense, out of fear that
something which may prove important in the future
is missing.
Growth rates of data precludes traditional
manually intensive approach

13
Why Do We Need Data Mining?

As databases grow, the ability to support the
decision support process using traditional query
languages becomes infeasible
Many queries of interest are difficult to state
in a query language (Query formulation problem)
find all cases of fraud
find all individuals likely to buy a FORD
expedition
find all documents that are similar to this
customers problem

QUERY
(Latitude, Longitude)2
RESULT
(Latitude, Longitude)1
14
What is It?

Knowledge Discovery in Databases is the
non-trivial process of identifying valid, novel,
potentially useful, and ultimately understandable
patterns in data.
The understandable patterns are used to
Make predictions or classifications about new
data
Explain existing data
Summarize the contents of a large database to
support decision making
Graphical data visualization to aid humans in
discovering deeper patterns

15
Applications of Data Mining
16
Data Mining Applications

Market analysis
Risk analysis and management
Fraud detection and detection of unusual patterns
(outliers)
Text mining (news group, email, documents) and
Web mining
Stream data mining
DNA and bio-data analysis

17
Market Analysis

Where does the data come from?
Credit card transactions, loyalty cards, discount
coupons, customer complaint calls, plus (public)
lifestyle studies
Target marketing
Find clusters of model customers who share the
same characteristics interest, income level,
spending habits, etc.
Determine customer purchasing patterns over time
Cross-market analysis
Associations/co-relations between product sales,
prediction based on such association
Customer profiling
What types of customers buy what products
(clustering or classification)
Customer requirement analysis
identifying the best products for different
customers
Predict what factors will attract new customers)

18
Corporate Analysis Risk Management

Finance planning and asset evaluation
cash flow analysis and prediction
contingent claim analysis to evaluate assets
cross-sectional and time series analysis
(financial-ratio, trend analysis, etc.)
Resource planning
summarize and compare the resources and spending
Competition
monitor competitors and market directions
group customers into classes and a class-based
pricing procedure
set pricing strategy in a highly competitive
market

19
Fraud Detection Mining Unusual Patterns

Approaches Clustering model construction for
frauds, outlier analysis
Applications Health care, retail, credit card
service, telecomm.
Auto insurance ring of collisions
Money laundering suspicious monetary
transactions
Medical insurance
Professional patients, ring of doctors, and ring
of references
Unnecessary or correlated screening tests
Telecommunications phone-call fraud
Phone call model destination of the call,
duration, time of day or week. Analyze patterns
that deviate from an expected norm
Retail industry
Analysts estimate that 38 of retail shrink is
due to dishonest employees
Anti-terrorism

20
Data Mining and Business Intelligence
21
Knowledge Discovery in Databases Process
22
KDD Process

Develop an understanding of the application
domain
Relevant prior knowledge, problem objectives,
success criteria, current solution, inventory
resources, constraints, terminology, cost and
benefits
Create target data set
Collect initial data, describe, focus on a subset
of variables, verify data quality
Data cleaning and preprocessing
Remove noise, outliers, missing fields, time
sequence information, known trends, integrate
data
Data Reduction and projection
Feature subset selection, feature construction,
discretizations, aggregations

23
KDD Process

Selection of data mining task
Classification, segmentation, deviation
detection, link analysis
Select data mining approach
Data mining to extract patterns or models
Interpretation and evaluation of patterns/models
Consolidating discovered knowledge

24
Knowledge Discovery
25
Required effort for each KDD Step

Arrows indicate the direction we hope the effort
should go.

26
Data Mining Tools
27
Commercial and Research Tools

Data To Knowledge
http//www.ncsa.uiuc.edu/Divisions/DMV/ALG/d2k/
SAS
http//www.sas.com/
Clementine
http//www.spss.com/spssbi/clementine/
Intelligent Miner
http//www-3.ibm.com/software/data/iminer/
Insightful Miner
http//www.insightful.com/products/product.asp?PID
26
K-Wiz
http//www.thinkanalytics.com/products/factsheets/
Kwiz_product_brief.htm

28
Software Engineering in Data Mining

Conceptual Software Hierarchy
Operating System (Windows, Mac OS, UNIX, Linux)
Programming Language (Java)
Modules Sequences of Programming Language
Commands
Itineraries Linked Modules
Streamlines Linked Itineraries
Software for
Users with Various Levels of Programming Skills
Collaborating Users

29
D2K - Software Environment for Data Mining

Visual programming system employing a scalable
framework
Robust computational infrastructure
Enable processor intensive apps, support
distributed computing
Enable data intensive apps, support
multi-processor, shared memory architectures,
thread pooling
Very low granularity, fast data flow paradigm,
integrated control flow
Reduction of development time
Increase code reuse and sharing
Expedite custom software developments
Relieve distributed computing burden
Flexible and extensible architecture
Create plug and play subsystem architectures, and
standard APIs
Rapid application development (RAD) environment
Integrated environment for models and
visualization

30
D2K Architecture

D2K Infrastructure
Defines the D2K API
D2K Modules
Computational unit written in Java that follows
the D2K API
D2K Itineraries
A group of modules that are connected to form an
application
D2K ToolKit
User interface
D2K Driven Applications
Applications that use D2K modules
D2K SL

31
Data Flow Programming Environment D2K
Tool Menu
Tool Bar
Side Tab Panes
Workspace
Jump Up Panes
32
D2K Programming and Runtime Environment
33
Streamlined Data Mining Environment D2K SL
KDD Steps
Workspace
KDD Options
Session
34
Data Mining Techniques in D2K

Discovery
Association Rules, Link Analysis, Self Organizing
Maps
Predictive Modeling
Classification Naive Bayesian, Neural Networks,
Decision Trees
Regression Neural Networks, Regression Trees
Deviation Detection
Visualization
Text To Knowledge (T2K)
Image To Knowledge (I2K)
----------------------
Audio, Touch, Scent and Savor To Knowledge
Knowledge To Wisdom (K2W)

35
Data Mining at Work
Numerous
Territorial Ratemaking
Functional Foods
Precision Farming
Transaction Management
Bio-Informatics
Heterogeneous Data Visualization
Effluent Quality Control
Web Information Retrieval, Archival and Clustering
Data Sources
Multiple
Crime Data Analysis
Data Fusion and Visualization
Auto Loss Ratio Predictions
Target Marketing
Cost Prediction (Warranty, Insurance Claims)
Survey Study of Disability
Warranty Clustering
Single
Automation
Diagnostics
Decision Support
Project Objectives
36
Examples of Data Mining Methods
37
Three Primary Data Mining Paradigms

Discovery
Example Association Rules
Predictive Modeling
Classification Example Decision Trees
Deviation Detection
Visualization

38
Association Rules and Market Basket Analysis
39
What is Market Basket Analysis?

Customer Analysis
Market Basket Analysis uses the information about
what a customer purchases to give us insight into
who they are and why they make certain purchases.
Product Analysis
Market basket Analysis gives us insight into the
merchandise by telling us which products tend to
be purchased together and which are most amenable
to purchase.

40
Market Basket Example
?
Where should detergents be placed in the Store to
maximize their sales?
?
Are window cleaning products purchased when
detergents and orange juice are bought together?
?
Is soda typically purchased with bananas? Does
the brand of soda make a difference?
?
How are the demographics of the neighborhood
affecting what customers are buying?
41
Association Rules

There has been a considerable amount of research
in the area of Market Basket Analysis. Its appeal
comes from the clarity and utility of its
results, which are expressed in the form
association rules.
Given
A database of transactions
Each transaction contains a set of items
Find all rules X-gtY that correlate the presence
of one set of items X with another set of items Y
Example When a customer buys bread and butter,
they buy milk 85 of the time

42
Results Useful, Trivial, or Inexplicable?

While association rules are easy to understand,
they are not always useful.
Useful On Fridays convenience store customers
often purchase diapers and beer together.
Trivial Customers who purchase maintenance
agreements are very likely to purchase large
appliances.
Inexplicable When a new Super Store opens, one
of the most commonly sold item is light bulbs.

43
How Does It Work?
Orange juice, Soda Milk, Orange Juice, Window
Cleaner Orange Juice, Detergent Orange juice,
detergent, soda Window cleaner, soda
Co-Occurrence of Products
Window Cleaner
OJ
Milk
Soda
Detergent
OJ Window Cleaner Milk Soda Detergent
4 1 1 2 1
1 2 1 1 0
1 1 1 0 0
2 1 0 3 1
1 0 0 1 2
44
How Does It Work?

The co-occurrence table contains some simple
patterns
Orange juice and soda are more likely to be
purchased together than any other two items
Detergent is never purchased with window cleaner
or milk
Milk is never purchased with soda or detergent
These simple observations are examples of
Associations and may suggest a formal rule like
If a customer purchases soda, THEN the customer
also purchases orange juice

Window Cleaner
OJ
Milk
Soda
Detergent
OJ Window Cleaner Milk Soda Detergent
1 2 1 1 0
1 1 1 0 0
2 1 0 3 1
1 0 0 1 2
4 1 1 2 1
45
How Good Are the Rules?

In the data, two of five transactions include
both soda and orange juice, These two
transactions support the rule. The support for
the rule is two out of five or 40
Since both transactions that contain soda also
contain orange juice there is a high degree of
confidence in the rule. In fact every transaction
that contains soda contains orange juice. So the
rule If soda, THEN orange juice has a confidence
of 100.

46
Confidence and Support - How Good Are the Rules

A rule must have some minimum user-specified
confidence
1 2 -gt 3 has a 90 confidence if when a
customer bought 1 and 2, in 90 of the cases, the
customer also bought 3.
A rule must have some minimum user-specified
support
1 2 -gt 3 should hold in some minimum percentage
of transactions to have value.

47
Confidence and Support
Transaction ID
Items
1 2 3 4
1, 2, 3 1,3 1,4 2, 5, 6
For minimum support 50 2 transactions and
minimum confidence 50
Frequent One Item Set
Support
1 2 3 4
75 50 50 25
For the rule 1gt 3 Support Support(1,3)
50 Confidence (1-gt3) Support
(1,3)/Support(1) 66 Confidence (3-gt1)
Support (1,3)/Support(3) 100
Frequent Two Item Set
Support
1,2 1,3 1,4 2,3
25 50 25 25
48
Association Examples

Find all rules that have Diet Coke as a result.
These rules may help plan what the store should
do to boost the sales of Diet Coke.
Find all rules that have Yogurt in the
condition. These rules may help determine what
products may be impacted if the store
discontinues selling Yogurt.
Find all rules that have Brats in the condition
and mustard in the result. These rules may help
in determining the additional items that have to
be sold together to make it highly likely that
mustard will also be sold.
Find the best k rules that have Yogurt in the
result.

49
The Basic Process

Choosing the right set of items
Taxonomies
Generation of rules
If condition Then result
Negation
Overcoming the practical limits imposed by
thousand or tens of thousands of products
Minimum Support Pruning

50
Choosing the Right Set of Items
Frozen Foods
General
Frozen Desserts
Frozen Vegetables
Frozen Dinners
Partial Product Taxonomy
Frozen Yogurt
Frozen Fruit Bars
Ice Cream
Peas
Carrots
Mixed
Other
Rocky Road
Cherry Garcia
Specific
Chocolate
Strawberry
Vanilla
Other
51
Example - Minimum Support Pruning / Rule
Generation
Scan Database
Find Pairings
Find Level of Support
Transaction ID
Items
Itemset
Support
Itemset
Support
1 2 3 4
1, 3, 4 2, 3, 5 1, 2, 3, 5 2, 5
1 2 3 4 5
2 3 3 1 3
2 3 5
3 3 3
Scan Database
Find Pairings
Find Level of Support
Itemset
Itemset
Support
Itemset
Support
2 3 5
2, 3 2, 5 3, 5
2 3 2
2, 5
3
Two rules with the highest support for two item
set 2-gt5 and 5-gt2
52
Other Association Rule Applications

Quantitative Association Rules
Age35..40 and MarriedYes -gt NumCars2
Association Rules with Constraints
Find all association rules where the prices of
items are gt 100 dollars
Temporal Association Rules
Diaper -gt Beer (1 support, 80 confidence)
Diaper -gt Beer (20support) 700-900 PM weekdays
Optimized Association Rules
Given a rule (l lt A lt u) and X -gt Y, Find values
for l and u such that support greater than
certain threshold and maximizes a support and
confidence.
Check Balance 30,000 .. 50,000 -gt
Certificate of Deposit (CD) Yes

53
Strengths of Market Basket Analysis

It produces easy to understand results
It supports undirected data mining
It works on variable length data
Rules are relatively easy to compute

54
Weaknesses of Market Basket Analysis

It an exponentially growth algorithm
It is difficult to determine the optimal number
of items
It discounts rare items
It is limited on the support that it provides
attributes

55
Decision Tree Learning
56
Example Supervised Learning with Decision Trees
57
Decision Tree Learning

Start with data at the root node
Select an attribute and form a logical test on
attribute
Branch on each outcome of test, move subset of
example satisfying that out come to corresponding
child node
Recurse on each child node
Termination rule specifies when to declare a node
is a leaf node
Note this is a one-step look ahead,
non-backtracking search through the space of all
decision trees
Critical Steps
Formulation of good logical tests
Selection measure for attributes

58
Decision Trees

Classifiers
Instances (unlabeled examples) represented as
attribute (feature) vectors
Internal Nodes Tests for Attribute Values
Typical equality test (e.g., Wind ?)
Inequality, other tests possible
Branches Attribute Values
One-to-one correspondence (e.g., Wind Strong,
Wind Light)
Leaves Assigned Classifications (Class Labels)

59
Decision Tree for Concept PlayTennis
Outlook?
Outlook?
Sunny
Overcast
Rain
Sunny
Overcast
Rain
Humidity?
Wind?
Humidity?
Wind?
Yes
Yes
High
Normal
Strong
Light
High
Normal
Strong
Light
Yes
No
No
Yes
No
No
Yes
Yes
60
Decision Trees and Decision Boundaries
How to Visualize Decision Trees? Example
Dividing Instance Space into Axis-Parallel
Rectangles
y

7
-
5

-

-

-
x
1
3
More than two variables ?
61
An Illustrative Example
Training Examples for Concept PlayTennis
Day
Temperature
Humidity
Wind
PlayTennis?
Outlook
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Sunny Sunny Overcast Rain Rain Rain Overcast Sunny
Sunny Rain Sunny Overcast Overcast Rain
Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mil
d Mild Hot Mild
High High High High Normal Normal Normal High Norm
al Normal Normal High Normal High
Light Strong Light Light Light Strong Strong Light
Light Light Strong Strong Light Strong
No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No
62
Constructing a Decision Tree for PlayTennis
The Initial Decision Tree with One Leaf
9, 5-
E(D) min(9/14, 5/14) 5/14 36
Question What attribute A and what value of A
should we split on?

Goal maximize error reduction E, where the error
reduction relative to attribute A is the expected
reduction in error due to splitting on A

63
Constructing a Decision Tree for PlayTennis
Potential Splits of Root Node
9, 5-
9, 5-
Temperature
Outlook
Cool
Hot
Mild
Sunny
Rain
Overcast
3, 1-
2, 2-
4, 2-
2, 3-
3, 2-
4, 0-
9, 5-
9, 5-
Humidity
Wind
High
Normal
Light
Strong
3, 4-
6, 1-
6, 2-
3, 3-
E(Split/Outlook) (5/14)
((5/14)(min(2/5,3/5)) (4/14)(min(4/4,0/4))
(5/14)(min(3/5,2/5))) 7 E(Split/Temperature)
(5/14) ((4/14)(min(3/4,1/4))
(6/14)(min(4/6,2/6)) (4/14)(min(2/4,2/4)))
0 E(Split/Humidity) (5/14)
((7/14)(min(3/7,4/7)) (7/14)(min(6/7,1/7)))
7 E(Split/Wind) (5/14)
((8/14)(min(6/8,2/8)) (6/14)(min(3/6,3/6)))
0
64
Constructing a Decision Tree for PlayTennis

Top-Down Induction
For discrete-valued attributes, terminates in
?(n) splits
Makes at most one pass through data set at each
level (why?)

Outlook?
1,2,3,4,5,6,7,8,9,10,11,12,13,14 9,5-
Humidity?
Wind?
Yes
Yes
No
Yes
No
65
Strengths Of Decision Trees

Decision trees are able to generate
understandable results
Decision trees perform classification without
requiring much computation
Decisions trees can handle both continuous and
categorical variables
Decision trees provide a clear indication of
which attributes are most important for
prediction or classification

66
Weakness Of Decision Trees

Error-prone with too many classes
Quick partitioning of data results in fast
deterioration in attribute selection quality
Trouble with non-rectangular regions

67
Visualization
68
Visualization Example Naïve Bayesian
Three Flower Types Petal and Sepal Based
Classification
69
Naïve Bayesian Visualization

The right hand pane shows the distribution of the
classes.
The left hand pane shows the attributes and each
of their values. They are listed by order of
significance.
The message box shows details about each pie
chart when brushed.
Clicking on a pie chart shows how knowing this
information can change the overall class
predication.
Clicking on multiple pie charts calculates
conditional probabilities.
Zoom in and out using the right mouse button.

Notice Iris-versicolor has a 33 likelihood
70
Rule Association Visualization

Read rules down the column
Example - the rule in the column labeled as 2 is
if petal-width Binned(, 2.) then
flower-typeIris-setosa
Support 25
Confidence 100

71
Discovery Using Rule Association

What services are purchased together?
What products or transactions are executed by
customers on a single visit to your website?
What are the relationships in the data?

72
Parallel Coordinates - Visualization

Each vertical line represents a field with the
minimum and maximum values represented at bottom
and top.
Each record has a line that connects it to the
its value at each field
Lines are colored based on the output field
Clicking on the label boxes allows the lines to
be rearranged
Zooming is accomplished by dragging a box over
the desired area. Clicking returns to the
original view.

73
Scatterplots - Visualization
74
Image To Knowledge (I2K) Data Visualization

Hyperspectral image with 120 bands

75
Image To Knowledge (I2K) Visualization of Results

Classification Results
Class labels per pixel
Class labels per geographical entity
Class labels of aggregations
Alignment Results
Overlays
Summary Charts
Image Operations
Enhancements
Image Restoration
Filtering

76
T2K - Text to Knowledge Topic Evolution

Any chronologically ordered text
News feeds
Email

77
Protein Consumption Dynamics

Objective
To understand, through database visualization,
global protein consumption patterns by providing
a means to directly compare historical and
simulated data.
Presented at the Global Soy Forum - 1999

78
Data Comparison, Reduction Synthesis

Goal
Development of a 3D visualization tool for
multi-channel on-board sensor data. This tools
allows for multiple time series comparison,
reduction and synthesis.
Related Projects
Derivative Monitoring
Real-time System Monitoring

79
Summary