Title: Introduction to Data Mining
1Introduction to Data Mining
2Course Overview
- Introduction to Knowledge Discovery in Databases
and Data Mining - Why Data Mining? What is Data Mining? On What
Kind of Data? - Applications of Data Mining
- Application Domains and Examples
- Knowledge Discovery in Databases and Data Mining
Process - Processing Steps
- Data Quality, Preparation, and Transformations
- Data Mining Tools
- D2K, SAS, Clementine, Intelligent Miner,
Insightful Miner, K-Wiz - Data Mining Methods
- Association Rules
- Decision Trees
- Information Visualization
- Summary
3Acknowledgement
- Contributions
- Michael Welge, Loretta Auvil, Lisa Gatzke,
Automated Learning Group, National Center for
Supercomputing Applications (NCSA), University of
Illinois at Urbana-Champaign - Jiawei Han, Computer Science, University of
Illinois at Urbana-Champaign
4Literature
- Data Mining Concepts and Techniques by J. Han
M. Kamber, Morgan Kaufmann Publishers, 2001 - Pattern Classification by R. Duda, P. Hart and D.
Stork, 2nd edition, John Wiley Sons, 2001
5Introduction to Knowledge Discovery in Databases
and Data Mining
6Computational Knowledge Discovery
7Terminology
- Data Mining
- A step in the knowledge discovery process
consisting of particular algorithms (methods)
that under some acceptable objective, produces a
particular enumeration of patterns (models) over
the data. - Knowledge Discovery Process
- The process of using data mining methods
(algorithms) to extract (identify) what is deemed
knowledge according to the specifications of
measures and thresholds, using a database along
with any necessary preprocessing or
transformations.
8Terminology - A Working Definition
- Data Mining is a decision support process in
which we search for patterns of information in
data. - Data Mining is a process of discovering
advantageous patterns in data. - A pattern is a conservative statement about a
probability distribution. - Webster A pattern is (a) a natural or chance
configuration, (b) a reliable sample of traits,
acts, tendencies, or other observable
characteristics of a person, group, or
institution
9Data Mining On What Kind of Data?
- Relational Databases
- Data Warehouses
- Transactional Databases
- Advanced Database Systems
- Object-Relational
- Spatial and Temporal
- Time-Series
- Multimedia
- Text
- Heterogeneous, Legacy, and Distributed
- WWW
Structure - 3D Anatomy
Function 1D Signal
Metadata Annotation
10Data Mining Confluence of Multiple Disciplines
?
20x20 2400 ? 10120 patterns
11Why Do We Need Data Mining ?
- Data volumes are too large for classical analysis
approaches - Large number of records (108 1012 bytes)
- High dimensional data ( 102 104 attributes)
How do you explore millions of records, tens or
hundreds of fields, and find patterns?
12Why Do We Need Data Mining ?
- Leverage organizations data assets
- Only a small portion (typically - 5-10) of the
collected data is ever analyzed - Data that may never be analyzed continues to be
collected, at a great expense, out of fear that
something which may prove important in the future
is missing. - Growth rates of data precludes traditional
manually intensive approach
13Why Do We Need Data Mining?
- As databases grow, the ability to support the
decision support process using traditional query
languages becomes infeasible - Many queries of interest are difficult to state
in a query language (Query formulation problem) - find all cases of fraud
- find all individuals likely to buy a FORD
expedition - find all documents that are similar to this
customers problem
QUERY
(Latitude, Longitude)2
RESULT
(Latitude, Longitude)1
14What is It?
- Knowledge Discovery in Databases is the
non-trivial process of identifying valid, novel,
potentially useful, and ultimately understandable
patterns in data. - The understandable patterns are used to
- Make predictions or classifications about new
data - Explain existing data
- Summarize the contents of a large database to
support decision making - Graphical data visualization to aid humans in
discovering deeper patterns
15Applications of Data Mining
16Data Mining Applications
- Market analysis
- Risk analysis and management
- Fraud detection and detection of unusual patterns
(outliers) - Text mining (news group, email, documents) and
Web mining - Stream data mining
- DNA and bio-data analysis
17Market Analysis
- Where does the data come from?
- Credit card transactions, loyalty cards, discount
coupons, customer complaint calls, plus (public)
lifestyle studies - Target marketing
- Find clusters of model customers who share the
same characteristics interest, income level,
spending habits, etc. - Determine customer purchasing patterns over time
- Cross-market analysis
- Associations/co-relations between product sales,
prediction based on such association - Customer profiling
- What types of customers buy what products
(clustering or classification) - Customer requirement analysis
- identifying the best products for different
customers - Predict what factors will attract new customers)
18Corporate Analysis Risk Management
- Finance planning and asset evaluation
- cash flow analysis and prediction
- contingent claim analysis to evaluate assets
- cross-sectional and time series analysis
(financial-ratio, trend analysis, etc.) - Resource planning
- summarize and compare the resources and spending
- Competition
- monitor competitors and market directions
- group customers into classes and a class-based
pricing procedure - set pricing strategy in a highly competitive
market
19Fraud Detection Mining Unusual Patterns
- Approaches Clustering model construction for
frauds, outlier analysis - Applications Health care, retail, credit card
service, telecomm. - Auto insurance ring of collisions
- Money laundering suspicious monetary
transactions - Medical insurance
- Professional patients, ring of doctors, and ring
of references - Unnecessary or correlated screening tests
- Telecommunications phone-call fraud
- Phone call model destination of the call,
duration, time of day or week. Analyze patterns
that deviate from an expected norm - Retail industry
- Analysts estimate that 38 of retail shrink is
due to dishonest employees - Anti-terrorism
20Data Mining and Business Intelligence
21Knowledge Discovery in Databases Process
22KDD Process
- Develop an understanding of the application
domain - Relevant prior knowledge, problem objectives,
success criteria, current solution, inventory
resources, constraints, terminology, cost and
benefits - Create target data set
- Collect initial data, describe, focus on a subset
of variables, verify data quality - Data cleaning and preprocessing
- Remove noise, outliers, missing fields, time
sequence information, known trends, integrate
data - Data Reduction and projection
- Feature subset selection, feature construction,
discretizations, aggregations
23KDD Process
- Selection of data mining task
- Classification, segmentation, deviation
detection, link analysis - Select data mining approach
- Data mining to extract patterns or models
- Interpretation and evaluation of patterns/models
- Consolidating discovered knowledge
24Knowledge Discovery
25Required effort for each KDD Step
- Arrows indicate the direction we hope the effort
should go.
26Data Mining Tools
27Commercial and Research Tools
- Data To Knowledge
- http//www.ncsa.uiuc.edu/Divisions/DMV/ALG/d2k/
- SAS
- http//www.sas.com/
- Clementine
- http//www.spss.com/spssbi/clementine/
- Intelligent Miner
- http//www-3.ibm.com/software/data/iminer/
- Insightful Miner
- http//www.insightful.com/products/product.asp?PID
26 - K-Wiz
- http//www.thinkanalytics.com/products/factsheets/
Kwiz_product_brief.htm
28Software Engineering in Data Mining
- Conceptual Software Hierarchy
- Operating System (Windows, Mac OS, UNIX, Linux)
- Programming Language (Java)
- Modules Sequences of Programming Language
Commands - Itineraries Linked Modules
- Streamlines Linked Itineraries
- Software for
- Users with Various Levels of Programming Skills
- Collaborating Users
29D2K - Software Environment for Data Mining
- Visual programming system employing a scalable
framework - Robust computational infrastructure
- Enable processor intensive apps, support
distributed computing - Enable data intensive apps, support
multi-processor, shared memory architectures,
thread pooling - Very low granularity, fast data flow paradigm,
integrated control flow - Reduction of development time
- Increase code reuse and sharing
- Expedite custom software developments
- Relieve distributed computing burden
- Flexible and extensible architecture
- Create plug and play subsystem architectures, and
standard APIs - Rapid application development (RAD) environment
- Integrated environment for models and
visualization
30D2K Architecture
- D2K Infrastructure
- Defines the D2K API
- D2K Modules
- Computational unit written in Java that follows
the D2K API - D2K Itineraries
- A group of modules that are connected to form an
application - D2K ToolKit
- User interface
- D2K Driven Applications
- Applications that use D2K modules
- D2K SL
31Data Flow Programming Environment D2K
Tool Menu
Tool Bar
Side Tab Panes
Workspace
Jump Up Panes
32D2K Programming and Runtime Environment
33Streamlined Data Mining Environment D2K SL
KDD Steps
Workspace
KDD Options
Session
34Data Mining Techniques in D2K
- Discovery
- Association Rules, Link Analysis, Self Organizing
Maps - Predictive Modeling
- Classification Naive Bayesian, Neural Networks,
Decision Trees - Regression Neural Networks, Regression Trees
- Deviation Detection
- Visualization
- Text To Knowledge (T2K)
- Image To Knowledge (I2K)
- ----------------------
- Audio, Touch, Scent and Savor To Knowledge
- Knowledge To Wisdom (K2W)
35Data Mining at Work
Numerous
Territorial Ratemaking
Functional Foods
Precision Farming
Transaction Management
Bio-Informatics
Heterogeneous Data Visualization
Effluent Quality Control
Web Information Retrieval, Archival and Clustering
Data Sources
Multiple
Crime Data Analysis
Data Fusion and Visualization
Auto Loss Ratio Predictions
Target Marketing
Cost Prediction (Warranty, Insurance Claims)
Survey Study of Disability
Warranty Clustering
Single
Automation
Diagnostics
Decision Support
Project Objectives
36Examples of Data Mining Methods
37Three Primary Data Mining Paradigms
- Discovery
- Example Association Rules
- Predictive Modeling
- Classification Example Decision Trees
- Deviation Detection
- Visualization
38Association Rules and Market Basket Analysis
39What is Market Basket Analysis?
- Customer Analysis
- Market Basket Analysis uses the information about
what a customer purchases to give us insight into
who they are and why they make certain purchases. - Product Analysis
- Market basket Analysis gives us insight into the
merchandise by telling us which products tend to
be purchased together and which are most amenable
to purchase.
40Market Basket Example
?
Where should detergents be placed in the Store to
maximize their sales?
?
Are window cleaning products purchased when
detergents and orange juice are bought together?
?
Is soda typically purchased with bananas? Does
the brand of soda make a difference?
?
How are the demographics of the neighborhood
affecting what customers are buying?
41Association Rules
- There has been a considerable amount of research
in the area of Market Basket Analysis. Its appeal
comes from the clarity and utility of its
results, which are expressed in the form
association rules. - Given
- A database of transactions
- Each transaction contains a set of items
- Find all rules X-gtY that correlate the presence
of one set of items X with another set of items Y - Example When a customer buys bread and butter,
they buy milk 85 of the time
42Results Useful, Trivial, or Inexplicable?
- While association rules are easy to understand,
they are not always useful. - Useful On Fridays convenience store customers
often purchase diapers and beer together. - Trivial Customers who purchase maintenance
agreements are very likely to purchase large
appliances. - Inexplicable When a new Super Store opens, one
of the most commonly sold item is light bulbs.
43How Does It Work?
Orange juice, Soda Milk, Orange Juice, Window
Cleaner Orange Juice, Detergent Orange juice,
detergent, soda Window cleaner, soda
Co-Occurrence of Products
Window Cleaner
OJ
Milk
Soda
Detergent
OJ Window Cleaner Milk Soda Detergent
4 1 1 2 1
1 2 1 1 0
1 1 1 0 0
2 1 0 3 1
1 0 0 1 2
44How Does It Work?
- The co-occurrence table contains some simple
patterns - Orange juice and soda are more likely to be
purchased together than any other two items - Detergent is never purchased with window cleaner
or milk - Milk is never purchased with soda or detergent
- These simple observations are examples of
Associations and may suggest a formal rule like - If a customer purchases soda, THEN the customer
also purchases orange juice
Window Cleaner
OJ
Milk
Soda
Detergent
OJ Window Cleaner Milk Soda Detergent
1 2 1 1 0
1 1 1 0 0
2 1 0 3 1
1 0 0 1 2
4 1 1 2 1
45How Good Are the Rules?
- In the data, two of five transactions include
both soda and orange juice, These two
transactions support the rule. The support for
the rule is two out of five or 40 - Since both transactions that contain soda also
contain orange juice there is a high degree of
confidence in the rule. In fact every transaction
that contains soda contains orange juice. So the
rule If soda, THEN orange juice has a confidence
of 100.
46Confidence and Support - How Good Are the Rules
- A rule must have some minimum user-specified
confidence - 1 2 -gt 3 has a 90 confidence if when a
customer bought 1 and 2, in 90 of the cases, the
customer also bought 3. - A rule must have some minimum user-specified
support - 1 2 -gt 3 should hold in some minimum percentage
of transactions to have value.
47Confidence and Support
Transaction ID
Items
1 2 3 4
1, 2, 3 1,3 1,4 2, 5, 6
For minimum support 50 2 transactions and
minimum confidence 50
Frequent One Item Set
Support
1 2 3 4
75 50 50 25
For the rule 1gt 3 Support Support(1,3)
50 Confidence (1-gt3) Support
(1,3)/Support(1) 66 Confidence (3-gt1)
Support (1,3)/Support(3) 100
Frequent Two Item Set
Support
1,2 1,3 1,4 2,3
25 50 25 25
48Association Examples
- Find all rules that have Diet Coke as a result.
These rules may help plan what the store should
do to boost the sales of Diet Coke. - Find all rules that have Yogurt in the
condition. These rules may help determine what
products may be impacted if the store
discontinues selling Yogurt. - Find all rules that have Brats in the condition
and mustard in the result. These rules may help
in determining the additional items that have to
be sold together to make it highly likely that
mustard will also be sold. - Find the best k rules that have Yogurt in the
result.
49The Basic Process
- Choosing the right set of items
- Taxonomies
- Generation of rules
- If condition Then result
- Negation
- Overcoming the practical limits imposed by
thousand or tens of thousands of products - Minimum Support Pruning
50Choosing the Right Set of Items
Frozen Foods
General
Frozen Desserts
Frozen Vegetables
Frozen Dinners
Partial Product Taxonomy
Frozen Yogurt
Frozen Fruit Bars
Ice Cream
Peas
Carrots
Mixed
Other
Rocky Road
Cherry Garcia
Specific
Chocolate
Strawberry
Vanilla
Other
51Example - Minimum Support Pruning / Rule
Generation
Scan Database
Find Pairings
Find Level of Support
Transaction ID
Items
Itemset
Support
Itemset
Support
1 2 3 4
1, 3, 4 2, 3, 5 1, 2, 3, 5 2, 5
1 2 3 4 5
2 3 3 1 3
2 3 5
3 3 3
Scan Database
Find Pairings
Find Level of Support
Itemset
Itemset
Support
Itemset
Support
2 3 5
2, 3 2, 5 3, 5
2 3 2
2, 5
3
Two rules with the highest support for two item
set 2-gt5 and 5-gt2
52Other Association Rule Applications
- Quantitative Association Rules
- Age35..40 and MarriedYes -gt NumCars2
- Association Rules with Constraints
- Find all association rules where the prices of
items are gt 100 dollars - Temporal Association Rules
- Diaper -gt Beer (1 support, 80 confidence)
- Diaper -gt Beer (20support) 700-900 PM weekdays
- Optimized Association Rules
- Given a rule (l lt A lt u) and X -gt Y, Find values
for l and u such that support greater than
certain threshold and maximizes a support and
confidence. - Check Balance 30,000 .. 50,000 -gt
Certificate of Deposit (CD) Yes
53Strengths of Market Basket Analysis
- It produces easy to understand results
- It supports undirected data mining
- It works on variable length data
- Rules are relatively easy to compute
54Weaknesses of Market Basket Analysis
- It an exponentially growth algorithm
- It is difficult to determine the optimal number
of items - It discounts rare items
- It is limited on the support that it provides
attributes
55Decision Tree Learning
56Example Supervised Learning with Decision Trees
57Decision Tree Learning
- Start with data at the root node
- Select an attribute and form a logical test on
attribute - Branch on each outcome of test, move subset of
example satisfying that out come to corresponding
child node - Recurse on each child node
- Termination rule specifies when to declare a node
is a leaf node - Note this is a one-step look ahead,
non-backtracking search through the space of all
decision trees -
- Critical Steps
- Formulation of good logical tests
- Selection measure for attributes
58Decision Trees
- Classifiers
- Instances (unlabeled examples) represented as
attribute (feature) vectors - Internal Nodes Tests for Attribute Values
- Typical equality test (e.g., Wind ?)
- Inequality, other tests possible
- Branches Attribute Values
- One-to-one correspondence (e.g., Wind Strong,
Wind Light) - Leaves Assigned Classifications (Class Labels)
59Decision Tree for Concept PlayTennis
Outlook?
Outlook?
Sunny
Overcast
Rain
Sunny
Overcast
Rain
Humidity?
Wind?
Humidity?
Wind?
Yes
Yes
High
Normal
Strong
Light
High
Normal
Strong
Light
Yes
No
No
Yes
No
No
Yes
Yes
60Decision Trees and Decision Boundaries
How to Visualize Decision Trees? Example
Dividing Instance Space into Axis-Parallel
Rectangles
y
7
-
5
-
-
-
x
1
3
More than two variables ?
61An Illustrative Example
Training Examples for Concept PlayTennis
Day
Temperature
Humidity
Wind
PlayTennis?
Outlook
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Sunny Sunny Overcast Rain Rain Rain Overcast Sunny
Sunny Rain Sunny Overcast Overcast Rain
Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mil
d Mild Hot Mild
High High High High Normal Normal Normal High Norm
al Normal Normal High Normal High
Light Strong Light Light Light Strong Strong Light
Light Light Strong Strong Light Strong
No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No
62Constructing a Decision Tree for PlayTennis
The Initial Decision Tree with One Leaf
9, 5-
E(D) min(9/14, 5/14) 5/14 36
Question What attribute A and what value of A
should we split on?
- Goal maximize error reduction E, where the error
reduction relative to attribute A is the expected
reduction in error due to splitting on A
63Constructing a Decision Tree for PlayTennis
Potential Splits of Root Node
9, 5-
9, 5-
Temperature
Outlook
Cool
Hot
Mild
Sunny
Rain
Overcast
3, 1-
2, 2-
4, 2-
2, 3-
3, 2-
4, 0-
9, 5-
9, 5-
Humidity
Wind
High
Normal
Light
Strong
3, 4-
6, 1-
6, 2-
3, 3-
E(Split/Outlook) (5/14)
((5/14)(min(2/5,3/5)) (4/14)(min(4/4,0/4))
(5/14)(min(3/5,2/5))) 7 E(Split/Temperature)
(5/14) ((4/14)(min(3/4,1/4))
(6/14)(min(4/6,2/6)) (4/14)(min(2/4,2/4)))
0 E(Split/Humidity) (5/14)
((7/14)(min(3/7,4/7)) (7/14)(min(6/7,1/7)))
7 E(Split/Wind) (5/14)
((8/14)(min(6/8,2/8)) (6/14)(min(3/6,3/6)))
0
64Constructing a Decision Tree for PlayTennis
- Top-Down Induction
- For discrete-valued attributes, terminates in
?(n) splits - Makes at most one pass through data set at each
level (why?)
Outlook?
1,2,3,4,5,6,7,8,9,10,11,12,13,14 9,5-
Humidity?
Wind?
Yes
Yes
No
Yes
No
65Strengths Of Decision Trees
- Decision trees are able to generate
understandable results - Decision trees perform classification without
requiring much computation - Decisions trees can handle both continuous and
categorical variables - Decision trees provide a clear indication of
which attributes are most important for
prediction or classification
66Weakness Of Decision Trees
- Error-prone with too many classes
- Quick partitioning of data results in fast
deterioration in attribute selection quality - Trouble with non-rectangular regions
67Visualization
68Visualization Example Naïve Bayesian
Three Flower Types Petal and Sepal Based
Classification
69Naïve Bayesian Visualization
- The right hand pane shows the distribution of the
classes. - The left hand pane shows the attributes and each
of their values. They are listed by order of
significance. - The message box shows details about each pie
chart when brushed. - Clicking on a pie chart shows how knowing this
information can change the overall class
predication. - Clicking on multiple pie charts calculates
conditional probabilities. - Zoom in and out using the right mouse button.
Notice Iris-versicolor has a 33 likelihood
70Rule Association Visualization
- Read rules down the column
- Example - the rule in the column labeled as 2 is
- if petal-width Binned(, 2.) then
flower-typeIris-setosa - Support 25
- Confidence 100
71Discovery Using Rule Association
- What services are purchased together?
- What products or transactions are executed by
customers on a single visit to your website? - What are the relationships in the data?
72Parallel Coordinates - Visualization
- Each vertical line represents a field with the
minimum and maximum values represented at bottom
and top. - Each record has a line that connects it to the
its value at each field - Lines are colored based on the output field
- Clicking on the label boxes allows the lines to
be rearranged - Zooming is accomplished by dragging a box over
the desired area. Clicking returns to the
original view.
73Scatterplots - Visualization
74Image To Knowledge (I2K) Data Visualization
- Hyperspectral image with 120 bands
75Image To Knowledge (I2K) Visualization of Results
- Classification Results
- Class labels per pixel
- Class labels per geographical entity
- Class labels of aggregations
- Alignment Results
- Overlays
- Summary Charts
- Image Operations
- Enhancements
- Image Restoration
- Filtering
76T2K - Text to Knowledge Topic Evolution
- Any chronologically ordered text
- News feeds
- Email
77Protein Consumption Dynamics
- Objective
- To understand, through database visualization,
global protein consumption patterns by providing
a means to directly compare historical and
simulated data. - Presented at the Global Soy Forum - 1999
78Data Comparison, Reduction Synthesis
- Goal
- Development of a 3D visualization tool for
multi-channel on-board sensor data. This tools
allows for multiple time series comparison,
reduction and synthesis. - Related Projects
- Derivative Monitoring
- Real-time System Monitoring
79Summary
- Curious? Puzzled?
- Found Application? Domain Specific Questions?
- Learn !
- Become Familiar with Data Mining Terminology
- Introduction to Data Mining
- Look For Tools
- Apply Data Mining Techniques to Problems
- Ask For Help