Title: Chapter 2 Data Mining Processes and Knowledge Discovery
1Chapter 2Data Mining Processes and Knowledge
Discovery
- Identify actionable results
2Contents
- Describes the Cross-Industry Standard Process for
Data Mining (CRISP-DM), a set of phases that can
be used in data mining studies - Discusses each phase in detail
- Gives an example illustration
- Discusses a knowledge discovery process
3CRISP-DM
- Cross-Industry Standard Process for Data Mining
- One of first comprehensive attempts toward
standard process model for data mining - Independent of industry sector technology
4CRISP-DM Phases
- Business (or problem) understanding
- Data understanding
- A systematic process to try to make sense of the
massive amounts of data generated from daily
operations. - Data preparation
- Transform create data set for modeling
- Modeling
- Evaluation
- Check good models, evaluate to assure nothing
missing - Deployment
5Business Understanding
- Solve a specific problem
- Determining business objectives, assessing the
current situation, establishing data mining
goals, and developing a project plan. - Clear definition helps
- Measurable success criteria
- Convert business objectives to set of data-mining
goals - What to achieve in technical terms, such as
- What types of customers are interested in each of
our products? - What are typical profiles of customers
6Data Understanding
- Initial data collection, data description, data
exploration, and the verification of data
quality. - Three issues considered in data selection
- Set up a concise and clear description of the
problem. For example, a retail DM project may
seek to identify spending behaviors of female
shoppers who purchase seasonal clothes. - Identify the relevant data for the problem
description, such demographical, credit card
transactional, financial data - Select variables for the relevant important for
the project.
7Data Understanding (cont.)
- Data types
- Demographic data (income, education, age )
- Socio-graphic data (hobby, club membership,)
- Transactional data (sales record, credit card
spending) - Quantitative data are measurable using numerical
values) - Qualitative data known as categorical data,
contains both nominal and ordinal data. (see also
page. 22) - Related data Can come from many sources?
- Internal
- ERP (or MIS)
- Data Warehouse
- External
- Government data
- Commercial data
- Created
- Research
8Data Preparation
- Once data sources available are identified, the
data need to be selected, cleaned, built into the
desired and formatted forms. - Clean data Formats, gaps, filters outliers
redundancies (see page .22) - Unified numerical scales
- Nominal data
- Code (such gender data, male and female)
- Ordinal data
- Nominal code or scale (excellent, fair, poor)
- Cardinal data (Categorical, A, B, C levels)
9Types of Data
Type Features Synonyms
Numerical Continuous Range
Integer Range
Binary Yes/No Flag
Categorical Finite Set
Date/Time Range
String Typeless
Text String
Range Numeric vales (integer, real, or
date/time) Set Data with distinct multiple value
(numeric, string, or data/time) Typeless for
other types of data
10Data Preparation (Cont.)
- Several statistical method and visualization
tools can be used to preprocess the selected
data. - Such max, min, mean, and mode can be used to
aggregate or smooth the data. - Scatter plots and box plots can be used to filter
outliers. - More advanced techniques, such as regression
analysis, cluster analysis, decision tree, or
hierarchical analysis may be applied in data
preprocessing. - In some cases, data preprocessing could take over
50 of the time of the entire data mining
process. - Shortening data processing time can reduce much
of the total computation time in data mining.
11Data Preparation data transformation
- Data transformation is to use simple mathematical
formulations or learning curves to convert
different measurements of selected, and clean,
data into a unified numerical scale for the data
analysis. - Data transformation can be used to
- Transform from numerical to numerical scales, to
shrink or enlarge the given data. Such as
(x-min)/max-min) to shrink the data into the
interval 0,1. - Recode categorical data to numerical scales.
Categorical data can be ordinal (less, moderate,
strong) and nominal (red, yellow, blue..). Such
1yes, 0no. see also page. 24.
See page. 24 for more details.
12Modeling
- Data modeling is where the data mining software
is used to generate results for various
situations. Data visualization and cluster
analysis are useful for initial analysis. - Depending on the data type,
- if the task is to group data, discriminant
analysis is applied. - If the purpose is estimation, regression is
appropriate the data are continuous (and logistic
regression is not). - Neural networks could be applied for both tasks.
- Data Treatment
- Training set for development of the model.
- Test set for testing the model that is built.
- Maybe others for refining the model
13Data mining techniques
- Techniques
- Association the relationship of a particular
item in a data transaction on other items in the
same transaction is used to predict patterns. See
also page 25 for example. - Classification the methods are intended for
learning different functions that map each item
of the selected data into one of a predefined set
of classes. Two key research problems related to
classification results are the evaluation of
misclassification and prediction power(C4.5). - Mathematical modeling is often used to construct
classification methods are binary decision trees
(CART), neural networks (nonlinear), linear
programming (boundary), and statistics. - See also page. 25, 26 for more explanations
14Data mining techniques (Cont.)
- Clustering taking ungrouped data and uses
automatic techniques to put this data into
groups. - Clustering is unsupervised and does not require a
learning set. (Chapter 5) - Predictions is related to regression technique,
to discover the relationship between the
dependent and independent variables. - Sequential patterns seeks to find similar
patterns in data transaction over a business
period. - The mathematical models behind sequential
patterns are logic rules, fuzzy logic, and so on. - Similar time sequences applied to discover
sequences similar to a known sequence over both
past and current business periods.
15Evaluation
- Does model meet business objectives?
- Any important business objectives not addressed?
- Does model make sense?
- Is model actionable?
PDCA
CRISP-DM
16Deployment
- DM can be used to verify previously held
hypotheses or for knowledge discovery. - DM models can be applied to business purposes ,
including prediction or identification of key
situations - Ongoing monitoring maintenance
- Evaluate performance against success criteria
- Market reaction competitor changes (remodeling
or fine tune)
17Example
- Training set for computer purchase
- 16 records
- 5 attributes
- Goal
- Find classifier for consumer behavior
18Database (1st half)
Case Age Income Student Credit Gender Buy?
A1 31-40 High No Fair Male Yes
A2 gt40 Medium No Fair Female Yes
A3 gt40 Low Yes Fair Female Yes
A4 31-40 Low Yes Excellent Female Yes
A5 30 Low Yes Fair Female Yes
A6 gt40 Medium Yes Fair Male Yes
A7 30 Medium Yes Excellent Male Yes
A8 31-40 Medium No Excellent Male Yes
19Database (2nd half)
Case Age Income Student Credit Gender Buy?
A9 31-40 High Yes Fair Male Yes
A10 30 High No Fair Male No
A11 30 High No Excellent Female No
A12 gt40 Low Yes Excellent Female No
A13 30 Medium No Fair Male No
A14 gt40 Medium No Excellent Female No
A15 30 Unknown No Fair Male Yes
A16 gt40 Medium No N/A Female No
20Data Selection
- Gender has weak relationship with purchase
- Based on correlation
- Drop gender
- Selected Attribute Set
- Age, Income, Student, Credit
21Data Preprocessing
- Income unknown in Case 15
- Credit not available in Case 16
- Drop these noisy cases
22Data Transformation
- Assign numerical values to each attribute
- Age 30 3 31-40 2 gt40 1
- Income High 3 Medium 2 Low 1
- Student Yes 2 No 1
- Credit Excellent 2 Fair 1
23Data Mining
- Categorize output
- Buys C1 Doesnt buy C2
- Conduct analysis
- Model says A8, A10 dont buy rest do
- Of the actual yes, 7 correct and 1 not
- Of the actual no, 2 correct
- Confusion matrix
24Data Interpretation and Test Data Set
Case Actual Model
B1 Yes Yes (1)
B2 Yes Yes (2)
B3 Yes Yes (3)
B4 Yes Yes (4)
B5 Yes Yes (5)
B6 Yes Yes (6)
B7 Yes Yes (7)
B8 (do not) No No
B9 No Yes
B10 (do not) No No
25Confusion Matrix
Model Buy Model Not Totals
Actual Buy 7 0 7
Actual Not 1 2 3
Totals 8 2 10
right
26Measures
- Correct classification rate
- 9/10 0.90
- Cost function
- cost of error
- model says buy, actual no 20
- model says no, actual buy 200
- 1 x 20 0 x 200 20
27Goals
- Avoid broad concepts
- Gain insight discover meaningful patterns learn
interesting things - Cant measure attainment
- Narrow and specify
- Identify customers likely to renew reduce churn
- Rank order by propensity (favor) to
28Goals
- Description what is
- understand
- explain
- discover knowledge
- Prescription what should be done
- classify
- predict
29Goal
- Method A
- four rules, explains 70
- Method B
- fifty rules, explains 72
- BEST?
- Gain understanding Method A better
- minimum description length (MDL)
- Reduce cost of mailing Method B better
30Measurement
- Accuracy
- How well does model describe observed data?
- Confidence levels
- a proportion of the time between lower and upper
limits - Comprehensibility
- Whole or parts?
31Measuring Predictive
- Classification prediction
- error rate incorrect/total
- requires evaluation set be representative
- Estimators
- predicted - actual (MAD, MSE, MAPE)
- variance sum(predicted - actual)2
- standard deviation square root of variance
- distance - how far off
32Statistics
- Population - entire group studied
- Sample - subset from population
- Bias - difference between sample average
population average - mean, median, mode
- distribution
- significance
- correlation, regression (hamming distance)
33Classification Models
- LIFT probability in class by sample divided by
probability in class by population - if population probability is 20 and
- sample probability is 30,
- LIFT 0.3/0.2 1.5
- Best lift not necessarily best need sufficient
sample size as confidence increase.
34Lift Chart
35Measuring Impact
- Ideal - (NPV) because of expenditure
- Mass mailing may be better
- Depends on
- fixed cost
- cost per recipient
- cost per respondent
- value of positive response
36Bottom Line
37Example Application
- Telephone industry
- Problem Unpaid bills
- Data mining used to develop models to predict
nonpayment as early as possible
See page. 27
38Knowledge Discovery Process
1 Data Selection Learning the application domain Creating target data set
2 Data Preprocessing Data cleaning preprocessing
3 Data Transformation Data reduction projection
4 Data Mining Choosing function Choosing algorithms Data mining
5 Data Interpretation Interpretation Using discovered knowledge
391 Business Understanding
- Predict which customers would be insolvent
- In time for firm to take preventive measures (and
avert losing good customers) - Hypothesis
- Insolvent customers would change calling habits
phone usage during a critical period before
immediately after termination of billing period
402 Data Understanding
- Static customer information available in files
- Bills, payments, usage
- Used data warehouse to gather organize data
- Coded to protect customer privacy
41Creating Target Data Set
- Customer files
- Customer information
- Disconnects
- Reconnections
- Time-dependent data
- Bills
- Payments
- Usage
- 100,000 customers over 17-month period
- Stratified (hierarchical) sampling to assure all
groups appropriately represented
423 Data Preparation
- Filtered out incomplete data
- Deleted inexpensive calls
- Reduced data volume about 50
- Low number of fraudulent cases
- Cross-checked with phone disconnects
- Lagged data made synchronization necessary
43Data Reduction Projection
- Information grouped by account
- Customer data aggregated by 2-week periods
- Discriminant analysis on 23 categories
- Calculated average owed by category (significant)
- Identified extra charges (significant)
- Investigated payment by installments (not
significant)
44Choosing Data Mining Function
- Classes
- Most possibly solvent (99.3)
- Most possibly insolvent (0.7)
- Costs of error widely different
- New data set created through stratified sampling
- Retained all insolvent
- Altered distribution to 90 solvent
- Used 2,066 cases total
- Critical period identified
- Last 15 two-week periods before service
interruption - Variables defined by counting measures in
two-week periods - 46 variables as candidate discriminant factors
454 Modeling
- Discriminant Analysis
- Linear model
- SPSS stepwise forward selection
- Decision Trees
- Rule-based classifier, C5, C4.5
- Neural Networks
- Nonlinear model
46Data Mining
- Training set about 2/3rds
- Rest test
- Discriminant analysis
- Used 17 variables
- Equal costs 0.875 correct
- Unequal costs 0.930 correct
- Rule-based 0.952 correct
- Neural network 0.929 correct
475 Evaluation
- 1st objective to maximize accuracy of predicting
insolvent customers - Decision tree classifier best
- 2nd objective to minimize error rate for solvent
customers - Neural network model close to Decision tree
- Used all 3 on case-by-case basis
48Coincidence Matrix Combined Models
Model insolvent Model solvent Unclass Totals
Actual insolvent 19 17 28 64
Actual solvent 1 626 27 654
Totals 20 643 91 718
496 Implementation
- Every customer examined using all 3 algorithms
- If all 3 agreed, used that classification
- If disagreement, categorized as unclassified
- Correct on test data 0.898
- Only 1 actually solvent customer would have been
disconnected