Chapter 2 Data Mining Processes and Knowledge Discovery - PowerPoint PPT Presentation

About This Presentation
Title:

Chapter 2 Data Mining Processes and Knowledge Discovery

Description:

Describes the Cross-Industry Standard Process for Data Mining (CRISP-DM), a set ... Recode categorical data to numerical scales. ... – PowerPoint PPT presentation

Number of Views:215
Avg rating:3.0/5.0
Slides: 50
Provided by: MHE485
Category:

less

Transcript and Presenter's Notes

Title: Chapter 2 Data Mining Processes and Knowledge Discovery


1
Chapter 2Data Mining Processes and Knowledge
Discovery
  • Identify actionable results

2
Contents
  • Describes the Cross-Industry Standard Process for
    Data Mining (CRISP-DM), a set of phases that can
    be used in data mining studies
  • Discusses each phase in detail
  • Gives an example illustration
  • Discusses a knowledge discovery process

3
CRISP-DM
  • Cross-Industry Standard Process for Data Mining
  • One of first comprehensive attempts toward
    standard process model for data mining
  • Independent of industry sector technology

4
CRISP-DM Phases
  • Business (or problem) understanding
  • Data understanding
  • A systematic process to try to make sense of the
    massive amounts of data generated from daily
    operations.
  • Data preparation
  • Transform create data set for modeling
  • Modeling
  • Evaluation
  • Check good models, evaluate to assure nothing
    missing
  • Deployment

5
Business Understanding
  • Solve a specific problem
  • Determining business objectives, assessing the
    current situation, establishing data mining
    goals, and developing a project plan.
  • Clear definition helps
  • Measurable success criteria
  • Convert business objectives to set of data-mining
    goals
  • What to achieve in technical terms, such as
  • What types of customers are interested in each of
    our products?
  • What are typical profiles of customers

6
Data Understanding
  • Initial data collection, data description, data
    exploration, and the verification of data
    quality.
  • Three issues considered in data selection
  • Set up a concise and clear description of the
    problem. For example, a retail DM project may
    seek to identify spending behaviors of female
    shoppers who purchase seasonal clothes.
  • Identify the relevant data for the problem
    description, such demographical, credit card
    transactional, financial data
  • Select variables for the relevant important for
    the project.

7
Data Understanding (cont.)
  • Data types
  • Demographic data (income, education, age )
  • Socio-graphic data (hobby, club membership,)
  • Transactional data (sales record, credit card
    spending)
  • Quantitative data are measurable using numerical
    values)
  • Qualitative data known as categorical data,
    contains both nominal and ordinal data. (see also
    page. 22)
  • Related data Can come from many sources?
  • Internal
  • ERP (or MIS)
  • Data Warehouse
  • External
  • Government data
  • Commercial data
  • Created
  • Research

8
Data Preparation
  • Once data sources available are identified, the
    data need to be selected, cleaned, built into the
    desired and formatted forms.
  • Clean data Formats, gaps, filters outliers
    redundancies (see page .22)
  • Unified numerical scales
  • Nominal data
  • Code (such gender data, male and female)
  • Ordinal data
  • Nominal code or scale (excellent, fair, poor)
  • Cardinal data (Categorical, A, B, C levels)

9
Types of Data
Type Features Synonyms
Numerical Continuous Range
Integer Range
Binary Yes/No Flag
Categorical Finite Set
Date/Time Range
String Typeless
Text String
Range Numeric vales (integer, real, or
date/time) Set Data with distinct multiple value
(numeric, string, or data/time) Typeless for
other types of data
10
Data Preparation (Cont.)
  • Several statistical method and visualization
    tools can be used to preprocess the selected
    data.
  • Such max, min, mean, and mode can be used to
    aggregate or smooth the data.
  • Scatter plots and box plots can be used to filter
    outliers.
  • More advanced techniques, such as regression
    analysis, cluster analysis, decision tree, or
    hierarchical analysis may be applied in data
    preprocessing.
  • In some cases, data preprocessing could take over
    50 of the time of the entire data mining
    process.
  • Shortening data processing time can reduce much
    of the total computation time in data mining.

11
Data Preparation data transformation
  • Data transformation is to use simple mathematical
    formulations or learning curves to convert
    different measurements of selected, and clean,
    data into a unified numerical scale for the data
    analysis.
  • Data transformation can be used to
  • Transform from numerical to numerical scales, to
    shrink or enlarge the given data. Such as
    (x-min)/max-min) to shrink the data into the
    interval 0,1.
  • Recode categorical data to numerical scales.
    Categorical data can be ordinal (less, moderate,
    strong) and nominal (red, yellow, blue..). Such
    1yes, 0no. see also page. 24.

See page. 24 for more details.
12
Modeling
  • Data modeling is where the data mining software
    is used to generate results for various
    situations. Data visualization and cluster
    analysis are useful for initial analysis.
  • Depending on the data type,
  • if the task is to group data, discriminant
    analysis is applied.
  • If the purpose is estimation, regression is
    appropriate the data are continuous (and logistic
    regression is not).
  • Neural networks could be applied for both tasks.
  • Data Treatment
  • Training set for development of the model.
  • Test set for testing the model that is built.
  • Maybe others for refining the model

13
Data mining techniques
  • Techniques
  • Association the relationship of a particular
    item in a data transaction on other items in the
    same transaction is used to predict patterns. See
    also page 25 for example.
  • Classification the methods are intended for
    learning different functions that map each item
    of the selected data into one of a predefined set
    of classes. Two key research problems related to
    classification results are the evaluation of
    misclassification and prediction power(C4.5).
  • Mathematical modeling is often used to construct
    classification methods are binary decision trees
    (CART), neural networks (nonlinear), linear
    programming (boundary), and statistics.
  • See also page. 25, 26 for more explanations

14
Data mining techniques (Cont.)
  • Clustering taking ungrouped data and uses
    automatic techniques to put this data into
    groups.
  • Clustering is unsupervised and does not require a
    learning set. (Chapter 5)
  • Predictions is related to regression technique,
    to discover the relationship between the
    dependent and independent variables.
  • Sequential patterns seeks to find similar
    patterns in data transaction over a business
    period.
  • The mathematical models behind sequential
    patterns are logic rules, fuzzy logic, and so on.
  • Similar time sequences applied to discover
    sequences similar to a known sequence over both
    past and current business periods.

15
Evaluation
  • Does model meet business objectives?
  • Any important business objectives not addressed?
  • Does model make sense?
  • Is model actionable?

PDCA
CRISP-DM
16
Deployment
  • DM can be used to verify previously held
    hypotheses or for knowledge discovery.
  • DM models can be applied to business purposes ,
    including prediction or identification of key
    situations
  • Ongoing monitoring maintenance
  • Evaluate performance against success criteria
  • Market reaction competitor changes (remodeling
    or fine tune)

17
Example
  • Training set for computer purchase
  • 16 records
  • 5 attributes
  • Goal
  • Find classifier for consumer behavior

18
Database (1st half)
Case Age Income Student Credit Gender Buy?
A1 31-40 High No Fair Male Yes
A2 gt40 Medium No Fair Female Yes
A3 gt40 Low Yes Fair Female Yes
A4 31-40 Low Yes Excellent Female Yes
A5 30 Low Yes Fair Female Yes
A6 gt40 Medium Yes Fair Male Yes
A7 30 Medium Yes Excellent Male Yes
A8 31-40 Medium No Excellent Male Yes
19
Database (2nd half)
Case Age Income Student Credit Gender Buy?
A9 31-40 High Yes Fair Male Yes
A10 30 High No Fair Male No
A11 30 High No Excellent Female No
A12 gt40 Low Yes Excellent Female No
A13 30 Medium No Fair Male No
A14 gt40 Medium No Excellent Female No
A15 30 Unknown No Fair Male Yes
A16 gt40 Medium No N/A Female No
20
Data Selection
  • Gender has weak relationship with purchase
  • Based on correlation
  • Drop gender
  • Selected Attribute Set
  • Age, Income, Student, Credit

21
Data Preprocessing
  • Income unknown in Case 15
  • Credit not available in Case 16
  • Drop these noisy cases

22
Data Transformation
  • Assign numerical values to each attribute
  • Age 30 3 31-40 2 gt40 1
  • Income High 3 Medium 2 Low 1
  • Student Yes 2 No 1
  • Credit Excellent 2 Fair 1

23
Data Mining
  • Categorize output
  • Buys C1 Doesnt buy C2
  • Conduct analysis
  • Model says A8, A10 dont buy rest do
  • Of the actual yes, 7 correct and 1 not
  • Of the actual no, 2 correct
  • Confusion matrix

24
Data Interpretation and Test Data Set
  • Test on independent data

Case Actual Model
B1 Yes Yes (1)
B2 Yes Yes (2)
B3 Yes Yes (3)
B4 Yes Yes (4)
B5 Yes Yes (5)
B6 Yes Yes (6)
B7 Yes Yes (7)
B8 (do not) No No
B9 No Yes
B10 (do not) No No
25
Confusion Matrix
Model Buy Model Not Totals
Actual Buy 7 0 7
Actual Not 1 2 3
Totals 8 2 10
right
26
Measures
  • Correct classification rate
  • 9/10 0.90
  • Cost function
  • cost of error
  • model says buy, actual no 20
  • model says no, actual buy 200
  • 1 x 20 0 x 200 20

27
Goals
  • Avoid broad concepts
  • Gain insight discover meaningful patterns learn
    interesting things
  • Cant measure attainment
  • Narrow and specify
  • Identify customers likely to renew reduce churn
  • Rank order by propensity (favor) to

28
Goals
  • Description what is
  • understand
  • explain
  • discover knowledge
  • Prescription what should be done
  • classify
  • predict

29
Goal
  • Method A
  • four rules, explains 70
  • Method B
  • fifty rules, explains 72
  • BEST?
  • Gain understanding Method A better
  • minimum description length (MDL)
  • Reduce cost of mailing Method B better

30
Measurement
  • Accuracy
  • How well does model describe observed data?
  • Confidence levels
  • a proportion of the time between lower and upper
    limits
  • Comprehensibility
  • Whole or parts?

31
Measuring Predictive
  • Classification prediction
  • error rate incorrect/total
  • requires evaluation set be representative
  • Estimators
  • predicted - actual (MAD, MSE, MAPE)
  • variance sum(predicted - actual)2
  • standard deviation square root of variance
  • distance - how far off

32
Statistics
  • Population - entire group studied
  • Sample - subset from population
  • Bias - difference between sample average
    population average
  • mean, median, mode
  • distribution
  • significance
  • correlation, regression (hamming distance)

33
Classification Models
  • LIFT probability in class by sample divided by
    probability in class by population
  • if population probability is 20 and
  • sample probability is 30,
  • LIFT 0.3/0.2 1.5
  • Best lift not necessarily best need sufficient
    sample size as confidence increase.

34
Lift Chart
35
Measuring Impact
  • Ideal - (NPV) because of expenditure
  • Mass mailing may be better
  • Depends on
  • fixed cost
  • cost per recipient
  • cost per respondent
  • value of positive response

36
Bottom Line
  • Return on investment

37
Example Application
  • Telephone industry
  • Problem Unpaid bills
  • Data mining used to develop models to predict
    nonpayment as early as possible

See page. 27
38
Knowledge Discovery Process
1 Data Selection Learning the application domain Creating target data set
2 Data Preprocessing Data cleaning preprocessing
3 Data Transformation Data reduction projection
4 Data Mining Choosing function Choosing algorithms Data mining
5 Data Interpretation Interpretation Using discovered knowledge
39
1 Business Understanding
  • Predict which customers would be insolvent
  • In time for firm to take preventive measures (and
    avert losing good customers)
  • Hypothesis
  • Insolvent customers would change calling habits
    phone usage during a critical period before
    immediately after termination of billing period

40
2 Data Understanding
  • Static customer information available in files
  • Bills, payments, usage
  • Used data warehouse to gather organize data
  • Coded to protect customer privacy

41
Creating Target Data Set
  • Customer files
  • Customer information
  • Disconnects
  • Reconnections
  • Time-dependent data
  • Bills
  • Payments
  • Usage
  • 100,000 customers over 17-month period
  • Stratified (hierarchical) sampling to assure all
    groups appropriately represented

42
3 Data Preparation
  • Filtered out incomplete data
  • Deleted inexpensive calls
  • Reduced data volume about 50
  • Low number of fraudulent cases
  • Cross-checked with phone disconnects
  • Lagged data made synchronization necessary

43
Data Reduction Projection
  • Information grouped by account
  • Customer data aggregated by 2-week periods
  • Discriminant analysis on 23 categories
  • Calculated average owed by category (significant)
  • Identified extra charges (significant)
  • Investigated payment by installments (not
    significant)

44
Choosing Data Mining Function
  • Classes
  • Most possibly solvent (99.3)
  • Most possibly insolvent (0.7)
  • Costs of error widely different
  • New data set created through stratified sampling
  • Retained all insolvent
  • Altered distribution to 90 solvent
  • Used 2,066 cases total
  • Critical period identified
  • Last 15 two-week periods before service
    interruption
  • Variables defined by counting measures in
    two-week periods
  • 46 variables as candidate discriminant factors

45
4 Modeling
  • Discriminant Analysis
  • Linear model
  • SPSS stepwise forward selection
  • Decision Trees
  • Rule-based classifier, C5, C4.5
  • Neural Networks
  • Nonlinear model

46
Data Mining
  • Training set about 2/3rds
  • Rest test
  • Discriminant analysis
  • Used 17 variables
  • Equal costs 0.875 correct
  • Unequal costs 0.930 correct
  • Rule-based 0.952 correct
  • Neural network 0.929 correct

47
5 Evaluation
  • 1st objective to maximize accuracy of predicting
    insolvent customers
  • Decision tree classifier best
  • 2nd objective to minimize error rate for solvent
    customers
  • Neural network model close to Decision tree
  • Used all 3 on case-by-case basis

48
Coincidence Matrix Combined Models
Model insolvent Model solvent Unclass Totals
Actual insolvent 19 17 28 64
Actual solvent 1 626 27 654
Totals 20 643 91 718
49
6 Implementation
  • Every customer examined using all 3 algorithms
  • If all 3 agreed, used that classification
  • If disagreement, categorized as unclassified
  • Correct on test data 0.898
  • Only 1 actually solvent customer would have been
    disconnected
Write a Comment
User Comments (0)
About PowerShow.com