Chapter 2 Overview of the Data Mining Process - PowerPoint PPT Presentation

About This Presentation
Title:

Chapter 2 Overview of the Data Mining Process

Description:

Chapter 2 Overview of the Data Mining Process * – PowerPoint PPT presentation

Number of Views:175
Avg rating:3.0/5.0
Slides: 36
Provided by: was111
Learn more at: https://www.washburn.edu
Category:

less

Transcript and Presenter's Notes

Title: Chapter 2 Overview of the Data Mining Process


1
Chapter 2Overview of the Data Mining Process
2
Introduction
  • Data Mining
  • Predictive analysis
  • Tasks of Classification Prediction
  • Core of Business Intelligence
  • Data Base Methods
  • OLAP
  • SQL
  • Do not involve statistical modeling

3
Core Ideas in Data Mining
  • Analytical Methods Used in Predictive Analytics
  • Classification
  • Used with categorical response variables
  • E.g. Will purchase be made / not made?
  • Prediction
  • Predict (estimate) value of continuous response
    variable
  • Prediction used with categorical as well
  • Association Rules
  • Affinity analysis what goes with what
  • Seeks correlations among data

4
Core Ideas in Data Mining
  • Data Reduction
  • Reduce variables
  • Group together similar variables
  • Data Exploration
  • View data as evidence
  • Get a feel for the data
  • Data Visualization
  • Graphical representation of data
  • Locate tends, correlations, etc.

5
Supervised Learning
  • Supervised learning" algorithms are those used
    in classification and prediction.
  • Data is available in which the value of the
    outcome of interest is known.
  • Training data" are the data from which the
    classification or prediction algorithm learns,"
    or is trained," about the relationship between
    predictor variables and the outcome variable.
  • This process results in a model
  • Classification Model
  • Predictive Model

6
Supervised Learning
  • Model is then run with another sample of data
  • validation data"
  • the outcome is known but we wish to see how well
    the model performs
  • If many different models are being tried out, a
    third sample of known outcomes -test data is
    used with the final, selected model to predict
    how well it will do.
  • The model can then be used to classify or predict
    the outcome of interest in new cases where the
    outcome is unknown.

7
Supervised Learning
  • Linear regression analysis is an example of
    supervised Learning
  • The Y variable is the (known) outcome variable
  • The X variable is some predictor variable.
  • A regression line is drawn to minimize the sum of
    squared deviations between the actual Y values
    and the values predicted by this line.
  • The regression line can now be used to predict Y
    values for new values of X for which we do not
    know the Y value.

8
Unsupervised Learning
  • No outcome variable to predict or classify
  • No learning from cases
  • Unsupervised leaning methods
  • Association Rules
  • Data Reduction Methods
  • Clustering Techniques

9
The Steps in Data Mining
  • 1. Develop an understanding of the purpose of the
    data mining project
  • It is a one-shot effort to answer a question or
    questions or
  • Application (if it is an ongoing procedure).
  • 2. Obtain the dataset to be used in the analysis.
  • Random sampling from a large database to capture
    records to be used in an analysis
  • Pulling together data from different databases.
  • Internal (e.g. Past purchases made by customers)
  • External (credit ratings).
  • Usually the analysis to be done requires only
    thousands or tens of thousands of records.

10
The Steps in Data Mining
  • 3. Explore, clean, and preprocess the data
  • Verifying that the data are in reasonable
    condition.
  • How missing data should be handled?
  • Are the values in a reasonable range, given what
    you would expect for each variable?
  • Are there obvious outliers?"
  • Data are reviewed graphically
  • For example, a matrix of scatter plots showing
    the relationship of each variable with each other
    variable.
  • Ensure consistency in the definitions of fields,
    units of measurement, time periods, etc.

11
The Steps in Data Mining
  • 4. Reduce the data
  • If supervised training is involved separate them
    into training, validation and test datasets.
  • Eliminating unneeded variables,
  • Transforming variables
  • Turning money spent" into spent gt 100" vs.
    Spent 100"),
  • Creating new variables
  • A variable that records whether at least one of
    several products was purchased
  • Make sure you know what each variable means, and
    whether it is sensible to include it in the
    model.
  • 5. Determine the data mining task
  • Classification, prediction, clustering, etc.
  • 6. Choose the data mining techniques to be used
  • Regression, neural nets, hierarchical clustering,
    etc.

12
The Steps in Data Mining
  • 7. Use algorithms to perform the task.
  • Iterative process - trying multiple variants, and
    often using multiple variants of the same
    algorithm (choosing different variables or
    settings within the algorithm).
  • When appropriate, feedback from the algorithm's
    performance on validation data is used to refine
    the settings.
  • 8. Interpret the results of the algorithms.
  • Choose the best algorithm to deploy,
  • Use final choice on the test data to get an idea
    how well it will perform.
  • 9. Deploy the model.
  • Integrate the model into operational systems
  • Run it on real records to produce decisions or
    actions.
  • For example, the model might be applied to a
    purchased list of possible customers, and the
    action might be include in the mailing if the
    predicted amount of purchase is gt 10."

13
Preliminary Steps
  • Organization of datasets
  • Records in rows
  • Variables in columns
  • In supervised learning one of these will be the
    outcome variable
  • Labels the first or last column
  • Sampling from a database
  • Use a samples to create, validate, test model
  • Oversampling rare events
  • If response variable value is seldom found in
    data then sample size increase
  • Adjust algorithm as necessary

14
Preliminary Steps(Pre-processing and Cleaning
the Data)
  • Types of variables
  • Continuous assumes a any real numerical value
    (generally within a specified range)
  • Categorical assumes one of a limited number of
    values
  • Text (e.g. Payments e current, not current,
    bankrupt
  • Numerical (e.g. Age e 0 120 )
  • Nominal (payments)
  • Ordinal (age)

15
Preliminary Steps(Pre-processing and Cleaning
the Data)
  • Handling categorical variables
  • If categorical is ordered then it can be used as
    continuous variable (e..G. Age, level of credit,
    etc.)
  • Use of dummy variables when range of values not
    large
  • e.g. Variable occupation e student, unemployed,
    employed, retired
  • Create binary (yes/no) dummy variables
  • Student yes/no
  • Unemployed yes/no
  • Employed yes/no
  • Retired yes/no
  • Variable selection
  • The more predictor variables the more records
    need to build the model
  • Reduce number of variables whenever appropriate

16
Preliminary Steps(Pre-processing and Cleaning
the Data)
  • Overfitting
  • Building a model - describe relationships among
    variables in order to predict future outcome
    (dependent) values on the basis of future
    predictor (independent) values.
  • Avoid explaining variation in the data that was
    nothing more than chance variation. Avoid
    mislabeling noise in the data as if it were a
    signal
  • Caution - if the dataset is not much larger than
    the number of predictor variables, then it is
    very likely that a spurious relationship like
    this will creep into the model

17
Overfitting
18
Preliminary Steps (Pre-processing and Cleaning
the Data)
  • How many variables how much data
  • A good rule of thumb is to have ten records for
    every predictor variable.
  • For classification procedures
  • At least 6xmxp records,
  • Where m number of outcome classes, and p
    number of variables
  • Compactness or parsimony is a desirable feature
    in a model.
  • A matrix of x-y plots can be useful in variable
    selection.
  • Can see at a glance x-y plots for all variable
    combinations.
  • A straight line would be an indication that one
    variable is exactly correlated with another.
  • We would want to include only one of them in our
    model.
  • Weed out irrelevant and redundant variables from
    our model
  • Consult domain expert whenever possible

19
Preliminary Steps(Pre-processing and Cleaning
the Data)
  • Outliers
  • Values that lie far away from the bulk of the
    data are called outliers
  • no statistical rule can tell us whether such an
    outlier is the result of an error
  • these are judgments best made by someone with
    domain" knowledge
  • if the number of records with outliers is very
    small, they might be treated as missing data.

20
Preliminary Steps(Pre-processing and Cleaning
the Data)
  • Missing values
  • If the number of records with missing values is
    small, those records might be omitted
  • The more variables, the more records to dropped
  • Solution - use average value computed from
    records with valid data for variable with missing
    data
  • Reduces variability in data set
  • Human judgment can be used to determine best way
    to handle missing data

21
Preliminary Steps(Pre-processing and Cleaning
the Data)
  • Normalizing (standardizing) the data
  • To normalize the data, we subtract the mean from
    each value, and divide by the standard deviation
    of the resulting deviations from the mean
  • Expressing each value as number of standard
    deviations away from the mean the z-score
  • Needed if variables are in different units e.G.
    Hours, thousands of dollars, etc.
  • Clustering algorithms measure variables values in
    distance from each other need a standard value
    for distance.
  • Data mining software, including XLMiner,
    typically has an option that normalizes the data
    in those algorithms where it may be required

22
Preliminary Steps
  • Use and creation of partition
  • Training partition
  • The largest partition
  • Contains the data used to build the various
    models
  • Same training partition is generally used to
    develop multiple models.
  • Validation partition
  • Used to assess the performance of each model,
  • Used to compare models and pick the best one.
  • In classification and regression trees algorithms
    the validation partition may be used
    automatically to tune and improve the model.
  • Test partition
  • Sometimes called the holdout" or evaluation"
    partition is used to assess the performance of a
    chosen model with new data.

23
The Three Data Partitions and Their Role in the
Data Mining Process
24
Simple Regression Example
25
Simple Regression Model
  • Make prediction about the starting salary of a
    current college graduate
  • Data set of starting salaries of recent college
    graduates

Data Set
Compute Average Salary
How certain are of this prediction? There is
variability in the data.
26
Simple Regression Model
  • Use total variation as an index of uncertainty
    about our prediction
  • The smaller the amount of total variation the
    more accurate (certain) will be our prediction.

27
Simple Regression Model
  • How explain the variability - Perhaps it
    depends on the students GPA

28
Simple Regression Model
  • Find a linear relationship between GPA and
    starting salary
  • As GPA increases/decreases starting salary
    increases/decreases

29
Simple Regression Model
  • Least Squares Method to find regression model
  • Choose a and b in regression model (equation) so
    that it minimizes the sum of the squared
    deviations actual Y value minus predicted Y
    value (Y-hat)

30
Simple Regression Model
  • How good is the model?

a 4,779 b 5,370 A computer program
computed these values
  • u-hat is a residual value
  • The sum of all u-hats is zero
  • The sum of all u-hats squared is the total
    variance not explained by the model
  • unexplained variance is 7,425,926

31
Simple Regression Model
Total Variation 23,000,000
32
Simple Regression Model
Total Unexplained Variation 7,425,726
33
Simple Regression Model
  • Relative Goodness of Fit
  • Summarize the improvement in prediction using
    regression model
  • Computer R2 coefficient of determination

Regression Model (equation) a better predictor
than guessing the average salary The GPA is a
more accurate predictor of starting salary than
guessing the average R2 is the performance
measure for the model. Predicted Starting
Salary 4,779 5,370 GPA
34
Building a Model - An Example with Linear
Regression
35
Problems 
  • Problem 2.11 Page 33
Write a Comment
User Comments (0)
About PowerShow.com